This project includes the implementation of text domain similarity measure in Persian which is described in paper "A Deep Learning-Based Approach for Measuring the Domain Similarity of Persian Texts."
The preprint version of the paper is available at: https://arxiv.org/abs/1909.09690
The data described in the paper is excluded. You can provide your own data and run the project with it. You just need to specify your data path and file names.
The sequence of execution is as follows:
preproc_embedding
module conducts preprocessing phase and create word embedding vectors.ad_pair_maker
generates paired advertisement texts and scores them based on the specified rule.pair_shuffler
shuffles the prepared dataset, so that the randomness is held.data_splitter
converts the data in the text file to numpy files each of which contains 1 million paired texts.cnn_tds
,lstm_tds
andw2v_mean_tds
are deep modules which carry out the training, validation and testing phase by the use ofdata_loader
andembedding_loader
.