Text Domain Similarity

This project includes the implementation of text domain similarity measure in Persian which is described in paper "A Deep Learning-Based Approach for Measuring the Domain Similarity of Persian Texts."

The preprint version of the paper is available at: https://arxiv.org/abs/1909.09690

Usage

The data described in the paper is excluded. You can provide your own data and run the project with it. You just need to specify your data path and file names.

The sequence of execution is as follows:

preproc_embedding module conducts preprocessing phase and create word embedding vectors.
ad_pair_maker generates paired advertisement texts and scores them based on the specified rule.
pair_shuffler shuffles the prepared dataset, so that the randomness is held.
data_splitter converts the data in the text file to numpy files each of which contains 1 million paired texts.
cnn_tds, lstm_tds and w2v_mean_tds are deep modules which carry out the training, validation and testing phase by the use of data_loader and embedding_loader.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
ad_pair_maker.py		ad_pair_maker.py
cnn_tds.py		cnn_tds.py
data_loader.py		data_loader.py
data_splitter.py		data_splitter.py
embedding_loader.py		embedding_loader.py
lstm_tds.py		lstm_tds.py
pair_shuffler.py		pair_shuffler.py
preproc_embedding.py		preproc_embedding.py
requirements.txt		requirements.txt
w2v_mean_tds.py		w2v_mean_tds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Domain Similarity

Usage

About

Releases

Packages

Languages

hosseinkshvrz/text_domain_similarity

Folders and files

Latest commit

History

Repository files navigation

Text Domain Similarity

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages