No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md

README.md

Text Preprocessing in Neural Text Classification

Jose Camacho Collados and Mohammad Taher Pilehvar

The following repository includes the pre-trained word embeddings and preprocessed text classification datasets for the paper On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis .

Pre-trained word embeddings

We release the 300-dimension word embeddings used in our experiments as binary bin files. The embeddings were trained on the UMBC corpus with the following preprocessing techniques:

  • Vanilla (simple tokenization): Download here [~1.8 GB]
  • Lowercased: Download here [~1.6 GB]
  • Lemmatized: Download here [~1.7 GB]
  • Multiword-grouped: Download here [~2.1 GB]

Preprocessed datasets

We also release the text categorization and sentiment analysis datasets already preprocessed:

  • Text categorization: Available here
  • Sentiment analysis: Available here

Note 1: If you use any of these datasets, please acknowledge the original sources (you can find them in the reference paper).
Note 2: For each class file in the dataset directories, each line corresponds to an instance in the corpus, be it a phrase, sentence or document (depending on the dataset).

Code

The code to run our experiments is available in the following complementary repository: https://github.com/pilehvar/sensecnn

Reference paper

If you use any of these resources, please cite the following paper:

@InProceedings{camacho:preprocessing2018,
  author = 	"Camacho-Collados, Jose and Pilehvar, Mohammad Taher",
  title = 	"On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis",
  booktitle = 	"Proceedings of the EMNLP Workshop on Analyzing and interpreting neural networks for NLP",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  location = 	"Brussels, Belgium"
}