Text Preprocessing in Neural Text Classification
Jose Camacho Collados and Mohammad Taher Pilehvar
The following repository includes the pre-trained word embeddings and preprocessed text classification datasets for the paper On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis .
Pre-trained word embeddings
We release the 300-dimension word embeddings used in our experiments as binary bin files. The embeddings were trained on the UMBC corpus with the following preprocessing techniques:
- Vanilla (simple tokenization): Download here [~1.8 GB]
- Lowercased: Download here [~1.6 GB]
- Lemmatized: Download here [~1.7 GB]
- Multiword-grouped: Download here [~2.1 GB]
Preprocessed datasets
We also release the text categorization and sentiment analysis datasets already preprocessed:
Note 1: If you use any of these datasets, please acknowledge the original sources (you can find them in the reference paper).
Note 2: For each class file in the dataset directories, each line corresponds to an instance in the corpus, be it a phrase, sentence or document (depending on the dataset).
Code
The code to run our experiments is available in the following complementary repository: https://github.com/pilehvar/sensecnn
Reference paper
If you use any of these resources, please cite the following paper:
@InProceedings{camacho:preprocessing2018,
author = "Camacho-Collados, Jose and Pilehvar, Mohammad Taher",
title = "On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis",
booktitle = "Proceedings of the EMNLP Workshop on Analyzing and interpreting neural networks for NLP",
year = "2018",
publisher = "Association for Computational Linguistics",
location = "Brussels, Belgium"
}