Skip to content

pedrada88/preproc-textclassification

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 

Text Preprocessing in Neural Text Classification

Jose Camacho Collados and Mohammad Taher Pilehvar

The following repository includes the pre-trained word embeddings and preprocessed text classification datasets for the paper On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis .

Pre-trained word embeddings

We release the 300-dimension word embeddings used in our experiments as binary bin files. The embeddings were trained on the UMBC corpus with the following preprocessing techniques:

  • Vanilla (simple tokenization): Download here [~1.8 GB]
  • Lowercased: Download here [~1.6 GB]
  • Lemmatized: Download here [~1.7 GB]
  • Multiword-grouped: Download here [~2.1 GB]

Preprocessed datasets

We also release the text categorization and sentiment analysis datasets already preprocessed:

  • Text categorization: Available here
  • Sentiment analysis: Available here

Note 1: If you use any of these datasets, please acknowledge the original sources (you can find them in the reference paper).
Note 2: For each class file in the dataset directories, each line corresponds to an instance in the corpus, be it a phrase, sentence or document (depending on the dataset).

Code

The code to run our experiments is available in the following complementary repository: https://github.com/pilehvar/sensecnn

Reference paper

If you use any of these resources, please cite the following paper:

@InProceedings{camacho:preprocessing2018,
  author = 	"Camacho-Collados, Jose and Pilehvar, Mohammad Taher",
  title = 	"On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis",
  booktitle = 	"Proceedings of the EMNLP Workshop on Analyzing and interpreting neural networks for NLP",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  location = 	"Brussels, Belgium"
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published