A simple demo of a bag-of-words based text classifier in python using NTLK.
Mostly ripped off from a great pydata conference talk
To try it out, first install the requirements using pip:
$ pip install -r requirements.txt
You will also need the 'wordnet' and 'stopwords' datasets:
shell$ python >> import nltk >> nltk.download('wordnet') >> nltk.download('stopwords')
You're now set to run the demo script itself:
$ python classifier.py
nltk4: NLTK nytimes5: Nytimes jezebel: NLTK
Indicating that the trained classifier--when given an article about NLTK (data/nltk4.txt), by the New York Times (data/nytimes5.txt), and Jezebel (data/jezebel.txt)--classfies them as an NLTK, NYTimes, and NLTK article respectively.
The two additional preprocessing steps this toy classifier performs that greatly improves it's ability to model a document's classification is that
- stop words are ignored and
- words that share a common lemma are normalized--for example, the existence of the word "run" in one document is treated identically to the word "running" in another, making the model take into account such variations of a word. The process of lemmatization can be read about more extensively on wikipedia, of course.