Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
a toy text classifier
Python
Branch: master
Failed to load latest commit information.
data move text files into data/
README.md make README easier
classifier.py README, stop words clarification
requirements.txt requirements file

README.md

classifier-demo

A simple demo of a bag-of-words based text classifier in python using NTLK.

Mostly ripped off from a great pydata conference talk

To try it out, first install the requirements using pip:

$ pip install -r requirements.txt

You will also need the 'wordnet' and 'stopwords' datasets:

shell$ python
>> import nltk
>> nltk.download('wordnet')
>> nltk.download('stopwords')

You're now set to run the demo script itself:

$ python classifier.py

Sample output:

nltk4:  NLTK
nytimes5:  Nytimes
jezebel:  NLTK

Indicating that the trained classifier--when given an article about NLTK (data/nltk4.txt), by the New York Times (data/nytimes5.txt), and Jezebel (data/jezebel.txt)--classfies them as an NLTK, NYTimes, and NLTK article respectively.

The two additional preprocessing steps this toy classifier performs that greatly improves it's ability to model a document's classification is that 1) stop words are ignored and 2) words that share a common lemma are normalized--for example, the existence of the word "run" in one document is treated identically to the word "running" in another, making the model take into account such variations of a word. The process of lemmatization can be read about more extensively on wikipedia, of course.

Something went wrong with that request. Please try again.