# Introduction


**What?** Classifying movie reviews with NLTK



# Import modules

In [2]:
import nltk
from nltk.corpus import movie_reviews

# Project's goal


- Using these corpora, we can build classifiers that will automatically tag new documents with appropriate category labels. 
- First, we construct a list of docu- ments, labeled with the appropriate categories. 
- For this example, we’ve chosen the Movie Reviews Corpus, which categorizes each review as positive or negative.



In [4]:
movie_reviews.categories()

['neg', 'pos']

In [5]:
documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append((list(movie_reviews.words(fileid)), category))

In [6]:
len(documents)

2000


- Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it should pay attention to (see Example 6-2). 
- For document topic identification, we can define a feature for each word, indicating whether the document contains that word. 
- To limit the number of features that the classifier needs to process, we begin by constructing a list of the 2,000 most frequent words in the overall corpus. 
- We can then define a feature extractor that simply checks whether each of these words is present in a given document.



In [8]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) 
word_features = list(all_words.keys())[:2000]

In [9]:
word_features 

['plot',
 ':',
 'two',
 'teen',
 'couples',
 'go',
 'to',
 'a',
 'church',
 'party',
 ',',
 'drink',
 'and',
 'then',
 'drive',
 '.',
 'they',
 'get',
 'into',
 'an',
 'accident',
 'one',
 'of',
 'the',
 'guys',
 'dies',
 'but',
 'his',
 'girlfriend',
 'continues',
 'see',
 'him',
 'in',
 'her',
 'life',
 'has',
 'nightmares',
 'what',
 "'",
 's',
 'deal',
 '?',
 'watch',
 'movie',
 '"',
 'sorta',
 'find',
 'out',
 'critique',
 'mind',
 '-',
 'fuck',
 'for',
 'generation',
 'that',
 'touches',
 'on',
 'very',
 'cool',
 'idea',
 'presents',
 'it',
 'bad',
 'package',
 'which',
 'is',
 'makes',
 'this',
 'review',
 'even',
 'harder',
 'write',
 'since',
 'i',
 'generally',
 'applaud',
 'films',
 'attempt',
 'break',
 'mold',
 'mess',
 'with',
 'your',
 'head',
 'such',
 '(',
 'lost',
 'highway',
 '&',
 'memento',
 ')',
 'there',
 'are',
 'good',
 'ways',
 'making',
 'all',
 'types',
 'these',
 'folks',
 'just',
 'didn',
 't',
 'snag',
 'correctly',
 'seem',
 'have',
 'taken',
 'pretty',


In [10]:
def document_features(document):
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words) 
    return features

In [11]:
document_features(movie_reviews.words('pos/cv957_8737.txt'))

{'contains(plot)': True,
 'contains(:)': True,
 'contains(two)': True,
 'contains(teen)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(to)': True,
 'contains(a)': True,
 'contains(church)': False,
 'contains(party)': False,
 'contains(,)': True,
 'contains(drink)': False,
 'contains(and)': True,
 'contains(then)': True,
 'contains(drive)': False,
 'contains(.)': True,
 'contains(they)': True,
 'contains(get)': True,
 'contains(into)': True,
 'contains(an)': True,
 'contains(accident)': False,
 'contains(one)': True,
 'contains(of)': True,
 'contains(the)': True,
 'contains(guys)': False,
 'contains(dies)': False,
 'contains(but)': True,
 'contains(his)': True,
 'contains(girlfriend)': True,
 'contains(continues)': False,
 'contains(see)': False,
 'contains(him)': True,
 'contains(in)': True,
 'contains(her)': False,
 'contains(life)': False,
 'contains(has)': True,
 'contains(nightmares)': False,
 'contains(what)': True,
 "contains(')": True,
 'contains(s)': T


- Now that we’ve defined our feature extractor, we can use it to train a classifier to label new movie reviews. 
- To check how reliable the resulting classifier is, we compute its accuracy on the test set . 
- And once again, we can use show_most_infor mative_features() to find out which features the classifier found to be most informative



In [13]:
featuresets = [(document_features(d), c) for (d,c) in documents] 
train_set, test_set = featuresets[100:], featuresets[:100] 
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [14]:
nltk.classify.accuracy(classifier, test_set)

0.78

In [15]:
classifier.show_most_informative_features(5)

Most Informative Features
    contains(recognizes) = True              pos : neg    =      8.1 : 1.0
    contains(schumacher) = True              neg : pos    =      7.8 : 1.0
 contains(unimaginative) = True              neg : pos    =      7.8 : 1.0
        contains(turkey) = True              neg : pos    =      6.5 : 1.0
     contains(atrocious) = True              neg : pos    =      6.4 : 1.0



- Apparently in this corpus, a review that mentions "schumacher" is almost 8 times more likely to be negative than positive, while a review that mentions "recongizes" is about 8 times more likely to be positive.



# References


- Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009.
- https://github.com/Sturzgefahr/Natural-Language-Processing-with-Python-Analyzing-Text-with-the-Natural-Language-Toolkit

