In [1]:
import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

Sometimes you may choose to completely ignore some words called stopwords, i.e. very frequent words such as "the" and "I". This can be done is several ways. One way could be to sort the vocabulary (i.e. words in data) by frequency in the training set, and define the top 10-100 vocabulary entries as stop words, or alternatively by using one of the many pre-defined stop word lists available online or by NLTK. Then every instance of these stop words are simply removed from both training and test documents as if they had never occured. In most existing text classification applications, however, using a stop word list might not improve performance. Let's see an example.

For this example, we will use the stop word list from NLTK. You can import it as follows:

In [2]:
from nltk.corpus import stopwords

Now, you will need to write a function, which takes as input some text called "words" and output a dictionary which maps a feature name (word) to True if the word exists in the data, similarly to the previous example. 

In [3]:
stopset = set(stopwords.words('english'))

def features_without_stopwords(words):
    return dict([(word, True) for word in words if word not in stopset])

In [5]:
#Note, this is just the same code as previously. The only thing that changes is the feature extraction method, 
#which is user specified.
def evaluate_classifier(feature_extraction):
    
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')
 
    negreviews = [(feature_extraction(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
    posreviews = [(feature_extraction(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

    negsplit = int(len(negreviews)*0.75)
    possplit = int(len(posreviews)*0.75)

    trainingset = negreviews[:negsplit] + posreviews[:possplit]
    testset = negreviews[negsplit:] + posreviews[possplit:]
    
    classifier = NaiveBayesClassifier.train(trainingset)
 
    print('accuracy:', nltk.classify.util.accuracy(classifier, testset))
    classifier.show_most_informative_features()

We can evaluate the following classifier as follows. Note that this code can take several minutes to run.

In [7]:
evaluate_classifier(features_without_stopwords)

accuracy: 0.724
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0


**Question**

Are stopwords important for sentiment analysis? What do you think it happened here?