# Sentiment Analysis: Python's NLTK Library

---

- Getting Started With NLTK
    - Installing and Importing
    - Compiling Data
    - Creating Frequency Distributions
    - Extracting Concordance and Collocations
  
  
- Using NLTK’s Pre-Trained Sentiment Analyzer
  
  
- Customizing NLTK’s Sentiment Analysis
    - Selecting Useful Features
    - Training and Using a Classifier
  
  
- Comparing Additional Classifiers
    - Installing and Importing scikit-learn
    - Using scikit-learn Classifiers With NLTK
  
  
- Conclusion

---

### 1. Getting Started With NLTK

The NLTK library contains various utilities that allow you to effectively manipulate and analyze linguistic data. Among its advanced features are text classifiers that you can use for many kinds of classification, including sentiment analysis.

<b>Sentiment analysis</b> is the practice of using algorithms to classify various samples of related text into overall positive and negative categories. With NLTK, you can employ these algorithms through powerful built-in machine learning operations to obtain insights from linguistic data.

In [1]:
import nltk

We installed and imported the library but still need to obtain a few additional resources. Some of them are text samples, and others are data models that certain NLTK functions require.

#### names: 
A list of common English names compiled by Mark Kantrowitz
#### stopwords:
A list of really common words, like articles, pronouns, prepositions, and conjunctions
#### state_union:
A sample of transcribed State of the Union addresses by different US presidents, compiled by Kathleen Ahrens
#### twitter_samples: 
A list of social media phrases posted to Twitter
#### movie_reviews: 
Two thousand movie reviews categorized by Bo Pang and Lillian Lee
#### averaged_perceptron_tagger: 
A data model that NLTK uses to categorize words into their part of speech
#### vader_lexicon: 
A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbert
#### punkt:
A data model created by Jan Strunk that NLTK uses to split full texts into word lists

In [2]:
nltk.download([
    "names",
    "stopwords",
    "state_union",
    "twitter_samples",
    "movie_reviews",
    "averaged_perceptron_tagger",
    "vader_lexicon",
    "punkt",
])

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Mayank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\names.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mayank\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\Mayank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\state_union.zip.
[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Mayank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\twitter_samples.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Mayank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Mayank\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       dat

True

Should NLTK require additional resources that you haven’t installed, you’ll see a helpful LookupError with details and instructions to download the resource

In [5]:
# w = nltk.corpus.shakespeare.words()

In [7]:
nltk.download('shakespeare')

[nltk_data] Downloading package shakespeare to
[nltk_data]     C:\Users\Mayank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\shakespeare.zip.


True

In [2]:
li = nltk.corpus.shakespeare.words('dream.xml')
print(li)



#### Load the State of the Union corpus

In [2]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]       # include only the words that are made up of letters

In [3]:
print(words[:100])

['PRESIDENT', 'HARRY', 'S', 'TRUMAN', 'S', 'ADDRESS', 'BEFORE', 'A', 'JOINT', 'SESSION', 'OF', 'THE', 'CONGRESS', 'April', 'Mr', 'Speaker', 'Mr', 'President', 'Members', 'of', 'the', 'Congress', 'It', 'is', 'with', 'a', 'heavy', 'heart', 'that', 'I', 'stand', 'before', 'you', 'my', 'friends', 'and', 'colleagues', 'in', 'the', 'Congress', 'of', 'the', 'United', 'States', 'Only', 'yesterday', 'we', 'laid', 'to', 'rest', 'the', 'mortal', 'remains', 'of', 'our', 'beloved', 'President', 'Franklin', 'Delano', 'Roosevelt', 'At', 'a', 'time', 'like', 'this', 'words', 'are', 'inadequate', 'The', 'most', 'eloquent', 'tribute', 'would', 'be', 'a', 'reverent', 'silence', 'Yet', 'in', 'this', 'decisive', 'hour', 'when', 'world', 'events', 'are', 'moving', 'so', 'rapidly', 'our', 'silence', 'might', 'be', 'misunderstood', 'and', 'might', 'give', 'comfort', 'to', 'our']


#### Remove stop words from your original word list

In [4]:
stopwords = nltk.corpus.stopwords.words("english")

In [5]:
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [6]:
words = [w for w in words if w.lower() not in stopwords]

In [6]:
print(words[:200])

['PRESIDENT', 'HARRY', 'TRUMAN', 'ADDRESS', 'JOINT', 'SESSION', 'CONGRESS', 'April', 'Mr', 'Speaker', 'Mr', 'President', 'Members', 'Congress', 'heavy', 'heart', 'stand', 'friends', 'colleagues', 'Congress', 'United', 'States', 'yesterday', 'laid', 'rest', 'mortal', 'remains', 'beloved', 'President', 'Franklin', 'Delano', 'Roosevelt', 'time', 'like', 'words', 'inadequate', 'eloquent', 'tribute', 'would', 'reverent', 'silence', 'Yet', 'decisive', 'hour', 'world', 'events', 'moving', 'rapidly', 'silence', 'might', 'misunderstood', 'might', 'give', 'comfort', 'enemies', 'infinite', 'wisdom', 'Almighty', 'God', 'seen', 'fit', 'take', 'us', 'great', 'man', 'loved', 'beloved', 'humanity', 'man', 'could', 'possibly', 'fill', 'tremendous', 'void', 'left', 'passing', 'noble', 'soul', 'words', 'ease', 'aching', 'hearts', 'untold', 'millions', 'every', 'race', 'creed', 'color', 'world', 'knows', 'lost', 'heroic', 'champion', 'justice', 'freedom', 'Tragic', 'fate', 'thrust', 'upon', 'us', 'grave',

#### Word tokenization

In [7]:
from pprint import pprint

In [8]:
text = """
For some quick analysis, creating a corpus could be overkill.
If all you need is a word list,
there are simpler ways to achieve that goal."""
print(nltk.word_tokenize(text))

['For', 'some', 'quick', 'analysis', ',', 'creating', 'a', 'corpus', 'could', 'be', 'overkill', '.', 'If', 'all', 'you', 'need', 'is', 'a', 'word', 'list', ',', 'there', 'are', 'simpler', 'ways', 'to', 'achieve', 'that', 'goal', '.']


In [9]:
pprint(nltk.word_tokenize(text), width=89, compact=True)

['For', 'some', 'quick', 'analysis', ',', 'creating', 'a', 'corpus', 'could', 'be',
 'overkill', '.', 'If', 'all', 'you', 'need', 'is', 'a', 'word', 'list', ',', 'there',
 'are', 'simpler', 'ways', 'to', 'achieve', 'that', 'goal', '.']


#### To build a frequency distribution with NLTK, construct the nltk.FreqDist class with a word list

In [10]:
words_text = nltk.word_tokenize(text)
fd = nltk.FreqDist(words)

In [11]:
fd.most_common(3)

[('must', 1568), ('people', 1291), ('world', 1128)]

In [12]:
fd.tabulate(3)

  must people  world 
  1568   1291   1128 


#### Can use frequency distributions to query particular words

In [15]:
print(fd["America"])
print(fd["america"])
print(fd["AMERICA"])      # number of times each word occurs exactly as given.

1076
0
3


#### Can also use them as iterators to perform some custom analysis on word properties.

In [13]:
# creating a new frequency distribution that’s based on the initial one but normalizes all words to lowercase
lower_fd = nltk.FreqDist([w.lower() for w in words])

In [14]:
lower_fd.most_common(3)

[('must', 1569), ('people', 1313), ('world', 1213)]

In [15]:
lower_from_fd = nltk.FreqDist([w.lower() for w in fd])

In [16]:
lower_from_fd.most_common(3)

[('world', 3), ('year', 3), ('new', 3)]

#### Extracting Concordance and Collocations

In the context of NLP, a concordance is a collection of word locations along with their context. You can use concordances to find:
1. How many times a word appears
2. Where each occurrence appears
3. What words surround each occurrence
  
  
In NLTK, you can do this by calling .concordance(). To use it, you need an instance of the nltk.Text class, which can also be constructed with a word list.

In [17]:
text = nltk.Text(nltk.corpus.state_union.words())

In [18]:
text.concordance('america', lines=5)                      # ignores case of words

Displaying 5 of 1079 matches:
 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
 to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to 


In [19]:
concordance_list = text.concordance_list('america', lines=3)

In [20]:
for entry in concordance_list:
    print(entry.line)

 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom


.concordance_list() gives you a list of ConcordanceLine objects, which contain information about where each word occurs as well as a few more properties worth exploring. The list is also sorted in order of appearance.
  
  
The nltk.Text class itself has a few other interesting features. One of them is .vocab(), which is worth mentioning because it creates a frequency distribution for a given text.

Check out how quickly you can create a custom nltk.Text instance and an accompanying frequency distribution:

In [21]:
words = nltk.word_tokenize(
    """Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex."""
)
text = nltk.Text(words)
fd = text.vocab()

In [22]:
fd.most_common(4)

[('is', 3), ('better', 3), ('than', 3), ('.', 3)]

In [23]:
fd.tabulate(11)

       is    better      than         . Beautiful      ugly  Explicit  implicit    Simple   complex 
        3         3         3         3         1         1         1         1         1         1 


.vocab() is essentially a shortcut to create a frequency distribution from an instance of nltk.Text. That way, you don’t have to make a separate call to instantiate a new nltk.FreqDist object.

#### Collocations
Collocations are series of words that frequently appear together in a given text. In the State of the Union corpus, for example, you’d expect to find the words United and States appearing next to each other very often. Those two words appearing together is a collocation.

Collocations can be made up of two or more words. NLTK provides classes to handle several types of collocations:

- Bigrams: Frequent two-word combinations
- Trigrams: Frequent three-word combinations
- Quadgrams: Frequent four-word combinations

In [23]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]

In [24]:
tri_collocations = nltk.collocations.TrigramCollocationFinder.from_words(words)

One of their most useful tools is the ngram_fd property. This property holds a frequency distribution that is built for each collocation rather than for individual words.

In [25]:
tri_collocations.ngram_fd.most_common(4)

[(('the', 'United', 'States'), 294),
 (('the', 'American', 'people'), 185),
 (('of', 'the', 'world'), 154),
 (('of', 'the', 'United'), 145)]

In [26]:
words_lower = [w.lower() for w in nltk.corpus.state_union.words() if w.isalpha()]
tri_collocations_lower = nltk.collocations.TrigramCollocationFinder.from_words(words_lower)
tri_collocations_lower.ngram_fd.most_common(4)

[(('the', 'united', 'states'), 327),
 (('the', 'american', 'people'), 208),
 (('the', 'state', 'of'), 171),
 (('to', 'the', 'congress'), 164)]

---

### 2. Using NLTK’s Pre-Trained Sentiment Analyzer

NLTK already has a built-in, pretrained sentiment analyzer called VADER (Valence Aware Dictionary and sEntiment Reasoner).

Since VADER is pretrained, you can get results more quickly than with many other analyzers. However, VADER is best suited for language used in social media, like short sentences with some slang and abbreviations. It’s less accurate when rating longer, structured sentences, but it’s often a good launching point.

In [27]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
score = sia.polarity_scores("Wow! NLTK is really hulk")

The negative, neutral, and positive scores are related: They all add up to 1 and can’t be negative. The compound score is calculated differently. It’s not just an average, and it can range from -1 to 1.

In [28]:
print(score)
print(type(score))
print(score['pos'])

{'neg': 0.0, 'neu': 0.494, 'pos': 0.506, 'compound': 0.6239}
<class 'dict'>
0.506


Put it to the test against real data using two different corpora

In [29]:
twitter_samples = nltk.corpus.twitter_samples.strings()

In [30]:
type(twitter_samples)

list

In [31]:
twitter_samples[5:8]

["oh god, my babies' faces :( https://t.co/9fcwGvaki0",
 '@RileyMcDonough make me smile :((',
 '@f0ggstar @stuartthull work neighbour on motors. Asked why and he said hates the updates on search :( http://t.co/XvmTUikWln']

In [30]:
tweets = [t.replace('://', '//') for t in nltk.corpus.twitter_samples.strings()]

In [31]:
tweets[5:8]

["oh god, my babies' faces :( https//t.co/9fcwGvaki0",
 '@RileyMcDonough make me smile :((',
 '@f0ggstar @stuartthull work neighbour on motors. Asked why and he said hates the updates on search :( http//t.co/XvmTUikWln']

Notice that you use a different corpus method, .strings(), instead of .words(). This gives you a list of raw tweets as strings.
  
  
Now use the .polarity_scores() function of your SentimentIntensityAnalyzer instance to classify tweets:

In [32]:
def is_positive(tweet):
    """Returns True if the compound score is > 0, else returns False"""
    return sia.polarity_scores(tweet)['compound'] > 0

In [35]:
from random import shuffle

shuffle(tweets)
tweets[0:3]

['RT @NicolaSturgeon: If Miliband is going to let Tories in rather than work with SNP, we will definitely need lots of SNP MPs to protect Sco…',
 'An Asian man is wearing tweed. Farage is confused. Frightened. Faints.\n#AskFarage',
 "@iamsrk Lol. That look's like a scary room! Ghost story or murder mystery? Either way, just try it and get done asap :p"]

In [36]:
shuffle(tweets)
for tweet in tweets[:5]:
    print('>', is_positive(tweet))
tweets[:5]

> True
> True
> True
> False
> True


["@_Ms_R lol. it's 10:16 right now. when you just wrote. :D",
 'My Google+ account. :) http//t.co/R8jyDxlQyo',
 '@KaReeMLSheNawY it is not even a real word :D',
 'Watch: Ed Miliband stumbles off stage @KateBaldwin18 https//t.co/uM1h8Q70Vg #bbcqt #GE2015',
 'RT @BBCJamesCook: Ed Miliband says he\'d rather not have a Labour government than do a "confidence and supply" deal with the SNP. #GE2015']

In this case, is_positive() uses only the positivity of the compound score to make the call. You can choose any combination of VADER scores to tweak the classification to your needs.

Second corpus, movie_reviews. The special thing about this corpus is that it’s already been classified. Therefore, you can use it to judge the accuracy of the algorithms you choose when rating similar texts.
  
  
Keep in mind that VADER is likely better at rating tweets than it is at rating long movie reviews. To get better results, you’ll set up VADER to rate individual sentences within the review rather than the entire text.
  
  
Since VADER needs raw strings for its rating, you can’t use .words() like you did earlier. Instead, make a list of the file IDs that the corpus uses, which you can use later to reference individual reviews:

In [33]:
type(nltk.corpus.movie_reviews.fileids())

list

In [34]:
nltk.corpus.movie_reviews.fileids()[0:5]

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt']

In [35]:
nltk.corpus.movie_reviews.raw('neg/cv000_29416.txt')

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience membe

In [36]:
positive_movie_reviews = nltk.corpus.movie_reviews.fileids(categories = ['pos'])

.fileids() exists in most, if not all, corpora. In the case of movie_reviews, each file corresponds to a single review. Note also that you’re able to filter the list of file IDs by specifying categories. This categorization is a feature specific to this corpus and others of the same type.

In [37]:
type(positive_movie_reviews)

list

In [38]:
positive_movie_reviews[0:5]

['pos/cv000_29590.txt',
 'pos/cv001_18431.txt',
 'pos/cv002_15918.txt',
 'pos/cv003_11664.txt',
 'pos/cv004_11636.txt']

In [39]:
negative_movie_reviews = nltk.corpus.movie_reviews.fileids(categories=['neg'])
all_review_ids = positive_movie_reviews + negative_movie_reviews

In [40]:
from statistics import mean

def is_positive(review_id):
    """True if the average of all sentence compound scores is positive."""
    text = nltk.corpus.movie_reviews.raw(review_id)
    scores = [sia.polarity_scores(sent)['compound'] for sent in nltk.sent_tokenize(text)]
    return mean(scores) > 0

.raw() is another method that exists in most corpora. By specifying a file ID or a list of file IDs, you can obtain specific data from the corpus.

In [42]:
from random import shuffle
shuffle(all_review_ids)

correct = 0
for review_id in all_review_ids:
    if is_positive(review_id):
        if review_id in positive_movie_reviews:
            correct += 1
    else:
        if review_id in negative_movie_reviews:
            correct += 1

print(F"{correct / len(all_review_ids):.2%} correct")

64.00% correct


After rating all reviews, only 64 percent were correctly classified by VADER using the logic defined in is_positive().
  
  
A 64 percent accuracy rating isn’t great, but it’s a start. Have a little fun tweaking is_positive() to see if you can increase the accuracy.
  
  
In the next section, you’ll build a custom classifier that allows you to use additional features for classification and eventually increase its accuracy to an acceptable level.

---

### 3. Customizing NLTK’s Sentiment Analysis

NLTK offers a few built-in classifiers that are suitable for various types of analyses, including sentiment analysis. The trick is to figure out which properties of your dataset are useful in classifying each piece of data into your desired categories.
  
  
In the world of machine learning, these data properties are known as features, which you must reveal and select as you work with your data.

#### Selecting Useful Features

Since you’ve learned how to use frequency distributions, why not use them as a launching point for an additional feature?
   
   
By using the predefined categories in the movie_reviews corpus, you can create sets of positive and negative words, then determine which ones occur most frequently across each set. Begin by excluding unwanted words and building the initial category groups:

In [45]:
unwanted = [w for w in nltk.corpus.stopwords.words('english')]
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])  # add words from the names corpus to the unwanted list
print(unwanted)     # since movie reviews are likely to have lots of actor names, which shouldn’t be part of your feature sets.

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [46]:
def skip_unwanted(pos_tuple):
    word, tag = pos_tuple
    if not word.isalpha() or word in unwanted:
        return False
    if tag.startswith('NN'):
        return False
    return True

positive_words = [word for word, tag in filter(
    skip_unwanted, 
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=['pos'])))]
negative_words = [word for word, tag in filter(
    skip_unwanted, 
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=['neg'])))]

In [47]:
positive_fd = nltk.FreqDist(positive_words)
negative_fd = nltk.FreqDist(negative_words)

In [48]:
positive_fd.most_common(20)

[('one', 2821),
 ('like', 1799),
 ('good', 1231),
 ('also', 1200),
 ('even', 1179),
 ('well', 1094),
 ('much', 1038),
 ('would', 1019),
 ('first', 1004),
 ('two', 999),
 ('get', 850),
 ('best', 828),
 ('many', 780),
 ('make', 779),
 ('really', 777),
 ('little', 774),
 ('great', 751),
 ('new', 723),
 ('never', 721),
 ('could', 636)]

In [49]:
negative_fd.most_common(20)

[('one', 2495),
 ('like', 1886),
 ('even', 1386),
 ('good', 1146),
 ('would', 1090),
 ('bad', 1034),
 ('much', 1011),
 ('get', 990),
 ('two', 912),
 ('first', 832),
 ('make', 824),
 ('could', 791),
 ('really', 781),
 ('also', 767),
 ('well', 761),
 ('little', 726),
 ('never', 653),
 ('know', 640),
 ('big', 597),
 ('new', 569)]

Since many words are present in both positive and negative sets, begin by finding the common set so you can remove it from the distribution objects

In [51]:
type(positive_fd)

nltk.probability.FreqDist

In [50]:
common_set = set(positive_fd).intersection(negative_fd)

In [51]:
for word in common_set:
    del positive_fd[word]    
    del negative_fd[word]

In [52]:
positive_fd.most_common(10)

[('shrek', 23),
 ('fei', 22),
 ('ordell', 20),
 ('soviet', 16),
 ('kimble', 16),
 ('en', 14),
 ('addresses', 14),
 ('lovingly', 14),
 ('nello', 14),
 ('horned', 13)]

In [53]:
negative_fd.most_common(10)

[('battlefield', 18),
 ('sphere', 18),
 ('nbsp', 18),
 ('heckerling', 15),
 ('spawn', 13),
 ('incoherent', 13),
 ('degenerates', 13),
 ('schumacher', 12),
 ('autistic', 12),
 ('horrid', 10)]

Once you’re left with unique positive and negative words in each frequency distribution object, you can finally build sets from the most common words in each distribution. The amount of words in each set is something you could tweak in order to determine its effect on sentiment analysis.

In [54]:
top_100_positive = [word for word, count in positive_fd.most_common(100)]
top_100_negative = [word for word, count in negative_fd.most_common(100)]

In [57]:
top_100_negative[:10]

['battlefield',
 'sphere',
 'nbsp',
 'heckerling',
 'spawn',
 'incoherent',
 'degenerates',
 'schumacher',
 'autistic',
 'horrid']

In [58]:
top_100_positive[:10]

['shrek',
 'fei',
 'ordell',
 'soviet',
 'kimble',
 'en',
 'addresses',
 'lovingly',
 'nello',
 'horned']

This is one example of a feature you can extract from your data, and it’s far from perfect. Looking closely at these sets, you’ll notice some uncommon names and words that aren’t necessarily positive or negative. 
  
Additionally, the other NLTK tools you’ve learned so far can be useful for building more features. One possibility is to leverage collocations that carry positive meaning, like the bigram “thumbs up!”

Here’s how you can set up the positive and negative bigram finders:

In [59]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

positive_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["pos"])
    if w.isalpha() and w not in unwanted
])
negative_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["neg"])
    if w.isalpha() and w not in unwanted
])

The rest is up to you! Try different combinations of features, think of ways to use the negative VADER scores, create ratios, polish the frequency distributions. The possibilities are endless!

#### Training and Using a Classifier

With your new feature set ready to use, the first prerequisite for training a classifier is to define a function that will extract features from a given piece of data.

extract_features() should return a dictionary, and it will create three features for each piece of text:
- The average compound score
- The average positive score
- The amount of words in the text that are also part of the top 100 words in all positive reviews

In [56]:
def extract_features(text):
    features = dict()
    word_count = 0
    compound_scores = list()
    positive_scores = list()
    
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if word.lower() in top_100_positive:
                word_count += 1
        compound_scores.append(sia.polarity_scores(sent)['compound'])
        positive_scores.append(sia.polarity_scores(sent)['pos'])
    
    # Adding 1 to the final compound score to always have positive numbers
    # since some classifiers you'll use later don't work with negative numbers.
    features['wordcount'] = word_count
    features['mean_compound'] = mean(compound_scores) + 1
    features['mean_positive'] = mean(positive_scores)
    
    return features

In order to train and evaluate a classifier, you’ll need to build a list of features for each text you’ll analyze:

In [57]:
features = [
    (extract_features(nltk.corpus.movie_reviews.raw(review_id)), 'pos') 
    for review_id in nltk.corpus.movie_reviews.fileids(categories=['pos'])
]

features.extend([
    (extract_features(nltk.corpus.movie_reviews.raw(review_id)), 'neg') 
    for review_id in nltk.corpus.movie_reviews.fileids(categories=['neg'])
])

Training the classifier involves splitting the feature set so that one portion can be used for training and the other for evaluation, then calling .train():

In [62]:
train_count = len(features) // 4

In [66]:
print('features count:', len(features))
print('training count:', train_count)
print('evaluation count:', len(features) - train_count)

features count: 2000
training count: 500
evaluation count: 1500


In [67]:
shuffle(features)
classifier = nltk.NaiveBayesClassifier.train(features[:train_count])

In [68]:
classifier.most_informative_features(10)

[('wordcount', 3),
 ('wordcount', 2),
 ('wordcount', 4),
 ('wordcount', 5),
 ('wordcount', 0),
 ('wordcount', 1),
 ('mean_positive', 0.11716666666666667),
 ('mean_positive', 0.159),
 ('mean_compound', 0.659528),
 ('mean_compound', 0.6766466666666666)]

In [86]:
shuffle(features)
classifier = nltk.NaiveBayesClassifier.train(features[:train_count])
nltk.classify.accuracy(classifier, features[train_count:])

0.6666666666666666

Adding a single feature has marginally improved VADER’s initial accuracy, from 64 percent to 67 percent. More features could help, as long as they truly indicate how positive a review is. You can use classifier.show_most_informative_features() to determine which features are most indicative of a specific property.

Feature engineering is a big part of improving the accuracy of a given algorithm, but it’s not the whole story. 
  
Another strategy is to use and compare different classifiers.

### 4. Comparing Additional Classifiers

In [58]:
import sklearn

NLTK provides a class that can use most classifiers from the popular machine learning framework scikit-learn.

Many of the classifiers that scikit-learn provides can be instantiated quickly since they have defaults that often work well. In this section, you’ll learn how to integrate them within NLTK to classify linguistic data.

A subset of all classifiers available to you. These will work within NLTK for sentiment analysis

In [59]:
from sklearn.naive_bayes import (
    BernoulliNB,
    ComplementNB,
    MultinomialNB,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

To aid in accuracy evaluation, it’s helpful to have a mapping of classifier names and their instances:

In [60]:
classifiers = {
    "BernoulliNB": BernoulliNB(),
    "ComplementNB": ComplementNB(),
    "MultinomialNB": MultinomialNB(),
    "KNeighborsClassifier": KNeighborsClassifier(),
    "DecisionTreeClassifier": DecisionTreeClassifier(),
    "RandomForestClassifier": RandomForestClassifier(),
    "LogisticRegression": LogisticRegression(),
    "MLPClassifier": MLPClassifier(max_iter=1000),
    "AdaBoostClassifier": AdaBoostClassifier(),
}

#### Using scikit-learn Classifiers With NLTK

Since NLTK allows you to integrate scikit-learn classifiers directly into its own classifier class, the training and classification processes will use the same methods you’ve already seen, .train() and .classify().

The features list contains tuples whose first item is a set of features given by extract_features(), and whose second item is the classification label from preclassified data in the movie_reviews corpus.

In [61]:
train_count = len(features) // 4
shuffle(features)

In [64]:
for name, sklearn_classifier in classifiers.items():
    classifier = nltk.classify.SklearnClassifier(sklearn_classifier)
    classifier.train(features[:train_count])
    accuracy = nltk.classify.accuracy(classifier, features[train_count:])
    print(F"{accuracy:.2%} - {name}")

66.60% - BernoulliNB
66.33% - ComplementNB
65.87% - MultinomialNB
69.60% - KNeighborsClassifier
63.60% - DecisionTreeClassifier
68.93% - RandomForestClassifier
70.87% - LogisticRegression
71.93% - MLPClassifier
69.67% - AdaBoostClassifier


For each scikit-learn classifier, call nltk.classify.SklearnClassifier to create a usable NLTK classifier that can be trained and evaluated exactly like you’ve seen before with nltk.NaiveBayesClassifier and its other built-in classifiers. The .train() and .accuracy() methods should receive different portions of the same list of features.
  
  
Now you’ve reached over 73 percent accuracy before even adding a second feature! While this doesn’t mean that the MLPClassifier will continue to be the best one as you engineer new features, having additional classification algorithms at your disposal is clearly advantageous.

### Conclusion

You’re now familiar with the features of NTLK that allow you to process text into objects that you can filter and manipulate, which allows you to analyze text data to gain information about its properties. You can also use different classifiers to perform sentiment analysis on your data and gain insights about how your audience is responding to content.

In this tutorial, you learned how to:

- Split and filter text data in preparation for analysis
- Analyze word frequency
- Find concordance and collocations using different methods
- Perform quick sentiment analysis with NLTK’s built-in VADER
- Define features for custom classification
- Use and compare classifiers from scikit-learn for sentiment analysis within NLTK