<a href="https://colab.research.google.com/github/mcgmed/Nautral-Language-Processing/blob/main/NLTK_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import nltk
nltk.download()

NLTK will display a download manager showing all available and installed resources. Here are the ones you’ll need to download for this tutorial:

*   names: A list of common English names compiled by Mark Kantrowitz
*   stopwords: A list of really common words, like articles, pronouns, prepositions, and conjunctions
*   state_union: A sample of transcribed State of the Union addresses by different US presidents, compiled by Kathleen Ahrens
*   twitter_samples: A list of social media phrases posted to Twitter
*   movie_reviews: Two thousand movie reviews categorized by Bo Pang and Lillian Lee
*   averaged_perceptron_tagger: A data model that NLTK uses to categorize words into their part of speech
*   vader_lexicon: A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbert
*   punkt: A data model created by Jan Strunk that NLTK uses to split full texts into word lists

A quick way to download specific resources directly from the console is to pass a list to nltk.download():

In [1]:
import nltk
nltk.download(["names", "stopwords", "state_union", "twitter_samples", "movie_reviews", "averaged_perceptron_tagger", "vader_lexicon", "punkt",])

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package state_union to /root/nltk_data...
[nltk_data]   Unzipping corpora/state_union.zip.
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Compiling Data

In [2]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]

Note that you build a list of individual words with the corpus’s .words() method, but you use str.isalpha() to include only the words that are made up of letters. Otherwise, your word list may end up with “words” that are only punctuation marks.

Since all words in the stopwords list are lowercase, and those in the original list may not be, you use str.lower() to account for any discrepancies. Otherwise, you may end up with mixedCase or capitalized stop words still in your list.

In [3]:
stopwords = nltk.corpus.stopwords.words("english")
words = [w for w in words if w.lower() not in stopwords]

pprint() prints complex data structures. The normal print() function prints the entire content in a single line. This is fine if the printed content is small in length and is not a complex data structure. But the output will become difficult to read if the content is a complex data structure like a complex json or a long content.

In [4]:
from pprint import pprint

text = """For some quick analysis, creating a corpus could be overkill.
          If all you need is a word list, there are simpler ways to achieve that goal."""
pprint(nltk.word_tokenize(text), width=79, compact=True)

['For', 'some', 'quick', 'analysis', ',', 'creating', 'a', 'corpus', 'could',
 'be', 'overkill', '.', 'If', 'all', 'you', 'need', 'is', 'a', 'word', 'list',
 ',', 'there', 'are', 'simpler', 'ways', 'to', 'achieve', 'that', 'goal', '.']


In [5]:
words = [word for word in nltk.word_tokenize(text) if word.isalpha()]
words

['For',
 'some',
 'quick',
 'analysis',
 'creating',
 'a',
 'corpus',
 'could',
 'be',
 'overkill',
 'If',
 'all',
 'you',
 'need',
 'is',
 'a',
 'word',
 'list',
 'there',
 'are',
 'simpler',
 'ways',
 'to',
 'achieve',
 'that',
 'goal']

## Creating Frequency Distributions

In [6]:
words = [word for word in nltk.word_tokenize(text) if word.isalpha()]
fd = nltk.FreqDist(words)
fd.most_common(3)

[('a', 2), ('For', 1), ('some', 1)]

In [7]:
fd

FreqDist({'a': 2, 'For': 1, 'some': 1, 'quick': 1, 'analysis': 1, 'creating': 1, 'corpus': 1, 'could': 1, 'be': 1, 'overkill': 1, ...})

In [8]:
fd.tabulate()

       a      For     some    quick analysis creating   corpus    could       be overkill       If      all      you     need       is     word     list    there      are  simpler     ways       to  achieve     that     goal 
       2        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1 


In [9]:
fd['a']

2

In [10]:
fd['For']

1

In [11]:
fd['one']

0

In [12]:
for w in fd:
  print(w)

a
For
some
quick
analysis
creating
corpus
could
be
overkill
If
all
you
need
is
word
list
there
are
simpler
ways
to
achieve
that
goal


## Extracting Concordance and Collocations

Before invoking .concordance(), build a new word list from the original corpus text so that all the context, even stop words, will be there:

In [13]:
text = nltk.Text(nltk.corpus.state_union.words())
text.concordance("america", lines=5)

Displaying 5 of 1079 matches:
 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
 to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to 


Since .concordance() only prints information to the console, it’s not ideal for data manipulation. To obtain a usable list that will also give you information about the location of each occurrence, use .concordance_list():

In [14]:
concordance_list = text.concordance_list("america", lines=2)
for entry in concordance_list:
  print(entry)

ConcordanceLine(left=['looked', 'forward', 'and', 'moved', 'forward', '.', 'That', 'is', 'what', 'he', 'would', 'want', 'us', 'to', 'do', '.', 'That', 'is', 'what'], query='America', right=['will', 'do', '.', 'So', 'much', 'blood', 'has', 'already', 'been', 'shed', 'for', 'the', 'ideals', 'which', 'we', 'cherish', ',', 'and'], offset=242, left_print=' would want us to do . That is what', right_print='will do . So much blood has already', line=' would want us to do . That is what America will do . So much blood has already')
ConcordanceLine(left=['even', 'a', 'momentary', 'pause', 'in', 'the', 'hard', 'fight', 'for', 'victory', '.', 'Today', ',', 'the', 'entire', 'world', 'is', 'looking', 'to'], query='America', right=['for', 'enlightened', 'leadership', 'to', 'peace', 'and', 'progress', '.', 'Such', 'a', 'leadership', 'requires', 'vision', ',', 'courage', 'and', 'tolerance', '.'], offset=294, left_print='ay , the entire world is looking to', right_print='for enlightened leadership to p

In [15]:
for entry in concordance_list:
  print(entry.line)

 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace


In [16]:
example = "Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex."
tokenized = nltk.word_tokenize(example)
text = nltk.Text(tokenized)
text.vocab() # Equivalent to fd = nltk.FreqDist(words)

FreqDist({'is': 3, 'better': 3, 'than': 3, '.': 3, 'Beautiful': 1, 'ugly': 1, 'Explicit': 1, 'implicit': 1, 'Simple': 1, 'complex': 1})

In [17]:
fd = text.vocab()
fd.tabulate(3)

    is better   than 
     3      3      3 


Collocations are series of words that frequently appear together in a given text. Collocations can be made up of two or more words. NLTK provides classes to handle several types of collocations:

Bigrams: Frequent two-word combinations
Trigrams: Frequent three-word combinations
Quadgrams: Frequent four-word combinations

NLTK provides specific classes for you to find collocations in your text. Following the pattern you’ve seen so far, these classes are also built from lists of words:

In [18]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)
finder.ngram_fd.most_common(2)

[(('the', 'United', 'States'), 294), (('the', 'American', 'people'), 185)]

In [19]:
finder.ngram_fd.tabulate(2)

  ('the', 'United', 'States') ('the', 'American', 'people') 
                          294                           185 


## Using NLTK’s Pre-Trained Sentiment Analyzer

NLTK already has a built-in, pretrained sentiment analyzer called VADER (Valence Aware Dictionary and sEntiment Reasoner).

Since VADER is pretrained, you can get results more quickly than with many other analyzers. However, VADER is best suited for language used in social media, like short sentences with some slang and abbreviations. It’s less accurate when rating longer, structured sentences, but it’s often a good launching point.

In [20]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")

{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

You’ll get back a dictionary of different scores. The negative, neutral, and positive scores are related: They all add up to 1 and can’t be negative. The compound score is calculated differently. It’s not just an average, and it can range from -1 to 1.

In [23]:
tweets = [t.replace("://", "//") for t in nltk.corpus.twitter_samples.strings()]

Notice that you use a different corpus method, .strings(), instead of .words(). This gives you a list of raw tweets as strings.

In [27]:
from random import shuffle

def is_positive(tweet: str):
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0

shuffle(tweets)
for tweet in tweets[:10]:
    print(">", is_positive(tweet), tweet)

> True @FC_TEAMJK3T follback :)
> True @sehunshinedaily if it makes u feel better i never have nor will see anyone in kpop in the flesh :D
> False RT @lenathehyena: @johnmcternan @Dungarbhan @UKLabour @Ed_Miliband it's your bloke who said he'd prefer the Tories. Did you miss that? Catc…
> True RT @Markfergusonuk: David Cameron says he's hungrier than he was five years ago. So are all of the people reliant on food banks...
> False Farage never heard of pearl harbour?  #AskNigelFarage
> False RT @A_Liberty_Rebel: Farage is right. The EU’s protectionist CAP itself creates poverty among would-be exporters from Africa to Europe. #bb…
> False RT @timothy_stanley: #Farage's preference for reversing the smoking ban is a rare example of a conservatism that actually seeks to turn the…
> False UK audience grills Cameron, Miliband, Clegg in Question Time 'debate' http//t.co/sev4g8qh3c
> True RT @AndrewSparrow: On best PM, Cameron ahead of Miliband, 48% to 34% - http//t.co/Pu8rOdGS1a
> False RT @Ir

In [33]:
pprint(nltk.corpus.movie_reviews.fileids(categories=["pos"]), compact=True)

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt',
 'pos/cv003_11664.txt', 'pos/cv004_11636.txt', 'pos/cv005_29443.txt',
 'pos/cv006_15448.txt', 'pos/cv007_4968.txt', 'pos/cv008_29435.txt',
 'pos/cv009_29592.txt', 'pos/cv010_29198.txt', 'pos/cv011_12166.txt',
 'pos/cv012_29576.txt', 'pos/cv013_10159.txt', 'pos/cv014_13924.txt',
 'pos/cv015_29439.txt', 'pos/cv016_4659.txt', 'pos/cv017_22464.txt',
 'pos/cv018_20137.txt', 'pos/cv019_14482.txt', 'pos/cv020_8825.txt',
 'pos/cv021_15838.txt', 'pos/cv022_12864.txt', 'pos/cv023_12672.txt',
 'pos/cv024_6778.txt', 'pos/cv025_3108.txt', 'pos/cv026_29325.txt',
 'pos/cv027_25219.txt', 'pos/cv028_26746.txt', 'pos/cv029_18643.txt',
 'pos/cv030_21593.txt', 'pos/cv031_18452.txt', 'pos/cv032_22550.txt',
 'pos/cv033_24444.txt', 'pos/cv034_29647.txt', 'pos/cv035_3954.txt',
 'pos/cv036_16831.txt', 'pos/cv037_18510.txt', 'pos/cv038_9749.txt',
 'pos/cv039_6170.txt', 'pos/cv040_8276.txt', 'pos/cv041_21113.txt',
 'pos/cv042_10982.txt', 'pos/

In [34]:
positive_review_ids = nltk.corpus.movie_reviews.fileids(categories=["pos"])
negative_review_ids = nltk.corpus.movie_reviews.fileids(categories=["neg"])
all_review_ids = positive_review_ids + negative_review_ids

In [35]:
from statistics import mean

def is_positive(review_id):
  """True if the average of all sentence compound scores is positive."""
  text = nltk.corpus.movie_reviews.raw(review_id)
  scores = [sia.polarity_scores(sentence)["compound"] for sentence in nltk.sent_tokenize(text)]
  return mean(scores) > 0

shuffle(all_review_ids)
correct = 0
for review_id in all_review_ids:
  if is_positive(review_id):
    if review_id in positive_review_ids:
      correct += 1
    else:
      if review_id in negative_review_ids:
        correct += 1

print(F"{correct / len(all_review_ids):.2%} correct")

69.15% correct


## Customizing NLTK’s Sentiment Analysis

In [38]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted(pos_tuple):
    word, tag = pos_tuple
    if not word.isalpha() or word in unwanted:
        return False
    if tag.startswith("NN"):
        return False
    return True

positive_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

This time, you also add words from the names corpus to the unwanted list on line 2 since movie reviews are likely to have lots of actor names, which shouldn’t be part of your feature sets.

In [None]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted(pos_tuple):
    word, tag = pos_tuple
    if not word.isalpha() or word in unwanted:
        return False
    if tag.startswith("NN"):
        return False
    return True

positive_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

In [42]:
positive_fd = nltk.FreqDist(positive_words)
negative_fd = nltk.FreqDist(negative_words)

common_set = set(positive_fd).intersection(negative_fd)

for word in common_set:
  del positive_fd[word]
  del negative_fd[word]

top_100_positive = {word for word, count in positive_fd.most_common(100)}
top_100_negative = {word for word, count in negative_fd.most_common(100)}

In [46]:
pprint(top_100_positive, compact=True)

{'addresses', 'amistad', 'apostle', 'argento', 'attentive', 'audacious',
 'balancing', 'belgian', 'benefit', 'biased', 'brisk', 'broadcast', 'claiborne',
 'conveys', 'criticized', 'curdled', 'danish', 'deft', 'deftly', 'donkey',
 'elegantly', 'embeth', 'en', 'exhilarating', 'fa', 'falter', 'farquaad', 'fei',
 'flynt', 'forceful', 'freed', 'funnest', 'galactic', 'ghost', 'hanks',
 'horned', 'indistinguishable', 'jedi', 'kimble', 'kudos', 'legally',
 'lovingly', 'lumumba', 'masterfully', 'matches', 'maximus', 'melancholy',
 'methodical', 'monetary', 'motta', 'mulan', 'narrates', 'nello', 'niccol',
 'notoriously', 'ordell', 'organizing', 'perceived', 'pink', 'powerfully',
 'profile', 'propelled', 'pun', 'radio', 'redefines', 'rico', 'safely',
 'seahaven', 'shanghai', 'shrek', 'sobbing', 'societal', 'soviet', 'spacey',
 'sparks', 'stendhal', 'superficially', 'supreme', 'sweetback', 'tale',
 'taxing', 'textured', 'tibbs', 'tibetan', 'trimmed', 'ulee', 'unassuming',
 'uncompromising', 'uncut

In [47]:
pprint(top_100_negative, compact=True)

{'abysmal', 'amish', 'artemus', 'audible', 'autistic', 'babe', 'battlefield',
 'bean', 'brazilian', 'brenner', 'busted', 'chi', 'chuckled', 'club', 'comment',
 'consecutive', 'crucible', 'deems', 'degenerates', 'digested', 'disguise',
 'droppingly', 'ego', 'embarassing', 'enticing', 'favors', 'fetch', 'flipped',
 'flubber', 'forgetful', 'geronimo', 'glancing', 'godzilla', 'goo', 'gordy',
 'grunting', 'harlem', 'heckerling', 'horrid', 'iii', 'incoherent', 'injury',
 'interspersed', 'jericho', 'joely', 'lamest', 'leaden', 'leguizamo',
 'manchurian', 'mandingo', 'modeled', 'monumentally', 'mumbo', 'mystery',
 'nbsp', 'negated', 'nitro', 'ordering', 'pad', 'pathetically', 'performances',
 'peripheral', 'plodding', 'popped', 'potty', 'precinct', 'psychlo', 'putrid',
 'rabid', 'rambo', 'rotating', 'sans', 'schumacher', 'segal', 'sneering',
 'snipes', 'spawn', 'sphere', 'squabble', 'stalks', 'stinks', 'stupidest',
 'stupidly', 'supergirl', 'tearing', 'tectonic', 'tediously', 'terminal',
 'top

Here’s how you can set up the positive and negative bigram finders:

In [48]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

positive_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["pos"])
    if w.isalpha() and w not in unwanted
])
negative_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["neg"])
    if w.isalpha() and w not in unwanted
])

## Training and Using a Classifier

To be continued.