<a href="https://colab.research.google.com/github/mcgmed/Nautral-Language-Processing/blob/main/NLTK_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import nltk
nltk.download()

NLTK will display a download manager showing all available and installed resources. Here are the ones you’ll need to download for this tutorial:

*   names: A list of common English names compiled by Mark Kantrowitz
*   stopwords: A list of really common words, like articles, pronouns, prepositions, and conjunctions
*   state_union: A sample of transcribed State of the Union addresses by different US presidents, compiled by Kathleen Ahrens
*   twitter_samples: A list of social media phrases posted to Twitter
*   movie_reviews: Two thousand movie reviews categorized by Bo Pang and Lillian Lee
*   averaged_perceptron_tagger: A data model that NLTK uses to categorize words into their part of speech
*   vader_lexicon: A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbert
*   punkt: A data model created by Jan Strunk that NLTK uses to split full texts into word lists

A quick way to download specific resources directly from the console is to pass a list to nltk.download():

In [2]:
import nltk
nltk.download(["names", "stopwords", "state_union", "twitter_samples", "movie_reviews", "averaged_perceptron_tagger", "vader_lexicon", "punkt",])

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package state_union to /root/nltk_data...
[nltk_data]   Unzipping corpora/state_union.zip.
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Compiling Data

In [3]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]

Note that you build a list of individual words with the corpus’s .words() method, but you use str.isalpha() to include only the words that are made up of letters. Otherwise, your word list may end up with “words” that are only punctuation marks.

Since all words in the stopwords list are lowercase, and those in the original list may not be, you use str.lower() to account for any discrepancies. Otherwise, you may end up with mixedCase or capitalized stop words still in your list.

In [4]:
stopwords = nltk.corpus.stopwords.words("english")
words = [w for w in words if w.lower() not in stopwords]

pprint() prints complex data structures. The normal print() function prints the entire content in a single line. This is fine if the printed content is small in length and is not a complex data structure. But the output will become difficult to read if the content is a complex data structure like a complex json or a long content.

In [5]:
from pprint import pprint

text = """For some quick analysis, creating a corpus could be overkill.
          If all you need is a word list, there are simpler ways to achieve that goal."""
pprint(nltk.word_tokenize(text), width=79, compact=True)

['For', 'some', 'quick', 'analysis', ',', 'creating', 'a', 'corpus', 'could',
 'be', 'overkill', '.', 'If', 'all', 'you', 'need', 'is', 'a', 'word', 'list',
 ',', 'there', 'are', 'simpler', 'ways', 'to', 'achieve', 'that', 'goal', '.']


In [6]:
words = [word for word in nltk.word_tokenize(text) if word.isalpha()]
words

['For',
 'some',
 'quick',
 'analysis',
 'creating',
 'a',
 'corpus',
 'could',
 'be',
 'overkill',
 'If',
 'all',
 'you',
 'need',
 'is',
 'a',
 'word',
 'list',
 'there',
 'are',
 'simpler',
 'ways',
 'to',
 'achieve',
 'that',
 'goal']

## Creating Frequency Distributions

In [7]:
words = [word for word in nltk.word_tokenize(text) if word.isalpha()]
fd = nltk.FreqDist(words)
fd.most_common(3)

[('a', 2), ('For', 1), ('some', 1)]

In [8]:
fd

FreqDist({'a': 2, 'For': 1, 'some': 1, 'quick': 1, 'analysis': 1, 'creating': 1, 'corpus': 1, 'could': 1, 'be': 1, 'overkill': 1, ...})

In [9]:
fd.tabulate()

       a      For     some    quick analysis creating   corpus    could       be overkill       If      all      you     need       is     word     list    there      are  simpler     ways       to  achieve     that     goal 
       2        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1        1 


In [10]:
fd['a']

2

In [11]:
fd['For']

1

In [12]:
fd['one']

0

In [13]:
for w in fd:
  print(w)

a
For
some
quick
analysis
creating
corpus
could
be
overkill
If
all
you
need
is
word
list
there
are
simpler
ways
to
achieve
that
goal


## Extracting Concordance and Collocations

Before invoking .concordance(), build a new word list from the original corpus text so that all the context, even stop words, will be there:

In [14]:
text = nltk.Text(nltk.corpus.state_union.words())
text.concordance("america", lines=5)

Displaying 5 of 1079 matches:
 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
 to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to 


Since .concordance() only prints information to the console, it’s not ideal for data manipulation. To obtain a usable list that will also give you information about the location of each occurrence, use .concordance_list():

In [20]:
concordance_list = text.concordance_list("america", lines=2)
for entry in concordance_list:
  print(entry)

ConcordanceLine(left=['looked', 'forward', 'and', 'moved', 'forward', '.', 'That', 'is', 'what', 'he', 'would', 'want', 'us', 'to', 'do', '.', 'That', 'is', 'what'], query='America', right=['will', 'do', '.', 'So', 'much', 'blood', 'has', 'already', 'been', 'shed', 'for', 'the', 'ideals', 'which', 'we', 'cherish', ',', 'and'], offset=242, left_print=' would want us to do . That is what', right_print='will do . So much blood has already', line=' would want us to do . That is what America will do . So much blood has already')
ConcordanceLine(left=['even', 'a', 'momentary', 'pause', 'in', 'the', 'hard', 'fight', 'for', 'victory', '.', 'Today', ',', 'the', 'entire', 'world', 'is', 'looking', 'to'], query='America', right=['for', 'enlightened', 'leadership', 'to', 'peace', 'and', 'progress', '.', 'Such', 'a', 'leadership', 'requires', 'vision', ',', 'courage', 'and', 'tolerance', '.'], offset=294, left_print='ay , the entire world is looking to', right_print='for enlightened leadership to p

In [21]:
for entry in concordance_list:
  print(entry.line)

 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace


In [24]:
example = "Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex."
tokenized = nltk.word_tokenize(example)
text = nltk.Text(tokenized)
text.vocab() # Equivalent to fd = nltk.FreqDist(words)

FreqDist({'is': 3, 'better': 3, 'than': 3, '.': 3, 'Beautiful': 1, 'ugly': 1, 'Explicit': 1, 'implicit': 1, 'Simple': 1, 'complex': 1})

In [25]:
fd = text.vocab()
fd.tabulate(3)

    is better   than 
     3      3      3 


Collocations are series of words that frequently appear together in a given text. Collocations can be made up of two or more words. NLTK provides classes to handle several types of collocations:

Bigrams: Frequent two-word combinations
Trigrams: Frequent three-word combinations
Quadgrams: Frequent four-word combinations

NLTK provides specific classes for you to find collocations in your text. Following the pattern you’ve seen so far, these classes are also built from lists of words:

In [26]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)
finder.ngram_fd.most_common(2)

[(('the', 'United', 'States'), 294), (('the', 'American', 'people'), 185)]

In [27]:
finder.ngram_fd.tabulate(2)

  ('the', 'United', 'States') ('the', 'American', 'people') 
                          294                           185 


## Using NLTK’s Pre-Trained Sentiment Analyzer

This will continue.