Plain text is the most predominant form of data available today. Text analysis applies analysis of word frequency distributions, pattern recognition, tagging, link and association analysis, sentiment analysis, and visualization. We will analyze text with the Python Natural Language Toolkit (NLTK) library. NLTK comes with a collection of sample texts called corpora.

NLTK is a Python API for the analysis of texts written in natural languages, such as
English. NLTK was created in 2001 and was originally intended as a teaching tool.

In [1]:
import nltk

In [2]:
import pkgutil as pu
import pydoc
import nltk

print ("nltk version", nltk.__version__)

def clean(astr):
   s = astr
   # remove multiple spaces
   s = ' '.join(s.split())
   s = s.replace('=','')

   return s

def print_desc(prefix, pkg_path):
   for pkg in pu.iter_modules(path=pkg_path):
      name = prefix + "." + pkg[1]

      if pkg[2] == True:
         try:
            docstr = pydoc.plain(pydoc.render_doc(name))
            docstr = clean(docstr)
            start = docstr.find("DESCRIPTION")
            docstr = docstr[start: start + 140]
            print (name, docstr)
         except:
            continue

print_desc("nltk", nltk.__path__)


nltk version 3.2.1
nltk.app DESCRIPTION chartparser: Chart Parser chunkparser: Regular-Expression Chunk Parser collocations: Find collocations in text concordance: Part
nltk.ccg DESCRIPTION For more information see nltk/doc/contrib/ccg/ccg.pdf PACKAGE CONTENTS api chart combinator lexicon logic DATA BackwardApplicati
nltk.chat DESCRIPTION A class for simple chatbots. These perform simple pattern matching on sentences typed by users, and respond with automatically g
nltk.chunk DESCRIPTION Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This 
nltk.classify DESCRIPTION Classes and interfaces for labeling tokens with category labels (or "class labels"). Typically, labels are represented with stri
nltk.cluster DESCRIPTION This module contains a number of basic clustering algorithms. Clustering describes the task of discovering groups of similar ite
nltk.corpus 
nltk.draw DESCRIPTION # Natural Language Toolkit: graphi



We still need to download the NLTK corpora.

Go to python

>import nltk

>nltk.download()

A GUI application should appear, where you can specify a destination and what
to download

### Filtering out stopwords, names,and numbers

It's a common requirement in text analysis to get rid of stopwords (common words
with low information value). NLTK has a stopwords corpora for a number of
languages. Load the English stopwords corpus and print some of the words

In [3]:
sw = set(nltk.corpus.stopwords.words('english'))
print ("Stop words",list(sw)[:8])

Stop words ['while', 'he', 'that', 'both', 'ourselves', 'won', 'been', 'about']


Notice that all the words in this corpus are in lowercase.

The Gutenberg project is a digital library of
books mostly with expired copyright, which are available for free on the Internet

In [4]:
gb = nltk.corpus.gutenberg
print ("Gutenberg files",gb.fileids()[-5:])

Gutenberg files ['milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


Extract the first couple of sentences from the milton-paradise.txt file that we
will filter later:

In [5]:
text_sent = gb.sents("milton-paradise.txt")[:2]
print ("Unfiltered", text_sent)

Unfiltered [['[', 'Paradise', 'Lost', 'by', 'John', 'Milton', '1667', ']'], ['Book', 'I']]


### Now, filter out the stopwords as follows:

In [6]:
for sent in text_sent:
    filtered = [w for w in sent if w.lower() not in sw]
    print ("Filtered", filtered)

Filtered ['[', 'Paradise', 'Lost', 'John', 'Milton', '1667', ']']
Filtered ['Book']


If we compare with the previous snippet, we notice that the word by has been
filtered out as it was found in the stopwords corpus.

Sometimes, we want to remove
numbers and names too. We can remove words based on Part of Speech (POS) tags.
In this tagging scheme, numbers correspond to the Cardinal Number (CD) tag.
Names correspond to the proper noun singular (NNP) tag. Tagging is an inexact
process based on heuristics.

In [10]:
#Tag the filtered text with the pos_tag() function:
for sent in text_sent:
    filtered = [w for w in sent if w.lower() not in sw]
    tagged = nltk.pos_tag(filtered)
    print ("Tagged", tagged)


Tagged [('[', 'JJ'), ('Paradise', 'NNP'), ('Lost', 'NNP'), ('John', 'NNP'), ('Milton', 'NNP'), ('1667', 'CD'), (']', 'NN')]
Tagged [('Book', 'NN')]


The pos_tag() function returns a list of tuples, where the second element in each
tuple is the tag. As you can see, some of the words are tagged as NNP, although they
probably shouldn't be. The heuristic here is to tag words as NNP if the first character
of a word is uppercase. If we set all the words to be lowercase, we will get a different
result.

In [12]:
for sent in text_sent:
    filtered = [w for w in sent if w.lower() not in sw]
    tagged = nltk.pos_tag(filtered)
    words= []

    for word in tagged:
        if word[1] != 'NNP' and word[1] != 'CD':
           words.append(word[0]) 

    print (words)


['[', ']']
['Book']


###### In the bag-of-words model, 
we create from a document a bag containing words
found in the document. In this model, we don't care about the word order. For each
word in the document, we count the number of occurrences. With these word counts,
we can do statistical analysis, for instance, to identify spam in e-mail messages.

If we have a group of documents, we can view each unique word in the corpus as a
feature; here, "feature" means parameter or variable. Using all the word counts, we
can build a feature vector for each document; "vector" is used here in the mathematical
sense. If a word is present in the corpus but not in the document, the value of this
feature will be 0. Surprisingly, NLTK doesn't have a handy utility currently to create a
feature vector. However, the machine learning Python library, scikit-learn, does have
a CountVectorizer class that we can use.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

gb = nltk.corpus.gutenberg
hamlet = gb.raw("shakespeare-hamlet.txt")
macbeth = gb.raw("shakespeare-macbeth.txt")

cv = CountVectorizer(stop_words='english')
print ("Feature vector", cv.fit_transform([hamlet, macbeth]).toarray())
print ("Features", cv.get_feature_names()[:5])

Feature vector [[ 1  0  1 ..., 14  0  1]
 [ 0  1  0 ...,  1  1  0]]
Features ['1599', '1603', 'abhominably', 'abhorred', 'abide']


### Analyzing word frequencies

The NLTK FreqDist class encapsulates a dictionary of words and counts for a
given list of words.

In [20]:
import string


gb = nltk.corpus.gutenberg
words = gb.words("shakespeare-caesar.txt")

sw = set(nltk.corpus.stopwords.words('english'))
punctuation = set(string.punctuation)
filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation]

fd = nltk.FreqDist(filtered)
print ("Words", list(fd.keys())[:5])
print ("Counts", list(fd.values())[:5])
print ("Max", list(fd.max()))
print ("Count", fd['d'])

fd = nltk.FreqDist(nltk.bigrams(filtered))
print ("Bigrams", list(fd.keys())[:5])
print ("Counts", list(fd.values())[:5])
print ("Bigram Max", list(fd.max()))
print ("Bigram count", fd[('let', 'vs')])



Words ['william', 'amaze', 'signifies', 'julius', 'priests']
Counts [1, 1, 1, 1, 2]
Max ['c', 'a', 'e', 's', 'a', 'r']
Count 0
Bigrams [('dangerous', 'flourish'), ('meanes', 'whereof'), ('bad', 'soules'), ('pleasure', 'portia'), ('rabblement', 'howted')]
Counts [1, 1, 1, 1, 1]
Bigram Max ['let', 'vs']
Bigram count 16


The first word in this list is of course not an English word, so we may need to add the
heuristic that words have a minimum of two characters. The NLTK FreqDist class
allows dictionary-like access, but it also has convenience methods.

The analysis until this point concerned single words, but we can extend the analysis
to word pairs and triplets. These are also called bigrams and trigrams. We can find
them with the bigrams() and trigrams() functions

## Naive Bayes Classification

Let's try to classify words as stopwords or punctuation. As a feature, we will use the
word length, since stopwords and punctuation tend to be short.

In [21]:
import string
import random


sw = set(nltk.corpus.stopwords.words('english'))
punctuation = set(string.punctuation)

def word_features(word):
   return {'len': len(word)}

def isStopword(word):
    return word in sw or word in punctuation

gb = nltk.corpus.gutenberg
words = gb.words("shakespeare-caesar.txt")

labeled_words = ([(word.lower(), isStopword(word.lower())) for word in words])
random.seed(42)
random.shuffle(labeled_words)
print (labeled_words[:5])

featuresets = [(word_features(n), word) for (n, word) in labeled_words]
cutoff = int(.9 * len(featuresets))
train_set, test_set = featuresets[:cutoff], featuresets[cutoff:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print ("'behold' class", classifier.classify(word_features('behold')))
print ("'the' class", classifier.classify(word_features('the')))

print ("Accuracy", nltk.classify.accuracy(classifier, test_set))
print (classifier.show_most_informative_features(5))



[('i', True), ('is', True), ('in', True), ('he', True), ('ambitious', False)]
'behold' class False
'the' class True
Accuracy 0.8521671826625387
Most Informative Features
                     len = 7               False : True   =     77.8 : 1.0
                     len = 6               False : True   =     52.2 : 1.0
                     len = 1                True : False  =     51.8 : 1.0
                     len = 2                True : False  =     10.9 : 1.0
                     len = 5               False : True   =     10.9 : 1.0
None


### Sentiment analysis

Opinion mining or sentiment analysis is a hot, new research field dedicated to
the automatic evaluation of opinions as expressed on social media, product review
websites, or other forums. Often, we want to know whether an opinion is positive,
neutral, or negative.

As such, we can apply any number of classification algorithms. Another
approach is to semiautomatically (with some manual editing) compose a list of
words with an associated numerical sentiment score (the word "good" can have a
score of 5 and the word "bad" a score of -5). If we have such a list, we can look up all
words in a text document and, for example, sum up all the found sentiment scores.
The number of classes can be more than three, like a five-star rating scheme.

We will apply Naive Bayes classification to the NLTK movie reviews corpus with the
goal of classifying movie reviews as either positive or negative. You may consider more elaborate filtering
schemes, but keep in mind that excessive filtering may hurt accuracy.

In [22]:
import random
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk import FreqDist
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy
import string


labeled_docs = [(list(movie_reviews.words(fid)), cat)
        for cat in movie_reviews.categories()
        for fid in movie_reviews.fileids(cat)]
random.seed(42)
random.shuffle(labeled_docs)

review_words = movie_reviews.words()
print ("# Review Words", len(review_words))


# Review Words 1583820


The complete corpus has tens of thousands of unique words that we can use as
features. However, using all these words might be inefficient. Select the top five
percent of the most frequent words:

In [25]:
sw = set(stopwords.words('english'))
punctuation = set(string.punctuation)

def isStopWord(word):
    return word in sw or word in punctuation

filtered = [w.lower() for w in review_words if not isStopWord(w.lower())]
print ("# After filter", len(filtered))

words = FreqDist(filtered)
N = int(.05 * len(words.keys()))
word_features = list(words.keys())[:N]

# After filter 710579


For each document, we can extract features using a number of methods including
the following:

• Check whether the given document has a word or not

• Determine the number of occurrences of a word for a given document

• Normalize word counts so that the maximum normalized word count will
be less than or equal to 1

• Take the logarithm of counts plus one (to avoid taking the logarithm of zero)

• Combine all the previous points into one metric

###### The following function, which uses raw word counts as a metric:

In [26]:
def doc_features(doc):
    doc_words = FreqDist(w for w in doc if not isStopWord(w))
    features = {}
    for word in word_features:
        features['count (%s)' % word] = (doc_words.get(word, 0))
    return features


In [27]:
featuresets = [(doc_features(d), c) for (d,c) in labeled_docs]
train_set, test_set = featuresets[200:], featuresets[:200]
classifier = NaiveBayesClassifier.train(train_set)
print ("Accuracy", accuracy(classifier, test_set))

print (classifier.show_most_informative_features())

Accuracy 0.6
Most Informative Features
          count (boring) = 2                 neg : pos    =     10.7 : 1.0
          count (murphy) = 1                 pos : neg    =      7.6 : 1.0
            count (tide) = 1                 pos : neg    =      5.8 : 1.0
         count (teaming) = 1                 pos : neg    =      5.8 : 1.0
       count (costuming) = 1                 pos : neg    =      5.8 : 1.0
        count (angelina) = 1                 neg : pos    =      5.5 : 1.0
           count (coats) = 1                 pos : neg    =      5.1 : 1.0
          count (winter) = 1                 pos : neg    =      5.1 : 1.0
        count (supercop) = 1                 pos : neg    =      5.1 : 1.0
             count (ass) = 2                 neg : pos    =      4.9 : 1.0
None


### Creating word clouds

In [29]:
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk import FreqDist
import string

sw = set(stopwords.words('english'))
punctuation = set(string.punctuation)

def isStopWord(word):
    return word in sw or word in punctuation

review_words = movie_reviews.words()
filtered = [w.lower() for w in review_words if not isStopWord(w.lower())]

words = FreqDist(filtered)
N = int(.01 * len(words.keys()))
tags = list(words.keys())[:N]

for tag in tags:
    print (tag, ':', words[tag])

recon : 2
interrotron : 2
bully : 24
ringwood : 1
bernstein : 10
blond : 34
tectonic : 7
mesa : 1
customary : 5
burgess : 4
gooden : 1
coil : 1
profusely : 1
50 : 59
dey : 4
neckbraces : 1
paraplegic : 1
attempt : 263
regretting : 1
weighs : 7
lobotomies : 1
therapeutic : 1
translators : 1
association : 11
crooked : 15
salma : 23
dalmantions : 1
semitic : 3
colleges : 2
rubble : 6
hordes : 5
uncles : 1
lodi : 1
lined : 4
kilmer : 48
jour : 1
makers : 58
embarrsingly : 1
system : 117
heroism : 4
yeshiva : 1
men_ : 1
shelled : 1
educated : 9
subjecting : 4
wrinkled : 2
acheives : 1
ambiguities : 2
oprah : 8
bimbos : 3
ascetic : 2
ogres : 2
righteousness : 1
trilogy : 60
obstructed : 1
tops : 17
tracer : 1
nighthawks : 1
supply : 22
sore : 8
inuits : 1
dislocation : 1
stitch : 1
fumbling : 8
fobbed : 1
36th : 1
devlin : 8
disgust : 11
detective : 127
ridiculousy : 1
splits : 5
sign : 87
tiresome : 32
dispatching : 2
1938 : 3
lau : 1
rem : 1
pseudoerotic : 1
roxburgh : 3
mozell : 1
arye : 

In [34]:
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.corpus import names
from nltk import FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer
import itertools
import pandas as pd
import numpy as np
import string

sw = set(stopwords.words('english'))
punctuation = set(string.punctuation)
#Improve filtering by using the isalpha() method and names corpus:
all_names = set([name.lower() for name in names.words()])

def isStopWord(word):
    return (word in sw or word in punctuation) or not word.isalpha() or word in all_names

review_words = movie_reviews.words()
filtered = [w.lower() for w in review_words if not isStopWord(w.lower())]

words = FreqDist(filtered)
#Create the list as follows:

texts = []

for fid in movie_reviews.fileids():
    texts.append(" ".join([w.lower() for w in movie_reviews.words(fid) if not isStopWord(w.lower()) and words[w.lower()] > 1]))
#Create the vectorizer; to be safe, let it ignore stopwords:
vectorizer = TfidfVectorizer(stop_words='english')
#Create the sparse term-document matrix:
matrix = vectorizer.fit_transform(texts)
#Sum the tf-idf values for each word and store it in a NumPy array:
sums = np.array(matrix.sum(axis=0)).ravel()
#Now, create a pandas DataFrame with the word rank weights and sort it:
ranks = []

for word, val in zip(vectorizer.get_feature_names(), sums):
    ranks.append((word, val))

df = pd.DataFrame(ranks, columns=["term", "tfidf"])
df = df.sort(['tfidf'])
print (df.head())

N = int(.01 * len(df))
df = df.tail(N)

for term, tfidf in zip(df["term"].values, df["tfidf"].values):
    print (term, ":", tfidf)


                 term    tfidf
2791      cannibalize  0.03035
8737            greys  0.03035
19964  superintendent  0.03035
14011           ology  0.03035
2406          briefer  0.03035
matter : 10.1599120411
review : 10.1620487127
seeing : 10.1937953512
jokes : 10.1950100205
past : 10.2297321489
romantic : 10.2707679481
directed : 10.2767754198
start : 10.3021271495
finally : 10.3151923686
video : 10.3567282336
despite : 10.3635675871
ship : 10.3700292095
beautiful : 10.4155676298
scream : 10.4219706559
sequence : 10.4611040981
supposed : 10.4735674889
shot : 10.4976115264
face : 10.5203272836
turn : 10.5353055048
lives : 10.5361737181
later : 10.5365640805
tell : 10.5417039031
camera : 10.5807382349
works : 10.5848494682
children : 10.5921429454
live : 10.6586081401
daughter : 10.6854088195
earth : 10.6855377464
mr : 10.7112802669
car : 10.7143978876
believe : 10.7248097299
maybe : 10.7378725085
person : 10.7659913761
book : 10.7988840881
worst : 10.8018088939
hand : 10.8157477252
na

