<a href="https://colab.research.google.com/github/lblogan14/python_data_analysis/blob/master/Chapter13_Textual_Data_Analysis+NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#The Natural Language Toolkit (NLTK)
(https://www.nltk.org/)

On Linux systems,

    pip install nltk

In [11]:
from google.colab import drive
drive.mount('/content/drive') #add drive content to the notebook

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [12]:
'''Locate the directory which has the dataset'''
%cd /content/drive/My' 'Drive/Colab' 'Notebooks/Python_Data_Analysis/data

/content/drive/My Drive/Colab Notebooks/Python_Data_Analysis/data


#Import the NLTK Library and the NLTK Downloader Tool
Within the NLTK library, there is also a large collection of sample texts, called *corpora*, which is taken largely from literature and is very useful as a basis for the applciation of the techniques developed with the NLTK library

In [0]:
import nltk

The NLTKKDownloader, **nltk.download_shell()** allows you to make selections through a guided choice of options

In [0]:
nltk.download_shell()

Example: use the Gutenberg corpus,

In [4]:
nltk.download_shell()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> gutenberg
    Downloading package gutenberg to /root/nltk_data...
      Unzipping corpora/gutenberg.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


or you can write the followings to download the dataset of your interest

In [13]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [14]:
gb = nltk.corpus.gutenberg
print('Gutenberg files:', gb.fileids())

Gutenberg files: ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


To access the internal content of one of these files,

In [17]:
macbeth = nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')
len(macbeth)

23140

You can see the first 10 words of the text

In [18]:
macbeth[:10]

['[',
 'The',
 'Tragedie',
 'of',
 'Macbeth',
 'by',
 'William',
 'Shakespeare',
 '1603',
 ']']

If you use the **sents()** function, you would have obtained a more structured array, with each sentence as an element. These elements would be arrays with words for elements

In [0]:
macbeth_sents = nltk.corpus.gutenberg.sents('shakespeare-macbeth.txt')

#Search for a Word with NLTK

**concordance()** function looks for all occurrences of a word passed as an argument within a corpus

In [25]:
text = nltk.Text(macbeth)
text.concordance('Stage')

Displaying 3 of 3 matches:
nts with Dishes and Seruice ouer the Stage . Then enter Macbeth Macb . If it we
with mans Act , Threatens his bloody Stage : byth ' Clock ' tis Day , And yet d
 struts and frets his houre vpon the Stage , And then is heard no more . It is 


**common_contexts()** function searches for a word present in NLTK nad returns the previous word and the next word to the one you are looking for.

In [26]:
text.common_contexts(['Stage'])

the_. bloody_: the_,


**similar()** searches for all words that have the same context as the searched one,

In [28]:
text.similar('Stage')

day time face warre ayre king bleeding man reuolt serieant like
knowledge broyle shew head spring heeles hare thane skie


#Analyze the Frequency of Words
**nltk.FreqDist()** function

In [0]:
fd = nltk.FreqDist(macbeth)

If you want to see the first 10 most common words in the text, use the **most_common()** function

In [30]:
fd.most_common(10)

[(',', 1962),
 ('.', 1235),
 ("'", 637),
 ('the', 531),
 (':', 477),
 ('and', 376),
 ('I', 333),
 ('of', 315),
 ('to', 311),
 ('?', 241)]

You can see these have little meaning during text analysis, it is often necessary to eliminate them. These are called stopwords

The NLTK library has an array of pre-selected stopwords

In [31]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Once downloaded, select only those related to English,

In [32]:
sw = set(nltk.corpus.stopwords.words('english'))
#python set() is an unordered collection with no duplicate elements
print(len(sw))
list(sw)[:10]

179


['be', 'has', "should've", 'once', 'than', 't', 'up', 'a', 'this', 'these']

Use these stopwords to filter the **macbeth** variable,

In [33]:
macbeth_filtered = [w for w in macbeth if w.lower() not in sw]
fd = nltk.FreqDist(macbeth_filtered)
fd.most_common(10)

[(',', 1962),
 ('.', 1235),
 ("'", 637),
 (':', 477),
 ('?', 241),
 ('Macb', 137),
 ('haue', 117),
 ('-', 100),
 ('Enter', 80),
 ('thou', 63)]

Still need to eliminate the punctuations. The punctuation array can be obtained by importing the **string** function

In [0]:
import string
punctuation = set(string.punctuation)

macbeth_filtered2 = [w.lower() for w in macbeth
                     if w.lower() not in sw
                        and w.lower() not in punctuation]

In [36]:
fd = nltk.FreqDist(macbeth_filtered2)
fd.most_common(10)

[('macb', 137),
 ('haue', 122),
 ('thou', 90),
 ('enter', 81),
 ('shall', 68),
 ('macbeth', 62),
 ('vpon', 62),
 ('thee', 61),
 ('macd', 58),
 ('vs', 57)]

#Selection of Words from Text

extract words based on their length

In [37]:
long_words = [w for w in macbeth if len(w) > 12]
sorted(long_words)

['Assassination',
 'Chamberlaines',
 'Distinguishes',
 'Gallowgrosses',
 'Metaphysicall',
 'Northumberland',
 'Voluptuousnesse',
 'commendations',
 'multitudinous',
 'supernaturall',
 'vnaccompanied']

look for all the words that contain a certain sequence of characters, such as **'ious'**.

In [38]:
ious_words = [w for w in macbeth if 'ious' in w]
ious_words = set(ious_words)
#set() is an unordered collection with no duplicate elements

sorted(ious_words)

['Auaricious',
 'Gracious',
 'Industrious',
 'Iudicious',
 'Luxurious',
 'Malicious',
 'Obliuious',
 'Pious',
 'Rebellious',
 'compunctious',
 'furious',
 'gracious',
 'pernicious',
 'pernitious',
 'pious',
 'precious',
 'rebellious',
 'sacrilegious',
 'serious',
 'spacious',
 'tedious']

#Bigrams and Collocations
*Bigrams* are pairs of words instead of single words. The words “is” and “yellow” are for example a bigram, since their combination is possible and meaningful. So “is yellow” can be found in textual data.

Examples include “fast food”, “pay attention”, “good morning”, and so on. These bigrams are called *collocations*.

In [39]:
bgrms = nltk.FreqDist(nltk.bigrams(macbeth_filtered2))
bgrms.most_common(15)

[(('enter', 'macbeth'), 16),
 (('exeunt', 'scena'), 15),
 (('thane', 'cawdor'), 13),
 (('knock', 'knock'), 10),
 (('st', 'thou'), 9),
 (('thou', 'art'), 9),
 (('lord', 'macb'), 9),
 (('haue', 'done'), 8),
 (('macb', 'haue'), 8),
 (('good', 'lord'), 8),
 (('let', 'vs'), 7),
 (('enter', 'lady'), 7),
 (('wee', 'l'), 7),
 (('would', 'st'), 6),
 (('macbeth', 'macb'), 6)]

You can also use *Trigrams*

In [40]:
tgrms = nltk.FreqDist(nltk.trigrams(macbeth_filtered2))
tgrms.most_common(10)

[(('knock', 'knock', 'knock'), 6),
 (('enter', 'macbeth', 'macb'), 5),
 (('enter', 'three', 'witches'), 4),
 (('exeunt', 'scena', 'secunda'), 4),
 (('good', 'lord', 'macb'), 4),
 (('three', 'witches', '1'), 3),
 (('exeunt', 'scena', 'tertia'), 3),
 (('thunder', 'enter', 'three'), 3),
 (('exeunt', 'scena', 'quarta'), 3),
 (('scena', 'prima', 'enter'), 3)]

#Use Text on the Network
The **urllib** library allows you to connect to the contents of web pages and allows you to download the text content from the Internet, including HTML pages.

In [0]:
from urllib import request

In [0]:
url = 'http://www.gutenberg.org/files/2554/2554-0.txt'
response = request.urlopen(url)
raw = response.read().decode('utf8')

In [44]:
raw[:75]

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

There is the Unicode character *\ufeff*. This happened because we used the **utf8** decoding system, which is valid in most cases, but not in this case. The most suitable system in this case is **utf-8-sig**.

In [50]:
url = 'http://www.gutenberg.org/files/2554/2554-0.txt'
response = request.urlopen(url)
raw = response.read().decode('utf-8-sig')
raw[:75]

'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

Convert into a corpus compatible with NLTK

**nltk.word_tokenize()** splits the character text into tokens (words)

**nltk.Text()** converts tokens into a textual body suitable for NLTK

In [0]:
url = 'http://www.gutenberg.org/files/2554/2554-0.txt'
response = request.urlopen(url)
raw = response.read().decode('utf-8-sig')
tokens = nltk.word_tokenize(raw)
webtext = nltk.Text(tokens)
webtext[:12]

#Extract the Text from the HTML Pages

In [55]:
url = 'http://news.bbc.co.uk/2/hi/health/2284783.stm'
html = request.urlopen(url).read().decode('utf8')
html[:120]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\r\n<html>\r\n<hea'

Conversion into NLTK corpus needs an additional library, **bs4** and the **beautifulSoup()** function, which provides you with suitable parsers that can recognize HTML tags and extract the text contained in them.

In [0]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'lxml').get_text()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

#Sentimental Analysis

In [58]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [0]:
import random
reviews = nltk.corpus.movie_reviews
documents = [(list(reviews.words(fileid)), category) 
             for category in reviews.categories()
            for fileid in reviews.fileids(category)]
random.shuffle(documents)

The each element of the **documents** variable contains two fields, the first one is the review, the second one is the evaluation of the review

In [66]:
first_review = ' '.join(documents[0][0])
#                       ^documents[review #][0:'review content', 1:evaluation]
print(first_review)

i don ' t box with kid gloves . i don ' t play nice , i ' m not a nice guy , and i never , ever , go easy on a film . i consider it to be a breech of some sort of code of ethics for a movie critic . however , i do some favors , and these often come in the form of points that i hand to certain groups due to the artistic bravery . rigormortis , the production company that has been my prime example of how money does not need to motivate filmmaking , gets several of these points each time . i still , however , will not go easy on them . they recently sent me a vhs copy of their down with america trilogy ( which begins , quite wittily , with a disclaimer that they are not trying to undermine america with the making of this film . ) and i decided to spend an hour of my day watching it . in the famous lines of many martyrs , i have no regrets . well , i do have some regrets , but that is not the point in the previous sentence . the point of it was that down with america was a film that , from

In [67]:
documents[0][1]

'pos'

Create a frequency distribution of all the words in the corpus, then convert this distribution into a casting list with the **list()** function

In [0]:
all_words = nltk.FreqDist(w.lower() for w in reviews.words())
word_features = list(all_words)

Define a function for the calculation of the features: words that are important enough to establish the opinion of a review

In [0]:
def document_features(document, word_features):
  document_words = set(document)
  features = {}
  for word in word_features:
    features['{}'.format(word)] = (word in document_words)
  return features

Now you can create feature sets from documents

In [0]:
featuresets = [(document_features(d, c)) for (d, c) in documents]

The aim is to create a set of all the words contained in the whole movie corpus, analyze whether they are present (True or False) in each single review, and see how much they contribute to the positive or negative judgment of it.

In [72]:
len(featuresets)

2000

Use the first 1500 elements of the set for the training set, and the last 500 items for the testing set

In [0]:
train_set, test_set = featuresets[:1500], featuresets[1500:]

Apply Naive Bayes classifier

In [77]:
train_set[:5]

[{'o': False, 'p': False, 's': True},
 {'o': False, 'p': False, 's': True},
 {'e': False, 'g': False, 'n': False},
 {'o': False, 'p': False, 's': True},
 {'o': False, 'p': False, 's': True}]

In [74]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

ValueError: ignored