## Processing a text file

In [1]:
import nltk

In [None]:
f = open('romeo.txt', 'r')  # open romeo.txt in read mode

In [None]:
rom = f.read()   # use f.read to read file data and store it in variable rom

In [None]:
print(rom[:1000])   # print the first 1000 characters

In [None]:
romList = rom.split()
print(romList)

In [None]:
romText = nltk.Text(romList)
romText.concordance('she')

In [None]:
nltk.download("punkt")

In [None]:
rWords = nltk.word_tokenize(rom)
rWords

In [None]:
rSent = nltk.sent_tokenize(rom)

In [None]:
nltk.download("stopwords")

In [None]:
import string
string.punctuation

In [None]:
stopwords = nltk.corpus.stopwords.words("english")
useless = stopwords + list(string.punctuation)
useless

In [None]:
bag = []
for w in rWords:
    if w not in useless:
        bag.append(w)
print(bag)

In [None]:
bag = [w for w in rWords if w not in useless]    # one-line code

## Getting text from an html file

In [None]:
import urllib.request

In [None]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

In [None]:
html = urllib.request.urlopen(url).read()    # this was the missing line

In [None]:
html[:1000]

In [None]:
from bs4 import BeautifulSoup   # BeautifulSoup is a library for cleaning html file 

In [None]:
raw = BeautifulSoup(html).get_text()
raw[:1000]

In [None]:
tokens = nltk.word_tokenize(raw)
tokens

In [None]:
text = nltk.Text(tokens)
text

In [None]:
text.concordance('gene')

## Movie Reviews Dataset
NLTK corpus with 200 text files, each is a review of a movie.  They are split in a neg folder
for the negative reviews and a pos folder for the positive reviews 

In [None]:
nltk.download("movie_reviews")

In [None]:
nltk.download()   # to download nltk data interactively

In [None]:
from nltk.corpus import movie_reviews

The `fileids` method provided by all the datasets in `nltk.corpus` gives access to a list of all the files available.

In [None]:
len(movie_reviews.fileids())

In [None]:
movie_reviews.fileids()[:5]

fileids can also filter the available files based on their category, which is the name of the subfolders they are located in. Therefore we can have lists of positive and negative reviews separately.

In [None]:
negative_fileids = movie_reviews.fileids('neg')
positive_fileids = movie_reviews.fileids('pos')
len(negative_fileids), len(positive_fileids)

We can inspect one of the reviews using the raw method of movie_reviews, each file is split into sentences, the curators of this dataset also removed from each review from any direct mention of the rating of the movie.

In [None]:
print(movie_reviews.raw(fileids=positive_fileids[0]))   

The movie_reviews corpus already has direct access to tokenized text with the words method:

In [None]:
movie_reviews.words(fileids=positive_fileids[0])

In [None]:
all_words = movie_reviews.words()
len(all_words)                        # how many words are in the entire movie reviews corpus?

In [None]:
filtered_words = [w for w in movie_reviews.words() if w not in useless] 

In [None]:
from collections import Counter
word_counter = Counter(filtered_words)

In [None]:
most_common_words = word_counter.most_common()[:10]
most_common_words

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
sorted_word_counts = sorted(list(word_counter.values()), reverse=True)

plt.loglog(sorted_word_counts)
plt.ylabel("Freq")
plt.xlabel("Word Rank");

In [None]:
plt.hist(sorted_word_counts, bins=50);

In [None]:
plt.hist(sorted_word_counts, bins=50, log=True);

## Train a Classifier for Sentiment Analysis

Using our `build_bag_of_words_features` function we can build separately the negative and positive features.
Basically for each of the 1000 negative and for the 1000 positive review, we create one dictionary of the words and we associate the label "neg" and "pos" to it.

In [None]:
def build_bag_of_words_filtered(words):
    return {
        word:1 for word in words \
        if not word in useless}

In [None]:
negative_features = [
    (build_bag_of_words_filtered(movie_reviews.words(fileids=[f])), 'neg') \
    for f in negative_fileids
]

In [None]:
print(negative_features[3])

In [None]:
positive_features = [
    (build_bag_of_words_filtered(movie_reviews.words(fileids=[f])), 'pos') \
    for f in positive_fileids
]

In [None]:
print(positive_features[6])

One of the simplest supervised machine learning classifiers is the Naive Bayes Classifier, we will train it on 80% of the data to learn what words are generally associated with positive or with negative reviews.

In [None]:
from nltk.classify import NaiveBayesClassifier

In [None]:
split = 800

In [None]:
sentiment_classifier = NaiveBayesClassifier.train(positive_features[:split]+negative_features[:split])

In [None]:
nltk.classify.util.accuracy(sentiment_classifier, positive_features[:split]+negative_features[:split])*100

The accuracy above is mostly a check that nothing went very wrong in the training, the real measure of accuracy is on the remaining 20% of the data that wasn't used in training, the test data:

In [None]:
nltk.classify.util.accuracy(sentiment_classifier, positive_features[split:]+negative_features[split:])*100

Accuracy here is around 70% which is pretty good for such a simple model if we consider that the estimated accuracy for a person is about 80%. We can finally print the most informative features, i.e. the words that mostly identify a positive or a negative review:

In [None]:
sentiment_classifier.show_most_informative_features()

In [2]:
f = open('romeo.txt', 'r')  # open romeo.txt in read mode

In [3]:
f = open('romeo.txt', 'r')

In [4]:
rom

NameError: name 'rom' is not defined

In [5]:
romeo

NameError: name 'romeo' is not defined

In [6]:
rom = f.read()

In [7]:
rom

"But, soft! what light through yonder window breaks?\nIt is the east, and Juliet is the sun.\nArise, fair sun, and kill the envious moon,\nWho is already sick and pale with grief,\nThat thou her maid art far more fair than she:\nBe not her maid, since she is envious;\nHer vestal livery is but sick and green\nAnd none but fools do wear it; cast it off.\nIt is my lady, O, it is my love!\nO, that she knew she were!\nShe speaks yet she says nothing: what of that?\nHer eye discourses; I will answer it.\nI am too bold, 'tis not to me she speaks:\nTwo of the fairest stars in all the heaven,\nHaving some business, do entreat her eyes\nTo twinkle in their spheres till they return.\nWhat if her eyes were there, they in her head?\nThe brightness of her cheek would shame those stars,\nAs daylight doth a lamp; her eyes in heaven\nWould through the airy region stream so bright\nThat birds would sing and think it were not night.\nSee, how she leans her cheek upon her hand!\nO, that I were a glove upo

In [8]:
romList = rom.split()
print(romList)

['But,', 'soft!', 'what', 'light', 'through', 'yonder', 'window', 'breaks?', 'It', 'is', 'the', 'east,', 'and', 'Juliet', 'is', 'the', 'sun.', 'Arise,', 'fair', 'sun,', 'and', 'kill', 'the', 'envious', 'moon,', 'Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief,', 'That', 'thou', 'her', 'maid', 'art', 'far', 'more', 'fair', 'than', 'she:', 'Be', 'not', 'her', 'maid,', 'since', 'she', 'is', 'envious;', 'Her', 'vestal', 'livery', 'is', 'but', 'sick', 'and', 'green', 'And', 'none', 'but', 'fools', 'do', 'wear', 'it;', 'cast', 'it', 'off.', 'It', 'is', 'my', 'lady,', 'O,', 'it', 'is', 'my', 'love!', 'O,', 'that', 'she', 'knew', 'she', 'were!', 'She', 'speaks', 'yet', 'she', 'says', 'nothing:', 'what', 'of', 'that?', 'Her', 'eye', 'discourses;', 'I', 'will', 'answer', 'it.', 'I', 'am', 'too', 'bold,', "'tis", 'not', 'to', 'me', 'she', 'speaks:', 'Two', 'of', 'the', 'fairest', 'stars', 'in', 'all', 'the', 'heaven,', 'Having', 'some', 'business,', 'do', 'entreat', 'her', 'eyes', 'T

In [9]:
romText = nltk.Text(romList)
romText.concordance('she')

Displaying 7 of 7 matches:
fair than she: Be not her maid, since she is envious; Her vestal livery is but 
is my lady, O, it is my love! O, that she knew she were! She speaks yet she say
y, O, it is my love! O, that she knew she were! She speaks yet she says nothing
s my love! O, that she knew she were! She speaks yet she says nothing: what of 
hat she knew she were! She speaks yet she says nothing: what of that? Her eye d
wer it. I am too bold, 'tis not to me she speaks: Two of the fairest stars in a
and think it were not night. See, how she leans her cheek upon her hand! O, tha


In [10]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\praja_m3gddx7\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [11]:
rWords = nltk.word_tokenize(rom)
rWords

['But',
 ',',
 'soft',
 '!',
 'what',
 'light',
 'through',
 'yonder',
 'window',
 'breaks',
 '?',
 'It',
 'is',
 'the',
 'east',
 ',',
 'and',
 'Juliet',
 'is',
 'the',
 'sun',
 '.',
 'Arise',
 ',',
 'fair',
 'sun',
 ',',
 'and',
 'kill',
 'the',
 'envious',
 'moon',
 ',',
 'Who',
 'is',
 'already',
 'sick',
 'and',
 'pale',
 'with',
 'grief',
 ',',
 'That',
 'thou',
 'her',
 'maid',
 'art',
 'far',
 'more',
 'fair',
 'than',
 'she',
 ':',
 'Be',
 'not',
 'her',
 'maid',
 ',',
 'since',
 'she',
 'is',
 'envious',
 ';',
 'Her',
 'vestal',
 'livery',
 'is',
 'but',
 'sick',
 'and',
 'green',
 'And',
 'none',
 'but',
 'fools',
 'do',
 'wear',
 'it',
 ';',
 'cast',
 'it',
 'off',
 '.',
 'It',
 'is',
 'my',
 'lady',
 ',',
 'O',
 ',',
 'it',
 'is',
 'my',
 'love',
 '!',
 'O',
 ',',
 'that',
 'she',
 'knew',
 'she',
 'were',
 '!',
 'She',
 'speaks',
 'yet',
 'she',
 'says',
 'nothing',
 ':',
 'what',
 'of',
 'that',
 '?',
 'Her',
 'eye',
 'discourses',
 ';',
 'I',
 'will',
 'answer',
 'it',


In [12]:
rSent = nltk.sent_tokenize(rom)

In [13]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\praja_m3gddx7\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [14]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [15]:
stopwords = nltk.corpus.stopwords.words("english")
useless = stopwords + list(string.punctuation)
useless

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [16]:
bag = []
for w in rWords:
    if w not in useless:
        bag.append(w)
print(bag)

['But', 'soft', 'light', 'yonder', 'window', 'breaks', 'It', 'east', 'Juliet', 'sun', 'Arise', 'fair', 'sun', 'kill', 'envious', 'moon', 'Who', 'already', 'sick', 'pale', 'grief', 'That', 'thou', 'maid', 'art', 'far', 'fair', 'Be', 'maid', 'since', 'envious', 'Her', 'vestal', 'livery', 'sick', 'green', 'And', 'none', 'fools', 'wear', 'cast', 'It', 'lady', 'O', 'love', 'O', 'knew', 'She', 'speaks', 'yet', 'says', 'nothing', 'Her', 'eye', 'discourses', 'I', 'answer', 'I', 'bold', "'t", 'speaks', 'Two', 'fairest', 'stars', 'heaven', 'Having', 'business', 'entreat', 'eyes', 'To', 'twinkle', 'spheres', 'till', 'return', 'What', 'eyes', 'head', 'The', 'brightness', 'cheek', 'would', 'shame', 'stars', 'As', 'daylight', 'doth', 'lamp', 'eyes', 'heaven', 'Would', 'airy', 'region', 'stream', 'bright', 'That', 'birds', 'would', 'sing', 'think', 'night', 'See', 'leans', 'cheek', 'upon', 'hand', 'O', 'I', 'glove', 'upon', 'hand', 'That', 'I', 'might', 'touch', 'cheek']


In [17]:
bag = [w for w in rWords if w not in useless]    # one-line code