## Week 5: Word-Level Text Analysis


Topics to Cover
---------------

-   comparing word frequency between authors

-   part-of-speech (POS) tagging

-   POS frequency comparison

-   Naive Bayes text classification

-   sentiment analysis

-   using WordNet


### Class Objective
Use text analysis techniques introduced by Montfort to examine and compare small text corpora.

#### Loading Corpora
Today we will be analyzing and comparing two small text corpora. Choose two text sets from the following list:

- [Works of Ralph Waldo Emerson](http://www.stephenmclaughlin.net/pcda/sample-data/week-4/Emerson.zip)
- [Works of Oscar Wilde](http://www.stephenmclaughlin.net/pcda/sample-data/week-4/Wilde.zip)

Open Terminal in macOS and launch our Docker container:

In [1]:
import nltk
import textblob
from operator import itemgetter
from pprint import pprint

In [2]:
# Download zipped texts from GitHub, then unzip the directory.

import os
os.chdir('/sharedfolder/')

!wget -y https://github.com/pcda17/pcda17.github.io/blob/master/week/5/Emerson.zip?raw=true -O Emerson.zip
!unzip Emerson.zip

!wget -y https://github.com/pcda17/pcda17.github.io/blob/master/week/5/Wilde.zip?raw=true -O Wilde.zip
!unzip Wilde.zip

In [3]:
## Load each author's text files as a list of strings.

corpus1_dir = "/sharedfolder/Emerson/"
corpus2_dir = "/sharedfolder/Wilde/"

##

os.chdir(corpus1_dir)

corpus1_filenames = os.listdir("./")

corpus1_texts=[]

for filename in corpus1_filenames:
    text = open(filename).read().replace("\n"," ") #replaces newlines with spaces
    corpus1_texts.append(text)

##
    
os.chdir(corpus2_dir)

corpus2_filenames = os.listdir("./")

corpus2_texts=[]

for filename in corpus2_filenames:
    text = open(filename).read().replace("\n"," ") #replaces newline characters with spaces
    corpus2_texts.append(text)

In [4]:
import random

print('Number of texts:')
print(len(corpus2_texts))

print()

random_filename = random.choice(corpus2_texts)
print('Random text head: \n\n' + random_filename[:4000])

Number of texts:
17

Random text head: 

                                      1890                                    CHARMIDES                                  by Oscar Wilde                         I      He was a Grecian lad, who coming home       With pulpy figs and wine from Sicily     Stood at his galley's prow, and let the foam       Blow through his crisp brown curls unconsciously,     And holding wind and wave in boy's despite     Peered from his dripping seat across the wet and         stormy night.      Till with the dawn he saw a burnished spear       Like a thin thread of gold against the sky,     And hoisted sail, and strained the creeking gear,       And bade the pilot head her lustily     Against the nor-west gale, and all day long     Held on his way, and marked the rowers' time with         measured song.      And when the faint Corinthian hills were red       Dropped anchor in a little sandy bay,     And with fresh boughs of olive crowned his head,       And brushed

## *> TextBlob Review*

Let’s review the TextBlob package, introduced in this week’s reading by Nick Montfort. First, let’s load TextBlob and convert two texts to lists of words. 

Note that each is contained in a WordList object, which we can manipulate as if it were an ordinary list.


In [5]:
from textblob import TextBlob

text1=TextBlob(corpus1_texts[0])
print(text1.words[:15])

print()

text2=TextBlob(corpus2_texts[0])
print(text2.words[:15])

['AN', 'ADDRESS', 'Delivered', 'before', 'the', 'Senior', 'Class', 'in', 'Divinity', 'College', 'Cambridge', 'Sunday', 'Evening', 'July', '15']

['1898', 'THE', 'BALLAD', 'OF', 'READING', 'GAOL', 'by', 'Oscar', 'Wilde', 'I', 'He', 'did', 'not', 'wear', 'his']


In [6]:
print(text1.sentences[:5])
print()
print(text2.sentences[:5])

[Sentence("        AN ADDRESS           _Delivered before the Senior Class in Divinity College, Cambridge, Sunday Evening, July 15, 1838_             In this refulgent summer, it has been a luxury to draw the breath of life."), Sentence("The grassurst, the meadow is spotted with fire and gold in the tint of flowers."), Sentence("The air is full of birds, and sweet with the breath of the pine, the balm-of-Gilead, and the new hay."), Sentence("Night brings no gloom to the heart with its welcome shade."), Sentence("Through the transparent darkness the stars pour their almost spiritual rays.")]

[Sentence("                                      1898                            THE BALLAD OF READING GAOL                                  by Oscar Wilde                         I          He did not wear his scarlet coat,           For blood and wine are red,         And blood and wine were on his hands           When they found him with the dead,         The poor dead woman whom he loved,      

In [7]:
print(sorted(text1.words)[:500])  # prints first 500 words in alphabetized word list

["'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", "'s", '1', '15', '1838', '2', 'A', 'A', 'A', 'A', 'A', 'A', 'ADDRESS', 'AN', 'Accept', 'Ah', 'Alas', 'All', 'All', 'All', 'All', 'All', 'All', 'Alone', 'Already', 'Always', 'America', 'America', 'And', 'And', 'And', 'And', 'And', 'And', 'And', 'And', 'And', 'And', 'And', 'And', 'Apollo', 'As', 'As', 'As', 'As', 'Be', 'Beauty', 'Beauty', 'Beauty', 'Behold', 'Behold', 'Benevolence', 'Bible', 'Boldly', 'But', 'But', 'But', 'But', 'But', 'But', 'But', 'But', 'But', 'But', 'But', 'But', 'But', 'But', 'By', 'By', 'By', 'Cambridge', 'Can', 'Catholic', 'Certainly', 'Character', 'China', 'Christ', 'Christ', 'Christ', 'Christ', 'Christ', 'Christ', 'Christ', 'Christ', 'Christ', 'Christ', 'Christian', 'Christian', 'Christian', 'Christian', 'Christian', 'Christianity', 'Christianity', 'Christianity', 'Christianity', 'Christianity', 'Christianity', 'Christianity', 'Church', 'Church', 'Church', 'Church', 'Class

In [8]:
print(sorted(list(set(text1.words)))[:500]) # prints sorted list of unique words (first 500 items)

["'s", '1', '15', '1838', '2', 'A', 'ADDRESS', 'AN', 'Accept', 'Ah', 'Alas', 'All', 'Alone', 'Already', 'Always', 'America', 'And', 'Apollo', 'As', 'Be', 'Beauty', 'Behold', 'Benevolence', 'Bible', 'Boldly', 'But', 'By', 'Cambridge', 'Can', 'Catholic', 'Certainly', 'Character', 'China', 'Christ', 'Christian', 'Christianity', 'Church', 'Class', 'College', 'Courage', 'Cultus', 'Deity', 'Delivered', 'Denderah', 'Discharge', 'Divinity', 'Drawn', 'Duty', 'East', 'Egypt', 'England', 'Epaminondas', 'Europe', 'Evening', 'Every', 'Everything', 'Evil', 'Faith', 'For', 'Fox', 'French', 'Friends', 'From', 'Genius', 'George', 'Ghost', 'God', 'Good', 'Greece', 'Greek', 'Greeks', 'Guard', 'Having', 'He', 'Hebrew', 'Hebrews', 'Hindoos', 'Historical', 'Holy', 'How', 'I', 'If', 'Imitation', 'Imperial', 'In', 'India', 'Instantly', 'It', 'Its', 'Jehovah', 'Jesus', 'Joy', 'July', 'Law', 'Let', 'Life', 'Literature', 'Look', 'Lord', 'Man', 'Massena', 'Meantime', 'Men', 'Miracle', 'Miracles', 'Monster', 'Mora

In [None]:
# Each TextBlob object contains a dictionary with the number of times each word appears in a text.

from pprint import pprint

pprint(text1.word_counts)

## *> Quick Exercise*

Create a function that returns the top 20 most frequent words in a given TextBlob object. 


*Hint: Use the `itemgetter` module to sort a list of lists by a given index.*


In [9]:
# A possible solution:

from operator import itemgetter
from pprint import pprint

freq_dict=text1.word_counts
freq_list=[]

for key in freq_dict:
    freq_list.append([key,freq_dict[key]])

sorted_freq_list=sorted(freq_list, key=itemgetter(1))[::-1]

pprint(sorted_freq_list[:30])

# What do you notice about these words?

[['the', 561],
 ['of', 339],
 ['and', 320],
 ['to', 196],
 ['is', 170],
 ['in', 165],
 ['a', 124],
 ['that', 123],
 ['it', 98],
 ['not', 87],
 ['he', 83],
 ['man', 63],
 ['as', 61],
 ['all', 59],
 ['with', 53],
 ['this', 52],
 ['his', 51],
 ['be', 50],
 ['by', 46],
 ['but', 46],
 ['or', 45],
 ['are', 44],
 ['i', 44],
 ['you', 44],
 ['soul', 41],
 ['which', 40],
 ['can', 39],
 ['we', 37],
 ['they', 37],
 ['for', 34]]


## *> Word Frequency Sans Stopwords*

Next we'll load the `nltk` module, which was installed as a dependency of TextBlob.

In computational text analysis, the term “stopword” refers to words that appear
frequently in most texts in a given language — e.g., “I,” “the,” “and,” “while,”
and so on. NLTK provides a useful stopword list. Here we assign the English stopword 
list to the variable `stopwords_eng`.

In [10]:
import nltk
from nltk.corpus import stopwords

stopwords_eng = stopwords.words('english')

print(stopwords_eng)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [11]:
# Now let’s look at the most frequent words in a text, disregarding stopwords.

from textblob import Word

freq_dict = text1.word_counts

freq_sans_stopwords = []

for key in freq_dict:
    lemma = Word(key).lemmatize()
    if lemma not in stopwords_eng:
        freq_sans_stopwords.append([key,freq_dict[key]])

sorted_freq_sans_stopwords = sorted(freq_sans_stopwords, key = itemgetter(1))[::-1]

pprint(sorted_freq_sans_stopwords[:20])\

# How do you interpret this list? Does it give you any insight into the text you’re looking at?

[['man', 63],
 ['soul', 41],
 ['men', 32],
 ['god', 27],
 ['life', 26],
 ['shall', 25],
 ['one', 22],
 ['us', 22],
 ['good', 21],
 ['love', 20],
 ['was', 20],
 ['world', 20],
 ['see', 19],
 ['has', 18],
 ['sentiment', 17],
 ['heart', 17],
 ['let', 16],
 ['true', 15],
 ['nature', 15],
 ['would', 15]]



## *> Quick Exercise*

Referencing the code above, create a function that returns a sorted list of stopword-free word frequency lists when passed a TextBlob object. Look at the top vocabulary for several texts by each of your authors. How similar or different are these frequency lists between texts and between authors?


In [12]:
# A possible solution:

def topwords(blob):
    stopwords_eng = stopwords.words('english')
    freq_dict=blob.word_counts
    freq_sans_stopwords = []
    for key in freq_dict:
        lemma = Word(key).lemmatize()
        if lemma not in stopwords_eng:
            freq_sans_stopwords.append([key, freq_dict[key]])
    sorted_freq_sans_stopwords=sorted(freq_sans_stopwords, key=itemgetter(1))[::-1]
    return sorted_freq_sans_stopwords

pprint(topwords(text1)[:20])
print()
pprint(topwords(text2)[:20])

[['man', 63],
 ['soul', 41],
 ['men', 32],
 ['god', 27],
 ['life', 26],
 ['shall', 25],
 ['one', 22],
 ['us', 22],
 ['good', 21],
 ['love', 20],
 ['was', 20],
 ['world', 20],
 ['see', 19],
 ['has', 18],
 ['sentiment', 17],
 ['heart', 17],
 ['let', 16],
 ['true', 15],
 ['nature', 15],
 ['would', 15]]

[['man', 38],
 ['was', 30],
 ['men', 22],
 ['day', 21],
 ['upon', 18],
 ['like', 17],
 ['little', 14],
 ['one', 14],
 ['does', 13],
 ['thing', 13],
 ['heart', 13],
 ['soul', 12],
 ['never', 12],
 ['dead', 11],
 ['night', 11],
 ['red', 10],
 ['saw', 10],
 ['every', 10],
 ['round', 10],
 ['us', 10]]


## POS Tagging

We can also use TextBlob to create a list of part-of-speech tags for each word in a text.

Let’s take a close look at our results. Examine two or three sentences a word at a time and check whether parts of speech were tagged correctly. If you find any mistakes, can you guess why the tagging algorithm slipped up?

In [None]:
pprint(text1.tags)

In [14]:
# Following Montfort’s example, let’s create a function that counts the number of adjectives in a text.

def adjs(text):
    count = 0
    for word, tag in text.tags:
        if tag == 'JJ':
            count+=1
    return count

print(adjs(text1))

502


In [15]:
def adj_percent(text):
    return float(adjs(text))/len(text.words)

print(adj_percent(text1))

0.06812321888994437




## *> Exercise*

Create a function called `POS_profile` that takes a TextBlob object and returns a list containing several parts of speech and their relative frequency within the text. Your POS profile should include the following parts of speech:

- nouns
- adjectives
- verbs
- adverbs
- pronouns

You can find a full list of POS tags used by TextBlob [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). Note that several parts of speech are split into multiple codes (e.g., NN, NNS, NNP, and NNPS for different classes of noun).

Next, run your POS profile on each text in your two corpora. How much do these values vary between authors and among texts by the same author?


In [16]:
# A possible solution:

def POS_profile(blob):
    noun_codes=['NN','NNS','NNP','NNPS']
    adj_codes=['JJ','JJR','JJS']
    verb_codes=['VB','VBD','VBG','VBN','VBP','VBZ']
    adv_codes=['RB','RBR','RBS']
    pronoun_codes=['PRP']
    noun_count=0
    adj_count=0
    verb_count=0
    adv_count=0
    pronoun_count=0
    for (word, tag) in blob.tags:
        if tag in noun_codes: noun_count+=1
        if tag in adj_codes: adj_count+=1
        if tag in verb_codes: verb_count+=1
        if tag in adv_codes: adv_count+=1
        if tag in pronoun_codes: pronoun_count+=1
    word_count=len(blob.words)
    return [float(noun_count)/word_count, float(adj_count)/word_count, float(verb_count)/word_count, float(adv_count)/word_count, float(pronoun_count)/word_count]



## Naive Bayes Classification

Review classification examples from Montfort text.

## *> Exercise*

Divide each of your corpora into two sets, one for training our classifier and one for testing. Split each text into a list of sentences and combine these to create four master lists: author 1 training, author 1 testing, author 2 training, author 2 testing.

Create a Naive Bayes classifier using your two training sets. Run the classifier on each sentence in your test sets and calculate the accuracy of your model.

Examine sentences that were misclassified. Why do you think the algorithm was misled?



## *> Sentiment Analysis*

If time permits.

## *> Using WordNet*

If time permits.