# Data 620 Assignment: High Frequency Words

Jithendra Seneviratne, Sheryl Piechocki 

June 23, 2020

1. Choose a corpus of interest.
2. How many total unique words are in the corpus? (Please feel free to define unique words in any interesting,
defensible way).
3. Taking the most common words, how many unique words represent half of the total words in the corpus?
4. Identify the 200 highest frequency words in this corpus.
5. Create a graph that shows the relative frequency of these 200 words.
6. Does the observed relative frequency of these words follow Zipf’s law? Explain.
7. In what ways do you think the frequency of the words in this corpus differ from “all words in all corpora.”

In [1]:
%matplotlib inline
import pandas as pd
import plotly as py
import plotly.graph_objs as go 
from plotly.offline import init_notebook_mode, plot, iplot
import matplotlib.pyplot as plt
init_notebook_mode(connected=True)
import nltk, re, pprint
from nltk import word_tokenize
from urllib import request
from nltk.probability import FreqDist
import string
import re
from nltk.corpus import wordnet
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
import numpy as np

### 1. Choose a corpus of interest
We chose the corpus of the book Treasure Island by Robert Louis Stevenson.  The corpus is found on the Gutenberg website. 

#### Process

For this analysis we'll be cleaning the corpus and removing punctuations as well as other symbols. We'll also lemmatize the corpus. We'll then remove stopwords and look at the frequency of the unique words in the corpus. Hopefully, our data cleaning process will help yield sensible results for our most common words.

In [2]:
from urllib import request
url = "http://www.gutenberg.org/files/120/120-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

### Get slice of corpus which contains the book

In [3]:
print(raw.find("PART ONE--"))
print(raw.rfind("End of Project Gutenberg'"))

4276
372462


In [4]:
raw = raw[4276:372462]

### Remove punctuations and symbols

In [5]:
raw = raw.replace('-', ' ')
new_raw = re.sub(r'[^\w\s]', '', raw)

### Tokenize the lower case of the text and get the count of tokens

In [6]:
#break up the string into words and change all to lower case
tokens = word_tokenize(new_raw.lower())
print('Count of Tokens: ' ,len(tokens))

Count of Tokens:  68706


### Create function to lemmatize words

In [7]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'
    
lemmatizer = WordNetLemmatizer()
def lemmatize_word(word):
    try:
        tag = get_wordnet_pos(nltk.pos_tag([word])[0][1])
        return lemmatizer.lemmatize(word, pos=tag)
    except:
        pass

### Lemmatize the tokens and remove numbers

In [8]:
tokens_lem = [lemmatize_word(x) for x in tokens]
tokens_lem = [x for x in tokens_lem if x.isalpha()]

### 2. How many total unique words are in the corpus? 

Distinct number of tokens will be used as total unique words.

In [9]:
print('Count of Unique Words: ', len(set(tokens_lem))) 

Count of Unique Words:  4606


### Take a look at the first 10 tokens

In [10]:
print('First 10 tokens: ' , tokens_lem[:10])

First 10 tokens:  ['part', 'one', 'the', 'old', 'buccaneer', 'the', 'old', 'sea', 'dog', 'at']


### Get the frequency distribution of the tokens and look at top 10

In [11]:
fdist1 = FreqDist(tokens_lem)
print(fdist1)
print('--------------------------------------------')
top10 = dict(fdist1.most_common(10))
#top10

print('Treasure Island by Robert Louis Stevenson')
print('Top 10 Words by Frequency')
print('-------------------')
print('Word      Frequency')
print('-------------------')
for word, freq in top10.items():
    print('{:<11} {}'.format(word, freq))

<FreqDist with 4606 samples and 68661 outcomes>
--------------------------------------------
Treasure Island by Robert Louis Stevenson
Top 10 Words by Frequency
-------------------
Word      Frequency
-------------------
the         4365
and         2874
a           2373
be          2205
i           1748
of          1674
to          1521
have        1076
in          968
he          929


### Removing Stop words

Stop words seem to be dominating our corpus so let's remove stop words and run our analysis again.

In [12]:
stop = stopwords.words('english') + ['mr',
                                     'mrs',
                                     'miss', 
                                     'say',
                                     'have', 
                                     'might',
                                     'thought',
                                     'would', 
                                     'could', 
                                     'make', 
                                     'much',
                                     'dear',
                                     'must',
                                     'know',
                                     'one',
                                     'good',
                                     'every',
                                     'towards',
                                     'give',
                                     'dr',
                                     'none',
                                     'go',
                                     'come',
                                     'upon',
                                     'get',
                                     'see',
                                     'like',
                                     'appear',
                                     'sometimes',
                                     'the',
                                     'and',
                                     'a',
                                     'be',
                                     'i',
                                     'of',
                                     'to',
                                     'have',
                                     'in',
                                     'he',
                                     'that',
                                     'you',
                                     'it',
                                     'his',
                                     'my',
                                     'with',
                                     'for',
                                     'on',
                                     'say',
                                     'but',
                                     'me',
                                     'at',
                                     'we',
                                     'all',
                                     'not',
                                     'this',
                                     'by',
                                     'him',
                                     'one',
                                     'there',
                                     'now',
                                     'man',
                                     'so',
                                     'do',
                                     'out',
                                     'they',
                                     'go',
                                     'well',
                                     'from',
                                     'come',
                                     'if',
                                     'like',
                                     'up',
                                     'see',
                                     'no',
                                     'when',
                                     'put',
                                     'take',
                                     'begin',
                                     'two',
                                     'three',
                                     'u',
                                     'still',
                                     'last',
                                     'never',
                                     'always',
                                     'thing',
                                     'tell']

filtered_tokens = [word for word in tokens_lem if word not in stop]

### 2. (version 2). How many total unique words are in the corpus?

Distinct number of tokens after removal of stop words.

In [13]:
print('Count of Unique Words after Removing Stop Words: ', len(set(filtered_tokens))) 

Count of Unique Words after Removing Stop Words:  4454


In [14]:
fdist2 = FreqDist(filtered_tokens)
print(fdist2)
print('--------------------------------------------')
top10v2 = dict(fdist2.most_common(10))
#top10
print('Treasure Island by Robert Louis Stevenson')
print('Top 10 Words by Frequency (Excluding Stop Words)')
print('-------------------')
print('Word      Frequency')
print('-------------------')
for word, freq in top10v2.items():
    print('{:<11} {}'.format(word, freq))

<FreqDist with 4454 samples and 27519 outcomes>
--------------------------------------------
Treasure Island by Robert Louis Stevenson
Top 10 Words by Frequency (Excluding Stop Words)
-------------------
Word      Frequency
-------------------
hand        236
captain     234
silver      222
doctor      173
time        139
cry         137
look        136
ship        133
old         119
long        115


We can see that the top words now take on a more distinct flavor, and match corpus thematically.

### 3. Taking the most common words, how many unique words represent half of the total words in the corpus?

The output below shows that the top 303 words account for 49.97% of the total words in the filtered corpus.


In [15]:
samples = list(dict(list(fdist2.most_common(len(fdist2)))).keys())
freqs = [fdist2[sample] for sample in samples] 
rel_freqs = [(fdist2[sample] / fdist2.N()) for sample in samples] 

df_words = pd.DataFrame()
df_words['frequency'] = freqs
df_words['relfreq'] = rel_freqs
df_words['word'] = samples

df_words['cum_perc'] = df_words['relfreq'].cumsum()
print('Middle Words and Corresponding Frequencies')
print('Treasure Island by Robert Louis Stevenson (Excluding Stop Words)')
df_words.loc[(df_words['cum_perc'] >= 0.495) & (df_words['cum_perc'] <= 0.505)]

Middle Words and Corresponding Frequencies
Treasure Island by Robert Louis Stevenson (Excluding Stop Words)


Unnamed: 0,frequency,relfreq,word,cum_perc
296,21,0.000763,blood,0.495076
297,21,0.000763,hot,0.495839
298,21,0.000763,pull,0.496602
299,21,0.000763,pistol,0.497365
300,21,0.000763,fortune,0.498129
301,21,0.000763,dance,0.498892
302,21,0.000763,current,0.499655
303,20,0.000727,admiral,0.500382
304,20,0.000727,benbow,0.501108
305,20,0.000727,year,0.501835


### 4. Identify the 200 highest frequency words in this corpus.

In [16]:
top200 = dict(fdist2.most_common(200))
print('Treasure Island by Robert Louis Stevenson')
print('Top 200 Words by Frequency (Excluding Stop Words)')
print('Word      Frequency')
print('-------------------')
for word, freq in top200.items():
    print('{:<11} {}'.format(word, freq))

Treasure Island by Robert Louis Stevenson
Top 200 Words by Frequency (Excluding Stop Words)
Word      Frequency
-------------------
hand        236
captain     234
silver      222
doctor      173
time        139
cry         137
look        136
ship        133
old         119
long        115
sea         112
back        111
ill         109
squire      106
little      104
men         102
sir         102
side        101
first       99
word        97
jim         97
head        94
way         93
lay         92
another     86
house       85
dont        85
round       84
john        84
eye         82
island      81
thats       81
great       79
enough      74
right       74
dead        73
sure        73
think       72
soon        71
even        70
moment      69
turn        67
voice       66
face        66
found       65
far         64
place       63
open        63
left        63
run         62
return      61
im          60
day         60
youll       60
boat        60
ask         59
may       

### 5. Create a graph that shows the relative frequency of these 200 words.

In [17]:
df_words.sort_values(by='relfreq',
                     inplace=True,
                     ascending=False)
fig = go.Figure()
fig.add_trace(go.Bar(x = df_words['word'][:200], y = df_words['relfreq'][:200], orientation = 'v'));

fig.update_layout(go.Layout(
    title='Relative Frequency of Top 200 Words in Treasure Island by Robert Louis Stevenson <br><sub>Excluding Stop Words</sub>',
    width=2000,
    height=400,
    yaxis=dict(
        title='Relative Frequency' , tickformat = '.1%'
    ),
    xaxis=dict(
        title='Word', tickfont=dict(size=7)
    )
));
fig.show();

In [18]:
samples2 = list(dict(list(fdist1.most_common(len(fdist1)))).keys())
freqs2 = [fdist1[sample] for sample in samples2] 
rel_freqs2 = [(fdist1[sample] / fdist1.N()) for sample in samples2] 

df_words2 = pd.DataFrame()
df_words2['frequency'] = freqs2
df_words2['relfreq'] = rel_freqs2
df_words2['word'] = samples2

df_words2['cum_perc'] = df_words2['relfreq'].cumsum()
print('Middle Words and Corresponding Frequencies')
print('Treasure Island by Robert Louis Stevenson (Including Stop Words)')
df_words2.loc[(df_words2['cum_perc'] >= 0.495) & (df_words2['cum_perc'] <= 0.505)]

Middle Words and Corresponding Frequencies
Treasure Island by Robert Louis Stevenson (Including Stop Words)


Unnamed: 0,frequency,relfreq,word,cum_perc
49,182,0.002651,here,0.496759
50,182,0.002651,or,0.49941
51,176,0.002563,take,0.501973
52,174,0.002534,could,0.504508


In [19]:
df_words2.sort_values(by='relfreq',
                     inplace=True,
                     ascending=False)
fig2 = go.Figure();
fig2.add_trace(go.Bar(x = df_words2['word'][:200], y = df_words2['relfreq'][:200], orientation = 'v'));

fig2.update_layout(go.Layout(
    title='Relative Frequency of Top 200 Words in Treasure Island by Robert Louis Stevenson<br><sub>Including Stop Words</sub>',
    width=2000,
    height=400,
    yaxis=dict(
        title='Relative Frequency'  , tickformat = '.1%'
    ),
    xaxis=dict(
        title='Word', tickfont=dict(size=7)
    )
));
fig2.show();

### 6. Does the observed relative frequency of these words follow Zipf’s law? Explain.
According to the text Natural Language Processing with Python:
Zipf’s Law: Let *f(w)* be the frequency of a word *w* in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf’s Law states that the frequency of a word type is inversely
proportional to its rank (i.e., *f × r = k*, for some constant *k*).

Plotting word rank vs. word frequency, both in log scale, will yield a straight line with negative slope if Zipf's law holds true.

The first plot below of word rank vs. word frequency is after stop words have been excluded.  This line is not quite straight and has more of a curve.  We conclude Zipf's law is not observed after removal of stop words.

The second plot below includes the stop words.  This line is rather straight with decreasing slope.  Therefore, we conclude that this corpus of Treasure Island follows Zipf's Law.  

In [20]:
fig = go.Figure()

fig.add_trace(go.Scatter(y = np.log(df_words['frequency']), x = np.log(df_words.index + 1)))

fig.update_layout(go.Layout(
    title='Word Rank vs Word Frequency in Treasure Island<br>by Robert Louis Stevenson <br><sub>Excluding Stop Words</sub>',
    width=500,
    height=500,
    xaxis=dict(
        title='Log Rank' #, tickangle = 90
    ),
    yaxis=dict(
        title='Log Frequency', tickfont=dict(size=7)
    )
))
fig.show()

In [21]:
fig = go.Figure()

fig.add_trace(go.Scatter(y = np.log(df_words2['frequency']), x = np.log(df_words2.index + 1)))

fig.update_layout(go.Layout(
    title='Word Rank vs Word Frequency in Treasure Island<br>by Robert Louis Stevenson<br><sub>Including Stop Words</sub>',
    width=500,
    height=500,
    xaxis=dict(
        title='Log Rank' 
    ),
    yaxis=dict(
        title='Log Frequency', tickfont=dict(size=7)
    )
))
fig.show()

### 7. In what ways do you think the frequency of the words in this corpus differ from “all words in all corpora.”

Typically, once a corpus is cleaned, lemmatized and filtered for stop words, you will find words thematically representing the topic of the corpus popping up as most frequent. Before we filtered out stop words, the top words were 'the' 'and' 'a' 'be' 'i' 'of' and 'to'. This would most likely be the same if we looked at all words in all corpora. Once the corpus was filtered out, the top words became 'hand' 'captain' 'silver', 'doctor', 'time'. This would make sense for a novel such as 'Treasure Island'.  Further, before removal of stop words, the corpus observes Zipf's law, but after removing stop words Zipf's law is not adhered to.
