# Data 620 Assignment: High Frequency Words

Jithendra Seneviratne, Sheryl Piechocki 

June 23, 2020

1. Choose a corpus of interest.
2. How many total unique words are in the corpus? (Please feel free to define unique words in any interesting,
defensible way).
3. Taking the most common words, how many unique words represent half of the total words in the corpus?
4. Identify the 200 highest frequency words in this corpus.
5. Create a graph that shows the relative frequency of these 200 words.
6. Does the observed relative frequency of these words follow Zipf’s law? Explain.
7. In what ways do you think the frequency of the words in this corpus differ from “all words in all corpora.”

In [1]:
%matplotlib inline
import pandas as pd
import plotly as py
import plotly.graph_objs as go 
from plotly.offline import init_notebook_mode, plot, iplot
import matplotlib.pyplot as plt
init_notebook_mode(connected=True)
import nltk, re, pprint
from nltk import word_tokenize
from urllib import request
from nltk.probability import FreqDist
import string
import re
from nltk.corpus import wordnet
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
import numpy as np

### 1. Choose a corpus of interest
We chose the corpus of the book Treasure Island by Robert Louis Stevenson.  The corpus is found on the Gutenberg website. 

In [2]:
from urllib import request
url = "http://www.gutenberg.org/files/120/120-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

print(raw.find("PART ONE--"))
print(raw.rfind("End of Project Gutenberg'"))


4276
372462


Get the total count of all words and punctuation symbols

In [3]:
raw = raw[4276:372462]
# Count of words and punctuation symbols
print('Total Words and Punctuation Symbols: ' , len(raw))

Total Words and Punctuation Symbols:  368186


Remove punctuation symbols

In [4]:
# clean up and remove punctuation
raw = raw.replace('-', ' ')
new_raw = re.sub(r'[^\w\s]', '', raw)


Tokenize the lower case of the text and get the count of tokens

In [5]:
#break up the string into words and change all to lower case
tokens = word_tokenize(new_raw.lower())
print('Count of Tokens: ' ,len(tokens))


Count of Tokens:  68706


Create function to lemmatize words.

In [6]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'
    
def lemmatize_word(word):
    lemmatizer = WordNetLemmatizer()
    try:
        tag = get_wordnet_pos(nltk.pos_tag([word])[0][1])
        return lemmatizer.lemmatize(word, pos=tag)
    except:
        pass

Lemmatize the tokens and remove the stop words.

In [7]:
tokens_lem = [lemmatize_word(x) for x in tokens]

In [8]:
filtered_tokens = [word for word in tokens_lem ] #if word not in stopwords.words('english')]

### 2. How many total unique words are in the corpus? 
Distinct number of tokens will be used as total unique words.

In [9]:
# Count of distinct words, or "word types" 
#print('Count of Unique Words: ', len(set(tokens))) 
print('Count of Unique Words: ', len(set(filtered_tokens))) 

Count of Unique Words:  4646


Take a look at the first 10 tokens.

In [10]:
tokens[:10]

['part', 'one', 'the', 'old', 'buccaneer', '1', 'the', 'old', 'sea', 'dog']

Get the frequency distribution of the tokens.

In [11]:
#fdist1 = FreqDist(tokens)
#print(fdist1)
fdist1 = FreqDist(filtered_tokens)
print(fdist1)


<FreqDist with 4646 samples and 68706 outcomes>


### 3. Taking the most common words, how many unique words represent half of the total words in the corpus?
The output below shows that the top 51 words account for 49.91% of the total words in the corpus.  (non-filtered)
The output below shows that the top 64 words account for 49.96% of the total words in the corpus.  (non-filtered)

In [12]:
samples = list(dict(list(fdist1.most_common(len(fdist1)))).keys())
freqs = [fdist1[sample] for sample in samples] 
rel_freqs = [(fdist1[sample] / fdist1.N()) for sample in samples] 

df_words = pd.DataFrame()
df_words['frequency'] = freqs
df_words['relfreq'] = rel_freqs
df_words['word'] = samples

df_words['cum_perc'] = df_words['relfreq'].cumsum()
print('Middle Words and Corresponding Frequencies')
df_words.loc[(df_words['cum_perc'] >= 0.495) & (df_words['cum_perc'] <= 0.505)]

Middle Words and Corresponding Frequencies


Unnamed: 0,frequency,relfreq,word,cum_perc
49,182,0.002649,here,0.496434
50,182,0.002649,or,0.499083
51,176,0.002562,take,0.501645
52,174,0.002533,could,0.504177


### 4. Identify the 200 highest frequency words in this corpus.

In [13]:
top200 = dict(fdist1.most_common(200))
print('Treasure Island by Robert Louis Stevenson')
print('Top 200 Words by Frequency')
print('Word      Frequency')
print('-------------------')
for word, freq in top200.items():
    print('{:<11} {}'.format(word, freq))

Treasure Island by Robert Louis Stevenson
Top 200 Words by Frequency
Word      Frequency
-------------------
the         4365
and         2874
a           2373
be          2205
i           1748
of          1674
to          1521
have        1076
in          968
he          929
that        836
you         825
it          800
his         650
my          618
with        616
for         588
on          507
say         503
but         470
me          431
at          418
we          417
all         353
not         350
this        310
by          304
him         291
one         281
there       274
now         272
man         265
so          264
do          256
out         254
hand        236
captain     234
they        233
go          229
well        224
silver      222
from        219
come        217
if          216
like        216
up          213
see         213
no          186
when        182
here        182
or          182
take        176
could       174
doctor      173
what        172
wou

### 5. Create a graph that shows the relative frequency of these 200 words.

In [14]:
df_words.sort_values(by='relfreq',
                     inplace=True,
                     ascending=False)
plt.figure(figsize=(20,15))
fig = go.Figure()
fig.add_trace(go.Bar(y = df_words['word'][:200], x = df_words['relfreq'][:200], orientation = 'h'))

fig.update_layout(go.Layout(
    title='Relative Frequency of Top 200 Words in Treasure Island by Robert Louis Stevenson',
    width=900,
    height=1100,
    xaxis=dict(
        title='Relative Frequency' #, tickangle = 90
    ),
    yaxis=dict(
        title='Word', tickfont=dict(size=7)
    )
))
fig.show()

<Figure size 1440x1080 with 0 Axes>

### 6. Does the observed relative frequency of these words follow Zipf’s law? Explain.
According to the text Natural Language Processing with Python:
Zipf’s Law: Let *f(w)* be the frequency of a word *w* in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf’s Law states that the frequency of a word type is inversely
proportional to its rank (i.e., *f × r = k*, for some constant *k*).

The plot below of word rank vs. word frequency (both in log scale) is a relatively straight line.  Therefore, we conclude that this corpus of Treasure Island follows Zipf's Law.  

In [15]:
plt.figure(figsize=(15,15))
fig = go.Figure()

fig.add_trace(go.Scatter(y = np.log(df_words['frequency']), x = np.log(df_words.index + 1)))

fig.update_layout(go.Layout(
    title='Word Rank vs Word Frequency in Treasure Island by Robert Louis Stevenson',
    width=900,
    height=1100,
    xaxis=dict(
        title='Log Rank' #, tickangle = 90
    ),
    yaxis=dict(
        title='Log Frequency', tickfont=dict(size=7)
    )
))
fig.show()

<Figure size 1080x1080 with 0 Axes>

### 7. In what ways do you think the frequency of the words in this corpus differ from “all words in all corpora.”