In [2]:
import nltk, re, pprint
import nltk
import matplotlib
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

**First, in the entry form below, load in the file or files.**  First, take a look at your text.  An easy way to get started is to first read it in, and then run it through the sentence tokenizer to divide it up, even if this division is not fully accurate.  You may have to do a bit of work to figure out which will be the "opening phrase" that Wolfram Alpha shows.  Below, write the code to read in the text and split it into sentences, and then print out the **opening phrase**.

In [3]:
with open("speeches.txt") as w:
    text = w.read()

In [4]:
sent_text = sent_tokenizer.tokenize(text)
len(sent_text)
# sent_text

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)

In [6]:
para_text = re.sub('[\n]+','\n', text)
para_text = para_text.split('\n')
opening_phrase = para_text[1]
opening_phrase

"...Thank you so much.  That's so nice.  Isn't he a great guy.  He doesn't get a fair press; he doesn't get it.  It's just not fair.  And I have to tell you I'm here, and very strongly here, because I have great respect for Steve King and have great respect likewise for Citizens United, David and everybody, and tremendous resect for the Tea Party.  Also, also the people of Iowa.  They have something in common.  Hard-working people.  They want to work, they want to make the country great.  I love the people of Iowa.  So that's the way it is.  Very simple."

**Next, tokenize.**  Look at the several dozen sentences to see what kind of tokenization issues you'll have.  Write a regular expression tokenizer, using the nltk.regexp_tokenize() as seen in class, or using something more sophisticated if you prefer, to do a nice job of breaking your text up into words.  You may need to make changes to the regex pattern that is given in the book to make it work well for your text collection. 

*How you break up the words will have effects down the line for how you can manipulate your text collection.  You may want to refine this code later.*

In [52]:
text = text.replace('\ufeff', '')
new_text = re.sub('[\n]+','\n', text)
sent_text = sent_tokenizer.tokenize(new_text)
sent_text

['SPEECH 1\n...Thank you so much.',
 "That's so nice.",
 "Isn't he a great guy.",
 "He doesn't get a fair press; he doesn't get it.",
 "It's just not fair.",
 "And I have to tell you I'm here, and very strongly here, because I have great respect for Steve King and have great respect likewise for Citizens United, David and everybody, and tremendous resect for the Tea Party.",
 'Also, also the people of Iowa.',
 'They have something in common.',
 'Hard-working people.',
 'They want to work, they want to make the country great.',
 'I love the people of Iowa.',
 "So that's the way it is.",
 'Very simple.',
 'With that said, our country is really headed in the wrong direction with a president who is doing an absolutely terrible job.',
 "The world is collapsing around us, and many of the problems we've caused.",
 'Our president is either grossly incompetent, a word that more and more people are using, and I think I was the first to use it, or he has a completely different agenda than you wan

In [59]:
pattern = r'''(?x)  # set flag to allow verbose regexps
 (?:[A-Z]\.)+[A-Z]*        # abbreviations, e.g. U.S.A.
| [a-zA-Z]+(?:[-'][a-zA-Z]+)*            # words with optional internal hyphens or apostrophes         
| \$?\d+(?:\.\d+)?%?     # currency (dollars only, e.g. $12.40, $33, $.9) and digits 
| [+/\-@&*.,;"'?():\-_`] #special symbols
'''

Tokens = nltk.regexp_tokenize(new_text,pattern)
# Tokens

**Compute word counts.** Now compute your frequency distribution using a FreqDist over the words. Let's not do lowercasing or stemming yet.  You can run this over the whole collection together, or sentence by sentence. Write the code for computing the FreqDist below and show the most common 20 words that result.

In [9]:
mp_freqdist = nltk.FreqDist(Tokens)  # compute the frequency distribution
mp_freqdist.most_common(20) 

[('.', 15848),
 (',', 8770),
 ('I', 6155),
 ('the', 5187),
 ('to', 5171),
 ('and', 3531),
 ('a', 3350),
 ('of', 2803),
 ('it', 2682),
 ('s', 2648),
 ('you', 2480),
 ('that', 2416),
 ('have', 2121),
 ('we', 2093),
 ('they', 1930),
 ('re', 1928),
 ('going', 1922),
 ('in', 1882),
 ('t', 1879),
 ('And', 1643)]

**Normalize the text.** Now adjust the output by normalizing the text: things you can try include removing stopwords, removing very short words, lowercasing the text, improving the tokenization, and/or doing other adjustments to bring content words higher up in the results.  The goal is to dig deeper into the collection to find interesting but relatively frequent words.  Show the code for these changes below.  

In [60]:
punctuations = list(string.punctuation)
modified_token = [i for i in Tokens if i not in punctuations and i.lower() not in stopwords.words('english')]
new_token = [i for i in modified_token if len(i) >= 5]
# new_token_long = [i for i in modified_token if len(i) >= 7]
# new_token
# new_token_long

**Show adjusted word counts.** Show the most frequent 20 words that result from these adjustments.

In [11]:
freqdist = nltk.FreqDist(new_token)  # compute the frequency distribution
freqdist.most_common(20)

[('going', 1922),
 ('people', 1215),
 ('great', 619),
 ('think', 574),
 ('country', 504),
 ('money', 375),
 ('right', 361),
 ('really', 312),
 ('would', 279),
 ('never', 253),
 ('Trump', 248),
 ('thing', 233),
 ('things', 227),
 ('China', 199),
 ('world', 193),
 ('years', 192),
 ('America', 190),
 ('million', 186),
 ('happen', 183),
 ('something', 181)]

**Creating a table.**
Python provides an easy way to line columns up in a table.  You can specify a width for a string such as %6s, producing a string that is padded to width 6. It is right-justified by default, but a minus sign in front of it switches it to left-justified, so -3d% means left justify an integer with width 3.  *AND* if you don't know the width in advance, you can make it a variable by using an asterisk rather than a number before the '\*s%' or the '-\*d%'.  Check out this example (this is just fyi):

In [173]:
print ('%-16s' % 'Info type', '%-16s' % 'Value')
print ('%-16s' % 'number of words', '%-16d' % 100000)

Info type        Value           
number of words  100000          


**Word Properties Table** Next there is a table of word properties, which you should compute (skip unique word stems, since we will do stemming in class on Wed).  Make a table that prints out:
1. number of words
2. number of unique words
3. average word length
4. longest word

You can make your table look prettier than the example I showed above if you like! (If you already know pandas, you can use that.)

You can decide for yourself if you want to eliminate punctuation and function words (stop words) or not.  It's your collection!  


In [42]:
uniqueWords = [] 
for i in new_token:
      if not i in uniqueWords:
          uniqueWords.append(i);
longest_word = max(new_token, key=len)
longest_word

'Transpacific-Partnership'

In [44]:
x = [len(i) for i in new_token]
average_word_length = sum(x)/len(new_token)
average_word_length

6.744774314029242

In [45]:
print ('%-16s' % 'Number of Words', '%-16s' % len(new_token))
print ('%-16s' % 'Number of Unique words', '%-16d' % len(uniqueWords))
print ('%-16s' % 'Average Word Length', '%-16f' % average_word_length)
print ('%-16s' % 'Longest Word', '%-16s' % longest_word)

Number of Words  47123           
Number of Unique words 5933            
Average Word Length 6.744774        
Longest Word     Transpacific-Partnership


**Most Frequent Words List.** Next is the most frequent words list.  This table shows the percent of the total as well as the most frequent words, so compute this number as well.  

In [222]:
frequent_word_list = []
common_tokens = freqdist.most_common(20)
for i in common_tokens:
    frequent_word_list.append([i[0],((i[1]/freqdist.N())*100)])
frequent_word_list
print ('%-16s' % 'Word', '%-16s' % 'Percentage')
for item in frequent_word_list:
    print ('%-16s' % item[0], '%-16f' % item[1])

Word             Percentage      
going            4.078688        
people           2.578359        
great            1.313584        
think            1.218089        
country          1.069541        
money            0.795790        
right            0.766080        
really           0.662097        
would            0.592068        
never            0.536893        
Trump            0.526282        
thing            0.494451        
things           0.481718        
China            0.422299        
world            0.409566        
years            0.407444        
America          0.403200        
million          0.394712        
happen           0.388345        
something        0.384101        


**Most Frequent Capitalized Words List** We haven't lower-cased the text so you should be able to compute this. Don't worry about whether capitalization comes from proper nouns, start of sentences, or elsewhere. You need to make a different FreqDist to do this one.  Write the code here for the new FreqDist and the List itself.  Show the list here.

In [13]:
capitalized_tokens = [i for i in new_token if i[0].isupper()]
capitalized_tokens
freqdist_capital = nltk.FreqDist(capitalized_tokens)  # compute the frequency distribution
freqdist_capital.most_common(20)

[('Trump', 248),
 ('China', 199),
 ('America', 190),
 ('Mexico', 155),
 ('Hillary', 151),
 ('Thank', 148),
 ('GOING', 127),
 ('United', 124),
 ('States', 120),
 ('Right', 116),
 ('Clinton', 112),
 ('Obama', 108),
 ("WE'RE", 98),
 ('Israel', 82),
 ('American', 82),
 ('President', 80),
 ('PEOPLE', 74),
 ('Nobody', 73),
 ('South', 72),
 ('Japan', 64)]

In [15]:
frequent_word_list_capital = []
common_tokens_capital = freqdist_capital.most_common(20)
for i in common_tokens_capital:
    frequent_word_list_capital.append([i[0],((i[1]/freqdist.N())*100)])
print ('%-16s' % 'Word', '%-16s' % 'Percentage')
for item in frequent_word_list_capital:
    print ('%-16s' % item[0], '%-16f' % item[1])

Word             Percentage      
Trump            0.526282        
China            0.422299        
America          0.403200        
Mexico           0.328926        
Hillary          0.320438        
Thank            0.314072        
GOING            0.269507        
United           0.263141        
States           0.254653        
Right            0.246164        
Clinton          0.237676        
Obama            0.229187        
WE'RE            0.207966        
Israel           0.174013        
American         0.174013        
President        0.169768        
PEOPLE           0.157036        
Nobody           0.154914        
South            0.152792        
Japan            0.135815        


**Sentence Properties Table** This summarizes number of sentences and average sentence length in words and characters (you decide if you want to include stopwords/punctuation or not).  Print those out in a table here.

In [53]:
avg_word = len(new_token)/len(sent_text)
avg_char = len(text)/len(sent_text)
print ('%-16s' % 'number of sentences', '%-16d' % len(sent_text))
print ('%-16s' % 'average sentence length in words', '%-16f' % avg_word)
print ('%-16s' % 'average sentence length in characters', '%-16f' % avg_char)

number of sentences 16493           
average sentence length in words 2.857152        
average sentence length in characters 54.342388       


**Reflect on the Output** (Write a brief paragraph below answering these questions.) What does it tell you about your collection?  What does it fail to tell you?  How does your collection perhaps differ from others?

### Reflection
The above initial look at the collection helps give a quick overview of how the text collection might look. Statistics such as average sentence length illuminates the fact the speeches are made up of short bursts of words rather than long sentences. We get a look at the number of unique tokens and their frequence of occurence throughout the text and though they highlight certain key terms that are used commonly in a political campaign we still cannot see their significance as part of the bigger text. 

The initial look has currently failed to provide any meaningful insight into the text collection. An initial impression on the difference that the text might have from other collections seems to be the fact that Proper Nouns are a high subset of the most common words even when we did not account only for Capitalized words.

Some Fun Notes: Given the political nature of the speeches it was still intersting to note the high use of terms such America, million, Obama, China, Hillary, Mexico, etc. It was interesting to note how 'Trump' was one of the most common words in a speech given by Donald Trump himself and might need further analysis. Also his emphasis on making America great again was clearly seen by the usage of the word 'great'.



**Compare to Another Collection** Now do the same analysis on another collection in NLTK.  
If your collection is a book, you can compare against another book.   Or you can contrast against an entirely different collection  (Brown corpus, presidential inaugural addresses, etc) to see the difference.
The list of collections is here: http://www.nltk.org/nltk_data/
Reflect on the similarities to or differences from your text collection.


In [17]:
with open("Dracula-Bram_Stoker.txt") as w:
    text = w.read()

In [22]:
sent_text = sent_tokenizer.tokenize(text)
len(sent_text)
sent_text[1]

'You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.org/license\n\n\nTitle: Dracula\n\nAuthor: Bram Stoker\n\nRelease Date: August 16, 2013 [EBook #345]\n\nLanguage: English\n\n\n*** START OF THIS PROJECT GUTENBERG EBOOK DRACULA ***\n\n\n\n\nProduced by Chuck Greif and the Online Distributed\nProofreading Team at http://www.pgdp.net (This file was\nproduced from images generously made available by The\nInternet Archive)\n\n\n\n\n\n\n\n                                DRACULA\n\n\n\n\n\n                                DRACULA\n\n                                  _by_\n\n                              Bram Stoker\n\n                        [Illustration: colophon]\n\n                                NEW YORK\n\n                            GROSSET & DUNLAP\n\n                              _Publishers_\n\n      Copyright, 1897, in the United States of America, according\n                   to Act of

In [18]:
para_text = re.sub('[\n]+','\n', text)
para_text = para_text.split('\n')
opening_phrase = para_text[1]
opening_phrase

'This eBook is for the use of anyone anywhere at no cost and with'

In [25]:
opening_phrase_modified = sent_text[2:4]
print(opening_phrase_modified)

['Bistritz._--Left Munich at 8:35 P. M., on 1st May, arriving at\nVienna early next morning; should have arrived at 6:46, but train was an\nhour late.', 'Buda-Pesth seems a wonderful place, from the glimpse which I\ngot of it from the train and the little I could walk through the\nstreets.']


In [61]:
text = text.replace('\ufeff', '')
new_text_mod = re.sub('[\n]+','\n', text)
sent_text_mod = sent_tokenizer.tokenize(new_text)
# sent_text_mod

In [62]:
pattern = r'''(?x)  # set flag to allow verbose regexps
 (?:[A-Z]\.)+[A-Z]*        # abbreviations, e.g. U.S.A.
| [a-zA-Z]+(?:[-'][a-zA-Z]+)*            # words with optional internal hyphens or apostrophes         
| \$?\d+(?:\.\d+)?%?     # currency (dollars only, e.g. $12.40, $33, $.9) and digits 
| [+/\-@&*.,;"'?():\-_`] #special symbols
'''

Tokens_mod = nltk.regexp_tokenize(new_text_mod,pattern)
# Tokens_mod

In [29]:
mp_freqdist_mod = nltk.FreqDist(Tokens_mod)  # compute the frequency distribution
mp_freqdist_mod.most_common(20) 

[(',', 11396),
 ('.', 8383),
 ('the', 7472),
 ('and', 5797),
 ('I', 4795),
 ('to', 4501),
 ('of', 3706),
 ('"', 2955),
 ('-', 2945),
 ('a', 2935),
 ('in', 2465),
 ('that', 2426),
 ('he', 1986),
 ('was', 1871),
 ('it', 1802),
 (';', 1684),
 ('is', 1501),
 ('for', 1480),
 ('as', 1476),
 ('me', 1453)]

In [57]:
punctuations = list(string.punctuation)
modified_token_mod = [i for i in Tokens_mod if i not in punctuations and i.lower() not in stopwords.words('english')]
new_token_mod = [i for i in modified_token_mod if len(i) >= 5]
# new_token_mod

In [33]:
freqdist_mod = nltk.FreqDist(new_token_mod)  # compute the frequency distribution
freqdist_mod.most_common(20)

[('could', 490),
 ('would', 427),
 ('shall', 425),
 ('Helsing', 295),
 ('seemed', 242),
 ('though', 217),
 ('think', 217),
 ('night', 209),
 ('looked', 185),
 ('Jonathan', 182),
 ('sleep', 179),
 ('great', 173),
 ('things', 171),
 ('little', 164),
 ('friend', 164),
 ('might', 159),
 ('found', 154),
 ('Professor', 153),
 ('Count', 149),
 ('thought', 146)]

In [34]:
uniqueWords_mod = [] 
for i in new_token_mod:
      if not i in uniqueWords_mod:
          uniqueWords_mod.append(i);
longest_word_mod = max(new_token_mod, key=len)
longest_word_mod

'two-pages-to-the-week-with-Sunday-squeezed-in-a-corner'

In [35]:
x = [len(i) for i in new_token_mod]
average_word_length_mod = sum(x)/len(new_token_mod)
average_word_length_mod

6.710401249972874

In [36]:
print ('%-16s' % 'Number of Words', '%-16s' % len(new_token_mod))
print ('%-16s' % 'Number of Unique words', '%-16d' % len(uniqueWords_mod))
print ('%-16s' % 'Average Word Length', '%-16f' % average_word_length_mod)
print ('%-16s' % 'Longest Word', '%-16s' % longest_word_mod)

Number of Words  46081           
Number of Unique words 9140            
Average Word Length 6.710401        
Longest Word     two-pages-to-the-week-with-Sunday-squeezed-in-a-corner


In [37]:
frequent_word_list_mod = []
common_tokens_mod = freqdist_mod.most_common(20)
for i in common_tokens_mod:
    frequent_word_list_mod.append([i[0],((i[1]/freqdist_mod.N())*100)])
frequent_word_list_mod
print ('%-16s' % 'Word', '%-16s' % 'Percentage')
for item in frequent_word_list_mod:
    print ('%-16s' % item[0], '%-16f' % item[1])

Word             Percentage      
could            1.063345        
would            0.926629        
shall            0.922289        
Helsing          0.640177        
seemed           0.525162        
though           0.470910        
think            0.470910        
night            0.453549        
looked           0.401467        
Jonathan         0.394957        
sleep            0.388446        
great            0.375426        
things           0.371086        
little           0.355895        
friend           0.355895        
might            0.345045        
found            0.334194        
Professor        0.332024        
Count            0.323344        
thought          0.316833        


In [38]:
capitalized_tokens_mod = [i for i in new_token_mod if i[0].isupper()]
capitalized_tokens_mod
freqdist_capital_mod = nltk.FreqDist(capitalized_tokens_mod)  # compute the frequency distribution
freqdist_capital_mod.most_common(20)

[('Helsing', 295),
 ('Jonathan', 182),
 ('Professor', 153),
 ('Count', 149),
 ('Arthur', 133),
 ('Harker', 106),
 ('Madam', 92),
 ('Godalming', 86),
 ('Project', 83),
 ('Quincey', 80),
 ('Seward', 80),
 ('Morris', 74),
 ("Lucy's", 74),
 ('London', 67),
 ('September', 61),
 ('Gutenberg-tm', 55),
 ('CHAPTER', 54),
 ("Harker's", 52),
 ("Seward's", 46),
 ("Count's", 45)]

In [39]:
frequent_word_list_capital_mod = []
common_tokens_capital_mod = freqdist_capital_mod.most_common(20)
for i in common_tokens_capital_mod:
    frequent_word_list_capital_mod.append([i[0],((i[1]/freqdist_mod.N())*100)])
print ('%-16s' % 'Word', '%-16s' % 'Percentage')
for item in frequent_word_list_capital_mod:
    print ('%-16s' % item[0], '%-16f' % item[1])

Word             Percentage      
Helsing          0.640177        
Jonathan         0.394957        
Professor        0.332024        
Count            0.323344        
Arthur           0.288622        
Harker           0.230030        
Madam            0.199648        
Godalming        0.186628        
Project          0.180118        
Quincey          0.173607        
Seward           0.173607        
Morris           0.160587        
Lucy's           0.160587        
London           0.145396        
September        0.132376        
Gutenberg-tm     0.119355        
CHAPTER          0.117185        
Harker's         0.112845        
Seward's         0.099824        
Count's          0.097654        


In [40]:
avg_word_mod = len(new_token_mod)/len(sent_text_mod)
avg_char_mod = len(text)/len(sent_text)
print ('%-16s' % 'number of sentences', '%-16d' % len(sent_text))
print ('%-16s' % 'average sentence length in words', '%-16f' % avg_word_mod)
print ('%-16s' % 'average sentence length in characters', '%-16f' % avg_char_mod)

number of sentences 8569            
average sentence length in words 5.377640        
average sentence length in characters 101.195005      


In [54]:
print ('%-16s' % 'Trump Speeches')
print ('%-16s' % 'Number of Words', '%-16s' % len(new_token))
print ('%-16s' % 'Number of Unique words', '%-16d' % len(uniqueWords))
print ('%-16s' % 'Average Word Length', '%-16f' % average_word_length)
print ('%-16s' % 'Longest Word', '%-16s' % longest_word)
print ('%-16s' % 'number of sentences', '%-16d' % len(sent_text))
print ('%-16s' % 'average sentence length in words', '%-16f' % avg_word)
print ('%-16s' % 'average sentence length in characters', '%-16f' % avg_char)

Trump Speeches  
Number of Words  47123           
Number of Unique words 5933            
Average Word Length 6.744774        
Longest Word     Transpacific-Partnership
number of sentences 16493           
average sentence length in words 2.857152        
average sentence length in characters 54.342388       


In [56]:
print ('%-16s' % 'Dracula')
print ('%-16s' % 'Number of Words', '%-16s' % len(new_token_mod))
print ('%-16s' % 'Number of Unique words', '%-16d' % len(uniqueWords_mod))
print ('%-16s' % 'Average Word Length', '%-16f' % average_word_length_mod)
print ('%-16s' % 'Longest Word', '%-16s' % longest_word_mod)
print ('%-16s' % 'number of sentences', '%-16d' % len(sent_text_mod))
print ('%-16s' % 'average sentence length in words', '%-16f' % avg_word_mod)
print ('%-16s' % 'average sentence length in characters', '%-16f' % avg_char_mod)

Dracula         
Number of Words  46081           
Number of Unique words 9140            
Average Word Length 6.710401        
Longest Word     two-pages-to-the-week-with-Sunday-squeezed-in-a-corner
number of sentences 8569            
average sentence length in words 5.377640        
average sentence length in characters 101.195005      


### Comparison

There seem to be some interesting similarities and differences within the text collections based on our initial first look. For my partner's collection - Dracula - there seems to be a lot initial text metadata, license and usage policy at the beginning of the file which is different from the structure of my collection. Comparing the first sentence in both the texts we see that for 'Speeches.txt' the initial sentence more closely resembles a sentence structure, however, the first sentence of 'Dracula' contained all the text collection metadata and chapter numbers and titles, which is incorrect and hence may require additonal clean up before tokenization.
A similarity amongst both texts was that initially the most common words for both contained mostly stop words and punctuations rather than meaningful words. Both texts revealed better insights for common words after removing stop words, punctuations and short words. We see that the longest word due the inclusion of hyphens as part of tokens was a phrase rather than a word in both the texts.  

Some very interesting comparison can be done when looking at the text properties:

-- Both texts had similar average word length of approximately 6.7

-- Number of words though similar in both the texts, 'Dracula' contained approximately 3000 more unique words than 'Trump Speeches' which might shed light on how Donald's Trump pattern on reiteration of his words and use of similar arguments in all his speeches (but it might be to premature to make such assumptions).

-- The difference in the number of sentences and average sentence length show that 'Speeches.txt' contains almost double the number of sentences with almost half the sentence length. This is an intersting characteristic to note.