# Homework 1: Word Frequencies

## Challenge
Can we identify different types of text documents based on the frequency of their words? Can we identify different authors, styles, or disciplines like medical versus information technology? The assignment is to compute word frequencies for different types of documents, and to develop patterns for document classification.

## Tasks
1. Write Python code to load different text documents and compute word frequencies. The most frequent words should be at the beginning of the list.
2. Identify a small (about 5 to 10) words that could represent a particular type of document.
3. Show how different types have different word lists ("signatures").
4. Discuss results and the feasibilty of this method.

## Deliverable
Use this notebook to implement your assignment. Please, observe the following:
1. Your notebook should have the completly executed code and results.
2. Please, organize your notebook to tell the story. Remove unnecessary clutter, test code, and anything that does not belong to the story.
3. Save your notebook in a directory named `HW1` in `MSA8010F16` in your *home* directory on the Hadoop Cluster. The path should be `~/MSA8010F16/HW1/HW1.ipynb`.
4. Also save the notebook in HTML as `~/MSA8010F16/HW1/HW1.html`
5. All file names are *case sensitive*!

In [70]:
##Step 1: Load the data
from urllib.request import urlopen
import string
with urlopen('http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt') as src:
    words = []
    txt = src.readlines()
    for t in txt[244:]:
        words = words + (t.decode().replace('\n','').casefold().split(' '))    

In [71]:
##Step 2: remove all the punctuation 
def remove_punct(doc):
    import re
    from string import punctuation
    r = re.compile(r'[{}]'.format(punctuation))
    output = []
    for x in doc:
        output.append(r.sub('',x))
    return output

words2 = remove_punct(words)

In [72]:
##Step 3: Remove empty entries
words3 = [w for w in words2 if w]

In [73]:
##Step 4: remove stop words
def remove_stops(doc):
    with urlopen('http://www.textfixer.com/resources/common-english-words.txt') as stop_words_src:
        stop_words = []
        sw = stop_words_src.readlines()
        for x in sw:
            stop_words = stop_words + (x.decode().split(','))   
    return [w for w in doc if w not in stop_words]

words4 = remove_stops(words3)

In [92]:
##Step 5: Find the top 20 meaningful words in the document
def find_top25_words(doc):
    from collections import Counter
    freq = Counter(doc)
    top25 = freq.most_common(25)    
    return (top25)    

print (find_top25_words(words4))

[('thou', 5485), ('thy', 4032), ('shall', 3591), ('thee', 3178), ('lord', 3059), ('king', 2861), ('good', 2812), ('now', 2778), ('sir', 2754), ('o', 2607), ('come', 2507), ('well', 2462), ('more', 2288), ('here', 2114), ('enter', 2098), ('love', 2053), ('ill', 1972), ('hath', 1941), ('man', 1835), ('one', 1779), ('go', 1733), ('upon', 1731), ('know', 1647), ('make', 1629), ('such', 1608)]


In [None]:
Now with all of the necessary program developed these can be run quickly for other documents. 
First, lets see if other classical works from the Renaissance could be picked out using the above words.
If we take out the thou, thy, thee, and shalls which really don't matter, we're left with:
    lord
    king
    good
    sir
    come
    well
    more
    here
    enter
    love
    ill
    man
    one
    go
    know
    make

In [93]:
##The Tragical History of Doctor Faustus by Christopher Marlowe
from urllib.request import urlopen
import string
with urlopen('http://www.gutenberg.org/cache/epub/779/pg779.txt') as src:
    marlowe = []
    txt = src.readlines()
    for t in txt[55:2147]:
        marlowe = marlowe + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

marlowe2 = remove_punct(marlowe)
marlowe3 = [w for w in marlowe2 if w]
marlowe4 = remove_stops(marlowe3)
marlowe5 = find_top25_words(marlowe4)
print (marlowe5)

[('faustus', 292), ('thou', 119), ('mephist', 72), ('thee', 72), ('shall', 66), ('ill', 58), ('mephistophilis', 57), ('thy', 54), ('now', 51), ('come', 50), ('soul', 47), ('hell', 43), ('god', 43), ('lucifer', 42), ('o', 39), ('wagner', 37), ('see', 37), ('enter', 36), ('art', 35), ('tell', 33), ('good', 32), ('well', 30), ('sir', 28), ('doctor', 28), ('scholar', 27)]


In [None]:
If we remove faustus, memphist, and mephistophilis as the character's name the remaining words are all very similar to the Shakespearean words.
4 out of the first 5 are the same: thou, thee, shall, thy. 
These are all standard words, perphas they could help us determine the era and the remaining top words could help destinguish the authors? 
    ill
    now
    come
    soul
    hell
    god
    lucifer
    see
    enter
    art
    tell
    good
    well
    sir
    doctor
    scholar
Lets look at another work from the same period. 

In [94]:
##Spenser's The Faerie Queene
from urllib.request import urlopen
import string
with urlopen('http://www.gutenberg.org/cache/epub/15272/pg15272.txt') as src:
    spencer = []
    txt = src.readlines()
    for t in txt[994:8535]:
        spencer = spencer + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

spencer2 = remove_punct(spencer)
spencer3 = [w for w in spencer2 if w]
spencer4 = remove_stops(spencer3)
spencer5 = find_top25_words(spencer4)
print (spencer5)

[('knight', 169), ('faire', 128), ('great', 127), ('now', 121), ('gan', 112), ('through', 101), ('well', 93), ('doth', 92), ('long', 89), ('forth', 89), ('whose', 86), ('unto', 84), ('way', 80), ('ne', 78), ('full', 74), ('life', 73), ('day', 72), ('up', 72), ('thy', 71), ('such', 71), ('never', 71), ('both', 69), ('man', 67), ('upon', 66), ('made', 66)]


In [None]:
    knight
    faire
    great
    now
    gan
    well
    long
    way
    full
    life
    day
    never
    both
    man
    made
This document doesn't seem to fit the initial word list from Shakespeare and Marlowe. 
The only word that is the same in all 3 is now...
So we probably can't get as specific as era. 
Maybe distinguish between fiction authors? Lets try it with Jane Austen. 

In [108]:
##Austen - Sense & Sensibility
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/austen-sense-758.txt') as src:
    sense = []
    txt = src.readlines()
    for t in txt:
        sense = sense + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

sense2 = remove_punct(sense)
sense3 = [w for w in sense2 if w]
sense4 = remove_stops(sense3)
sense5 = find_top25_words(sense4)
print (sense5)

[('elinor', 618), ('mrs', 526), ('very', 497), ('marianne', 490), ('more', 407), ('such', 358), ('one', 327), ('much', 287), ('herself', 249), ('time', 241), ('now', 234), ('know', 229), ('dashwood', 224), ('sister', 214), ('though', 213), ('edward', 210), ('well', 209), ('miss', 209), ('think', 206), ('jennings', 203), ('mother', 199), ('before', 198), ('never', 186), ('thing', 184), ('nothing', 180)]


In [None]:
mrs/miss
very
more
such
much
herself
time
now
know
sister
though
well
think
mother
before
never
thing
nothing

In [109]:
##Austen - Pride & Predujice
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/austen-pride-757.txt') as src:
    pride = []
    txt = src.readlines()
    for t in txt:
        pride = pride + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

pride2 = remove_punct(pride)
pride3 = [w for w in pride2 if w]
pride4 = remove_stops(pride3)
pride5 = find_top25_words(pride4)
print (pride5)

[('mr', 815), ('elizabeth', 610), ('very', 487), ('darcy', 401), ('such', 389), ('mrs', 358), ('much', 328), ('more', 323), ('bennet', 308), ('one', 297), ('miss', 294), ('jane', 278), ('bingley', 272), ('know', 237), ('before', 229), ('herself', 227), ('though', 226), ('never', 221), ('well', 219), ('soon', 217), ('think', 211), ('now', 211), ('time', 203), ('good', 196), ('lady', 194)]


In [None]:
mr
very
such
mrs/miss
much
more
know
before
herself
though
never
well
soon
think
now
time
good
lady

In [110]:
##Austen - Emma
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/austen-emma-754.txt') as src:
    emma = []
    txt = src.readlines()
    for t in txt:
        emma = emma + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

emma2 = remove_punct(emma)
emma3 = [w for w in emma2 if w]
emma4 = remove_stops(emma3)
emma5 = find_top25_words(emma4)
print (emma5)

[('very', 1187), ('mr', 1124), ('emma', 751), ('mrs', 687), ('miss', 587), ('much', 474), ('such', 471), ('more', 463), ('one', 428), ('harriet', 391), ('thing', 385), ('think', 384), ('weston', 382), ('little', 361), ('being', 358), ('well', 353), ('never', 346), ('knightley', 337), ('know', 322), ('elton', 317), ('good', 307), ('now', 302), ('quite', 275), ('jane', 272), ('herself', 267)]


In [None]:
very
mr
mrs/miss
much
such
more
thing
think
little
being
well
never
know
good
now
quite
herself


and now lets compare to other women writers around the same time

In [111]:
## Louisa May Alcott - Little Women
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/li_women') as src:
    little = []
    txt = src.readlines()
    for t in txt:
        little = little + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

little2 = remove_punct(little)
little3 = [w for w in little2 if w]
little4 = remove_stops(little3)
little5 = find_top25_words(little4)
print (little5)

[('jo', 1254), ('one', 866), ('little', 728), ('up', 647), ('meg', 638), ('amy', 573), ('laurie', 552), ('dont', 551), ('very', 494), ('out', 482), ('beth', 418), ('good', 407), ('now', 399), ('go', 393), ('im', 390), ('well', 376), ('never', 375), ('much', 371), ('old', 366), ('see', 361), ('over', 353), ('more', 346), ('away', 331), ('mother', 329), ('time', 321)]


In [None]:
little
up
dont
very
out
good
now
go
im
well
never
much
old
see
over
more
away
mother
time

In [112]:
##Bronte - Jane Eyre
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/bronte-jane-178.txt') as src:
    jane = []
    txt = src.readlines()
    for t in txt:
        jane = jane + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

jane2 = remove_punct(jane)
jane3 = [w for w in jane2 if w]
jane4 = remove_stops(jane3)
jane5 = find_top25_words(jane4)
print (jane5)

[('now', 666), ('one', 579), ('mr', 542), ('out', 402), ('up', 384), ('very', 377), ('more', 362), ('little', 342), ('jane', 339), ('well', 325), ('rochester', 317), ('sir', 314), ('miss', 308), ('never', 294), ('before', 286), ('see', 276), ('such', 257), ('thought', 256), ('over', 255), ('mrs', 250), ('go', 248), ('down', 245), ('again', 245), ('still', 245), ('shall', 241)]


In [None]:
now
mr
out
up
very
more
little
well
sir
miss/mrs
never
before
see
such
thought
over
go
down
again
still
shall

and compared to a book about a women written in the same time period by a man

In [113]:
##Tolstoy - Anna Karenina
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/anna_karenina') as src:
    relig = []
    txt = src.readlines()
    for t in txt:
        relig = relig + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

relig2 = remove_punct(relig)
relig3 = [w for w in relig2 if w]
relig4 = remove_stops(relig3)
relig5 = find_top25_words(relig4)
print (relig5)

[('levin', 1524), ('up', 1287), ('one', 1201), ('out', 1004), ('now', 896), ('vronsky', 779), ('more', 747), ('anna', 742), ('well', 696), ('come', 682), ('go', 678), ('very', 673), ('know', 669), ('went', 638), ('alexei', 625), ('himself', 615), ('see', 613), ('kitty', 600), ('over', 581), ('time', 554), ('thought', 552), ('felt', 551), ('stepan', 550), ('eyes', 546), ('yes', 544)]


In [None]:
up
out
now
more
well
come
go
very
know
went
himself
see
over
time
thought
felt
eyes
yes

Lets try something else; see if we can categorize religious documents.

In [95]:
##The bible
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/bible10.txt') as src:
    relig = []
    txt = src.readlines()
    for t in txt:
        relig = relig + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

relig2 = remove_punct(relig)
relig3 = [w for w in relig2 if w]
relig4 = remove_stops(relig3)
relig5 = find_top25_words(relig4)
print (relig5)

[('shall', 9838), ('unto', 8997), ('lord', 7830), ('thou', 5474), ('thy', 4600), ('god', 4442), ('ye', 3982), ('thee', 3826), ('out', 2775), ('upon', 2748), ('man', 2613), ('israel', 2565), ('up', 2380), ('son', 2370), ('hath', 2264), ('king', 2258), ('people', 2139), ('came', 2093), ('house', 2024), ('come', 1971), ('one', 1967), ('children', 1802), ('before', 1796), ('day', 1734), ('land', 1718)]


In [None]:
Again, a lot of fluff. If we keep only nouns & verbs we're left with:
    lord
    god
    came/come
    man
    isreal
    son
    king
    people
    house
    children
    day
    land

In [96]:
##St. Augustine writings
from urllib.request import urlopen
import string
with urlopen('http://www.gutenberg.org/cache/epub/3296/pg3296.txt') as src:
    aug = []
    txt = src.readlines()
    for t in txt[37:9369]:
        aug = aug + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

aug2 = remove_punct(aug)
aug3 = [w for w in aug2 if w]
aug4 = remove_stops(aug3)
aug5 = find_top25_words(aug4)
print (aug5)

[('thou', 1087), ('thee', 921), ('thy', 898), ('things', 551), ('god', 507), ('unto', 354), ('one', 348), ('those', 313), ('lord', 306), ('o', 301), ('out', 285), ('more', 280), ('good', 279), ('now', 279), ('made', 267), ('man', 249), ('earth', 243), ('time', 227), ('upon', 223), ('being', 219), ('soul', 210), ('such', 208), ('through', 201), ('heaven', 198), ('life', 198)]


In [None]:
Keeping only nouns gives:
        things
        god
        lord
        more
        good
        now
        made
        man
        earth
        time
        being
        soul
        heaven
        life
So we have lord and god in common but both of these words were found in the shakespearean text and the marlowe, 
although not both together...

Lets try another one, not christian.

In [97]:
##Legends of the Jews volume 1
from urllib.request import urlopen
import string
with urlopen('http://www.gutenberg.org/cache/epub/1493/pg1493.txt') as src:
    leg = []
    txt = src.readlines()
    for t in txt[37:9369]:
        leg = leg + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

leg2 = remove_punct(leg)
leg3 = [w for w in leg2 if w]
leg4 = remove_stops(leg3)
leg5 = find_top25_words(leg4)
print (leg5)

[('god', 788), ('abraham', 551), ('thou', 503), ('upon', 367), ('thy', 285), ('one', 278), ('earth', 271), ('unto', 271), ('adam', 252), ('man', 239), ('before', 234), ('thee', 234), ('lord', 225), ('world', 213), ('day', 206), ('angels', 206), ('son', 189), ('men', 185), ('time', 183), ('first', 169), ('shall', 168), ('himself', 161), ('out', 161), ('made', 158), ('king', 154)]


In [None]:
nouns:
    god
    abraham
    earth
    adam
    man
    lord
    world
    day
    angels
    son
    men
    time
    himself
    made
    king
now, we have 3 western religious documents that have 4-5 words in common:
    lord, god, man, (earth aug&leg)
lets see if these flow over to other religions. islam should be similar so that's first

In [98]:
##Islam - The Koran
from urllib.request import urlopen
import string
with urlopen('http://www.gutenberg.org/cache/epub/2800/pg2800.txt') as src:
    islam= []
    txt = src.readlines()
    for t in txt[8876:39809]:
        islam = islam + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

islam2 = remove_punct(islam)
islam3 = [w for w in islam2 if w]
islam4 = remove_stops(islam3)
islam5 = find_top25_words(islam4)
print (islam5)

[('god', 2870), ('shall', 1725), ('ye', 1517), ('hath', 779), ('those', 721), ('lord', 667), ('thou', 612), ('thee', 485), ('believe', 398), ('o', 386), ('verily', 377), ('one', 355), ('day', 333), ('people', 331), ('earth', 330), ('thy', 326), ('see', 322), ('men', 319), ('down', 316), ('come', 312), ('sent', 294), ('signs', 264), ('fear', 261), ('upon', 246), ('before', 239)]


In [None]:
nouns:
    god
    lord
    believe
    day
    people
    earth
    see
    men
    come
    sent
    signs
    fear
again the same words are present (lord, god, men (form of man), and earth)    
now lets see if it carries to any of the eastern religions. 
Hinduism first:

In [99]:
##Hindu - The Mahabharata of Krishna-Dwaipayana Vyasa
from urllib.request import urlopen
import string
with urlopen('http://www.gutenberg.org/cache/epub/7864/pg7864.txt') as src:
    hindu = []
    txt = src.readlines()
    for t in txt[237:21594]:
        hindu = hindu + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

hindu2 = remove_punct(hindu)
hindu3 = [w for w in hindu2 if w]
hindu4 = remove_stops(hindu3)
hindu5 = find_top25_words(hindu4)
print (hindu5)


[('o', 2320), ('thou', 1624), ('king', 1179), ('son', 955), ('great', 941), ('one', 833), ('thy', 825), ('thee', 814), ('unto', 795), ('those', 710), ('shall', 625), ('having', 545), ('continued', 538), ('hath', 531), ('thus', 460), ('upon', 419), ('sons', 418), ('became', 412), ('men', 391), ('therefore', 388), ('brahmana', 380), ('monarch', 378), ('art', 374), ('parva', 362), ('race', 362)]


In [None]:
    king
    son
    great
    having
    continues
    became
    men
    monarch
    art
    race

In [101]:
##Buddhist - Buddha, The Gospel
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/NONFICTION/gospel') as src:
    budd = []
    txt = src.readlines()
    for t in txt:
        budd = budd + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

budd2 = remove_punct(budd)
budd3 = [w for w in budd2 if w]
budd4 = remove_stops(budd3)
budd5 = find_top25_words(budd4)
print (budd5)

[('one', 741), ('blessed', 525), ('truth', 305), ('thou', 298), ('buddha', 239), ('o', 192), ('man', 190), ('life', 183), ('thy', 175), ('world', 153), ('good', 144), ('tathagata', 144), ('lord', 142), ('king', 142), ('mind', 141), ('self', 138), ('now', 137), ('great', 131), ('ananda', 125), ('heart', 116), ('evil', 116), ('having', 114), ('thee', 114), ('bhikkhus', 105), ('those', 101)]


In [None]:
    blessed
    truth
    man
    life
    world
    good
    lord
    king
    mind
    self
    heart
    evil
    having

In [102]:
##Atheism: Beyond Good and Evil, by Friedrich Nietzsche
from urllib.request import urlopen
import string
with urlopen('http://www.gutenberg.org/cache/epub/4363/pg4363.txt') as src:
    ath = []
    txt = src.readlines()
    for t in txt[148:6138]:
        ath = ath + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

ath2 = remove_punct(ath)
ath3 = [w for w in ath2 if w]
ath4 = remove_stops(ath3)
ath5 = find_top25_words(ath4)
print (ath5)


[('one', 379), ('more', 248), ('man', 219), ('even', 187), ('such', 151), ('itself', 145), ('perhaps', 144), ('himself', 132), ('men', 121), ('good', 119), ('still', 115), ('something', 109), ('new', 98), ('always', 96), ('themselves', 90), ('german', 89), ('out', 89), ('much', 82), ('time', 81), ('soul', 80), ('great', 79), ('love', 77), ('very', 75), ('morality', 75), ('woman', 74)]


In [None]:
    man/men
    itself
    himself/themselves
    good
    something
    time
    soul
    love
    morality
    woman

In [None]:
Medical Texts

In [103]:
##Surgical Anatomy BY JOSEPH MACLISE
from urllib.request import urlopen
import string
with urlopen('http://www.gutenberg.org/cache/epub/24440/pg24440.txt') as src:
    medi = []
    txt = src.readlines()
    for t in txt[1207:13394]:
        medi = medi + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

medi2 = remove_punct(medi)
medi3 = [w for w in medi2 if w]
medi4 = remove_stops(medi3)
medi5 = find_top20_words(medi4)
print (medi5)


[('plate', 754), ('artery', 643), ('muscle', 476), ('vessels', 463), ('part', 378), ('internal', 347), ('side', 328), ('b', 310), ('external', 306), ('between', 294), ('c', 288), ('fascia', 283), ('d', 279), ('two', 272), ('hernia', 267), ('canal', 255), ('left', 234), ('f', 223), ('right', 222), ('fig', 221)]


In [None]:
plate
artery
muscle
vessels
part
internal
side
external
between
fascia
hernia
canal
left
right

In [104]:
##Medical Papers - The Cleveland Medical Gazette, Vol. I. No. 3., January 1886
from urllib.request import urlopen
import string
with urlopen('https://www.gutenberg.org/files/52874/52874-0.txt') as src:
    medjo = []
    txt = src.readlines()
    for t in txt[43:1921]:
        medjo = medjo + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

medjo2 = remove_punct(medjo)
medjo3 = [w for w in medjo2 if w]
medjo4 = remove_stops(medjo3)
medjo5 = find_top20_words(medjo4)
print (medjo5)


[('medical', 62), ('dr', 59), ('one', 56), ('case', 41), ('more', 41), ('upon', 37), ('time', 36), ('physician', 34), ('physicians', 29), ('cases', 29), ('made', 29), ('two', 28), ('being', 26), ('medicine', 25), ('found', 25), ('first', 23), ('well', 23), ('very', 22), ('health', 22), ('patient', 22)]


In [None]:
medical
dr
case
time
physician
made
being
medicine
found
well
health
patient

In [105]:
##Medical Papers - Humanistic Nursing
from urllib.request import urlopen
import string
with urlopen('http://www.gutenberg.org/cache/epub/25020/pg25020.txt') as src:
    nurse = []
    txt = src.readlines()
    for t in txt[242:6094]:
        nurse = nurse + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

nurse2 = remove_punct(nurse)
nurse3 = [w for w in nurse2 if w]
nurse4 = remove_stops(nurse3)
nurse5 = find_top20_words(nurse4)
print (nurse5)

[('nursing', 767), ('nurse', 282), ('human', 231), ('being', 212), ('nurses', 183), ('man', 167), ('patient', 164), ('world', 161), ('experience', 158), ('one', 154), ('situation', 132), ('humanistic', 132), ('through', 130), ('more', 125), ('patients', 125), ('each', 122), ('lived', 109), ('clinical', 108), ('time', 104), ('dialogue', 101)]


In [None]:
nursing
human
being
man
patient
world
experience
situation
humanistic
patients
lived
clinical
time
dialogue

In [107]:
##Medical Papers - Humanistic Nursing
from urllib.request import urlopen
import string
locale = urlopen('http://apps.who.int/medicinedocs/documents/s17078e/s17078e.pdf')
diag_file = open("document.pdf", 'w')
diag = diag_file.write(locale.read())
diag_file.close

txt = src.readlines()
    for t in txt[242:6094]:
        nurse = nurse + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

nurse2 = remove_punct(nurse)
nurse3 = [w for w in nurse2 if w]
nurse4 = remove_stops(nurse3)
nurse5 = find_top20_words(nurse4)
print (nurse5)

[]
