<h1 style="color:blue; text-align:center;""> Lecture 24 </h1>
<hr style="height:5px;border-width:0;color:blue;background-color:blue">

<center><h1> Natural Language Tool Kit - NLTK </h1></center>

### Introduction to Natural Language Processing: 
https://www.ibm.com/topics/natural-language-processing

### Jumping NLP Curves

<img src="nlp.png" width="500" height="300">

#### The five language domains:

- Phonology—Study of the speech sound (i.e., phoneme) system of a language, including the rules for combining and using phonemes.
- Morphology—Study of the rules that govern how morphemes, the minimal meaningful units of language, are used in a language.
- Syntax—The rules that pertain to the ways in which words can be combined to form sentences in a language.
- Semantics—The meaning of words and combinations of words in a language.
- Pragmatics—The rules associated with the use of language in conversation and broader social situations.


## Tokenizing

In [13]:
import nltk

In [2]:
# nltk.download()

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [4]:
# tokenizing - word tokenizers.... sentence tokenizers
# lexicon and corporas
# corpora- body of text. exmpl:medical journals,presidential speaches,
# english language
#lexicon - words and their meanings
# investor-speak.... regular english-speak
# investor speak 'bull' = someone who is positive about the market
# english-speak 'bull' = scary animal you dont want running @ you

In [5]:
example_text = "Hello Mr. Smith, how are you doing today? The weather is great and Python is awesome.The sky is pinkish-blue. You should not eat cardboard."

In [6]:
print(sent_tokenize(example_text))

['Hello Mr. Smith, how are you doing today?', 'The weather is great and Python is awesome.The sky is pinkish-blue.', 'You should not eat cardboard.']


In [7]:
print(word_tokenize(example_text))

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'and', 'Python', 'is', 'awesome.The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', 'not', 'eat', 'cardboard', '.']


In [8]:
for i in word_tokenize(example_text):
    print(i)

Hello
Mr.
Smith
,
how
are
you
doing
today
?
The
weather
is
great
and
Python
is
awesome.The
sky
is
pinkish-blue
.
You
should
not
eat
cardboard
.


## Stop Words

In [9]:
from nltk.corpus import stopwords

In [10]:
example_sentence = "This is an example showing off stop word filtration."
stop_words = set(stopwords.words("english"))
words = word_tokenize(example_sentence)

In [11]:
filtered_sentence = []
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
        
print(filtered_sentence)

['This', 'example', 'showing', 'stop', 'word', 'filtration', '.']


In [12]:
#another method
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)

['This', 'example', 'showing', 'stop', 'word', 'filtration', '.']


## Punctuation

In [19]:
import string

In [20]:
punctuation =  set(string.punctuation)
print("Punctuation list is: \n",punctuation)

print("Stop list is: \n", stop_words)

Punctuation list is: 
 {'>', '#', ']', ':', '~', '+', '|', '{', '&', '/', '-', '<', '`', '=', '"', '%', '@', '*', ',', '[', '(', ')', '$', '.', '!', '?', '^', '\\', '_', '}', ';', "'"}
Stop list is: 
 {'isn', 'very', 'be', 'some', 'mustn', 'my', 'no', 'hasn', 'shouldn', 'down', 's', 'than', 'ain', 'if', 'will', 'most', 'too', 'y', 'by', 'after', 'your', 'him', 'over', 'herself', 'where', 'myself', "shouldn't", 'me', 'being', "don't", 'weren', 'can', 'between', 'of', 'above', 'just', "mustn't", 'do', 'the', 'then', 'his', 'how', 'did', 'won', 'now', "you're", 'own', 'are', 'up', 'himself', 'yourself', 'what', 'you', 'been', 'on', 'both', 'was', 'couldn', 'ourselves', "that'll", 'until', 'whom', 'why', 'not', 'once', 'their', 'is', 'so', 'ma', 'and', 'through', 'under', "hadn't", "hasn't", 'which', "it's", 'all', 'themselves', 'to', 'had', 'has', 'didn', 'only', 'shan', "wouldn't", 'itself', 'for', 'were', 'he', "you'll", 'haven', 'yourselves', 'am', 'into', 'during', 'aren', 'below', "d

In [21]:
filtered_data = []

for w in word_tokenize(example_text):
    if (w not in stop_words) and (w not in punctuation):
        filtered_data.append(w)

In [22]:
filtered_data

['Hello',
 'Mr.',
 'Smith',
 'today',
 'The',
 'weather',
 'great',
 'Python',
 'awesome.The',
 'sky',
 'pinkish-blue',
 'You',
 'eat',
 'cardboard']

## Stemming

In [23]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize

In [24]:
ps = PorterStemmer()
example_words=["python","pythoner","pythoning","pythoned","pythonly"]
for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


In [25]:
new_text = "It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly atleast once."

In [26]:
words = word_tokenize(new_text)
for w in words:
    print(ps.stem(w))

it
is
veri
import
to
be
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
atleast
onc
.


## Part Of Speech Tagging

In [27]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [28]:
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

In [29]:
sample_text

'PRESIDENT GEORGE W. BUSH\'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all. Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream. Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King. (Applause.)\n\nPresident George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan. 31, 2006. White House photo by Eric DraperEvery time I\'m invited to this rostrum, I\'m humbled by the privilege, and mindful of the history we\'ve seen together. We have gathered under this Capitol dome in moments of national mourning and national achievement. We have serv

In [30]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [31]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [32]:
tokenized

["PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all.",
 'Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream.',
 'Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King.',
 '(Applause.)',
 'President George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan.',
 '31, 2006.',
 "White House photo by Eric DraperEvery time I'm invited to this rostrum, I'm humbled by the privilege, and mindful of the history we've seen together.",
 'We have gathered under this Capitol dome in moments of national mourning and national ach

In [33]:
type(tokenized)

list

In [34]:
len(tokenized)

346

In [35]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)
            
    except Exception as e:
        print(str(e))

In [36]:
process_content()

[('31', 'CD'), (',', ','), ('2006', 'CD'), ('.', '.')]
[('White', 'NNP'), ('House', 'NNP'), ('photo', 'NN'), ('by', 'IN'), ('Eric', 'NNP'), ('DraperEvery', 'NNP'), ('time', 'NN'), ('I', 'PRP'), ("'m", 'VBP'), ('invited', 'JJ'), ('to', 'TO'), ('this', 'DT'), ('rostrum', 'NN'), (',', ','), ('I', 'PRP'), ("'m", 'VBP'), ('humbled', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('privilege', 'NN'), (',', ','), ('and', 'CC'), ('mindful', 'NN'), ('of', 'IN'), ('the', 'DT'), ('history', 'NN'), ('we', 'PRP'), ("'ve", 'VBP'), ('seen', 'VBN'), ('together', 'RB'), ('.', '.')]
[('We', 'PRP'), ('have', 'VBP'), ('gathered', 'VBN'), ('under', 'IN'), ('this', 'DT'), ('Capitol', 'NNP'), ('dome', 'NN'), ('in', 'IN'), ('moments', 'NNS'), ('of', 'IN'), ('national', 'JJ'), ('mourning', 'NN'), ('and', 'CC'), ('national', 'JJ'), ('achievement', 'NN'), ('.', '.')]
[('We', 'PRP'), ('have', 'VBP'), ('served', 'VBN'), ('America', 'NNP'), ('through', 'IN'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('most', 'RBS'), ('con

[('For', 'IN'), ('people', 'NNS'), ('everywhere', 'RB'), (',', ','), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('is', 'VBZ'), ('a', 'DT'), ('partner', 'NN'), ('for', 'IN'), ('a', 'DT'), ('better', 'JJR'), ('life', 'NN'), ('.', '.')]
[('Short-changing', 'VBG'), ('these', 'DT'), ('efforts', 'NNS'), ('would', 'MD'), ('increase', 'VB'), ('the', 'DT'), ('suffering', 'NN'), ('and', 'CC'), ('chaos', 'NN'), ('of', 'IN'), ('our', 'PRP$'), ('world', 'NN'), (',', ','), ('undercut', 'JJ'), ('our', 'PRP$'), ('long-term', 'JJ'), ('security', 'NN'), (',', ','), ('and', 'CC'), ('dull', 'VB'), ('the', 'DT'), ('conscience', 'NN'), ('of', 'IN'), ('our', 'PRP$'), ('country', 'NN'), ('.', '.')]
[('I', 'PRP'), ('urge', 'VBP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('to', 'TO'), ('serve', 'VB'), ('the', 'DT'), ('interests', 'NNS'), ('of', 'IN'), ('America', 'NNP'), ('by', 'IN'), ('showing', 'VBG'), ('the', 'DT'), ('compassion', 'NN'), ('of', 'IN'), ('America', 'NNP'), ('.', '.')]

[('Tonight', 'NNP'), ('I', 'PRP'), ('ask', 'VBP'), ('you', 'PRP'), ('to', 'TO'), ('pass', 'VB'), ('legislation', 'NN'), ('to', 'TO'), ('prohibit', 'VB'), ('the', 'DT'), ('most', 'RBS'), ('egregious', 'JJ'), ('abuses', 'NNS'), ('of', 'IN'), ('medical', 'JJ'), ('research', 'NN'), (':', ':'), ('human', 'JJ'), ('cloning', 'VBG'), ('in', 'IN'), ('all', 'DT'), ('its', 'PRP$'), ('forms', 'NNS'), (',', ','), ('creating', 'VBG'), ('or', 'CC'), ('implanting', 'VBG'), ('embryos', 'NN'), ('for', 'IN'), ('experiments', 'NNS'), (',', ','), ('creating', 'VBG'), ('human-animal', 'JJ'), ('hybrids', 'NNS'), (',', ','), ('and', 'CC'), ('buying', 'NN'), (',', ','), ('selling', 'NN'), (',', ','), ('or', 'CC'), ('patenting', 'VBG'), ('human', 'JJ'), ('embryos', 'NN'), ('.', '.')]
[('Human', 'NNP'), ('life', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('gift', 'NN'), ('from', 'IN'), ('our', 'PRP$'), ('Creator', 'NNP'), ('--', ':'), ('and', 'CC'), ('that', 'IN'), ('gift', 'NN'), ('should', 'MD'), ('never', 'RB'), ('be

## Chunking

In [37]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    counter = 0
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
           
            chunked.draw()
            
            counter += 1
            if counter==3:
                break
            
    except Exception as e:
        print(str(e))


In [38]:
process_content()

## Chinking

In [39]:
def process_content():
    counter = 0
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""
            
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            chunked.draw()
            
            counter += 1
            if counter==3:
                break
            
    except Exception as e:
        print(str(e))
        

In [40]:
process_content()

## Named Entity Recognition
* NER is used for recognizing nouns in a text.~

In [41]:
def process_content():
    counter = 0
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            
            namedEnt.draw()
            
            counter += 1
            if counter==3:
                break
                
    except Exception as e:
        print(str(e))
            

In [42]:
process_content()

## Lemmatizing
* NLTK Lemmatization example requires “nltk.corpus.wordnet”. Lemmatization requires the part of speech tag in the sentence. 
* Without understanding whether the word is used as a verb, noun, or adjective, performing the lemmatization with NLTK will not be effective.
* lemmatizing is similar operation to stemming only end result is a real word
* it might be the same word but some form of synonym to the original word so you might end up with a very different word but it will be similar word with the same meaning

In [45]:
# NER Type Examples
# -----------------
# ORGANIZATION    Georgia-Pacific Corp., WHO
# PERSON        Eddy Bonte, President Obama
# LOCATION      Marray River, Mount Everest
# DATE          June, 2008-06-29
# TIME          Two fifty am, 1:30 p.m.
# MONEY         175 million Canadian Dollars, GBP 10.40
# PERCENT       twenty pct, 18.75 %
# FACILITY      Washington Monument, Stonehenge
# GPE           South East Asia, Middlothian

In [46]:
from nltk.stem import WordNetLemmatizer

In [47]:
lemmatizer = WordNetLemmatizer()

In [48]:
print(lemmatizer.lemmatize("thoroughly", pos="r"))

thoroughly


In [49]:
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run", "v"))

cat
cactus
goose
rock
python
good
best
run


## Corpora

In [50]:
from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize

In [52]:
print(gutenberg.readme())

Project Gutenberg Selections
http://gutenberg.net/

This corpus contains etexts from from Project Gutenberg,
by the following authors:

* Jane Austen (3)
* William Blake (2)
* Thornton W. Burgess
* Sarah Cone Bryant
* Lewis Carroll
* G. K. Chesterton (3)
* Maria Edgeworth
* King James Bible
* Herman Melville
* John Milton
* William Shakespeare (3)
* Walt Whitman

The beginning of the body of each book could not be identified automatically,
so the semi-generic header of each file has been removed, and included below.
Some source files ended with a line "End of The Project Gutenberg Etext...",
and this has been deleted.

Information about Project Gutenberg (one page)

We produce about two million dollars for each hour we work.  The
fifty hours is one conservative estimate for how long it we take
to get any etext selected, entered, proofread, edited, copyright
searched and analyzed, the copyright letters written, etc.  This
projected audience is one hundred million readers.  If our value


In [53]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [54]:
sample = gutenberg.raw("bible-kjv.txt")
tok = sent_tokenize(sample)
print(tok[5:15])

['1:5 And God called the light Day, and the darkness he called Night.', 'And the evening and the morning were the first day.', '1:6 And God said, Let there be a firmament in the midst of the waters,\nand let it divide the waters from the waters.', '1:7 And God made the firmament, and divided the waters which were\nunder the firmament from the waters which were above the firmament:\nand it was so.', '1:8 And God called the firmament Heaven.', 'And the evening and the\nmorning were the second day.', '1:9 And God said, Let the waters under the heaven be gathered together\nunto one place, and let the dry land appear: and it was so.', '1:10 And God called the dry land Earth; and the gathering together of\nthe waters called he Seas: and God saw that it was good.', '1:11 And God said, Let the earth bring forth grass, the herb yielding\nseed, and the fruit tree yielding fruit after his kind, whose seed is\nin itself, upon the earth: and it was so.', '1:12 And the earth brought forth grass, and

In [55]:
print(gutenberg.raw('shakespeare-hamlet.txt'))

[The Tragedie of Hamlet by William Shakespeare 1599]


Actus Primus. Scoena Prima.

Enter Barnardo and Francisco two Centinels.

  Barnardo. Who's there?
  Fran. Nay answer me: Stand & vnfold
your selfe

   Bar. Long liue the King

   Fran. Barnardo?
  Bar. He

   Fran. You come most carefully vpon your houre

   Bar. 'Tis now strook twelue, get thee to bed Francisco

   Fran. For this releefe much thankes: 'Tis bitter cold,
And I am sicke at heart

   Barn. Haue you had quiet Guard?
  Fran. Not a Mouse stirring

   Barn. Well, goodnight. If you do meet Horatio and
Marcellus, the Riuals of my Watch, bid them make hast.
Enter Horatio and Marcellus.

  Fran. I thinke I heare them. Stand: who's there?
  Hor. Friends to this ground

   Mar. And Leige-men to the Dane

   Fran. Giue you good night

   Mar. O farwel honest Soldier, who hath relieu'd you?
  Fra. Barnardo ha's my place: giue you goodnight.

Exit Fran.

  Mar. Holla Barnardo

   Bar. Say, what is Horatio there?
  Hor. A peece of

## WordNet
* WordNet is probabaly one of the largest not really corpus but the largest I suppose capability
* With WordNet you can take words and you can look up synonyms to words and antonyms and definitions and then even context of that word 

In [56]:
from nltk.corpus import wordnet

In [57]:
syns = wordnet.synsets("program")
print(syns)

print(syns[0].name())

[Synset('plan.n.01'), Synset('program.n.02'), Synset('broadcast.n.02'), Synset('platform.n.02'), Synset('program.n.05'), Synset('course_of_study.n.01'), Synset('program.n.07'), Synset('program.n.08'), Synset('program.v.01'), Synset('program.v.02')]
plan.n.01


In [58]:
#just the word
print(syns[0].lemmas()[0].name())

plan


In [59]:
# definition
print(syns[0].definition())

a series of steps to be carried out or goals to be accomplished


In [60]:
#examples
print(syns[0].examples())

['they drew up a six-step plan', 'they discussed plans for a new bond issue']


In [61]:
synonyms = []
antonyms = []
for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        print("l:",l)
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
            
# print(set(synonyms))
# print(set(antonyms))

l: Lemma('good.n.01.good')
l: Lemma('good.n.02.good')
l: Lemma('good.n.02.goodness')
l: Lemma('good.n.03.good')
l: Lemma('good.n.03.goodness')
l: Lemma('commodity.n.01.commodity')
l: Lemma('commodity.n.01.trade_good')
l: Lemma('commodity.n.01.good')
l: Lemma('good.a.01.good')
l: Lemma('full.s.06.full')
l: Lemma('full.s.06.good')
l: Lemma('good.a.03.good')
l: Lemma('estimable.s.02.estimable')
l: Lemma('estimable.s.02.good')
l: Lemma('estimable.s.02.honorable')
l: Lemma('estimable.s.02.respectable')
l: Lemma('beneficial.s.01.beneficial')
l: Lemma('beneficial.s.01.good')
l: Lemma('good.s.06.good')
l: Lemma('good.s.07.good')
l: Lemma('good.s.07.just')
l: Lemma('good.s.07.upright')
l: Lemma('adept.s.01.adept')
l: Lemma('adept.s.01.expert')
l: Lemma('adept.s.01.good')
l: Lemma('adept.s.01.practiced')
l: Lemma('adept.s.01.proficient')
l: Lemma('adept.s.01.skillful')
l: Lemma('adept.s.01.skilful')
l: Lemma('good.s.09.good')
l: Lemma('dear.s.02.dear')
l: Lemma('dear.s.02.good')
l: Lemma('dear.s

In [62]:
print(set(synonyms))
print(set(antonyms))

{'good', 'soundly', 'skilful', 'proficient', 'near', 'in_force', 'unspoilt', 'estimable', 'commodity', 'thoroughly', 'practiced', 'dependable', 'honest', 'expert', 'serious', 'full', 'trade_good', 'respectable', 'dear', 'honorable', 'upright', 'goodness', 'ripe', 'skillful', 'in_effect', 'well', 'effective', 'secure', 'sound', 'salutary', 'beneficial', 'just', 'unspoiled', 'safe', 'right', 'undecomposed', 'adept'}
{'evilness', 'badness', 'evil', 'ill', 'bad'}


In [63]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("boat.n.01")
print(w1.wup_similarity(w2))

0.9090909090909091


In [64]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("car.n.01")
print(w1.wup_similarity(w2))

0.6956521739130435


In [65]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("cactus.n.01")
print(w1.wup_similarity(w2))

0.38095238095238093


<h1 style="color:blue; text-align:center;""> Lecture 25 </h1>
<hr style="height:5px;border-width:0;color:blue;background-color:blue">

## Text Classification
* We can use text classifiers for all kinds of stuff maybe you are trying to classify the text as stocks writing or politics writing etc
* Or another form of text classifier might be just discerning whether or not something is spam or a legitimate email that kind of thing so our text classifier is going to classify something as either positive connotation or positive connotation basically or meaning or sentiment as form of opinion mining

In [1]:
import random
from nltk.corpus import movie_reviews
import pickle 

In [2]:
movie_reviews.fileids()[:10]

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt',
 'neg/cv005_29357.txt',
 'neg/cv006_17022.txt',
 'neg/cv007_4992.txt',
 'neg/cv008_29326.txt',
 'neg/cv009_29417.txt']

In [3]:
f = movie_reviews.open('neg/cv000_29416.txt')
for line in f: 
    print(line)

plot : two teen couples go to a church party , drink and then drive . 

they get into an accident . 

one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 

what's the deal ? 

watch the movie and " sorta " find out . . . 

critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 

which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 

they seem to have taken this pretty neat concept , but executed it terribly . 

so what are the problems with the movie ? 

well , its main problem is that it's simply too jumbled . 

it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , 

In [4]:
movie_reviews.categories()

['neg', 'pos']

In [5]:
movie_reviews.fileids('neg')

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt',
 'neg/cv005_29357.txt',
 'neg/cv006_17022.txt',
 'neg/cv007_4992.txt',
 'neg/cv008_29326.txt',
 'neg/cv009_29417.txt',
 'neg/cv010_29063.txt',
 'neg/cv011_13044.txt',
 'neg/cv012_29411.txt',
 'neg/cv013_10494.txt',
 'neg/cv014_15600.txt',
 'neg/cv015_29356.txt',
 'neg/cv016_4348.txt',
 'neg/cv017_23487.txt',
 'neg/cv018_21672.txt',
 'neg/cv019_16117.txt',
 'neg/cv020_9234.txt',
 'neg/cv021_17313.txt',
 'neg/cv022_14227.txt',
 'neg/cv023_13847.txt',
 'neg/cv024_7033.txt',
 'neg/cv025_29825.txt',
 'neg/cv026_29229.txt',
 'neg/cv027_26270.txt',
 'neg/cv028_26964.txt',
 'neg/cv029_19943.txt',
 'neg/cv030_22893.txt',
 'neg/cv031_19540.txt',
 'neg/cv032_23718.txt',
 'neg/cv033_25680.txt',
 'neg/cv034_29446.txt',
 'neg/cv035_3343.txt',
 'neg/cv036_18385.txt',
 'neg/cv037_19798.txt',
 'neg/cv038_9781.txt',
 'neg/cv039_5963.txt',
 'neg/cv040_8829.txt',
 'neg/cv041_22364.txt',


In [6]:
movie_reviews.fileids('pos')[:10]

['pos/cv000_29590.txt',
 'pos/cv001_18431.txt',
 'pos/cv002_15918.txt',
 'pos/cv003_11664.txt',
 'pos/cv004_11636.txt',
 'pos/cv005_29443.txt',
 'pos/cv006_15448.txt',
 'pos/cv007_4968.txt',
 'pos/cv008_29435.txt',
 'pos/cv009_29592.txt']

In [7]:
words_in_this_file = movie_reviews.words('pos/cv992_11962.txt')

print(words_in_this_file[:100])

print("\nTotal words in this file: ", len(words_in_this_file))

['here', 'is', 'a', 'film', 'that', 'is', 'so', 'unexpected', ',', 'so', 'scary', ',', 'and', 'so', 'original', 'that', 'it', 'caught', 'me', 'off', 'guard', 'and', 'threw', 'me', 'for', 'a', 'loop', '.', 'okay', ',', 'it', 'isn', "'", 't', 'quite', 'original', ',', 'considering', 'it', 'is', 'a', 'sequel', 'to', 'the', 'box', 'office', 'hit', 'species', ',', 'but', 'it', 'certainly', 'is', 'smart', '.', 'most', 'films', 'of', 'this', 'genre', 'are', 'reminiscent', 'of', 'those', 'cheesy', 'b', '-', 'horror', 'films', 'from', 'the', '50s', 'and', '60s', ',', 'and', 'some', 'even', 'become', 'them', '.', 'however', ',', 'as', 'we', 'learned', 'with', 'the', '1995', 'small', '-', 'budget', 'horror', '/', 'sci', '-', 'fi', 'film', ',', 'sometimes']

Total words in this file:  1696


In [None]:
# documents = [(list(movie_reviews.words()), category)
#             for category in movie_reviews.categories()
#             for field in movie_reviews.fileids(category)]

In [8]:
documents = []
for category in movie_reviews.categories():
    print(category)
    for fileid in movie_reviews.fileids(category):
        print(fileid)
        documents.append([movie_reviews.words(fileid), category])
        print(documents[0])
        break
    break
    

neg
neg/cv000_29416.txt
[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...], 'neg']


In [9]:
documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append([movie_reviews.words(fileid), category])


In [10]:
random.shuffle(documents)

In [11]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())


In [14]:
all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]


In [15]:
print(all_words["stupid"])

253


## Words as Features for Learning

In [16]:
word_features = list(all_words.keys())[:3000]

In [17]:
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)
        
    return features


In [18]:
print((find_features(movie_reviews.words("neg/cv000_29416.txt"))))



In [19]:
featuresets = [(find_features(rev), category) for (rev, category) 
               in documents]

In [24]:
featuresets[0]

({'plot': True,
  ':': True,
  'two': True,
  'teen': False,
  'couples': False,
  'go': False,
  'to': True,
  'a': True,
  'church': False,
  'party': False,
  ',': True,
  'drink': False,
  'and': True,
  'then': False,
  'drive': True,
  '.': True,
  'they': True,
  'get': False,
  'into': True,
  'an': True,
  'accident': False,
  'one': True,
  'of': True,
  'the': True,
  'guys': False,
  'dies': False,
  'but': True,
  'his': True,
  'girlfriend': False,
  'continues': False,
  'see': False,
  'him': True,
  'in': True,
  'her': False,
  'life': True,
  'has': True,
  'nightmares': False,
  'what': False,
  "'": True,
  's': True,
  'deal': False,
  '?': False,
  'watch': True,
  'movie': True,
  '"': False,
  'sorta': False,
  'find': False,
  'out': False,
  'critique': False,
  'mind': False,
  '-': True,
  'fuck': False,
  'for': True,
  'generation': False,
  'that': True,
  'touches': False,
  'on': True,
  'very': False,
  'cool': True,
  'idea': False,
  'presents': Fal

### Naive Bayes

In [25]:
training_set = featuresets[:1900]
testing_set = featuresets[1900:]

In [26]:
training_set[0]

({'plot': True,
  ':': True,
  'two': True,
  'teen': False,
  'couples': False,
  'go': False,
  'to': True,
  'a': True,
  'church': False,
  'party': False,
  ',': True,
  'drink': False,
  'and': True,
  'then': False,
  'drive': True,
  '.': True,
  'they': True,
  'get': False,
  'into': True,
  'an': True,
  'accident': False,
  'one': True,
  'of': True,
  'the': True,
  'guys': False,
  'dies': False,
  'but': True,
  'his': True,
  'girlfriend': False,
  'continues': False,
  'see': False,
  'him': True,
  'in': True,
  'her': False,
  'life': True,
  'has': True,
  'nightmares': False,
  'what': False,
  "'": True,
  's': True,
  'deal': False,
  '?': False,
  'watch': True,
  'movie': True,
  '"': False,
  'sorta': False,
  'find': False,
  'out': False,
  'critique': False,
  'mind': False,
  '-': True,
  'fuck': False,
  'for': True,
  'generation': False,
  'that': True,
  'touches': False,
  'on': True,
  'very': False,
  'cool': True,
  'idea': False,
  'presents': Fal

In [28]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [29]:
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

Original Naive Bayes Algo accuracy percent: 75.0
Most Informative Features
                   sucks = True              neg : pos    =     16.3 : 1.0
                  annual = True              pos : neg    =      9.7 : 1.0
                  justin = True              neg : pos    =      9.0 : 1.0
                  stinks = True              neg : pos    =      9.0 : 1.0
                  turkey = True              neg : pos    =      8.4 : 1.0
           unimaginative = True              neg : pos    =      8.3 : 1.0
             silverstone = True              neg : pos    =      7.6 : 1.0
                everyday = True              pos : neg    =      7.0 : 1.0
                 frances = True              pos : neg    =      7.0 : 1.0
                  regard = True              pos : neg    =      7.0 : 1.0
                 idiotic = True              neg : pos    =      7.0 : 1.0
                 martian = True              neg : pos    =      7.0 : 1.0
                    mena 

### Save Classifier with Pickle

In [30]:
import pickle 

In [31]:
classifier = nltk.NaiveBayesClassifier.train(training_set)
save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

### Bag of Words:

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

    A vocabulary of known words.
    A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

The intuition is that documents are similar if they have similar content. Further, that from the content alone we can learn something about the meaning of the document.

The bag-of-words can be as simple or complex as you like. The complexity comes both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words.

##### Steps:
- Data Gathering and Loading
- Vocabulary building
- Document vectors creation [integer ,binary, and one hot encoding schemes]

Data:

- It was the best of times,
- it was the worst of times,
- it was the age of wisdom,
- it was the age of foolishness,

Vocabulary:

The unique words here (ignoring case and punctuation) are:

    “it”
    “was”
    “the”
    “best”
    “of”
    “times”
    “worst”
    “age”
    “wisdom”
    “foolishness”

That is a vocabulary of 10 words from a corpus containing 24 words.

Document Vectors:
    
The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present i.e Binary encoding scheme.

- “It was the best of times“ = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
- "it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
- "it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
- "it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

##### Limitations of Bag-of-Words

The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on your specific text data.

It has been used with great success on prediction problems like language modeling and documentation classification.

Nevertheless, it suffers from some shortcomings, such as:

    - Vocabulary: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
    
    - Sparsity: Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.
    
    - Meaning: Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”), and much more.
