# Basic NLP with NLTK and TextBlob (An abridged version)

Natural Language Processing with the Natural Language Toolkit

#### Install nltk if necessary

In [53]:
# pip install nltk
# nltk.download()

#### Import nltk and pandas

In [7]:
import nltk
import pandas as pd

### Sentence tokenization

Do the following:
* import `sent_tokenize`
* Try it on this blob of text:
```
Hello. How are you, dear Mr. Sir? Are you well?  Here: drink this! It will make you feel better.  I mean, it won't make you feel worse!
```
* Print out each of your sentences

In [8]:
from nltk.tokenize import sent_tokenize

text = """Hello. How are you, dear Mr. Sir? Are you well?
          Here: drink this! It will make you feel better.
          I mean, it won't make you feel worse!"""

sentences = sent_tokenize(text)
print sentences


['Hello.', 'How are you, dear Mr. Sir?', 'Are you well?', 'Here: drink this!', 'It will make you feel better.', "I mean, it won't make you feel worse!"]


### Word tokenization

Do the following:
* import `TreebankWordTokenizer`
* Try it on the last sentence that you found with the sentence tokenization above
* Print your resulting words

In [9]:
# TreebankWordTokenizer assumes that our input has already been segmented into sentences..


from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentences[5])

['I', 'mean', ',', 'it', 'wo', "n't", 'make', 'you', 'feel', 'worse', '!']

Do the following:
* import `word_tokenize`
* Try it on the last sentence that you found with the sentence tokenization above
* Print your resulting words

In [10]:
from nltk.tokenize import word_tokenize
words = word_tokenize(sentences[5])
words

['I', 'mean', ',', 'it', 'wo', "n't", 'make', 'you', 'feel', 'worse', '!']

Do the following:
* import `wordpunct_tokenize`
* Try it on the last sentence that you found with the sentence tokenization above
* Print your resulting words

### Included text corpora

Also install these!

 * movie_reviews: Imdb reviews characterized as pos & neg  
 * treebank: tagged and parsed Wall Street Journal text  
 * brown: tagged & categorized English text (news, fiction, etc)  

(There are over 60 others.)

# [TextBlob -- Sentiment Analysis, Text Classification and More](https://textblob.readthedocs.org/en/dev/)
Simple text processing NLP tasks.

### Creating a TextBlob
* import TextBlob from textblob
* Create a TextBlob with the following text:
```
In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since. "Whenever you feel like criticizing any one," he told me, "blah blah blah.
```
* Print out the `sentences` of this object
* Print out the `words` of this object
* Print out the `words` of the first sentence (Hint: `sentences` is a list)
* Use `word_counts` to print all of the word counts

In [18]:
# pip install textblob
from textblob import TextBlob

GATSBY_TEXT = """In my younger and more vulnerable years my father
                 gave me some advice that I've been turning over
                 in my mind ever since. "Whenever you feel like
                 criticizing any one," he told me, "blah blah blah."""

gatsby = TextBlob(GATSBY_TEXT)

In [21]:
gatsby.sentences

[Sentence("In my younger and more vulnerable years my father
                  gave me some advice that I've been turning over
                  in my mind ever since."), Sentence(""Whenever you feel like
                  criticizing any one," he told me, "blah blah blah.")]

In [22]:
gatsby.words

WordList(['In', 'my', 'younger', 'and', 'more', 'vulnerable', 'years', 'my', 'father', 'gave', 'me', 'some', 'advice', 'that', 'I', "'ve", 'been', 'turning', 'over', 'in', 'my', 'mind', 'ever', 'since', 'Whenever', 'you', 'feel', 'like', 'criticizing', 'any', 'one', 'he', 'told', 'me', 'blah', 'blah', 'blah'])

In [23]:
gatsby.sentences[0].words

WordList(['In', 'my', 'younger', 'and', 'more', 'vulnerable', 'years', 'my', 'father', 'gave', 'me', 'some', 'advice', 'that', 'I', "'ve", 'been', 'turning', 'over', 'in', 'my', 'mind', 'ever', 'since'])

In [24]:
for word, count in gatsby.word_counts.items():
    print "%15s %i" % (word, count)

            and 1
             ve 1
           feel 1
           over 1
           mind 1
          years 1
           blah 3
             in 2
            any 1
     vulnerable 1
          since 1
         father 1
           been 1
            you 1
           ever 1
           gave 1
        turning 1
           that 1
           some 1
         advice 1
            one 1
             he 1
             me 2
        younger 1
           like 1
              i 1
           told 1
       whenever 1
    criticizing 1
           more 1
             my 3


In [25]:
def get_count(item):
    return item[1]

for word, count in sorted(gatsby.word_counts.items(), key=get_count, reverse=True):
    print "%15s %i" % (word, count)

           blah 3
             my 3
             in 2
             me 2
            and 1
             ve 1
           feel 1
           over 1
           mind 1
          years 1
            any 1
     vulnerable 1
          since 1
         father 1
           been 1
            you 1
           ever 1
           gave 1
        turning 1
           that 1
           some 1
         advice 1
            one 1
             he 1
        younger 1
           like 1
              i 1
           told 1
       whenever 1
    criticizing 1
           more 1


###  How do you really feel?    TextBlob:  Sentiment Analysis
* Create `TextBlob` objects for the following strings (each line is a separate one) and use `sentiment` to print their sentiment:
```
Oh my god I love this bootcamp, it's so awesome.
it's so awesome
Oh my god
I love this bootcamp
Oh my god I love this bootcamp
it's so awesome
I hate cupcakes
```

In [26]:
TextBlob("Oh my god I love this bootcamp, it's so awesome.").sentiment

Sentiment(polarity=0.75, subjectivity=0.8)

In [27]:
TextBlob("it's so awesome").sentiment

Sentiment(polarity=1.0, subjectivity=1.0)

In [28]:
TextBlob("Oh my god.").sentiment

Sentiment(polarity=0.0, subjectivity=0.0)

In [29]:
TextBlob("I love this bootcamp.").sentiment

Sentiment(polarity=0.5, subjectivity=0.6)

In [30]:
TextBlob("Oh my god I love this bootcamp.").sentiment

Sentiment(polarity=0.5, subjectivity=0.6)

In [31]:
TextBlob("it's so awesome.").sentiment

Sentiment(polarity=1.0, subjectivity=1.0)

In [32]:
print TextBlob("I hate cupcakes.").sentiment

Sentiment(polarity=-0.8, subjectivity=0.9)


#### Stemming

Do the following:
* Create a `PorterStemmer`
* Create a `TextBlob` for this text:
```
Are you running in two marathons?
```
* For each word in this blob, print out its stemmed version using `stem`

In [33]:
stemmer = nltk.stem.porter.PorterStemmer()
for word in TextBlob("Are you running in two marathons?").words:
    print stemmer.stem(word)

Are
you
run
in
two
marathon


To see different nltk stemmers in effect:
http://text-processing.com/demo/stem/

### Movie Reviews 
(without stopwords!)

Do the following (if necessary):
* import `nltk`, `TextBlob`, `movie_reviews`
* Get the first 100 `fileids` from `movie_reviews`
* Use `words` to get all of the words for every fileid and store this as a list of lists
* For each list of words, join it by whitespace to create a document
* Print your first doc

In [34]:
import nltk
from textblob import TextBlob
from nltk.corpus import movie_reviews

fileids = movie_reviews.fileids()[:100]
doc_words = [movie_reviews.words(fileid) for fileid in fileids]
documents = [' '.join(words) for words in doc_words]
print documents[0:1]

[u'plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what \' s the deal ? watch the movie and " sorta " find out . . . critique : a mind - fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn \' t snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it \' s simply too jumbled . it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have n

##### Stopwords
* In NLTK import stopwords
* Create a variable to hold the english stopwords list
* Add '.', ',', '(', ')', "'", '"' to that list
* You can remove all stop words using your stopwords list

In [35]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop += ['.', ',', '(', ')', "'", '"']