# NLP with NLTK

Natural Language Processing with the Natural Language Toolkit

[nltk](http://www.nltk.org/) is a Python package for NLP.

In [None]:
# pip install nltk
import nltk

Much of NLTK depends on additional data which you'll have to download. Use `nltk.download()` to get at least the following:

 * maxent_treebank_pos_tagger (in models)
 * punkt (in models)
 * maxent_ne_chunk (in models)
 * words (in corpora)

You can install these and continue without restarting your kernel.

In [None]:
nltk.download()

### Sentence tokenization

In [None]:
from nltk.tokenize import sent_tokenize

text = """Hello. How are you, dear sir? Are you well?
          Here: drink this! It will make you feel better."""

sentences = sent_tokenize(text)
sentences

### Word tokenization

In [None]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentences[4])

In [None]:
from nltk.tokenize import word_tokenize
words = word_tokenize(sentences[4])
words

### When you say "word"...

In [None]:
word_tokenize(sentences[3])

In [None]:
word_tokenize("Who's going to that thing today?")

In [None]:
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize("Who's going to that thing today?")

Demo of different tokenizers: http://text-processing.com/demo/tokenize/

### Part of speech tagging

In [None]:
from nltk.tag import pos_tag
words = word_tokenize("Who's going to that thing today?")
pos_tag(words)

WP: wh-pronoun ("who", "what")  
VBZ: verb, 3rd person sing. present ("takes")  
VBG: verb, gerund/present participle ("taking")  
TO: to ("to go", "to him")   
DT: determiner ("the", "this")  
NN: noun, singular or mass ("door")  
.: Punctuation (".", "?")  

All tags: http://www.monlp.com/2011/11/08/part-of-speech-tags/

Part of speech allows you to focus on different parts of language.

You may want to find keywords only among verbs, for example.

Or when you are classifying, you may want to use nouns and adjectives as features, because they carry the most information about the subject (compared to who tags like WP, or TO, etc.) Part of speech can allow you to do higher resolution text analysis.

### Chunking
Extracting phrases

In [None]:
from nltk.chunk import ne_chunk
words = word_tokenize("""I'm Irmak Sirer and I'm here to say
                         I love NLTK in a major way.""")
tags = pos_tag(words)
tree = ne_chunk(tags)
print tree

In [None]:
tree.draw()

### Included text corpora

Also install these!

 * movie_reviews: Imdb reviews characterized as pos & neg  
 * treebank: tagged and parsed Wall Street Journal text  
 * brown: tagged & categorized English text (news, fiction, etc)  

(There are over 60 others.)

In [None]:
nltk.download()

In [None]:
from nltk.corpus import treebank_chunk
treebank_chunk.tagged_sents()[0]

In [None]:
treebank_chunk.chunked_sents()[0].draw()

# TextBlob

In [None]:
# pip install textblob
from textblob import TextBlob

GATSBY_TEXT = """In my younger and more vulnerable years my father
                 gave me some advice that I've been turning over
                 in my mind ever since. "Whenever you feel like
                 criticizing any one," he told me, "blah blah blah."""

gatsby = TextBlob(GATSBY_TEXT)

In [None]:
gatsby.tags

In [None]:
gatsby.noun_phrases

In [None]:
gatsby.sentiment

In [None]:
TextBlob("Oh my god I love this bootcamp, it's so awesome.").sentiment

In [None]:
TextBlob("Cupcakes are the worst.").sentiment

In [None]:
TextBlob("The color of this car is blue").sentiment

In [None]:
print TextBlob("I hate cupcakes.").sentiment.polarity

In [None]:
print TextBlob("I hate cupcakes.").sentiment.subjectivity

In [None]:
gatsby.sentences

In [None]:
gatsby.words

In [None]:
gatsby.sentences[0].words

#### Stemming

In [None]:
stemmer = nltk.stem.porter.PorterStemmer()
for word in TextBlob("I was going to go to many places").words:
    print stemmer.stem(word)

To see different nltk stemmers in effect:
http://text-processing.com/demo/stem/

In [None]:
for word, count in gatsby.word_counts.items():
    print "%15s %i" % (word, count)

In [None]:
def get_count(item):
    return item[1]

for word, count in sorted(gatsby.word_counts.items(), key=get_count, reverse=True):
    print "%15s %i" % (word, count)

### Looking at tweets

In [None]:
import pymongo
from pprint import pprint
client = pymongo.MongoClient()
db = client.ds3

In [None]:
print db.doge.count()

In [None]:
tweet = db.doge.find_one({"favorite_count": {"$gte": 10}})
print(tweet['text'])

In [None]:
TextBlob(tweet['text']).words

##### Word counts in doge tweets

In [None]:
all_text = ""
for tweet in db.doge.find().limit(100):
    all_text += tweet["text"] + "\n"
    
for word, count in sorted(TextBlob(all_text).word_counts.items(), key=get_count, reverse=True)[:15]:
    print "%15s %i" % (word, count)

##### Without stopwords

In [None]:
nltk.download()

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

all_text = ""
for tweet in db.doge.find().limit(100):
    all_text += tweet["text"] + "\n"

for word, count in sorted(TextBlob(all_text).word_counts.items(),
                          key=get_count, reverse=True)[:20]:
    if word not in stop:
        print "%15s %i" % (word, count)

### Movie Reviews

In [None]:
nltk.download()

In [None]:
import nltk
from textblob import TextBlob
from nltk.corpus import movie_reviews

fileids = movie_reviews.fileids()[:100]
doc_words = [movie_reviews.words(fileid) for fileid in fileids]
documents = [' '.join(words) for words in doc_words]
print documents[:3]

##### Top bigrams in reviews

In [None]:
from nltk.util import ngrams
from nltk.tokenize import sent_tokenize, word_tokenize

from collections import defaultdict
from operator import itemgetter

from nltk.corpus import stopwords
stop = stopwords.words('english')
stop += ['.', ',', '(', ')', "'", '"']

counter = defaultdict(int)

n = 2
for doc in documents:
    words = TextBlob(doc).words
    words = [w for w in words if w not in stop]
    bigrams = ngrams(words, n)
    for gram in bigrams:
        counter[gram] += 1
            
for gram, count in sorted(counter.items(), key = itemgetter(1), reverse=True)[:30]:
    phrase = " ".join(gram)
    print '%20s %i' % (phrase, count)

### Using Sklearn algorithms with text data

In [None]:
#### TF: frequency in this document
#### IDF: inverse frequency in the corpus

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

vectorizer = TfidfVectorizer(stop_words="english", ngram_range=(1,2))
doc_vectors = vectorizer.fit_transform(documents)

classes = np.array(['pos']*50 + ['neg']*50)


model = MultinomialNB().fit(doc_vectors, classes)

In [None]:
gatsby_vector = vectorizer.transform([GATSBY_TEXT])
model.predict(gatsby_vector)