# Section- 5
# NLP (Natural language processing)

## Section 5.1: Key Concepts, text data cleaning
## Section 5.2: Count Vectorizer, TFIDF 
## Section 5.3: Example with Spam data 
## Section 5.4: Tweak model with Spam data
## Section 5.5: Pipeline with Spam data 

## Section 5.1: Key Concepts, text data cleaning

In [127]:
import numpy as np
from collections import Counter
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.stem import SnowballStemmer
import string
from scipy.spatial.distance import pdist, squareform
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.svm import SVC 
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline  import Pipeline, FeatureUnion, make_pipeline

In [66]:
stops = set(nltk.corpus.stopwords.words('english'))

In [68]:
#stops

In [3]:
corpus = ["Jeff stole my octopus sandwich.", 
    "'Help!' I sobbed, sandwichlessly.", 
    "'Drop the sandwiches!' said the sandwich police."]

# How do I turn a corpus of documents into a feature matrix? 
# Words --> numbers?????

## Corpus: list of documents

```
 [
 "Jeff stole my octopus sandwich.", 
 "'Help!' I sobbed, sandwichlessly.", 
 "'Drop the sandwiches!' said the sandwich police."
 ]```

In [60]:
def our_tokenizer(doc, stops=None, stemmer=None):
    doc = word_tokenize(doc.lower())
    tokens = [''.join([char for char in tok if char not in string.punctuation]) for tok in doc]
    tokens = [tok for tok in tokens if tok]
    if stops:
        tokens = [tok for tok in tokens if (tok not in stops)]
    if stemmer:
        tokens = [stemmer.stem(tok) for tok in tokens]
    return tokens

In [5]:
tokenized_docs = [our_tokenizer(doc) for doc in corpus]
tokenized_docs

[['jeff', 'stole', 'my', 'octopus', 'sandwich'],
 ['help', 'i', 'sobbed', 'sandwichlessly'],
 ['drop', 'the', 'sandwiches', 'said', 'the', 'sandwich', 'police']]

## Step 1: lowercase, lose punction, split into tokens
```
[
 ['jeff', 'stole', 'my', 'octopus', 'sandwich'],
 ['help', 'i', 'sobbed', 'sandwichlessly'],
 ['drop', 'the', 'sandwiches', 'said', 'the', 'sandwich', 'police']
]
```

In [6]:
stopwords = set(nltk.corpus.stopwords.words('english')) 

In [57]:
'i' in stopwords 

True

In [7]:
tokenized_docs = [our_tokenizer(doc, stops=stopwords) for doc in corpus]
tokenized_docs

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['help', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwiches', 'said', 'sandwich', 'police']]

## Step 2: remove stop words
```
[
 ['jeff', 'stole', 'octopus', 'sandwich'],
 ['help', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwiches', 'said', 'sandwich', 'police']
]
```


In [64]:
tokenized_docs = [our_tokenizer(doc, stops=stopwords, stemmer=SnowballStemmer('english')) for doc in corpus]
tokenized_docs

[[u'jeff', u'stole', u'octopus', u'sandwich'],
 [u'help', u'sob', u'sandwichless'],
 [u'drop', u'sandwich', u'said', u'sandwich', u'polic']]

## Step 3: Stemming/Lemmatization
```
[
 ['jeff', 'stole', 'octopus', 'sandwich'],
 ['help', 'sobbed', 'sandwichlessly'],
 ['drop', u'sandwich', 'said', 'sandwich', 'police']
]
```
### OK now what?

Vocabulary:
```
['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']

```

In [65]:
vocab_set = set()

In [10]:
for doc in tokenized_docs:
    vocab_set.update(doc)

In [11]:
vocab = sorted(list(vocab_set))
print vocab

['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']


## Section 5.2: Count Vectorizer, TFIDF 

# Count vectorization

Vocabulary:
```
['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']

```

```
['jeff', 'stole', 'octopus', 'sandwich']
[0, 0, 1, 1, 0, 0, 1, 0, 0, 1]

['help', 'sobbed', 'sandwichlessly']
[0, 1, 0, 0, 0, 0, 0, 1, 1, 0]

['drop', u'sandwich', 'said', 'sandwich', 'police']
[1, 0, 0, 0, 1, 1, 2, 0, 0, 0]
```

## Term frequency
$$TF_{word,document} = \frac{\#\_of\_times\_word\_appears\_in\_document}{total\_\#\_of\_words\_in\_document}$$

```
['jeff', 'stole', 'octopus', 'sandwich']
[0, 0, 1/4, 1/4, 0, 0, 1/4, 0, 0, 1/4]

['help', 'sobbed', 'sandwichlessly']
[0, 1/3, 0, 0, 0, 0, 0, 1/3, 1/3, 0]

['drop', u'sandwich', 'said', 'sandwich', 'police']
[1/5, 0, 0, 0, 1/5, 1/5, 2/5, 0, 0, 0]
```

## Document frequency
$$ DF_{word} = \frac{\#\_of\_documents\_containing\_word}{total\_\#\_of\_documents} $$

Vocabulary:
```
['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']
```

Document frequency for each word:
```
[1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 2/3, 1/3, 1/3, 1/3]
```

## Inverse document frequency
$$ IDF_{word} = \log\left(\frac{total\_\#\_of\_documents}{\#\_of\_documents\_containing\_word}\right) $$

Vocabulary:
```
['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']
```

IDF for each word:
```
[1.099, 1.099, 1.099, 1.099, 1.099, 1.099, 0.405, 1.099, 1.099, 1.099]
```

# TFIDF

Vocabulary:
```
['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']
```
TF * IDF:

```
['jeff', 'stole', 'octopus', 'sandwich']
[0, 0, 0.275, 0.275, 0, 0, 0.101, 0, 0, 0.275]

['help', 'sobbed', 'sandwichlessly']
[0, 0.366, 0, 0, 0, 0, 0, 0.366, 0.366, 0]

['drop', u'sandwich', 'said', 'sandwich', 'police']
[0.22, 0, 0, 0, 0.22, 0.22, 0.162, 0, 0, 0]
```

Now that we have turned our DOCUMENTS into VECTORS, we can put them into whatever machine learning algorithm we want! We can use whatever kind of similarity measure we please!

Wow!

In [15]:
cosine_similarity([[0, 0, 0.275, 0.275, 0, 0, 0.101, 0, 0, 0.275],  [0.22, 0, 0, 0, 0.22, 0.22, 0.162, 0, 0, 0]])

array([[ 1.        ,  0.08115802],
       [ 0.08115802,  1.        ]])

In [70]:
cosine_similarity([[0, 0.366, 0, 0, 0, 0, 0, 0.366, 0.366, 0],  [0.22, 0, 0, 0, 0.22, 0.22, 0.162, 0, 0, 0]])

array([[ 1.,  0.],
       [ 0.,  1.]])

## Section 5.3: Example with Spam data 

In [None]:
#revisit spam ham example 

In [71]:
df= pd.read_table('data/SMSSpamCollection', header=None)

In [72]:
df.head(3)

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


In [73]:
df.columns=['spam', 'msg']

In [74]:
df.head(2)

Unnamed: 0,spam,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...


In [75]:
stopwords_set=set(stopwords)

punctuation_set=set(string.punctuation)

In [77]:
len(stopwords_set)

179

In [79]:
len(punctuation_set)

32

In [84]:
df['msg_cleaned']= df.msg.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords_set \
                                                   and word not in punctuation_set]))

In [83]:
str1='Go until jurong point, crazy'.split()
' '.join(str1)

'Go until jurong point, crazy'

In [85]:
df.head(2)

Unnamed: 0,spam,msg,msg_cleaned
0,ham,"Go until jurong point, crazy.. Available only ...","Go jurong point, crazy.. Available bugis n gre..."
1,ham,Ok lar... Joking wif u oni...,Ok lar... Joking wif u oni...


In [86]:
df['msg_cleaned']= df.msg_cleaned.str.lower()  

In [87]:
df.head(2)

Unnamed: 0,spam,msg,msg_cleaned
0,ham,"Go until jurong point, crazy.. Available only ...","go jurong point, crazy.. available bugis n gre..."
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...


In [143]:
count_vect= CountVectorizer()

In [144]:
X= count_vect.fit_transform(df.msg_cleaned) 

In [145]:
X.shape

(5572, 8703)

In [146]:
y=df.spam

In [147]:
X_train, X_test, y_train, y_test= train_test_split(X,y)

In [148]:
lg= LogisticRegression()

lg.fit(X_train,y_train)
y_pred=lg.predict(X_test)
lg.score(X_test,y_test)

0.98277099784637478

In [149]:
confusion_matrix(y_test, y_pred)

array([[1207,    1],
       [  23,  162]])

In [94]:
y_pred

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

## Section 5.4: Tweak model with Spam data 

In [164]:
## try tfidf  

tfidf= TfidfVectorizer()  

In [165]:
df.head(2)

Unnamed: 0,spam,msg,msg_cleaned,spam_num
0,ham,"Go until jurong point, crazy.. Available only ...","go jurong point, crazy.. available bugis n gre...",0
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,0


In [152]:
X= tfidf.fit_transform(df.msg_cleaned)
y=df.spam 
X_train, X_test, y_train, y_test= train_test_split(X,y)  

In [153]:
## try random forest 
rf= RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)
rf.score(X_test,y_test)

0.97415649676956206

In [154]:
confusion_matrix(y_test, y_pred)   

array([[1222,    2],
       [  34,  135]])

In [155]:
#try gradient boost 
gb= GradientBoostingClassifier()
gb.fit(X_train,y_train)
y_pred=gb.predict(X_test)
gb.score(X_test,y_test)

0.97343862167982775

In [156]:
confusion_matrix(y_test, y_pred)

array([[1216,    8],
       [  29,  140]])

In [157]:
# Try tfidf with bigrams & trigrams 
tfidf=TfidfVectorizer(ngram_range=(1,3)) 

In [158]:
X= tfidf.fit_transform(df.msg_cleaned)
y=df.spam
X_train, X_test, y_train, y_test= train_test_split(X,y)

In [159]:
#try gradient boost 
gb= GradientBoostingClassifier()
gb.fit(X_train,y_train)
y_pred=gb.predict(X_test)
gb.score(X_test,y_test)

0.96841349605168703

In [166]:
confusion_matrix(y_test, y_pred) 

array([[1209,    0],
       [  41,  143]])

In [171]:
tfidf=TfidfVectorizer()

In [172]:
X=tfidf.fit_transform(df.msg_cleaned)
y=df.spam
X_train, X_test, y_train, y_test=  train_test_split(X,y)

In [173]:
lg= LogisticRegression()
lg.fit(X_train,y_train)
y_pred=lg.predict(X_test)
lg.score(X_test,y_test)

0.95764536970567127

In [174]:
confusion_matrix(y_test, y_pred) 

array([[1204,    1],
       [  58,  130]])

## Section 5.5: Pipeline with Spam data 

In [177]:
pipeline= Pipeline([('countvect', CountVectorizer(stop_words=stopwords_set)),\
                    #('tfidf', TfidfVectorizer(stop_words=stopwords_set)),\
                    ('lg',  LogisticRegression())])

In [178]:
X=df.msg_cleaned #note we are passing the cleaned msg to the pipeline 
y=df.spam
X_train, X_test, y_train, y_test= train_test_split(X,y) 


pipeline.fit(X_train, y_train) 
y_pred= pipeline.predict(X_test)
print pipeline.score(X_test, y_test)
print confusion_matrix(y_test, y_pred) 

0.97415649677
[[1208    1]
 [  35  149]]


In [169]:
pipeline= Pipeline([#('countvect', CountVectorizer(stop_words=stopwords_set)),\
                    ('countvect', CountVectorizer(stop_words=stopwords_set)),\
                    ('rf',  RandomForestClassifier())])

In [139]:
X=df.msg_cleaned #note we are passing the cleaned msg to the pipeline 
y=df.spam
X_train, X_test, y_train, y_test= train_test_split(X,y) 


pipeline.fit(X_train, y_train) 
y_pred= pipeline.predict(X_test)
print pipeline.score(X_test, y_test)
print confusion_matrix(y_test, y_pred)  

# the best one so far!

0.97272074659
[[1188    7]
 [  31  167]]


In [140]:
pipeline= Pipeline([#('countvect', CountVectorizer(stop_words=stopwords_set)),\
                    ('countvect', CountVectorizer(stop_words=stopwords_set, ngram_range=(1,3))),\
                    ('rf',  RandomForestClassifier())])

In [141]:
X=df.msg_cleaned #note we are passing the cleaned msg to the pipeline 
y=df.spam
X_train, X_test, y_train, y_test= train_test_split(X,y) 


pipeline.fit(X_train, y_train) 
y_pred= pipeline.predict(X_test)
print pipeline.score(X_test, y_test)
print confusion_matrix(y_test, y_pred) 

0.964824120603
[[1221    0]
 [  49  123]]
