# Easy natural language processing

## Outline

* Quick rundown of NLP taks
* Build your own spam detector
* Build your own sentiment analyzer
* NLTK library exploration
* Latent semantic analysis (LSA)
* Build your own article spinner

---------------------------------------------------------------------

## NLP applications

### Very good @

* **Spam detection**
    - Filtering spam emails in inbox and/or categorizing them like "Primary", "Social" etc.


* **POS (parts-of-speech) tagging**
    - Given a sentence, we identify noun, adjective, verb etc.


* **NER (named-entity recognition)**
    - Given a sentence, we identify whether the word represents a person, an organization etc.

===============================================================

### Pretty good @

* **Sentiment analysis**
    - Assigning a score to a sentence based on word analysis.
    
    
* **Machine Translation**
    - Translating text into different languages
    
    
* **Information Extraction**
    - E.g. Adding events to the calender by automatically reading content of message.

===============================================================

### Needs Improvement

* **Machine conversations**
    - Speech recognition (Cortana, Siri), extracting meaning from spoken words and replying with meaningful response.


* **Paraphrasing and summarization**
    - Summarize an article into few sentences (AI base AMP)

---------------------------------------------------------------------

## Why is NLP hard?

### Ambiguity 1

Republicans Grill IRS Chief Over Lost Emails

Interpretations:

1. Republicans harshly question the chief about the emails
2. Republicans cook the chief using email as the fuel

### Ambiguity 2

I saw a man on a hill with a telescope.

Interpretations:

1. There's a man on a hill, and I'm watching him with my telescope.
2. There's a man on a hill, who I'm seeing, and he has a telescope.
3. There's a man, and he's on a hill that also has a telescope on it.
4. I'm on a hill, and I saw a man using a telescope

### Ambiguity 3

Twitter feeds: 'u', 'ur', 'lol', 'netflix and chill'

# Building a spam detector

Dataset: [https://archive.ics.uci.edu/ml/datasets/Spambase]

### Data description

**Columns 1 - 48: word frequency measure** - number of times a word appears divided by number of words in document * 100

**Last column is label**: 1 = spam, 0 = not spam

In [42]:
# http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [43]:
data = pd.read_csv('spambase/spambase.data', header=None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [44]:
data = data.as_matrix()
type(data)

numpy.ndarray

In [46]:
X = data[:,:48]
Y = data[:,-1]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7, test_size=0.3, shuffle=True)

In [47]:
X_train.shape

(3220, 48)

In [48]:
X_test.shape

(1381, 48)

In [49]:
Y_train.shape

(3220,)

In [50]:
Y_test.shape

(1381,)

In [51]:
model = MultinomialNB()
model.fit(X_train, Y_train)
print('Classification rate for MultinomialNB: {}'.format(model.score(X_test, Y_test)))

Classification rate for MultinomialNB: 0.8725561187545257


In [52]:
from sklearn.ensemble import AdaBoostClassifier

In [53]:
model = AdaBoostClassifier()
model.fit(X_train, Y_train)
print('Classification rate for AdaBoost: {}'.format(model.score(X_test, Y_test)))

Classification rate for AdaBoost: 0.9261404779145547


## More on feature extraction (getting the features)

Collectively called as **"bag of words"** we have following techiniques available for extracing features from words:

1. word proportion (the one used in above example)
2. raw word counts
3. binary (1 if word appears, 0 otherwise)
4. TF-IDF (takes into account the face that some words appear in many documents and hence don't really tell us much e.g. 'and', 'in', 'or' etc.))

Useful link for TF-IDF [http://scikit-learn.org/stable/modules/feature_extraction.html]

# Build a sentiment analyzer

* sentiment = how positive or negative some text is
* applications = amazon reviews, yelp reviews, hotel reviews, tweets

Dataset: [https://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html]

### Outline of our sentiment analyzer

* We'll just look at the **electronics category**, but you can try the same code on others
* We could use 5 start targets to do regression, but let's just do classification since they are already marked 'positive' and 'negative'
* **XML parser (BeautifulSoup)**
* Only look at key **'review_text'**
* We'll need 2 passes, one to determine vocabulary size and which index corresponds to which word, and one to create data vectors
* After that, we can just use any **SKLearn classifier** as we did previously
* But we'll **use logistic regression so we can interpret the weights**

In [54]:
import nltk
import numpy as np
from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

In [55]:
# Converts words to their base form
# More on lemmatizer
# http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization 

word_net_lemmatizer = WordNetLemmatizer()

# reading 'given' stopwords, words that give neutral sentiment
# rstrip - removes the characters to the right based on given parameter
stopwords = set(word.rstrip() for word in open('stopwords.txt'))

# we will use nltk provided stopwords
stopwords = set(stopwords.words('english'))

# More on BeautifulSoup
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/
# reading XML files using BeautifulSoup (using the lxml parser)
positive_reviews = BeautifulSoup(open('electronics/positive.review').read(), 'lxml')
positive_reviews = positive_reviews.findAll('review_text')

negative_reviews = BeautifulSoup(open('electronics/negative.review').read(), 'lxml')
negative_reviews = negative_reviews.findAll('review_text')

type(positive_reviews)

bs4.element.ResultSet

In [56]:
# shuffling data
np.random.shuffle(positive_reviews)
np.random.shuffle(negative_reviews)

In [57]:
# equalizing labels
len(positive_reviews)

1000

In [58]:
len(negative_reviews)

1000

In [59]:
# creating the main vocabulary dictionary
vocabulary = {}
vocabulary_index = 0

positive_tokenized = []
negative_tokenized = []

In [60]:
# defining custom tokenizing function
def custom_tokenizer(s):
    # lowering all text
    s = s.lower()
    
    # this is equivalent to list.split() but let's keep faith in word_tokenize()
    tokens = nltk.tokenize.word_tokenize(s)
    
    # remove the words with length <= 2
    tokens = [token for token in tokens if len(token) > 2]
    
    # lemmatize the tokens
    tokens = list(map(lambda token: word_net_lemmatizer.lemmatize(token), tokens))
    
    # remove the stopwords from the tokens
    tokens = [token for token in tokens if token not in stopwords]
    
    return tokens

In [61]:
# # nltk.download()
# tokens = nltk.tokenize.word_tokenize('This is to certify that Milind Dalvi successfully completed 13 hours of Complete Python Bootcamp: Go from zero to hero in Python online course on Aug. 28, 2017')
# # remove the words with length <= 2
# tokens = [token for token in tokens if len(token) > 2]
# tokens

In [62]:
# #lemmatize the tokens
# tokens = list(map(lambda token: word_net_lemmatizer.lemmatize(token), tokens))
# tokens

In [63]:
for review in positive_reviews:
    tokens = custom_tokenizer(review.text)
    positive_tokenized.append(tokens)
    for token in tokens:
        if token not in vocabulary:
            vocabulary[token] = vocabulary_index
            vocabulary_index += 1
            
for review in negative_reviews:
    tokens = custom_tokenizer(review.text)
    negative_tokenized.append(tokens)
    for token in tokens:
        if token not in vocabulary:
            vocabulary[token] = vocabulary_index
            vocabulary_index += 1

In [64]:
# defining token to vector function
def tokens_to_vector(tokens, label):
    # this is the feature vector with target label
    feature = np.zeros(len(tokens) + 1)
    for token in tokens:
        feature[vocabulary[token]] += 1
    feature = (feature * 100.0) / feature.sum()   # percentage
    feature[-1] = label
    return feature

In [65]:
# Number of records N
N = len(positive_tokenized) + len(negative_tokenized)
data = np.zeros((N, len(vocabulary) + 1))
i = 0
for tokens in positive_tokenized:
    XY = tokens_to_vector(tokens, 1)
    data[i, :] = XY
    i += 1
    
for tokens in negative_tokenized:
    XY = tokens_to_vector(tokens, 1)
    data[i, :] = XY
    i += 1

ValueError: could not broadcast input array from shape (19) into shape (11089)

In [None]:
X = data[:, :-1]
Y = 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7, test_size=0.3, shuffle=True)