Natural Language Processing (NLP)
===
(1) NLP Tasks: How does it perform? 
    
    (a) Well: Spam detection, Parts-of-Speech (POS) tagging see www.parts-of-speech.info, and Named-Entity-Recognition (NER).
    
    (b) Good: Sentiment analysis, Machine Translation see https://translate.google.com, and Information extraction
    
    (c) Sometimes: Machine conversations (recognize speech, wreck a nice beach), Paraphrasing and summarization.  e.x., pinterest.
    
    (d) Interesting progress: word2vec (uses ann)

(2) Spam Detector

(3) Sentiment Analyzer

(4) Exlpore NLTK

(5) Latent Semantic Analysis (LSA)

(6) Article spinner

Why is NLP Hard? Math is universal, but language is ambiguous! For example, "Republicans Grill IRS Chief Over Lost Emails" could be interpretted as "Republicans harshly question the chief about emails" or "Republicans cook the chief using email as fuel". Another example, "I saw a man on a hill with a telescope." could be interpretted as "There's a man on a hill and I'm watching him with a telescope." , "There's a man on a hill, who I'm seeing, and he has a telescope.", "There's a man, and he's on a hill that also has a telescope on it.", "I'm on a hill, and I saw a man using a telescope."  Non-formal language: Twitter is limited to 140 characters, "U", "UR", "LOL", "netflix and chill"  Remember language = text and voice. Voice requires signal processing.

Spam Detector
---
Let's get the pre-processed data here:
http://archive.ics.uci.edu/ml/datasets/Spambase

2 main takeaways:
(1) A lot of NLP is just pre-processing data, so we can use ML algorithms we already know.

(2) You can choose ANY ML algorithm as long as you can make the data fit.

Sci-kit learn.

Pre-processing 
---
Columns 1...48

word frequncy measure - number of times word appears divided by the number of words in the documents x 100

Last column is a label

1=spam, 0=not spam
One example of a "term-document matrix" - terms go along columns, documents (aka emails) go along rows


In [1]:
# sklearn.naive_bayes.MultinomialNB
from sklearn.naive_bayes import MultinomialNB
import pandas as pd 
import numpy as np

data = pd.read_csv("/home/mike/Downloads/spambase/spambase.data").as_matrix()
np.random.shuffle(data)

X = data[:,:48]
Y = data[:,-1]

Xtrain = X[:-100,]
Ytrain = Y[:-100,]

Xtest = X[-100:,]
Ytest = Y[-100:,]

model = MultinomialNB()
model.fit(Xtrain,Ytrain)
print "Classification rate for NB:", model.score(Xtest,Ytest)

Classification rate for NB: 0.85


In [2]:
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier()
model.fit(Xtrain,Ytrain)
print "Classification rate for AdaBoost:", model.score(Xtest,Ytest)

Classification rate for AdaBoost: 0.92


Other Types of Features
---

Most of these fit into what we call "bag of words")

(1) word proportion (we already saw this)

(2) raw word counts

(3) binary (1 if word appears, o otherwise)

(4) TF-IDF (takes into account the fact that some words appear in many documents, and hence don't really tell us much)
    
sklearn: http://scikit-learn.org/stable/modules/feature_extractions.html
        

Building a very simple Sentiment Analyzer
---
What is it? 
    
    sentiment = how positive or negative some text is
    
    Amazon reviews, Yelp reviews, hotel reviews, tweets, ...
    
Our data:
https://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html

Outline of our Sentiment Analyzer
---
We'll just look at the electronics category, but you can try the same code on others.  We could use 5 star targets to do regression, but let's just do classification since they are already marked "positive" and "negative".  We'll need an XML parser and it will be BeautifulSoup.  We're only going to look at the key "review_text" and ignore the extra data. To create our feature vector, we're going to do the same thing we did in the 1st dataset were we counted up the number of occurences of each word and divide by the total number of words.  For that to work we'll need 2 passes, one to collect the total number of distinct words so we know the size of our feature vector and possibly remove stop words like this, is, or I to reduce the vocabulary size.  This is so we'll know the index of each token/word.  On the second pass, we'll be able to assign the values to the data vector.['/ After that, we can just use any SKLearn Classifier as we did previously, but we'll use logistic regression so we can interpret the weights of the learned model to get a score for each individual input word.  That will tell us if you see a word like horrible with a weight of -1 is associated with negative reviews.

In [6]:
import nltk
import numpy as np

from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup

# this WordNetLemmatizer turns words into their base form such as dogs and dog are the same word.  
# So we don't want our vocabulary size to be too large. 
wordnet_lemmatizer = WordNetLemmatizer()

# from http://www.lextek.com/manuals/onix/stopwords1.html
stopwords = set(w.rstrip() for w in open('/home/mike/Downloads/machine_learning_examples-master/nlp_class/stopwords.txt'))

# load the reviews
# data courtesy of http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html
positive_reviews = BeautifulSoup(open('/home/mike/Downloads/machine_learning_examples-master/nlp_class/electronics/positive.review').read())
positive_reviews = positive_reviews.findAll('review_text')

negative_reviews = BeautifulSoup(open('/home/mike/Downloads/machine_learning_examples-master/nlp_class/electronics/negative.review').read())
negative_reviews = negative_reviews.findAll('review_text')

# there are more positive reviews than negative reviews
# so let's take a random sample so we have balanced classes
np.random.shuffle(positive_reviews)
positive_reviews = positive_reviews[:len(negative_reviews)]

# first let's just try to tokenize the text using nltk's tokenizer
# let's take the first review for example:
# t = positive_reviews[0]
# nltk.tokenize.word_tokenize(t.text)
#
# notice how it doesn't downcase, so It != it
# not only that, but do we really want to include the word "it" anyway?
# you can imagine it wouldn't be any more common in a positive review than a negative review
# so it might only add noise to our model.
# so let's create a function that does all this pre-processing for us

def my_tokenizer(s):
    s = s.lower() # downcase
    tokens = nltk.tokenize.word_tokenize(s) # split string into words (tokens)
    tokens = [t for t in tokens if len(t) > 2] # remove short words, they're probably not useful
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] # put words into base form
    tokens = [t for t in tokens if t not in stopwords] # remove stopwords
    return tokens


# create a word-to-index map so that we can create our word-frequency vectors later
# let's also save the tokenized versions so we don't have to tokenize again later
word_index_map = {}
current_index = 0
positive_tokenized = []
negative_tokenized = []

for review in positive_reviews:
    tokens = my_tokenizer(review.text)
    positive_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1

for review in negative_reviews:
    tokens = my_tokenizer(review.text)
    negative_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1


# now let's create our input matrices
def tokens_to_vector(tokens, label):
    x = np.zeros(len(word_index_map) + 1) # last element is for the label
    for t in tokens:
        i = word_index_map[t]
        x[i] += 1
    x = x / x.sum() # normalize it before setting label
    x[-1] = label
    return x

N = len(positive_tokenized) + len(negative_tokenized)
# (N x D+1 matrix - keeping them together for now so we can shuffle more easily later
data = np.zeros((N, len(word_index_map) + 1))
i = 0
for tokens in positive_tokenized:
    xy = tokens_to_vector(tokens, 1)
    data[i,:] = xy
    i += 1

for tokens in negative_tokenized:
    xy = tokens_to_vector(tokens, 0)
    data[i,:] = xy
    i += 1

# shuffle the data and create train/test splits
# try it multiple times!
np.random.shuffle(data)

X = data[:,:-1]
Y = data[:,-1]

# last 100 rows will be test
Xtrain = X[:-100,]
Ytrain = Y[:-100,]
Xtest = X[-100:,]
Ytest = Y[-100:,]

model = LogisticRegression()
model.fit(Xtrain, Ytrain)
print "Classification rate:", model.score(Xtest, Ytest)

Classification rate: 0.67


In [7]:
# let's look at the weights for each word
# try it with different threshold values!
threshold = 0.5
for word, index in word_index_map.iteritems():
    weight = model.coef_[0][index]
    if weight > threshold or weight < -threshold:
        print word, weight

unit -0.717914355533
fit 0.538816456084
easy 1.76958660878
support -0.86059544655
happy 0.690601251567
time -0.546546903041
love 1.21653505251
returned -0.791150191408
cable 0.645920811086
company -0.514905050724
paper 0.560488992688
try -0.659793813561
customer -0.686208736099
perfect 0.976268261198
waste -0.994578948086
highly 1.03353133764
then -1.08902468091
wa -1.56369137092
space 0.626725466711
price 2.74307767646
using 0.660881455752
lot 0.680385338126
you 1.05070795654
poor -0.759043961336
month -0.806202631089
tried -0.737323838327
stopped -0.551823183193
pretty 0.760439720798
look 0.547474831485
quality 1.3335500341
speaker 0.954018470374
ha 0.743957343761
recommend 0.643257957902
doe -1.18245881385
bad -0.690957509065
mouse 0.532031083526
item -0.971515536974
little 0.901291355375
sound 1.09392919589
n't -2.11720680364
money -1.04915938476
've 0.817587331682
hour -0.553014507251
bit 0.647221141343
comfortable 0.639402392728
value 0.525073779817
buy -0.885609434453
excellent 