# Basic Text Classification with Naive Bayes
### extra credit part 1
***
In the mini-project, you'll learn the basics of text analysis using a subset of movie reviews from the rotten tomatoes database. You'll also use a fundamental technique in Bayesian inference, called Naive Bayes. This mini-project is based on [Lab 10 of Harvard's CS109](https://github.com/cs109/2015lab10) class.  Please free to go to the original lab for additional exercises and solutions.

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from six.moves import range
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Setup Pandas
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

critics = pd.read_csv('./critics.csv')
critics = critics[~critics.quote.isnull()] #drop rows with missing quotes
critics.head()

n_reviews = len(critics)
n_movies = critics.rtid.unique().size
n_critics = critics.critic.unique().size

print("Number of reviews: {:d}".format(n_reviews))
print("Number of critics: {:d}".format(n_critics))
print("Number of movies:  {:d}".format(n_movies))

texts = critics.quote

# practice CountVectorizer
vectorizer = CountVectorizer(stop_words = 'english',
                            ngram_range = (1, 2))

vectorizer.fit_transform(texts)
vocab = vectorizer.vocabulary_

print('Number of bigrams: ' + str(len(vocab.items())))
print('Practice assessing the dictionary...')
w2000 = [w for w in vocab.items() if w[1] == 2000]
print('The 2000th bigram: ' + w2000[0][0])


Number of reviews: 15561
Number of critics: 623
Number of movies:  1921
Number of bigrams: 147225
Practice assessing the dictionary...
The 2000th bigram: action delivers


In [2]:
# useful functions from Springboard's assignment 

def cv_score(clf, X, y, scorefunc):
    # compute average accuracy from KFold 
    result = 0.
    nfold = 5
    for train, test in KFold(nfold).split(X): # split data into train/test groups, 5 times
        clf.fit(X[train], y[train]) # fit the classifier, passed is as clf.
        result += scorefunc(clf, X[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average

def log_likelihood(clf, x, y):
    # compute log_likelihood value
    prob = clf.predict_log_proba(x)
    rotten = y == 0
    fresh = ~rotten
    return prob[rotten, 0].sum() + prob[fresh, 1].sum()

# split into train and test set
_, itest = train_test_split(range(critics.shape[0]), train_size=0.7,
                           random_state = 0)
mask = np.zeros(critics.shape[0], dtype=np.bool)
mask[itest] = True

### Springboard assignment

* Previously, we had accuracy around 0.72

### Improvement:
1. Add ngram: ngram_range = (1, 2)
2. Remove stop words 
3. chi-squared feature selection
4. Try TF-IDF and Random Forest

In [3]:
# Mutinomial NB with TF-IDF

# Tuning hyper paramaters:
# (1) alpha = Multinomial NB additive smoothing parameter
# (2) K = Select features according to the K highest scores.

# define grid to look for
alphas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
Ks = [4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000]

best_alpha = None
best_K = None
maxscore = -np.inf

# grid search
for alpha in alphas:
    for K in Ks: 
        
        vectorizer = TfidfVectorizer(stop_words = 'english',
                                     ngram_range = (1, 2))
        
        X_all = vectorizer.fit_transform(critics.quote)
        X_all = X_all.tocsc()
        y_all = (critics.fresh == 'fresh').values.astype(np.int)
        # x_all uses all avilable features (words)
        
        # chi-squared feature selection: x_new contains selected words 
        X_new = SelectKBest(chi2, k=K).fit_transform(X_all, y_all)

        Xtrainthis = X_new[mask] #define training set
        ytrainthis = y_all[mask]

        clf = MultinomialNB(alpha=alpha) # classifier
        
        cvscore = cv_score(clf, Xtrainthis, ytrainthis, log_likelihood)

        if cvscore > maxscore:
            maxscore = cvscore
            best_alpha = alpha
            best_K = K

# fitting the test set 

vectorizer = TfidfVectorizer(stop_words = 'english',
                             ngram_range = (1, 2))

X_all = vectorizer.fit_transform(critics.quote)
X_all = X_all.tocsc()
y_all = (critics.fresh == 'fresh').values.astype(np.int)
        
X_new = SelectKBest(chi2, k = best_K).fit_transform(X_all, y_all)

Xtrain = X_new[mask] # training set
ytrain = y_all[mask]
Xtest = X_new[~mask] # test set
ytest = y_all[~mask]

clf = MultinomialNB(alpha=best_alpha).fit(Xtrain, ytrain)

training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)

print("Mutinomial NB with TF-IDF")
print("----------------------------------------")
print('Optimal alpha = ' + str(best_alpha))
print('Optimal K = ' + str(best_K))
print("----------------------------------------")
print("Accuracy on training data: {:2f}".format(training_accuracy))
print("Accuracy on test data:     {:2f}".format(test_accuracy))
print("----------------------------------------")
print(confusion_matrix(ytest, clf.predict(Xtest)))

Mutinomial NB with TF-IDF
----------------------------------------
Optimal alpha = 0.001
Optimal K = 6500
----------------------------------------
Accuracy on training data: 0.919469
Accuracy on test data:     0.786173
----------------------------------------
[[2233 2015]
 [ 314 6330]]


In [4]:
# Mutinomial NB with CountVectorizer

# Tuning hyper paramaters:
# (1) alpha = Multinomial NB additive smoothing parameter
# (2) K = Select features according to the K highest scores.

# define grid to look for
alphas = [0.01, 0.1, 0.25, 0.5, 0.75, 1, 1.5]
Ks = [4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000]

best_alpha = None
best_K = None
maxscore = -np.inf

for alpha in alphas:
    for K in Ks: 
        
        vectorizer = CountVectorizer(stop_words = 'english',
                                     ngram_range = (1, 2))
        
        X_all = vectorizer.fit_transform(critics.quote)
        X_all = X_all.tocsc()
        y_all = (critics.fresh == 'fresh').values.astype(np.int)
        
        X_new = SelectKBest(chi2, k=K).fit_transform(X_all, y_all)

        Xtrainthis = X_new[mask]
        ytrainthis = y_all[mask]

        clf = MultinomialNB(alpha=alpha)
        
        cvscore = cv_score(clf, Xtrainthis, ytrainthis, log_likelihood)

        if cvscore > maxscore:
            maxscore = cvscore
            best_alpha = alpha
            best_K = K

# fitting the test set 

vectorizer = CountVectorizer(stop_words = 'english',
                             ngram_range = (1, 2))

X_all = vectorizer.fit_transform(critics.quote)
X_all = X_all.tocsc()
y_all = (critics.fresh == 'fresh').values.astype(np.int)
        
X_new = SelectKBest(chi2, k = best_K).fit_transform(X_all, y_all)

Xtrain = X_new[mask]
ytrain = y_all[mask]
Xtest = X_new[~mask]
ytest = y_all[~mask]

clf = MultinomialNB(alpha=best_alpha).fit(Xtrain, ytrain)

training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)

print("Mutinomial NB with CountVectorizer")
print("----------------------------------------")
print('Optimal alpha = ' + str(best_alpha))
print('Optimal K = ' + str(best_K))
print("----------------------------------------")
print("Accuracy on training data: {:2f}".format(training_accuracy))
print("Accuracy on test data:     {:2f}".format(test_accuracy))
print("----------------------------------------")
print(confusion_matrix(ytest, clf.predict(Xtest)))

Mutinomial NB with CountVectorizer
----------------------------------------
Optimal alpha = 0.5
Optimal K = 5000
----------------------------------------
Accuracy on training data: 0.899336
Accuracy on test data:     0.814818
----------------------------------------
[[2806 1442]
 [ 575 6069]]


In [5]:
# Random Forest with CountVectorizer

list_max_depth = [50,100,300,500,1000]
Ks = [4000, 4500, 5000, 5500, 6000]

best_max_depth = None
best_K = None
maxscore = -np.inf

for max_depth in list_max_depth:
    for K in Ks: 
        
        vectorizer = CountVectorizer(stop_words = 'english',
                                     ngram_range = (1, 2))
        
        X_all = vectorizer.fit_transform(critics.quote)
        X_all = X_all.tocsc()
        y_all = (critics.fresh == 'fresh').values.astype(np.int)
        
        X_new = SelectKBest(chi2, k=K).fit_transform(X_all, y_all)

        Xtrainthis = X_new[mask]
        ytrainthis = y_all[mask]

        clf = RandomForestClassifier(max_depth = max_depth,
                                     n_estimators = 100)
        
        cvscore = cv_score(clf, Xtrainthis, ytrainthis, log_likelihood)

        if cvscore > maxscore:
            best_max_depth = max_depth
            best_K = K

# fitting the test set 

vectorizer = CountVectorizer(stop_words = 'english',
                             ngram_range = (1, 2))

X_all = vectorizer.fit_transform(critics.quote)
X_all = X_all.tocsc()
y_all = (critics.fresh == 'fresh').values.astype(np.int)
        
X_new = SelectKBest(chi2, k = best_K).fit_transform(X_all, y_all)

Xtrain = X_new[mask]
ytrain = y_all[mask]
Xtest = X_new[~mask]
ytest = y_all[~mask]

clf = RandomForestClassifier(max_depth = best_max_depth,
                             n_estimators = 100).fit(Xtrain, ytrain)

training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)

print("Random Forest with CountVectorizer")
print("----------------------------------------")
print('Optimal max_depth = ' + str(max_depth))
print('Optimal K = ' + str(best_K))
print("----------------------------------------")
print("Accuracy on training data: {:2f}".format(training_accuracy))
print("Accuracy on test data:     {:2f}".format(test_accuracy))
print("----------------------------------------")
print(confusion_matrix(ytest, clf.predict(Xtest)))

  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np.log(proba)
  return np

Random Forest with CountVectorizer
----------------------------------------
Optimal max_depth = 1000
Optimal K = 6000
----------------------------------------
Accuracy on training data: 0.982652
Accuracy on test data:     0.707033
----------------------------------------
[[1826 2422]
 [ 769 5875]]


#### Conclusion 
From the Springboard assignment, I made an improvement as follows:

1. Use Mutinomial NB
2. Use CountVectorizer
3. Remove stop words
4. Use bigrams
5. Use chi-squared feature selection

The accuracy increases from 0.72 to 0.81