# Homework with Yelp reviews data

In [39]:
import json

import pandas as pd


## Introduction

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.json`** is the original format of the file. **`yelp.csv`** contains the same data, in a more convenient format. Both of the files are in the course repo (in the **`data`** directory), so there is no need to download the data from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.
- The **cool** column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.
- The **useful** and **funny** columns are similar to the **cool** column.

**Goal:** Predict the star rating of a review using **only** the review text. (We will not be using the other columns.)

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

## Task 1

Read **`yelp.csv`** into a Pandas DataFrame and examine it.

In [40]:
csv_file = '../data/yelp.csv'
df = pd.read_csv(csv_file)

In [41]:
df.describe()

Unnamed: 0,stars,cool,useful,funny
count,10000.0,10000.0,10000.0,10000.0
mean,3.7775,0.8768,1.4093,0.7013
std,1.214636,2.067861,2.336647,1.907942
min,1.0,0.0,0.0,0.0
25%,3.0,0.0,0.0,0.0
50%,4.0,0.0,1.0,0.0
75%,5.0,1.0,2.0,1.0
max,5.0,77.0,76.0,57.0


In [42]:
df.head(3)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0


In [43]:
def nunique(df, columns=None):
    if not columns:
        raise Exception('Add Columns')
    for col in columns:
            num_unique = df[col].nunique()
            print "# unique %s : %s" % (col, num_unique)    
             

In [44]:
nunique(df, columns=['business_id', 'user_id', 'type', 'cool', 'useful', 'funny'])


# unique business_id : 4174
# unique user_id : 6403
# unique type : 1
# unique cool : 29
# unique useful : 28
# unique funny : 29


In [45]:
def unique(df, columns=None):
    if not columns:
        raise Exception('Add Columns')
    for col in columns:
            num_unique = df[col].unique()
            print "'%s' has %s unique values \n unique items: \n %s \n\n" % (col, len(num_unique), num_unique) 
             

In [46]:
unique(df, columns=['business_id', 'user_id', 'type', 'cool', 'useful', 'funny'])

'business_id' has 4174 unique values 
 unique items: 
 ['9yKzy9PApeiPPOUJEtnvkg' 'ZRJwVLyzEJq1VAihDhYiow' '6oRAC4uyJCsJl1X0WZpVSA'
 ..., 'qhIlkXgcC4j34lNTIqu9WA' 'JOZqBKIOB8WEBAWm7v1JFA'
 'f96lWMIAUhYIYy9gOktivQ'] 


'user_id' has 6403 unique values 
 unique items: 
 ['rLtl8ZkDX5vH5nAx9C3q5Q' '0a2KyEL0d3Yb1V6aivbIuQ' '0hT2KtfLiobPvh6cDC8JQg'
 ..., 'gGbN1aKQHMgfQZkqlsuwzg' '0lyVoNazXa20WzUyZPLaQQ'
 'KSBFytcdjPKZgXKQnYQdkA'] 


'type' has 1 unique values 
 unique items: 
 ['review'] 


'cool' has 29 unique values 
 unique items: 
 [ 2  0  1  4  7  3  5 11  6  8 16 28 12 13 10 22 17 18  9 14 21 15 19 20 23
 77 27 38 32] 


'useful' has 28 unique values 
 unique items: 
 [ 5  0  1  2  3  7  4  6 16  9 17 19 28  8 15 10 12 23 20 11 13 18 14 24 76
 31 38 30] 


'funny' has 29 unique values 
 unique items: 
 [ 0  1  4  2  3  8  9  6  5 39  7 12 16 20 27 11 13 17 10 30 22 14 19 18 23
 21 15 24 57] 




In [47]:
def mapper(df, column):
    num_unique = df[column].unique()
    mapper = {user: num for num, user  in enumerate(num_unique) }
    return mapper


In [48]:
def unique_mapper(df, columns=None):
    if not columns:
        raise Exception('Add Columns')
    for col in columns:
        map_dict = mapper(df, col)
        df[col] = df[col].map(map_dict)
        
        
        

In [49]:
unique_mapper(df, columns=['user_id', 'business_id', 'review_id'])

In [50]:
df.tail()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
9995,165,2012-07-28,9995,3,First visit...Had lunch here today - used my G...,review,6398,1,2,0
9996,783,2012-01-18,9996,4,Should be called house of deliciousness!\n\nI ...,review,6399,0,0,0
9997,1033,2010-11-16,9997,4,I recently visited Olive and Ivy for business ...,review,6400,0,0,0
9998,2562,2012-12-02,9998,2,My nephew just moved to Scottsdale recently so...,review,6401,0,0,0
9999,341,2010-10-16,9999,5,4-5 locations.. all 4.5 star average.. I think...,review,6402,0,0,0


## Task 1 (Alternative)

Ignore the **`yelp.csv`** file, and instead construct this DataFrame manually using **`yelp.json`**. This involves reading the file into Python, decoding the JSON, converting it to a DataFrame, and adding individual columns for each of the vote types.

**Note:** This may be a challenging task, so I recommend skipping it unless you are fluent with Python and Pandas.

In [51]:
json_file = '../data/yelp.json'

reviews = []
for line in open(json_file, 'r'):
    review = json.loads(line)
    review.update(review['votes'])
    del review['votes']
    reviews.append(review)

json_dj = pd.DataFrame(reviews)

In [52]:
pd.DataFrame(json_dj)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,type,useful,user_id
0,9yKzy9PApeiPPOUJEtnvkg,2,2011-01-26,0,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,5,rLtl8ZkDX5vH5nAx9C3q5Q
1,ZRJwVLyzEJq1VAihDhYiow,0,2011-07-27,0,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0,0a2KyEL0d3Yb1V6aivbIuQ
2,6oRAC4uyJCsJl1X0WZpVSA,0,2012-06-14,0,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,1,0hT2KtfLiobPvh6cDC8JQg
3,_1QQZuf4zZOyFCvXc0o6Vg,1,2010-05-27,0,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,2,uZetl9T0NcROGOyFfughhg
4,6ozycU1RpktNG2-1BroVtw,0,2012-01-05,0,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,0,vYmM4KTsC8ZfQBg-j5MWkw
5,-yxfBYGB6SEqszmxJxd97A,4,2007-12-13,1,m2CKSsepBCoRYWxiRUsxAg,4,"Quiessence is, simply put, beautiful. Full wi...",review,3,sqYN3lNgvPbPCTRsMFu27g
6,zp713qNhx8d9KCJJnrw1xA,7,2010-02-12,4,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,7,wFweIWhv2fREZV_dYkz_1g
7,hW0Ne_HTHEAgGF1rAdmR-g,0,2012-07-12,0,JL7GXJ9u4YMx7Rzs05NfiQ,4,"Luckily, I didn't have to travel far to make m...",review,1,1ieuYcKS7zeAv_U15AB13A
8,wNUea3IXZWD63bbOQaOH-g,0,2012-08-17,0,XtnfnYmnJYi71yIuGsXIUA,4,Definitely come for Happy hour! Prices are ama...,review,0,Vh_DlizgGhSqQh4qfZ2h6A
9,nMHhuYan8e3cONo3PornJA,0,2010-08-11,0,jJAIXA46pU1swYyRCdfXtQ,5,Nobuo shows his unique talents with everything...,review,1,sUNkXg8-KFtCMQDV6zRzQg


In [53]:
len(df.columns) == len(json_dj.columns)

True

In [54]:
nunique(json_dj, columns=['business_id', 'user_id', 'type', 'cool', 'useful', 'funny'])

# unique business_id : 4174
# unique user_id : 6403
# unique type : 1
# unique cool : 29
# unique useful : 28
# unique funny : 29


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** You will need to filter the DataFrame using an OR condition. [Working with DataFrames](http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/) has an example of this.

In [55]:
dfs = df[(df.stars==1) | (df.stars==5)] 


In [56]:
dfs['stars'] = dfs.stars.map({5:1, 1:0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [57]:
dfs.head(1)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,0,2011-01-26,0,1,My wife took me here on my birthday for breakf...,review,0,2,5,0


## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a Pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [58]:
feature_cols = ['text']
X = dfs.text
y = dfs.stars 

# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [59]:
X.shape

(4086,)

In [60]:
y.shape

(4086,)

## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [61]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

vect = CountVectorizer(stop_words='english', ngram_range=(1, 2))
# vect = CountVectorizer()


In [62]:
# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<3064x150034 sparse matrix of type '<type 'numpy.int64'>'
	with 307258 stored elements in Compressed Sparse Row format>

In [63]:
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1022x150034 sparse matrix of type '<type 'numpy.int64'>'
	with 59818 stored elements in Compressed Sparse Row format>

In [64]:
X_train_tokens = vect.get_feature_names()
    

## Task 5

Use Multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [2]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression

from sklearn import metrics


metrics.classification_report

<function sklearn.metrics.classification.classification_report>

In [66]:
def nb_models(X_train_dtm, X_test_dtm, y_train, y_test, clf=None):
    nb = clf()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    
    y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
    accuracy_score = metrics.accuracy_score(y_test, y_pred_class)
    roc_auc = metrics.roc_auc_score(y_test, y_pred_prob)
    confusion = metrics.confusion_matrix(y_test, y_pred_class)
    
    
    return """Model: %s \n 
    Accuracy Score: %s \n 
    ROC AUC Score: %s \n 
    Confusion Matrix: %s
    
    """ % (clf, accuracy_score, roc_auc, confusion)

In [67]:
print nb_models(X_train_dtm, X_test_dtm, y_train, y_test, clf=MultinomialNB)

Model: <class 'sklearn.naive_bayes.MultinomialNB'> 
 
    Accuracy Score: 0.853228962818 
 
    ROC AUC Score: 0.770704705821 
 
    Confusion Matrix: [[ 35 149]
 [  1 837]]
    
    


In [68]:
print nb_models(X_train_dtm, X_test_dtm, y_train, y_test, clf=BernoulliNB)

Model: <class 'sklearn.naive_bayes.BernoulliNB'> 
 
    Accuracy Score: 0.820939334638 
 
    ROC AUC Score: 0.502717391304 
 
    Confusion Matrix: [[  1 183]
 [  0 838]]
    
    


In [69]:
print nb_models(X_train_dtm, X_test_dtm, y_train, y_test, clf=LogisticRegression)



Model: <class 'sklearn.linear_model.logistic.LogisticRegression'> 
 
    Accuracy Score: 0.930528375734 
 
    ROC AUC Score: 0.967981477638 
 
    Confusion Matrix: [[135  49]
 [ 22 816]]
    
    


## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [70]:
print('Null accuracy: %s') % max(y_test.mean(), 1 - y_test.mean())


Null accuracy: 0.819960861057


In [71]:
def null_accuracy(y):
    high = y.mean()
    low = 1 - y.mean()
    
    return max(high, low)
null_accuracy(y)

0.81669114047968672

## Task 7 (Challenge)

Calculate which 5 tokens are the most predictive of **5-star reviews**, and which 5 tokens are the most predictive of **1-star reviews**.

- **Hint:** Use the `feature_count_` attribute from the Naive Bayes model object as a shortcut, so that you don't have to do any NumPy math.

In [72]:
clf = MultinomialNB
nb = clf()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
accuracy_score = metrics.accuracy_score(y_test, y_pred_class)
roc_auc = metrics.roc_auc_score(y_test, y_pred_prob)
confusion = metrics.confusion_matrix(y_test, y_pred_class)
nb

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [74]:
features_count = nb.feature_count_ 

print features_count

[[ 26.   1.   1. ...,   0.   0.   0.]
 [ 39.   0.   0. ...,   1.   1.   1.]]


In [75]:
features_log = nb.feature_log_prob_
print features_log

[[ -9.02763207 -11.63032176 -11.63032176 ..., -12.32346894 -12.32346894
  -12.32346894]
 [ -9.23596182 -12.92484128 -12.92484128 ..., -12.23169409 -12.23169409
  -12.23169409]]


In [76]:
vect.get_feature_names()[:5]

[u'00', u'00 00', u'00 15', u'00 24', u'00 25']

In [77]:
fc_df = pd.DataFrame(features_count, columns=vect.get_feature_names())

In [78]:
fl_df = pd.DataFrame(features_log, columns=vect.get_feature_names())
fl_df

Unnamed: 0,00,00 00,00 15,00 24,00 25,00 30,00 50,00 actually,00 amazing,00 arriving,...,zwiebel,zwiebel kräuter,zzed,zzed pants,éclairs,éclairs napoleons,école,école lenôtre,ém,ém huge
0,-9.027632,-11.630322,-11.630322,-12.323469,-12.323469,-11.630322,-12.323469,-11.630322,-12.323469,-12.323469,...,-12.323469,-12.323469,-12.323469,-12.323469,-12.323469,-12.323469,-12.323469,-12.323469,-12.323469,-12.323469
1,-9.235962,-12.924841,-12.924841,-12.231694,-12.231694,-12.924841,-11.826229,-12.924841,-12.231694,-12.231694,...,-12.231694,-12.231694,-12.231694,-12.231694,-12.231694,-12.231694,-12.231694,-12.231694,-12.231694,-12.231694


In [79]:
fl_df_stacked = fl_df.head(2).stack().unstack(0)

In [80]:
fl_df_stacked.sort_values(1, ascending=False)[:5]

Unnamed: 0,0,1
great,-8.074974,-5.610288
place,-6.367632,-5.620998
food,-6.280836,-5.800363
good,-6.868148,-5.805206
just,-6.586897,-6.066276


In [81]:
fl_df_stacked.sort_values(1, ascending=True)[:5]

Unnamed: 0,0,1
says figured,-11.630322,-12.924841
sank waders,-11.630322,-12.924841
sank,-11.630322,-12.924841
sanitizing tools,-11.630322,-12.924841
sanitizing,-11.630322,-12.924841


In [82]:
fc_df

Unnamed: 0,00,00 00,00 15,00 24,00 25,00 30,00 50,00 actually,00 amazing,00 arriving,...,zwiebel,zwiebel kräuter,zzed,zzed pants,éclairs,éclairs napoleons,école,école lenôtre,ém,ém huge
0,26,1,1,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,39,0,0,1,1,0,2,0,1,1,...,1,1,1,1,1,1,1,1,1,1


In [83]:
fc_df_stacked = fc_df.head(2).stack().unstack(0)

In [84]:
fc_df_stacked.sort_values(1, ascending=False)[:5]

Unnamed: 0,0,1
great,69,1501
place,385,1485
food,420,1241
good,233,1235
just,309,951


In [85]:
fc_df_stacked.sort_values(1, ascending=True)[:5]

Unnamed: 0,0,1
says figured,1,0
sank waders,1,0
sank,1,0
sanitizing tools,1,0
sanitizing,1,0


## Task 8 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

#### Nathan's Answer: 

Many of these seem to have many neutral sentences or background with just one or two particular negative or positive sentences.


In [86]:
# exercise: show the message text for the false positives
X_test[y_test < y_pred_class].index

Int64Index([2175, 4556, 1048, 1781, 2674, 9984,  995, 2947, 5833,  281,
            ...
            3482, 9299, 3960, 6106, 4311, 7035, 8000, 3755,  507, 9037],
           dtype='int64', length=149)

In [87]:
X_test[2839]

'Never Again,\nI brought my Mountain Bike in (which I bought 1 week earlier from Walmart) to have them replace my flat inner tubes with puncture resistant tubes.)  I came back 20 minutes later when they said it would be done and as I rode away, the back tire rhythmically rubbed against the back brake.  When I returned and told them about it, I was told the back wheel was not true.  It was true for the whole short week since I purchased it from Walmart.  I plan to take it to their competitor (Landis) and pay the extra $15-$25) to have the wheel put back to the straight condition it was in when I brought it to Slippery Pig in the first place.   I would never give a shop another penny for an extra service I should never have needed on a brand new bike which coasted perfectly when I brought it in.'

In [88]:
# exercise: show the message text for the false negatives
X_test[y_test > y_pred_class].index

Int64Index([750], dtype='int64')

In [89]:
X_test[2504]

"I've passed by prestige nails in walmart 100s of times but never really thought of having a pedicure there (even though they are always busy!) As I stared at my feet, long overdue for a pedicure, I thought it was about time to try them...since walmart rarely let's me down why should the nail salon inside?\n\nTo my surprise I got a wonderful pedicure or $23 not too bad this day in age...my to mention it was just as good as going to the more upscale salon just across the street! \n\nI'm glad to be the first to review them they deserve it! Now if only they did facials at walmart and hair I'd be set!"

## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy.
- Print the confusion matrix.
- Comment on the results.

In [90]:
X = df.text
y = df.stars 

# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [217]:
# vect = CountVectorizer(stop_words='english')
vect = TfidfVectorizer(sublinear_tf=True, max_df=.5, analyzer='word', stop_words='english', ngram_range=(1, 1))
# vect = HashingVectorizer(stop_words='english', analyzer='word', norm=u'l2', ngram_range=(1, 5))


# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)


In [93]:
y_test.value_counts().head(1) / len(y_test)


4    0.3536
Name: stars, dtype: float64

In [212]:
def nb_multi_models(X_train_dtm, X_test_dtm, y_train, y_test, clf=None):
    nb = clf()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    
    y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
    accuracy_score = metrics.accuracy_score(y_test, y_pred_class)
#     roc_auc = metrics.roc_auc_score(y_test, y_pred_prob)
    confusion = metrics.confusion_matrix(y_test, y_pred_class)
    
    print """Model: %s \n 
    Accuracy Score: %s \n 
    ROC AUC Score: %s \n 
    Confusion Matrix: %s
    
    """ % (clf, accuracy_score, roc_auc, confusion)

    return nb

In [218]:
mnb = nb_multi_models(X_train_dtm, X_test_dtm, y_train, y_test, clf=MultinomialNB)

Model: <class 'sklearn.naive_bayes.MultinomialNB'> 
 
    Accuracy Score: 0.4332 
 
    ROC AUC Score: 0.770704705821 
 
    Confusion Matrix: [[  0   0   0 147  38]
 [  0   0   0 214  20]
 [  0   0   1 341  23]
 [  0   0   0 780 104]
 [  0   0   0 530 302]]
    
    


In [219]:
nbern = nb_multi_models(X_train_dtm, X_test_dtm, y_train, y_test, clf=BernoulliNB)

Model: <class 'sklearn.naive_bayes.BernoulliNB'> 
 
    Accuracy Score: 0.4176 
 
    ROC AUC Score: 0.770704705821 
 
    Confusion Matrix: [[ 19  10   7  57  92]
 [ 12  13  22 112  75]
 [  6  15  25 183 136]
 [  8  16  23 410 427]
 [  6  11   9 229 577]]
    
    


In [220]:
lb = nb_multi_models(X_train_dtm, X_test_dtm, y_train, y_test, clf=LogisticRegression)

Model: <class 'sklearn.linear_model.logistic.LogisticRegression'> 
 
    Accuracy Score: 0.5192 
 
    ROC AUC Score: 0.770704705821 
 
    Confusion Matrix: [[ 61  24  17  36  47]
 [ 23  34  40  97  40]
 [  7   9  61 225  63]
 [  1   0  21 586 276]
 [  5   0   2 269 556]]
    
    


In [216]:
# df5 = pd.DataFrame(nbern.feature_count_ , columns=vect.get_feature_names())
# df5_stacked = df5.head(2).stack().unstack(0)
# df5_stacked.sort_values(1, ascending=False)[:5]

### Comments:

Looking at the three different models tested, it seems apparent that confusion matrixes tell an interesting story depending on the which parameters where passed into the vectorizer. I noticed that the longer the ngram length that I passed in that less likely certain categories were to be receive a classification (either true or false).

As a general note, the advise to review false predictions is well noted. 

I also noticed how much quicker NB was for training and predicting a model then logistical regression. I imagine with a bigger dataset that this lag would grow with the size of the dataset. 

In general, it seems apparent that more robust feature extraction is needed.



