## Text Classification using Naive Bayes model

Yelp review data is taken from the Kaggle competetion

** Goal: ** Classify the review text from 1 - 5 star range.

- Data is splitted into train and test data sets
- Data is transformed to document-term matrices.
- ** Multinomial Naive Bayes ** model is applied on the training data
- ** Accuracy ** , ** confusion matrix ** and ** classification report ** is generated.
- Predicted data is examined using False Positive, False Negatives etc.
- Frequency of 1 star and 5 star of tokens is calculated to examine which tokens are contributing
- most to 1 star and 5 start predictions.
- Calculated ** Precision **, ** Recall/Sensitivity ** , ** F1 score ** and ** Support **.

In [1]:
import pandas as pd

In [2]:
yelp = pd.read_csv('./data/yelp.csv')

In [15]:
yelp.shape

(10000, 10)

In [16]:
# check the data types of columns
yelp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
business_id    10000 non-null object
date           10000 non-null object
review_id      10000 non-null object
stars          10000 non-null int64
text           10000 non-null object
type           10000 non-null object
user_id        10000 non-null object
cool           10000 non-null int64
useful         10000 non-null int64
funny          10000 non-null int64
dtypes: int64(4), object(6)
memory usage: 781.3+ KB


In [18]:
# Random 5 yelp reviews
yelp.sample(5)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
1090,v68yAIw0gpn15xwtcR_rKg,2012-05-28,uMTrZsAO-U1IXZeMDo-U5g,4,This Dutch Bros had some of the best customer ...,review,sYxD6VdrAaajp-gxim0I_Q,0,1,0
5599,fP-BPL6iRu2tbcvlnjRshw,2012-10-21,9o9V6DK7LRHgwzYYA7vUfg,3,Super cool place I've been to 3-4 times now an...,review,5mvaOdV7nyKCP2Xct7yDTw,0,0,0
3532,fJ46ok6poCuLGT1O2M3xBA,2011-03-08,18uXgzZnJRCfBeP3V3tKqQ,5,I had an amazing facial here! \nThe space is c...,review,0o0VMEJeQY0pAAZ9nxErBA,3,2,2
2527,sYZt3f1YFlg0ycDMyO-vJw,2010-03-25,uUr6meezb22Ig40LtvfYyA,2,What kind of sports bar doesn't serve SLIDERS?...,review,OksbhhgC71Ary3zNHMypeQ,0,0,0
2567,5Feaj6aixO_QxKFnkHf3xg,2012-03-28,ExR82nFB6qg5xvc9kU0bnw,4,Stayed here with my girlfriend for a Radiohead...,review,vt0y5E2LUFeG0_PZ1b3d_Q,0,0,0


In [22]:
# class distribution
yelp.stars.value_counts().sort_index()

1     749
2     927
3    1461
4    3526
5    3337
Name: stars, dtype: int64

In [27]:
# First, take only two class - 5-start and 1-star reviews
yelp.best_worst = yelp[(yelp.stars == 5) | (yelp.stars == 1)]

# equivalently, 'loc' method
yelp.best_worst = yelp.loc[(yelp.stars ==5) | (yelp.stars == 1), :]

In [28]:
# Examine the shape
yelp.best_worst.shape

(4086, 10)

In [31]:
# define X and y
X = yelp.best_worst.text
y = yelp.best_worst.stars

In [33]:
# Split into trianing and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [44]:
# Create document-term matrices
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [68]:
# define a count vector
vect = CountVectorizer()

# fit and transform train data set
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(3064, 16825)

In [69]:
# make document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm.shape

(1022, 16825)

In [70]:
# create an instance of naive bayes model and train the model
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [73]:
# we can also use pipeline to perform in single step
alt_model = make_pipeline(
    CountVectorizer(), MultinomialNB()
).fit(X_train, y_train)

# Predict class for test data set revies
alt_test_pred = nb.predict(X_test_dtm)

In [74]:
# Predict class for test data set revies
test_pred = nb.predict(X_test_dtm)

In [75]:
# calculate accuracy
from sklearn import metrics
metrics.accuracy_score(y_test, test_pred)

0.9187866927592955

In [48]:
# confusion matrix
print(metrics.confusion_matrix(y_test, test_pred))

[[126  58]
 [ 25 813]]


In [49]:
# exampine the class distribution of the test data set
y_test.value_counts()

5    838
1    184
Name: stars, dtype: int64

In [51]:
# calculate null accuracy for 5-star reviews
# 81.9 % of reviews are 5-star reviews 
y_test.value_counts().head(1) / y_test.shape

5    0.819961
Name: stars, dtype: float64

In [55]:
# Now lets observe predictions and see FP FN

# first 10 false positives (1-star reviews incorrectly classified as 5-star reviews)
X_test[y_test < test_pred].head(10)

2175    This has to be the worst restaurant in terms o...
1781    If you like the stuck up Scottsdale vibe this ...
2674    I'm sorry to be what seems to be the lone one ...
9984    Went last night to Whore Foods to get basics t...
3392    I found Lisa G's while driving through phoenix...
8283    Don't know where I should start. Grand opening...
2765    Went last week, and ordered a dozen variety. I...
2839    Never Again,\nI brought my Mountain Bike in (w...
321     My wife and I live around the corner, hadn't e...
1919                                         D-scust-ing.
Name: text, dtype: object

In [61]:
# Model is reacting to words good, impressive, nice
yelp.loc[1781, "text"]
X_test[1781]

"If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating."

In [62]:
# like this, model doesn't have enough data to classify such review text

X_test[1919]

'D-scust-ing.'

In [64]:
# first 10 false negatives (5-star reviews incorrectly classified as 1-star reviews)
X_test[y_test > test_pred].head(10)

7148    I now consider myself an Arizonian. If you dri...
4963    This is by far my favourite department store, ...
6318    Since I have ranted recently on poor customer ...
380     This is a must try for any Mani Pedi fan. I us...
5565    I`ve had work done by this shop a few times th...
3448    I was there last week with my sisters and whil...
6050    I went to sears today to check on a layaway th...
2504    I've passed by prestige nails in walmart 100s ...
2475    This place is so great! I am a nanny and had t...
241     I was sad to come back to lai lai's and they n...
Name: text, dtype: object

In [65]:
# false negative: model is reacting to the words "complain", "crowds", "rushing", "pricey", "scum"
X_test[4963]

'This is by far my favourite department store, hands down. I have had nothing but perfect experiences in this store, without exception, no matter what department I\'m in. The shoe SA\'s will bend over backwards to help you find a specific shoe, and the staff will even go so far as to send out hand-written thank you cards to your home address after you make a purchase - big or small. Tim & Anthony in the shoe salon are fabulous beyond words! \n\nI am not completely sure that I understand why people complain about the amount of merchandise on the floor or the lack of crowds in this store. Frankly, I would rather not be bombarded with merchandise and other people. One of the things I love the most about Barney\'s is not only the prompt attention of SA\'s, but the fact that they aren\'t rushing around trying to help 35 people at once. The SA\'s at Barney\'s are incredibly friendly and will stop to have an actual conversation, regardless or whether you are purchasing something or not. I hav

In [77]:
# Calculate which 10 tokens are most predictive

# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

16825

In [78]:
# store the number of times each token appears across each class
one_star_token_count = nb.feature_count_[0, :]
five_star_token_count = nb.feature_count_[1, :]

In [80]:
# create a DataFrame of tokens with their separate one-star and five-star counts
tokens = pd.DataFrame({
    'token':X_train_tokens,
    'one_star':one_star_token_count,
    'five_star':five_star_token_count}
).set_index('token')

In [81]:
# add 1 to one-star and five-star counts to avoid dividing by 0
tokens['one_star'] = tokens.one_star + 1
tokens['five_star'] = tokens.five_star + 1

In [82]:
# first number is one-star reviews, second number is five-star reviews
nb.class_count_

array([ 565., 2499.])

In [83]:
# convert the one-star and five-star counts into frequencies
tokens['one_star'] = tokens.one_star / nb.class_count_[0]
tokens['five_star'] = tokens.five_star / nb.class_count_[1]

In [84]:
# calculate the ratio of five-star to one-star for each token
tokens['five_star_ratio'] = tokens.five_star / tokens.one_star

In [86]:
# sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows
tokens.sort_values('five_star_ratio', ascending=False).head(10)

Unnamed: 0_level_0,five_star,one_star,five_star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fantastic,0.077231,0.00354,21.817727
perfect,0.098039,0.00531,18.464052
yum,0.02481,0.00177,14.017607
favorite,0.138055,0.012389,11.143029
outstanding,0.019608,0.00177,11.078431
brunch,0.016807,0.00177,9.495798
gem,0.016006,0.00177,9.043617
mozzarella,0.015606,0.00177,8.817527
pasty,0.015606,0.00177,8.817527
amazing,0.185274,0.021239,8.723323


In [87]:
# sort the DataFrame by five_star_ratio (ascending order),
# and examine the first 10 rows, tokens contributing to 1-star reviews
tokens.sort_values('five_star_ratio', ascending=True).head(10)

Unnamed: 0_level_0,five_star,one_star,five_star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
staffperson,0.0004,0.030088,0.013299
refused,0.0004,0.024779,0.016149
disgusting,0.0008,0.042478,0.018841
filthy,0.0004,0.019469,0.020554
unprofessional,0.0004,0.015929,0.025121
unacceptable,0.0004,0.015929,0.025121
acknowledge,0.0004,0.015929,0.025121
ugh,0.0008,0.030088,0.026599
fuse,0.0004,0.014159,0.028261
boca,0.0004,0.014159,0.028261


In [89]:
# let's observe predicts for each class(5-class)
# Multiclass classification
# define X and y using the original DataFrame
X = yelp.text
y = yelp.stars

In [91]:
# check distribution of classes
y.value_counts().sort_index()

1     749
2     927
3    1461
4    3526
5    3337
Name: stars, dtype: int64

In [92]:
# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [93]:
# create document-term matrices using CountVectorizer
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [94]:
# fit a Multinomial Naive Bayes model
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [95]:
# make class predictions
test_pred = nb.predict(X_test_dtm)

In [97]:
# calculate the accuary
metrics.accuracy_score(y_test, test_pred)

0.4712

In [98]:
# calculate the null accuracy
y_test.value_counts().head(1) / y_test.shape

4    0.3536
Name: stars, dtype: float64

In [99]:
# print the confusion matrix
metrics.confusion_matrix(y_test, test_pred)

array([[ 55,  14,  24,  65,  27],
       [ 28,  16,  41, 122,  27],
       [  5,   7,  35, 281,  37],
       [  7,   0,  16, 629, 232],
       [  6,   4,   6, 373, 443]], dtype=int64)

In [100]:
# print the classification report
print(metrics.classification_report(y_test, test_pred))

             precision    recall  f1-score   support

          1       0.54      0.30      0.38       185
          2       0.39      0.07      0.12       234
          3       0.29      0.10      0.14       365
          4       0.43      0.71      0.53       884
          5       0.58      0.53      0.55       832

avg / total       0.46      0.47      0.43      2500



In [101]:
# Of all the samples we classified as true how many are actually true
# Precesion(P) = (correctly predicted Positive) / (total predicted Positive)
# P = TP / TP + FP

# calculate precision for class 1
precision = 55 / float(55 + 28 + 5 + 7 + 6)
print(precision)

0.5445544554455446


In [102]:
# Recall: Of all the actual true samples how many did we classify as true
# Recall( R) = (correctly predicted Positive) / (total correct Positive observation) = TP / TP + FN

# calculate recall/Sensitivity
recall = 55 / float(55 + 14 + 24 + 65 + 27)
print(recall)

0.2972972972972973


In [106]:
# F1 score is a weighted average of precision and recall.
# manually calculate the F1 score for class 1
f1 = 2 * (precision * recall) / (precision + recall)
print(f1)

0.38461538461538464


In [105]:
# support: How many observations exist for which a given class is the true class?
# manually calculate the support for class 1
support = 55 + 14 + 24 + 65 + 27
print(support)

185
