#### Exercise 3.04: Text Classification – Naive Bayes
In this exercise, we will classify reviews of musical instruments on Amazon with the help of the Naïve Bayes classification algorithm.

In [1]:
import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

import re

import string

from nltk import word_tokenize

from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer

from collections import Counter

from pylab import *

import nltk

nltk.download('punkt')

nltk.download('wordnet')

import warnings

warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /Users/LNonyane/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
review_data = pd.read_json('data/reviews_Musical_Instruments_5.json', lines=True)
review_data[['reviewText', 'overall']].head()

Unnamed: 0,reviewText,overall
0,"Not much to write about here, but it does exac...",5
1,The product does exactly as it should and is q...,5
2,The primary job of this device is to block the...,5
3,Nice windscreen protects my MXL mic and preven...,5
4,This pop filter is great. It looks and perform...,5


In [3]:
lemmatizer = WordNetLemmatizer()
review_data['cleaned_review_text'] = review_data['reviewText']\
.apply(lambda x : ' '.join\
 ([lemmatizer.lemmatize\
  (word.lower()) \
  for word in word_tokenize\
  (re.sub(r'([^\s\w]|_)+', ' ',\
   str(x)))]))

In [4]:
review_data[['cleaned_review_text', 'reviewText', 'overall']].head()

Unnamed: 0,cleaned_review_text,reviewText,overall
0,not much to write about here but it doe exactl...,"Not much to write about here, but it does exac...",5
1,the product doe exactly a it should and is qui...,The product does exactly as it should and is q...,5
2,the primary job of this device is to block the...,The primary job of this device is to block the...,5
3,nice windscreen protects my mxl mic and preven...,Nice windscreen protects my MXL mic and preven...,5
4,this pop filter is great it look and performs ...,This pop filter is great. It looks and perform...,5


In [5]:
tfidf_model = TfidfVectorizer(max_features=500)
tfidf_df = pd.DataFrame(tfidf_model.fit_transform(review_data['cleaned_review_text']).todense()) # todense() creates matrix
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

Unnamed: 0,10,100,12,20,34,able,about,accurate,acoustic,actually,...,won,work,worked,worth,would,wrong,year,yet,you,your
0,0.0,0.0,0.0,0.0,0.0,0.0,0.159684,0.0,0.0,0.0,...,0.0,0.134327,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.085436,0.0,0.0,0.0,0.0,0.0,0.0,0.067074,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.115312,0.0,0.0,0.0,0.07988,0.111989
3,0.0,0.0,0.0,0.0,0.0,0.339573,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.303608,0.0


Create a new column target, which will have 0 if the overall parameter is less than 4, and 1 otherwise. 

In [6]:
review_data['target'] = review_data['overall'].apply(lambda x : 0 if x<=4 else 1)
review_data['target'].value_counts()

1    6938
0    3323
Name: target, dtype: int64

Use sklearn's GaussianNB() function to fit a Gaussian Naive Bayes model on the TFIDF representation of these reviews after cleaning them. 

In [7]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB() # instance of class GaussianNB  
nb.fit(tfidf_df,review_data['target']) # trainning the model
predicted_labels = nb.predict(tfidf_df)
nb.predict_proba(tfidf_df)[:,1]

array([9.97730158e-01, 3.63599675e-09, 9.45692105e-07, ...,
       2.46001047e-02, 3.43660991e-08, 1.72767906e-27])

Use the crosstab function of pandas to compare the results of our classification model with the actual classes ('target', in this case) of the reviews.

In [8]:
review_data['predicted_labels'] = predicted_labels
pd.crosstab(review_data['target'], review_data['predicted_labels'])

predicted_labels,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2333,990
1,2380,4558


Here, we can see 2333 instances with the target label 0 that are correctly classified and 990 such instances that have been wrongly classified. Furthermore, 4558 instances with the target label 1 have been correctly classified, whereas 2380 such instances have been wrongly classified.

We improved our classification of the 0 label but lost performance on the 1 label.