### Exercise 3.03: Text Classification – Logistic Regression
In this exercise, we will classify reviews of musical instruments on Amazon with the help of the logistic regression classification algorithm.

In [1]:
import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

import re

import string

from nltk import word_tokenize

from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer

from collections import Counter

from pylab import *

import nltk

nltk.download('punkt')

nltk.download('wordnet')
import warnings

warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /Users/LNonyane/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
review_data = pd.read_json('data/reviews_Musical_Instruments_5.json',lines=True)
review_data[['reviewText', 'overall']].head()

Unnamed: 0,reviewText,overall
0,"Not much to write about here, but it does exac...",5
1,The product does exactly as it should and is q...,5
2,The primary job of this device is to block the...,5
3,Nice windscreen protects my MXL mic and preven...,5
4,This pop filter is great. It looks and perform...,5


Use a lambda function to extract tokens from each 'reviewText' of this DataFrame, lemmatize them, and concatenate them side by side. Use the join function to concatenate a list of words into a single sentence. Use the regular expression method (re) to replace anything other than alphabetical characters, digits, and whitespaces with blank space.

In [4]:
lemmatizer = WordNetLemmatizer()
review_data['cleaned_review_text'] = review_data['reviewText']\
.apply(lambda x : ' '.join\
 ([lemmatizer.lemmatize\
  (word.lower()) \
  for word in word_tokenize\
  (re.sub(r'([^\s\w]|_)+', ' ',\
   str(x)))]))

In [5]:
review_data['cleaned_review_text']

0        not much to write about here but it doe exactl...
1        the product doe exactly a it should and is qui...
2        the primary job of this device is to block the...
3        nice windscreen protects my mxl mic and preven...
4        this pop filter is great it look and performs ...
                               ...                        
10256                   great just a expected thank to all
10257    i ve been thinking about trying the nanoweb st...
10258    i have tried coated string in the past includi...
10259    well made by elixir and developed with taylor ...
10260    these string are really quite good but i would...
Name: cleaned_review_text, Length: 10261, dtype: object

Create a DataFrame from the TFIDF matrix representation of the cleaned version of reviewText.

In [6]:
review_data[['cleaned_review_text', 'reviewText', 'overall']].head()

Unnamed: 0,cleaned_review_text,reviewText,overall
0,not much to write about here but it doe exactl...,"Not much to write about here, but it does exac...",5
1,the product doe exactly a it should and is qui...,The product does exactly as it should and is q...,5
2,the primary job of this device is to block the...,The primary job of this device is to block the...,5
3,nice windscreen protects my mxl mic and preven...,Nice windscreen protects my MXL mic and preven...,5
4,this pop filter is great it look and performs ...,This pop filter is great. It looks and perform...,5


Create a TFIDF matrix and transform it into a DataFrame.

In [7]:
tfidf_model = TfidfVectorizer(max_features=500)
tfidf_df = pd.DataFrame(tfidf_model.fit_transform(review_data['cleaned_review_text']).todense()) # todense() creates matrix
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

Unnamed: 0,10,100,12,20,34,able,about,accurate,acoustic,actually,...,won,work,worked,worth,would,wrong,year,yet,you,your
0,0.0,0.0,0.0,0.0,0.0,0.0,0.159684,0.0,0.0,0.0,...,0.0,0.134327,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.085436,0.0,0.0,0.0,0.0,0.0,0.0,0.067074,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.115312,0.0,0.0,0.0,0.07988,0.111989
3,0.0,0.0,0.0,0.0,0.0,0.339573,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.303608,0.0


Create a new column target, which will have 0 if the overall parameter is less than 4, and 1 otherwise. 

In [8]:
review_data['target'] = review_data['overall'].apply(lambda x : 0 if x<=4 else 1)
review_data['target'].value_counts()

1    6938
0    3323
Name: target, dtype: int64

Use sklearn's LogisticRegression() function to fit a logistic regression model on the TFIDF representation of these reviews after cleaning them.

In [9]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression() # log regression class instance
logreg.fit(tfidf_df,review_data['target']) # training model
predicted_labels = logreg.predict(tfidf_df)
logreg.predict_proba(tfidf_df)[:,1]

array([0.57146961, 0.68579907, 0.56068939, ..., 0.65979968, 0.5495679 ,
       0.21186011])

Use the crosstab function of pandas to compare the results of our classification model with the actual classes ('target', in this case) of the reviews.

In [11]:
review_data['predicted_labels'] = predicted_labels # create feature 'predicted_labels' in df review_data and assign predicted_labels values
pd.crosstab(review_data['target'], review_data['predicted_labels'])

predicted_labels,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1543,1780
1,626,6312


Here, we can see 1543 instances with the target label 0 that are correctly classified and 1780 such instances that are wrongly classified. Furthermore, 6312 instances with the target label 1 are correctly classified, whereas 626 such instances are wrongly classified.