## Sentimental Analysis
##### In this section, we will explore two models to estimate the sentiment from a snippet of text data. 

### Packages

In [1]:
import kaggle
import pandas as pd
import matplotlib.pyplot as plt 
import pyarrow
import fastparquet
import numpy as np
import os
from collections import Counter, defaultdict
import warnings
import seaborn as sns
warnings.filterwarnings("ignore")
from wordcloud import WordCloud 
#kaggle.api.authenticate()
import nltk
from string import punctuation
import textacy.preprocessing as tprep
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS
import spacy
nlp = spacy.load("en_core_web_sm")
from sklearn.metrics import accuracy_score 
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix,plot_confusion_matrix
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

## Step 1: Data Preparation
####    Loading Dataset for Modeling

In [2]:
new_df=pd.read_parquet('prepared_text.parquet.gzip', engine='pyarrow')
new_df.head(3)

Unnamed: 0,Rating,lemmas,adjs_verbs,nouns,noun_phrases,adj_noun_phrases,entities,tokens,token_count,new_reviews
0,5,"[feel, lucky, find, use, phone, use, hard, pho...","[feel, lucky, find, hard, upgrade, sell, like,...","[phone, phone, line, son, year, thank, seller,...","[phone_line, thank_seller]","[hard_phone, hard_phone_line, old_one, recomme...",[],"[feel, lucky, found, used, phone, us, used, ha...",38,feel lucky found used phone us used hard phone...
1,4,"[nice, phone, nice, grade, pantach, revue, cle...","[nice, nice, clean, easy, android, fantastic, ...","[phone, grade, pantach, revue, set, set, phone...",[grade_pantach],"[nice_phone, nice_grade, nice_grade_pantach, c...",[android/GPE],"[nice, phone, nice, grade, pantach, revue, cle...",24,nice phone nice grade pantach revue clean set ...
2,5,[pleased],[pleased],[],[],[],[],[pleased],1,pleased


#### Removing unecessary columns prior to modeling 

In [3]:

new_df =new_df[[
    'Rating','new_reviews','tokens'
]]

In [4]:
new_df.info()
print('\n')
print(new_df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 405669 entries, 0 to 405668
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Rating       405669 non-null  int64 
 1   new_reviews  405669 non-null  object
 2   tokens       405669 non-null  object
dtypes: int64(1), object(2)
memory usage: 9.3+ MB


(405669, 3)


#### Undersampling Majority class 

In [5]:
# undersampling 5-star reviews and oversampling other reviews 
five_stars = new_df[new_df['Rating'] == 5].sample(n=70000)
non_five = new_df[new_df['Rating'] != 5 ]
df_bal = pd.concat([five_stars,non_five],axis=0)

##### We  annotated all reviews with a rating of 4 and 5 as positive and with ratings 1 and 2 as negative:

In [6]:
# Assigning a new [1,0] target class label based on the product rating
df_bal['sentiment'] = 0
df_bal.loc[df_bal['Rating'] > 3, 'sentiment'] = 1
df_bal.loc[df_bal['Rating'] < 3, 'sentiment'] = 0

## Step 2: Train-Test Split on Balanced Dataset

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(df_bal['new_reviews'],
                                                       df_bal['sentiment'],
                                                        test_size=0.2,
                                                        random_state=42,
                                                        stratify=df_bal['sentiment'])
print ('Size of Training Data ', X_train.shape[0]) 
print ('Size of Test Data ', X_test.shape[0])
print ('Distribution of classes in Training Data :')
print ('Positive Sentiment ', str(sum(Y_train == 1)/ len(Y_train) * 100.0))
print ('Negative Sentiment ', str(sum(Y_train == 0)/ len(Y_train) * 100.0))
print ('Distribution of classes in Testing Data :')
print ('Positive Sentiment ', str(sum(Y_test == 1)/ len(Y_test) * 100.0)) 
print ('Negative Sentiment ', str(sum(Y_test == 0)/ len(Y_test) * 100.0))

Size of Training Data  205997
Size of Test Data  51500
Distribution of classes in Training Data :
Positive Sentiment  50.56529949465284
Negative Sentiment  49.434700505347166
Distribution of classes in Testing Data :
Positive Sentiment  50.565048543689315
Negative Sentiment  49.43495145631068


## Step 3: Text Vectorization
 ##### 	Using TF-IDF vectorization to create the vectorized representation:

In [8]:

tfidf = TfidfVectorizer(min_df = 10, ngram_range=(1,1))
X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

## Step 4: Training the Machine Learning Models

#### Model 1 : Linear SVC

In [9]:
model_svc = LinearSVC(random_state=42, tol=1e-5)
model_svc.fit(X_train_tf, Y_train)
Y_pred = model_svc.predict(X_test_tf)
print ('Accuracy Score - ', accuracy_score(Y_test, Y_pred)) 
print ('ROC-AUC Score - ', roc_auc_score(Y_test, Y_pred))

Accuracy Score -  0.8829708737864078
ROC-AUC Score -  0.8829385881762057



As we can see, our model achieves an accuracy of around 88%. Now, let’s take look at some of the model predictions and the review text to perform a sense check of the model:

In [11]:
sample_reviews = df_bal.sample(10)
sample_reviews_tf = tfidf.transform(sample_reviews['new_reviews'])
sentiment_predictions_svc = model_svc.predict(sample_reviews_tf)
sentiment_predictions_svc = pd.DataFrame(data = sentiment_predictions_svc,
                                         index=sample_reviews.index,
                                         columns=['sentiment_prediction'])
sample_reviews_svc = pd.concat([sample_reviews, sentiment_predictions_svc], axis=1)
print ('Some sample reviews with their sentiment - ') 
sample_reviews_svc[['new_reviews','sentiment_prediction']]

Some sample reviews with their sentiment - 


Unnamed: 0,new_reviews,sentiment_prediction
301994,love many great features everything need phone...,1
113284,ready garbage fault shoulda done research first,0
297230,like phonei like phone letters small text stil...,1
145296,100 cant go wrong note couple programs google ...,1
211667,good,1
122761,ive 4 12 months holding alright freeze reviews...,0
222442,gift someone satisfied,1
126426,good features quality,1
123560,good,1
3726,works great,1


We can see that this model is able to predict the reviews reasonably well. For instance, review 113284 where the customer found the result to be garbage is marked as negative.

#### Model 2: Deep Neural Multi-layer Perceptron Classifier

In [14]:
model_clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
model_clf.fit(X_train_tf, Y_train)
Y_pred_clf= model_clf.predict(X_test_tf)
print ('Accuracy Score - ', accuracy_score(Y_test, Y_pred_clf)) 
print ('ROC-AUC Score - ', roc_auc_score(Y_test, Y_pred_clf))

Accuracy Score -  0.8933009708737865
ROC-AUC Score -  0.8930619524994998


In [13]:
sentiment_predictions_clf = model_clf.predict(sample_reviews_tf)
sentiment_predictions_clf = pd.DataFrame(data = sentiment_predictions_clf,
                                         index=sample_reviews.index,
                                         columns=['sentiment_prediction'])
sample_reviews_clf = pd.concat([sample_reviews, sentiment_predictions_clf], axis=1)
print ('Some sample reviews with their sentiment - ') 
sample_reviews_clf[['new_reviews','sentiment_prediction']]

Some sample reviews with their sentiment - 


Unnamed: 0,new_reviews,sentiment_prediction
301994,love many great features everything need phone...,1
113284,ready garbage fault shoulda done research first,0
297230,like phonei like phone letters small text stil...,1
145296,100 cant go wrong note couple programs google ...,1
211667,good,1
122761,ive 4 12 months holding alright freeze reviews...,1
222442,gift someone satisfied,1
126426,good features quality,1
123560,good,1
3726,works great,1


This model achieved accuracy around 89% . However, when we look at the same sample reviews, we can see  reveiw 122761 where the customer talks about freezing reviews, the model predicted as positive while it looks more negative.