# CUSTOMER SENTIMENT ANALYZER

## Table of Contents

1. [Introduction](#introduction)

2. [Libraries](#libraries)

3. [Data Preparation and Pre-processing](#data-preparation-and-pre-processing)
    - [Importing Data](#importing-data)
    - [Cleaning Data](#cleaning-data)
    - [Data with only Alphabets](#data-with-only-alphabets)
    - [Lower Case Data](#lower-case-data)
    - [Remove Stop Words](#remove-stop-words)
    - [Stemming](#stemming)
    - [Train-Test-Splitting](#train-test-splitting)
    - [TFIDf Vectorization](#TFIDF-vectorization)
4. [Model Building](#model-building)
    - [Training the Model](#training-the-model)
5. [Evaluation](#evaluation)
6. [Conclusion](#conclusion)

## Introduction 

#### Problem Statement 
1. We have IMDB movies reviews dataset where there are two columns named reviews and sentiment with 50000 rows.
2. We have to build a model (to predict the sentiment) on the basis of data & data which will be predicted afterwards.

#### Solution
1. Clean the Data
2. Pre-Process the Data (Stemming or Lemmatizer and Count Vectorization,etc.)
3. Build the model (DecisonTree, RandomForest,XGBoostClassifier,LightBGMClassifier)
4. Evaluate the result using metrics (accuracy,precision,recall)
5. Select the model with best metrics

## Libraries

In [1]:
# importing libraries and data
import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mahes\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Data Preparation and Pre-Processing

#### Importing Data

In [2]:
# importing data
data = pd.read_csv('IMDB_dataset.csv')

In [3]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


#### Cleaning Data

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [5]:
# From the above data, we can see that all values are non-null. Let's see if it has duplicate values
data.duplicated().sum()

np.int64(418)

In [6]:
# we have 418 duplicates, we will drop all the duplicates 
# because duplicates can contaminate the split between training, validation, test sets and also takes up unnecessary memory.
data.drop_duplicates(inplace=True)

In [7]:
data.shape

(49582, 2)

#### Data with only Alphabets

In [8]:
pd.set_option('display.max_colwidth', None) # for display all the contents of the row
data.head(1)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive


In [9]:
data['review'] = data['review'].apply(lambda x:re.sub('[^a-zA-Z ]','',x))

In [10]:
data.head(1)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that after watching just Oz episode youll be hooked They are right as this is exactly what happened with mebr br The first thing that struck me about Oz was its brutality and unflinching scenes of violence which set in right from the word GO Trust me this is not a show for the faint hearted or timid This show pulls no punches with regards to drugs sex or violence Its is hardcore in the classic use of the wordbr br It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary It focuses mainly on Emerald City an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda Em City is home to manyAryans Muslims gangstas Latinos Christians Italians Irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awaybr br I would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare Forget pretty pictures painted for mainstream audiences forget charm forget romanceOZ doesnt mess around The first episode I ever saw struck me as so nasty it was surreal I couldnt say I was ready for it but as I watched more I developed a taste for Oz and got accustomed to the high levels of graphic violence Not just violence but injustice crooked guards wholl be sold out for a nickel inmates wholl kill on order and get away with it well mannered middle class inmates being turned into prison bitches due to their lack of street skills or prison experience Watching Oz you may become comfortable with what is uncomfortable viewingthats if you can get in touch with your darker side,positive


#### Lower Case Data

In [11]:
pd.reset_option('display.max_colwidth') # resetting the column width
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production br br The filmin...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically theres a family where a little boy J...,negative
4,Petter Matteis Love in the Time of Money is a ...,positive


In [12]:
data['review'] = data['review'].apply(lambda x:x.lower())

In [13]:
data.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


#### Remove Stop Words

In [14]:
eng_stopwords = stopwords.words('english') #stopwords.words({Language Name})
eng_stopwords

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [15]:
# to remove the stopwords from the sentence, first we will split the words on the basis of space i,e ' '
data['review'] = data['review'].apply(lambda x:x.split())
data.head()

Unnamed: 0,review,sentiment
0,"[one, of, the, other, reviewers, has, mentione...",positive
1,"[a, wonderful, little, production, br, br, the...",positive
2,"[i, thought, this, was, a, wonderful, way, to,...",positive
3,"[basically, theres, a, family, where, a, littl...",negative
4,"[petter, matteis, love, in, the, time, of, mon...",positive


In [16]:
# removing stopwords from each row
data['review'] = data['review'].apply(lambda x: [word for word in x if word not in set(eng_stopwords)])
data.head()

Unnamed: 0,review,sentiment
0,"[one, reviewers, mentioned, watching, oz, epis...",positive
1,"[wonderful, little, production, br, br, filmin...",positive
2,"[thought, wonderful, way, spend, time, hot, su...",positive
3,"[basically, theres, family, little, boy, jake,...",negative
4,"[petter, matteis, love, time, money, visually,...",positive


#### Stemming

In [17]:
data.head()

Unnamed: 0,review,sentiment
0,"[one, reviewers, mentioned, watching, oz, epis...",positive
1,"[wonderful, little, production, br, br, filmin...",positive
2,"[thought, wonderful, way, spend, time, hot, su...",positive
3,"[basically, theres, family, little, boy, jake,...",negative
4,"[petter, matteis, love, time, money, visually,...",positive


In [18]:
#Stemming
ps = PorterStemmer()
data['review'] = data['review'].apply(lambda x: [ps.stem(word) for word in x])
data.head()

Unnamed: 0,review,sentiment
0,"[one, review, mention, watch, oz, episod, youl...",positive
1,"[wonder, littl, product, br, br, film, techniq...",positive
2,"[thought, wonder, way, spend, time, hot, summe...",positive
3,"[basic, there, famili, littl, boy, jake, think...",negative
4,"[petter, mattei, love, time, money, visual, st...",positive


In [19]:
# joining the words after the stemming them
data['review'] =data['review'].apply(lambda x: ' '.join(x))
data.head()

Unnamed: 0,review,sentiment
0,one review mention watch oz episod youll hook ...,positive
1,wonder littl product br br film techniqu unass...,positive
2,thought wonder way spend time hot summer weeke...,positive
3,basic there famili littl boy jake think there ...,negative
4,petter mattei love time money visual stun film...,positive


In [20]:
# function for pre-processing
def preprocess(x):
    x = re.sub('[^a-zA-Z]'," ",x)
    x = x.lower()
    x = x.split()
    x = [i for i in x if i not in set(stopwords.words('english'))]
    x = [ps.stem(i) for i in x]
    x = " ".join(x)
    return x

In [21]:
# Postive --> 1 and Negative --> 0
data['sentiment'] = [1 if word =='positive' else 0 for word in data['sentiment']]
data['sentiment'].astype(np.int32)

0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 49582, dtype: int32

In [22]:
X = data['review']
y = data['sentiment']

In [23]:
y

0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 49582, dtype: int64

In [30]:
y.value_counts()

sentiment
1    24884
0    24698
Name: count, dtype: int64

#### Train-Test Splitting

In [24]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.8,shuffle=True)

In [25]:
print(X_train.dtypes)
print(X_test.dtypes)
print(y_train.dtypes)
print(y_test.dtypes)

object
object
int64
int64


In [26]:
y_train

41051    1
42389    0
44147    1
31320    1
21519    1
        ..
45769    1
48325    0
46058    1
44830    1
44969    1
Name: sentiment, Length: 39665, dtype: int64

In [27]:
y_test

16586    0
10877    0
24215    0
4949     0
49217    1
        ..
38774    1
12424    1
11555    0
26662    1
43891    1
Name: sentiment, Length: 9917, dtype: int64

In [32]:
X_train

41051    person absolut love movi nove read book first ...
42389    franco film divid categori earli often black w...
44147    movi rock jen sexi ever polli wow realli ever ...
31320    farrah fawcett superb power drama play marjori...
21519    death wish exactli bad movi terribl act implau...
                               ...                        
45769    im slowli plough mani hong kong action film ge...
48325    recap band five young american men left platoo...
46058    odd coupl classic film version neil simon famo...
44830    best batman movi doubt movi take place citi fi...
44969    read ashew comment thought must watch entir di...
Name: review, Length: 39665, dtype: object

In [33]:
X_test

16586    movi shown recent cabl channel want see anoth ...
10877    start script immit inan charact shallow formul...
24215    horribl act costum product valu edit script ev...
4949     tri sit bomb long agowhat disast act atrocious...
49217    ladi man suffer common problem among movi base...
                               ...                        
38774    anyon ever gone audit certainli relat one grea...
12424    hong kong man woman move day adjac apart respe...
11555    bad watchabl scienc fiction film suffer abomin...
26662    anim featur coproduct ireland belgium franc de...
43891    tcm keep awak time keep come film ive never he...
Name: review, Length: 9917, dtype: object

#### Count Vectorization

In [34]:
# from sklearn.feature_extraction.text import CountVectorizer
# cv = CountVectorizer()

#### TFIDF Vectorization

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=20000, min_df=5, max_df=0.85, ngram_range=(1,2))

In [36]:
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [37]:
print(f"Shape of X_train_tfidf: {X_train_tfidf.shape}")
print(f"Type of X_train_tfidf: {type(X_train_tfidf)}") 

Shape of X_train_tfidf: (39665, 20000)
Type of X_train_tfidf: <class 'scipy.sparse._csr.csr_matrix'>


## Model Building

#### Training the Model

In [53]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score

In [54]:
model = LogisticRegressionCV(solver='liblinear', random_state=42,cv=5)

In [55]:
model.fit(X_train_tfidf,y_train)

0,1,2
,Cs,10
,fit_intercept,True
,cv,5
,dual,False
,penalty,'l2'
,scoring,
,solver,'liblinear'
,tol,0.0001
,max_iter,100
,class_weight,


In [56]:
y_pred = model.predict(X_test_tfidf)

## Evaluation

In [57]:
accuracy_score(y_test,y_pred)

0.9000705858626601

In [58]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.91      0.89      0.90      4989
           1       0.89      0.91      0.90      4928

    accuracy                           0.90      9917
   macro avg       0.90      0.90      0.90      9917
weighted avg       0.90      0.90      0.90      9917



## Conclusion
1. As per our f1-score for the model is 0.90, we can use this model for predicting the sentiment of the reviews.
2. We can also try using XGBoostClassifier, DecisionTreeClassifier etc. classification models for this data based on the metrics evaluation.
3. For using this models on reviews, we have to do all the pre-processing of the reviews that we have done in the notebook i.e use preprocess function on the review and then predict the outcome.