# Sentiment Analysis and Rating Prediction From The Review Text

Sentiment ananlysis and rating prediction are among the imporant machine learning topics that help companies find if the users are happy or unhappy with the service/product provided. The users write reviews of the products/services on various platforms, such as social networking websites like Facebook and Twitter, Blogs, and service offering websites. The ananlysis of such reviews to find the coustomer satisfaction will be helpful for companies to improve their products as well as the customer service.

In this project, I aim to build a machine learning system that will predict the user rating from his text review. Precisely, I will work on building the models for the following.

1. Predict the users' sentiments (positive or negative).
2. Predict his product/service rating on a scale of 1 to 5.

I have already done the ETL in the other notebook. So here, I will just load the data prepare it for the model training, valiadation and testing and describe the deep learning model employed. In this notebook, I will just focus on the rating predcitction, i.e. mutliclass-classfication. For sentiment analysis, i.e. the binary classification, I have described the modeling in another notebook.

So, let's just start with loading the required libraries.

In [1]:
import numpy as np
import pandas as pd
import gzip
import glob
import os
import re

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

In [3]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, GRU, Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.utils.np_utils import to_categorical
from keras.callbacks import ModelCheckpoint, EarlyStopping


Using TensorFlow backend.


## Loading the data from the CSV

In [5]:
df_ReviewRating = pd.read_csv('AmazonBookReviews_Ratings.csv')
df_ReviewRating.shape

(484645, 3)

In [7]:
lenthsStr = df_ReviewRating['reviewText'].apply(str).map(len)
maxlength = max(lenthsStr)
maxindex = lenthsStr[lenthsStr == maxlength].index[0]
print('The length of longest review text: '+ str(maxlength))
print('The index of review with maximum length: '+ str(maxindex))

The length of longest review text: 499
The index of review with maximum length: 100


## Spliting Data Set into Train, Validation and Test

In [8]:
X_train, X_test, y_train, y_test = train_test_split(df_ReviewRating['reviewText'], df_ReviewRating['overall'], test_size=0.2, random_state=1)

In [9]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

In [10]:
print(X_train.shape)
print(X_train[0:5])
print(y_train.shape)
print(y_train[0:5])

df_train = pd.concat([X_train, y_train], axis=1)
df_train.groupby('overall').count()/df_train.shape[0]*100

(310172,)
475274    Great story branch off of original. Need next ...
459122    I teach a watercolor class at a college for se...
229593    NIce collection of Destroyer novels.  You get ...
201419    I love Chuck Klosterman's pop culture essays, ...
279894    The writing is excellent.  Good characters and...
Name: reviewText, dtype: object
(310172,)
475274    5.0
459122    5.0
229593    5.0
201419    5.0
279894    5.0
Name: overall, dtype: float64


Unnamed: 0_level_0,reviewText
overall,Unnamed: 1_level_1
1.0,3.418426
2.0,3.944908
3.0,9.165237
4.0,22.491069
5.0,60.980359


In [11]:
print(X_test.shape)
print(X_test[0:5])
print(y_test.shape)
print(y_test[0:5])
df_test = pd.concat([X_test, y_test], axis=1)
df_test.groupby('overall').count()/df_test.shape[0]*100

(96929,)
140694    Loved it. Once I started reading it I couldn't...
401595    Slow and boring. Gay son, military dad and wea...
155299    VERY GOOD!  Mix of historical and modern times...
109577    A must read. What a wonderful twist to a LOVE ...
137179    Pauline is a lovable heroine . She is truly he...
Name: reviewText, dtype: object
(96929,)
140694    5.0
401595    2.0
155299    5.0
109577    5.0
137179    3.0
Name: overall, dtype: float64


Unnamed: 0_level_0,reviewText
overall,Unnamed: 1_level_1
1.0,3.48193
2.0,4.004993
3.0,9.040638
4.0,22.472119
5.0,61.00032


In [12]:
print(X_val.shape)
print(X_val[0:5])
print(y_val.shape)
print(y_val[0:5])
df_val = pd.concat([X_val, y_val], axis=1)
df_val.groupby('overall').count()/df_val.shape[0]*100

(77544,)
264019    As a fan of Dawn's, I was anxious to read her ...
118798    There are some real surprises in this one, one...
377831    In my opinion, as a writer/artist myself, this...
119666    This book is beautifully done, so easy to foll...
405407    This is a good resource for some one who has l...
Name: reviewText, dtype: object
(77544,)
264019    5.0
118798    5.0
377831    5.0
119666    5.0
405407    4.0
Name: overall, dtype: float64


Unnamed: 0_level_0,reviewText
overall,Unnamed: 1_level_1
1.0,3.429021
2.0,3.987414
3.0,9.274734
4.0,22.308625
5.0,61.000206


We see that the distribution of classess in train, validation and test set is representative of the resal data set.

## Data Preparation  for Modeling

In this step, we first tokenize the textual data into words and convert it into sequences of same length.

### Tokenization

In [13]:
%%time
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_ReviewRating['reviewText'])


CPU times: user 21.3 s, sys: 285 ms, total: 21.6 s
Wall time: 21.6 s


In [14]:
vocab_size = len(tokenizer.word_index) + 1
vocab_size

156075

In [15]:
%%time
sequence_train = tokenizer.texts_to_sequences(X_train)
sequence_test = tokenizer.texts_to_sequences(X_test)
sequence_val = tokenizer.texts_to_sequences(X_val)

CPU times: user 15.2 s, sys: 240 ms, total: 15.4 s
Wall time: 15.4 s


In [16]:
%%time
X_train_pad = pad_sequences(sequence_train, maxlen=maxlength)
X_test_pad = pad_sequences(sequence_test, maxlen=maxlength)
X_val_pad = pad_sequences(sequence_val, maxlen=maxlength)

CPU times: user 4.51 s, sys: 1.36 s, total: 5.86 s
Wall time: 5.84 s


Let us convert the labels vector to a matrix (one-hot encoded) using to_categorical. The to_categorical expects the label to start from 0, so I changed the labels. Now, rating one is represented by 0 and five by 4.

In [17]:
y_train_label = to_categorical(np.asarray(y_train - 1))
print(y_train[0:5])
y_test_label = to_categorical(np.asarray(y_test - 1))
print(y_test[0:5])
y_val_label = to_categorical(np.asarray(y_val - 1))
print(y_val[0:5])

475274    5.0
459122    5.0
229593    5.0
201419    5.0
279894    5.0
Name: overall, dtype: float64
140694    5.0
401595    2.0
155299    5.0
109577    5.0
137179    3.0
Name: overall, dtype: float64
264019    5.0
118798    5.0
377831    5.0
119666    5.0
405407    4.0
Name: overall, dtype: float64


In [43]:
y_train_label[0:5]

array([[ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  1.]])

## Model Training and Evaluation

In this project, I decided to use three different deep learning models. The models and their evaluation follows.

### GRU based Model

In [45]:
# embedding_dimensions =  vocab_size**0.25
embedding_dimensions = 100

Model definition ...

In [46]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dimensions, input_length=maxlength))
model.add(GRU(units=32, dropout=0.2))
model.add(Dense(5, activation='softmax'))
print(model.summary()) 

Model compilation and training ...

In [47]:
%%time
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [48]:
%%time
model.fit(X_train_pad, y_train_label,
          batch_size=128,
          epochs=5,
          validation_data=(X_val_pad, y_val_label),
         callbacks = [checkpoint, early_stop])

Model evaluation ...

In [49]:
scores = model.evaluate(X_test_pad, y_test_label, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

### LSTM basedModel

Model definition ...

In [None]:
model = Sequential() 
model.add(Embedding(vocab_size, embedding_dimensions, input_length=maxlength)) 
model.add(LSTM(100)) 
model.add(Dense(5, activation='softmax'))
print(model.summary()) 

Model comiplation and training ...

In [None]:
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])

In [None]:
%%time
model.fit(X_train_pad, y_train_label,
          batch_size=128,
          epochs=5,
          validation_data=(X_val_pad, y_val_label))

Model evaluation ...

In [55]:
scores = model.evaluate(X_test_pad, y_test_label, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 67.02%


### 1D CNN based Model 

Model definition ...

In [None]:
model = Sequential() 
model.add(Embedding(vocab_size, embedding_dimensions, input_length=maxlength)) 
model.add(Convolution1D(64, 3, padding='same'))
model.add(Convolution1D(32, 3, padding='same'))
model.add(Convolution1D(16, 3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(180,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(5, activation='softmax'))
print(model.summary()) 

Model compilation and training ...

In [None]:
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy']) 

In [None]:
%%time
model.fit(X_train_pad, y_train_label,
          batch_size=128,
          epochs=5,
          validation_data=(X_val_pad, y_val_label))

Model evaluation ...

In [None]:
scores = model.evaluate(X_test_pad, y_test_label, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

## Conclusion

We have trained three different deep learning models to predict the user ratings from the text of the review they wrote. 