# Sentiment Analysis and Rating Prediction From The Review Text

Sentiment ananlysis and rating prediction are among the imporant machine learning topics that help companies find if the users are happy or unhappy with the service/product provided. The users write reviews of the products/services on various platforms, such as social networking websites like Facebook and Twitter, Blogs, and service offering websites. The ananlysis of such reviews to find the coustomer satisfaction will be helpful for companies to improve their products as well as the customer service.

In this project, I aim to build a machine learning system that will predict the user rating from his text review. Precisely, I will work on building the models for the following.

1. Predict the users' sentiments (positive or negative).
2. Predict his product/service rating on a scale of 1 to 5.

I have already done the ETL in the other notebook. So here, I will just load the data prepare it for the model training, valiadation and testing and describe the deep learning model employed. In this notebook, I will just focus on the rating predcitction, i.e. mutliclass-classfication. For sentiment analysis, i.e. the binary classification, I have described the modeling in another notebook.

So, let's just start with loading the required libraries.

In [1]:
import numpy as np
import pandas as pd
import gzip
import glob
import os
import re

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

In [3]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, GRU, Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.utils.np_utils import to_categorical
from keras.callbacks import ModelCheckpoint, EarlyStopping


Using TensorFlow backend.


## Loading the data from the CSV

In [4]:
df_ReviewRating = pd.read_csv('AmazonBookReviews_Ratings.csv')
df_ReviewRating.shape

(68563, 3)

In [5]:
lenthsStr = df_ReviewRating['reviewText'].apply(str).map(len)
maxlength = max(lenthsStr)
maxindex = lenthsStr[lenthsStr == maxlength].index[0]
print('The length of longest review text: '+ str(maxlength))
print('The index of review with maximum length: '+ str(maxindex))

The length of longest review text: 499
The index of review with maximum length: 844


## Spliting Data Set into Train, Validation and Test

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df_ReviewRating['reviewText'], df_ReviewRating['overall'], test_size=0.2, random_state=1)

In [7]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

In [8]:
print(X_train.shape)
print(X_train[0:5])
print(y_train.shape)
print(y_train[0:5])

df_train = pd.concat([X_train, y_train], axis=1)
df_train.groupby('overall').count()/df_train.shape[0]*100

(43880,)
11362    This must be the worst series I have read by a...
21675    I can't stand all of this female inner drama i...
9518     She is a woman I have had admiration for. Her ...
13030    There's nothing more festive than Christmas in...
65150    I started with book one, I Love My Breakup bec...
Name: reviewText, dtype: object
(43880,)
11362    2.0
21675    4.0
9518     3.0
13030    5.0
65150    5.0
Name: overall, dtype: float64


Unnamed: 0_level_0,reviewText
overall,Unnamed: 1_level_1
1.0,3.099362
2.0,3.719234
3.0,9.307201
4.0,23.144941
5.0,60.729262


In [9]:
print(X_test.shape)
print(X_test[0:5])
print(y_test.shape)
print(y_test[0:5])
df_test = pd.concat([X_test, y_test], axis=1)
df_test.groupby('overall').count()/df_test.shape[0]*100

(13713,)
15006    I love this series!!!! If you haven't read it ...
28600    In my opinion, Jessica Gibson had a good story...
60398    I actually liked the first book better. I hate...
43146    It was another example of why I like reading H...
51397    The story is a bit ridiculous at times but alw...
Name: reviewText, dtype: object
(13713,)
15006    5.0
28600    2.0
60398    3.0
43146    5.0
51397    3.0
Name: overall, dtype: float64


Unnamed: 0_level_0,reviewText
overall,Unnamed: 1_level_1
1.0,3.223219
2.0,3.981623
3.0,9.319624
4.0,22.577117
5.0,60.898418


In [10]:
print(X_val.shape)
print(X_val[0:5])
print(y_val.shape)
print(y_val[0:5])
df_val = pd.concat([X_val, y_val], axis=1)
df_val.groupby('overall').count()/df_val.shape[0]*100

(10970,)
20329    I truly loved this story and the author manage...
65932    This part here damn. I must say it will cause ...
4703     This is a concise and direct read on how to st...
65888    I really like the information provided in this...
15641    I love this entire series so far...characters ...
Name: reviewText, dtype: object
(10970,)
20329    5.0
65932    5.0
4703     4.0
65888    4.0
15641    5.0
Name: overall, dtype: float64


Unnamed: 0_level_0,reviewText
overall,Unnamed: 1_level_1
1.0,3.354603
2.0,3.956244
3.0,9.908842
4.0,22.260711
5.0,60.519599


We see that the distribution of classess in train, validation and test set is representative of the resal data set.

## Data Preparation  for Modeling

In this step, we first tokenize the textual data into words and convert it into sequences of same length.

### Tokenization

In [11]:
%%time
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_ReviewRating['reviewText'])


CPU times: user 2.71 s, sys: 52 ms, total: 2.76 s
Wall time: 2.84 s


In [12]:
vocab_size = len(tokenizer.word_index) + 1
vocab_size

46363

In [13]:
%%time
sequence_train = tokenizer.texts_to_sequences(X_train)
sequence_test = tokenizer.texts_to_sequences(X_test)
sequence_val = tokenizer.texts_to_sequences(X_val)

CPU times: user 2.17 s, sys: 84 ms, total: 2.25 s
Wall time: 2.33 s


In [14]:
%%time
X_train_pad = pad_sequences(sequence_train, maxlen=maxlength)
X_test_pad = pad_sequences(sequence_test, maxlen=maxlength)
X_val_pad = pad_sequences(sequence_val, maxlen=maxlength)

CPU times: user 464 ms, sys: 308 ms, total: 772 ms
Wall time: 802 ms


Let us convert the labels vector to a matrix (one-hot encoded) using to_categorical. The to_categorical expects the label to start from 0, so I changed the labels. Now, rating one is represented by 0 and five by 4.

In [15]:
y_train_label = to_categorical(np.asarray(y_train - 1))
print(y_train[0:5])
y_test_label = to_categorical(np.asarray(y_test - 1))
print(y_test[0:5])
y_val_label = to_categorical(np.asarray(y_val - 1))
print(y_val[0:5])

11362    2.0
21675    4.0
9518     3.0
13030    5.0
65150    5.0
Name: overall, dtype: float64
15006    5.0
28600    2.0
60398    3.0
43146    5.0
51397    3.0
Name: overall, dtype: float64
20329    5.0
65932    5.0
4703     4.0
65888    4.0
15641    5.0
Name: overall, dtype: float64


In [16]:
y_train_label[0:5]

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.]], dtype=float32)

## Model Training and Evaluation

In this project, I decided to use three different deep learning models. The models and their evaluation follows.

### GRU based Model

In [17]:
# embedding_dimensions =  vocab_size**0.25
embedding_dimensions = 100

Model definition ...

In [18]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dimensions, input_length=maxlength))
model.add(GRU(units=32, dropout=0.2))
model.add(Dense(5, activation='softmax'))
print(model.summary()) 

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 499, 100)          4636300   
_________________________________________________________________
gru_1 (GRU)                  (None, 32)                12768     
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 165       
Total params: 4,649,233
Trainable params: 4,649,233
Non-trainable params: 0
_________________________________________________________________
None


Model compilation and training ...

In [19]:
%%time
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

CPU times: user 44 ms, sys: 8 ms, total: 52 ms
Wall time: 61.9 ms


In [20]:
%%time
model.fit(X_train_pad, y_train_label,
          batch_size=128,
          epochs=5,
          validation_data=(X_val_pad, y_val_label),
          callbacks = [checkpoint, early_stop])

NameError: name 'checkpoint' is not defined

Model evaluation ...

In [21]:
scores = model.evaluate(X_test_pad, y_test_label, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 18.75%


### LSTM basedModel

Model definition ...

In [22]:
model = Sequential() 
model.add(Embedding(vocab_size, embedding_dimensions, input_length=maxlength)) 
model.add(LSTM(100)) 
model.add(Dense(5, activation='softmax'))
print(model.summary()) 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 499, 100)          4636300   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 505       
Total params: 4,717,205
Trainable params: 4,717,205
Non-trainable params: 0
_________________________________________________________________
None


Model comiplation and training ...

In [23]:
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])

In [24]:
%%time
model.fit(X_train_pad, y_train_label,
          batch_size=128,
          epochs=5,
          validation_data=(X_val_pad, y_val_label))

Instructions for updating:
Use tf.cast instead.
Train on 43880 samples, validate on 10970 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 2h 36min 53s, sys: 1h 37min 2s, total: 4h 13min 56s
Wall time: 7h 40min 33s


<keras.callbacks.History at 0x7fb752c8ab00>

Model evaluation ...

In [25]:
scores = model.evaluate(X_test_pad, y_test_label, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 63.93%


### 1D CNN based Model 

Model definition ...

In [26]:
model = Sequential() 
model.add(Embedding(vocab_size, embedding_dimensions, input_length=maxlength)) 
model.add(Convolution1D(64, 3, padding='same'))
model.add(Convolution1D(32, 3, padding='same'))
model.add(Convolution1D(16, 3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(180,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(5, activation='softmax'))
print(model.summary()) 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 499, 100)          4636300   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 499, 64)           19264     
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 499, 32)           6176      
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 499, 16)           1552      
_________________________________________________________________
flatten_1 (Flatten)          (None, 7984)              0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 7984)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 180)               1437300   
__________

Model compilation and training ...

In [27]:
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy']) 

In [28]:
%%time
model.fit(X_train_pad, y_train_label,
          batch_size=128,
          epochs=5,
          validation_data=(X_val_pad, y_val_label))

Train on 43880 samples, validate on 10970 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 1h 56s, sys: 0 ns, total: 1h 56s
Wall time: 1h 22min 25s


<keras.callbacks.History at 0x7fb6782085c0>

Model evaluation ...

In [29]:
scores = model.evaluate(X_test_pad, y_test_label, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 59.06%


## Conclusion

We have trained three different deep learning models to predict the user ratings from the text of the review they wrote. 