# Sentiment Analysis and Rating Prediction From The Review Text

Sentiment ananlysis and rating prediction are among the imporant machine learning topics that help companies find if the users are happy or unhappy with the service/product provided. The users write reviews of the products/services on various platforms, such as social networking websites like Facebook and Twitter, Blogs, and service offering websites. The ananlysis of such reviews to find the coustomer satisfaction will be helpful for companies to improve their products as well as the customer service.

In this project, I aim to build a machine learning system that will predict the user rating from his text review. Precisely, I will work on building the models for the following.

1. Predict the users' sentiments (positive or negative).
2. Predict his product/service rating on a scale of 1 to 5.

I have already done the ETL in the other notebook. So here, I will just load the data prepare it for the model training, valiadation and testing and describe the deep learning model employed. In this notebook, I will just focus on the sentiment predcitction, i.e. binary-classfication. For rating prediction, i.e. the multiclass classification, I have described the modeling in another notebook.

So, let's just start with loading the required libraries.

In [3]:
import numpy as np
import pandas as pd
import gzip
import glob
import os
import re

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

In [5]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, GRU, Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.utils.np_utils import to_categorical
from keras.callbacks import ModelCheckpoint, EarlyStopping


Using TensorFlow backend.


## Loading the data from CSV

In [59]:
df_sentiments = pd.read_csv('AmazonBookReviews_Sentiment.csv')
df_sentiments.shape

(440262, 3)

In [61]:
lenthsStr = df_sentiments['reviewText'].apply(str).map(len)
maxlength = max(lenthsStr)
maxindex = lenthsStr[lenthsStr == maxlength].index[0]
print('The length of longest review text: '+ str(maxlength))
print('The index of review with maximum length: '+ str(maxindex))

The length of longest review text: 499
The index of review with maximum length: 93


## Spliting Data Set into Train, Validation and Test

In [62]:
X_train, X_test, y_train, y_test = train_test_split(df_sentiments['reviewText'], df_sentiments['sentiment'], test_size=0.2, random_state=1)

In [63]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

In [64]:
print(X_train.shape)
print(X_train[0:5])
print(y_train.shape)
print(y_train[0:5])

df_train = pd.concat([X_train, y_train], axis=1)
df_train.groupby('sentiment').count()/df_train.shape[0]*100

(281767,)
29672     I LOVED Brown River Queen.  However, because i...
305974    If you like Dave Barry's sense of humor, then ...
377265    OMG I love this series its got romance,action,...
277198    GREAT STORY, so I bought it for my kindle.  Th...
363491    Surprising ending.  I liked the book and chara...
Name: reviewText, dtype: object
(281767,)
29672     1.0
305974    1.0
377265    1.0
277198    1.0
363491    1.0
Name: sentiment, dtype: float64


Unnamed: 0_level_0,reviewText
sentiment,Unnamed: 1_level_1
0.0,8.093922
1.0,91.906078


In [65]:
print(X_test.shape)
print(X_test[0:5])
print(y_test.shape)
print(y_test[0:5])
df_test = pd.concat([X_test, y_test], axis=1)
df_test.groupby('sentiment').count()/df_test.shape[0]*100

(88053,)
186694    Intresting mix of vocations, and how people ca...
17142     John Holt's writings are very engrossing and e...
302953    Enjoyable and thought-provoking. This was my H...
159171    Interesting idea, not well executed.  There wa...
155024    I started reading the Slavers Wars series by M...
Name: reviewText, dtype: object
(88053,)
186694    1.0
17142     1.0
302953    1.0
159171    0.0
155024    1.0
Name: sentiment, dtype: float64


Unnamed: 0_level_0,reviewText
sentiment,Unnamed: 1_level_1
0.0,8.182572
1.0,91.817428


In [66]:
print(X_val.shape)
print(X_val[0:5])
print(y_val.shape)
print(y_val[0:5])
df_val = pd.concat([X_val, y_val], axis=1)
df_val.groupby('sentiment').count()/df_val.shape[0]*100

(70442,)
374739    This is a funny and illuminating book, focused...
119343    Great read. Loved this story of three sisters ...
9476      The insights and stories really help you to sl...
250850    Wow! Odd has always been my favorite. This sho...
41347     I've been a Kennedy fan for years, their flaws...
Name: reviewText, dtype: object
(70442,)
374739    1.0
119343    1.0
9476      1.0
250850    1.0
41347     1.0
Name: sentiment, dtype: float64


Unnamed: 0_level_0,reviewText
sentiment,Unnamed: 1_level_1
0.0,8.28483
1.0,91.71517


We see that the distribution of classess in train, validation and test set is representative of the resal data set.

## Data Preparation  for Modeling

In this step, we first tokenize the textual data into words and convert it into sequences of same length.

### Tokenization

In [67]:
%%time
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_sentiments['reviewText'])


CPU times: user 19.5 s, sys: 314 ms, total: 19.8 s
Wall time: 19.8 s


In [68]:
vocab_size = len(tokenizer.word_index) + 1
vocab_size

148508

In [69]:
%%time
sequence_train = tokenizer.texts_to_sequences(X_train)
sequence_test = tokenizer.texts_to_sequences(X_test)
sequence_val = tokenizer.texts_to_sequences(X_val)

CPU times: user 14.5 s, sys: 604 ms, total: 15.1 s
Wall time: 15.1 s


In [70]:
%%time
X_train_pad = pad_sequences(sequence_train, maxlen=maxlength)
X_test_pad = pad_sequences(sequence_test, maxlen=maxlength)
X_val_pad = pad_sequences(sequence_val, maxlen=maxlength)

CPU times: user 4.34 s, sys: 1.45 s, total: 5.78 s
Wall time: 5.77 s


Let us convert the labels vector to a matrix (one-hot encoded) using to_categorical.

In [71]:
y_train_label = to_categorical(np.asarray(y_train))
print(y_train[0:5])
y_test_label = to_categorical(np.asarray(y_test))
print(y_test[0:5])
y_val_label = to_categorical(np.asarray(y_val))
print(y_val[0:5])

29672     1.0
305974    1.0
377265    1.0
277198    1.0
363491    1.0
Name: sentiment, dtype: float64
186694    1.0
17142     1.0
302953    1.0
159171    0.0
155024    1.0
Name: sentiment, dtype: float64
374739    1.0
119343    1.0
9476      1.0
250850    1.0
41347     1.0
Name: sentiment, dtype: float64


In [72]:
y_train_label[0:5]

array([[ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.]])

## Model Training and Evaluation

In this project, I decided to use three different deep learning models. The models and their evaluation follows.

In [41]:
callback_list = [EarlyStopping(), ModelCheckpoint('weights.{epoch:02d}-{val_loss:.2f}.hdf5')]

### GRU based Model

In [42]:
# embedding_dimensions =  vocab_size**0.25
embedding_dimensions = 100

Model definition ...

In [43]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dimensions, input_length=maxlength))
model.add(GRU(units=32, dropout=0.2))
model.add(Dense(2, activation='sigmoid'))
print(model.summary()) 

Model compilation and training

In [44]:
%%time
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [45]:
%%time
model.fit(X_train_pad, y_train_label,
          batch_size=128,
          epochs=5,
          validation_data=(X_val_pad, y_val_label),
         callbacks = callback_list)

Model evaluations ...

In [46]:
scores = model.evaluate(X_test_pad, y_test_label, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

### 1D CNN based Model

Model definition ...

In [47]:
model = Sequential() 
model.add(Embedding(vocab_size, embedding_dimensions, input_length=maxlength)) 
model.add(Convolution1D(64, 3, padding='same'))
model.add(Convolution1D(32, 3, padding='same'))
model.add(Convolution1D(16, 3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(180,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(2, activation='sigmoid'))
print(model.summary()) 

Model compilation and training ...

In [None]:
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy']) 

In [48]:
%%time
model.fit(X_train_pad, y_train_label,
          batch_size=128,
          epochs=5,
          validation_data=(X_val_pad, y_val_label))

Model evaluation ...

In [49]:
scores = model.evaluate(X_test_pad, y_test_label, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

### LSTM based Model

Model definition ...

In [None]:
model = Sequential() 
model.add(Embedding(vocab_size, embedding_dimensions, input_length=maxlength)) 
model.add(LSTM(100)) 
model.add(Dense(2, activation='sigmoid'))
print(model.summary()) 

Model compilation and training ...

In [None]:
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy']) 

In [None]:
%%time
model.fit(X_train_pad, y_train_label,
          batch_size=128,
          epochs=5,
          validation_data=(X_val_pad, y_val_label))

Model evaluation ...

In [None]:
scores = model.evaluate(X_test_pad, y_test_label, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

## Conclusion

We have trained three different deep learning models to predict the user ratings from the text of the review they wrote. 