# Sentiment Analysis and Rating Prediction From The Review Text

Sentiment ananlysis and rating prediction are among the imporant machine learning topics that help companies find if the users are happy or unhappy with the service/product provided. The users write reviews of the products/services on various platforms, such as social networking websites like Facebook and Twitter, Blogs, and service offering websites. The ananlysis of such reviews to find the coustomer satisfaction will be helpful for companies to improve their products as well as the customer service.

In this project, I aim to build a machine learning system that will predict the user rating from his text review. Precisely, I will work on building the models for the following.

1. Predict the users' sentiments (positive or negative).
2. Predict his product/service rating on a scale of 1 to 5.

In the following, I describe the data set I have used for the purpose, detail the ETL performed and explain the deep learning model employed.

## The Data Set

To build and test our model for the project, I decided to work with Amazon's Bokks review data set available freely at http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books_5.json.gz. The data set contains about 8.9 million instances where the user wrote reviews of the books and provided their ratings. The general structure of the review is as follows.

## Import the required libraries

In [2]:
# !pip uninstall keras --yes
# !pip install keras==2.1.2 

In [3]:
import numpy as np
import pandas as pd
import gzip
import glob
import os
import re

## ETL

Let us first download the data set

In [None]:
!wget -O reviews_Books.json.gz http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books_5.json.gz

The second step is to parse the data set into a data frame.

In [6]:
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

Due to the limitations of the Free Tier of IBM Cloud dataplatform, I was getting the memory error while trying to read all the contents and load the data in data frames. So I did manually in 6 steps while restarting my kernel every time after each step. In each step, I saved the loaded data into a csv file. This gave me a total of six csv files.
1. AmazonBooksReviews1.csv
2. AmazonBooksReviews2.csv
3. AmazonBooksReviews3.csv
4. AmazonBooksReviews4.csv
5. AmazonBooksReviews5.csv
6. AmazonBooksReviews6.csv

In [7]:
def getDF(path, startIndex,recordCount):
  
  i = 0
  counter = 0    
  df = {}
  for d in parse(path):
    if counter<startIndex:
        counter += 1
        continue
    df[i] = d
    i += 1
    print(i)
    if i+startIndex>=recordCount:
        break
  return pd.DataFrame.from_dict(df, orient='index')

In [8]:
df = getDF('reviews_Books.json.gz',7900001,8900000)
df.to_csv('AmazonBooksReviews6.csv')

The next step is to read all the scv files into data frame

In [6]:
path =os.getcwd() 
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()
allFiles

['/gpfs/global_fs01/sym_shared/YPProdSpark/user/s551-0e069dd8b32632-b79326335986/notebook/work/AmazonBooksReviews1.csv',
 '/gpfs/global_fs01/sym_shared/YPProdSpark/user/s551-0e069dd8b32632-b79326335986/notebook/work/AmazonBooksReviews2.csv',
 '/gpfs/global_fs01/sym_shared/YPProdSpark/user/s551-0e069dd8b32632-b79326335986/notebook/work/AmazonBooksReviews3.csv',
 '/gpfs/global_fs01/sym_shared/YPProdSpark/user/s551-0e069dd8b32632-b79326335986/notebook/work/AmazonBooksReviews4.csv',
 '/gpfs/global_fs01/sym_shared/YPProdSpark/user/s551-0e069dd8b32632-b79326335986/notebook/work/AmazonBooksReviews5.csv',
 '/gpfs/global_fs01/sym_shared/YPProdSpark/user/s551-0e069dd8b32632-b79326335986/notebook/work/AmazonBooksReviews6.csv']

In [9]:
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)

df_complete = pd.concat(list_, axis = 0, ignore_index = True)
print(df_complete.shape)
print(df_complete.columns)

(8898036, 10)
Index(['Unnamed: 0', 'helpful', 'reviewText', 'overall', 'reviewerName',
       'unixReviewTime', 'reviewerID', 'asin', 'reviewTime', 'summary'],
      dtype='object')


Next, we get rid of all the columns except those having reviews (reviewText) and rataings (overall)

In [10]:
df_ReviewRating = df_complete[['reviewText','overall']]
df_ReviewRating.shape

(8898036, 2)

Let us see the distribution of ratings in the data set

In [13]:
df_ReviewRating.groupby('overall').count()/df_ReviewRating.shape[0]*100

Unnamed: 0_level_0,reviewText
overall,Unnamed: 1_level_1
1.0,3.639208
2.0,4.665108
3.0,10.734571
4.0,24.982951
5.0,55.97152


In [11]:
print(df_ReviewRating['reviewText'].apply(str).map(len).max())

32658


The dataframe has a review with the lenght as much as 32658 characters. Allowing maximum review length as much exhausts the Free tiers resources. Therefore, I decided to keep the reviews with a maximum length of 500.

In [12]:
#Filter the reviews with lenght of more than 200 charecters
df_ReviewRatingFiltered = df_ReviewRating[df_ReviewRating['reviewText'].apply(str).map(len)<500]
df_ReviewRatingFiltered = df_ReviewRatingFiltered[df_ReviewRatingFiltered['reviewText'].apply(str).map(len)>10]
df_ReviewRatingFiltered.shape[0]

4846446

So, it leaves us with about 4.8 million instances. In fact, these too are also too many for the free tier to model. So, in later stages, I will just be using about 10 percent of it. 

In [14]:
df_finalRating = df_ReviewRatingFiltered.sample(frac=0.1)
df_finalRating = df_finalRating.reset_index(drop=True)
print(df_finalRating.shape)
df_finalRating.groupby('overall').count()/df_finalRating.shape[0]*100

(484645, 2)


Unnamed: 0_level_0,reviewText
overall,Unnamed: 1_level_1
1.0,3.432822
2.0,3.963726
3.0,9.157837
4.0,22.458088
5.0,60.987527


The distribution is about the same as the original data set, so we will save this data set as a csv file, later to be used fo rating predictions.

In [15]:
df_finalRating.to_csv('AmazonBookReviews_Ratings.csv')

#### Preparation for sentiment analysi

For sentmient analysis, I will remove the middle rating, i.e. 3, and the rating above three will be considered as happy (represented by 1) and less than three as unhappy (represented by 0).

In [21]:
df_sentiments = df_finalRating[~ (df_finalRating.overall==3)]
df_sentiments.loc[df_sentiments.overall < 3.0, 'sentiment'] = 0
df_sentiments.loc[df_sentiments.overall > 3.0, 'sentiment'] = 1
df_sentiments = df_sentiments[['reviewText','sentiment']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [22]:
print(df_sentiments.shape)
df_sentiments.groupby('sentiment').count()/df_sentiments.shape[0]*100

(440262, 2)


Unnamed: 0_level_0,reviewText
sentiment,Unnamed: 1_level_1
0.0,8.142197
1.0,91.857803


Now, let's save the dataframe in csv file

In [24]:
df_sentiments.to_csv('AmazonBookReviews_Sentiment.csv')

So, now we have done the ETL and saved the clean copy of data in csv files. Now, for the data prepaation for the modeling task, modeling and the evaluation will be done in other notebooks.