<a href="https://colab.research.google.com/github/kkarimi62/IBM-Machine-Learning-Professional-Certificate/blob/main/deepLearningProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis of Amazon Product Reviews Using Deep Learning

## Introduction

This project will classify the sentiment of Amazon product reviews as positive or negative. The dataset contains a large number of reviews (approximately 100,000) from Amazon with associated labels.


## Import Libraries and Define Auxiliary Functions
We import the following libraries:


In [1]:
import numpy as np
import pandas as pd
from google.colab import files
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

## Dataset Source

We use the Amazon Customer Reviews Dataset available on [kaggle](https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset/data), including a collection of reviews written in the Amazon.com marketplace and associated metadata between 1995 and 2015. More specifically, we use the subset `amazon_reviews_us_Mobile_Electronics_v1_00.tsv.zip` relevant to Mobile and Electronics categoty for sentiment analysis.

The dataset consists of the followings:
* `marketplace`: 2 letter country code of the marketplace where the review was written.
* `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
* `review_id`: The unique ID of the review.
* `product_id`: The unique Product ID the review pertains to. In the multilingual dataset the reviews for the same product in different countries can be grouped by the same product_id.
* `product_parent`: Random identifier that can be used to aggregate reviews for the same product.
* `product_title`: Title of the product.
* `product_category`: Broad product category that can be used to group reviews (also used to group the dataset into coherent parts).
* `star_rating`:
The 1-5 star rating of the review.
* `helpful_votes`:
Number of helpful votes.
* `total_votes`:
Number of total votes the review received.
* `vine`: Review was written as part of the Vine program.
* `verified_purchase`:
The review is on a verified purchase.
* `review_headline`:
The title of the review.
* `review_body`:
The review text.
* `review_date`:
The date the review was written.

### Data Collection

We obtain the Amazon Reviews dataset from kaggle.


In [3]:
uploaded = files.upload()

#--- unzip
!unzip amazon_reviews_us_Mobile_Electronics_v1_00.tsv.zip

Saving amazon_reviews_us_Mobile_Electronics_v1_00.tsv.zip to amazon_reviews_us_Mobile_Electronics_v1_00.tsv.zip
Archive:  amazon_reviews_us_Mobile_Electronics_v1_00.tsv.zip
  inflating: amazon_reviews_us_Mobile_Electronics_v1_00.tsv  


### Data Preprocessing

We load the dataset into a pandas DataFrame and then display relevant information. Here only `review_body` and `star_rating` columns are included for our sentiment analysis. `star_rating` contains integers between 1-5 and is converted to binary sentiment labels. We also create a `Dataset` object and incorporate relevant columns by using the  `from_tensor_slices` functionality in TensorFlow.


In [4]:
# Read the dataset
# skip line numbers 35246 and 87073 because they seem to have an inconsistent format as opposed to the other lines!
raw_data     = pd.read_csv('amazon_reviews_us_Mobile_Electronics_v1_00.tsv',sep='\t',skiprows=[35246-1,87073-1])

# Display the first few rows of the dataset
print("First few rows of the dataset:")
display(round(raw_data.head(),2))

# Data information
print("\nData Information:")
display(raw_data.info())

# Descriptive statistics
print("\nDescriptive Statistics:")
display(round(raw_data.describe(),2).T)

# Distribution of the target variable (star_rating)
print("\nDistribution of star_rating:")
display(raw_data['star_rating'].value_counts(normalize=True))

#--- Remove missing values
data         = raw_data['review_body star_rating'.split()].dropna()
print("\nMissing Values:")
display( data.isnull().sum() )

#--- Create dataset object
review_body  = data.review_body
star_rating  = data.star_rating
binary_label = np.where( star_rating > 3, 1, 0 )
text_dataset = tf.data.Dataset.from_tensor_slices( ( review_body, binary_label ) )

First few rows of the dataset:


Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,20422322,R8MEA6IGAHO0B,B00MC4CED8,217304173,BlackVue DR600GW-PMP,Mobile_Electronics,5.0,0.0,0.0,N,Y,Very Happy!,"As advertised. Everything works perfectly, I'm...",2015-08-31
1,US,40835037,R31LOQ8JGLPRLK,B00OQMFG1Q,137313254,GENSSI GSM / GPS Two Way Smart Phone Car Alarm...,Mobile_Electronics,5.0,0.0,1.0,N,Y,five star,it's great,2015-08-31
2,US,51469641,R2Y0MM9YE6OP3P,B00QERR5CY,82850235,iXCC Multi pack Lightning cable,Mobile_Electronics,5.0,0.0,0.0,N,Y,great cables,These work great and fit my life proof case fo...,2015-08-31
3,US,4332923,RRB9C05HDOD4O,B00QUFTPV4,221169481,abcGoodefg® FBI Covert Acoustic Tube Earpiece ...,Mobile_Electronics,4.0,0.0,0.0,N,Y,Work very well but couldn't get used to not he...,Work very well but couldn't get used to not he...,2015-08-31
4,US,44855305,R26I2RI1GFV8QG,B0067XVNTG,563475445,Generic Car Dashboard Video Camera Vehicle Vid...,Mobile_Electronics,2.0,0.0,0.0,N,Y,Cameras has battery issues,"Be careful with these products, I have bought ...",2015-08-31



Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104852 entries, 0 to 104851
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   marketplace        104852 non-null  object 
 1   customer_id        104852 non-null  int64  
 2   review_id          104852 non-null  object 
 3   product_id         104852 non-null  object 
 4   product_parent     104852 non-null  int64  
 5   product_title      104852 non-null  object 
 6   product_category   104852 non-null  object 
 7   star_rating        104850 non-null  float64
 8   helpful_votes      104850 non-null  float64
 9   total_votes        104850 non-null  float64
 10  vine               104850 non-null  object 
 11  verified_purchase  104850 non-null  object 
 12  review_headline    104848 non-null  object 
 13  review_body        104848 non-null  object 
 14  review_date        104850 non-null  object 
dtypes: float64(3), int64(2), object(

None


Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
customer_id,104852.0,27937830.0,15087220.0,10071.0,14714015.0,26503567.5,42235508.5,53096568.0
product_parent,104852.0,501519600.0,287166100.0,53524.0,259373148.0,493728853.0,744008282.0,999950751.0
star_rating,104850.0,3.76,1.52,1.0,3.0,4.0,5.0,5.0
helpful_votes,104850.0,1.24,7.07,0.0,0.0,0.0,1.0,769.0
total_votes,104850.0,1.62,7.91,0.0,0.0,0.0,1.0,791.0



Distribution of star_rating:


star_rating
5.0    0.497835
4.0    0.172275
1.0    0.167582
3.0    0.092704
2.0    0.069604
Name: proportion, dtype: float64


Missing Values:


review_body    0
star_rating    0
dtype: int64

In [5]:
for review, star in text_dataset.take( 5 ):
  print( review, star, '\n' )

tf.Tensor(b"As advertised. Everything works perfectly, I'm very happy with the camera. As a matter of fact I'm going to buy another one for my 2nd car.", shape=(), dtype=string) tf.Tensor(1, shape=(), dtype=int64) 

tf.Tensor(b"it's great", shape=(), dtype=string) tf.Tensor(1, shape=(), dtype=int64) 

tf.Tensor(b'These work great and fit my life proof case for the iPhone 6', shape=(), dtype=string) tf.Tensor(1, shape=(), dtype=int64) 

tf.Tensor(b"Work very well but couldn't get used to not hearing anything out of the ear they v were plugged into.", shape=(), dtype=string) tf.Tensor(1, shape=(), dtype=int64) 

tf.Tensor(b"Be careful with these products, I have bought several of these cameras and the image is pretty decent but battery doesn't hold any charge!!!!", shape=(), dtype=string) tf.Tensor(0, shape=(), dtype=int64) 



We split our dataset into train/test subsets.

In [6]:
def train_test_split(dataset,train_ratio=0.8, batch_size = 32):
    numpy_array       =  np.array( list( dataset.as_numpy_iterator() ) )
    num_samples       = numpy_array.shape[ 0 ]
    indices           = np.arange( num_samples )
    np.random.shuffle( indices )
    #
    num_train_samples = int( train_ratio * num_samples )
    train_indices     = indices[ :num_train_samples ]
    test_indices      = indices[ num_train_samples: ]

    train_data        = numpy_array[ train_indices ]
    test_data         = numpy_array[ test_indices ]

    train_dataset     = tf.data.Dataset.from_tensor_slices( (train_data[:,0],train_data[:,1].astype(float)) )
    test_dataset      = tf.data.Dataset.from_tensor_slices( (test_data[:,0],test_data[:,1].astype(float)) )

    return train_dataset.batch( batch_size, drop_remainder=True ) , test_dataset.batch( batch_size, drop_remainder=True )


train_dataset, test_dataset = train_test_split( text_dataset )


`train_dataset` and `test_dataset` contain inputs and targets that are `tf` strings and floats:

In [7]:
for input, target in train_dataset:
  print('input.shape:',input.shape)
  print('input.dtype:',input.dtype)
  print('target.shape:',target.shape)
  print('target.dtype:',target.dtype)
  print('input[0]:',input[0])
  print('target[0]:',target[0])
  break

input.shape: (32,)
input.dtype: <dtype: 'string'>
target.shape: (32,)
target.dtype: <dtype: 'float64'>
input[0]: tf.Tensor(b"I truly cannot understand what all the commotion is about. This is a excellent product at $35.00. You are not going to find another for this price.  I've bought multiple units and none have failed.  I just installed in a BMW 740il. Took an hour and a half but after you could only see one cable. I hid the microphone where the original one was and WOW what clarity and it's hidden.  No one I have installed this for has complained about bluetooth connectivity or ipod functionality.  Of course there is going to be half a second lag now and then.  This is not made by apple people.  For 34.99 you cannot possible go wrong.", shape=(), dtype=string)
target[0]: tf.Tensor(1.0, shape=(), dtype=float64)


## Bag-of-words approach

We vectorize the review text data to convert sentences into sequences of word indices.
The sequences are padded to ensure uniform input length (`pad_to_max_tokens = True`).
The vocabulary is constructed with `max_tokens=20000`, using most frequent terms.
We also set `output_mode            = 'multi_hot'`, converting sequences of word indices into corresponding binary arrays of `max_tokens` size. This is the so-called *bag-of-words* approach where the text encoding discards the underlying order.

In [8]:
max_tokens             = 20000

vectorize_layer        = keras.layers.TextVectorization( max_tokens             = max_tokens,
                                                         output_mode            = 'multi_hot',
                                                         pad_to_max_tokens      = True,
                                                       )

#--- prepare a dataset that only includes raw texts (no labels)
text_only_train_ds     = train_dataset.map(lambda x, y: x )

#--- dataset vocab
vectorize_layer.adapt( text_only_train_ds )
print('subset of the built vocabulary:')
display( np.random.choice( vectorize_layer.get_vocabulary(), 10 ) )

#--- indexing training and test datasets
int_train_ds           = train_dataset.map( lambda x, y: ( vectorize_layer( x ), y ) )
int_test_ds            = test_dataset.map(  lambda x, y: ( vectorize_layer( x ), y ) )

subset of the built vocabulary:


array(['compressed', 'established', 'scratched', '500ma', 'pursuing',
       'evaluation', 'reduced', 'sue', 'bluered', 'toggles'], dtype='<U19')

Let's explore the content of `int_test_ds`:

In [9]:
for input, target in int_test_ds:
  print('input.shape:',input.shape)
  print('input.dtype:',input.dtype)
  print('target.shape:',target.shape)
  print('target.dtype:',target.dtype)
  print('input[0]:',input[0])
  print('target[0]:',target[0])
  break

input.shape: (32, 20000)
input.dtype: <dtype: 'float32'>
target.shape: (32,)
target.dtype: <dtype: 'float64'>
input[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
target[0]: tf.Tensor(1.0, shape=(), dtype=float64)


Inputs are size-32 batches of 20,000-dimensional binary arrays. Let's write a model-building function:

In [10]:
def get_bagOfWords_model( input_dim, hidden_dim ):
  inputs = keras.Input( shape = input_dim )
  x      = keras.layers.Dense(hidden_dim, activation='relu')(inputs)
  x      = keras.layers.Dropout(0.5)(x)
  output = keras.layers.Dense( 1, activation = 'sigmoid' )( x )
  model  = keras.Model(inputs, output)

  model.compile( optimizer = 'rmsprop',
                 loss      = 'binary_crossentropy',
                 metrics   = ['accuracy'] )
  return model

We train and evaluate our model.

In [11]:
model = get_bagOfWords_model( input_dim  = ( max_tokens ),
                              hidden_dim = 16 )
model.summary()

callbacks = [ tf.keras.callbacks.ModelCheckpoint( filepath = 'bag_of_words_model.keras',
                                                  monitor  = 'val_accuracy',
                                                  mode     = 'max',
                                                  save_best_only=True )
            ]
history  = model.fit( int_train_ds,
                     validation_data = int_test_ds,
                     callbacks       = callbacks,
                     epochs          = 10 )

model    = keras.models.load_model( 'bag_of_words_model.keras' )

print(f'Test accuracy:{model.evaluate( int_test_ds )[1]:.2f}')

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy:0.88


After 10 `epochs`, We reach a test accuracy of 88% which exceeds our baseline 67%, corresponding to the ratio of positive reviews in the current dataset.

## Sequence model approach

Here we build our sequence model, taking a set of integer sequences as input. We truncate the inputs by setting `output_sequence_length = 200`. This number corresponds to the 95% quantile associated with the distribution of the length of reviews (in terms of the number of words).

In [12]:
output_sequence_length = 200

vectorize_layer        = keras.layers.TextVectorization( max_tokens             = max_tokens,
                                                         output_mode            = 'int',
                                                         output_sequence_length = output_sequence_length,
                                                       )

#--- prepare a dataset that only includes raw texts (no labels)
text_only_train_ds     = train_dataset.map(lambda x, y: x )

#--- dataset vocab
vectorize_layer.adapt( text_only_train_ds )
print('subset of the built vocabulary:')
display( np.random.choice( vectorize_layer.get_vocabulary(), 10 ) )

#--- indexing training and test datasets
int_train_ds           = train_dataset.map( lambda x, y: ( vectorize_layer( x ), y ) )
int_test_ds            = test_dataset.map(  lambda x, y: ( vectorize_layer( x ), y ) )


subset of the built vocabulary:


array(['horn', 'ptt', 'backi', 'duragadget', 'discharge', 'scuffing',
       'locate', 'mines', 'mushy', 'themselves'], dtype='<U19')

In [13]:
for input, target in int_test_ds.take(1):
  print('input.shape:',input.shape)
  print('input.dtype:',input.dtype)
  print('target.shape:',target.shape)
  print('target.dtype:',target.dtype)
  print('input[0]:',input[0])
  print('target[0]:',target[0])

input.shape: (32, 200)
input.dtype: <dtype: 'int64'>
target.shape: (32,)
target.dtype: <dtype: 'float64'>
input[0]: tf.Tensor(
[   9   96    8    7    1    3   54  175  530   16   11 1800   26    1
    3   95    2 3872   14   18   46   42 5255    1   15    4  345 2907
    4   49  113   51   90  115   18   42  674   31    3   49  187    7
  336   26   10   11  614  487    7 3330  144   44    4 1475   13   11
 1332 1758   10    7  451    5   37    3  254    4    4   19  110  662
 6390   31  277    7   25   35   40    7   25   66    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0 

### Word embedding


Our revised model takes a sequence of integers with dimension 200 and converts it into a binary array of size (200, `embed_dim=64`), using an embedding layer. We also replace the dense layer with a `LSTM`.

In [14]:
def get_model( input_dim, embed_dim, hidden_dim, output_dim ):
  inputs = keras.Input( shape = input_dim, dtype = tf.int32 )
  x      = keras.layers.Embedding( max_tokens, embed_dim, input_length=output_sequence_length )( inputs )
  x      = keras.layers.Bidirectional( keras.layers.LSTM(hidden_dim) )(x)
  x      = keras.layers.Dropout(0.5)(x)
  output = keras.layers.Dense( output_dim, activation = 'sigmoid' )( x )
  model  = keras.Model(inputs, output)

  model.compile( optimizer = 'rmsprop',
                 loss      = 'binary_crossentropy',
                 metrics   = ['accuracy'] )
  return model


In [None]:
model = get_model( input_dim  = (output_sequence_length),
                   embed_dim  = 64,
                   hidden_dim = 16,
                   output_dim = 1 )
model.summary()

callbacks = [ tf.keras.callbacks.ModelCheckpoint( filepath = 'embedding_lstm.keras',
                                                  monitor  = 'val_accuracy',
                                                  mode     = 'max',
                                                  save_best_only=True )
            ]
history  = model.fit( int_train_ds,
                     validation_data = int_test_ds,
                     callbacks       = callbacks,
                     epochs          = 10 )

model    = keras.models.load_model( 'embedding_lstm.keras' )

print(f'Test accuracy:{model.evaluate( int_test_ds )[1]:.2f}')

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 200)]             0         
                                                                 
 embedding (Embedding)       (None, 200, 64)           1280000   
                                                                 
 bidirectional (Bidirection  (None, 32)                10368     
 al)                                                             
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1290401 (4.92 MB)
Trainable params: 1290401 (4.92 MB)
Non-trainable params: 0 (0.00 Byte)
_____________________

We note that the above model trains relatively slow in comparison with the multi-hot model (due to the embedding and LSTM layers). The observed test accuracy is above 90% and is in close agreement with the latter model. This slight improvement is despite the fact that the sequence model probes truncated sequences after 600 words as opposed to the full reviews processed by the first model.

## Conclusions and Discussions


The project successfully classified the sentiment of Amazon product reviews as positive or negative using deep learning techniques. Two primary models were explored: a Bag-of-Words model and a Sequence model with word embeddings and an LSTM layer. The Bag-of-Words model achieved a test accuracy of approximately 88%, significantly surpassing the baseline accuracy of 67%, which was based on the distribution of positive reviews in the dataset.

Data preprocessing steps, including converting star ratings to binary sentiment labels, were crucial for preparing the dataset. The Bag-of-Words model utilized a simple, yet effective text vectorization approach, transforming text into binary arrays based on word presence. This method, despite its simplicity, provided a robust baseline performance.

The Sequence model, incorporating an embedding layer and an LSTM, aimed to capture more complex patterns in the text data. The LSTM, known for handling sequential data, helped model temporal dependencies within the review text, potentially improving the classification of nuanced sentiments. However, the actual test accuracy and the comparative performance between these models need further analysis.

The training process for both models included validation steps to monitor and save the best-performing model based on validation accuracy. This approach ensured that the model generalizes well to unseen data, reducing the risk of overfitting. The final evaluation metrics indicate a strong performance for both models, with the Bag-of-Words model reaching a high accuracy, demonstrating its effectiveness even with a relatively simple architecture.

Several limitations were noted:
i) The dataset was restricted to the Mobile and Electronics category, limiting the generalizability of the findings across different product categories.
ii) The Bag-of-Words approach, while effective, does not consider word order or semantic meaning, potentially limiting its ability to capture more complex sentiments.

Future work could explore:
i) Expanding the dataset to include more diverse product categories.
ii) Incorporating more advanced models, such as transformers or attention mechanisms, to potentially enhance the sentiment classification accuracy.
iii) Analyzing and mitigating any biases present in the dataset that could affect model predictions.

The ability to accurately classify sentiment from product reviews has practical implications for businesses, helping them understand customer opinions and feedback. This can guide product improvements, marketing strategies, and customer engagement initiatives. The models developed in this project could be integrated into recommendation systems, automated customer service, and market analysis tools, providing valuable insights and enhancing user experience.








Please check out my [Github](https://github.com/kkarimi62/IBM-Machine-Learning-Professional-Certificate/blob/main/ClusteringAndDimensionReductionTechniques.ipynb) for further information.