# AMAZON CUSTOMER REVIEW

## 1.0 BUSINESS UNDERSTANDING

In the online market of today, customer reviews are an essential part of purchasing decisions. Amazon, being a giant online store, collects millions of product reviews that indicate customer satisfaction, product quality, and overall user experience. It is not efficient, however, to process such vast data manually and is time-consuming.

Sentiment analysis enables companies to analyze customers' feedback automatically, extract meaningful information, and make knowledgeable

## 1.1 PROBLEM STATEMENT

Amazon gets millions of reviews, and it's not possible to read them manually. We need an automated system for sentiment analysis to categorize the reviews as positive, negative, or neutral and also to gain insightful information too.

## 1.2 OBJECTIVES

## 1.2.1 Main Objectives

To accurately determine the overall emotional tone (positive, negative, or neutral) of customer reviews.

## 1.2.2 Specific Objectives

* Identify trends in customer satisfaction.

* Improve customer experience by addressing negative feedback.

* Help businesses optimize their product offerings based on user sentiment.


## 1.3 Business Questions

* What percentage of customer reviews are positive, negative, or neutral?
* Are there specific features or keywords associated with  reviews?
* Can sentiment analysis help predict potential or customer dissatisfaction?
* Can the  business use sentiment insights to improve product quality and customer support?


## 1.4 Metric of Success

# 2.0 DATA UNDERSTANDING

The dataset used for this sentiment analysis project consists of Amazon product reviews, which provide insights into customer opinions about various products. It contains 1,597 records with 27 columns, capturing details about the product, review content and user feedback.


The dataset comprises of the following columns:

id → Unique identifier for each review.

asins → Amazon Standard Identification Number (ASIN) of the product.

brand → Brand of the product.

categories → Product categories (e.g., "Amazon Devices").

colors → Available colors of the product (often missing).

dateAdded → Date the review was added to the dataset.

dateUpdated → Date the review was last updated.

dimension → Physical dimensions of the product.

manufacturer → Manufacturer of the product.

manufacturerNumber → Manufacturer’s product number.

name → Product name.

prices → Pricing details of the product.

reviews.date → Date when the review was posted.

reviews.doRecommend → Whether the reviewer recommends the product (Yes/No).

eviews.numHelpful → Number of users who found the review helpful.

reviews.rating → Star rating given by the reviewer (1 to 5).

reviews.sourceURLs → URL of the original review page.

reviews.text → Full text of the review (Main feature for sentiment analysis).

reviews.title → Title of the review (Summary of the review).

reviews.username → Username of the reviewer.

reviews.userCity → City of the reviewer (Mostly missing).

reviews.userProvince → Province of the reviewer (Mostly missing).

sizes → Available sizes of the product (Mostly empty).

upc → Universal Product Code (UPC).
                                        
weight → Weight of the product.

### 2.1 Exploring The Dataset

In [1]:
##import the relevant libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve

In [2]:
data =pd.read_csv("Amazon Reviews.csv")

data.head(5)

Unnamed: 0,id,asins,brand,categories,colors,dateAdded,dateUpdated,dimension,ean,keys,...,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,sizes,upc,weight
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",,,Cristina M,,,205 grams
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,Allow me to preface this with a little history...,One Simply Could Not Ask For More,,,Ricky,,,205 grams
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,4.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,,,Tedd Gardiner,,,205 grams
3,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I bought one of the first Paperwhites and have...,Love / Hate relationship,,,Dougal,,,205 grams
4,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I have to say upfront - I don't like coroporat...,I LOVE IT,,,Miljan David Tanic,,,205 grams


In [3]:
#check the shape of the dataset
data.shape

(1597, 27)

In [4]:
#check the column information 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1597 entries, 0 to 1596
Data columns (total 27 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    1597 non-null   object 
 1   asins                 1597 non-null   object 
 2   brand                 1597 non-null   object 
 3   categories            1597 non-null   object 
 4   colors                774 non-null    object 
 5   dateAdded             1597 non-null   object 
 6   dateUpdated           1597 non-null   object 
 7   dimension             565 non-null    object 
 8   ean                   898 non-null    float64
 9   keys                  1597 non-null   object 
 10  manufacturer          965 non-null    object 
 11  manufacturerNumber    902 non-null    object 
 12  name                  1597 non-null   object 
 13  prices                1597 non-null   object 
 14  reviews.date          1217 non-null   object 
 15  reviews.doRecommend  

## Data cleaning

In [5]:
data.isnull().sum()

id                         0
asins                      0
brand                      0
categories                 0
colors                   823
dateAdded                  0
dateUpdated                0
dimension               1032
ean                      699
keys                       0
manufacturer             632
manufacturerNumber       695
name                       0
prices                     0
reviews.date             380
reviews.doRecommend     1058
reviews.numHelpful       697
reviews.rating           420
reviews.sourceURLs         0
reviews.text               0
reviews.title             17
reviews.userCity        1597
reviews.userProvince    1597
reviews.username          17
sizes                   1597
upc                      699
weight                   911
dtype: int64

In [6]:
data.duplicated().sum()

0

In [7]:
# drop the rows that surpassed the 30% threshold
# set up the threshold
threshold = 0.3

# generate a count for every column
missing_count= data.isna().sum()

# calulate the number of rows in the dataframe
total_row= len(data)

# create a list of rows surpassing the threshold
drop_col= [i for i in data.columns if missing_count[i]/total_row > threshold]

# dropping the columns having more missing values
data.drop(columns=drop_col,inplace=True)

data.head()

Unnamed: 0,id,asins,brand,categories,dateAdded,dateUpdated,keys,name,prices,reviews.date,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,kindlepaperwhite/b00qjdu3ky,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",2015-08-08T00:00:00.000Z,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",Cristina M
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,kindlepaperwhite/b00qjdu3ky,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",2015-09-01T00:00:00.000Z,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,Allow me to preface this with a little history...,One Simply Could Not Ask For More,Ricky
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,kindlepaperwhite/b00qjdu3ky,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",2015-07-20T00:00:00.000Z,4.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,Tedd Gardiner
3,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,kindlepaperwhite/b00qjdu3ky,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",2017-06-16T00:00:00.000Z,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I bought one of the first Paperwhites and have...,Love / Hate relationship,Dougal
4,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,kindlepaperwhite/b00qjdu3ky,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",2016-08-11T00:00:00.000Z,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I have to say upfront - I don't like coroporat...,I LOVE IT,Miljan David Tanic


In [8]:
data.isnull().sum()

id                      0
asins                   0
brand                   0
categories              0
dateAdded               0
dateUpdated             0
keys                    0
name                    0
prices                  0
reviews.date          380
reviews.rating        420
reviews.sourceURLs      0
reviews.text            0
reviews.title          17
reviews.username       17
dtype: int64

In [9]:
#Feature engineer the date column and drop the unrequired
data['reviews.date'] = pd.to_datetime(data['reviews.date'], format='ISO8601')

# Format the 'date' column to 'MM-YY-DD'
data['formatted_date'] = data['reviews.date'].dt.strftime('%m-%y-%d')

# Print the resulting DataFrame
data.head()

Unnamed: 0,id,asins,brand,categories,dateAdded,dateUpdated,keys,name,prices,reviews.date,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,formatted_date
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,kindlepaperwhite/b00qjdu3ky,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",2015-08-08 00:00:00+00:00,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",Cristina M,08-15-08
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,kindlepaperwhite/b00qjdu3ky,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",2015-09-01 00:00:00+00:00,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,Allow me to preface this with a little history...,One Simply Could Not Ask For More,Ricky,09-15-01
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,kindlepaperwhite/b00qjdu3ky,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",2015-07-20 00:00:00+00:00,4.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,Tedd Gardiner,07-15-20
3,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,kindlepaperwhite/b00qjdu3ky,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",2017-06-16 00:00:00+00:00,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I bought one of the first Paperwhites and have...,Love / Hate relationship,Dougal,06-17-16
4,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,kindlepaperwhite/b00qjdu3ky,Kindle Paperwhite,"[{""amountMax"":139.99,""amountMin"":139.99,""curre...",2016-08-11 00:00:00+00:00,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I have to say upfront - I don't like coroporat...,I LOVE IT,Miljan David Tanic,08-16-11


In [10]:
#drop the unnecessary columns

data.drop(columns= ['reviews.username', 'id', 'asins','prices','reviews.sourceURLs','dateAdded','reviews.date','keys','name'], inplace= True)

data.head()

Unnamed: 0,brand,categories,dateUpdated,reviews.rating,reviews.text,reviews.title,formatted_date
0,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",08-15-08
1,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,Allow me to preface this with a little history...,One Simply Could Not Ask For More,09-15-01
2,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,4.0,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,07-15-20
3,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,I bought one of the first Paperwhites and have...,Love / Hate relationship,06-17-16
4,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,I have to say upfront - I don't like coroporat...,I LOVE IT,08-16-11


In [11]:
# impute the numeric column 
data['reviews.rating']= data['reviews.rating'].fillna(data['reviews.rating'].median())

# drop the remaining null values
data.dropna()

#check the final version
data.head(5)

Unnamed: 0,brand,categories,dateUpdated,reviews.rating,reviews.text,reviews.title,formatted_date
0,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",08-15-08
1,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,Allow me to preface this with a little history...,One Simply Could Not Ask For More,09-15-01
2,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,4.0,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,07-15-20
3,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,I bought one of the first Paperwhites and have...,Love / Hate relationship,06-17-16
4,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,I have to say upfront - I don't like coroporat...,I LOVE IT,08-16-11


In [12]:
df1 = data.copy()

df1.head(5)

Unnamed: 0,brand,categories,dateUpdated,reviews.rating,reviews.text,reviews.title,formatted_date
0,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",08-15-08
1,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,Allow me to preface this with a little history...,One Simply Could Not Ask For More,09-15-01
2,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,4.0,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,07-15-20
3,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,I bought one of the first Paperwhites and have...,Love / Hate relationship,06-17-16
4,Amazon,"Amazon Devices,mazon.co.uk",2017-07-18T23:52:58Z,5.0,I have to say upfront - I don't like coroporat...,I LOVE IT,08-16-11


## EDA analysis