# **ANALYZING REVIEWS OF THE BOOK BECOMING ON AMAZON FROM 2018 TO 2020**

![Book Cover](images/book-cover.jpg)

## **About this project**

This small project aims to help me get back on track with data analytics while evaluating whether **Becoming** by **Michelle Obama** is worth reading. I’ve come across this book in many bookstores and have heard positive reviews, but I’m still unsure if it’s suitable for me at this time. By analyzing readers' reviews, we can gain insights into what people think about the book and decide whether to purchase it now or at a later stage in my life.

## **About the data**

The data in this project is pretty simple. It is from [Amazon Customer Review Dataverse](https://dataverse.harvard.edu/file.xhtml?fileId=5612736&version=1.0) of Harvard Dataverse. Now let's get into the work!

## **Data Cleaning**

First thing first! Let's import and see if the data is clean. Or not I will make it clean.

In [1]:
# Import libraries
import pandas as pd

In [2]:
# Load the data
data = pd.read_csv('export_book.csv')

In [3]:
# Display some of the data
print("No columns: ", data.shape[1], "\n", "No rows: ", data.shape[0])
data.head()

No columns:  10 
 No rows:  5000


Unnamed: 0.1,Unnamed: 0,asin,product name,ratings,reviews,helpful,date,Unnamed: 6,target,text
0,0,1524763136,Becoming,5,\n\n Slow and boring and self boasting.\n\n,4100,13-Dec-18,,p,\n\n Slow and boring and self boasting.\n\n
1,1,1524763136,Becoming,5,\n\n The last thing I wanted to read was a sh...,3892,11-Dec-18,,p,\n\n The last thing I wanted to read was a sh...
2,2,1524763136,Becoming,1,\n\n I believe I always loved Michelle Obama....,2824,13-Nov-18,,n,\n\n I believe I always loved Michelle Obama....
3,3,1524763136,Becoming,1,\n\n Worst piece of crap ever\n\n,3182,11-Dec-18,,n,\n\n Worst piece of crap ever\n\n
4,4,1524763136,Becoming,1,\n\n If you are an insomniac this book will d...,2838,11-Dec-18,,n,\n\n If you are an insomniac this book will d...


There are 10 columns but some seems useless (because it may contain no other value but NaN or just 1 value) and some seems a duplicate of another column. Let's inspect them and remove if one is not neccessary!

In [4]:
# Check the dictint values in each column
for col in data.columns:
    print(col, " : ", data[col].nunique())

Unnamed: 0  :  5000
asin  :  1
product name  :  1
ratings  :  5
reviews  :  4993
helpful  :  147
date  :  669
Unnamed: 6  :  0
target  :  2
text  :  4993


Let's look at them one by one:

- **`Unnamed: 0`**: The number of unique values in this column equals to the number of rows of the dataset. It is clear that this column is just an index or the ID of each row in this dataset. Since I only have 1 dataset, there is no need for this column. 

- **`asin`**: This column is the opposite of the previous column. While the previous column contains 5000 unique values equaling to the number of rows of the dataset, this column contains only 1 value throughout the dataset. The data it holds is just a number which looks like a unique identifier. However, if the value is repeated, it is no longer a unique identifier. And as I do not need a unique identifier for this dataset, this column will be removed. 

- **`product name`**: Just like the previous column, this one only contains 1 value. And since the product whose reviews I analyze is just 1, there is no need for this column as well.

- **`ratings`**: The number of unique values in this column is perfectly correct as there are 5 levels equivalent to 5 stars to rate a book. This one will stay. 

- **`reviews`**: Aha! This is the main character of this analysis. 4993 unique values out of 5000 values is not bad. To be honest, I expect it to be 5000 unique values in this column as no review should be the same as any other one. Definitely I will keep this one but I will inspect its duplicate values later.

- **`helpful`**: This is the number of "helpful" votes other reviewers voted on a review. The number of unique values is okay, in my opinion. I will keep this one.

- **`date`**: The uniqueness of the values in this column is okay too. This one will stay.

- **`Unnamed: 6`**: Uh oh, this one is not good as it has 0 unique values. Look at some of its data above, its values are just NaN. Definitely delete it!

- **`target`**: Though the uniqueness of the values in this column is okay, I don't see if it is any helpful. And I can't find the metadata for this specific dataset. So that it will go away.

- **`text`**: It is easy to notice from the first 5 rows of the data printed above that the column **`reviews`** and this column are the same. Definitely this column will be removed. 

In short, the columns that will be removed are: **`Unnamed: 0`**, **`asin`**, **`product name`**, **`Unnamed: 6`**, **`target`**, **`text`** (6 columns).

In [5]:
# Remove the columns
data.drop(['Unnamed: 0', 'asin', 'product name', 'Unnamed: 6', 'target', 'text'], axis=1, inplace=True)
data.head()

Unnamed: 0,ratings,reviews,helpful,date
0,5,\n\n Slow and boring and self boasting.\n\n,4100,13-Dec-18
1,5,\n\n The last thing I wanted to read was a sh...,3892,11-Dec-18
2,1,\n\n I believe I always loved Michelle Obama....,2824,13-Nov-18
3,1,\n\n Worst piece of crap ever\n\n,3182,11-Dec-18
4,1,\n\n If you are an insomniac this book will d...,2838,11-Dec-18


Now let's look at the duplicate values in the `reviews` column!

In [6]:
data[data.reviews.duplicated(keep=False)].sort_values('reviews')

Unnamed: 0,ratings,reviews,helpful,date
65,5,\n\n Boring\n\n,132,8-Jan-19
279,5,\n\n Boring\n\n,10,30-May-19
1216,5,\n\n Boring\n\n,39,16-Dec-18
4442,4,\n\n Excellent\n\n,0,13-Jul-20
4936,5,\n\n Excellent\n\n,0,6-May-20
4382,5,\n\n Good book.\n\n,0,26-Jul-20
4706,5,\n\n Good book.\n\n,0,2-Jun-20
2290,1,\n\n Great book\n\n,1,6-Jun-20
4868,5,\n\n Great book\n\n,0,10-May-20
647,5,\n\n Love the book!\n\n,9,15-Nov-18


Okay, it seems that this duplication is acceptable as those duplicate reviews are from different dates, some with different ratings and helpful votes. Therefore, those reviews are still valid. They are kept to go on with the analysis part. 

But before that, let me clean the leading and tailing "\n\n" characters in each review.

In [7]:
# Remove "\n\n" from the reviews
data['reviews'] = data['reviews'].str.replace("\n\n", "").str.strip()
data.head()

Unnamed: 0,ratings,reviews,helpful,date
0,5,Slow and boring and self boasting.,4100,13-Dec-18
1,5,The last thing I wanted to read was a shallow ...,3892,11-Dec-18
2,1,I believe I always loved Michelle Obama. Her ...,2824,13-Nov-18
3,1,Worst piece of crap ever,3182,11-Dec-18
4,1,If you are an insomniac this book will definit...,2838,11-Dec-18


It seems perfect now but I suspect the data types can be not right. Let's inspect it!

In [8]:
data.dtypes

ratings     int64
reviews    object
helpful    object
date       object
dtype: object

Well, it is clear that I have to change the data type of the `helpful` and `date` columns.

In [9]:
# Change the data type of the `helpful` column
data['helpful'] = data['helpful'].str.replace(",", "").astype(int)
data.helpful.dtypes

dtype('int64')

In [11]:
# Change the data type of the `date` column
data['date'] = pd.to_datetime(data['date'], format='%d-%b-%y').dt.strftime('%Y-%m-%d')
data.date.dtypes

dtype('O')

Okay! Let's see the clean data now!

In [12]:
data.head()

Unnamed: 0,ratings,reviews,helpful,date
0,5,Slow and boring and self boasting.,4100,2018-12-13
1,5,The last thing I wanted to read was a shallow ...,3892,2018-12-11
2,1,I believe I always loved Michelle Obama. Her ...,2824,2018-11-13
3,1,Worst piece of crap ever,3182,2018-12-11
4,1,If you are an insomniac this book will definit...,2838,2018-12-11


Well, after some research, I think I need to process the data a little bit more as I want to analyze the text in the `reviews` column.

## References

Some articles or tutorials that helped me with this project.

1. [Use Sentiment Analysis With Python to Classify Movie Reviews](https://realpython.com/sentiment-analysis-python/) by [Kyle Stratis](https://realpython.com/team/kstratis/).