# **ANALYZING REVIEWS OF THE BOOK BECOMING ON AMAZON FROM 2018 TO 2020**

![Book Cover](images/book-cover.jpg)

*<div align="center">Image source: The New York Times</div>*

## **About this project**

This small project aims to help me get back on track with data analytics while evaluating whether **Becoming** by **Michelle Obama** is worth reading. I’ve come across this book in many bookstores and have heard positive reviews, but I’m still unsure if it’s suitable for me at this time. By analyzing readers' reviews, we can gain insights into what people think about the book and decide whether to purchase it now or at a later stage in my life.

## **About the data**

The data in this project is pretty simple. It is from [Amazon Customer Review Dataverse](https://dataverse.harvard.edu/file.xhtml?fileId=5612736&version=1.0) of Harvard Dataverse. Now let's get into the work!

## **Data Preprocessing**

First thing first! Let's import and see if the data is clean. Or not I will make it clean. And then I'll process some data if needed for the analysis. 

In [1]:
# Run this if the pandas library is not installed in your environment 
# %%bash pip install pandas

In [2]:
# Import library
import pandas as pd

In [3]:
# Load the data
data = pd.read_csv('export_book.csv')

In [4]:
# Display some of the data
print("No columns: ", data.shape[1], "\n", "No rows: ", data.shape[0])
data.head()

No columns:  10 
 No rows:  5000


Unnamed: 0.1,Unnamed: 0,asin,product name,ratings,reviews,helpful,date,Unnamed: 6,target,text
0,0,1524763136,Becoming,5,\n\n Slow and boring and self boasting.\n\n,4100,13-Dec-18,,p,\n\n Slow and boring and self boasting.\n\n
1,1,1524763136,Becoming,5,\n\n The last thing I wanted to read was a sh...,3892,11-Dec-18,,p,\n\n The last thing I wanted to read was a sh...
2,2,1524763136,Becoming,1,\n\n I believe I always loved Michelle Obama....,2824,13-Nov-18,,n,\n\n I believe I always loved Michelle Obama....
3,3,1524763136,Becoming,1,\n\n Worst piece of crap ever\n\n,3182,11-Dec-18,,n,\n\n Worst piece of crap ever\n\n
4,4,1524763136,Becoming,1,\n\n If you are an insomniac this book will d...,2838,11-Dec-18,,n,\n\n If you are an insomniac this book will d...


There are 10 columns but some seems useless (because it may contain no other value but NaN or just 1 value) and some seems a duplicate of another column. Let's inspect them and remove if one is not neccessary!

In [5]:
# Check the dictint values in each column
for col in data.columns:
    print(col, " : ", data[col].nunique())

Unnamed: 0  :  5000
asin  :  1
product name  :  1
ratings  :  5
reviews  :  4993
helpful  :  147
date  :  669
Unnamed: 6  :  0
target  :  2
text  :  4993


Let's look at them one by one:

- **`Unnamed: 0`**: The number of unique values in this column equals to the number of rows of the dataset. It is clear that this column is just an index or the ID of each row in this dataset. Since I only have 1 dataset, there is no need for this column. 

- **`asin`**: This column is the opposite of the previous column. While the previous column contains 5000 unique values equaling to the number of rows of the dataset, this column contains only 1 value throughout the dataset. The data it holds is just a number which looks like a unique identifier. However, if the value is repeated, it is no longer a unique identifier. And as I do not need a unique identifier for this dataset, this column will be removed. 

- **`product name`**: Just like the previous column, this one only contains 1 value. And since the product whose reviews I analyze is just 1, there is no need for this column as well.

- **`ratings`**: The number of unique values in this column is perfectly correct as there are 5 levels equivalent to 5 stars to rate a book. This one will stay. 

- **`reviews`**: Aha! This is the main character of this analysis. 4993 unique values out of 5000 values is not bad. To be honest, I expect it to be 5000 unique values in this column as no review should be the same as any other one. Definitely I will keep this one but I will inspect its duplicate values later.

- **`helpful`**: This is the number of "helpful" votes other reviewers voted on a review. The number of unique values is okay, in my opinion. I will keep this one.

- **`date`**: The uniqueness of the values in this column is okay too. This one will stay.

- **`Unnamed: 6`**: Uh oh, this one is not good as it has 0 unique values. Look at some of its data above, its values are just NaN. Definitely delete it!

- **`target`**: Though the uniqueness of the values in this column is okay. Its 2 unique values ('n' and 'p') is indeed stand for 'negative' and 'positive' and they are sentiment labels for the reviews. Though the column is not wrong, I want to analyze the sentiment deeper, with more sentiment labels, not just 'negative' and 'positive', so this column will go away.

- **`text`**: It is easy to notice from the first 5 rows of the data printed above that the column **`reviews`** and this column are the same. Definitely this column will be removed. 

In short, the columns that will be removed are: **`Unnamed: 0`**, **`asin`**, **`product name`**, **`Unnamed: 6`**, **`target`**, **`text`** (6 columns).

In [6]:
# Remove the columns
data.drop(['Unnamed: 0', 'asin', 'product name', 'Unnamed: 6', 'target', 'text'], axis=1, inplace=True)
data.head()

Unnamed: 0,ratings,reviews,helpful,date
0,5,\n\n Slow and boring and self boasting.\n\n,4100,13-Dec-18
1,5,\n\n The last thing I wanted to read was a sh...,3892,11-Dec-18
2,1,\n\n I believe I always loved Michelle Obama....,2824,13-Nov-18
3,1,\n\n Worst piece of crap ever\n\n,3182,11-Dec-18
4,1,\n\n If you are an insomniac this book will d...,2838,11-Dec-18


Now let's look at the duplicate values in the `reviews` column!

In [7]:
data[data.reviews.duplicated(keep=False)].sort_values('reviews')

Unnamed: 0,ratings,reviews,helpful,date
65,5,\n\n Boring\n\n,132,8-Jan-19
279,5,\n\n Boring\n\n,10,30-May-19
1216,5,\n\n Boring\n\n,39,16-Dec-18
4442,4,\n\n Excellent\n\n,0,13-Jul-20
4936,5,\n\n Excellent\n\n,0,6-May-20
4382,5,\n\n Good book.\n\n,0,26-Jul-20
4706,5,\n\n Good book.\n\n,0,2-Jun-20
2290,1,\n\n Great book\n\n,1,6-Jun-20
4868,5,\n\n Great book\n\n,0,10-May-20
647,5,\n\n Love the book!\n\n,9,15-Nov-18


Okay, it seems that this duplication is acceptable as those duplicate reviews are from different dates, some with different ratings and helpful votes. Therefore, those reviews are still valid. They are kept to go on with the analysis part. 

But before that, let me clean the leading and tailing "\n\n" characters in each review.

In [8]:
# Remove "\n\n" from the reviews
data['reviews'] = data['reviews'].str.replace("\n\n", "").str.strip()
data.head()

Unnamed: 0,ratings,reviews,helpful,date
0,5,Slow and boring and self boasting.,4100,13-Dec-18
1,5,The last thing I wanted to read was a shallow ...,3892,11-Dec-18
2,1,I believe I always loved Michelle Obama. Her ...,2824,13-Nov-18
3,1,Worst piece of crap ever,3182,11-Dec-18
4,1,If you are an insomniac this book will definit...,2838,11-Dec-18


It seems perfect now but I suspect the data types can be not right. Let's inspect it!

In [9]:
data.dtypes

ratings     int64
reviews    object
helpful    object
date       object
dtype: object

Well, it is clear that I have to change the data type of the `helpful` and `date` columns.

In [10]:
# Change the data type of the `helpful` column
data['helpful'] = data['helpful'].str.replace(",", "").astype(int)
data.helpful.dtypes

dtype('int64')

In [11]:
# Change the data type of the `date` column
data['date'] = pd.to_datetime(data['date'], format='%d-%b-%y')
data.date.dtypes

dtype('<M8[ns]')

Okay! Let's see the clean data now!

In [12]:
data.head()

Unnamed: 0,ratings,reviews,helpful,date
0,5,Slow and boring and self boasting.,4100,2018-12-13
1,5,The last thing I wanted to read was a shallow ...,3892,2018-12-11
2,1,I believe I always loved Michelle Obama. Her ...,2824,2018-11-13
3,1,Worst piece of crap ever,3182,2018-12-11
4,1,If you are an insomniac this book will definit...,2838,2018-12-11


Well, after some research, I think I need to process the data a little bit more as I want to analyze the text in the `reviews` column.

In [13]:
# I run this to install the nltk library for preprocessing the text data
# %pip install nltk --quiet

In [14]:
# Import the library
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
# Define a function for processing the reviews

def process_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    # Remove the stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

In [16]:
# Process the reviews
data['reviews'] = data['reviews'].apply(process_text)

In [17]:
# View the `reviews` column now
data.reviews.head()

0                            slow boring self boasting
1    last thing wanted read shallow minded patting ...
2    believe always loved michelle obama grace dign...
3                                worst piece crap ever
4          insomniac book definitely help real snoozer
Name: reviews, dtype: object

Okay, the data is now well cleaned. However, because I deleted the `target` column which shows the sentiment label of the reviews, now I have to recreate that column with a new name and new sentiment labels which are more detailed so that the analysis would be better. 

In [18]:
# I run this to install the textblob library 
# %pip install textblob --quiet

In [19]:
# Import the library
from textblob import TextBlob

In [20]:
# Define a function to get the sentiment of the reviews

def get_sentiment_label(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity >= 0.6:
        return 'very positive'
    elif polarity >= 0.2:
        return 'positive'
    elif polarity >= -0.2:
        return 'neutral'
    elif polarity >= -0.6:
        return 'negative'
    else:
        return 'very negative'

In [21]:
# Get the sentiment of the reviews

data['sentiment'] = data['reviews'].apply(get_sentiment_label)

In [22]:
# View the data now
data.head()

Unnamed: 0,ratings,reviews,helpful,date,sentiment
0,5,slow boring self boasting,4100,2018-12-13,very negative
1,5,last thing wanted read shallow minded patting ...,3892,2018-12-11,neutral
2,1,believe always loved michelle obama grace dign...,2824,2018-11-13,positive
3,1,worst piece crap ever,3182,2018-12-11,very negative
4,1,insomniac book definitely help real snoozer,2838,2018-12-11,neutral


Okayyy! It's time for the main part: Analyze the data!

## **Data Analysis**

To get warmed up, I'd like to see the average ratings of this book. This is so simple. You can see this number on every review of a book or a product. It helps buyers have a simplistic view of the product they are considering.

In [23]:
# The average rating of the book
print("The average rating of the book:", round(data.ratings.mean(), 1))

The average rating of the book: 4.4


A pretty good score! It is not easy to get this point. However, I gotta leverage the data I get to better understand how readers think of this book. Let's dive into more insights! I'd like to see how many reviews this book got on Amazon over months from 2018 to 2020.

In [24]:
# Download the matplotlib library
# %pip install matplotlib --quiet

In [25]:
# Import the library
import matplotlib.pyplot as plt
import plotly.express as px

In [89]:
# Number of reviews over time

# Get the number of reviews per month
no_monthly_reviews = data.groupby(pd.Grouper(key='date', freq='M')).size().reset_index(name='review_counts')


'M' is deprecated and will be removed in a future version, please use 'ME' instead.



In [115]:
# Number of reviews over time

# Get the number of reviews per month
no_monthly_reviews = data.groupby(pd.Grouper(key='date', freq='M')).size().reset_index(name='review_counts')

# Plot the data
fig = px.line(
    no_monthly_reviews,
    x='date',
    y='review_counts',
    title='NUMBER OF REVIEWS OVER MONTHS',
    labels={'date': 'Month', 'review_counts': 'Number of Reviews'}
)

# Update figure for better readability and aesthetics
fig.update_layout(
    xaxis_tickformat='%m-%Y', 
    title_font=dict(size=30, family='Arial, sans-serif', weight='bold'),  
    title_x=0.5,  
    width=1000,
    paper_bgcolor='lightblue',
    plot_bgcolor='lightblue'
)

fig.update_traces(
    mode='lines+markers',
    line=dict(color='#8c564b', width=2),
)

fig.update_xaxes(
    tickangle=-90,
    tickfont=dict(size=10),
    dtick="M1", # Show every month
    tickmode='array',
    tickvals=no_monthly_reviews['date'],
    ticktext=[date.strftime('%m-%Y') for date in no_monthly_reviews['date']],
    range=[no_monthly_reviews['date'].min() - pd.Timedelta(weeks=4), no_monthly_reviews['date'].max() + pd.Timedelta(weeks=4)],
)

fig.update_yaxes(
    tickfont=dict(size=10),
)

fig.show()


'M' is deprecated and will be removed in a future version, please use 'ME' instead.



Hmm, there are not many changes in the number of reviews from 11-2018 to 10-2020. The book was published in 11-2018 and it got significant amount of reviews. The month when it got the most number of reviews is January 2019 with 891 reviews (almost 900). However, in contrast to the month when it got most attention, it only get 19 reviews on October 2020 - the month when it got the least number of reviews. Compare the first months after the book was published (11-2018 to 04-2019) to other months (05-2019 to 10-2020), I can see when is the golden era of the book (11-2018 to 04-2019) and when is not (05-2019 to 10-2020). When it was freshly published, many readers bought it and gave reviews to it. That's why its top months about the number of reviews are the months after its publication. I wonder if they were the reviews about the book (maybe the book is not good according to the reviews in the first months) that people who read the reviews decided not to read it. (There are many reasons I can think of but not all of them can be answered based on this limited dataset.) Let's explore the ratings in those months!

In [113]:
# Average ratings over time

# Get the average ratings per month
avg_monthly_ratings = data.groupby(pd.Grouper(key='date', freq='M'))['ratings'].mean().reset_index(name='average_ratings')
avg_monthly_ratings['average_ratings'] = avg_monthly_ratings['average_ratings'].round(1)

# Plot the data
fig = px.line(
    avg_monthly_ratings,
    x='date',
    y='average_ratings',
    title='AVERAGE RATINGS OVER MONTHS',
    labels={'date': 'Month', 'average_ratings': 'Average Ratings'},
    width=1000,
    height=600,
)

fig.update_layout(
    title_font=dict(size=30, family='Arial, sans-serif', weight='bold'),  
    title_x=0.5, 
    paper_bgcolor='lightblue',
    plot_bgcolor='lightblue'
)

fig.update_traces(
    mode='lines+markers',
    line=dict(color='#8c564b', width=2),
)

fig.update_xaxes(
    tickangle=-90,
    tickfont=dict(size=10),
    dtick="M1", 
    tickmode='array',
    tickvals=avg_monthly_ratings['date'],
    ticktext=[date.strftime('%m-%Y') for date in avg_monthly_ratings['date']],
    range=[avg_monthly_ratings['date'].min() - pd.Timedelta(weeks=4), avg_monthly_ratings['date'].max() + pd.Timedelta(weeks=4)],
)

fig.update_yaxes(range=[0, 5])

fig.show()


'M' is deprecated and will be removed in a future version, please use 'ME' instead.



Hmm, it seems that my hypothesis is not true. The average ratings over months are quite fair. The average monthly ratings of the book range from 4.0-4.7. In its publication month, the average ratings are the least (comparing to other months) with the ratings of 4.0. The book's highest monthly average ratings appeared on 2 months: May 2019 and October 2020. 
Although the first months after the publication of the book brought it many reviews, the ratings 

## **References**

Some articles or tutorials I read to complete this project and other resources:

1. [In ‘Becoming,’ Michelle Obama Mostly Opts for Empowerment Over Politics](https://www.nytimes.com/2018/11/09/books/review-becoming-michelle-obama.html) by [Jennifer Szalai](https://www.nytimes.com/by/jennifer-szalai) on *The New York Times*.

2. [Use Sentiment Analysis With Python to Classify Movie Reviews](https://realpython.com/sentiment-analysis-python/) by [Kyle Stratis](https://realpython.com/team/kstratis/) on *Real Python*.

3. [Becoming (book)](https://www.google.com/search?q=becoming+michelle+obama+publication+date&oq=becoming+michelle+obama+publicat&gs_lcrp=EgZjaHJvbWUqBwgDECEYoAEyBggAEEUYOTIHCAEQIRigATIHCAIQIRigATIHCAMQIRigAdIBCDk4OTlqMGo3qAIAsAIA&sourceid=chrome&ie=UTF-8#:~:text=Becoming%20(book)%20%2D%20Wikipedia,wiki%20%E2%80%BA%20Becoming_(book)) on *Wikipedia*.