# Amazon Best Selling Book Data Analysis

## Business Understanding

Amazon has a huge books collection for its customers to buy or borrow. customers can give rating to the books n a 5 star scale after buy i, the people give rating can be professionals such as journalists or editors, but also can be anyone with a point of view in a specific area or amatures. The rating system is based on a score and a detailed text review. The rating can be used to recommand books to others for deciding whether to purchase a particular book or not.

In this notebook, I have got the data called "Amazon's best selling book between 2009 and 2019", and will analyze the authors, the genres and the most valueable book in this dataset.  therefor, I list out 3 business qustions to answer from the exporing:


#### Who are the most popular top 10 authors ?
#### How to give weighted rating to all the books?
#### How sells distribution in the perspective of genres ?

## Data acquisition and understanding
First , we need to import useful libraries and the dataset.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
pd.set_option("display.width", 350)
[([root] + dirname + file)[0] for root, dirname, file in os.walk('/kaggle/input') if len(file)>0 ][0]

In [None]:
df=pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')


## Understanding Data
To understanding the data , we need to have a overview of the data, including rows and columns , the size is also interesting.

In [None]:
df.shape

In [None]:
df.head(10)

In [None]:
df.describe()

> After first load and check, the data set has 550 rows and 7 columns
> To clean the content we need to prepare Data

## Prepare Data

### Remove Duplications

Some books appeared in different years. and the price may be different.  
Genre should be the same, but need to check reviews and ratings
I need to merge books with the same title, reviews, and ratings .

In [None]:
#check duplicates by .duplicated
df.duplicated(subset=['Name', 'User Rating', 'Reviews']).sum()


Then we have 198 books can be merged.

In [None]:
# 1 First, find out all count of books with the same name and rating.
name_grouped = df.groupby(["Name", 'User Rating', 'Reviews']).count()

# 2 list out which count is bigger than 1, means duplicated
duplication = name_grouped.loc[(name_grouped['Author'] > 1)]

duplication


we have 97 books was republished.
To identify a book, I choose not only name but also rating and reviews.  so  I can safty merge them.  and can simply pick the newest version/latest year. The years are already sorted, so we can just pick the **last entry** of each duplicate.

Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html

Then we can safty merge these duplicates from original dataset

In [None]:
merge_duplication=df.drop_duplicates(subset=['Name', 'User Rating', 'Reviews'],keep = 'last',inplace = False)
merge_duplication

check if we have anymore duplicated rows

In [None]:
#check again
merge_duplication.reset_index().duplicated(['Name', 'User Rating', 'Reviews']).sum()

### Remove Price 0 

We can see in the dataframe description that some items are priced at 0. which may give wrong ratings comparing with normal books

In [None]:
price0 = merge_duplication.loc[merge_duplication['Price'] == 0]
price0

I will remove these 7 rows from merge_duplication,   then I can get the prepared dataset

In [None]:
df_cleaned = merge_duplication.drop(price0.index)
df_cleaned

## Get a Quick Overview


To understand out business question better, I need under stand the dataset better, here are several questions need to be answered:


How many books one author can publish? 
And How many books can they publish in total? 

In [None]:
sorted_by_df_cleaned = df_cleaned.value_counts('Author')

top_popular= sorted_by_df_cleaned[:10]

plt.figure(figsize=(20,5))
sns.barplot(top_popular.index, top_popular.values)

plt.title('Top 10 Authors with the Most Books')
plt.ylabel('# Number of Books')
plt.show()


We see the top 10 authors with the most books published, Rick Riordan is the most popular author. Stephen King is the 10th 

In [None]:
df_cleaned.loc[df_cleaned['Author'].isin(list(top_popular.index))]

To answer the questions , One author can have up to 10 books, and top 10 author published 62 books

### Bayesian average

> From the overview, I can see there is a problem that the number of voting is small for less famouse books.
> To solve this matter, 'Bayesian average' is ideal to be interduced here. 

*inspired by https://www.kaggle.com/paotografi/amazon-2009-2019-best-selling-book-eda*

My understanding is, the intuition of Bayesian average is considering the min and average value from dataset when evaluate a single element.
Individual value needs to caculate with group factors together which is "outside information".
In this way, books having a fewer votes can get weighted rating to compete with books having more votes 

From wifikepdia:
https://en.wikipedia.org/wiki/Bayesian_average


IMDB use it to give a weighted rating also, and here is imdb explanation: 
*How does IMDB calculate the rank of movies and TV shows on the Top Rated Movies and Top Rated TV Show lists?
The following formula is used to calculate the Top Rated 250 titles. This formula provides a true 'Bayesian estimate', which takes into account the number of votes each title has received, minimum votes required to be on the list, and the mean vote for all titles*

'Bayesian estimate' from imdb: 
https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV?ref_=helpms_helpart_inline#calculate


To utilize Bayesian estimate, there are 4 variables need to be considerated:

1. v = votes, reviews received by the author, number of people give rating
2. m = min votes, the lowest reviews one author can get in dataset.
3. R = mean rating, received by the author
4. C = mean rating, of the dataset

Then we calculate the weighted rating using Bayesian average in this way:

weighted_rating = (R * v / v + m ) + ( C * m / v + m)

In [None]:
# to get m and c
m = merge_duplication['Reviews'].min()
C = merge_duplication['User Rating'].mean()
['min review:',m,'mean rating:',C]

In [None]:
# prepare variables
author_counts = df_cleaned.value_counts('Author')
author_names=author_counts.index
voutes=author_counts.values
author_counts

In [None]:
# to initialize v and R 
ratings_sum=np.zeros(len(author_counts))
v=np.zeros(len(author_counts))
R=np.zeros(len(author_counts))
ratings=np.zeros(len(author_counts))

In [None]:
#to get weighted_rating list
get_rating_sum = lambda x: df_cleaned.loc[df_cleaned['Author'] == author_names[x], 'User Rating'].sum()
get_votes = lambda x: df_cleaned.loc[df_cleaned['Author'] == author_names[x], 'Reviews'].sum()
for i in range(0, len(author_counts)):
    ratings_sum[i] = get_rating_sum(i)
    R[i] = ratings_sum[i] / voutes[i]
    v[i] = get_votes(i)
    
    ratings[i]=(R[i] * v[i] + C * m )/(v[i]+m)

ratings[:10]

In [None]:
#put author rating into dataframe
df_rating=pd.DataFrame({
    'Author': author_names,
    'Books Written': author_counts,
    'Reviews': v,
    'Average Rating': R,
    'Weighted Rating': ratings
})
df_rating['Average Rating']=df_rating['Average Rating'].round(decimals=4)
df_rating.head()

after get weighted ratings, I can redo the rank to authors

In [None]:
top_rating=df_rating.nlargest(10 ,['Weighted Rating'])
plt.figure(figsize=(20,6))
sns.barplot(top_rating['Author'], top_rating['Weighted Rating'])
plt.title('Top 10 Authors with weighted Ratings')

plt.ylim(top_rating['Weighted Rating'].min()-0.0001,top_rating['Weighted Rating'].max()+0.0001)
plt.ylabel('Weighted Ratings')
plt.show()

top_rating

The new result is different that the total count of books published. Dav Pilkey  has heightest rating.


### The Genre

Try to look into Genre category, , and find out how many groups we have by group and sum.

In [None]:
df_cleaned.groupby('Genre').count()

It shows that there are move Non Fiction book in the book list

In [None]:
df_cleaned.groupby('Genre').sum()

because it only have 2 categories,  we can simply rename the Gerne to isFiction, and the value to be true or false

In [None]:
is_fiction = df_cleaned.rename(columns={'Genre': 'isFiction'}).replace({'isFiction': {'Fiction': True, 'Non Fiction': False}})

is_fiction

In [None]:
genre_reviews = is_fiction.groupby("isFiction")["Reviews"].sum()
genre_ratings = is_fiction.groupby("isFiction")["User Rating"].sum()
genre_reviews_avg = is_fiction.groupby("isFiction")["Reviews"].mean()
genre_ratings_avg = is_fiction.groupby("isFiction")["User Rating"].mean()

In [None]:
#1. v = votes/reviews received by the author, number of people give rating
#2. m = min votes, the lowest reviews one author can get in dataset.
#3. R = mean rating, received by the author
#4. C = mean rating, of the dataset

vs = genre_reviews
m = df_cleaned['Reviews'].min()
C = df_cleaned['User Rating'].mean()
Rs = genre_ratings_avg
w_rating=np.zeros(2)

for i in range(0,len(genre_reviews.index)):
    R = Rs[i]
    v = vs[i]
    w_rating[i]=(R * v/(v+m))+(C * m/(v+m))
    

In [None]:
fig=plt.figure(figsize=(10,6))
sns.barplot(genre_reviews.index,w_rating)

plt.ylim(w_rating.min()-0.01, w_rating.max() + 0.01)
plt.title('Books Fiction or Non Fiction Weighted Rating')
plt.show()

the comparision shows that Fiction book have better ratings thatn non-fiction