- In this study, we are going to make Exploratory Data Analysis (EDA) with the Amazon.com's bestseller books
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way.
- Study's dataset has 550 books along with their ratings, price, publication year, authors' name and genre.
- Data includes 2009-2019 best seller books.

- Let's import the required libraries

In [1]:
import pandas as pd
import numpy as np


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [2]:
df = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
df.head()

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
0,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17350,8,2016,Non Fiction
1,11/22/63: A Novel,Stephen King,4.6,2052,22,2011,Fiction
2,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,4.7,18979,15,2018,Non Fiction
3,1984 (Signet Classics),George Orwell,4.7,21424,6,2017,Fiction
4,"5,000 Awesome Facts (About Everything!) (Natio...",National Geographic Kids,4.8,7665,12,2019,Non Fiction


In [3]:
df.shape

(550, 7)

- So we have 550 book and 7 features to work on

In [4]:
df.isnull().sum()

Name           0
Author         0
User Rating    0
Reviews        0
Price          0
Year           0
Genre          0
dtype: int64

- We have a very clean dataset, which is very rare in the real world. 
- So enjoy working data without any missing value in it.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550 entries, 0 to 549
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         550 non-null    object 
 1   Author       550 non-null    object 
 2   User Rating  550 non-null    float64
 3   Reviews      550 non-null    int64  
 4   Price        550 non-null    int64  
 5   Year         550 non-null    int64  
 6   Genre        550 non-null    object 
dtypes: float64(1), int64(3), object(3)
memory usage: 30.2+ KB


- We have 4 numeric variable
- Also we 3 non-numeric variable.
- As a data type, everything seems quite OK.

In [6]:
df.describe()

Unnamed: 0,User Rating,Reviews,Price,Year
count,550.0,550.0,550.0,550.0
mean,4.618364,11953.281818,13.1,2014.0
std,0.22698,11731.132017,10.842262,3.165156
min,3.3,37.0,0.0,2009.0
25%,4.5,4058.0,7.0,2011.0
50%,4.7,8580.0,11.0,2014.0
75%,4.8,17253.25,16.0,2017.0
max,4.9,87841.0,105.0,2019.0


Before going further, let's summarize what we have got from the dataset.

- Our dataset has 550 books from different authors and genres.

- Object data type variable (genre) can be grouped and see the differences among them.

- Reviews and price columns most probably have outliers. (Mean- Median difference, difference between 75% and maximum value, difference between %25 and minimum value)

-  Numerical variables deserves special attention for further analysis.


- Everything seems OK.  Let's move on to the next step: **analysis part**.

### Analysis Part

#### **Author**

In [7]:
df['Author'].nunique()

248

- 248 differnt authors are in the dataset.

#### **Genre**

In [8]:
df['Genre'].value_counts(normalize=True)

Non Fiction    0.563636
Fiction        0.436364
Name: Genre, dtype: float64

- I was expecting more genre, it is quite surprising.
- Anyway, Genre is still good to use to see differences between two category.

In [9]:
fig = px.histogram(df, x="Genre", title='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### **User Rating**

In [10]:
df['User Rating'].describe()

count    550.000000
mean       4.618364
std        0.226980
min        3.300000
25%        4.500000
50%        4.700000
75%        4.800000
max        4.900000
Name: User Rating, dtype: float64

- Mean and median score are quite close to each other. (Median_ 4.7, Mean=4.618)
- Since median score is bigger than mean score we can expect outlier from the minimum side. 
- Most probably we will have left skewed distribution. 
- But still we can expect close to normal distribution of the variable.
- Let's see it

In [11]:
fig = px.histogram(df, x= 'User Rating', title='User Rating', marginal="box", hover_data = df[['Name','Author']])
fig.show()

- Yeap, as expected we have several outliers from the minimum side.
- Slightly left skewed distribution, but still close to the normal distribution
- Oh wait...
- Oh no!! J.K.Rowling's 'Casual Vacancy' got the lowest rating.

#### **Reviews**

In [12]:
df['Reviews'].describe()

count      550.000000
mean     11953.281818
std      11731.132017
min         37.000000
25%       4058.000000
50%       8580.000000
75%      17253.250000
max      87841.000000
Name: Reviews, dtype: float64

- We have huge difference between mean and median values (mean = 11953, median=8580)
- It has highly skewed distribution with the outliers on the maximum side.
- We can expect highly right skewed distribution with possible outliers in the maximum side.
- Let' see it.

In [13]:
fig = px.histogram(df, x= 'Reviews', title='Reviews', marginal="box", hover_data = df[['Name','Author']])
fig.show()

- As expected, highly right skewed distribution with the outliers on the maximum side.

- By the way, I saw that Kristin Hannah's 'The Nightingale' is among the outliers with 49K plus reviews. It is a beautiful novel to read.

#### **Price**

In [14]:
df['Price'].describe()

count    550.000000
mean      13.100000
std       10.842262
min        0.000000
25%        7.000000
50%       11.000000
75%       16.000000
max      105.000000
Name: Price, dtype: float64

- We can expect slightly rightly skewed distribution
- Still, distribution will be close to the normal distribution.
- We can expect outliers on the maximum side.
- Yeah, we have also free books. 
- Let me note them to check it after the analys and if any of them still free, I would be happy to have it.

In [15]:
fig = px.histogram(df, x= 'Price', title='Price', marginal="box", hover_data = df[['Name','Author']])
fig.show()

- As we expected, we have slightly right skewed distribution with outliers on the maximum side.
- Also we have 13 counts of books on the $0-1 range. 13 free books, sounds good to me.

- Befor moving on the details, let's see the correlation matrix for our dataset

In [16]:
df.drop('Year', axis=1).corr()

Unnamed: 0,User Rating,Reviews,Price
User Rating,1.0,-0.001729,-0.133086
Reviews,-0.001729,1.0,-0.109182
Price,-0.133086,-0.109182,1.0


In [17]:
index_vals = df['Genre'].astype('category').cat.codes

fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='User Rating',
                                 values=df['User Rating']),
                            dict(label='Reviews',
                                 values=df['Reviews']),
                            dict(label='Price',
                                 values=df['Price'])],
                showupperhalf=False, 
                text=df['Name'],
                marker=dict(color=index_vals,
                            showscale=False, # colors encode categorical variables
                            line_color='white', line_width=0.5)
                ))


fig.update_layout(
    title='Books',
    width=1000,
    height=1000,
)

fig.show()

- There isn't any significant correlation to consider for further analysis.

- After getting overall picture about the data, we can go into more details.

In [18]:
genre_by_year = df.groupby('Year')['Genre'].value_counts().reset_index(level=0).rename(columns={'Genre': 'Genre count'}, index={'index': 'Genre'})
genre_by_year

Unnamed: 0_level_0,Year,Genre count
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1
Non Fiction,2009,26
Fiction,2009,24
Non Fiction,2010,30
Fiction,2010,20
Non Fiction,2011,29
Fiction,2011,21
Non Fiction,2012,29
Fiction,2012,21
Non Fiction,2013,26
Fiction,2013,24


#### **Movie Genre in Each Year**

In [19]:
fig = px.line(genre_by_year, x='Year', y='Genre count', color= genre_by_year.index, title='Movies By Genre in Each Year')
fig.show()

- Non fiction books  are sharply descreased on the 2014, and then sharply increased on 2015.
- Fiction books significantly increased on 2014 and then sharply decreased on 2015.
- Both fiction and non-fiction books have inconsistency on their counts by year.

#### **Price of Books in Each Year**

In [20]:
fig = px.scatter(df, x='Year', y='Price', title='Price of the Books in Each Year', hover_data = df[['Name','Author']])
fig.show()

- Prices of the books are quite on the same range by year with several outliers.
- We have 2013 and 2014 books at the price of $105 from American Psychiatric Association.

#### **Number of Reviews in Each Year**

In [21]:
fig = px.scatter(df, x='Year', y='Reviews', title='Number of Reviews in Each Year', color='Genre',hover_data = df[['Name','Author']])
fig.show()

- Quite same distribution by each year, especialy after 2010.
- Several outliers affect the distribution, as we have mentioned before.

#### **User Rating in Each Year**

In [22]:
fig = px.scatter(df, x='Year', y='User Rating', title='User Rating in Each Year',color='Genre', hover_data = df[['Name','Author']])
fig.show()

- User rating has almost same distribution on each year with a quite few outliers.

### **Top 20 Higly Rated Books**

In [23]:
top_20 = df.sort_values('User Rating', ascending=False)[:20]
top_20

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
431,The Magnolia Story,Chip Gaines,4.9,7861,5,2016,Non Fiction
87,Dog Man: Lord of the Fleas: From the Creator o...,Dav Pilkey,4.9,5470,6,2018,Fiction
85,Dog Man: Fetch-22: From the Creator of Captain...,Dav Pilkey,4.9,12619,8,2019,Fiction
84,Dog Man: Brawl of the Wild: From the Creator o...,Dav Pilkey,4.9,7235,4,2019,Fiction
83,Dog Man: Brawl of the Wild: From the Creator o...,Dav Pilkey,4.9,7235,4,2018,Fiction
82,Dog Man: A Tale of Two Kitties: From the Creat...,Dav Pilkey,4.9,4786,8,2017,Fiction
81,Dog Man and Cat Kid: From the Creator of Capta...,Dav Pilkey,4.9,5062,6,2018,Fiction
252,"Oh, the Places You'll Go!",Dr. Seuss,4.9,21834,8,2019,Fiction
476,The Very Hungry Caterpillar,Eric Carle,4.9,19546,5,2013,Fiction
477,The Very Hungry Caterpillar,Eric Carle,4.9,19546,5,2014,Fiction


In [24]:
fig = px.bar(top_20, x='Name', y= 'User Rating',  hover_data = top_20[['Year','Genre', 'Price']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Several books with their yearly editions got the most reviews from the readers.
- Only one book from non-fiction genre find place in the top 20 highly rated book list.
- Also maximum price for the books in the top 20 list is $10

#### **Lowest Rated 20 Books**

In [25]:
bottom_20 = df.sort_values('User Rating')[:20]
bottom_20

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
353,The Casual Vacancy,J.K. Rowling,3.3,9372,12,2012,Fiction
132,Go Set a Watchman: A Novel,Harper Lee,3.6,14982,19,2015,Fiction
106,Fifty Shades of Grey: Book One of the Fifty Sh...,E L James,3.8,47265,14,2012,Fiction
107,Fifty Shades of Grey: Book One of the Fifty Sh...,E L James,3.8,47265,14,2013,Fiction
393,The Goldfinch: A Novel (Pulitzer Prize for Fic...,Donna Tartt,3.9,33844,20,2014,Fiction
22,Allegiant,Veronica Roth,3.9,6310,13,2013,Fiction
392,The Goldfinch: A Novel (Pulitzer Prize for Fic...,Donna Tartt,3.9,33844,20,2013,Fiction
364,The Elegance of the Hedgehog,Muriel Barbery,4.0,1859,11,2009,Fiction
137,Gone Girl,Gillian Flynn,4.0,57271,9,2014,Fiction
136,Gone Girl,Gillian Flynn,4.0,57271,10,2013,Fiction


In [26]:
fig = px.bar(bottom_20, x='Name', y= 'User Rating',  hover_data = bottom_20[['Year','Genre', 'Price']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Lowest rating is 3.3.
- Only 1 book from non-fiction genre in the list.
- Maximum price of the book in the list is $20

#### **Top 20 Reviewed Books**

In [27]:
top_20_reviews = df.sort_values('Reviews', ascending=False)[:20]
top_20_reviews

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
534,Where the Crawdads Sing,Delia Owens,4.8,87841,15,2019,Fiction
382,The Girl on the Train,Paula Hawkins,4.1,79446,18,2015,Fiction
383,The Girl on the Train,Paula Hawkins,4.1,79446,7,2016,Fiction
32,Becoming,Michelle Obama,4.8,61133,11,2018,Non Fiction
33,Becoming,Michelle Obama,4.8,61133,11,2019,Non Fiction
137,Gone Girl,Gillian Flynn,4.0,57271,9,2014,Fiction
135,Gone Girl,Gillian Flynn,4.0,57271,10,2012,Fiction
136,Gone Girl,Gillian Flynn,4.0,57271,10,2013,Fiction
368,The Fault in Our Stars,John Green,4.7,50482,13,2014,Fiction
367,The Fault in Our Stars,John Green,4.7,50482,7,2014,Fiction


In [28]:
fig = px.bar(top_20_reviews, x='Name', y= 'Reviews',  hover_data = top_20_reviews[['Year','Genre', 'Price']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Several books with their different editions got the most reviews.
- Only one book from non-fiction genre find place in the list.
- Maximum price of the book is one of the my favorite book, 'The Alchemist' by $35.

#### **Lowest Number Reviewed 20 Books**

In [29]:
bottom_20_reviews = df.sort_values('Reviews')[:20]
bottom_20_reviews

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
78,Divine Soul Mind Body Healing and Transmission...,Zhi Gang Sha,4.6,37,6,2009,Non Fiction
300,Soul Healing Miracles: Ancient and New Sacred ...,Zhi Gang Sha,4.6,220,17,2013,Non Fiction
121,George Washington's Sacred Fire,Peter A. Lillback,4.5,408,20,2010,Non Fiction
512,True Compass: A Memoir,Edward M. Kennedy,4.5,438,15,2009,Non Fiction
359,The Daily Show with Jon Stewart Presents Earth...,Jon Stewart,4.4,440,11,2010,Non Fiction
11,A Patriot's History of the United States: From...,Larry Schweikart,4.6,460,2,2010,Non Fiction
39,"Broke: The Plan to Restore Our Trust, Truth an...",Glenn Beck,4.5,471,8,2010,Non Fiction
27,"Autobiography of Mark Twain, Vol. 1",Mark Twain,4.2,491,14,2010,Non Fiction
264,Percy Jackson and the Olympians Paperback Boxe...,Rick Riordan,4.8,548,2,2010,Fiction
31,"Barefoot Contessa, How Easy Is That?: Fabulous...",Ina Garten,4.7,615,21,2010,Non Fiction


In [30]:
fig = px.bar(bottom_20_reviews, x='Name', y= 'Reviews',  hover_data = bottom_20_reviews[['Year','Genre', 'Price']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Zhi Gang Sha's two differnt books got the lowest reviews.
- Only 3 fiction books are in the lowest reviewed book list.
- Other books come from non-fiction genre.
- We can make an assumption about it. But still we need other data to support our assumptions.

- OK Let's see the top 25 Authors in our dataset

#### **Top 25 Authors** 

In [31]:
top_25_authors = df['Author'].value_counts()[:25]
top_25_authors

Jeff Kinney                           12
Rick Riordan                          11
Suzanne Collins                       11
Gary Chapman                          11
American Psychological Association    10
Gallup                                 9
Dr. Seuss                              9
Rob Elliott                            8
Dav Pilkey                             7
Stephenie Meyer                        7
Stephen R. Covey                       7
Eric Carle                             7
Bill O'Reilly                          7
Harper Lee                             6
E L James                              6
The College Board                      6
Don Miguel Ruiz                        6
Stieg Larsson                          6
J.K. Rowling                           6
Sarah Young                            6
John Grisham                           5
R. J. Palacio                          5
Laura Hillenbrand                      5
John Green                             5
Dale Carnegie   

In [32]:
fig = px.bar(top_25_authors, x= top_25_authors.index, y=top_25_authors.values, title='Top 25 Authors',labels={'y':'Number of Books', 'index':'Author'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Let's see these top 25 authors, user ratings and reviews scores.

In [33]:
top_25_authors_ratings = df[df['Author'].isin(top_25_authors.index)][['Author','User Rating', 'Reviews']]
top_25_authors_ratings_grouped=top_25_authors_ratings.groupby('Author')[['User Rating','Reviews']].mean().sort_values('Reviews', ascending=False)

In [34]:
fig = px.bar(top_25_authors_ratings_grouped, x= top_25_authors_ratings_grouped.index, y='User Rating', title='Top 25 Authors with USer Rating')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- As expected, they have very high user rating (4.9 to 4)

In [35]:
top_25_authors_reviews_grouped=top_25_authors_ratings.groupby('Author')[['User Rating', 'Reviews']].mean().sort_values('Reviews', ascending=False)
fig = px.bar(top_25_authors_ratings_grouped, x= top_25_authors_ratings_grouped.index, y='Reviews', title='Top 25 Authors with Reviews')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Thanks for the dataset contibutor for this data. I really enjoyed working on it.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 