# Background

The following document describes our analysis on a dataset of more than 13,700 books as an attempt to discover any useful insights for our company

In [4]:
import pandas as pd
books_ori = pd.read_csv("data_input/books_c.csv")
book = books_ori.copy()
books.shape

(13714, 10)

Understanding our data is crucial to this analysis. We start off by looking at the data types in each columns of our dataset:

In [5]:
books.dtypes

bookID                  int64
title                  object
authors                object
average_rating        float64
isbn                   object
isbn13                  int64
language_code          object
# num_pages             int64
ratings_count           int64
text_reviews_count      int64
dtype: object

Perform a type conversion so both `isbn` and `isbn13` share the same type. This adds consistency to our exploratory data analysis process later on:

In [6]:
books['isbn13'] = books['isbn13'].astype('object')
books.dtypes

bookID                  int64
title                  object
authors                object
average_rating        float64
isbn                   object
isbn13                 object
language_code          object
# num_pages             int64
ratings_count           int64
text_reviews_count      int64
dtype: object

It's useful to start off our analysis by looking at the top 3 most prolific authors according to our company's data:

In [36]:
mylist = books['authors'].value_counts().head(3).index.to_list()
mylist

['Agatha Christie', 'Stephen King', 'Orson Scott Card']

Our scout Andy recommended us to sign J.K. Rowling, a promising author from Great Britain. We want to present a more data-driven argument as to make the case of whether we should be splashing top cash to signing her under our publishing label:

In [11]:
cond1 = books['authors'] == 'J.K. Rowling'
mean_ratings = books.loc[cond1, 'average_rating'].mean()
round(mean_ratings, 2)

4.52

To help the company acquire shelf-worthy titles, I've compiled a list of commercially successful books with great ratings. These are the top 10 books:

In [12]:
books.columns

Index(['bookID', 'title', 'authors', 'average_rating', 'isbn', 'isbn13',
       'language_code', '# num_pages', 'ratings_count', 'text_reviews_count'],
      dtype='object')

In [22]:
cond1 = books['ratings_count'] > 1000000
greatbooks = books.loc[cond1, ].sort_values('average_rating', ascending=False).head(10)
result = greatbooks.loc[:,['title', 'authors', 'average_rating', 'ratings_count']]
result

Unnamed: 0,title,authors,average_rating,ratings_count
0,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling,4.56,1944099
4,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling,4.55,2149872
1,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,4.49,1996446
2,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling,4.47,5629932
4455,A Game of Thrones (A Song of Ice and Fire #1),George R.R. Martin,4.45,1598396
5300,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.41,2115562
6363,The Book Thief,Markus Zusak,4.37,1410666
25,The Fellowship of the Ring (The Lord of the Ri...,J.R.R. Tolkien,4.35,2009749
9319,Where the Sidewalk Ends,Shel Silverstein,4.3,1094416
2000,The Hobbit or There and Back Again,J.R.R. Tolkien,4.26,2364968


In [38]:
books.language_code.value_counts()

eng      10594
en-US     1699
spa        419
en-GB      341
ger        238
fre        209
jpn         64
por         27
mul         21
ita         19
zho         16
grc         12
en-CA        9
nl           7
rus          7
swe          6
glg          4
tur          3
enm          3
cat          3
lat          3
ara          2
heb          1
nor          1
wel          1
msa          1
dan          1
gla          1
srp          1
ale          1
Name: language_code, dtype: int64

In [26]:
result.to_html()

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>title</th>\n      <th>authors</th>\n      <th>average_rating</th>\n      <th>ratings_count</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Harry Potter and the Half-Blood Prince (Harry ...</td>\n      <td>J.K. Rowling</td>\n      <td>4.56</td>\n      <td>1944099</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Harry Potter and the Prisoner of Azkaban (Harr...</td>\n      <td>J.K. Rowling</td>\n      <td>4.55</td>\n      <td>2149872</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Harry Potter and the Order of the Phoenix (Har...</td>\n      <td>J.K. Rowling</td>\n      <td>4.49</td>\n      <td>1996446</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Harry Potter and the Sorcerer\'s Stone (Harry P...</td>\n      <td>J.K. Rowling</td>\n      <td>4.47</td>\n      <td>5629932</td>\n    </tr>\n    <tr>\n      <th>4455</th>\n      <