Hello class, for hands-on practice today we'll be putting ourselves in the shoes of a data analyst working in a book publishing firm (Penguin Random House, Harper Collins) and tasked to study a dataset to identify critical trends in book success / failures. Book publishing is an extremeeeeeely competitive field, so the more chances we have at publishing the real winners (and avoid failures), the more $$$ it means for the company. 

Start a new notebook, and use `books_c.csv` which I've uploaded to the GitHub repo for your analysis. Here are the guiding questions.

1. Read in `books_c.csv`. How many rows and columns are there?
2. If `isbn` is a charachter string (`object`), we should expect `isbn13` to also be a string. Perform the type conversion using `.astype()`. Are there any other columns that require a type conversion?
> isbn13: A 13-digit ISBN to identify the book, instead of the standard 11-digit ISBN.
3. Which 3 authors are the most represented in the dataframe? 
4. Find all books by author 'J.K. Rowling'. What is her average rating? Round up to 2 decimal points
5. Return all books with at least 1 million (1,000,000) ratings count. Order them by average rating in descending order (`.sort_values('___',ascending=False)`) and print the top 10 books

In [11]:
import pandas as pd
books = pd.read_csv("data_input/books_c.csv")
books.head()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,# num_pages,ratings_count,text_reviews_count
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling,4.56,0439785960,9780439785969,eng,652,1944099,26249
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,4.49,0439358078,9780439358071,eng,870,1996446,27613
2,3,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling,4.47,0439554934,9780439554930,eng,320,5629932,70390
3,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.41,0439554896,9780439554893,eng,352,6267,272
4,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling,4.55,043965548X,9780439655484,eng,435,2149872,33964


1. Read in `books_c.csv`. How many rows and columns are there?

In [6]:
books.shape

(13714, 10)

2. If `isbn` is a charachter string (`object`), we should expect `isbn13` to also be a string. Perform the type conversion using `.astype()`. Are there any other columns that require a type conversion?

In [8]:
books.dtypes

bookID                  int64
title                  object
authors                object
average_rating        float64
isbn                   object
isbn13                  int64
language_code          object
# num_pages             int64
ratings_count           int64
text_reviews_count      int64
dtype: object

In [12]:
books['isbn13'] = books['isbn13'].astype('object')
books['language_code'] = books['language_code'].astype('category')


In [13]:
books.dtypes

bookID                   int64
title                   object
authors                 object
average_rating         float64
isbn                    object
isbn13                  object
language_code         category
# num_pages              int64
ratings_count            int64
text_reviews_count       int64
dtype: object

3. Which 3 authors are the most represented in the dataframe?

In [16]:
books['authors'].value_counts().head(3)

Agatha Christie     69
Stephen King        66
Orson Scott Card    48
Name: authors, dtype: int64

4. Find all books by author 'J.K. Rowling'. What is her average rating? Round up to 2 decimal points

In [37]:
books[books['authors'] == 'J.K. Rowling']

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,# num_pages,ratings_count,text_reviews_count
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling,4.56,0439785960,9780439785969,eng,652,1944099,26249
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,4.49,0439358078,9780439358071,eng,870,1996446,27613
2,3,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling,4.47,0439554934,9780439554930,eng,320,5629932,70390
3,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.41,0439554896,9780439554893,eng,352,6267,272
4,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling,4.55,043965548X,9780439655484,eng,435,2149872,33964
5,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling,4.78,0439682584,9780439682589,eng,2690,38872,154
7,10,Harry Potter Collection (Harry Potter #1-6),J.K. Rowling,4.73,0439827604,9780439827607,eng,3342,27410,820
693,2002,Harry Potter Schoolbooks Box Set: Two Classic ...,J.K. Rowling,4.4,043932162X,9780439321624,eng,240,11459,143
695,2005,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling,4.56,0747584664,9780747584667,eng,768,1173,72
1123,3357,Harry Potter Y La Piedra Filosofal (Harry Pott...,J.K. Rowling,4.47,0613359607,9780613359603,spa,254,84,5


In [27]:
books[books['authors'] == 'J.K. Rowling'].describe().loc['mean', 'average_rating'].round(2)

4.52

In [39]:
cond1 = books['authors'] == 'J.K. Rowling'
mean_ratings = books.loc[cond1, 'average_rating'].mean()
round(mean_ratings, 2)

4.52

5. Return all books with at least 1 million (1,000,000) ratings count. Order them by average rating in descending order (`.sort_values('___',ascending=False)`) and print the top 10 books

In [36]:
books[books['ratings_count'] >= 1000000].sort_values('average_rating', ascending = False).head(10)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,# num_pages,ratings_count,text_reviews_count
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling,4.56,0439785960,9780439785969,eng,652,1944099,26249
4,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling,4.55,043965548X,9780439655484,eng,435,2149872,33964
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,4.49,0439358078,9780439358071,eng,870,1996446,27613
2,3,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling,4.47,0439554934,9780439554930,eng,320,5629932,70390
4455,13496,A Game of Thrones (A Song of Ice and Fire #1),George R.R. Martin,4.45,0553588486,9780553588484,eng,848,1598396,37379
5300,15881,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.41,0439064864,9780439064866,eng,341,2115562,32694
6363,19063,The Book Thief,Markus Zusak,4.37,0375831002,9780375831003,eng,552,1410666,84237
25,34,The Fellowship of the Ring (The Lord of the Ri...,J.R.R. Tolkien,4.35,0618346252,9780618346257,eng,398,2009749,12784
9319,30119,Where the Sidewalk Ends,Shel Silverstein,4.3,0060513039,9780060513030,eng,176,1094416,9754
2000,5907,The Hobbit or There and Back Again,J.R.R. Tolkien,4.26,0618260307,9780618260300,eng,366,2364968,31664


In [42]:
cond1 = books['ratings_count'] >= 1000000
result = books.loc[cond1, ].sort_values('average_rating', ascending = False).head(10)

In [44]:
result.to_csv('jkrowlings.csv')

In [45]:
result.to_json()

'{"bookID":{"0":1,"4":5,"1":2,"2":3,"4455":13496,"5300":15881,"6363":19063,"25":34,"9319":30119,"2000":5907},"title":{"0":"Harry Potter and the Half-Blood Prince (Harry Potter  #6)","4":"Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)","1":"Harry Potter and the Order of the Phoenix (Harry Potter  #5)","2":"Harry Potter and the Sorcerer\'s Stone (Harry Potter  #1)","4455":"A Game of Thrones (A Song of Ice and Fire  #1)","5300":"Harry Potter and the Chamber of Secrets (Harry Potter  #2)","6363":"The Book Thief","25":"The Fellowship of the Ring (The Lord of the Rings  #1)","9319":"Where the Sidewalk Ends","2000":"The Hobbit or There and Back Again"},"authors":{"0":"J.K. Rowling","4":"J.K. Rowling","1":"J.K. Rowling","2":"J.K. Rowling","4455":"George R.R. Martin","5300":"J.K. Rowling","6363":"Markus Zusak","25":"J.R.R. Tolkien","9319":"Shel Silverstein","2000":"J.R.R. Tolkien"},"average_rating":{"0":4.56,"4":4.55,"1":4.49,"2":4.47,"4455":4.45,"5300":4.41,"6363":4.37,"25":4.35,"