## The Goodreads Project <br> 
Written by: Niyousha Mohammadshafie <br>
Date:  July 2020

### Loading the Libraries

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Loading the Data

In [22]:
df = pd.read_csv('../Goodreads/books.csv', error_bad_lines=False)

b'Skipping line 3350: expected 12 fields, saw 13\nSkipping line 4704: expected 12 fields, saw 13\nSkipping line 5879: expected 12 fields, saw 13\nSkipping line 8981: expected 12 fields, saw 13\n'


In [23]:
df.head(5)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic


In [24]:
df.shape

(11123, 12)

#### Columns Description:
- **bookID**: Unique ID of the books <br>
- **title**: Titles of the books<br>
- **authors**: Authors of the books<br>
- **average_rating**: The average rating of the books, as decided by the users<br>
- **isbn**: Specific information about the books - such as edition and publisher<br>
- **isbn13**: The new format for ISBN, implemented in 2007. 13 digits<br>
- **language_code**: Language of the books<br>
- **num_pages**: Number of pages for the books<br>
- **ratings_count**: Number of ratings given for the books<br>
- **text_reviews_count**: The count of reviews left by users
- **publication_date**: Date of the publication of the books
- **publisher**: Name of the publisher of the books

### Data Cleaning

In [25]:
books = df.copy() 

In [26]:
books.drop(['bookID','isbn','isbn13'],axis = 1,inplace=True) #dropping unnecessary data
books.head()

Unnamed: 0,title,authors,average_rating,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,eng,352,6333,244,11/1/2003,Scholastic
3,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,eng,435,2339585,36325,5/1/2004,Scholastic Inc.
4,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,eng,2690,41428,164,9/13/2004,Scholastic


In [49]:
#checking for missing data
columns = list(books.columns)
for col in columns:
    print(f'{col} : {pd.isnull(col)}')

title : False
authors : False
average_rating : False
language_code : False
  num_pages : False
ratings_count : False
text_reviews_count : False
publication_date : False
publisher : False


The data has no missing values. 

In [39]:
columns

['title',
 'authors',
 'average_rating',
 'language_code',
 '  num_pages',
 'ratings_count',
 'text_reviews_count',
 'publication_date',
 'publisher']

In [45]:
for col in columns:
    print(f'{col} : {books[col].dtypes}')

title : object
authors : object
average_rating : float64
language_code : object
  num_pages : int64
ratings_count : int64
text_reviews_count : int64
publication_date : object
publisher : object


In [28]:
books['authors'] = books['authors'].apply(lambda x: x.split('/')[0])

The duplicated rows of the same books need to be removed.

In [52]:
books = books.sort_values(by = ['title', 'authors', 'ratings_count'])

Unnamed: 0,title,authors,average_rating,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
1847,said the shotgun to the head.,Saul Williams,4.22,en-US,192,2762,214,9/1/2003,MTV Books
4072,$30 Film School: How to Write Direct Produce...,Michael W. Dean,3.49,eng,528,30,4,5/13/2003,Cengage Learning
5300,'Salem's Lot,Stephen King,4.02,eng,817,18,3,8/1/1976,Signet
1577,'Salem's Lot,Stephen King,4.02,eng,586,25,6,10/6/2010,Hodder & Stoughton Ltd
5298,'Salem's Lot,Stephen King,4.02,en-US,0,56,5,1/19/2004,Simon & Schuster Audio
1576,'Salem's Lot,Stephen King,4.02,en-US,427,178,35,11/13/1979,Signet
9249,'Salem's Lot,Stephen King,4.02,eng,427,186,22,8/1/1976,Signet
1573,'Salem's Lot,Stephen King,4.02,eng,17,227,54,1/19/2004,Simon & Schuster Audio
1574,'Salem's Lot,Stephen King,4.02,eng,405,1039,130,10/17/1975,Doubleday
1572,'Salem's Lot,Stephen King,4.25,eng,594,84123,571,11/1/2005,Doubleday


In [54]:
#check for the repitative rows
books.drop_duplicates(subset=['title','authors'], keep = 'last', inplace = True)

In [56]:
books.shape

(10421, 9)

In [66]:
books['publication year'] = books['publication_date'].apply(lambda x: x.split('/')[2])
books['publication year'].astype('int64').dtypes

dtype('int64')

### Exploratory Data Analysis

**Question 1**: What is the most popular book?

In [37]:
highest_rating = books['average_rating'].max()
print(highest_rating)

5.0


In [38]:
books.loc[books['average_rating'] == highest_rating ]

Unnamed: 0,title,authors,average_rating,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
624,Comoediae 1: Acharenses/Equites/Nubes/Vespae/P...,Aristophanes/F.W. Hall/W.M. Geldart,5.0,grc,364,0,0,2/22/1922,Oxford University Press USA
786,Willem de Kooning: Late Paintings,Julie Sylvester/David Sylvester,5.0,eng,83,1,0,9/1/2006,Schirmer Mosel
855,Literature Circle Guide: Bridge to Terabithia:...,Tara MacCarthy,5.0,eng,32,4,1,1/1/2002,Teaching Resources
1243,Middlesex Borough (Images of America: New Jersey),Middlesex Borough Heritage Committee,5.0,eng,128,2,0,3/17/2003,Arcadia Publishing
4125,Zone of the Enders: The 2nd Runner Official St...,Tim Bogenn,5.0,eng,128,2,0,3/6/2003,BradyGames
4788,The Diamond Color Meditation: Color Pathway to...,John Diamond,5.0,eng,74,5,3,2/1/2006,Square One Publishers
4933,Bulgakov's the Master and Margarita: The Text ...,Elena N. Mahlow,5.0,eng,202,4,0,1/1/1975,Vantage Press
5023,The Complete Theory Fun Factory: Music Theory ...,Ian Martin/Katie Elliott,5.0,eng,96,1,0,6/1/2004,Boosey & Hawkes Inc
5474,The Goon Show Volume 4: My Knees Have Fallen ...,NOT A BOOK,5.0,eng,2,3,0,4/1/1996,BBC Physical Audio
5476,The Goon Show Volume 11: He's Fallen in the W...,NOT A BOOK,5.0,eng,2,2,0,10/2/1995,BBC Physical Audio


The highest rating is 5.0 but the books that have intact 5.0 rating usually have recieved very few reviews/rating from the readers. So, that means althought they have perfect rating, they are popular.