# Checkpoint Two: Exploratory Data Analysis

Now that your chosen dataset is approved, it is time to start working on your analysis. Use this notebook to perform your EDA and make notes where directed to as you work.

## Getting Started

Since we have not provided your dataset for you, you will need to load the necessary files in this repository. Make sure to include a link back to the original dataset here as well.

My dataset: https://www.kaggle.com/datasets/ishikajohari/best-books-10k-multi-genre-data?resource=download

Your first task in EDA is to import necessary libraries and create a dataframe(s). Make note in the form of code comments of what your thought process is as you work on this setup task.

In [1]:
#importing all the libraries necessary for EDA and data visualization later down the road
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import style

df = pd.read_csv("goodreads_data.csv")

In [2]:
df.head()

Unnamed: 0.1,Unnamed: 0,Book,Author,Description,Genres,Avg_Rating,Num_Ratings,URL
0,0,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ...",4.27,5691311,https://www.goodreads.com/book/show/2657.To_Ki...
1,1,Harry Potter and the Philosopher’s Stone (Harr...,J.K. Rowling,Harry Potter thinks he is an ordinary boy - un...,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',...",4.47,9278135,https://www.goodreads.com/book/show/72193.Harr...
2,2,Pride and Prejudice,Jane Austen,"Since its immediate success in 1813, Pride and...","['Classics', 'Fiction', 'Romance', 'Historical...",4.28,3944155,https://www.goodreads.com/book/show/1885.Pride...
3,3,The Diary of a Young Girl,Anne Frank,Discovered in the attic in which she spent the...,"['Classics', 'Nonfiction', 'History', 'Biograp...",4.18,3488438,https://www.goodreads.com/book/show/48855.The_...
4,4,Animal Farm,George Orwell,Librarian's note: There is an Alternate Cover ...,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',...",3.98,3575172,https://www.goodreads.com/book/show/170448.Ani...


## Get to Know the Numbers

Now that you have everything setup, put any code that you use to get to know the dataframe and its rows and columns better in the cell below. You can use whatever techniques you like, except for visualizations. You will put those in a separate section.

When working on your code, make sure to leave comments so that your mentors can understand your thought process.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   10000 non-null  int64  
 1   Book         10000 non-null  object 
 2   Author       10000 non-null  object 
 3   Description  9923 non-null   object 
 4   Genres       10000 non-null  object 
 5   Avg_Rating   10000 non-null  float64
 6   Num_Ratings  10000 non-null  object 
 7   URL          10000 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 625.1+ KB


In [4]:
#print the columns
df.columns

Index(['Unnamed: 0', 'Book', 'Author', 'Description', 'Genres', 'Avg_Rating',
       'Num_Ratings', 'URL'],
      dtype='object')

In [5]:
#rename the 'Unnamed' column
df = df.rename(columns={'Unnamed: 0':'ID'})

In [6]:
df.columns

Index(['ID', 'Book', 'Author', 'Description', 'Genres', 'Avg_Rating',
       'Num_Ratings', 'URL'],
      dtype='object')

In [7]:
# check the size of df
df.shape

(10000, 8)

In [8]:
# check for duplicates
duplicate_rows_df = df[df.duplicated()]
print("Number of duplicated rows: ", duplicate_rows_df.shape)

Number of duplicated rows:  (0, 8)


In [9]:
# what's the avg rating of the data set?
df["Avg_Rating"].describe()

count    10000.000000
mean         4.068577
std          0.335359
min          0.000000
25%          3.880000
50%          4.080000
75%          4.260000
max          5.000000
Name: Avg_Rating, dtype: float64

In [10]:
#which book has the highest rating?
top_books = (df[df.Avg_Rating == df.Avg_Rating.max()])

In [11]:
# how many?
top_books.shape

(120, 8)

In [12]:
top_books.head()

Unnamed: 0,ID,Book,Author,Description,Genres,Avg_Rating,Num_Ratings,URL
3737,3737,Joey Wheeler: The Official Character & Monster...,"Arthur ""Sam"" Murakami","""Check out this official character and monster...",[],5.0,2,https://www.goodreads.com/book/show/2114514.Jo...
4603,4603,"This Land of Streams: Spiritual, Friendship, R...",Maria Johnsen,You will find within this book Maria Johnsen's...,[],5.0,11,https://www.goodreads.com/book/show/20773953-t...
4824,4824,"Ama Dios (4 AoL Consciousness Books Combined, ...",Nataša Pantović,"Ama Dios, 4 AoL Consciousness Books Combined.A...","['Philosophy', 'Spirituality', 'Adult']",5.0,6,https://www.goodreads.com/book/show/44015366-a...
5809,5809,The Secrets of Albion Falls (The Secrets Serie...,Sass Cadeaux,Living a sequestered life in magical village n...,[],5.0,23,https://www.goodreads.com/book/show/17310646-t...
5843,5843,Eclavarda Rising,Stephen Christiansen,,[],5.0,2,https://www.goodreads.com/book/show/23353186-e...


## Visualize

Create any visualizations for your EDA here. Make note in the form of code comments of what your thought process is for your visualizations.

In [13]:
# get top 10 most common genres
genre_count = df['Genres'].value_counts().head(10)
genre_count

[]                                                                                            960
['Fiction']                                                                                    49
['Fantasy']                                                                                    42
['Nonfiction']                                                                                 24
['Romance']                                                                                    20
['Poetry']                                                                                     18
['Horror']                                                                                     15
['Thriller', 'Fiction', 'Mystery', 'Crime', 'Action', 'Suspense', 'Mystery Thriller']          11
['Fiction', 'Young Adult', 'Childrens', 'Middle Grade', 'Fantasy', 'Mystery', 'Adventure']     11
['Self Help']                                                                                   9
Name: Genres, dtype:

The above output reveals that there was missing a significant amount of missing data in the Genres column. I will address this in the next checkpoint when I clean the data.

## Summarize Your Results

With your EDA complete, answer the following questions.

1. Was there anything surprising about your dataset? 
Missing data in the 'Genres' column revealed by a .value_counts()
2. Do you have any concerns about your dataset? 
The missing data concerns me; I hope it is not so much that it will affect my analysis. I also want to make sure that the books I am analyzing reviews for have a statistically significant number of ratings (e.g., some books only have 2 reviews total and that is not helpful in drawing larger insights). 
3. Is there anything you want to make note of for the next phase of your analysis, which is cleaning data? 
I will have to find a way to clean the data so that I can count the genres with '[]' as the value as null