In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('books.csv', encoding='latin1')
df.head()

Unnamed: 0,publisher,dagger,book_review_link,author,primary_isbn10,price,primary_isbn13,sunday_review_link,date,first_chapter_link,contributor,title,age_group,weeks_on_list
0,Riverhead,0,https://www.nytimes.com/2015/01/05/books/the-g...,Paula Hawkins,1594634025,0,9781590000000.0,https://www.nytimes.com/2015/02/01/books/revie...,2/19/17,,by Paula Hawkins,THE GIRL ON THE TRAIN,,102
1,Scribner,0,,Anthony Doerr,1501173219,0,9781500000000.0,https://www.nytimes.com/2014/05/11/books/revie...,5/7/17,,by Anthony Doerr,ALL THE LIGHT WE CANNOT SEE,,81
2,Vintage,0,,E L James,525431888,0,9780530000000.0,,3/5/17,,by E. L. James,FIFTY SHADES DARKER,,66
3,St. Martin's,0,,Kristin Hannah,1466850604,0,9781470000000.0,,10/29/17,,by Kristin Hannah,THE NIGHTINGALE,,63
4,Penguin Group,0,https://www.nytimes.com/2009/02/19/books/19mas...,Kathryn Stockett,1440697663,0,9781440000000.0,,4/8/12,,by Kathryn Stockett,THE HELP,,58


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2249 entries, 0 to 2248
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   publisher           2249 non-null   object 
 1   dagger              2249 non-null   int64  
 2   book_review_link    136 non-null    object 
 3   author              2249 non-null   object 
 4   primary_isbn10      2073 non-null   object 
 5   price               2249 non-null   int64  
 6   primary_isbn13      2248 non-null   object 
 7   sunday_review_link  130 non-null    object 
 8   date                2249 non-null   object 
 9   first_chapter_link  6 non-null      object 
 10  contributor         2249 non-null   object 
 11  title               2249 non-null   object 
 12  age_group           0 non-null      float64
 13  weeks_on_list       2249 non-null   int64  
dtypes: float64(1), int64(3), object(10)
memory usage: 246.1+ KB


# Missing Values

In [4]:
# Check for missing values in each column
df.isna().sum()

publisher                0
dagger                   0
book_review_link      2113
author                   0
primary_isbn10         176
price                    0
primary_isbn13           1
sunday_review_link    2119
date                     0
first_chapter_link    2243
contributor              0
title                    0
age_group             2249
weeks_on_list            0
dtype: int64

There seem to be a lot of missing values (94% to 99%) in columns 'book_review_link', 'sunday_review_link', and 'first_chapter_link'. All these three columns are dropped as it will be useless to keep them in the dataset. I will use API later to add more necessary columns/features later. 

In [5]:
df = df.drop(columns=['book_review_link', 'sunday_review_link', 'first_chapter_link'] , axis=1).copy()
df.head()

Unnamed: 0,publisher,dagger,author,primary_isbn10,price,primary_isbn13,date,contributor,title,age_group,weeks_on_list
0,Riverhead,0,Paula Hawkins,1594634025,0,9781590000000.0,2/19/17,by Paula Hawkins,THE GIRL ON THE TRAIN,,102
1,Scribner,0,Anthony Doerr,1501173219,0,9781500000000.0,5/7/17,by Anthony Doerr,ALL THE LIGHT WE CANNOT SEE,,81
2,Vintage,0,E L James,525431888,0,9780530000000.0,3/5/17,by E. L. James,FIFTY SHADES DARKER,,66
3,St. Martin's,0,Kristin Hannah,1466850604,0,9781470000000.0,10/29/17,by Kristin Hannah,THE NIGHTINGALE,,63
4,Penguin Group,0,Kathryn Stockett,1440697663,0,9781440000000.0,4/8/12,by Kathryn Stockett,THE HELP,,58


In [6]:
df['dagger'].isna().sum()

0

In [7]:
df['dagger'].nunique()

1

In [8]:
df['age_group'].isna().sum()

2249

In [9]:
df['age_group'].nunique()

0

In [10]:
df['price'].isna().sum()

0

In [11]:
df['price'].nunique()

1

* 'dagger' column only has 1 value which is '0', so there is no information for 'dagger'

* All of the rows in 'age_group' columns are empty or they are all null values as there seems to be no information about the age group. Therefore, it is better to drop the entire column

* Simmilar to 'dagger', there is no information on price for each book as there is only 1 value '0' for every row in the 'price' column 

==> drop all three columns

In [12]:
df = df.drop(columns=['dagger', 'age_group', 'price'], axis=1).copy()
df.head()

Unnamed: 0,publisher,author,primary_isbn10,primary_isbn13,date,contributor,title,weeks_on_list
0,Riverhead,Paula Hawkins,1594634025,9781590000000.0,2/19/17,by Paula Hawkins,THE GIRL ON THE TRAIN,102
1,Scribner,Anthony Doerr,1501173219,9781500000000.0,5/7/17,by Anthony Doerr,ALL THE LIGHT WE CANNOT SEE,81
2,Vintage,E L James,525431888,9780530000000.0,3/5/17,by E. L. James,FIFTY SHADES DARKER,66
3,St. Martin's,Kristin Hannah,1466850604,9781470000000.0,10/29/17,by Kristin Hannah,THE NIGHTINGALE,63
4,Penguin Group,Kathryn Stockett,1440697663,9781440000000.0,4/8/12,by Kathryn Stockett,THE HELP,58


In [13]:
df.isna().sum()

publisher           0
author              0
primary_isbn10    176
primary_isbn13      1
date                0
contributor         0
title               0
weeks_on_list       0
dtype: int64

ISBN is a unique ID for each book. Therefore, we cannot replace or fill it with the mean or median of all other values in the same columns. These two columns will be kept like that for now

# Reformat 'date' column

# Save cleaned dataset into a new csv file

In [14]:
#df.to_csv('cleaned_books.csv')