Importing the dataset

In [1]:
import pandas as pd

df = pd.read_json("../data/books.json")
df.head()

Unnamed: 0,title,price,rating,availability,description,category,product_page_url,image_url,upc,product_type,price_excl_tax,price_incl_tax,tax,num_reviews
0,Set Me Free,£17.46,Five,19,Aaron Ledbetter’s future had been planned out ...,Young Adult,http://books.toscrape.com/catalogue/set-me-fre...,http://books.toscrape.com/media/cache/b8/e9/b8...,ce6396b0f23f6ecc,Books,£17.46,£17.46,£0.00,0
1,Shakespeare's Sonnets,£20.66,Four,19,This book is an important and complete collect...,Poetry,http://books.toscrape.com/catalogue/shakespear...,http://books.toscrape.com/media/cache/4d/7a/4d...,30a7f60cd76ca58c,Books,£20.66,£20.66,£0.00,0
2,"Starving Hearts (Triangular Trade Trilogy, #1)",£13.99,Two,19,"Since her assault, Miss Annette Chetwynd has b...",Default,http://books.toscrape.com/catalogue/starving-h...,http://books.toscrape.com/media/cache/a0/7e/a0...,0312262ecafa5a40,Books,£13.99,£13.99,£0.00,0
3,The Black Maria,£52.15,One,19,"Praise for Aracelis Girmay:""[Girmay's] every l...",Poetry,http://books.toscrape.com/catalogue/the-black-...,http://books.toscrape.com/media/cache/d1/7a/d1...,1dfe412b8ac00530,Books,£52.15,£52.15,£0.00,0
4,The Boys in the Boat: Nine Americans and Their...,£22.60,Four,19,For readers of Laura Hillenbrand's Seabiscuit ...,Default,http://books.toscrape.com/catalogue/the-boys-i...,http://books.toscrape.com/media/cache/d1/2d/d1...,e10e1e165dc8be4a,Books,£22.60,£22.60,£0.00,0


For easier processing, convert the prices to float and remove the $ sign. This applies to 'price', 'price_excl_tax', 'price_incl_tax' and 'tax'.

In [2]:
df['price'] = df['price'].str.replace('£', '').astype(float)
df['price_excl_tax'] = df['price_excl_tax'].str.replace('£', '').astype(float)
df['price_incl_tax'] = df['price_incl_tax'].str.replace('£', '').astype(float)
df['tax'] = df['tax'].str.replace('£', '').astype(float)


Convert ratings to numeric (1,2,3) instead of One, Two, Three.

In [3]:
rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
df['rating'] = df['rating'].map(rating_map)


In case there are missing categories or descriptions, fill them up with 'Unknown'

In [5]:
df['category'] = df['category'].fillna('Unknown')
df['description'] = df['description'].fillna('No description')


Since this was scraped from a website built for scraping, there isn't much to clean and most data is in place. Below is the final dataset.

In [7]:
df.head(50)

Unnamed: 0,title,price,rating,availability,description,category,product_page_url,image_url,upc,product_type,price_excl_tax,price_incl_tax,tax,num_reviews
0,Set Me Free,17.46,5,19,Aaron Ledbetter’s future had been planned out ...,Young Adult,http://books.toscrape.com/catalogue/set-me-fre...,http://books.toscrape.com/media/cache/b8/e9/b8...,ce6396b0f23f6ecc,Books,17.46,17.46,0.0,0
1,Shakespeare's Sonnets,20.66,4,19,This book is an important and complete collect...,Poetry,http://books.toscrape.com/catalogue/shakespear...,http://books.toscrape.com/media/cache/4d/7a/4d...,30a7f60cd76ca58c,Books,20.66,20.66,0.0,0
2,"Starving Hearts (Triangular Trade Trilogy, #1)",13.99,2,19,"Since her assault, Miss Annette Chetwynd has b...",Default,http://books.toscrape.com/catalogue/starving-h...,http://books.toscrape.com/media/cache/a0/7e/a0...,0312262ecafa5a40,Books,13.99,13.99,0.0,0
3,The Black Maria,52.15,1,19,"Praise for Aracelis Girmay:""[Girmay's] every l...",Poetry,http://books.toscrape.com/catalogue/the-black-...,http://books.toscrape.com/media/cache/d1/7a/d1...,1dfe412b8ac00530,Books,52.15,52.15,0.0,0
4,The Boys in the Boat: Nine Americans and Their...,22.6,4,19,For readers of Laura Hillenbrand's Seabiscuit ...,Default,http://books.toscrape.com/catalogue/the-boys-i...,http://books.toscrape.com/media/cache/d1/2d/d1...,e10e1e165dc8be4a,Books,22.6,22.6,0.0,0
5,The Coming Woman: A Novel Based on the Life of...,17.93,3,19,"""If you have a heart, if you have a soul, Kare...",Default,http://books.toscrape.com/catalogue/the-coming...,http://books.toscrape.com/media/cache/97/36/97...,e72a5dfc7e9267b2,Books,17.93,17.93,0.0,0
6,The Dirty Little Secrets of Getting Your Dream...,33.34,4,19,Drawing on his extensive experience evaluating...,Business,http://books.toscrape.com/catalogue/the-dirty-...,http://books.toscrape.com/media/cache/e1/1b/e1...,2597b5a345f45e1b,Books,33.34,33.34,0.0,0
7,The Requiem Red,22.65,1,19,Patient Twenty-nine.A monster roams the halls ...,Young Adult,http://books.toscrape.com/catalogue/the-requie...,http://books.toscrape.com/media/cache/6b/07/6b...,f77dbf2323deb740,Books,22.65,22.65,0.0,0
8,Sapiens: A Brief History of Humankind,54.23,5,20,From a renowned historian comes a groundbreaki...,History,http://books.toscrape.com/catalogue/sapiens-a-...,http://books.toscrape.com/media/cache/ce/5f/ce...,4165285e1663650f,Books,54.23,54.23,0.0,0
9,Sharp Objects,47.82,4,20,"WICKED above her hipbone, GIRL across her hear...",Mystery,http://books.toscrape.com/catalogue/sharp-obje...,http://books.toscrape.com/media/cache/c0/59/c0...,e00eb4fd7b871a48,Books,47.82,47.82,0.0,0
