## Data Cleaning


Data cleaning is an important step before the data is explored for analysis purpose or visualization. A cleaned data provides accurate, unbiased analysis results. 

In [2]:
#imports

import pandas as pd
import numpy as np
import os

#### Checking Order Details 

In [3]:
#get the current working directory
cwd = os.getcwd()

order_details = pd.read_csv(cwd+"/pizza_sales_data/order_details.csv")

In [4]:
#first few rows
order_details.head()

Unnamed: 0,order_details_id,order_id,pizza_id,quantity
0,1,1,hawaiian_m,1
1,2,2,classic_dlx_m,1
2,3,2,five_cheese_l,1
3,4,2,ital_supr_l,1
4,5,2,mexicana_m,1


In [5]:
#size of the data
order_details.shape

(48620, 4)

In [7]:
#let's check for any null or missing values
order_details.isnull().value_counts()

order_details_id  order_id  pizza_id  quantity
False             False     False     False       48620
dtype: int64

In [34]:
#let's check total unique order_ids 
len(order_details.order_id.unique())

21350

There are 21350 unique orders placed with each order either having atleast 1 or more than 1 pizzas. Order id is a unique identifier for each order placed by a table. 

#### Checking Orders dataset


In [25]:
orders = pd.read_csv(cwd + "/pizza_sales_data/orders.csv")

In [26]:
#check first few rows

orders.head()

Unnamed: 0,order_id,date,time
0,1,2015-01-01,11:38:36
1,2,2015-01-01,11:57:40
2,3,2015-01-01,12:12:28
3,4,2015-01-01,12:16:31
4,5,2015-01-01,12:21:30


In [27]:
#size of the dataset
orders.shape

(21350, 3)

In [35]:
#check for missing or null values
orders.isnull().value_counts()

order_id  date   time 
False     False  False    21350
dtype: int64

#### Checking Pizza_types

In [38]:
pizza_type = pd.read_csv(cwd + "/pizza_sales_data/pizza_types.csv",encoding = 'unicode_escape')

In [39]:
pizza_type.head()

Unnamed: 0,pizza_type_id,name,category,ingredients
0,bbq_ckn,The Barbecue Chicken Pizza,Chicken,"Barbecued Chicken, Red Peppers, Green Peppers,..."
1,cali_ckn,The California Chicken Pizza,Chicken,"Chicken, Artichoke, Spinach, Garlic, Jalapeno ..."
2,ckn_alfredo,The Chicken Alfredo Pizza,Chicken,"Chicken, Red Onions, Red Peppers, Mushrooms, A..."
3,ckn_pesto,The Chicken Pesto Pizza,Chicken,"Chicken, Tomatoes, Red Peppers, Spinach, Garli..."
4,southw_ckn,The Southwest Chicken Pizza,Chicken,"Chicken, Tomatoes, Red Peppers, Red Onions, Ja..."


In [40]:
pizza_type.shape

(32, 4)

In [41]:
#check for null or missing values

pizza_type.isnull().value_counts()

pizza_type_id  name   category  ingredients
False          False  False     False          32
dtype: int64

In [42]:
pizza_type.category.value_counts()

Supreme    9
Veggie     9
Classic    8
Chicken    6
Name: category, dtype: int64

#### Checking pizzas dataset


In [43]:
pizzas = pd.read_csv(cwd + "/pizza_sales_data/pizzas.csv")

In [44]:
pizzas.head()

Unnamed: 0,pizza_id,pizza_type_id,size,price
0,bbq_ckn_s,bbq_ckn,S,12.75
1,bbq_ckn_m,bbq_ckn,M,16.75
2,bbq_ckn_l,bbq_ckn,L,20.75
3,cali_ckn_s,cali_ckn,S,12.75
4,cali_ckn_m,cali_ckn,M,16.75


In [45]:
pizzas.shape

(96, 4)

In [46]:
#check for missing or null values
pizzas.isnull().value_counts()

pizza_id  pizza_type_id  size   price
False     False          False  False    96
dtype: int64

Overall the data is very cleaned and structured. There is no missing values or data discripancy. 