## Pizza Sales Exploratory Data Analysis

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
# create path to csvs
path = 'pizza_store_tables'


In [3]:
csv_list = os.listdir(path)
csv_list

['.ipynb_checkpoints',
 'orders.csv',
 'order_details - Copy.csv',
 'order_details.csv',
 'pizzas.csv',
 'pizza_types - Copy.csv',
 'pizza_types.csv']

### Orders Table EDA
- Check data types
- Check for null values
- Check for duplicates in columns that should have unique values

In [4]:
# read in orders table
order_path = os.path.join(path,'orders.csv')
order_path
orders_df = pd.read_csv(order_path)
orders_df.head()

Unnamed: 0,order_id,date,time
0,1,2015-01-01,11:38:36
1,2,2015-01-01,11:57:40
2,3,2015-01-01,12:12:28
3,4,2015-01-01,12:16:31
4,5,2015-01-01,12:21:30


#### Check data types and for null values

In [5]:
orders_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21350 entries, 0 to 21349
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   order_id  21350 non-null  int64 
 1   date      21350 non-null  object
 2   time      21350 non-null  object
dtypes: int64(1), object(2)
memory usage: 500.5+ KB


No nulls, order_id is int64 data type as expected. 
Date and time are objects, may need to be converted to datetime for further analysis.

#### Check for duplicates in the order_id column

In [6]:
orders_df['order_id'].nunique()

21350

No duplicates

### Order Details table EDA
    Perform similar checks as with above table

In [7]:
# read in order_details table
order_details_path = os.path.join(path,'order_details.csv')
order_details_df = pd.read_csv(order_details_path)
order_details_df.head()

Unnamed: 0,order_details_id,order_id,pizza_id,quantity
0,1,1,hawaiian_m,1
1,2,2,classic_dlx_m,1
2,3,2,five_cheese_l,1
3,4,2,ital_supr_l,1
4,5,2,mexicana_m,1


In [8]:
order_details_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48620 entries, 0 to 48619
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   order_details_id  48620 non-null  int64 
 1   order_id          48620 non-null  int64 
 2   pizza_id          48620 non-null  object
 3   quantity          48620 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


No null values

In [9]:
order_details_df.nunique()

order_details_id    48620
order_id            21350
pizza_id               91
quantity                4
dtype: int64

#### Results:
    order_details_id: no duplicates, this is the primary key for this table
    order_id:21350 unique values corresponds correctly to order_id of orders table; this is foreign key
    pizza_id: 91 unique values, when checking pizzas table, should not be fewer than 91 pizza_ids in that table

#### Check min and max order quantities to ensure no odd results such as negative units ordered

In [10]:
order_details_df['quantity'].max()


4

In [11]:
order_details_df['quantity'].min()

1

### Pizzas table EDA

In [12]:
# read in pizzas csv
pizzas_path = os.path.join(path,'pizzas.csv')
pizzas_df = pd.read_csv(pizzas_path)
pizzas_df.head()

Unnamed: 0,pizza_id,pizza_type_id,size,price
0,bbq_ckn_s,bbq_ckn,S,12.75
1,bbq_ckn_m,bbq_ckn,M,16.75
2,bbq_ckn_l,bbq_ckn,L,20.75
3,cali_ckn_s,cali_ckn,S,12.75
4,cali_ckn_m,cali_ckn,M,16.75


There are separate fields pizza_id and pizza_type_id, where pizza_id is a combination of pizza_type and size, so there should be far more pizza_ids than pizza_type_ids

In [13]:
pizzas_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   pizza_id       96 non-null     object 
 1   pizza_type_id  96 non-null     object 
 2   size           96 non-null     object 
 3   price          96 non-null     float64
dtypes: float64(1), object(3)
memory usage: 3.1+ KB


No null values, price is float64 as expected

In [14]:
pizzas_df.describe()

Unnamed: 0,price
count,96.0
mean,16.440625
std,4.090266
min,9.75
25%,12.5
50%,16.25
75%,20.25
max,35.95


In [15]:
pizzas_df['pizza_id'].nunique()

96

There 96 unique pizza_ids, which validates the 91 from the order_details_table, and 5 pizza_ids have not been ordered in the year

In [16]:
pizzas_df['pizza_type_id'].nunique()

32

There are 32 unique pizza_type_ids, so should not be fewer than 32 in the pizza_types_table

### Pizza Types Eda

In [17]:
pizza_types_df = pd.read_csv(os.path.join(path,'pizza_types.csv'),encoding= 'unicode_escape')
pizza_types_df.head()

Unnamed: 0,pizza_type_id,name,category,ingredients
0,bbq_ckn,The Barbecue Chicken Pizza,Chicken,"Barbecued Chicken, Red Peppers, Green Peppers,..."
1,cali_ckn,The California Chicken Pizza,Chicken,"Chicken, Artichoke, Spinach, Garlic, Jalapeno ..."
2,ckn_alfredo,The Chicken Alfredo Pizza,Chicken,"Chicken, Red Onions, Red Peppers, Mushrooms, A..."
3,ckn_pesto,The Chicken Pesto Pizza,Chicken,"Chicken, Tomatoes, Red Peppers, Spinach, Garli..."
4,southw_ckn,The Southwest Chicken Pizza,Chicken,"Chicken, Tomatoes, Red Peppers, Red Onions, Ja..."


In [18]:
pizza_types_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   pizza_type_id  32 non-null     object
 1   name           32 non-null     object
 2   category       32 non-null     object
 3   ingredients    32 non-null     object
dtypes: object(4)
memory usage: 1.1+ KB


No values

In [19]:
pizza_types_df.nunique()

pizza_type_id    32
name             32
category          4
ingredients      32
dtype: int64

There are 32 unique pizza_type_ids with no duplicates, which validates the 32 from the pizzas table. There are 32 unique sets of ingredients, which confirms there are no duplicated ingredient sets for the 32 pizza types.

### Analysis and Dashboard Data Validation

In [21]:
# combine date and time object fields and convert to date time
orders_df['order_date'] = orders_df['date'] + ' ' + orders_df['time']
orders_df['order_date'] = pd.to_datetime(orders_df['order_date'])
orders_df.head()

Unnamed: 0,order_id,date,time,order_date
0,1,2015-01-01,11:38:36,2015-01-01 11:38:36
1,2,2015-01-01,11:57:40,2015-01-01 11:57:40
2,3,2015-01-01,12:12:28,2015-01-01 12:12:28
3,4,2015-01-01,12:16:31,2015-01-01 12:16:31
4,5,2015-01-01,12:21:30,2015-01-01 12:21:30


In [22]:
print(orders_df['date'].min(),orders_df['date'].max())

2015-01-01 2015-12-31


In [23]:
print(orders_df['time'].min(),orders_df['time'].max())

09:52:21 23:05:52


The dashboard should not 

In [26]:
# get sales $ for each month
orders_df2 = pd.merge(left=orders_df,right=order_details_df,how='inner',on='order_id')
orders_df2.head()

Unnamed: 0,order_id,date,time,order_date,order_details_id,pizza_id,quantity
0,1,2015-01-01,11:38:36,2015-01-01 11:38:36,1,hawaiian_m,1
1,2,2015-01-01,11:57:40,2015-01-01 11:57:40,2,classic_dlx_m,1
2,2,2015-01-01,11:57:40,2015-01-01 11:57:40,3,five_cheese_l,1
3,2,2015-01-01,11:57:40,2015-01-01 11:57:40,4,ital_supr_l,1
4,2,2015-01-01,11:57:40,2015-01-01 11:57:40,5,mexicana_m,1


In [27]:
line_item_df = pd.merge(left=orders_df2,right=pizzas_df,how='inner',on='pizza_id')
line_item_df.head()

Unnamed: 0,order_id,date,time,order_date,order_details_id,pizza_id,quantity,pizza_type_id,size,price
0,1,2015-01-01,11:38:36,2015-01-01 11:38:36,1,hawaiian_m,1,hawaiian,M,13.25
1,77,2015-01-02,12:22:46,2015-01-02 12:22:46,179,hawaiian_m,1,hawaiian,M,13.25
2,146,2015-01-03,14:22:10,2015-01-03 14:22:10,357,hawaiian_m,1,hawaiian,M,13.25
3,163,2015-01-03,16:54:54,2015-01-03 16:54:54,389,hawaiian_m,1,hawaiian,M,13.25
4,247,2015-01-04,20:55:29,2015-01-04 20:55:29,568,hawaiian_m,1,hawaiian,M,13.25


In [28]:
line_item_df['line_cost']=line_item_df['quantity'] * line_item_df['price']
line_item_df.head()

Unnamed: 0,order_id,date,time,order_date,order_details_id,pizza_id,quantity,pizza_type_id,size,price,line_cost
0,1,2015-01-01,11:38:36,2015-01-01 11:38:36,1,hawaiian_m,1,hawaiian,M,13.25,13.25
1,77,2015-01-02,12:22:46,2015-01-02 12:22:46,179,hawaiian_m,1,hawaiian,M,13.25,13.25
2,146,2015-01-03,14:22:10,2015-01-03 14:22:10,357,hawaiian_m,1,hawaiian,M,13.25,13.25
3,163,2015-01-03,16:54:54,2015-01-03 16:54:54,389,hawaiian_m,1,hawaiian,M,13.25,13.25
4,247,2015-01-04,20:55:29,2015-01-04 20:55:29,568,hawaiian_m,1,hawaiian,M,13.25,13.25


In [29]:
order_df3 = line_item_df.groupby(['order_id']).agg({'order_date':'max','quantity':'sum','line_cost':'sum'}).reset_index()
order_df3.head()

Unnamed: 0,order_id,order_date,quantity,line_cost
0,1,2015-01-01 11:38:36,1,13.25
1,2,2015-01-01 11:57:40,5,92.0
2,3,2015-01-01 12:12:28,2,37.25
3,4,2015-01-01 12:16:31,1,16.5
4,5,2015-01-01 12:21:30,1,16.5


In [30]:
monthly_sales_df = order_df3.resample('M',on='order_date').agg({'order_id':'count','quantity':'sum','line_cost':'sum'})
monthly_sales_df

Unnamed: 0_level_0,order_id,quantity,line_cost
order_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-31,1845,4232,69793.3
2015-02-28,1685,3961,65159.6
2015-03-31,1840,4261,70397.1
2015-04-30,1799,4151,68736.8
2015-05-31,1853,4328,71402.75
2015-06-30,1773,4107,68230.2
2015-07-31,1935,4392,72557.9
2015-08-31,1841,4168,68278.25
2015-09-30,1661,3890,64180.05
2015-10-31,1646,3883,64027.6


In [34]:
december_line_item_df = line_item_df[line_item_df['date']>'2015-11-30']
december_line_item_df.head()

Unnamed: 0,order_id,date,time,order_date,order_details_id,pizza_id,quantity,pizza_type_id,size,price,line_cost
430,19680,2015-12-01,12:24:32,2015-12-01 12:24:32,44786,hawaiian_m,1,hawaiian,M,13.25,13.25
431,19690,2015-12-01,13:24:16,2015-12-01 13:24:16,44803,hawaiian_m,1,hawaiian,M,13.25,13.25
432,19704,2015-12-01,16:22:17,2015-12-01 16:22:17,44838,hawaiian_m,1,hawaiian,M,13.25,13.25
433,19712,2015-12-01,17:45:53,2015-12-01 17:45:53,44853,hawaiian_m,1,hawaiian,M,13.25,13.25
434,19722,2015-12-01,19:21:05,2015-12-01 19:21:05,44875,hawaiian_m,1,hawaiian,M,13.25,13.25


In [35]:
december_line_item_df['quantity'].sum()

3935

In [36]:
december_line_item_df['line_cost'].sum()

64701.149999999994

In [37]:
december_line_item_df.describe()

Unnamed: 0,order_id,order_details_id,quantity,price,line_cost
count,3859.0,3859.0,3859.0,3859.0,3859.0
mean,20511.754859,46691.0,1.019694,16.442083,16.7663
std,490.732223,1114.141673,0.140818,3.57985,4.360662
min,19671.0,44762.0,1.0,9.75,9.75
25%,20073.5,45726.5,1.0,12.5,12.75
50%,20505.0,46691.0,1.0,16.5,16.5
75%,20943.5,47655.5,1.0,20.25,20.5
max,21350.0,48620.0,3.0,35.95,51.0


In [38]:
print(december_line_item_df.time.max(),december_line_item_df.time.min())

23:03:23 11:02:20


In [33]:
december_line_item_df.resample('60min',on='order_date').agg({'order_id':'count','quantity':'sum','line_cost':'sum'})

Unnamed: 0_level_0,order_id,quantity,line_cost
order_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-12-01 11:00:00,11,11,163.75
2015-12-01 12:00:00,23,24,396.60
2015-12-01 13:00:00,26,26,410.25
2015-12-01 14:00:00,6,6,103.50
2015-12-01 15:00:00,7,7,98.70
...,...,...,...
2015-12-31 19:00:00,22,23,396.15
2015-12-31 20:00:00,16,17,278.50
2015-12-31 21:00:00,7,7,113.20
2015-12-31 22:00:00,1,1,20.25


In [39]:
line_item_df[line_item_df['time']<'11:00:00']

Unnamed: 0,order_id,date,time,order_date,order_details_id,pizza_id,quantity,pizza_type_id,size,price,line_cost
1029,10862,2015-07-02,10:34:34,2015-07-02 10:34:34,24700,classic_dlx_m,1,classic_dlx,M,16.0,16.0
4639,7521,2015-05-07,10:54:15,2015-05-07 10:54:15,17097,thai_ckn_l,1,thai_ckn,L,20.75,20.75
6778,16439,2015-10-04,10:54:55,2015-10-04 10:54:55,37234,prsc_argla_l,1,prsc_argla,L,20.75,20.75
13677,10862,2015-07-02,10:34:34,2015-07-02 10:34:34,24702,southw_ckn_l,1,southw_ckn,L,20.75,20.75
16267,5247,2015-03-30,10:50:46,2015-03-30 10:50:46,11943,cali_ckn_m,1,cali_ckn,M,16.75,16.75
16469,9991,2015-06-17,10:52:26,2015-06-17 10:52:26,22718,cali_ckn_m,1,cali_ckn,M,16.75,16.75
16741,16439,2015-10-04,10:54:55,2015-10-04 10:54:55,37233,cali_ckn_m,1,cali_ckn,M,16.75,16.75
20938,3283,2015-02-25,10:54:03,2015-02-25 10:54:03,7448,four_cheese_l,2,four_cheese,L,17.95,35.9
22088,3283,2015-02-25,10:54:03,2015-02-25 10:54:03,7449,napolitana_s,1,napolitana,S,12.0,12.0
22241,9991,2015-06-17,10:52:26,2015-06-17 10:52:26,22719,napolitana_s,1,napolitana,S,12.0,12.0
