In [6]:
%load_ext sql


The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [7]:
%sql mysql+pymysql://root:1234@localhost:3306/pizza_runner


# Case Study #2 - Pizza Runner 🍕
<img src="https://i.imgur.com/0I5LUMK.png" width="800" height="800" alt="Danny's Diner"/>


## Introduction
Did you know that over 115 million kilograms of pizza is consumed daily worldwide??? (Well according to Wikipedia anyway…)

Danny was scrolling through his Instagram feed when something really caught his eye - “80s Retro Styling and Pizza Is The Future!”

Danny was sold on the idea, but he knew that pizza alone was not going to help him get seed funding to expand his new Pizza Empire - so he had one more genius idea to combine with it - he was going to Uberize it - and so Pizza Runner was launched!

Danny started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Danny’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers.

Danny has shared with you 6 tabels for this case study:

1. runners
1. runner_orders
1. pizza_names
1. customer_orders
1. pizza_recipes
1. pizza_toppings



# Entity Relationship Diagram
![](https://i.imgur.com/wlIToXm.png)

# Explore our database

In [8]:
%%sql
# lets explore our tables
SELECT *
FROM runners

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
4 rows affected.


runner_id,registration_date
1,2021-01-01
2,2021-01-03
3,2021-01-08
4,2021-01-15


In [9]:
%%sql

SELECT *
FROM runner_orders

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
10 rows affected.


order_id,runner_id,pickup_time,distance,duration,cancellation
1,1,2020-01-01 18:15:34,20km,32 minutes,
2,1,2020-01-01 19:10:54,20km,27 minutes,
3,1,2020-01-03 00:12:37,13.4km,20 mins,
4,2,2020-01-04 13:53:03,23.4,40,
5,3,2020-01-08 21:10:57,10,15,
6,3,,,,Restaurant Cancellation
7,2,2020-01-08 21:30:45,25km,25mins,
8,2,2020-01-10 00:15:02,23.4 km,15 minute,
9,2,,,,Customer Cancellation
10,1,2020-01-11 18:50:20,10km,10minutes,


In [10]:
%%sql
# lets explore our tables
SELECT *
FROM customer_orders

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
14 rows affected.


order_id,customer_id,pizza_id,exclusions,extras,order_time
1,101,1,,,2020-01-01 18:05:02
2,101,1,,,2020-01-01 19:00:52
3,102,1,,,2020-01-02 23:51:23
3,102,2,,,2020-01-02 23:51:23
4,103,1,4,,2020-01-04 13:23:46
4,103,1,4,,2020-01-04 13:23:46
4,103,2,4,,2020-01-04 13:23:46
5,104,1,,1,2020-01-08 21:00:29
6,101,2,,,2020-01-08 21:03:13
7,105,2,,1,2020-01-08 21:20:29


In [11]:
%%sql
# lets explore our tables
SELECT *
FROM pizza_toppings

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
12 rows affected.


topping_id,topping_name
1,Bacon
2,BBQ Sauce
3,Beef
4,Cheese
5,Chicken
6,Mushrooms
7,Onions
8,Pepperoni
9,Peppers
10,Salami


In [12]:
%%sql
# lets explore our tables
SELECT *
FROM pizza_recipes

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
2 rows affected.


pizza_id,toppings
1,"1, 2, 3, 4, 5, 6, 8, 10"
2,"4, 6, 7, 9, 11, 12"


In [13]:
%%sql
# lets explore our tables
SELECT *
FROM pizza_names

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
2 rows affected.


pizza_id,pizza_name
1,Meatlovers
2,Vegetarian


# Data Preprocessing 

## Handling Null Values For `customer_orders`:

- **Exclusions and Extras**: There are entries like empty strings, 'null' as text, and actual NULL. so I decide on a consistent representation for no exclusions or extras. 
- Using `NULL` for these fields when there are no exclusions or extras is a good practice. This can be achieved by updating the table to convert all empty strings and 'null' texts to NULL.


- First, I created a temporary table based on the customer_orders table to implement changes without affecting the real database. 
- This approach allows for safe experimentation and modifications, ensuring that no irreversible alterations are made to the original data.

In [14]:
%%sql
CREATE TEMPORARY TABLE temp_customer_orders AS
SELECT *
FROM customer_orders;

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
14 rows affected.


[]

In [15]:
%%sql
UPDATE temp_customer_orders
SET exclusions = NULLIF(exclusions, ''),
    exclusions = NULLIF(exclusions, 'null'),
    extras = NULLIF(extras, ''),
    extras = NULLIF(extras, 'null')


 * mysql+pymysql://root:***@localhost:3306/pizza_runner
14 rows affected.


[]

In [16]:
%%sql
SELECT *
FROM temp_customer_orders;

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
14 rows affected.


order_id,customer_id,pizza_id,exclusions,extras,order_time
1,101,1,,,2020-01-01 18:05:02
2,101,1,,,2020-01-01 19:00:52
3,102,1,,,2020-01-02 23:51:23
3,102,2,,,2020-01-02 23:51:23
4,103,1,4,,2020-01-04 13:23:46
4,103,1,4,,2020-01-04 13:23:46
4,103,2,4,,2020-01-04 13:23:46
5,104,1,,1,2020-01-08 21:00:29
6,101,2,,,2020-01-08 21:03:13
7,105,2,,1,2020-01-08 21:20:29


## Handling `runner_orders`:

### Handling nulls of `runner_orders`:

- I updated the `temp_runner_orders` table to replace instances of the string 'null' with actual NULL values across multiple columns (`pickup_time`, `distance`, `duration`, `cancellation`). This step was taken to standardize the data format and improve the integrity of the database.


In [17]:
%%sql
CREATE TEMPORARY TABLE temp_runner_orders AS
SELECT *
FROM runner_orders;

 * mysql+pymysql://root:***@localhost:3306/pizza_runner


10 rows affected.


[]

In [18]:
%%sql
UPDATE temp_runner_orders
SET 
   pickup_time = NULLIF(pickup_time,'null'),
   distance = NULLIF(distance,'null'),
   duration = NULLIF(duration,'null'),
   cancellation = NULLIF(cancellation,'null'),
   cancellation = NULLIF(cancellation,'') 
;

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
10 rows affected.


[]

In [19]:
%%sql
select *
from temp_runner_orders

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
10 rows affected.


order_id,runner_id,pickup_time,distance,duration,cancellation
1,1,2020-01-01 18:15:34,20km,32 minutes,
2,1,2020-01-01 19:10:54,20km,27 minutes,
3,1,2020-01-03 00:12:37,13.4km,20 mins,
4,2,2020-01-04 13:53:03,23.4,40,
5,3,2020-01-08 21:10:57,10,15,
6,3,,,,Restaurant Cancellation
7,2,2020-01-08 21:30:45,25km,25mins,
8,2,2020-01-10 00:15:02,23.4 km,15 minute,
9,2,,,,Customer Cancellation
10,1,2020-01-11 18:50:20,10km,10minutes,


### Conversion of distance and duration columns to numeric types 

- Removed all non-numeric characters from the `distance` and `duration` fields in the `temp_runner_orders` table using `REGEXP_REPLACE`
- Changed the data type of the `distance` column to `DECIMAL(10, 1)`.
- Changed the data type of the `duration` column to `DECIMAL(10)`
    - To store numerical values for calculations.

In [20]:
%%sql

# Update the distance to clean non-numeric characters
UPDATE temp_runner_orders
SET distance = REGEXP_REPLACE(distance, '[^\\d.]', '');

# Alter the column type to DECIMAL
ALTER TABLE temp_runner_orders
MODIFY COLUMN distance DECIMAL(10, 1);

# Update the duration to clean non-numeric characters
UPDATE temp_runner_orders
SET duration = REGEXP_REPLACE(duration, '[^\\d.]', '');

# Alter the column type to DECIMAL
ALTER TABLE temp_runner_orders
MODIFY COLUMN duration DECIMAL(10);

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
10 rows affected.
10 rows affected.
10 rows affected.
10 rows affected.


[]

In [21]:
%%sql
select *
from temp_runner_orders

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
10 rows affected.


order_id,runner_id,pickup_time,distance,duration,cancellation
1,1,2020-01-01 18:15:34,20.0,32.0,
2,1,2020-01-01 19:10:54,20.0,27.0,
3,1,2020-01-03 00:12:37,13.4,20.0,
4,2,2020-01-04 13:53:03,23.4,40.0,
5,3,2020-01-08 21:10:57,10.0,15.0,
6,3,,,,Restaurant Cancellation
7,2,2020-01-08 21:30:45,25.0,25.0,
8,2,2020-01-10 00:15:02,23.4,15.0,
9,2,,,,Customer Cancellation
10,1,2020-01-11 18:50:20,10.0,10.0,


# Case Study Questions

## A. Pizza Metrics

### 1. How many pizzas were ordered?

In [22]:
%%sql 

SELECT 
    COUNT(*) AS pizzas_ordered
FROM 
    temp_customer_orders
    

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
1 rows affected.


pizzas_ordered
14


### 2. How many unique customer orders were made?

In [23]:
%%sql 

SELECT 
    COUNT(DISTINCT order_id) AS unique_customer_orders
FROM 
    temp_customer_orders
    

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
1 rows affected.


unique_customer_orders
10


### 3. How many successful orders were delivered by each runner?

In [24]:
%%sql 

SELECT
    runner_id,
    COUNT(*) AS Delivered_orders
FROM 
    temp_runner_orders
WHERE 
    cancellation is NULL
GROUP BY
    runner_id

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
3 rows affected.


runner_id,Delivered_orders
1,4
2,3
3,1


### 4. How many of each type of pizza was delivered?

In [25]:
%%sql 

SELECT
    p.pizza_name,
    COUNT(*) AS deliver_count
FROM 
    temp_customer_orders AS c
INNER JOIN
    pizza_names AS p ON c.pizza_id = p.pizza_id
INNER JOIN
    temp_runner_orders AS r ON c.order_id = r.order_id
WHERE
    r.cancellation IS NULL
GROUP BY
    pizza_name


 * mysql+pymysql://root:***@localhost:3306/pizza_runner
2 rows affected.


pizza_name,deliver_count
Meatlovers,9
Vegetarian,3


### 5. How many Vegetarian and Meatlovers were ordered by each customer?

In [26]:
%%sql 

SELECT
    customer_id,
    p.pizza_name,
    COUNT(customer_id) AS pizzas_ordered 
FROM 
    temp_customer_orders AS c
INNER JOIN
    pizza_names AS p ON c.pizza_id = p.pizza_id
GROUP BY
    customer_id,
    pizza_name


 * mysql+pymysql://root:***@localhost:3306/pizza_runner
8 rows affected.


customer_id,pizza_name,pizzas_ordered
101,Meatlovers,2
102,Meatlovers,2
102,Vegetarian,1
103,Meatlovers,3
103,Vegetarian,1
104,Meatlovers,3
101,Vegetarian,1
105,Vegetarian,1


### 6. What was the maximum number of pizzas delivered in a single order?

In [27]:
%%sql 

SELECT
    COUNT(order_id) AS max_pizzas_ordered 
FROM 
    temp_customer_orders AS c
INNER JOIN
    pizza_names AS p ON c.pizza_id = p.pizza_id
GROUP BY
    order_id
ORDER BY   
    max_pizzas_ordered DESC
LIMIT 1


 * mysql+pymysql://root:***@localhost:3306/pizza_runner
1 rows affected.


max_pizzas_ordered
3


### 7. For each customer, how many delivered pizzas had at least 1 change and how many had no changes?

In [28]:
%%sql 

SELECT
    c.customer_id,
    SUM(CASE WHEN (exclusions IS NULL AND extras IS NULL) THEN 1
    ELSE 0 END) AS no_change,
    SUM(CASE WHEN (exclusions IS NOT NULL OR extras IS NOT NULL) THEN 1
    ELSE 0 END) AS "change"
FROM 
    temp_customer_orders AS c
INNER JOIN 
    temp_runner_orders AS r ON c.order_id = r.order_id
WHERE
    r.cancellation IS NULL
GROUP BY
    customer_id



 * mysql+pymysql://root:***@localhost:3306/pizza_runner
5 rows affected.


customer_id,no_change,change
101,2,0
102,3,0
103,0,3
104,1,2
105,0,1


### 8. How many pizzas were delivered that had both exclusions and extras?


In [29]:
%%sql 

SELECT
    SUM(CASE WHEN (exclusions IS NOT NULL AND extras IS NOT NULL) THEN 1
    ELSE 0 END) AS "exclusions and extras"
FROM 
    temp_customer_orders AS c
INNER JOIN 
    temp_runner_orders AS r ON c.order_id = r.order_id
WHERE
    r.cancellation IS NULL




 * mysql+pymysql://root:***@localhost:3306/pizza_runner
1 rows affected.


exclusions and extras
1


### 9. What was the total volume of pizzas ordered for each hour of the day?


In [30]:
%%sql
SELECT
    HOUR(order_time) AS hour_of_day,
    COUNT(*) AS total_pizzas_ordered
FROM
    temp_customer_orders
GROUP BY
    hour_of_day


 * mysql+pymysql://root:***@localhost:3306/pizza_runner
6 rows affected.


hour_of_day,total_pizzas_ordered
18,3
19,1
23,3
13,3
21,3
11,1


### 10. What was the volume of orders for each day of the week?

In [31]:
%%sql
SELECT
    DAYOFWEEK(order_time) AS day_of_week,
    DAYNAME(order_time) AS day_name,

    COUNT(*) AS total_pizzas_ordered
FROM
    temp_customer_orders
GROUP BY
    day_of_week, day_name

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
4 rows affected.


day_of_week,day_name,total_pizzas_ordered
4,Wednesday,5
5,Thursday,3
7,Saturday,5
6,Friday,1


## B. Runner and Customer Experience

### 1. How many runners signed up for each 1 week period? (i.e. week starts 2021-01-01)

In [70]:
%%sql
SELECT
    WEEK(registration_date) + 1 AS week_period,
    COUNT(runner_id) AS num_runners_signed_up
FROM
    runners
GROUP BY
    week_period
ORDER BY
    week_period;



 * mysql+pymysql://root:***@localhost:3306/pizza_runner
3 rows affected.


week_period,num_runners_signed_up
1,1
2,2
3,1


### 2. What was the average time in minutes it took for each runner to arrive at the Pizza Runner HQ to pickup the order?

In [33]:
%%sql
SELECT
    r.runner_id,
    AVG(TIME_TO_SEC(TIMEDIFF(r.pickup_time, c.order_time)) / 60) AS avg_minutes_to_pickup 
FROM
    temp_runner_orders as r 
RIGHT JOIN
    temp_customer_orders as c USING(order_id)
WHERE 
    TIMEDIFF(r.pickup_time, c.order_time) IS NOT NULL
GROUP BY 
    r.runner_id;


 * mysql+pymysql://root:***@localhost:3306/pizza_runner
3 rows affected.


runner_id,avg_minutes_to_pickup
1,15.67776667
2,23.71998
3,10.4667


### 3. Is there any relationship between the number of pizzas and how long the order takes to prepare?

In [100]:
%%sql
WITH CTE AS(
    SELECT
    order_id,
    COUNT(order_id) AS num_pizzas_in_order,
    MAX(TIMEDIFF(r.pickup_time, c.order_time)) AS minutes_to_pickup
FROM
    temp_runner_orders AS r 
RIGHT JOIN
    temp_customer_orders AS c
USING(order_id)
WHERE 
    TIMEDIFF(r.pickup_time, c.order_time) IS NOT NULL
GROUP BY 
    order_id
)

SELECT
    num_pizzas_in_order,
    AVG(TIME_TO_SEC(minutes_to_pickup)/60) AS avg_minutes_to_pickup
FROM
    CTE
GROUP BY(num_pizzas_in_order)

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
3 rows affected.


num_pizzas_in_order,avg_minutes_to_pickup
1,12.35666
2,18.375
3,29.2833


- When number of pizzas in order increases the avredge time to prepare the order also increases. 
- So that could mean larger orders may require more preparation time due to the additional ingredients and steps involved in making multiple pizzas.

### 4. What was the average distance travelled for each customer?

In [67]:
%%sql
SELECT
    customer_id,
    AVG(distance) AS distance_travelled
FROM
    temp_customer_orders AS c
left JOIN
    temp_runner_orders AS r
USING(order_id)
WHERE
    distance is not null
GROUP BY
    customer_id


 * mysql+pymysql://root:***@localhost:3306/pizza_runner
5 rows affected.


customer_id,distance_travelled
101,20.0
102,16.73333
103,23.4
104,10.0
105,25.0


### 5. What was the difference between the longest and shortest delivery times for all orders?

In [71]:
%%sql
SELECT
    MAX(duration) - MIN(duration) AS TIME_DIFF
FROM
    temp_runner_orders


 * mysql+pymysql://root:***@localhost:3306/pizza_runner
1 rows affected.


TIME_DIFF
30


### 6. What was the average speed for each runner for each delivery and do you notice any trend for these values?

In [98]:
%%sql
SELECT
    runner_id,
    order_id,
    ROUND(((distance / duration) * 60 ), 1) AS Speed
FROM
    temp_runner_orders
WHERE
    duration is not null
ORDER BY
    runner_id, order_id 

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
8 rows affected.


runner_id,order_id,Speed
1,1,37.5
1,2,44.4
1,3,40.2
1,10,60.0
2,4,35.1
2,7,60.0
2,8,93.6
3,5,40.0


- Runner 1's average speed varied between 37.5 km/h and 60 km/h.
-  Runner 2 showed a wide range in average speed, fluctuating from 35.1 km/h to 93.6 km/h. Notably, despite covering the same distance of 23.4 km, order 4 was delivered at 35.1 km/h, while order 8 was delivered at a significantly higher speed of 93.6 km/h. This inconsistency raises concerns.
- Runner 3 maintained a steady average speed of 40 km/h.

---
An increase in speed over time for each runner could indeed suggest that they are becoming more familiar with the routes they are delivering, possibly discovering new, more efficient roads or shortcuts. 

### 7. What is the successful delivery percentage for each runner?

In [95]:
%%sql
SELECT
    runner_id,
    (COUNT(CASE WHEN  duration IS NOT NULL THEN 1 END) /
    COUNT(runner_id)) * 100 AS successful_delivery_percentage 
FROM
    temp_runner_orders
GROUP BY 
    runner_id

 * mysql+pymysql://root:***@localhost:3306/pizza_runner
3 rows affected.


runner_id,successful_delivery_percentage
1,100.0
2,75.0
3,50.0


## C. Ingredient Optimisation