# Introduction to Python Project : FoodHub Data Analysis

### Problem Statement

**Context**

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

**Objective**

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company improve its business.


### Data Dictionary

The data includes various information related to a food order. A detailed data dictionary is provided below.

Data Dictionary

- order_id: Unique ID of the order
- customer_id: ID of the customer who ordered the food
- restaurant_name: Name of the restaurant
- cuisine_type: Cuisine ordered by the customer
- cost_of_the_order: Cost of the order
- day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
- rating: Rating given by the customer out of 5
- food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
- delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

### Let us start by importing the required libraries

In [1]:
# import libraries for data manipulation
import pandas as pd

# import libraries for data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# misc/other imports
import warnings

# import libraries for python utils
import tabulate

In [2]:
# GLOBAL OPTIONS ---

# To avoid clutter in the output, suppress warnings
warnings.filterwarnings('ignore')

# Set global figure size for all plots
plt.rcParams['figure.figsize'] = (15, 5)

plt.rcParams['font.size'] = 12

# Set global color cycle for lines (this will affect line color in most plots)
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#ce8d61'])

# Set the grid style globally (for both major and minor grids)
#plt.rcParams['axes.grid'] = True
plt.rcParams['grid.linestyle'] = '--'
plt.rcParams['grid.alpha'] = 0.6
# todo:
#plt.rcParams['grid.color'] = 'gray'
plt.rcParams['grid.color'] = '#E2E0E0FF'

# Seaborne style ---

# Set global Seaborn style and context
# todo: decide palette
# ref: https://r02b.github.io/seaborn_palettes/
sns.set_theme(style="whitegrid", palette="muted", context="notebook")

##### 🛠️ Python Utils

In [3]:
# UTILS (plot) ---

def set_small_fig():
    '''
    (10, 6) -> ar: 1
    Square shaped fig,
    eg: scatter plot,
    '''
    plt.rcParams['figure.figsize'] = (6, 6)

def set_medium_fig():

    ''' Normal | Medium
    (10, 6) -> ar: 1.66
    Set figure size = (10, 6)'''
    plt.rcParams['figure.figsize'] = (10, 6)

def set_large_fig():
    '''
    (12, 6) -> ar: 2
    Large figure
    '''
    plt.rcParams['figure.figsize'] = (12, 6)

def set_vlarge_fig():
    '''
    (13, 5) -> ar: 3
    Very Large figure (Eg Time Series covering long span)
    '''
    plt.rcParams['figure.figsize'] = (15, 5)

In [21]:
# UTILS (PANDAS) ---

def tableit(series: pd.Series, index_name: str = None):
    """
    Display a pandas Series as a formatted table using the tabulate library.
    Args:
        series (pd.Series): The pandas Series to be displayed.
        index_name (str): The name of the index column. Default to `series.name`
    Returns:
        None
    """
    i_name = index_name if index_name else series.name
    table = tabulate.tabulate(series.items(), headers=[i_name, "value"], tablefmt='grid')
    print(table)

# todo: decide default value for show_index
def tableit_df(df: pd.DataFrame, show_index: bool = True):
    """
    Display a pandas DataFrame as a formatted table using the tabulate library.
    Args:
        df (pd.DataFrame): The pandas DataFrame to be displayed.
        show_index (bool): Whether to display the DataFrame index. Default is True.
    Returns:
        None
    """
    # Convert DataFrame to a tabulated string
    table = tabulate.tabulate(df, headers='keys', tablefmt='pretty', showindex=show_index)
    print(table)

> ⚡ **_Note_**
> 
> Please note that some lines in this notebook may appear as commented-out code (e.g., `print()/repr()`) in favor of using `tableit()/tableit_df()` for more structured outputs. These comments are intentionally left in place so that, in case the notebook is exported to a different format and the display results do not appear as expected, they can be easily reverted to the default behavior (e.g., using repr() for outputs in the original notebook).

### Understanding the structure of the data

In [22]:
# todo: delete in future
# uncomment and run the following lines for Google Colab
# from google.colab import drive
# drive.mount('/content/drive')

In [23]:
# Load the dataset
df = pd.read_csv('foodhub_order.csv')
# Make a copy of the original DataFrame (to have a original copy of the data)
original_df = df.copy()

In [24]:
# View the first 5 rows
df.head(5)

Unnamed: 0,order_id,customer_id,restaurant_name,cuisine_type,cost_of_the_order,day_of_the_week,rating,food_preparation_time,delivery_time
0,1477147,337525,Hangawi,Korean,30.75,Weekend,Not given,25,20
1,1477685,358141,Blue Ribbon Sushi Izakaya,Japanese,12.08,Weekend,Not given,25,23
2,1477070,66393,Cafe Habana,Mexican,12.23,Weekday,5,23,28
3,1477334,106968,Blue Ribbon Fried Chicken,American,29.2,Weekend,3,25,15
4,1478249,76942,Dirty Bird to Go,American,11.59,Weekday,4,25,24


🔍 **_Observation_**:

1. The dataset contains information about food orders from various restaurants.
2. There are 9 columns in the dataset: order_id, customer_id, restaurant_name, cuisine_type, cost_of_the_order, day_of_the_week, rating, food_preparation_time, and delivery_time.
3. The 'rating' column contains both numeric and non-numeric values (e.g., 'Not given').
4. The 'day_of_the_week' column indicates whether the order was placed on a weekday or weekend.
5. The 'food_preparation_time' and 'delivery_time' columns are measured in minutes.

### **Question 1:** How many rows and columns are present in the data? [0.5 mark]

In [25]:
rows, cols = df.shape
rows, cols

(1898, 9)

🔍 **_Observation_**:

1. The dataset contains 1898 rows and 9 columns.
2. rows corresponds to orders and columns corresponds to features of the order.

### **Question 2:** What are the datatypes of the different columns in the dataset? (The info() function can be used) [0.5 mark]

In [26]:
tableit(df.dtypes, index_name='column')

+-----------------------+---------+
| column                | value   |
| order_id              | int64   |
+-----------------------+---------+
| customer_id           | int64   |
+-----------------------+---------+
| restaurant_name       | object  |
+-----------------------+---------+
| cuisine_type          | object  |
+-----------------------+---------+
| cost_of_the_order     | float64 |
+-----------------------+---------+
| day_of_the_week       | object  |
+-----------------------+---------+
| rating                | object  |
+-----------------------+---------+
| food_preparation_time | int64   |
+-----------------------+---------+
| delivery_time         | int64   |
+-----------------------+---------+


🔍 **_Observation_**:

1. The 'order_id', 'customer_id', 'food_preparation_time', and 'delivery_time' have integer values (ie numeric).
2. The 'cost_of_the_order' holds float values (ie numeric).
3. The 'restaurant_name', 'cuisine_type', 'day_of_the_week', and 'rating' columns are of type object, indicating they contain string values.



> 🔧 **_Actionable Insights_** 
> 
> Although the 'rating' column is currently of type `object`, it should ideally be of type `int64`, as ratings are generally *ordinal* in nature. This will be addressed later.

In [31]:
# ? As suggested in question, so using the `info()` as well
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB


### **Question 3:** Are there any missing values in the data? If yes, treat them using an appropriate method. [1 mark]

In [32]:
has_missing_values =  df.isnull().any().any()
has_missing_values

False

🔍 **_Observation_**:

There are *no* explicit missing values in the dataset. However, the **rating** column contains entries labeled as **"Not given"** which indicate the *absence* of a customer rating. These entries should be treated as missing values during analysis to ensure accurate insights.

### **Question 4:** Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? [2 marks]

In [33]:
stat_summary  = df.describe(include='all').T
stat_summary

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
order_id,1898.0,,,,1477495.5,548.049724,1476547.0,1477021.25,1477495.5,1477969.75,1478444.0
customer_id,1898.0,,,,171168.478398,113698.139743,1311.0,77787.75,128600.0,270525.0,405334.0
restaurant_name,1898.0,178.0,Shake Shack,219.0,,,,,,,
cuisine_type,1898.0,14.0,American,584.0,,,,,,,
cost_of_the_order,1898.0,,,,16.498851,7.483812,4.47,12.08,14.14,22.2975,35.41
day_of_the_week,1898.0,2.0,Weekend,1351.0,,,,,,,
rating,1898.0,4.0,Not given,736.0,,,,,,,
food_preparation_time,1898.0,,,,27.37197,4.632481,20.0,23.0,27.0,31.0,35.0
delivery_time,1898.0,,,,24.161749,4.972637,15.0,20.0,25.0,28.0,33.0


🔍 **_Observation_**:

1. The dataset contains 1898 entries for each column.
2. The average cost of the orders is approximately 16.50 dollars.
3. The cost of an order lies between 4 to 36 dollars.
4. The average food preparation time is approximately 27 minutes.
5. Food preparation times range from a minimum of 20 minutes to a maximum of 35 minutes.
6. The average delivery time is around 24 minutes, with most orders being delivered within 30 minutes.
7. The restaurant with the most orders is "Shake Shack," receiving a total of 219 orders.
8. The most common cuisine type is "American" with 584 orders. 
9. The majority of orders (ie 1351) are placed on weekends.
10. The 'rating' column has 736 entries labeled as "Not given", indicating missing ratings.

In [35]:
# Q
# What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed ?

# Get the statistical summary for the 'food_preparation_time' column
food_preparation_summary = df['food_preparation_time'].describe()
food_preparation_summary

count    1898.000000
mean       27.371970
std         4.632481
min        20.000000
25%        23.000000
50%        27.000000
75%        31.000000
max        35.000000
Name: food_preparation_time, dtype: float64

❓ *Question*

What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed ?

=> 

The minimum time for food preparation is 20 minutes, the average time is 27.37 minutes, and the maximum time is 35 minutes





📌 *Points*
1. 25% of the orders have a preparation time of 23 minutes or less.
2. The median food preparation time is 27 minutes, meaning that half of the orders are prepared in 27 minutes or less. 
3. 25% of the orders have a preparation time of 31 minutes or more.

📝 *Gist*

On average, restaurants take about 27 minutes to prepare food, with most orders falling between 23 and 31 minutes.
While the fastest orders are ready in 20 minutes, some take up to 35 minutes.
The slight variability in preparation time (around 4 minutes) suggests a consistent but slightly flexible pace across restaurants.


### **Question 5:** How many orders are not rated? [1 mark]

In [36]:
df['rating'].value_counts()

Not given    736
5            588
4            386
3            188
Name: rating, dtype: int64

✅ Answer

There are 736 orders that are not rated, as indicated by the "Not given" entries in the rating column.

>  ⚡ Note
>
> The 'rating' column contains 736 entries marked as 'Not given,' indicating that these orders were not rated by customers. There are no missing values in the column, so these entries clearly represent the absence of a rating.

#### Observations:


### Exploratory Data Analysis (EDA)

### Univariate Analysis

### **Question 6:** Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [9 marks]

In [None]:
# Write the code here

### **Question 7**: Which are the top 5 restaurants in terms of the number of orders received? [1 mark]

In [None]:
# Write the code here

#### Observations:


### **Question 8**: Which is the most popular cuisine on weekends? [1 mark]

In [None]:
# Write the code here

#### Observations:


### **Question 9**: What percentage of the orders cost more than 20 dollars? [2 marks]

In [None]:
# Write the code here

#### Observations:


### **Question 10**: What is the mean order delivery time? [1 mark]

In [None]:
# Write the code here

#### Observations:


### **Question 11:** The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]

In [None]:
# Write the code here

#### Observations:


### Multivariate Analysis

### **Question 12**: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]


In [None]:
# Write the code here

### **Question 13:** The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer. [3 marks]

In [None]:
# Write the code here

#### Observations:


### **Question 14:** The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]

In [None]:
# Write the code here

#### Observations:


### **Question 15:** The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]

In [None]:
# Write the code here

#### Observations:


### **Question 16:** The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]

In [None]:
# Write the code here

#### Observations:


### Conclusion and Recommendations

### **Question 17:** What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.) [6 marks]

### Conclusions:
*  

### Recommendations:

*  