# FoodHub Data Analysis


### Context

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

### Objective

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

### Data Description

The data contains the different data related to a food order. The detailed data dictionary is given below.

### Data Dictionary

* order_id: Unique ID of the order
* customer_id: ID of the customer who ordered the food
* restaurant_name: Name of the restaurant
* cuisine_type: Cuisine ordered by the customer
* cost: Cost of the order
* day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
* rating: Rating given by the customer out of 5
* food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
* delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

# Loading Modules & Data

### Import the required libraries

In [1]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

### Import your data

In [2]:
df = pd.read_csv("foodhub_order.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'foodhub_order.csv'

# Data Exploration

#####Do sanity checks on the data

In [None]:
# Initial glance at the dataset
df

In [None]:
# Determine the dimentions of the dataset
df.shape

In [None]:
#statistical Summary of the numerical columns
df.describe()

In [None]:
#statistical Summary of the Categorical columns
df.describe(include='O')

### Standardising Data Types 
Check if there are Multiple Datatypes in one Column 
Question : Is there a quicker more efficent way to do this?
How to optimise for data type ?

In [None]:
### checking Columns to ensure data types within the columns are consistent  

# df['restaurant_name'].unique() .... string or object 
# df['cuisine_type'].unique() ..... string or object 
# df['cost_of_the_order'].unique() .... string or object 
# df['day_of_the_week'].unique() ... string or object 
# df['rating'].unique() 
# df['order_id'].unique() 
# df['food_preparation_time'].unique() 
# df['delivery_time'].unique()

In [None]:
# replace "Not given" in the ratings column with Nan and converting data type
df['rating'] = df['rating'].replace('Not given', np.nan)
df['rating'] = df['rating'].astype(float)

In [None]:
df['rating']

In [None]:
df.dtypes

## Questions for guidance.
The questions or tasks below are meant to giuide you to extract insight for the business. You are encourage to ask more questions.


### **Question 1:** How many orders are not rated?

In [None]:
nan_count = df['rating'].isna().sum()
nan_count

Orders are not rated 736 times 

## Exploratory Data Analysis (EDA)

### Univariate Analysis

### **Question 2:** Explore all the variables and provide observations on their distributions. (Choose appropriate plots as you wish)

Order_id

In [None]:
plt.hist(x=df['order_id'], bins=30, edgecolor='black', alpha=0.8)
plt.xlabel('order_id')
plt.ylabel('Frequency')
plt.title('Fequency/Distribution of Order ID')
plt.grid(True)

Customer ID

In [None]:
plt.hist(x=df['customer_id'], bins=20, edgecolor='black', alpha=0.8)
# Add labels and a title
plt.xlabel('customer_id')
plt.ylabel('Frequency')
plt.title('Fequency/Distribution of Customer ID')
plt.grid(True)

Restaurant Name

In [None]:
df['restaurant_name'].nunique()

In [None]:
sns.countplot(data=df, x='restaurant_name')
# Add labels and a title
plt.xlabel('restaurant_name')
plt.ylabel('Frequency')
plt.title('Frequency of Resturant Name')
plt.xticks(rotation=90)

In [None]:
#Sort the Resturant names by frequency in descending order and select the top 10
top_restaurants = df['restaurant_name'].value_counts().head(10)

plt.bar(top_restaurants.index, top_restaurants.values, color='blue')
plt.xlabel('Restaurant')
plt.ylabel('Frequency')
plt.title('Top 10 Restaurants by Frequency')
plt.xticks(rotation=90)

Cuisine Type

In [None]:
df['cuisine_type'].unique()

In [None]:
#Sort the Cuisine by frequency in descending order
top_restaurants = df['cuisine_type'].value_counts().head(10)

plt.bar(top_restaurants.index, top_restaurants.values, color='blue')
plt.xlabel('Restaurant')
plt.ylabel('Frequency')
plt.title('Top 10 Restaurants by Frequency')
plt.xticks(rotation=90);

**Observation** 

The top 10 resturants include :
American 
'Japanese', 'Mexican', 'American', 'Indian', 'Italian',
       'Mediterranean', 'Chinese', 'Middle Eastern', 'Thai', 'Southern',
       'French', 'Spanish', 'Vietnamese'

Cost Of The Order

In [None]:
plt.hist(x=df['cost_of_the_order'], bins=20, edgecolor='black', alpha=0.8)
# Add labels and a title
plt.xlabel('cost_of_the_order')
plt.ylabel('Frequency')
plt.title('Fequency/Distribution of Order Cost')
plt.grid(True)

In [None]:
sns.distplot(df['cost_of_the_order']);

day_of_the_week

In [None]:
df.columns.tolist()

In [None]:
df['day_of_the_week'].unique()

In [None]:
day_of_the_week= df['day_of_the_week'].value_counts()
day_of_the_week

In [None]:
#Sort the Cuisine by frequency in descending order
day_of_the_week= df['day_of_the_week'].value_counts()

plt.bar(day_of_the_week.index, day_of_the_week.values, color='#00FFFF',alpha =0.5)
plt.xlabel('Day_of_the_week')
plt.ylabel('Frequency')
plt.title('Frequency of orders by time of week')
plt.xticks(rotation=90);

In [None]:
day_of_the_week = day_of_the_week.reset_index()

In [None]:
plt.figure(figsize = (7,4))
# ax=sns.countplot(y='Origin',x='Quantity',data= n_data,orient='h',
# estimator =sum )
ax= sns.barplot(data=day_of_the_week, x='index',y='day_of_the_week',color='teal',alpha=0.5)
#ax.bar_label(ax.containers[0],size = 13)
for i in ax.containers:
    ax.bar_label(i,)
    sns.despine(left = True,bottom=True)
    ax.set(xlabel=None)
    ax.set(ylabel=None)
#ax.set(xticklabels=[])
ax.set(yticks=[])
#plt.ylabel('Origin',fontsize=30)
plt.tick_params(axis='both', which='major', labelsize=10)
plt.show()

In [None]:
sns.set(rc={"figure.figsize":(8,3)})
ax = ax= sns.barplot(data=day_of_the_week, x='index',y='day_of_the_week',color='cyan',width=0.6)
ax.set(yticks=[])
ax.set(title="Impression vs Conversion per Channel")
for p in ax.patches:
# get the height of each bar
    height = p.get_height()
    # adding text to each bar
    ax.text(x = p.get_x()+(p.get_width()/2),
    y = height+100,
    s = "{:.0f}".format(height),
    ha = "center")

In [None]:
sns.set(rc={"figure.figsize":(8,4)})
ax= sns.barplot(data=day_of_the_week, x='index',y='day_of_the_week',color='cyan')
ax.set(yticks=[])
ax.set(title="Orders on Weekday vs weekend")
for p in ax.patches:
 # get the height of each bar
 height = p.get_height()
 # adding text to each bar
 ax.text(x = p.get_x()+(p.get_width()/2), 
 y = height-100, 
 s = "{:.0f}".format(height),  
 ha = "center")

### **Question 3**: Which are the top 5 restaurants in terms of the number of orders received?

In [None]:
# Write the code here
# Feel free to add more cells

### **Question 4**: Which is the most popular cuisine on weekends?

In [None]:
# Write the code here
# Feel free to add more

In [None]:
popular_cuisine_df =df[df['day_of_the_week']== 'Weekend']\
  .groupby('cuisine_type')[['order_id']].count()\
  .sort_values(by='order_id',ascending=False)\
  .reset_index()\
  .rename(columns={'order_id':'Count'})


In [None]:
popular_cuisine_df.head()

In [None]:
sns.barplot(data=popular_cuisine_df,y='cuisine_type',x='Count');


**Observation**

The top three cuisines include American,Japanese and Italian, whilst the least popular cuisine include Vietnamese, spanish and Southern respectively. 

### **Question 5**: What percentage of the orders cost more than 20 dollars?

In [None]:
df.head(2)

In [None]:
cost_above20 = len(df[df['cost_of_the_order'] > 20])

In [None]:
round(cost_above20/len(df) *100,2)

### **Question 6**: What is the mean order delivery time?

In [None]:
mean_delivery_time =df['delivery_time'].mean()
mean_delivery_time

### **Question 7:** The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]

In [None]:
Frequent_customer = df['customer_id'].value_counts().reset_index()
Frequent_customer.columns = ['customer_id', 'count']
Frequent_customer.head(3)

In [None]:
df

## Bivariate/Multivariate Analysis

### **Question 8**: Perform a bivariate/multivariate analysis to explore relationships between the important variables in the dataset.


In [None]:
# Write the code here
# Feel free to add more cells

### **Question 9:** The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer.

In [49]:
df.head(2)

Unnamed: 0,order_id,customer_id,restaurant_name,cuisine_type,cost_of_the_order,day_of_the_week,rating,food_preparation_time,delivery_time
0,1477147,337525,Hangawi,Korean,30.75,Weekend,,25,20
1,1477685,358141,Blue Ribbon Sushi Izakaya,Japanese,12.08,Weekend,,25,23


### **Question 10:** The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders.

In [36]:
# Write the code here
# Feel free to add more cells

### **Question 11:** The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.)

In [37]:
# Write the code here
# Feel free to add more cells

### **Question 12:** The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends?

## Conclusion and Recommendations

### **Question 13:** What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.)

### Conclusions:
*  

### Recommendations:

*  

---