# **Project Name**    -
Bridging the Gap: Optimizing Uber’s Supply and Demand through Data Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
Member- Nikhar Roy Chaudhuri

# **Project Summary -**

Write the summary here within 500-600 words.
This project aims to analyze Uber ride request data to identify and address the supply-demand gap issues in urban transport. The main objective is to determine when and where customer demand exceeds driver availability, leading to unfulfilled requests or long wait times.

Data Preparation

The dataset includes ride request timestamps, pickup points, and trip statuses. After cleaning, we extracted features such as request hour and day of the week to analyze patterns more effectively.

Exploratory Data Analysis (EDA)

The EDA revealed two major peak periods:

Morning Rush (5 AM–9 AM): High demand from the City to the Airport, with a large number of unfulfilled requests due to fewer available drivers in the City.
Evening Rush (5 PM–9 PM): High demand from the Airport to the City, where supply again falls short, leading to many "No Cars Available" statuses.
Heatmaps and bar charts confirmed that these are critical time windows where the supply-demand gap is most prominent. A boxplot of pickup delays further highlighted longer waiting times during peak hours. Weekday patterns (limited to Monday and Tuesday) showed consistent demand, suggesting that time of day and location are the bigger challenges.

Key Insights

City mornings and Airport evenings face the highest unfulfilled demand.
Driver unavailability is the primary reason for request failures.
Pickup delays rise sharply during peak times.
Recommendations

Incentivize drivers to operate in high-demand areas during peak hours.
Balance supply dynamically using predictive demand models.
Promote scheduled rides, especially for airport pickups, to ensure availability.
By addressing these critical gaps with data-driven strategies, Uber can improve customer experience, reduce cancellations, and optimize resource deployment.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**
Uber is facing a significant gap between ride demand and driver supply during certain hours and pickup locations. This leads to a high number of unfulfilled requests, customer dissatisfaction, and revenue loss. The goal is to identify patterns causing this mismatch and suggest data-driven strategies to bridge the supply-demand gap effectively.

#### **Define Your Business Objective?**

Answer Here-To analyze Uber ride request data and identify key causes of supply-demand gaps—such as peak-hour shortages, driver unavailability, and high cancellation rates—to help Uber improve driver allocation and minimize unmet demand.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset

In [None]:
# Load Dataset
df = pd.read_csv('Uber Request Data.csv')


### Dataset First View

In [None]:
# Dataset First Look

In [None]:
# Dataset First Look
df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


### Dataset Information

In [None]:
# Dataset Info

In [None]:
# Dataset Info
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()


# Removing duplicate rows


In [None]:
df.drop_duplicates(inplace=True)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()


In [None]:
# ✅ Show missing values as percentage of total records
(df.isnull().sum() / len(df)) * 100


In [None]:
# Visualizing the missing values

In [None]:
#  Visualize missing values using seaborn heatmap
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
sns.heatmap(df.isnull(), cbar=False, cmap='Blues')
plt.title("Missing Values Heatmap")
plt.show()


### What did you know about your dataset?

Answer Here-
-The dataset contains 6,745 Uber request entries and 6 original columns.
- After cleaning, we added new columns such as `request_hour` and `request_date` to support time-based analysis.
- Approximately **39.3%** of `driver_id` values are missing, which likely indicates unassigned drivers or system unavailability.
- About **58%** of entries have missing `drop_timestamp`, suggesting many trips were either cancelled or not fulfilled.
- There are no null values in `request_id`, `pickup_point`, `status`, or `request_timestamp`, which are key fields for analysis.
- The dataset is well-structured for analyzing supply-demand gaps, time patterns, and cancellation causes.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
# Dataset Columns
df.columns


In [None]:
# Dataset Describe

In [None]:
# Dataset Describe
df.describe(include='all')


### Variables Description

Answer Here- - `Request id`: A unique integer identifier for each Uber ride request.
- `Pickup point`: The location where the ride was requested — either "City" or "Airport".
- `Driver id`: Numeric ID of the assigned driver. Around 39% are missing, indicating no driver was assigned.
- `Status`: The final outcome of the request — can be "Trip Completed", "Cancelled", or "No Cars Available".
- `Request timestamp`: The date and time when the ride was requested, in mixed datetime format.
- `Drop timestamp`: The date and time when the trip ended. Missing for about 58% of rows due to unfulfilled requests.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
#  Loop through all columns and print number of unique values
for col in df.columns:
    print(f" {col} ➤ {df[col].nunique()} unique values")

# Optional: Show actual unique values for key categorical columns
print("\n Unique values in 'Pickup point':", df['Pickup point'].unique())
print(" Unique values in 'Status':", df['Status'].unique())
print("Unique driver IDs (sample):", df['Driver id'].dropna().unique()[:10])  # show only a sample


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
# 1. Standardize column names (remove extra spaces or fix case)
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# 2. Convert timestamp columns to datetime with dayfirst=True
df['request_timestamp'] = pd.to_datetime(df['request_timestamp'], dayfirst=True, errors='coerce')
df['drop_timestamp'] = pd.to_datetime(df['drop_timestamp'], dayfirst=True, errors='coerce')


#  3. Fill missing driver_id with 'Unavailable' and convert column to string for consistency
df['driver_id'] = df['driver_id'].fillna('Unavailable').astype(str)

#  4. Fill missing status with 'Missing' if any (precautionary)
df['status'] = df['status'].fillna('Missing')

#  5. Add new column: Hour of request
df['request_hour'] = df['request_timestamp'].dt.hour

# 6. Add new column: Date of request
df['request_date'] = df['request_timestamp'].dt.date

# 7. (Optional) Preview the cleaned data
df.head()


In [None]:
df.to_csv("Cleaned_Uber_Request_Data.csv", index=False)


### What all manipulations have you done and insights you found?

Answer Here-**Manipulations Done:**
- Cleaned and standardized column names for consistency.
- Converted `request_timestamp` and `drop_timestamp` to proper datetime format using `dayfirst=True`.
- Handled missing values:
  - Filled missing `driver_id` with `'Unavailable'`.
  - Filled missing `status` with `'Missing'` (precautionary, if any).
- Created two new columns:
  - `request_hour` — to analyze hourly trends
  - `request_date` — for date-wise aggregation

**Insights from Cleaned Output:**
- The data shows many missing `drop_timestamp` values (`NaT`), indicating trips were not completed — this aligns with cancelled rides or no-driver scenarios.
- Missing `driver_id` values were successfully replaced, ensuring the dataset remains consistent for grouped analysis.
- `request_hour` and `request_date` columns were correctly extracted, confirming timestamps are now usable for time-based visualizations.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

In [None]:
# Chart 1 – Request volume by hour
import matplotlib.pyplot as plt
import seaborn as sns

# Set the plot style
plt.figure(figsize=(12,6))
sns.countplot(x='request_hour', data=df, palette='viridis')

# Add titles and labels
plt.title("Number of Uber Requests by Hour of the Day", fontsize=14)
plt.xlabel("Hour of Day (0-23)", fontsize=12)
plt.ylabel("Number of Requests", fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a countplot to visualize the number of Uber requests by hour because it effectively highlights demand trends across different times of the day. Since time-based analysis is crucial in transport data, this chart helps spot peak and off-peak hours clearly.


##### 2. What is/are the insight(s) found from the chart?

Answer Here-The chart reveals two major demand spikes:
- Morning hours (5 AM – 10 AM)
- Evening hours (5 PM – 9 PM)

These align with typical office commute times. Demand is comparatively low during late night and mid-afternoon periods. This indicates potential strain during peak hours, especially if driver availability is not scaled proportionally.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. These insights help Uber optimize driver allocation, reduce rider wait times, and improve customer satisfaction during rush hours. By forecasting demand peaks, Uber can implement surge pricing or offer incentives to drivers during high-demand periods, which positively impacts revenue.

There are no direct insights leading to negative growth here. However, failure to address the high demand during peak hours may lead to lost customers due to long wait times or unfulfilled rides, which could indirectly hurt the business.


#### Chart - 2

In [None]:
# Chart - 2 visualization code

In [None]:
# Chart 2 – Number of requests from Airport vs City
plt.figure(figsize=(8,5))
sns.countplot(x='pickup_point', data=df, palette='pastel')

# Titles and labels
plt.title("Number of Requests by Pickup Point", fontsize=14)
plt.xlabel("Pickup Location", fontsize=12)
plt.ylabel("Number of Requests", fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I selected this chart to compare the number of ride requests between the two primary pickup locations: Airport and City. A bar chart is ideal for comparing categorical variables and clearly shows which location sees higher traffic.



##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. These insights help Uber plan driver distribution between Airport and City more efficiently. If one location is underserved, customers may face cancellations or delays, which negatively affects user experience. By balancing supply based on this chart, Uber can improve operational efficiency and customer satisfaction.

There are no direct signs of negative growth, but a **lack of supply at either location despite high demand** could potentially harm Uber's reputation and lead to loss of users if not addressed.


#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
# ----- Chart 3  : Heat-map  -----
import seaborn as sns
import matplotlib.pyplot as plt

pivot = df.pivot_table(
    index='pickup_point',
    columns='status',
    values='request_id',
    aggfunc='count',
    fill_value=0
)

plt.figure(figsize=(8,4))
sns.heatmap(pivot, annot=True, fmt="d", cmap="YlGnBu")
plt.title("Request Counts – Pickup vs Status")
plt.ylabel("Pickup Location")
plt.xlabel("Trip Status")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-A heatmap is excellent for comparing two categorical variables — in this case, pickup point and trip status. It visually encodes both the count (through color intensity) and actual values (annotations), making it easy to spot patterns, especially where service is falling short.


##### 2. What is/are the insight(s) found from the chart?

Answer Here-
Airport pickups have a significantly higher count of “No Cars Available” requests (1713) compared to Cancelled (198) or Completed (1327).
City pickups, on the other hand, show more balanced distribution:
Trip Completed (1504) is highest
Cancelled (1066) is also relatively high
No Cars Available (937) is lower than the airport.
This suggests that lack of car availability is a major issue at the airport, while cancellation is a bigger issue in the city.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, absolutely.

By deploying more drivers to airport locations, the company can reduce the “No Cars Available” rate, directly improving customer satisfaction and revenue.
Meanwhile, tackling high cancellation rates in the city may involve improving driver behavior or system incentives.
These changes can improve the fulfillment rate, reduce lost opportunities, and increase customer loyalty.

Negative Growth Insight:
Yes — the 1713 unserved requests at the airport reflect missed revenue and poor customer experience. If not resolved, this could push users toward competitors or alternative transport modes.



#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='request_hour', hue='status', data=df, palette='muted')
plt.title('Trip Status by Hour of the Day', fontsize=14)
plt.xlabel('Hour of Day (0-23)', fontsize=12)
plt.ylabel('Number of Requests', fontsize=12)
plt.legend(title='Trip Status')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-We chose a grouped bar chart to show trip status distribution (Trip Completed, Cancelled, No Cars Available) across different hours of the day. This format clearly reveals time-based trends in trip outcomes, which wouldn't be as easy to spot in pie charts or heatmaps.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-
Early mornings (5 AM–9 AM) and evenings (5 PM–9 PM) are peak hours for trip requests.
Morning hours (5 AM–9 AM) see high cancellations, especially around 6–8 AM.
Evening hours (5 PM–9 PM) are dominated by "No Cars Available" issues, likely due to high demand but low driver availability.
Midday hours (10 AM–4 PM) show relatively more Trip Completed requests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, absolutely.

Uber can optimize driver supply by increasing availability in the evening peak hours to reduce “No Cars Available.”
Morning cancellations might signal user drop-offs due to driver unresponsiveness or delays — addressing this improves customer experience.
Strategic driver incentive programs could be scheduled during these high-demand windows to close the service gap.

If these issues persist, customer dissatisfaction rises, usage drops, and growth slows due to lost trust and poor service availability in key time slots.


#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
import matplotlib.pyplot as plt

# Count of each status
status_counts = df['status'].value_counts()

# Pie chart
plt.figure(figsize=(7, 7))
colors = ['#66b3ff', '#ff9999', '#99ff99']
plt.pie(status_counts, labels=status_counts.index, autopct='%1.1f%%', startangle=140, colors=colors)
plt.title('Trip Status Distribution')
plt.axis('equal')  # Make it a circle
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-A pie chart is ideal for visualizing proportions within a single categorical variable. It gives a quick and intuitive view of how the total Uber requests are split among the three status categories (Trip Completed, Cancelled, No Cars Available). This helps decision-makers instantly spot imbalance or inefficiencies.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Only 42% of requests were successfully completed.
A significant 39.3% of requests could not be fulfilled due to no cars being available.
18.7% were cancelled, either by users or drivers.
This shows that more than half of Uber service attempts fail, which is a major concern for customer satisfaction and operational efficiency.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. The insights highlight major service gaps — especially the large number of unfulfilled requests due to unavailable cars. By addressing these gaps (e.g., better driver scheduling, dynamic incentives), Uber can significantly improve trip completion rates, enhance customer experience, and potentially increase revenue.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

In [None]:
import matplotlib.pyplot as plt

# Check which requests had a driver_id vs were unassigned
assigned = df['driver_id'].apply(lambda x: x != 'Unavailable').value_counts()

# Labels
labels = ['Driver Assigned', 'Driver Not Assigned']
colors = ['#4CAF50', '#FF6F61']

# Plot
plt.figure(figsize=(6,6))
plt.pie(assigned, labels=labels, autopct='%1.1f%%', startangle=90, colors=colors, wedgeprops=dict(width=0.4))
plt.title('Driver Assignment Distribution')
plt.axis('equal')  # Equal aspect ratio ensures it looks like a circle
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-This donut chart was chosen because it clearly represents the proportion of ride requests that had a driver assigned versus those that didn’t. The circular format with a hollow center emphasizes segmentation and is visually distinct from previously used bar and line charts, offering a fresh, intuitive view of fulfillment efficiency.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-assigned. This reveals a significant driver shortage or allocation issue, where nearly 4 out of 10 requests fail at the assignment stage before the ride can even begin.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, this insight has both positive and negative implications:

Positive impact: It helps the business identify a key operational gap in driver availability, enabling data-backed resource planning, surge allocation, and driver hiring strategies.
Negative growth: A 39% driver assignment failure rate can directly lead to poor customer experience, reduced bookings, and eventual churn. Immediate steps are needed to bridge this gap in high-demand areas or hours.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

In [None]:
# Group by hour and pickup point, then count requests
heatmap_data = df.groupby(['request_hour', 'pickup_point'])['request_id'].count().unstack()

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, fmt='d', cmap='YlGnBu')

plt.title('Heatmap: Number of Requests by Hour & Pickup Point')
plt.xlabel('Pickup Point')
plt.ylabel('Hour of the Day (0–23)')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-A heatmap is ideal for showing the density of values across two categorical dimensions — in this case, hour of the day and pickup point. It helps quickly identify patterns in high or low request volumes across time and location without needing multiple charts.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-City: High number of ride requests occurs in the early morning hours (5 AM–9 AM), likely due to office commute rush.
Airport: Peak requests happen in the evening hours (5 PM–9 PM), possibly due to flight arrivals or end-of-day travel.
Midday hours in both locations show relatively low request volumes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-To minimize cancellations and improve customer satisfaction:

Reallocate drivers dynamically based on these time-location patterns.
Ensure higher driver availability in the city during morning rush hours and at the airport during evenings.
Use predictive scheduling to reduce unmet demand and improve fulfillment rates during these peak periods.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

In [None]:
# Filter for only rows with valid request timestamps
valid_df = df.dropna(subset=['request_timestamp'])

# Total requests per pickup point
total_requests = valid_df.groupby('pickup_point')['request_id'].count()

# Unavailable drivers per pickup point
unavailable_requests = valid_df[valid_df['driver_id'] == 'Unavailable'].groupby('pickup_point')['request_id'].count()

# Calculate % unassigned
unassigned_percent = (unavailable_requests / total_requests) * 100

# Plot
plt.figure(figsize=(8, 5))
unassigned_percent.plot(kind='bar', color='tomato')
plt.title('Driver Unavailability Rate by Pickup Point')
plt.ylabel('% of Requests with No Driver Assigned')
plt.xlabel('Pickup Point')
plt.ylim(0, 100)
plt.grid(axis='y')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-This chart was chosen to highlight the operational inefficiencies in assigning drivers from different pickup points. Instead of raw counts, we show the percentage of requests without a driver, giving a clearer picture of supply gaps — especially helpful for management decisions.

##### 2. What is/are the insight(s) found from the chart?

Answer Here- The Airport has a significantly higher driver unavailability rate (~50%) compared to the City (~25%).
This implies a severe driver shortage at the Airport, which can lead to lost revenue, longer wait times, and customer dissatisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here- Reallocate or increase driver availability at the Airport during peak request times.
Incentivize drivers to accept airport pickups by offering bonuses or priority queues.
Use predictive scheduling to match driver supply to anticipated demand based on historical patterns.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load cleaned Uber dataset
df = pd.read_csv("Cleaned_Uber_Request_Data.csv")

# Convert 'request_timestamp' to datetime
df['request_timestamp'] = pd.to_datetime(df['request_timestamp'])

# Extract hour and weekday from the timestamp
df['hour'] = df['request_timestamp'].dt.hour
df['weekday'] = df['request_timestamp'].dt.day_name()

# Create a combined column for status and pickup point (optional but flexible)
df['Status@Pickup'] = df['status'] + ' @ ' + df['pickup_point']

# Create pivot table: weekdays vs hours (count of requests)
pivot_heatmap = pd.pivot_table(
    df,
    index='weekday',
    columns='hour',
    values='Status@Pickup',  # could also use 'request_id'
    aggfunc='count',
    fill_value=0
)

# Reorder weekdays properly
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
pivot_heatmap = pivot_heatmap.reindex(weekday_order)

# Plot the heatmap
plt.figure(figsize=(14, 7))
sns.heatmap(pivot_heatmap, cmap="viridis", linewidths=0.3, linecolor='gray')
plt.title("Heatmap: Total Requests by Hour and Weekday (All Statuses Combined)")
plt.xlabel("Hour of Day")
plt.ylabel("Day of Week")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-This heatmap was selected because it provides a compact visual summary of Uber request activity across both hours of the day and days of the week, revealing trends and patterns that are not easily captured in simple bar or pie charts. It helps in identifying peak and off-peak hours, especially when analyzed across weekdays.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-The dataset contains data only for Monday and Tuesday, with no entries for the remaining days.
Peak demand occurs in the early morning (5–9 AM) and late evening (5–10 PM) on both days.
Monday generally shows slightly higher activity than Tuesday across most hours, confirming initial assumptions from earlier visuals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Ensure higher driver availability during morning and evening peak hours on weekdays to reduce cancellations and missed rides.
Expand data collection to other days of the week to gain a full understanding of weekly trends and make more accurate demand forecasts.
Use this pattern to schedule driver shifts strategically for maximum efficiency and customer satisfaction.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

In [None]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the cleaned Uber dataset
df = pd.read_csv("Cleaned_Uber_Request_Data.csv")

# Convert timestamps to datetime format
df['request_timestamp'] = pd.to_datetime(df['request_timestamp'], errors='coerce')
df['drop_timestamp'] = pd.to_datetime(df['drop_timestamp'], errors='coerce')

# Filter only rows where trip was completed and timestamps are not missing
completed = df[(df['status'] == 'Trip Completed') &
               df['request_timestamp'].notna() &
               df['drop_timestamp'].notna()]

# Create 'request_hour' column if not already present
if 'request_hour' not in completed.columns:
    completed['request_hour'] = completed['request_timestamp'].dt.hour

# Calculate pickup delay in minutes
completed['pickup_delay'] = (completed['drop_timestamp'] - completed['request_timestamp']).dt.total_seconds() / 60

# Filter out extreme or invalid values (optional)
completed = completed[(completed['pickup_delay'] >= 0) & (completed['pickup_delay'] <= 120)]

# Create the boxplot
plt.figure(figsize=(14, 6))
sns.boxplot(x='request_hour', y='pickup_delay', data=completed)

# Add chart details
plt.title("Pickup Delay (in minutes) Across Hours of the Day")
plt.xlabel("Hour of the Day")
plt.ylabel("Pickup Delay (minutes)")
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-We used a boxplot to visualize the distribution and variability of pickup delays across different hours of the day. Unlike bar charts or heatmaps, boxplots show spread, outliers, medians, and quartiles—all of which are useful for identifying patterns or anomalies in trip delays

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Pickup delays are relatively high and consistently spread out across the day.
Some hours (like 2 AM and 3 AM) have higher median delays, possibly due to fewer drivers being available.
There is noticeable variation in the delay across hours, with several outliers (delays >75 mins), indicating that wait times may be unpredictable during certain periods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Improve driver availability during late-night and early-morning hours to reduce delays and increase customer satisfaction.
Use this hourly pickup delay trend to restructure driver shift allocations, especially for high-delay periods.
Consider using incentives or surge pricing to encourage drivers to be available during hours with the highest delay variance.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv('Cleaned_Uber_Request_Data.csv')

# Filter only cancelled requests
cancelled = df[df['status'] == 'Cancelled']

# Calculate cancellation count per hour
cancelled_by_hour = cancelled['request_hour'].value_counts().sort_index()

# Calculate total requests per hour
total_by_hour = df['request_hour'].value_counts().sort_index()

# Compute cancellation rate (%)
cancellation_rate = (cancelled_by_hour / total_by_hour) * 100

# Plot
plt.figure(figsize=(12, 6))
plt.plot(cancellation_rate.index, cancellation_rate.values, marker='o', color='crimson')
plt.title("Cancellation Rate by Hour of the Day")
plt.xlabel("Hour of the Day")
plt.ylabel("Cancellation Rate (%)")
plt.grid(True)
plt.xticks(range(0, 24))
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-This line chart clearly shows trends over time (hourly) for cancellation rates, which wouldn’t be easily understood using bar or pie charts. It highlights peak cancellation hours smoothly and is ideal for identifying patterns across a day.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Cancellation rates spike sharply between 5 AM and 9 AM, peaking around 7 AM.
This indicates morning rush hours face a severe lack of drivers or rider cancellations.
Cancellation rates remain relatively low and stable post-noon.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Focus on addressing early morning ride cancellations:

Incentivize drivers to be available during 5 AM–9 AM.
Consider advance booking or priority allocation for high-demand morning hours.
Investigate if cancellations are rider-initiated or due to driver shortage, and act accordingly.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load cleaned dataset
df = pd.read_csv("Cleaned_Uber_Request_Data.csv")

# Convert request timestamp to datetime
df['request_timestamp'] = pd.to_datetime(df['request_timestamp'], errors='coerce')

# Extract hour from timestamp
df['request_hour'] = df['request_timestamp'].dt.hour

# Define a function to classify into dayparts
def get_daypart(hour):
    if 4 <= hour < 8:
        return 'Early Morning'
    elif 8 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 16:
        return 'Afternoon'
    elif 16 <= hour < 20:
        return 'Evening'
    elif 20 <= hour <= 23:
        return 'Night'
    else:
        return 'Late Night'

# Apply the function to create a new column
df['daypart'] = df['request_hour'].apply(get_daypart)

# Group by pickup point and daypart
daypart_counts = df.groupby(['pickup_point', 'daypart']).size().unstack(fill_value=0)

# Reorder the columns for consistent daypart sequence
daypart_order = ['Early Morning', 'Morning', 'Afternoon', 'Evening', 'Night', 'Late Night']
daypart_counts = daypart_counts[daypart_order]

# Plotting
daypart_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='Set2')

# Chart labels and formatting
plt.title("Ride Requests by Pickup Point Across Dayparts")
plt.xlabel("Pickup Point")
plt.ylabel("Number of Requests")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-This stacked bar chart helps visualize the contribution of different parts of the day (dayparts) to the total number of ride requests, split by pickup point (City or Airport). It provides deeper operational insight beyond just hourly trend

##### 2. What is/are the insight(s) found from the chart?

Answer Here-The City dominates ride requests across all dayparts, especially during Morning and Evening.
Airport requests are mostly concentrated in the Early Morning and Night, hinting at flight schedules and airport activity.
Afternoon and Late Night see relatively low demand across both pickup points.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Increase driver availability at the Airport during night/early morning.
Deploy more drivers in the City during peak dayparts (Morning/Evening).
This segmentation allows better time-based resource allocation to meet user demand more efficiently.


#### Chart - 13

In [None]:
# Chart - 13 visualization code

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the cleaned Uber dataset
df = pd.read_csv("Cleaned_Uber_Request_Data.csv")

# Convert request timestamp to datetime
df['request_timestamp'] = pd.to_datetime(df['request_timestamp'], errors='coerce')

# Sort the DataFrame by timestamp
df = df.sort_values(by='request_timestamp')

# Calculate time gaps between requests for each pickup point
df['time_gap'] = df.groupby('pickup_point')['request_timestamp'].diff().dt.total_seconds() / 60  # in minutes

# Drop NA values from first entries in each group
gap_data = df.dropna(subset=['time_gap'])

# Plot histograms of time gaps for each pickup point
plt.figure(figsize=(12, 6))
for point in gap_data['pickup_point'].unique():
    subset = gap_data[gap_data['pickup_point'] == point]
    plt.hist(subset['time_gap'], bins=50, alpha=0.6, label=point)

# Chart formatting
plt.title('Distribution of Time Gaps Between Ride Requests by Pickup Point')
plt.xlabel('Time Gap Between Requests (minutes)')
plt.ylabel('Frequency')
plt.legend(title='Pickup Point')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-This histogram compares how frequently ride requests occur at different pickup points (Airport vs City). It tells us how often users are requesting rides and where demand is more continuous vs sporadic, which isn’t captured in previous visuals.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-City ride requests have shorter time gaps, indicating frequent and consistent demand.
Airport shows longer time gaps between requests, likely due to fewer users or longer intervals between flights.
There’s a clear contrast in usage patterns by location.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Focus more drivers in the City where demand is continuous.
For Airport, consider scheduling driver arrivals based on expected flight schedules or passenger clusters.
Helps in fleet optimization, preventing driver idle time at low-frequency zones.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the cleaned Uber dataset
df = pd.read_csv("Cleaned_Uber_Request_Data.csv")

# Convert timestamps to datetime format
df['request_timestamp'] = pd.to_datetime(df['request_timestamp'], errors='coerce')
df['drop_timestamp'] = pd.to_datetime(df['drop_timestamp'], errors='coerce')

# Calculate trip duration in minutes
df['trip_duration'] = (df['drop_timestamp'] - df['request_timestamp']).dt.total_seconds() / 60

# Encode categorical variables into numeric
df_numeric = df.copy()
df_numeric['pickup_point'] = df_numeric['pickup_point'].astype('category').cat.codes
df_numeric['status'] = df_numeric['status'].astype('category').cat.codes

# Select only numeric columns relevant for correlation
corr_data = df_numeric[['pickup_point', 'status', 'request_hour', 'trip_duration']]

# Generate the correlation matrix
correlation_matrix = corr_data.corr()

# Plot the correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Uber Ride Features")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-A correlation heatmap is perfect for identifying relationships between variables such as pickup point, trip status, request hour, and trip duration. It’s a high-level view that can reveal hidden patterns or signal potential multicollinearity.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-There's a positive correlation between status and trip_duration, suggesting longer trips are more likely to be completed.
pickup_point and trip_duration are weakly negatively correlated, implying slightly shorter trips from city vs. airport.
request_hour doesn’t show strong correlation with trip duration or status, indicating time of day isn’t the main driver for trip length or status outcome.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the cleaned Uber dataset
df = pd.read_csv("Cleaned_Uber_Request_Data.csv")

# Convert timestamps to datetime
df['request_timestamp'] = pd.to_datetime(df['request_timestamp'], errors='coerce')
df['drop_timestamp'] = pd.to_datetime(df['drop_timestamp'], errors='coerce')

# Calculate trip duration (in minutes)
df['trip_duration'] = (df['drop_timestamp'] - df['request_timestamp']).dt.total_seconds() / 60

# Encode categorical variables
df['pickup_point_code'] = df['pickup_point'].astype('category').cat.codes
df['status_code'] = df['status'].astype('category').cat.codes

# Prepare final DataFrame for plotting
plot_df = df[['pickup_point_code', 'status_code', 'request_hour', 'trip_duration']]

# Drop NaNs and infinite values
plot_df = plot_df.replace([float('inf'), float('-inf')], pd.NA).dropna()

# Create the pair plot
sns.pairplot(plot_df, diag_kind='hist', plot_kws={'alpha': 0.5})
plt.suptitle("Pair Plot of Uber Ride Variables", y=1.02)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-The pair plot was selected because it gives a compact overview of bivariate relationships and distributions across multiple variables. It’s an excellent choice to understand how variables like pickup location, request hour, status, and trip duration interact with each other, all in one visual.


##### 2. What is/are the insight(s) found from the chart?

Answer Here-Trip duration seems to vary significantly but does not show a strong linear relationship with request hour, suggesting ride duration isn't necessarily time-of-day dependent.
The pickup_point_code and status_code are categorical, and their scatter plots are mostly vertical bands, showing that they are not continuous and thus not numerically correlated.
There’s a concentration of request hours around certain times, visible in the histogram.
Status code 2 (probably 'Cancelled') dominates, which may indicate a higher cancellation rate that can be explored further.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here-To achieve the business objective, Uber should use historical request data to predict peak demand hours and locations, then proactively allocate more drivers during those periods—especially for City mornings and Airport evenings—to reduce cancellations, minimize wait times, and improve ride fulfillment.

# **Conclusion**

Write the conclusion here-
Through detailed analysis of Uber's ride request data, we identified clear supply-demand mismatches during peak hours, particularly in the mornings from the City and evenings from the Airport. High cancellation rates and driver unavailability during these periods highlight operational inefficiencies. By leveraging data-driven insights to optimize driver allocation, predict peak demand, and implement targeted interventions, Uber can significantly reduce missed rides, improve customer satisfaction, and enhance overall service efficiency.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***