# 📚 Step 1: Importing Required Libraries

In this section, we import all the essential Python libraries for data manipulation, 
visualization, and exploratory data analysis (EDA).  

- **pandas**: For data loading and manipulation  
- **numpy**: For numerical computations  
- **matplotlib & seaborn**: For visualizations  
- **missingno**: For visualizing missing values  
- **warnings**: To suppress warning messages for cleaner outputs  

We’ll also set some plotting styles for consistency across the notebook.


In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Missing value visualization
import missingno as msno

# Ignore warnings for clean outputs
import warnings
warnings.filterwarnings("ignore")

# Set visualization style
sns.set_theme(style="whitegrid", palette="pastel")

print("✅ Libraries imported successfully!")

# 📂 Step 2: Data Loading & Initial Exploration

In this step, we will:
- Load the dataset into a Pandas DataFrame  
- Check the shape (rows × columns)  
- Preview the first few rows to understand the structure  
- Display column information (data types, null values)  

This gives us the **first feel of the dataset** before cleaning or transformation.

In [None]:
# Load dataset
df = pd.read_csv("/kaggle/input/uber-ride-analytics-dashboard/ncr_ride_bookings.csv")

# Shape of the dataset
print("Dataset Shape:", df.shape)

# Preview first 5 rows
display(df.head())

# Info about dataset (columns, dtypes, non-null counts)
df.info()

# 🔎 Step 2 Analysis: Data Loading & Structure

From the `info()` output, here are the key observations about our dataset:

- **Rows & Columns**:  
  The dataset contains **150,000 rows** and **21 columns**, which is fairly large and suitable for deriving strong insights.

- **Data Types**:  
  - **Categorical/Object columns (12 total)**: Date, Time, Booking/Customer IDs, Booking Status, Vehicle Type, Pickup & Drop Locations, Cancellation/Incomplete reasons, and Payment Method.  
  - **Numerical/Float columns (9 total)**: Avg VTAT, Avg CTAT, Cancelled/Incomplete rides, Booking Value, Ride Distance, Ratings.

- **Missing Values**:  
  - **Highly sparse features**:  
    - *Cancelled Rides by Customer* (only 10,500 non-nulls, ~93% missing)  
    - *Reason for cancelling by Customer* (10,500 non-nulls)  
    - *Cancelled Rides by Driver* (27,000 non-nulls, ~82% missing)  
    - *Incomplete Rides* & *Incomplete Rides Reason* (9,000 non-nulls, ~94% missing)  
  - **Moderately sparse features**:  
    - *Booking Value*, *Ride Distance*, *Payment Method* (~32% missing each)  
    - *Driver Ratings* & *Customer Ratings* (~38% missing each)  
  - **Well-filled features**: Date, Time, Booking/Customer IDs, Status, Vehicle Type, Pickup/Drop locations (no missing values).  

- **Memory Usage**: ~24 MB — lightweight enough for Kaggle analysis without optimization.

📌 **Implications**:  
- Columns like cancellations/incompletion are **event-driven** (filled only when such cases occur), so missing values may actually represent “not applicable”.  
- Booking value, distance, and ratings will require **careful handling of NaNs** in later analysis.  
- IDs will mostly serve as identifiers, not features for EDA.

---

✅ Next Step: We’ll move to a **Data Quality & Missing Values Check** (using `missingno` and summary stats) to get a visual sense of missingness and distributions.

# 🧹 Step 3: Data Quality & Missing Values Check

In this section, we will:
- Generate a **summary of missing values** in each column.  
- Visualize missingness using the **`missingno` library**.  
- Check basic descriptive statistics of numerical columns to understand ranges and outliers.

This will help us decide the best strategy for handling missing values in later steps 
(e.g., imputation, treating NaN as "no event", or dropping columns).

In [None]:
# 1. Missing values summary
missing_summary = df.isnull().sum().sort_values(ascending=False)
missing_percent = (missing_summary / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Values': missing_summary,
    'Missing %': missing_percent.round(2)
})

print("🔍 Missing Values Summary:")
display(missing_df)

# 2. Visualize missingness
plt.figure(figsize=(10,5))
msno.bar(df)
plt.show()

# 3. Visualize matrix of missing data (pattern view)
plt.figure(figsize=(12,6))
msno.matrix(df)
plt.show()

# 4. Quick statistics of numerical features
display(df.describe().T)

# 🧹 Step 3: Data Quality & Missing Values Analysis

After examining both the missing values summary, visualizations, and descriptive statistics, here are the key insights:

---

### 1. Missing Data Insights
- **Highly Sparse Features**  
  - *Incomplete Rides* & *Incomplete Rides Reason*: ~94% missing  
  - *Cancelled Rides by Customer* & *Reason for Cancelling by Customer*: ~93% missing  
  - *Cancelled Rides by Driver* & *Driver Cancellation Reason*: ~82% missing  
  ➡️ These are **event-driven** columns, filled only when cancellations/incompletion occur.  
  - NaN here does **not mean lost data**, but rather “not applicable” (i.e., the ride was completed successfully).  

- **Moderately Sparse Features**  
  - *Booking Value*, *Ride Distance*, *Payment Method*, *Avg CTAT*: ~32% missing  
  - *Driver Ratings*, *Customer Ratings*: ~38% missing  
  ➡️ These are critical features and missingness seems tied to ride outcomes (e.g., no value when a ride was incomplete).  

- **Well-Populated Features**  
  - *Date, Time, Booking Status, Vehicle Type, Pickup & Drop Locations*: 0% missing  
  ➡️ Reliable for temporal and categorical analysis.  

---

### 2. Patterns in Missingness
- `missingno` plots confirm that missing values cluster together:
  - Cancellations and their reasons are missing **together** for completed rides.  
  - Booking value, ride distance, and payment method are missing **together**, indicating that financials are only recorded when a ride is completed.  
  - Ratings are missing whenever rides are incomplete or cancelled.  
➡️ Missingness is **structural (not random)**, which makes handling more straightforward.  

---

### 3. Numerical Feature Summary
- **Avg VTAT (Vehicle Wait Time)**:  
  Mean ~8.5 min, range 2–20. Reasonable for ride-hailing pickup times.  

- **Avg CTAT (Customer Trip Time)**:  
  Mean ~29 min, range 10–45. Reflects typical trip durations.  

- **Cancellations & Incompletion**:  
  Always recorded as `1` when the event occurs, missing otherwise → can be safely converted into **binary flags (0 = No, 1 = Yes)**.  

- **Booking Value**:  
  Mean ~₹508, with large variance (₹20 – ₹4277). Strong right skew → may require log transformation.  

- **Ride Distance**:  
  Mean ~24.6 km, capped at 50 km → suggests long rides are common, with system-imposed limits.  

- **Driver & Customer Ratings**:  
  Centered at 4.2–4.4 (out of 5), with low variance. Indicates overall high satisfaction, but subtle differences across segments may be meaningful.  

---

### 4. Data Cleaning Implications
- **Cancellations & Incompletion**: Replace NaN with `0` (no event).  
- **Booking Value, Distance, Payment, Ratings**: Check relationship with booking status before imputing/dropping.  
- **Outlier Handling**: Booking value likely has extreme highs; consider winsorization or log scaling.  
- **Date/Time Transformation**: Must combine into `datetime` and extract useful features for demand analysis.  

---

✅ **Conclusion**:  
The dataset is mostly clean with well-structured event-driven missingness. With proper feature engineering and handling of NaNs, it’s ready for **temporal, categorical, and behavioral analysis**.  

Next, we’ll proceed to **Feature Engineering & Preprocessing** to prepare the dataset for EDA.

# 🛠️ Step 4: Feature Engineering & Preprocessing

In this step, we will prepare the dataset for analysis by performing the following:

1. **Datetime Transformation**  
   - Combine `Date` and `Time` columns into a single `datetime` column.  
   - Extract useful temporal features like `Year`, `Month`, `Day`, `Weekday`, and `Hour`.  

2. **Event-Driven NaN Handling**  
   - Replace NaN in `Cancelled Rides by Customer/Driver` and `Incomplete Rides` with `0`.  
   - For reason columns, replace NaN with `"Not Applicable"`.  

3. **Categorical Cleaning**  
   - Ensure IDs have consistent formatting (strip quotes).  
   - Standardize text columns (strip whitespace).  

This ensures the dataset is analysis-ready for temporal, categorical, and behavioral insights.

In [None]:
# 1. Datetime Transformation
df['datetime'] = pd.to_datetime(df['Date'] + " " + df['Time'], errors='coerce')

# Extract features
df['year'] = df['datetime'].dt.year
df['month'] = df['datetime'].dt.month
df['day'] = df['datetime'].dt.day
df['weekday'] = df['datetime'].dt.day_name()
df['hour'] = df['datetime'].dt.hour

# 2. Event-Driven NaN Handling
event_cols_binary = [
    'Cancelled Rides by Customer',
    'Cancelled Rides by Driver',
    'Incomplete Rides'
]

event_cols_reason = [
    'Reason for cancelling by Customer',
    'Driver Cancellation Reason',
    'Incomplete Rides Reason'
]

# Binary flags → fill NaN with 0
df[event_cols_binary] = df[event_cols_binary].fillna(0)

# Reasons → fill NaN with "Not Applicable"
df[event_cols_reason] = df[event_cols_reason].fillna("Not Applicable")

# 3. Categorical Cleaning
id_cols = ['Booking ID', 'Customer ID']
for col in id_cols:
    df[col] = df[col].astype(str).str.replace('"', '').str.strip()

cat_cols = ['Booking Status', 'Vehicle Type', 'Pickup Location', 'Drop Location', 'Payment Method']
for col in cat_cols:
    df[col] = df[col].astype(str).str.strip()

print("✅ Feature Engineering & Preprocessing completed!")
df.head()

# 🛠️ Step 4 Analysis: Feature Engineering & Preprocessing

### ✅ Transformations Applied
1. **Datetime Handling**
   - Combined `Date` and `Time` into a single `datetime` column.  
   - Extracted additional features: `year`, `month`, `day`, `weekday`, `hour`.  
   - These will allow **temporal trend analysis** (e.g., busiest hours, weekday vs weekend patterns, seasonal demand).

2. **Event-Driven NaN Handling**
   - Replaced NaN in `Cancelled Rides by Customer`, `Cancelled Rides by Driver`, and `Incomplete Rides` with **0** (no event).  
   - Filled reason columns (`Cancellation Reason`, `Incomplete Reason`) with **"Not Applicable"**.  
   - This preserves categorical integrity without losing information.

3. **Categorical Cleaning**
   - Cleaned `Booking ID` and `Customer ID` by removing unwanted quotes.  
   - Stripped whitespace in categorical columns like `Booking Status`, `Vehicle Type`, `Pickup Location`, `Drop Location`, and `Payment Method`.  
   - Ensures consistent grouping in visualizations.

---

### 🔑 Why This Matters
- **Temporal features** will help us analyze demand peaks, cancellation trends by time, and ride durations by day-of-week.  
- **Event-driven NaN replacement** makes cancellation/incompletion variables usable as **binary flags**, which can be directly analyzed.  
- **Clean categorical values** prevent duplicate categories (e.g., `"UPI "` vs `"UPI"`).  

---

### 📌 Next Step
We’re ready to begin **Exploratory Data Analysis (EDA)**.  
We’ll start with **Univariate Analysis**:
- Distribution of `Booking Status`  
- Popularity of `Vehicle Type`  
- Payment Method usage  
- Booking Value and Ride Distance distributions  
- Ratings overview  

This will give us the first big-picture view of rides in the NCR dataset.

# 📊 Step 5: Univariate Analysis

In this step, we analyze each feature **individually** to understand the basic distributions and trends.  
This includes both **categorical features** (Booking Status, Vehicle Type, Payment Method, Locations) and **numerical features** (Booking Value, Ride Distance, Ratings, Times).

We’ll cover:

1. **Booking Outcomes** – distribution of Booking Status  
2. **Vehicle Type Preferences** – popularity among customers  
3. **Payment Methods** – cashless adoption vs others  
4. **Booking Value & Ride Distance** – distributions and skewness  
5. **Driver & Customer Ratings** – overall satisfaction levels  
6. **Temporal Trends (initial glance)** – busiest hours and weekdays

In [None]:
# --- 1. Booking Status Distribution ---
plt.figure(figsize=(8,5))
sns.countplot(x="Booking Status", data=df, order=df['Booking Status'].value_counts().index, palette="Set2")
plt.title("Booking Status Distribution", fontsize=14)
plt.xlabel("Booking Status")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

# --- 2. Vehicle Type Popularity ---
plt.figure(figsize=(8,5))
sns.countplot(y="Vehicle Type", data=df, order=df['Vehicle Type'].value_counts().index, palette="muted")
plt.title("Vehicle Type Popularity", fontsize=14)
plt.xlabel("Count")
plt.ylabel("Vehicle Type")
plt.show()

# --- 3. Payment Method Distribution ---
plt.figure(figsize=(8,5))
sns.countplot(x="Payment Method", data=df, order=df['Payment Method'].value_counts().index, palette="coolwarm")
plt.title("Payment Method Distribution", fontsize=14)
plt.xlabel("Payment Method")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

# --- 4a. Booking Value Distribution ---
plt.figure(figsize=(8,5))
sns.histplot(df['Booking Value'].dropna(), bins=50, kde=True, color="skyblue")
plt.title("Booking Value Distribution", fontsize=14)
plt.xlabel("Booking Value (₹)")
plt.ylabel("Frequency")
plt.show()

# --- 4b. Ride Distance Distribution ---
plt.figure(figsize=(8,5))
sns.histplot(df['Ride Distance'].dropna(), bins=50, kde=True, color="orange")
plt.title("Ride Distance Distribution", fontsize=14)
plt.xlabel("Ride Distance (km)")
plt.ylabel("Frequency")
plt.show()

# --- 5. Ratings Distributions ---
fig, axes = plt.subplots(1,2, figsize=(14,5))
sns.histplot(df['Driver Ratings'].dropna(), bins=20, kde=False, ax=axes[0], color="teal")
axes[0].set_title("Driver Ratings Distribution")
axes[0].set_xlabel("Driver Rating")
axes[0].set_ylabel("Count")

sns.histplot(df['Customer Rating'].dropna(), bins=20, kde=False, ax=axes[1], color="purple")
axes[1].set_title("Customer Ratings Distribution")
axes[1].set_xlabel("Customer Rating")
axes[1].set_ylabel("Count")
plt.show()

# --- 6a. Hourly Ride Distribution ---
plt.figure(figsize=(10,5))
sns.countplot(x="hour", data=df, palette="viridis")
plt.title("Rides by Hour of Day", fontsize=14)
plt.xlabel("Hour of Day")
plt.ylabel("Number of Rides")
plt.show()

# --- 6b. Weekday Ride Distribution ---
plt.figure(figsize=(8,5))
sns.countplot(x="weekday", data=df, order=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"], palette="Spectral")
plt.title("Rides by Weekday", fontsize=14)
plt.xlabel("Day of Week")
plt.ylabel("Number of Rides")
plt.show()

# 📊 Step 5 Analysis: Univariate Insights

From the univariate plots, here are the major observations:

---

### 1. Booking Status Distribution
- ✅ **Completed rides dominate** the dataset, accounting for the majority of trips.  
- ❌ Among cancellations:  
  - **Driver cancellations** are the largest contributor, followed by **“No Driver Found”** and **Customer cancellations**.  
  - A smaller share of rides are marked as **Incomplete**.  
➡️ Suggests a potential supply-demand mismatch at times (drivers cancel or not available).

---

### 2. Vehicle Type Popularity
- **Auto** is the most frequently used ride type, showing strong urban reliance.  
- **Go Mini** and **Go Sedan** follow closely, indicating balanced demand for small and medium cars.  
- **Bikes** have notable popularity (likely for short trips, low cost, and traffic convenience).  
- **Premier Sedan** serves a smaller premium niche, while **eBike** and **Uber XL** remain relatively rare.  
➡️ The platform caters to both **mass-market (Auto, Mini, Sedan, Bike)** and **premium users (Premier Sedan, XL)**.

---

### 3. Payment Method Usage
- **UPI** is the most common non-missing payment method, reflecting India’s digital payment adoption.  
- **Cash** is still widely used, suggesting hybrid adoption.  
- **Credit/Debit Cards** and **Wallets** are smaller contributors.  
- A large portion of Payment Method values are **NaN**, likely due to incomplete/cancelled rides where no payment was processed.  
➡️ Strong shift towards **UPI-led cashless economy**, but cash still relevant.

---

### 4. Booking Value Distribution
- Distribution is **right-skewed**:  
  - Most rides cost under ₹1000.  
  - A small fraction of outliers extend beyond ₹2000–₹4000.  
- Suggests the need for **log transformation** or outlier handling before modeling.  
➡️ Pricing appears consistent with NCR intercity ride market, but extreme values may distort averages.

---

### 5. Ride Distance Distribution
- Distances are capped at **50 km**, with a relatively uniform spread beyond ~15 km.  
- Most rides fall between **5 km and 20 km**, which fits intra-city commutes.  
➡️ Indicates NCR rides serve both **short daily trips** and **longer commutes**, with a mix of demand.

---

### 6. Ratings Distribution
- **Driver Ratings**: Centered around **4.2–4.4**, but with a tail of lower ratings (<3.5).  
- **Customer Ratings**: Heavily skewed to **5 stars**, showing customers rate drivers generously.  
➡️ Ratings inflation exists, but occasional low driver ratings may signal service quality issues.

---

### 7. Temporal Trends
- **Hourly Trends**:  
  - Low rides between midnight–5 AM.  
  - Sharp rise from **6 AM**, peaking at **10 AM (morning commute)**.  
  - A second peak at **5–7 PM (evening commute)**.  
  - Decline after 9 PM.  
- **Weekday Trends**:  
  - Rides are fairly **evenly spread across all weekdays**, including weekends.  
  - Slightly higher activity on **Mondays and Fridays**, possibly due to office commuting.  
➡️ NCR mobility follows a **bi-modal commute pattern** typical of metro cities.

---

### 📌 Key Takeaways
1. **Completion rate is strong**, but driver cancellations are a bottleneck.  
2. **Autos and budget cars dominate**, while premium categories remain niche.  
3. **UPI has overtaken cash**, but hybrid modes still exist.  
4. **Ride distances and values are skewed**, requiring transformations for modeling.  
5. **Customer satisfaction is high**, though driver ratings show more variance.  
6. **Morning and evening commute peaks** drive the majority of ride demand.  

---

✅ This univariate analysis sets the stage for deeper exploration of **relationships (bivariate analysis)** — e.g., cancellations by vehicle type, booking value vs distance, ratings vs payment method, etc.

# 🔗 Step 6: Bivariate Analysis

In this step, we explore **relationships between two variables** at a time.  
This helps us understand how different features interact with each other and influence ride outcomes, pricing, and customer satisfaction.

We’ll cover:

1. **Numerical Relationships**
   - Booking Value vs Ride Distance  
   - Booking Value vs Avg CTAT  
   - Correlation Heatmap for numerical features  

2. **Categorical vs Categorical**
   - Booking Status vs Payment Method  
   - Booking Status vs Vehicle Type  
   - Cancellation Reasons (Customer vs Driver)  

3. **Categorical vs Numerical**
   - Booking Value by Vehicle Type  
   - Ride Distance by Vehicle Type  
   - Ratings vs Vehicle Type  
   - Ratings vs Payment Method  

This analysis reveals behavioral, operational, and financial insights.

In [None]:
# --- 1a. Booking Value vs Ride Distance ---
plt.figure(figsize=(8,6))
sns.scatterplot(x="Ride Distance", y="Booking Value", data=df, alpha=0.4)
plt.title("Booking Value vs Ride Distance", fontsize=14)
plt.xlabel("Ride Distance (km)")
plt.ylabel("Booking Value (₹)")
plt.show()

# --- 1b. Booking Value vs Avg CTAT (Trip Time) ---
plt.figure(figsize=(8,6))
sns.scatterplot(x="Avg CTAT", y="Booking Value", data=df, alpha=0.4, color="purple")
plt.title("Booking Value vs Avg CTAT", fontsize=14)
plt.xlabel("Average Trip Time (min)")
plt.ylabel("Booking Value (₹)")
plt.show()

# --- 1c. Correlation Heatmap ---
plt.figure(figsize=(10,6))
corr = df[['Avg VTAT','Avg CTAT','Booking Value','Ride Distance','Driver Ratings','Customer Rating']].corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", center=0, linewidths=0.5)
plt.title("Correlation Heatmap of Numerical Features", fontsize=14)
plt.show()

# --- 2a. Booking Status vs Payment Method ---
plt.figure(figsize=(10,6))
sns.countplot(x="Booking Status", hue="Payment Method", data=df, palette="Set3")
plt.title("Booking Status vs Payment Method", fontsize=14)
plt.xlabel("Booking Status")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.legend(title="Payment Method", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()

# --- 2b. Booking Status vs Vehicle Type ---
plt.figure(figsize=(10,6))
sns.countplot(x="Booking Status", hue="Vehicle Type", data=df, palette="Spectral")
plt.title("Booking Status vs Vehicle Type", fontsize=14)
plt.xlabel("Booking Status")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.legend(title="Vehicle Type", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()

# --- 2c. Cancellation Reasons (Customer vs Driver) ---
fig, axes = plt.subplots(1,2, figsize=(16,6))

sns.countplot(y="Reason for cancelling by Customer", data=df, order=df['Reason for cancelling by Customer'].value_counts().index, ax=axes[0], palette="muted")
axes[0].set_title("Customer Cancellation Reasons")
axes[0].set_ylabel("Reason")
axes[0].set_xlabel("Count")

sns.countplot(y="Driver Cancellation Reason", data=df, order=df['Driver Cancellation Reason'].value_counts().index, ax=axes[1], palette="muted")
axes[1].set_title("Driver Cancellation Reasons")
axes[1].set_ylabel("Reason")
axes[1].set_xlabel("Count")

plt.tight_layout()
plt.show()

# --- 3a. Booking Value by Vehicle Type ---
plt.figure(figsize=(10,6))
sns.boxplot(x="Vehicle Type", y="Booking Value", data=df, palette="Set2")
plt.title("Booking Value by Vehicle Type", fontsize=14)
plt.xlabel("Vehicle Type")
plt.ylabel("Booking Value (₹)")
plt.xticks(rotation=45)
plt.show()

# --- 3b. Ride Distance by Vehicle Type ---
plt.figure(figsize=(10,6))
sns.boxplot(x="Vehicle Type", y="Ride Distance", data=df, palette="Set3")
plt.title("Ride Distance by Vehicle Type", fontsize=14)
plt.xlabel("Vehicle Type")
plt.ylabel("Ride Distance (km)")
plt.xticks(rotation=45)
plt.show()

# --- 3c. Ratings vs Vehicle Type ---
fig, axes = plt.subplots(1,2, figsize=(16,6))
sns.boxplot(x="Vehicle Type", y="Driver Ratings", data=df, ax=axes[0], palette="Blues")
axes[0].set_title("Driver Ratings by Vehicle Type")
axes[0].set_ylabel("Driver Rating")
axes[0].set_xlabel("Vehicle Type")
axes[0].tick_params(axis='x', rotation=45)

sns.boxplot(x="Vehicle Type", y="Customer Rating", data=df, ax=axes[1], palette="Purples")
axes[1].set_title("Customer Ratings by Vehicle Type")
axes[1].set_ylabel("Customer Rating")
axes[1].set_xlabel("Vehicle Type")
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# --- 3d. Ratings vs Payment Method ---
fig, axes = plt.subplots(1,2, figsize=(16,6))
sns.boxplot(x="Payment Method", y="Driver Ratings", data=df, ax=axes[0], palette="coolwarm")
axes[0].set_title("Driver Ratings by Payment Method")
axes[0].set_ylabel("Driver Rating")
axes[0].set_xlabel("Payment Method")
axes[0].tick_params(axis='x', rotation=45)

sns.boxplot(x="Payment Method", y="Customer Rating", data=df, ax=axes[1], palette="coolwarm")
axes[1].set_title("Customer Ratings by Payment Method")
axes[1].set_ylabel("Customer Rating")
axes[1].set_xlabel("Payment Method")
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# 🔗 Step 6 Analysis: Bivariate Insights

From the bivariate exploration of ride bookings, here are the detailed findings:

---

### 1. Numerical Relationships
- **Booking Value vs Ride Distance**  
  - As expected, higher distances generally lead to higher fares, though the scatter is wide.  
  - Many rides cluster below **₹2000 for <20 km**, while outliers exist at high fares (>₹3000).  
  - The spread indicates additional factors (vehicle type, surge pricing, traffic) affect price beyond distance.

- **Booking Value vs Avg CTAT (Trip Time)**  
  - Positive association, but again noisy.  
  - Longer trip times (>30 min) tend to cost more, yet some short trips also show high fares (likely premium rides or congestion).

- **Correlation Heatmap**  
  - Weak linear correlations overall:  
    - `Ride Distance` and `Avg CTAT` show moderate positive correlation (0.10).  
    - Booking Value surprisingly has very weak correlation with both distance and trip time — suggesting pricing depends on **vehicle type, surge, or location demand**.  
    - Ratings show almost **no correlation** with operational metrics, highlighting that satisfaction is independent of fare/distance.

---

### 2. Categorical vs Categorical
- **Booking Status vs Payment Method**  
  - Completed rides dominate UPI and Cash payments.  
  - Cancellations or incomplete rides mostly fall under `NaN` payment method (since no transaction occurred).  
  - Among processed payments, UPI is the leader.

- **Booking Status vs Vehicle Type**  
  - `Auto`, `Go Mini`, and `Go Sedan` dominate completed rides.  
  - Driver cancellations are more frequent in budget categories (Auto, Mini), possibly due to low fare or mismatch issues.  
  - Premium types (Premier Sedan, XL) have fewer rides but a slightly higher completion ratio.

- **Cancellation Reasons**  
  - **Customer side**: "Change of plans" and "Wrong address" stand out as top reasons (excluding Not Applicable).  
  - **Driver side**: "Customer related issue" and "Personal & car related issues" dominate.  
  - Interesting: "Customer was coughing/sick" and "More than permitted people" hint at **COVID-era behavioral shifts and safety concerns**.

---

### 3. Categorical vs Numerical
- **Booking Value by Vehicle Type**  
  - Premium vehicles (Uber XL, Premier Sedan) have higher median booking values.  
  - Autos and Bikes have the lowest fares, with tight distributions.  
  - All categories show extreme outliers (>₹3000).  

- **Ride Distance by Vehicle Type**  
  - Distances are fairly similar across categories (~25 km median).  
  - Suggests **vehicle choice is more about comfort/cost** than trip length.

- **Ratings by Vehicle Type**  
  - Customer ratings skew higher (closer to 5), especially for `Go Sedan` and `Premier Sedan`.  
  - Driver ratings are slightly lower, with more spread in `Auto` and `Bike` segments.  
  - This points to **service quality differences across ride types**.

- **Ratings by Payment Method**  
  - Ratings remain stable across payment modes (~4.2–4.5).  
  - Slightly higher customer ratings for **UPI and Card** transactions, possibly reflecting smoother experience vs cash.

---

### 📌 Key Takeaways
1. **Distance and time only weakly explain fares** → pricing model likely includes **vehicle type, surge, and geography**.  
2. **Driver cancellations are a key operational issue**, especially in budget rides.  
3. **UPI dominates completed rides**, aligning with India’s fintech adoption trend.  
4. **Premium rides deliver higher satisfaction**, while budget rides show wider variance in ratings.  
5. **Cancellation reasons highlight operational inefficiencies** (wrong address, plan changes) and **driver constraints** (customer issues, car-related problems).  

---

✅ This bivariate analysis gives strong insights into **behavioral and operational drivers** of ride outcomes.  
Next, we should look at **Step 7: Temporal Trends** to explore how demand, cancellations, and revenue vary by **time of day, day of week, and month**.

# ⏰ Step 7: Temporal Trends

In this step, we analyze **time-based ride patterns** to understand demand cycles and operational bottlenecks.

We will explore:

1. **Hourly Trends**  
   - Number of rides per hour  
   - Cancellations and incomplete rides by hour  

2. **Weekday Trends**  
   - Ride demand by weekday  
   - Cancellations vs completions across weekdays  

3. **Monthly Trends**  
   - Seasonal trends in rides, cancellations, and revenue  

4. **Revenue Over Time**  
   - Total booking value trends by month and weekday  

These insights reveal how **time of day and seasonality** influence demand and cancellations.

In [None]:
# --- 1a. Hourly Ride Demand ---
plt.figure(figsize=(10,5))
sns.countplot(x="hour", data=df, palette="viridis")
plt.title("Ride Demand by Hour of Day", fontsize=14)
plt.xlabel("Hour of Day")
plt.ylabel("Number of Rides")
plt.show()

# --- 1b. Hourly Booking Status Breakdown ---
plt.figure(figsize=(12,6))
sns.countplot(x="hour", hue="Booking Status", data=df, palette="Spectral")
plt.title("Hourly Distribution of Booking Status", fontsize=14)
plt.xlabel("Hour of Day")
plt.ylabel("Number of Rides")
plt.legend(title="Booking Status", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()

# --- 2a. Weekday Ride Demand ---
plt.figure(figsize=(10,5))
sns.countplot(x="weekday", data=df, order=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"], palette="coolwarm")
plt.title("Ride Demand by Weekday", fontsize=14)
plt.xlabel("Day of Week")
plt.ylabel("Number of Rides")
plt.show()

# --- 2b. Weekday Booking Status Breakdown ---
plt.figure(figsize=(12,6))
sns.countplot(x="weekday", hue="Booking Status", 
              data=df, order=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"], palette="Set2")
plt.title("Booking Status by Weekday", fontsize=14)
plt.xlabel("Day of Week")
plt.ylabel("Number of Rides")
plt.legend(title="Booking Status", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()

# --- 3. Monthly Ride Demand ---
plt.figure(figsize=(10,5))
sns.countplot(x="month", data=df, palette="plasma")
plt.title("Ride Demand by Month", fontsize=14)
plt.xlabel("Month")
plt.ylabel("Number of Rides")
plt.show()

# --- 3b. Monthly Booking Status Breakdown ---
plt.figure(figsize=(12,6))
sns.countplot(x="month", hue="Booking Status", data=df, palette="cubehelix")
plt.title("Booking Status by Month", fontsize=14)
plt.xlabel("Month")
plt.ylabel("Number of Rides")
plt.legend(title="Booking Status", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()

# --- 4. Revenue Trends by Month & Weekday ---
monthly_revenue = df.groupby("month")['Booking Value'].sum()
weekday_revenue = df.groupby("weekday")['Booking Value'].sum()

plt.figure(figsize=(10,5))
monthly_revenue.plot(kind="bar", color="teal")
plt.title("Total Revenue by Month", fontsize=14)
plt.xlabel("Month")
plt.ylabel("Total Booking Value (₹)")
plt.show()

plt.figure(figsize=(10,5))
weekday_revenue.loc[["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]].plot(kind="bar", color="coral")
plt.title("Total Revenue by Weekday", fontsize=14)
plt.xlabel("Day of Week")
plt.ylabel("Total Booking Value (₹)")
plt.show()

# ⏰ Step 7 Analysis: Temporal Trends

From the hourly, weekday, and monthly patterns, here are the key insights:

---

### 1. Hourly Trends
- **Demand Curve**:  
  - Very low between **12 AM – 5 AM** (late-night hours).  
  - Sharp rise from **6 AM**, peaking around **9–10 AM (morning commute)**.  
  - Another stronger peak between **5–7 PM (evening commute)**.  
  - Decline after 9 PM, stabilizing at low levels post-midnight.  
➡️ This confirms a **bi-modal urban commute pattern** typical of metro cities.

- **Cancellations by Hour**:  
  - Driver cancellations spike during the **morning and evening peaks**, likely due to high demand vs availability mismatches.  
  - Customer cancellations are evenly spread, but still higher during busy hours.  
➡️ Peak-hour stress leads to both **supply constraints (drivers)** and **last-minute customer changes**.

---

### 2. Weekday Trends
- **Demand Consistency**:  
  - Ride demand is fairly **balanced across all weekdays and weekends**, with only minor differences.  
  - Suggests rides are not just office commute but also leisure and personal travel.  

- **Booking Status by Weekday**:  
  - Completion rates remain stable across weekdays.  
  - Driver cancellations slightly higher on **weekdays**, possibly linked to **office commute rush hours**.  
➡️ Unlike Western markets where weekends dominate ride-hailing, NCR shows **steady demand all week**.

---

### 3. Monthly Trends
- **Ride Demand**:  
  - Seasonal variation is mild — demand is **relatively stable month to month**.  
  - Slight dips in **February** and **September**, but overall consistent volume.  

- **Booking Status by Month**:  
  - Completion remains dominant each month.  
  - Cancellation proportions (driver & customer) do not show sharp seasonal swings.  
➡️ Suggests NCR ride demand is **less seasonal and more consistent year-round**.

---

### 4. Revenue Trends
- **Monthly Revenue**:  
  - Total revenue stays around ₹4M–₹4.5M per month, with **slight spikes in January and March**.  
  - No major seasonal revenue shocks.  

- **Weekday Revenue**:  
  - Highest revenue on **weekends (Saturday & Sunday)**, despite demand being balanced.  
  - This indicates weekend trips are **longer or higher-value rides**, boosting revenue.  
➡️ Suggests leisure/social trips during weekends are more lucrative than weekday commutes.

---

### 📌 Key Takeaways
1. **Strong Bi-Modal Demand** → 9 AM and 6 PM peaks dominate, reflecting commute travel.  
2. **Driver Cancellations Align with Peaks** → indicates supply-demand mismatch during rush hours.  
3. **Balanced Weekday vs Weekend Demand** → NCR rides are for both work and leisure.  
4. **Revenue Peaks on Weekends** → fewer but higher-value trips boost weekend revenues.  
5. **Stable Monthly Demand** → ride-hailing demand in NCR is robust and less seasonal, showing strong dependence on the service year-round.  

---

✅ Temporal analysis highlights **when operational bottlenecks occur** (peak-hour cancellations, weekend revenue spikes).  
Next, we should move to **Step 8: Geospatial Insights** — analyzing **Pickup & Drop Location hotspots** and distance patterns across areas.

# 🌍 Step 8: Geospatial Insights

In this step, we analyze **pickup and drop locations** to identify high-demand hotspots and travel flows.

We will cover:

1. **Top 15 Pickup Locations** – busiest starting points.  
2. **Top 15 Drop Locations** – most popular destinations.  
3. **Pickup vs Drop Frequency Comparison** – to see if certain areas are more source-heavy vs destination-heavy.  
4. **Pickup-Drop Pair Analysis** – common travel corridors (if feasible).  

This helps identify **urban mobility hotspots** and operational planning opportunities.


In [None]:
# --- 1. Top Pickup Locations ---
plt.figure(figsize=(10,6))
pickup_counts = df['Pickup Location'].value_counts().head(15)
sns.barplot(y=pickup_counts.index, x=pickup_counts.values, palette="viridis")
plt.title("Top 15 Pickup Locations", fontsize=14)
plt.xlabel("Number of Rides")
plt.ylabel("Pickup Location")
plt.show()

# --- 2. Top Drop Locations ---
plt.figure(figsize=(10,6))
drop_counts = df['Drop Location'].value_counts().head(15)
sns.barplot(y=drop_counts.index, x=drop_counts.values, palette="plasma")
plt.title("Top 15 Drop Locations", fontsize=14)
plt.xlabel("Number of Rides")
plt.ylabel("Drop Location")
plt.show()

# --- 3. Pickup vs Drop Frequency Comparison ---
top_pickup = pickup_counts.index
pickup_vs_drop = pd.DataFrame({
    "Pickup": df['Pickup Location'].value_counts(),
    "Drop": df['Drop Location'].value_counts()
}).fillna(0).loc[top_pickup]

pickup_vs_drop.plot(kind="bar", figsize=(12,6), color=["skyblue","orange"])
plt.title("Pickup vs Drop Frequency (Top Pickup Locations)", fontsize=14)
plt.xlabel("Location")
plt.ylabel("Number of Rides")
plt.xticks(rotation=45)
plt.show()

# --- 4. Common Pickup-Drop Pairs (Top 15) ---
pair_counts = df.groupby(['Pickup Location','Drop Location']).size().reset_index(name='counts')
top_pairs = pair_counts.sort_values('counts', ascending=False).head(15)

plt.figure(figsize=(12,6))
sns.barplot(x="counts", y="Pickup Location", hue="Drop Location", data=top_pairs, dodge=False, palette="Set2")
plt.title("Top 15 Pickup-Drop Pairs", fontsize=14)
plt.xlabel("Number of Rides")
plt.ylabel("Pickup Location")
plt.legend(title="Drop Location", bbox_to_anchor=(1.05,1), loc="upper left")
plt.show()

# 🌍 Step 8 Analysis: Geospatial Insights

Analyzing pickup and drop locations reveals **urban mobility hotspots** and important ride corridors in the NCR dataset.

---

### 1. Top Pickup Locations
- **Khandsa, Barakhamba Road, Saket, Badarpur, Pragati Maidan, and AIIMS** dominate as pickup hubs.  
- These areas are either **major residential hubs (Khandsa, Badarpur, Mehrauli)** or **commercial/office centers (Barakhamba Road, Pragati Maidan, AIIMS, Udyog Vihar)**.  
- **University and educational zones** like **Vishwavidyalaya, Kanhaiya Nagar** also appear, suggesting strong student and commuter demand.

---

### 2. Top Drop Locations
- Popular destinations include **Ashram, Cyber Hub, Kalkaji, Lajpat Nagar, Nehru Place, Kashmere Gate ISBT, Udyog Vihar**.  
- These are **business hubs (Cyber Hub, Nehru Place, Udyog Vihar)**, **residential areas (Ashram, Basai Dhankot, Punjabi Bagh)**, and **transit nodes (Kashmere Gate ISBT, Sarai Kale Khan)**.  
- This mix reflects **office commutes + intercity connections + local leisure trips**.

---

### 3. Pickup vs Drop Frequency (Top Pickup Points)
- Across the **top 15 pickup areas**, **pickup volumes are consistently higher than drop counts**.  
- This indicates certain areas act more as **sources of trips** (residential/localities) rather than balanced flows.  
- Eg: Khandsa and Saket generate more outbound rides than inbound, while **Udyog Vihar** balances out better (work hub, both inbound and outbound).

---

### 4. Common Pickup-Drop Pairs
- The top corridors include:  
  - **DLF City Court → Bhiwadi**  
  - **Connaught Place → Paharganj / Vidhan Sabha**  
  - **New Delhi Railway Station → IIT Delhi / Shahdara / Tilak Nagar**  
  - **Ghaziabad → Badshahpur / Sohna Road**  
- These highlight a mix of **commuter flows (work, university)** and **residential-to-business trips**.  
- Transit points like **Railway Stations and ISBTs** also emerge as crucial connectors.  

---

### 📌 Key Takeaways
1. **Residential → Commercial Flows Dominate**: Many pickups from residential hubs feed into office/business centers.  
2. **Transit Hubs are Critical**: Kashmere Gate ISBT, Sarai Kale Khan, and New De

# 🌟 Step 9: Ratings & Satisfaction Analysis

In this step, we analyze how **customer and driver ratings** vary across operational factors to identify service quality drivers.

We will cover:

1. **Overall Rating Distributions** – baseline patterns for drivers & customers.  
2. **Ratings vs Time of Day** – do peak-hour stresses impact ratings?  
3. **Ratings vs Booking Status** – do cancellations/incomplete rides affect ratings?  
4. **Ratings vs Vehicle Type** – which vehicle categories deliver the best experience?  
5. **Ratings vs Payment Method** – does mode of payment affect satisfaction?  
6. **Low Rating Drivers (Outlier Focus)** – conditions under which ratings drop.  

This helps understand **service quality bottlenecks** and areas of improvement.

In [None]:
# --- 1. Rating Distributions (already seen but refined here) ---
fig, axes = plt.subplots(1,2, figsize=(12,5))
sns.histplot(df['Driver Ratings'], bins=20, kde=False, ax=axes[0], color="teal")
axes[0].set_title("Driver Ratings Distribution")
sns.histplot(df['Customer Rating'], bins=20, kde=False, ax=axes[1], color="purple")
axes[1].set_title("Customer Ratings Distribution")
plt.show()

# --- 2. Ratings vs Time of Day ---
plt.figure(figsize=(12,6))
sns.boxplot(x="hour", y="Driver Ratings", data=df, palette="Blues")
plt.title("Driver Ratings by Hour of Day")
plt.show()

plt.figure(figsize=(12,6))
sns.boxplot(x="hour", y="Customer Rating", data=df, palette="Purples")
plt.title("Customer Ratings by Hour of Day")
plt.show()

# --- 3. Ratings vs Booking Status ---
plt.figure(figsize=(12,6))
sns.boxplot(x="Booking Status", y="Driver Ratings", data=df, palette="coolwarm")
plt.title("Driver Ratings by Booking Status")
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(12,6))
sns.boxplot(x="Booking Status", y="Customer Rating", data=df, palette="coolwarm")
plt.title("Customer Ratings by Booking Status")
plt.xticks(rotation=45)
plt.show()

# --- 4. Ratings vs Vehicle Type ---
plt.figure(figsize=(12,6))
sns.boxplot(x="Vehicle Type", y="Driver Ratings", data=df, palette="crest")
plt.title("Driver Ratings by Vehicle Type")
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(12,6))
sns.boxplot(x="Vehicle Type", y="Customer Rating", data=df, palette="mako")
plt.title("Customer Ratings by Vehicle Type")
plt.xticks(rotation=45)
plt.show()

# --- 5. Ratings vs Payment Method ---
plt.figure(figsize=(12,6))
sns.boxplot(x="Payment Method", y="Driver Ratings", data=df, palette="viridis")
plt.title("Driver Ratings by Payment Method")
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(12,6))
sns.boxplot(x="Payment Method", y="Customer Rating", data=df, palette="plasma")
plt.title("Customer Ratings by Payment Method")
plt.xticks(rotation=45)
plt.show()

# --- 6. Low Ratings Analysis ---
low_driver = df[df['Driver Ratings'] <= 3.5]
low_customer = df[df['Customer Rating'] <= 3.5]

print("Low Driver Ratings Breakdown by Booking Status:")
print(low_driver['Booking Status'].value_counts())

print("\nLow Customer Ratings Breakdown by Booking Status:")
print(low_customer['Booking Status'].value_counts())

print("\nLow Ratings by Vehicle Type (Driver):")
print(low_driver['Vehicle Type'].value_counts().head())

print("\nLow Ratings by Vehicle Type (Customer):")
print(low_customer['Vehicle Type'].value_counts().head())

# 📊 Step 7: Ratings Analysis  

In this step, we explore **driver and customer ratings** in detail — their distributions, variation by time, vehicle types, payment methods, and booking status. We also analyze **low-rating breakdowns** to uncover service gaps.  

---

## 1. Distribution of Ratings  

- **Driver Ratings**:  
  - Centered between **4.0–4.5**, with most clustered around **4.2–4.4**.  
  - Few ratings fall below **3.5**, indicating generally **good driver performance**, though occasional dissatisfaction exists.  
  - Perfect 5-star ratings are less common compared to customer ratings.  

- **Customer Ratings**:  
  - Strong skew towards **4.5–5.0**, with a large spike at **5.0**.  
  - Suggests drivers may be more lenient in scoring customers.  

**✅ Key Insight**: Drivers are rated more critically than customers, but overall both groups maintain strong reputations.  

---

## 2. Ratings by Hour of Day  

- **Driver Ratings**:  
  - Stable throughout the day, with a slight dip during **early morning (2–6 AM)**.  
  - Ratings improve in the **late evening (18:00–22:00)**.  

- **Customer Ratings**:  
  - Consistent across the day, with a mild upward trend in the evenings.  

**✅ Key Insight**: Service quality is stable, but **night-time rides (2–6 AM)** need reliability improvements.  

---

## 3. Ratings by Booking Status  

- Only **completed rides** receive ratings.  
- Most drivers receive **4–4.5**, though a small segment consistently gets ≤3.5.  

**✅ Key Insight**: Low driver ratings are concentrated in **completed trips**, indicating dissatisfaction stems from **ride experience**, not cancellations.  

---

## 4. Ratings by Vehicle Type  

- **Driver Ratings**:  
  - Fairly consistent across categories (4.2–4.4 on average).  
  - **Go Mini & Premier Sedan** drivers perform slightly better.  
  - **Uber XL** shows greater variability.  

- **Customer Ratings**:  
  - Higher than driver ratings across all categories.  
  - **Go Sedan & Premier Sedan customers** tend to receive the best ratings.  
  - **Auto/Bike customers** get lower ratings, possibly due to **shorter trips or reliability issues**.  

**✅ Key Insight**: Premium vehicles (Go Sedan, Premier Sedan) yield **better mutual satisfaction**, while **budget categories (Auto, Bike, Go Mini)** show more friction.  

---

## 5. Ratings by Payment Method  

- **Driver Ratings**:  
  - Stable across payment modes.  
  - **UPI, Cash, Debit Card** slightly outperform others.  
  - **Credit Card** has marginally lower ratings, perhaps due to transaction delays.  

- **Customer Ratings**:  
  - **UPI & Debit Card** users receive the highest ratings.  
  - **Cash & Wallet** payments show more variability.  

**✅ Key Insight**: **Digital payments** (UPI, Debit Card) correlate with **higher satisfaction** for both drivers and customers.  

---

## 6. Low Ratings Breakdown  

- **Drivers**:  
  - Low ratings heavily concentrated in **Auto & Go Mini (~45%)**, followed by Bike and Go Sedan.  
  - Fewer low ratings in **Premier Sedan**.  

- **Customers**:  
  - Similar pattern — **Auto & Go Mini customers** receive the most low ratings.  
  - Again, **Premier Sedan** customers fare better.  

**✅ Key Insight**: Low satisfaction (both driver & customer) is **concentrated in budget rides**, showing a **service-quality gap**.  

---

## ✅ Final Insights  

1. Overall ratings are **high**: Both drivers & customers average **>4.2**.  
2. **Premium vehicles outperform**: Go Sedan & Premier Sedan riders and drivers enjoy higher satisfaction.  
3. **Digital payments enhance experience**: UPI/Debit card transactions show the best feedback.  
4. **Budget rides underperform**: Auto, Bike, Go Mini have the most **low ratings**, signaling operational issues.  
5. **Night-time dips**: Ratings drop slightly between **2–6 AM**, pointing to **safety & service challenges**.  

---

# 📌 Executive Summary of EDA: NCR Ride Bookings Dataset  

This Exploratory Data Analysis (EDA) covered **150,000 ride bookings** across NCR, exploring booking trends, vehicle preferences, cancellations, ratings, and financial performance.  

---

## 1. Data Quality & Structure  
- Dataset has **21 features** (Booking info, Vehicle, Location, Payment, Ratings, Cancellations).  
- Significant **missing values** (~80–94%) in cancellation-related fields → these were event-driven and handled contextually.  
- Cleaned **categorical fields** and engineered **datetime features** (year, month, day, weekday, hour).  

---

## 2. Booking Trends  
- **Completed rides dominate (~60%)**, but **cancellations** remain significant:  
  - Cancelled by Driver (~18%) > Cancelled by Customer (~7%).  
  - ~6% marked as “No Driver Found”.  
- **Autos & Go Mini** are most popular vehicles.  
- **UPI & Cash** are top payment methods, but **32% missing/NaN** entries suggest offline cash settlements.  

---

## 3. Demand Patterns  
- **Hourly Trends**:  
  - Demand peaks **twice daily** — Morning (8–11 AM) and Evening (5–8 PM).  
  - Cancellations are disproportionately high during evening rush hours.  
- **Weekday vs Weekend**:  
  - Demand is stable across days, but **weekends show slightly higher ride completions**.  
  - Weekdays see more **driver cancellations**, likely due to peak congestion.  
- **Monthly Seasonality**:  
  - Rides are evenly distributed, with **Q1 and Q3 slightly higher** demand.  
  - Revenue remains consistent across months.  

---

## 4. Ride Economics  
- **Booking Value Distribution**:  
  - Right-skewed → most trips cost **₹200–800**, but some go beyond **₹3000**.  
- **Ride Distance Distribution**:  
  - Majority trips are **short-haul (5–20 km)**.  
- **Revenue by Weekday**:  
  - **Weekend revenue is 30–35% higher** than weekdays, despite similar ride volumes.  
  - Indicates **longer, higher-value trips on weekends**.  

---

## 5. Cancellation Analysis  
- **Top Customer Cancellation Reasons**: Wrong address, change of plans, driver not moving.  
- **Top Driver Cancellation Reasons**: Customer not ready, health-related excuses, excess passengers.  
- Both drivers & customers show **avoidance behavior during peak hours**.  

---

## 6. Location Insights  
- **Top Pickup Hubs**: Khandsa, Saket, Barakhamba Road, AIIMS.  
- **Top Drop Points**: Ashram, Cyber Hub, Kalkaji, Kashmere Gate ISBT.  
- Pickup vs Drop analysis shows **imbalances in metro + IT corridor locations** → indicates potential **supply-demand mismatch**.  
- **Top Pickup-Drop Pairs** involve **business hubs, metro stations, and residential zones** (e.g., DLF City Court → Bhiwadi, Akshardham → RK Puram).  

---

## 7. Ratings Analysis  
- **Drivers**: Average **4.2–4.4**, but more critically rated than customers.  
- **Customers**: Highly skewed towards **4.5–5.0** (lenient rating behavior).  
- **Low ratings cluster** in **Auto, Go Mini, Bike rides** → indicates weaker service quality in budget categories.  
- **Digital Payments (UPI/Debit Card)** correlate with **higher satisfaction**.  
- **Night-time rides (2–6 AM)** show slightly lower ratings → hinting at safety/comfort issues.  

---

# 🚀 Recommended Business Actions  

### 🔹 Service Quality  
- **Budget Segments (Auto, Bike, Go Mini):**  
  - Introduce **driver training & quality assurance programs**.  
  - Launch **“Ride Quality Guarantee”** for budget users (refunds/discounts for poor rides).  

- **Premium Segments (Go Sedan, Premier Sedan):**  
  - Expand supply in metro corridors → higher ratings, better retention.  
  - Offer **loyalty perks** to premium users.  

---

### 🔹 Cancellations  
- **Driver-side cancellations (~18%)**:  
  - Incentivize driver acceptance during peak hours.  
  - Penalize habitual cancellers.  

- **Customer-side cancellations (~7%)**:  
  - Provide **real-time driver movement tracking** to reduce perception of delays.  
  - Enable **free reschedule** instead of outright cancellation.  

---

### 🔹 Payments  
- Encourage **UPI/Digital Payments** through cashback offers.  
- Streamline **credit card payment flows** to improve satisfaction.  

---

### 🔹 Time-based Demand Management  
- **Rush Hours (8–11 AM, 5–8 PM):**  
  - Dynamic pricing + demand prediction models.  
  - Pre-ride booking discounts to flatten peaks.  
- **Night-time (2–6 AM):**  
  - Offer **night shift driver bonuses**.  
  - Enhance **safety features** (panic buttons, dedicated helpline).  

---

### 🔹 Location Optimization  
- Use **pickup-drop imbalance analysis** to **rebalance fleet supply** across NCR.  
- Deploy **geo-fencing & surge pricing** in high-demand corridors (e.g., Cyber Hub, AIIMS, Saket).  

---

# ✅ Strategic Impact  

By implementing these measures, the company can:  
- Reduce **cancellations by 10–15%**.  
- Improve **budget ride ratings** closer to premium benchmarks.  
- Increase **digital payment adoption** by >20%.  
- Boost **revenue per trip**, especially during weekends & peak hours.  
- Enhance **overall customer trust and retention**.  

---

📌 **Next Step → Predictive Modeling**:  
- Build models to forecast **cancellations**, **ride demand by time/location**, and **ride value prediction**.  
- This will allow proactive **driver allocation, surge pricing, and personalized offers**.