# Uber Fare Analysis - Interactive Notebook

This notebook provides an interactive analysis of Uber fare data with comprehensive visualizations and insights.

## Table of Contents
1. [Data Loading & Overview](#data-loading)
2. [Data Cleaning](#data-cleaning)
3. [Exploratory Data Analysis](#eda)
4. [Time-based Analysis](#time-analysis)
5. [Fare Analysis](#fare-analysis)
6. [Geographic Analysis](#geo-analysis)
7. [Key Insights & Recommendations](#insights)

## 1. Data Loading & Setup <a id="data-loading"></a>

In [None]:
%pip install plotly

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Libraries imported successfully!")

Collecting plotly
  Using cached plotly-6.3.0-py3-none-any.whl.metadata (8.5 kB)
Collecting narwhals>=1.15.1 (from plotly)
  Using cached narwhals-2.1.2-py3-none-any.whl.metadata (11 kB)
Collecting narwhals>=1.15.1 (from plotly)
  Using cached narwhals-2.1.2-py3-none-any.whl.metadata (11 kB)
Downloading plotly-6.3.0-py3-none-any.whl (9.8 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/9.8 MB[0m [31m?[0m eta [36m-:--:--[0mDownloading plotly-6.3.0-py3-none-any.whl (9.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.8/9.8 MB[0m [31m80.5 kB/s[0m eta [36m0:00:00[0m00:02[0m00:05[0mm
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.8/9.8 MB[0m [31m80.5 kB/s[0m eta [36m0:00:00[0m
[?25hDownloading narwhals-2.1.2-py3-none-any.whl (392 kB)
Downloading narwhals-2.1.2-py3-none-any.whl (392 kB)
Installing collected packages: narwhals, plotly
Installing collected packages: narwhals, plotly
^C
[31mERROR: Operation cancelled

In [None]:
# Load the enhanced dataset
df = pd.read_csv('Data/enhanced/uber_enhanced.csv')

# Convert datetime column
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Display first few rows
df.head()

## 2. Data Overview <a id="data-cleaning"></a>

In [None]:
# Dataset info
print("Dataset Information:")
print("=" * 40)
df.info()

In [None]:
# Summary statistics
print("Summary Statistics:")
print("=" * 40)
df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
print("=" * 40)
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percentage
})
print(missing_df[missing_df['Missing Count'] > 0])

## 3. Exploratory Data Analysis <a id="eda"></a>

In [None]:
# Interactive fare distribution
fig = px.histogram(df, x='fare_amount', nbins=50, 
                   title='Uber Fare Distribution (Interactive)',
                   labels={'fare_amount': 'Fare Amount ($)', 'count': 'Frequency'})
fig.update_layout(showlegend=False)
fig.show()

In [None]:
# Box plot by time category
fig = px.box(df, x='time_category', y='fare_amount',
             title='Fare Distribution by Time of Day',
             labels={'fare_amount': 'Fare Amount ($)', 'time_category': 'Time Category'})
fig.show()

## 4. Time-based Analysis <a id="time-analysis"></a>

In [None]:
# Hourly patterns
hourly_stats = df.groupby('hour').agg({
    'fare_amount': ['mean', 'count'],
    'passenger_count': 'mean'
}).round(2)

hourly_stats.columns = ['Avg_Fare', 'Ride_Count', 'Avg_Passengers']
hourly_stats = hourly_stats.reset_index()

# Create subplots
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Average Fare by Hour', 'Number of Rides by Hour',
                   'Average Passengers by Hour', 'Combined Metrics'),
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": True}]]
)

# Average fare by hour
fig.add_trace(
    go.Scatter(x=hourly_stats['hour'], y=hourly_stats['Avg_Fare'],
              mode='lines+markers', name='Avg Fare', line=dict(color='blue')),
    row=1, col=1
)

# Ride count by hour
fig.add_trace(
    go.Bar(x=hourly_stats['hour'], y=hourly_stats['Ride_Count'],
           name='Ride Count', marker_color='orange'),
    row=1, col=2
)

# Average passengers by hour
fig.add_trace(
    go.Scatter(x=hourly_stats['hour'], y=hourly_stats['Avg_Passengers'],
              mode='lines+markers', name='Avg Passengers', line=dict(color='green')),
    row=2, col=1
)

# Combined metrics
fig.add_trace(
    go.Scatter(x=hourly_stats['hour'], y=hourly_stats['Avg_Fare'],
              mode='lines+markers', name='Avg Fare', line=dict(color='blue')),
    row=2, col=2
)

fig.add_trace(
    go.Scatter(x=hourly_stats['hour'], y=hourly_stats['Ride_Count']/1000,
              mode='lines+markers', name='Ride Count (k)', 
              line=dict(color='red'), yaxis='y2'),
    row=2, col=2, secondary_y=True
)

fig.update_layout(height=600, title_text="Hourly Analysis Dashboard")
fig.show()

In [None]:
# Weekly patterns
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekly_stats = df.groupby('weekday').agg({
    'fare_amount': ['mean', 'count']
}).round(2)

weekly_stats.columns = ['Avg_Fare', 'Ride_Count']
weekly_stats = weekly_stats.reindex(weekday_order)

# Interactive weekly analysis
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Average Fare by Day of Week', 'Number of Rides by Day of Week')
)

fig.add_trace(
    go.Bar(x=weekly_stats.index, y=weekly_stats['Avg_Fare'],
           name='Avg Fare', marker_color='skyblue'),
    row=1, col=1
)

fig.add_trace(
    go.Bar(x=weekly_stats.index, y=weekly_stats['Ride_Count'],
           name='Ride Count', marker_color='lightcoral'),
    row=1, col=2
)

fig.update_layout(height=400, title_text="Weekly Patterns")
fig.show()

## 5. Fare Analysis <a id="fare-analysis"></a>

In [None]:
# Peak vs Off-peak analysis
peak_stats = df.groupby('is_peak').agg({
    'fare_amount': ['mean', 'median', 'count'],
    'passenger_count': 'mean',
    'distance': 'mean'
}).round(2)

peak_stats.columns = ['Avg_Fare', 'Median_Fare', 'Ride_Count', 'Avg_Passengers', 'Avg_Distance']
print("Peak vs Off-Peak Analysis:")
print("=" * 40)
print(peak_stats)

# Interactive comparison
fig = px.bar(peak_stats.reset_index(), x='is_peak', y=['Avg_Fare', 'Median_Fare'],
             title='Peak vs Off-Peak Fare Comparison',
             barmode='group')
fig.show()

In [None]:
# Passenger count analysis
passenger_stats = df.groupby('passenger_count').agg({
    'fare_amount': ['mean', 'count'],
    'distance': 'mean'
}).round(2)

passenger_stats.columns = ['Avg_Fare', 'Ride_Count', 'Avg_Distance']
passenger_stats = passenger_stats.reset_index()

# Interactive passenger analysis
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Average Fare by Passenger Count', 'Number of Rides by Passenger Count')
)

fig.add_trace(
    go.Bar(x=passenger_stats['passenger_count'], y=passenger_stats['Avg_Fare'],
           name='Avg Fare', marker_color='blue'),
    row=1, col=1
)

fig.add_trace(
    go.Bar(x=passenger_stats['passenger_count'], y=passenger_stats['Ride_Count'],
           name='Ride Count', marker_color='green'),
    row=1, col=2
)

fig.update_layout(height=400, title_text="Passenger Count Analysis")
fig.show()

## 6. Geographic Analysis <a id="geo-analysis"></a>

In [None]:
# Sample data for visualization (to avoid overplotting)
sample_df = df.sample(n=5000, random_state=42)

# Scatter plot of pickup locations colored by fare
fig = px.scatter(sample_df, x='pickup_longitude', y='pickup_latitude',
                 color='fare_amount', size='passenger_count',
                 title='Pickup Locations Colored by Fare Amount',
                 labels={'pickup_longitude': 'Longitude', 
                        'pickup_latitude': 'Latitude',
                        'fare_amount': 'Fare ($)'},
                 color_continuous_scale='Viridis')
fig.show()

In [None]:
# Distance vs Fare relationship
fig = px.scatter(sample_df, x='distance', y='fare_amount',
                 color='time_category', size='passenger_count',
                 title='Distance vs Fare Relationship',
                 labels={'distance': 'Trip Distance (degrees)', 'fare_amount': 'Fare ($)'},
                 trendline='ols')
fig.show()

# Calculate correlation
correlation = df['distance'].corr(df['fare_amount'])
print(f"Correlation between distance and fare: {correlation:.3f}")

## 7. Key Insights & Recommendations <a id="insights"></a>

In [None]:
# Calculate key metrics for insights
total_rides = len(df)
avg_fare = df['fare_amount'].mean()
total_revenue = df['fare_amount'].sum()
peak_rides = len(df[df['is_peak'] == 'Peak'])
peak_percentage = (peak_rides / total_rides) * 100
busiest_hour = df.groupby('hour').size().idxmax()
highest_fare_hour = df.groupby('hour')['fare_amount'].mean().idxmax()
most_common_passengers = df['passenger_count'].mode()[0]

print("KEY INSIGHTS & RECOMMENDATIONS")
print("=" * 50)
print(f"Total Rides Analyzed: {total_rides:,}")
print(f"Average Fare: ${avg_fare:.2f}")
print(f"Total Revenue: ${total_revenue:,.2f}")
print(f"Peak Rides: {peak_percentage:.1f}% of total")
print(f"Busiest Hour: {busiest_hour}:00")
print(f"Highest Fare Hour: {highest_fare_hour}:00")
print(f"Most Common Passenger Count: {most_common_passengers}")

print("\nBUSINESS RECOMMENDATIONS:")
print("=" * 30)
print("1. Peak Hour Strategy: Implement dynamic pricing during 7-9 AM and 5-7 PM")
print("2. Weekend Focus: Friday shows highest demand - optimize driver allocation")
print("3. Multi-passenger Incentives: Encourage ride-sharing for cost efficiency")
print("4. Geographic Optimization: Focus on high-fare areas for premium services")
print("5. Data-driven Pricing: Use time and location data for optimal fare structure")

In [None]:
# Create a summary dashboard
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Fare Distribution', 'Hourly Patterns', 
                   'Weekly Patterns', 'Peak vs Off-Peak'),
    specs=[[{"type": "histogram"}, {"type": "scatter"}],
           [{"type": "bar"}, {"type": "bar"}]]
)

# Fare distribution
fig.add_trace(
    go.Histogram(x=df['fare_amount'], nbinsx=30, name='Fare Distribution'),
    row=1, col=1
)

# Hourly patterns
hourly_fare = df.groupby('hour')['fare_amount'].mean()
fig.add_trace(
    go.Scatter(x=hourly_fare.index, y=hourly_fare.values,
              mode='lines+markers', name='Hourly Avg Fare'),
    row=1, col=2
)

# Weekly patterns
weekly_rides = df.groupby('weekday').size().reindex(weekday_order)
fig.add_trace(
    go.Bar(x=weekly_rides.index, y=weekly_rides.values, name='Weekly Rides'),
    row=2, col=1
)

# Peak vs Off-peak
peak_comparison = df.groupby('is_peak')['fare_amount'].mean()
fig.add_trace(
    go.Bar(x=peak_comparison.index, y=peak_comparison.values, name='Peak Analysis'),
    row=2, col=2
)

fig.update_layout(height=600, title_text="Uber Fare Analysis Summary Dashboard", showlegend=False)
fig.show()

## Conclusion

This interactive analysis has revealed several key insights about Uber fare patterns:

- **Time-based Patterns**: Clear peak and off-peak periods with distinct fare patterns
- **Weekly Trends**: Friday shows the highest demand for rides
- **Fare Distribution**: Most rides fall within the $5-15 range
- **Geographic Insights**: Location significantly impacts fare amounts
- **Passenger Patterns**: Single passenger rides dominate the dataset

These insights can guide business decisions for pricing strategies, resource allocation, and service optimization.