This notebook reads taxi data to produce a clean, normalised dataset of hourly busyness. It starts with loading a summary of hourly activity, validating important columns, and calculating average median activity per hour per taxi zone. These are then normalised to produce a standard busyness measure of 0-1. The resulting product is a dataset, which can be applied downstream to a machine learning tasks, and green scoring for sustainable routing.

In [1]:
# Importing.
import pandas as pd
import numpy as np

In [2]:
# Loading zone level summary data.
summary_df = pd.read_csv("zone_hourly_summary.csv")

# Ensuring required columns are present.
required_columns = ["PULocationID", "pickup_hour", "day_of_week", "zone_hourly_activity"]
missing = [col for col in required_columns if col not in summary_df.columns]
if missing:
    raise ValueError(f"Missing columns: {missing}")

In [3]:
# Getting ercentiles for each zone-hour-day group.
zone_percentiles = summary_df.groupby(
    ["PULocationID", "day_of_week", "pickup_hour"]
)["zone_hourly_activity"].agg(
    min="min",
    p10=lambda x: x.quantile(0.10),
    p25=lambda x: x.quantile(0.25),
    p50="median",  
    p75=lambda x: x.quantile(0.75),
    p90=lambda x: x.quantile(0.90),
    max="max"
).reset_index()

In [4]:
# Calculating average median per pickup_hour and PULocationID for normalisation.
zone_medians = zone_percentiles.groupby(['pickup_hour', 'PULocationID'])['p50'].mean().reset_index()

# Normalised busyness across the whole dataset.
min_val = zone_medians['p50'].min()
max_val = zone_medians['p50'].max()
zone_medians['normalised_busyness'] = (zone_medians['p50'] - min_val) / (max_val - min_val)

In [5]:
# Merging normalised values back into the main percentiles DataFrame.
zone_final = pd.merge(
    zone_percentiles,
    zone_medians[['pickup_hour', 'PULocationID', 'normalised_busyness']],
    on=['pickup_hour', 'PULocationID'],
    how='left'
)

In [6]:
# Saving as CSV.
zone_final.to_csv("zone_hourly_busyness_stats.csv", index=False)

print("zone_hourly_busyness_stats.csv created successfully")
print(zone_final.head(10))

zone_hourly_busyness_stats.csv created successfully
   PULocationID  day_of_week  pickup_hour  min  p10    p25   p50    p75   p90  \
0             4            0            0    5  7.0   9.00  12.5  15.00  19.3   
1             4            0            1    2  3.0   5.00   6.5  11.00  16.5   
2             4            0            2    2  3.6   4.00   5.0   7.00  18.2   
3             4            0            3    1  2.0   2.75   5.0   6.25   9.7   
4             4            0            4    1  2.0   3.00   3.0   5.00   6.0   
5             4            0            5    1  1.0   2.00   3.0   5.00   6.8   
6             4            0            6    1  3.0   4.00   7.0   9.00  11.0   
7             4            0            7    4  6.1   8.00  12.0  18.00  20.9   
8             4            0            8    1  6.0  11.00  16.0  19.00  24.0   
9             4            0            9    4  6.0   9.00  13.0  15.00  19.0   

   max  normalised_busyness  
0   59             0.05791