# Uber GPA Lead Case Study – Project Plan

_Notebook: High-level project spec and analysis outline before loading data_

## 1. Context: Role, Case Prompt, and Constraints

**Role context (GPA Lead / Senior Manager, BizOps – Global Portfolio Analysis)**
- Challenge leadership to focus on the most consequential aspects of performance across 100+ metrics over 8+ years.
- Build a ‘management by exception’ view: surface outliers, regime shifts, and blind spots rather than re‑stating dashboards.
- Combine time‑series depth with business judgment: interpret *why* metrics move and what to do about it.

**Case prompt (simplified)**
- “Go find some publicly available data and show how Uber (or any other company) could make money from it.”
- Use external, public data only (no internal Uber data).
- Demonstrate: (i) structured thinking, (ii) smart use of time‑series analytics, and (iii) clear business storytelling.

**My chosen angle**
- Focus: **Uber airport trips in NYC** as a monetization and risk‑management opportunity.
- Rationale:
  - Airport pickups are high‑value, operationally critical trips (higher fares, time‑sensitive, peak congestion).
  - Airports sit at the intersection of several rich public datasets: flights, weather, transit, and local mobility.
  - This makes it a clean way to demonstrate how external signals can highlight **opportunities, risks, and blind spots**.

## 2. Project Goals

1. **Use external public data** to approximate how Uber’s NYC airport business responds to exogenous signals
   (weather, flight disruptions, transit outages).
2. **Surface 2–3 concrete revenue or margin opportunities** and 1–2 risk flags that a GPA Lead could credibly bring
   to Mobility leadership.
3. **Demonstrate a scalable analytics pattern** that generalizes to the “100 metrics over 8 years” problem:
   - Normalize and align heterogeneous time series.
   - Quantify interdependencies (e.g., weather ↔ trips ↔ average fare).
   - Prioritize ‘exceptions’ and structural breaks rather than reporting everything.
4. Package the work so it is easy to discuss verbally in 30 minutes: a few clear charts + a simple narrative
   about **signal → insight → action → $$**.

## 3. Business Questions & Working Hypotheses

**Core question:**
> How can Uber use public airport‑adjacent signals (flights, weather, transit) to improve pricing, supply positioning,
> and reliability for NYC airport trips?

**Hypothesis 1 – Severe weather and demand spikes**
- **H1a:** Severe weather at JFK/LGA/EWR (heavy rain, snow, low visibility) **increases demand** for airport trips
  relative to similar non‑weather days.
- **H1b:** On those days, **average fare per trip and ETAs** behave differently (e.g., higher fares, worse reliability),
  creating an opportunity to better balance surge, incentives, and reliability guarantees.

**Hypothesis 2 – Flight volume & delays as advance signals**
- **H2a:** Daily/hourly **flight arrivals and delay patterns** correlate strongly with airport ground trip demand.
- **H2b:** Flight‑schedule + delay data could be used to **pre‑position drivers** before spikes (storm systems,
  mass delays, diversion events), reducing surge volatility and cancellation rates.

**Hypothesis 3 – Transit disruptions and mode shift**
- **H3a:** **MTA outages** or severe slowdowns on airport‑adjacent lines (e.g., E train to JFK, LIRR, AirTrain outages)
  trigger **short‑notice uplifts in Uber trips** from key boroughs to the airports.
- **H3b:** Uber could systematically treat transit alerts as a **‘demand shock’ prior** and use them to adjust
  driver incentives and rider messaging (e.g., targeted notifications to likely airport riders).

These hypotheses are intentionally simple but map directly to the GPA mandate:
- They connect **external drivers → Uber performance metrics → monetizable actions**.
- They set up a workflow that extends cleanly to many more metrics and cities.

## 4. Public Data Sources and Intended Use

| Theme                | Dataset (example)                                      | Granularity / Period        | How I’ll use it |
|----------------------|---------------------------------------------------------|-----------------------------|-----------------|
| Airport trip demand  | NYC TLC trip records (yellow/green/HVFHV)             | Trip‑level, multi‑year      | Filter trips to/from JFK/LGA/EWR; construct daily/hourly metrics: trip count, avg fare, trip distance, pickup/dropoff borough mix. |
| Flights & delays     | Port Authority of NY/NJ, BTS / FAA on‑time stats      | Flight‑level or aggregated  | Build daily/hourly features: total arrivals, % delayed, average delay minutes; identify disruption days. |
| Weather              | NOAA / NWS weather (METAR) for JFK/LGA/EWR            | Hourly                       | Create severity scores: precipitation, visibility, wind, storms; tag ‘event’ vs control days. |
| Transit disruptions  | MTA service alerts / GTFS‑RT / historical disruptions | Event‑level or daily flags  | Construct binary/ordinal variables for outages on airport‑relevant lines and dates. |
| Macro (optional)     | Gas prices, holidays, etc.                             | Weekly/daily                | Control variables for broader demand or supply shifts if needed. |

The initial case implementation can **start with TLC + weather** (most straightforward to acquire and clean),
then layer in flights and transit disruptions if time permits.

## 5. Key Metrics and Feature Engineering Plan

### 5.1 Airport trip metrics (target variables)
- `trips_airport_daily`: # of completed Uber‑eligible trips per airport (JFK/LGA/EWR) per day.
- `avg_fare_airport_daily`: Mean/median fare (or revenue proxy) per trip per airport per day.
- `eta_proxy`: Travel‑time proxy (pickup‑to‑dropoff duration) for reliability; depending on data availability.
- `origin_borough_share`: Distribution of pickup locations (e.g., Manhattan vs Brooklyn vs Queens).

### 5.2 External drivers (features)
- Weather features per airport per day (or hour):
  - Total precipitation, max wind speed, min visibility, temperature extremes.
  - A composite **`weather_severity_index`** to simplify interpretation.
- Flight features:
  - Total arrivals, % delayed > X minutes, # cancellations, big‑event flags (storms, ATC ground stops).
- Transit disruption features:
  - Binary flags for outages on specific lines.
  - Simple severity buckets (minor / major / full outage).

### 5.3 Interactions & transformations
- Lagged relationships, e.g., weather at **t** with trips at **t** and **t+1**.
- Ratios, e.g., trips per 1,000 arriving passengers; fare per mile.
- Dummy variables for holidays, weekends, and known special events (e.g., UNGA, major sports events).

## 6. Analytical Approach

The analysis will be structured to mirror what the GPA Lead would do when handed a large internal data stack:

### 6.1 Phase 0 – Environment & repo structure (already done)
- Confirm Python environment (Anaconda), Jupyter/Spyder, and Git are set up.
- Repo layout (already created):
  - `data/` – raw and processed external datasets.
  - `notebooks/` – planning, EDA, and modeling notebooks.
  - `code/` – reusable Python modules (data loading, feature engineering, plots).
  - `reports/` – figures and final case slides.

### 6.2 Phase 1 – Data ingestion and cleaning
- Write loaders for each source (TLC, weather, flights, transit).
- Standardize date/time handling and time zones.
- Apply basic quality checks: missing values, outlier trip distances/fare amounts, duplicate flights, etc.

### 6.3 Phase 2 – Aggregation and feature engineering
- Aggregate trip‑level data to daily (and, if manageable, hourly) airport metrics.
- Construct external‑driver features and join into a **single time‑series panel** keyed by date (and airport).
- Build a small library of transformations (lags, rolling averages, ratios).

### 6.4 Phase 3 – EDA and interdependence
- Visual EDA:
  - Time‑series plots for trips, fares, and external drivers.
  - Scatter/heatmaps: e.g., weather severity vs. trips, delays vs. trips per passenger.
- Quantitative EDA:
  - Correlation matrices (with and without lags).
  - Simple baseline models: e.g., linear/regularized regression or gradient‑boosted trees to rank feature importance.
  - Structural break / regime‑change diagnostics on key series.

### 6.5 Phase 4 – Surface opportunities & risks
- Define what counts as a **meaningful ‘exception’** (e.g., top 5% uplift or degradation vs. baseline).
- For each hypothesis, identify:
  - Conditions when performance materially improves (opportunity to **lean in**).
  - Conditions when performance materially worsens (risk to **mitigate**).
- Translate findings into potential levers:
  - Pricing / surge tuning.
  - Driver incentive design and pre‑positioning.
  - Rider messaging (e.g., airport‑specific notifications).

### 6.6 Phase 5 – Storyline and artifacts
- Distill to **2–3 killer charts** + a one‑page narrative:
  - “Here are the external signals that matter most for NYC airport performance.”
  - “Here’s how much money we could unlock or protect by acting on them.”
  - “Here’s how this generalizes across metrics and regions.”

## 7. Notebook & Module Roadmap

Planned artifacts in this repo:

1. `notebooks/01_project_plan.ipynb`  ← **(this notebook)**
   - Captures problem framing, hypotheses, and analysis plan.

2. `notebooks/02_data_ingest_and_qc.ipynb`
   - Prototype data loading and basic quality checks.
   - Output: cleaned, standardized daily airport panel saved to `data/processed/`.

3. `notebooks/03_eda_and_feature_ranking.ipynb`
   - Visual EDA and interdependence exploration.
   - Simple models to rank drivers of airport trips and fares.

4. `notebooks/04_opportunities_risks_and_scenarios.ipynb`
   - Define exception thresholds, scenario tests, and back‑of‑the‑envelope value estimates.
   - Draft talking points and charts for the interview.

5. `code/` modules (to be created as we go):
   - `data_io.py` – data loading and saving helpers.
   - `features_airport.py` – feature engineering for airport‑related metrics.
   - `plots.py` – standardized plotting helpers for time‑series and correlations.

The goal is to keep notebooks focused on *thinking and communication*, with reusable code moved into `code/`.

## 8. Risks, Limitations, and Extensions

- **No internal Uber data:** All results are directional and illustrative; precise elasticities and dollar values
  would require internal metrics (take rate, driver incentives, cancellations, etc.).
- **Data coverage & quality:** Public datasets may have gaps or coarse aggregation that limit granularity.
- **Time constraints:** For the interview, the priority is to show a **clear, defensible framework** and at least
  one well‑developed example, not to exhaust every dataset.

Potential extensions if time allows:
- Add **other cities** (e.g., SFO, LHR) to show the framework scales.
- Layer on basic **profitability proxies** (e.g., revenue per trip × approximate take rate minus incentive proxy).
- Explore **clustering** of days (e.g., typical vs. storm vs. disruption regimes) as a bridge to the 100‑metrics problem.

In [None]:
# Quick sanity check: Python environment and paths
import sys
from pathlib import Path

project_root = Path('..').resolve()
print('Python version:', sys.version)
print('Project root:', project_root)
print('Subdirectories:', [p.name for p in project_root.iterdir() if p.is_dir()])