# Citi Bike NYC Expansion Dashboard - Data Collection & Processing

## Project Overview

**Objective:** Create a strategic dashboard to guide Citi Bike's expansion strategy by identifying optimal locations for new bike stations.

**Business Problem:** Bike shortages at popular stations indicate unmet demand. Strategic placement of new stations can maximize ROI on infrastructure investment.

**Data Sources:**
- Primary: Citi Bike trip data for all of 2022 (12 monthly CSV files)
- Supplementary: NOAA weather data from LaGuardia Airport weather station

**Research Questions:**
1. What are the most popular stations?
2. How does weather affect ridership patterns?
3. What are the most popular routes between stations?
4. Are existing stations evenly distributed?

**Methodology:**
- Concatenate 12 monthly trip files into single dataset
- Use NOAA API to fetch daily temperature data for 2022
- Merge trip data with weather data on date field
- Export cleaned dataset for visualization

**Author:** Saurabh Singh  
**Exercise:** 2.2 - Planning & Data Sourcing with APIs

---

## 1. Import Libraries

In [None]:
import pandas as pd
import requests
import json
from datetime import datetime
import glob
import os

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")

## 2. Load and Concatenate Citi Bike Data

**How this code works:**
1. `glob.glob()` finds all CSV files matching the pattern
2. List comprehension `[pd.read_csv(file) for file in all_files]` reads each file
3. `pd.concat()` stacks all DataFrames vertically (axis=0)
4. `ignore_index=True` creates new sequential index

This is more efficient than using `.append()` in a loop because concat creates a single copy rather than multiple intermediate copies.

In [None]:
file_pattern = '/mnt/user-data/uploads/JC-*-cit*bike-tripdata.csv'
all_files = sorted(glob.glob(file_pattern))

print(f"Found {len(all_files)} files")

In [None]:
df = pd.concat([pd.read_csv(file) for file in all_files], axis=0, ignore_index=True)

print(f"Total rows: {len(df):,}")
print(f"Columns: {len(df.columns)}")

In [None]:
df.head()

## 3. Prepare Date Column

In [None]:
df['started_at'] = pd.to_datetime(df['started_at'])
df['date'] = df['started_at'].dt.date

print(f"Date range: {df['date'].min()} to {df['date'].max()}")

## 4. Fetch Weather Data from NOAA API

In [None]:
base_url = 'https://www.ncdc.noaa.gov/cdo-web/api/v2/data'
params = {
    'datasetid': 'GHCND',
    'stationid': 'GHCND:USW00014732',
    'startdate': '2022-01-01',
    'enddate': '2022-12-31',
    'datatypeid': 'TAVG',
    'units': 'metric',
    'limit': 1000
}
headers = {'token': 'PCBYzCnHmLFQWfIHNMkMEkcnBxhaKwJJ'}

response = requests.get(base_url, params=params, headers=headers)
weather_data = response.json()
print(f"Status: {response.status_code}")

## 5. Process Weather Data

In [None]:
avg_temps = [item for item in weather_data['results'] if item['datatype'] == 'TAVG']
dates_temp = [item['date'] for item in avg_temps]
temps = [item['value'] for item in avg_temps]

df_weather = pd.DataFrame()
df_weather['date'] = [datetime.strptime(d, "%Y-%m-%dT%H:%M:%S").date() for d in dates_temp]
df_weather['avgTemp'] = [float(v) / 10.0 for v in temps]

df_weather.head()

In [None]:
df_weather.to_csv('/home/claude/weather_data_2022.csv', index=False)
print("Weather data saved!")

## 6. Merge Datasets

In [None]:
%%time
df_merged = df.merge(df_weather, how='left', on='date', indicator=True)
print(f"Merged shape: {df_merged.shape}")

In [None]:
print("Merge quality:")
print(df_merged['_merge'].value_counts())

## 7. Export Final Dataset

In [None]:
df_final = df_merged.drop(columns=['_merge'])
df_final.to_csv('/home/claude/merged_citibike_weather_2022.csv', index=False)
print("Export complete!")