# Citi Bike NYC Expansion Dashboard - Data Collection & Processing

## Project Overview

**Objective:** Create a strategic dashboard to guide Citi Bike's expansion strategy by identifying optimal locations for new bike stations.

**Business Problem:** Bike shortages at popular stations indicate unmet demand. Strategic placement of new stations can maximize ROI on infrastructure investment.

**Data Sources:**
- Primary: Citi Bike trip data for all of 2022 (12 monthly CSV files)
- Supplementary: NOAA weather data from LaGuardia Airport weather station

**Research Questions:**
1. What are the most popular stations?
2. How does weather affect ridership patterns?
3. What are the most popular routes between stations?
4. Are existing stations evenly distributed?

**Methodology:**
- Concatenate 12 monthly trip files into single dataset
- Use NOAA API to fetch daily temperature data for 2022
- Merge trip data with weather data on date field
- Export cleaned dataset for visualization

**Author:** Saurabh Singh  
**Exercise:** 2.2 - Planning & Data Sourcing with APIs

---

## 1. Import Libraries

In [1]:
import pandas as pd
import requests
import json
from datetime import datetime
import glob
import os

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")

Libraries imported successfully!
Pandas version: 3.0.0


## 2. Load and Concatenate Citi Bike Data

**How this code works:**
1. `glob.glob()` finds all CSV files matching the pattern
2. List comprehension `[pd.read_csv(file) for file in all_files]` reads each file
3. `pd.concat()` stacks all DataFrames vertically (axis=0)
4. `ignore_index=True` creates new sequential index

This is more efficient than using `.append()` in a loop because concat creates a single copy rather than multiple intermediate copies.

In [2]:
file_pattern = 'data/JC-*-citibike-tripdata.csv'
all_files = sorted(glob.glob(file_pattern))

print(f"Found {len(all_files)} files")

Found 11 files


In [3]:
df = pd.concat([pd.read_csv(file) for file in all_files], axis=0, ignore_index=True)

print(f"Total rows: {len(df):,}")
print(f"Columns: {len(df.columns)}")

Total rows: 786,983
Columns: 13


In [4]:
df.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,CA5837152804D4B5,electric_bike,2022-01-26 18:50:39,2022-01-26 18:51:53,12 St & Sinatra Dr N,HB201,12 St & Sinatra Dr N,HB201,40.750604,-74.02402,40.750604,-74.02402,member
1,BA06A5E45B6601D2,classic_bike,2022-01-28 13:14:07,2022-01-28 13:20:23,Essex Light Rail,JC038,Essex Light Rail,JC038,40.712774,-74.036486,40.712774,-74.036486,member
2,7B6827D7B9508D93,classic_bike,2022-01-10 19:55:13,2022-01-10 20:00:37,Essex Light Rail,JC038,Essex Light Rail,JC038,40.712774,-74.036486,40.712774,-74.036486,member
3,6E5864EA6FCEC90D,electric_bike,2022-01-26 07:54:57,2022-01-26 07:55:22,12 St & Sinatra Dr N,HB201,12 St & Sinatra Dr N,HB201,40.750604,-74.02402,40.750604,-74.02402,member
4,E24954255BBDE32D,electric_bike,2022-01-13 18:44:46,2022-01-13 18:45:43,12 St & Sinatra Dr N,HB201,12 St & Sinatra Dr N,HB201,40.750604,-74.02402,40.750604,-74.02402,member


## 3. Prepare Date Column

In [5]:
df['started_at'] = pd.to_datetime(df['started_at'])
df['date'] = df['started_at'].dt.date

print(f"Date range: {df['date'].min()} to {df['date'].max()}")

Date range: 2022-01-01 to 2022-12-31


## 4. Fetch Weather Data from NOAA API

In [10]:
base_url = 'https://www.ncdc.noaa.gov/cdo-web/api/v2/data'
params = {
    'datasetid': 'GHCND',
    'stationid': 'GHCND:USW00014732',
    'startdate': '2022-01-01',
    'enddate': '2022-12-31',
    'datatypeid': 'TAVG',
    'units': 'metric',
    'limit': 1000
}
headers = {'token': 'PCBYzCnHmLFQWfIHNMkMEkcnBxhaKwJJ'}

response = requests.get(base_url, params=params, headers=headers)
weather_data = response.json()
print(f"Status: {response.status_code}")

Status: 200


## 5. Process Weather Data

In [11]:
avg_temps = [item for item in weather_data['results'] if item['datatype'] == 'TAVG']
dates_temp = [item['date'] for item in avg_temps]
temps = [item['value'] for item in avg_temps]

df_weather = pd.DataFrame()
df_weather['date'] = [datetime.strptime(d, "%Y-%m-%dT%H:%M:%S").date() for d in dates_temp]
df_weather['avgTemp'] = [float(v) / 10.0 for v in temps]

df_weather.head()

Unnamed: 0,date,avgTemp
0,2022-01-01,1.16
1,2022-01-02,1.14
2,2022-01-03,0.14
3,2022-01-04,-0.27
4,2022-01-05,0.32


In [12]:
df_weather.to_csv('outputs/weather_data_2022.csv', index=False)
print("Weather data saved!")

Weather data saved!


## 6. Merge Datasets

In [13]:
%%time
df_merged = df.merge(df_weather, how='left', on='date', indicator=True)
print(f"Merged shape: {df_merged.shape}")

Merged shape: (786983, 16)
CPU times: user 59.9 ms, sys: 27.4 ms, total: 87.2 ms
Wall time: 110 ms


In [14]:
print("Merge quality:")
print(df_merged['_merge'].value_counts())

Merge quality:
_merge
both          786983
left_only          0
right_only         0
Name: count, dtype: int64


## 7. Export Final Dataset

In [16]:
df_final = df_merged.drop(columns=['_merge'])
df.to_csv('outputs/merged_citibike_weather_2022.csv', index=False)
print("Export complete!")

Export complete!
