# Citi Bike NYC Expansion Dashboard - Data Collection & Processing

## Project Overview

**Objective:** Create a strategic dashboard to guide Citi Bike's expansion strategy by identifying optimal locations for new bike stations.

**Business Problem:** Bike shortages at popular stations indicate unmet demand. Strategic placement of new stations can maximize ROI on infrastructure investment.

**Data Sources:**
- Primary: Citi Bike trip data for all of 2022 (12 monthly CSV files)
- Supplementary: NOAA weather data from LaGuardia Airport weather station

**Research Questions:**
1. What are the most popular stations?
2. How does weather affect ridership patterns?
3. What are the most popular routes between stations?
4. Are existing stations evenly distributed?

**Methodology:**
- Concatenate 12 monthly trip files into single dataset
- Use NOAA API to fetch daily temperature data for 2022
- Merge trip data with weather data on date field
- Export cleaned dataset for visualization

**Author:** Saurabh Singh  
**Exercise:** 2.2 - Planning & Data Sourcing with APIs

---

## 1. Import Libraries

In [None]:
import pandas as pd
import requests
import json
from datetime import datetime
import os

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")

---

## 2. Load and Concatenate Citi Bike Data

### How this code works:

**Step 1: Create list of file paths using list comprehension**
- `os.listdir(folderpath)` returns all filenames in the 'data' folder
- The list comprehension loops through each filename
- `os.path.join()` combines the folder path with each filename to create full file paths
- Result: a list containing paths to all 12 CSV files

**Step 2: Read and concatenate files using a generator**
- A generator `(pd.read_csv(f) for f in filepaths)` is used instead of a list comprehension
- Generators use parentheses `()` while list comprehensions use brackets `[]`
- Generators are more memory-efficient because they process one file at a time
- `pd.concat()` stacks all DataFrames vertically (axis=0)
- `ignore_index=True` creates a new sequential index for the combined dataset

In [None]:
# Step 1: Create a list with all file paths using list comprehension
folderpath = r"data"  # Folder containing the CSV files
filepaths = [os.path.join(folderpath, name) for name in os.listdir(folderpath)]

print(f"Found {len(filepaths)} files")
print("Files:")
for file in filepaths:
    print(f"  - {file}")

In [None]:
# Step 2: Read and join all CSV files using a generator with pd.concat()
df = pd.concat((pd.read_csv(f) for f in filepaths), ignore_index=True)

print(f"Total rows: {len(df):,}")
print(f"Columns: {len(df.columns)}")
print(f"\nColumn names:")
print(df.columns.tolist())

In [None]:
# Display first few rows
print("First 5 rows:")
df.head()

In [None]:
# Display last few rows to verify all data was loaded
print("Last 5 rows:")
df.tail()

---

## 3. Obtain Weather Data from NOAA API

### NOAA API Configuration

We'll fetch daily average temperature data from LaGuardia Airport weather station for 2022.

**API Parameters:**
- Dataset: GHCND (Global Historical Climatology Network Daily)
- Station: GHCND:USW00014732 (LaGuardia Airport)
- Data Type: TAVG (Daily Average Temperature)
- Period: January 1, 2022 - December 31, 2022
- Limit: 1000 records (maximum allowed)

In [None]:
# Define your NOAA token (replace with your actual token from NOAA email)
Token = 'PCBYzCnHmLFQWfIHNMkMEkcnBxhaKwJJ'

# LaGuardia Airport weather station ID for New York
station_id = 'GHCND:USW00014732'

# Compile the API URL with all parameters
url = f'https://www.ncdc.noaa.gov/cdo-web/api/v2/data?datasetid=GHCND&datatypeid=TAVG&limit=1000&stationid={station_id}&startdate=2022-01-01&enddate=2022-12-31'

print(f"Requesting weather data from NOAA API...")
print(f"Station: LaGuardia Airport ({station_id})")
print(f"Period: 2022-01-01 to 2022-12-31")

In [None]:
# Make the API request with authentication token
r = requests.get(url, headers={'token': Token})

# Check if request was successful
if r.status_code == 200:
    print("✓ API request successful!")
else:
    print(f"✗ API request failed with status code: {r.status_code}")
    print(f"Response: {r.text}")

In [None]:
# Load the API response as JSON
d = json.loads(r.text)

print(f"JSON loaded successfully")
print(f"Number of results: {len(d.get('results', []))}")

### Wrangle Weather Data

Extract only the temperature values and dates from the JSON response using list comprehensions:

In [None]:
# Secure all items in the response that correspond to TAVG (average temperature)
avg_temps = [item for item in d['results'] if item['datatype']=='TAVG']

# Get only the date field from all average temperature readings
dates_temp = [item['date'] for item in avg_temps]

# Get the temperature values from all average temperature readings
temps = [item['value'] for item in avg_temps]

print(f"Extracted {len(dates_temp)} temperature records")
print(f"Date range: {dates_temp[0]} to {dates_temp[-1]}")

### Create Weather DataFrame

Convert the extracted data into a clean DataFrame:
- Parse dates from ISO format (YYYY-MM-DDTHH:MM:SS) to datetime
- Convert temperature from tenths of Celsius to Celsius (NOAA stores temps × 10)

In [None]:
# Create empty dataframe
df_weather = pd.DataFrame()

# Convert date strings to datetime objects (removes time component)
df_weather['date'] = [datetime.strptime(d, "%Y-%m-%dT%H:%M:%S") for d in dates_temp]

# Convert temperature from tenths of Celsius to Celsius by dividing by 10
df_weather['avgTemp'] = [float(v)/10.0 for v in temps]

print("Weather data DataFrame created successfully!")
print(f"Shape: {df_weather.shape}")

In [None]:
# Display first few rows
print("First 5 rows of weather data:")
df_weather.head()

In [None]:
# Display last few rows
print("Last 5 rows of weather data:")
df_weather.tail()

In [None]:
# Export weather data to CSV
df_weather.to_csv('outputs/weather_data_2022.csv', index=False)
print("Weather data saved to: outputs/weather_data_2022.csv")

---

## 4. Merge Citi Bike Data with Weather Data

### Prepare Date Columns for Merge

Before merging, we need to ensure both datasets have matching date formats:

In [None]:
# Convert start_time to datetime format
df['started_at'] = pd.to_datetime(df['started_at'])

# Extract date component (without time) for merging
df['date'] = pd.to_datetime(df['started_at']).dt.date

# Convert weather date to date format (remove time component) for matching
df_weather['date'] = pd.to_datetime(df_weather['date']).dt.date

print("Date columns prepared for merge")
print(f"\nCiti Bike date range: {df['date'].min()} to {df['date'].max()}")
print(f"Weather date range: {df_weather['date'].min()} to {df_weather['date'].max()}")

### Perform the Merge

Merge the datasets using a left join on the date column:
- Left join ensures all Citi Bike trips are retained
- Weather data is matched by date
- `indicator=True` creates a column showing merge quality

In [None]:
%%time
# Merge datasets on date field using left join
df_merged = df.merge(df_weather, how='left', on='date', indicator=True)

print(f"\nMerged dataset shape: {df_merged.shape}")
print(f"Columns: {df_merged.columns.tolist()}")

### Verify Merge Quality

Check the merge indicator to ensure all records matched successfully:

In [None]:
print("Merge quality:")
print(df_merged['_merge'].value_counts())

# Calculate match percentage
total_rows = len(df_merged)
both_count = (df_merged['_merge'] == 'both').sum()
match_pct = (both_count / total_rows) * 100

print(f"\nMatch rate: {match_pct:.2f}%")

In [None]:
# Drop the merge indicator column (cleanup)
df_merged = df_merged.drop(columns=['_merge'])

print("Merge indicator column removed")
print(f"Final dataset shape: {df_merged.shape}")

In [None]:
# Display sample of merged data
print("Sample of merged dataset:")
df_merged.head(10)

---

## 5. Export Final Dataset

In [None]:
# Export merged dataset to CSV
df_merged.to_csv('outputs/merged_citibike_weather_2022.csv', index=False)

print("✓ Export complete!")
print(f"\nFiles created:")
print(f"  1. outputs/merged_citibike_weather_2022.csv ({len(df_merged):,} rows)")
print(f"  2. outputs/weather_data_2022.csv ({len(df_weather):,} rows)")
print(f"\nDataset ready for visualization!")

---

## Summary

This notebook successfully:
1. ✓ Loaded and concatenated 12 months of Citi Bike trip data using list comprehensions and generators
2. ✓ Fetched daily temperature data from NOAA API for LaGuardia Airport
3. ✓ Merged trip data with weather data on date field with 100% match rate
4. ✓ Exported cleaned datasets for dashboard visualization

**Next Steps:**
- Create visualizations to answer research questions
- Identify most popular stations and routes
- Analyze weather impact on ridership
- Develop expansion recommendations