# Citi Bike June 2024 â€“ Data Exploration

In [1]:
import pandas as pd
from pathlib import Path

raw_dir = Path("../data/raw")
files = sorted(raw_dir.glob("202406-citibike-tripdata_*.csv"))
files


[PosixPath('../data/raw/202406-citibike-tripdata_1.csv'),
 PosixPath('../data/raw/202406-citibike-tripdata_2.csv'),
 PosixPath('../data/raw/202406-citibike-tripdata_3.csv'),
 PosixPath('../data/raw/202406-citibike-tripdata_4.csv'),
 PosixPath('../data/raw/202406-citibike-tripdata_5.csv')]

In [2]:
df = pd.read_csv(files[0])
df.head()


  df = pd.read_csv(files[0])


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,1C6A640ACEFB0795,classic_bike,2024-06-14 08:41:33.060,2024-06-14 08:51:11.113,E 85 St & York Ave,7146.04,E 63 St & 3 Ave,6830.02,40.775369,-73.948034,40.763954,-73.9646,member
1,6982C5274D493834,electric_bike,2024-06-13 20:11:55.677,2024-06-13 20:18:51.570,6 Ave & Broome St,5610.09,Bank St & Washington St,5964.01,40.72431,-74.00473,40.736197,-74.008592,member
2,235EDC45BD2151E4,electric_bike,2024-06-12 12:44:45.109,2024-06-12 12:55:27.631,6 Ave & Broome St,5610.09,E 17 St & 2 Ave,5896.01,40.72431,-74.00473,40.734312,-73.983725,member
3,62586586291415AC,classic_bike,2024-06-11 12:59:14.679,2024-06-11 13:20:40.846,Fulton St & Waverly Ave,4345.11,Atlantic Ave & Furman St,4614.04,40.683239,-73.965996,40.691652,-73.999979,member
4,746A5E2469FB7DD9,electric_bike,2024-06-11 19:14:36.519,2024-06-11 19:37:54.512,Carlton Ave & Dean St,4199.12,Canal St & Rutgers St,5303.08,40.680974,-73.97101,40.714275,-73.9899,member


In [3]:
list(df.columns)


['ride_id',
 'rideable_type',
 'started_at',
 'ended_at',
 'start_station_name',
 'start_station_id',
 'end_station_name',
 'end_station_id',
 'start_lat',
 'start_lng',
 'end_lat',
 'end_lng',
 'member_casual']

## Available Columns

- ride_id
- rideable_type
- started_at
- ended_at
- start_station_name
- start_station_id
- end_station_name
- end_station_id
- start_lat, start_lng
- end_lat, end_lng
- member_casual


In [4]:
df["started_at"] = pd.to_datetime(df["started_at"])
df["ended_at"] = pd.to_datetime(df["ended_at"])

df["ride_duration_min"] = (df["ended_at"] - df["started_at"]).dt.total_seconds() / 60

df["ride_duration_min"].describe()


count    1000000.000000
mean          16.129550
std           48.297100
min            0.106417
25%            5.671150
50%           10.042025
75%           17.686100
max         1501.578183
Name: ride_duration_min, dtype: float64

## Ride Duration Summary (Single CSV Sample)

- Median ride duration is ~10 minutes
- Majority of rides are under 20 minutes
- Long right tail exists (outliers up to ~1500 minutes)
- Duration outliers should be capped or filtered for modeling


In [5]:
df["hour"] = df["started_at"].dt.hour
df["weekday"] = df["started_at"].dt.weekday  # Monday=0

df[["hour", "weekday"]].describe()


Unnamed: 0,hour,weekday
count,1000000.0,1000000.0
mean,14.200092,2.979914
std,5.138071,1.980651
min,0.0,0.0
25%,10.0,1.0
50%,15.0,3.0
75%,18.0,5.0
max,23.0,6.0


## Temporal Patterns (Sample)

- Average ride starts around mid-afternoon (~14:00)
- Rides span all weekdays fairly evenly
- Hour-of-day and weekday are strong candidate features for demand forecasting


## Modeling Goal (Next)

Objective:
- Predict hourly bike demand per station

Target variable:
- Number of rides starting at a station per hour

Granularity:
- (station_id, hour)

Next steps:
- Aggregate rides by station and hour
- Create time-based features
- Train a baseline demand forecasting model
