# Data Processing

## Bus Data

### Downloading the Bus Data

This will download about 10GB of data, and may therefore take a while.

In [None]:
import downloader

downloader.download_bus_data()

### Downloading the Bus Schedule Data

Compressed bus schedule data conforming to the GTFS Static standard can be found at the following:
- https://mobilitydatabase.org/feeds/gtfs/mdb-50
- https://transitfeeds.com/p/sfmta/60

The schedule currently in use can be found at: 
- https://www.sfmta.com/reports/gtfs-transit-data

To add a schedule to the project:
1. Download the static GTFS zip file.
2. Decompress the zip file as a directory.
3. Name the directory according to the start and end dates in the `calendar.txt` file within (i.e. if the dates are `20250315` and `20250509`, name the directory `2025-03-15_2025-05-09`).
4. Place the directory within `data/bus-static/sf/`.

We used the following schedules: 
- `2024-01-20_2024-03-01`
- `2024-03-02_2024-05-24`
- `2024-06-08_2024-08-16`
- `2024-06-22_2024-08-16`
- `2025-03-15_2025-05-09`

### Processing the Data
This will take a while, dependending on your computer's processing power. Expect it to take about 1 minute per processed day (a few hours total).

In [None]:
import bus_processing

bus_processing.process_all()
bus_processing.combine_bus_aggregates()

Once the above is complete, we can load the resulting aggregate data:

In [1]:
from IPython.display import display
import bus_processing

bus_df = bus_processing.load_bucket_statistics()
display(bus_df)

Unnamed: 0,time_bucket,delay_total,late_5_min,early_5_min,count
0,2024-01-20 04:00:00-08:00,0,0,0,1
1,2024-01-20 04:20:00-08:00,-5,0,0,6
2,2024-01-20 04:40:00-08:00,-14,0,1,34
3,2024-01-20 05:00:00-08:00,-6,1,2,93
4,2024-01-20 05:20:00-08:00,-41,2,5,142
...,...,...,...,...,...
16019,2025-04-14 09:20:00-07:00,387,59,30,570
16020,2025-04-14 09:40:00-07:00,411,60,24,554
16021,2025-04-14 10:00:00-07:00,543,56,31,550
16022,2025-04-14 10:20:00-07:00,513,71,32,541


## Calendar Data

In [2]:
from IPython.display import display
import calendar_processing

calendar_df = calendar_processing.build_calendar()
display(calendar_df)

Unnamed: 0_level_0,holiday,monday,tuesday,wednesday,thursday,friday,saturday,sunday
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2023-01-01,1,0,0,0,0,0,0,1
2023-01-02,1,1,0,0,0,0,0,0
2023-01-03,0,0,1,0,0,0,0,0
2023-01-04,0,0,0,1,0,0,0,0
2023-01-05,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...
2025-04-10,0,0,0,0,1,0,0,0
2025-04-11,0,0,0,0,0,1,0,0
2025-04-12,0,0,0,0,0,0,1,0
2025-04-13,0,0,0,0,0,0,0,1


## Weather Data

In [3]:
import meteostat as mt
import datetime
from IPython.display import display

start = datetime.datetime(2023, 1, 1)
end = datetime.datetime(2025, 4, 14)
sf = 72494
weather_df = mt.Hourly(sf, start, end, "America/Los_Angeles")
weather_df = weather_df.fetch()

# Drop columns without many missing measurements
weather_df = weather_df.drop(weather_df.columns[weather_df.isnull().sum()/len(weather_df) > 0.1], axis=1)

# There are small gaps in most features; assume conditions remain unchanged since last measurement
weather_df = weather_df.ffill()

display(weather_df)

Unnamed: 0_level_0,temp,dwpt,rhum,prcp,wdir,wspd,pres,coco
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2023-01-01 00:00:00-08:00,12.2,8.3,77.0,0.0,270.0,25.9,1008.3,2.0
2023-01-01 01:00:00-08:00,11.1,8.3,83.0,0.0,260.0,16.6,1008.2,2.0
2023-01-01 02:00:00-08:00,10.6,7.8,83.0,0.0,240.0,11.2,1008.6,2.0
2023-01-01 03:00:00-08:00,11.1,7.2,77.0,0.0,250.0,22.3,1009.2,2.0
2023-01-01 04:00:00-08:00,10.6,6.7,77.0,0.0,250.0,33.5,1009.7,2.0
...,...,...,...,...,...,...,...,...
2025-04-13 20:00:00-07:00,15.0,6.8,58.0,0.0,290.0,24.1,1013.9,2.0
2025-04-13 21:00:00-07:00,15.0,6.0,55.0,0.0,300.0,20.5,1014.2,2.0
2025-04-13 22:00:00-07:00,13.9,7.2,64.0,0.0,290.0,22.3,1014.0,2.0
2025-04-13 23:00:00-07:00,13.3,6.6,64.0,0.0,300.0,9.4,1014.1,2.0


## Combining the Data