# <a id='toc1_'></a>[Data Processing](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Data Processing](#toc1_)    
  - [Preface](#toc1_1_)    
  - [Bus Data](#toc1_2_)    
    - [Downloading the Bus Data](#toc1_2_1_)    
    - [Downloading the Bus Schedule Data](#toc1_2_2_)    
    - [Processing the Data](#toc1_2_3_)    
    - [Loading the Data](#toc1_2_4_)    
  - [Calendar Data](#toc1_3_)    
  - [Weather Data](#toc1_4_)    
  - [Combining the Data](#toc1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Preface](#toc0_)
The sections [Downloading the Bus Data](#downloading-the-bus-data), [Downloading the Bus Schedule Data](#downloading-the-bus-schedule-data), and [Processing the Data](#processing-the-data) are not necessary to run if you want to use our dataset, as the end products (bus-aggregates csv files) are included in the repository. These sections need only be run if you wish to update the bus data to be more up-to-date or use other bus schedule data.

The simpler version of our dataset (with all buses aggregated regardless of route) is included in the repository at `data/dataset.csv`. The more comprehensive version of out dataset (with bus aggregations separated by route) is too large to include, and you must therefore run all code cells starting from the section [Loading the Data](#loading-the-data) to generate it.

## <a id='toc1_2_'></a>[Bus Data](#toc0_)

### <a id='toc1_2_1_'></a>[Downloading the Bus Data](#toc0_)

This will download about 10GB of data from the [Bus Observatory](https://api.busobservatory.org/), and may therefore take a while. 

In [None]:
import downloader

downloader.download_bus_data()

### <a id='toc1_2_2_'></a>[Downloading the Bus Schedule Data](#toc0_)

Compressed bus schedule data conforming to the GTFS Static standard can be found at the following:
- https://mobilitydatabase.org/feeds/gtfs/mdb-50
- https://transitfeeds.com/p/sfmta/60

The schedule currently in use can be found at: 
- https://www.sfmta.com/reports/gtfs-transit-data

To add a schedule to the project:
1. Download the static GTFS zip file.
2. Decompress the zip file as a directory.
3. Name the directory according to the start and end dates in the `calendar.txt` file within (i.e. if the dates are `20250315` and `20250509`, name the directory `2025-03-15_2025-05-09`).
4. Place the directory within `data/bus-static/sf/`.

We used the following schedules:
- `2023-11-20_2024-02-02`
- `2023-12-23_2024-02-02`
- `2024-01-20_2024-03-01`
- `2024-03-02_2024-05-24`
- `2024-06-08_2024-08-16`
- `2024-06-22_2024-08-16`
- `2025-03-15_2025-05-09`

### <a id='toc1_2_3_'></a>[Processing the Data](#toc0_)
This will take a while, dependending on your computer's processing power. Expect it to take about 1 minute per processed day (a few hours total).

In [None]:
import bus_processing

bus_processing.process_all(bucket_size=20)
bus_processing.combine_bus_aggregates("all")
bus_processing.combine_bus_aggregates("routes")

### <a id='toc1_2_4_'></a>[Loading the Data](#toc0_)

Once the above is complete, we can load the resulting aggregate data.

In [1]:
from IPython.display import display
import bus_processing

bus_all_df = bus_processing.load_bucket_statistics("all")
display(bus_all_df)

bus_routes_df = bus_processing.load_bucket_statistics("routes")
display(bus_routes_df)

Unnamed: 0,time_bucket,delay_total,late_5_min,early_5_min,count
0,2024-01-20 04:00:00-08:00,0,0,0,1
1,2024-01-20 04:20:00-08:00,-5,0,0,6
2,2024-01-20 04:40:00-08:00,-14,0,1,34
3,2024-01-20 05:00:00-08:00,-6,1,2,93
4,2024-01-20 05:20:00-08:00,-41,2,5,142
...,...,...,...,...,...
16019,2025-04-14 09:20:00-07:00,387,59,30,570
16020,2025-04-14 09:40:00-07:00,411,60,24,554
16021,2025-04-14 10:00:00-07:00,543,56,31,550
16022,2025-04-14 10:20:00-07:00,513,71,32,541


Unnamed: 0,vehicle.trip.route_id,vehicle.trip.direction_id,time_bucket,delay_total,late_5_min,early_5_min,count
0,1,0.0,2024-01-20 04:20:00-08:00,0,0,0,1
1,1,0.0,2024-01-20 04:40:00-08:00,0,0,0,2
2,1,0.0,2024-01-20 05:00:00-08:00,2,0,0,4
3,1,0.0,2024-01-20 05:20:00-08:00,-1,0,0,4
4,1,0.0,2024-01-20 05:40:00-08:00,5,0,0,6
...,...,...,...,...,...,...,...
1384710,TBUS,1.0,2025-04-14 05:00:00-07:00,-2,0,0,2
1384711,TBUS,1.0,2025-04-14 05:20:00-07:00,-5,0,0,3
1384712,TBUS,1.0,2025-04-14 05:40:00-07:00,-4,0,0,3
1384713,TBUS,1.0,2025-04-14 06:00:00-07:00,0,0,0,2


## <a id='toc1_3_'></a>[Calendar Data](#toc0_)

In [2]:
from IPython.display import display
import calendar_processing

calendar_df = calendar_processing.build_calendar()
display(calendar_df)

Unnamed: 0_level_0,holiday,weekday
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01-01,1,6
2023-01-02,1,0
2023-01-03,0,1
2023-01-04,0,2
2023-01-05,0,3
...,...,...
2025-04-10,0,3
2025-04-11,0,4
2025-04-12,0,5
2025-04-13,0,6


## <a id='toc1_4_'></a>[Weather Data](#toc0_)

Weather data is provided by Meteostat. Descriptions for each data column can be found [here](https://dev.meteostat.net/python/hourly.html#data-structure).

In [3]:
import meteostat as mt
import datetime
from IPython.display import display

start = datetime.datetime(2023, 1, 1)
end = datetime.datetime(2025, 4, 14)
sf = 72494
weather_df = mt.Hourly(sf, start, end, "America/Los_Angeles")
weather_df = weather_df.fetch()

# Drop columns without many missing measurements
weather_df = weather_df.drop(weather_df.columns[weather_df.isnull().sum()/len(weather_df) > 0.1], axis=1)

# There are small gaps in most features; assume conditions remain unchanged since last measurement
weather_df = weather_df.ffill()

display(weather_df)

Unnamed: 0_level_0,temp,dwpt,rhum,prcp,wdir,wspd,pres,coco
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2023-01-01 00:00:00-08:00,12.2,8.3,77.0,0.0,270.0,25.9,1008.3,2.0
2023-01-01 01:00:00-08:00,11.1,8.3,83.0,0.0,260.0,16.6,1008.2,2.0
2023-01-01 02:00:00-08:00,10.6,7.8,83.0,0.0,240.0,11.2,1008.6,2.0
2023-01-01 03:00:00-08:00,11.1,7.2,77.0,0.0,250.0,22.3,1009.2,2.0
2023-01-01 04:00:00-08:00,10.6,6.7,77.0,0.0,250.0,33.5,1009.7,2.0
...,...,...,...,...,...,...,...,...
2025-04-13 20:00:00-07:00,15.0,6.8,58.0,0.0,290.0,24.1,1013.9,2.0
2025-04-13 21:00:00-07:00,15.0,6.0,55.0,0.0,300.0,20.5,1014.2,2.0
2025-04-13 22:00:00-07:00,13.9,7.2,64.0,0.0,290.0,22.3,1014.0,2.0
2025-04-13 23:00:00-07:00,13.3,6.6,64.0,0.0,300.0,9.4,1014.1,2.0


## <a id='toc1_5_'></a>[Combining the Data](#toc0_)

In [4]:
from IPython.display import display
import os
import pandas as pd
from constants import DATA_DIR


def combine_data(bus_df: pd.DataFrame, calendar_df: pd.DataFrame, weather_df: pd.DataFrame):
    df = bus_df.copy()

    # Add columns to join on
    df["time"] = df.apply(lambda row: row["time_bucket"].replace(minute=0, second=0), axis=1).astype("datetime64[ns, America/Los_Angeles]")
    df["date"] = df.apply(lambda row: row["time_bucket"].date(), axis=1)

    # Use inner joins to remove the few rows that lack matches
    df = df.merge(calendar_df, on="date", how="inner")
    df = df.merge(weather_df, on="time", how="inner")
    df = df.drop(["time", "date"], axis=1)
    return df

df = combine_data(bus_all_df, calendar_df, weather_df)
display(df)
path = os.path.join(DATA_DIR, "dataset.csv")
df.to_csv(path, index=False)

df = combine_data(bus_routes_df, calendar_df, weather_df)
display(df)
path = os.path.join(DATA_DIR, "dataset_routes.csv")
df.to_csv(path, index=False)

Unnamed: 0,time_bucket,delay_total,late_5_min,early_5_min,count,holiday,weekday,temp,dwpt,rhum,prcp,wdir,wspd,pres,coco
0,2024-01-20 04:00:00-08:00,0,0,0,1,0,5,13.9,11.8,87.0,0.0,180.0,20.5,1007.6,4.0
1,2024-01-20 04:20:00-08:00,-5,0,0,6,0,5,13.9,11.8,87.0,0.0,180.0,20.5,1007.6,4.0
2,2024-01-20 04:40:00-08:00,-14,0,1,34,0,5,13.9,11.8,87.0,0.0,180.0,20.5,1007.6,4.0
3,2024-01-20 05:00:00-08:00,-6,1,2,93,0,5,13.9,11.8,87.0,0.3,180.0,20.5,1007.6,4.0
4,2024-01-20 05:20:00-08:00,-41,2,5,142,0,5,13.9,11.8,87.0,0.3,180.0,20.5,1007.6,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15989,2025-04-13 23:20:00-07:00,156,17,8,162,0,6,13.3,6.6,64.0,0.0,300.0,9.4,1014.1,2.0
15990,2025-04-13 23:40:00-07:00,129,21,9,159,0,6,13.3,6.6,64.0,0.0,300.0,9.4,1014.1,2.0
15991,2025-04-14 00:00:00-07:00,103,19,10,149,0,0,12.0,8.7,80.0,0.0,300.0,15.0,1014.0,1.0
15992,2025-04-14 00:20:00-07:00,108,12,9,127,0,0,12.0,8.7,80.0,0.0,300.0,15.0,1014.0,1.0


Unnamed: 0,vehicle.trip.route_id,vehicle.trip.direction_id,time_bucket,delay_total,late_5_min,early_5_min,count,holiday,weekday,temp,dwpt,rhum,prcp,wdir,wspd,pres,coco
0,1,0.0,2024-01-20 04:20:00-08:00,0,0,0,1,0,5,13.9,11.8,87.0,0.0,180.0,20.5,1007.6,4.0
1,1,0.0,2024-01-20 04:40:00-08:00,0,0,0,2,0,5,13.9,11.8,87.0,0.0,180.0,20.5,1007.6,4.0
2,1,0.0,2024-01-20 05:00:00-08:00,2,0,0,4,0,5,13.9,11.8,87.0,0.3,180.0,20.5,1007.6,4.0
3,1,0.0,2024-01-20 05:20:00-08:00,-1,0,0,4,0,5,13.9,11.8,87.0,0.3,180.0,20.5,1007.6,4.0
4,1,0.0,2024-01-20 05:40:00-08:00,5,0,0,6,0,5,13.9,11.8,87.0,0.3,180.0,20.5,1007.6,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1382476,TBUS,1.0,2025-04-13 07:00:00-07:00,2,1,0,5,0,6,10.6,7.8,83.0,0.0,0.0,0.0,1015.8,1.0
1382477,TBUS,1.0,2025-04-13 07:20:00-07:00,-5,0,1,3,0,6,10.6,7.8,83.0,0.0,0.0,0.0,1015.8,1.0
1382478,TBUS,1.0,2025-04-13 07:40:00-07:00,1,0,0,3,0,6,10.6,7.8,83.0,0.0,0.0,0.0,1015.8,1.0
1382479,TBUS,1.0,2025-04-13 08:00:00-07:00,-2,0,0,2,0,6,13.0,9.1,77.0,0.0,120.0,6.0,1016.0,2.0
