<a href="https://colab.research.google.com/github/guzmanlopez/montevideo-bus-forecast/blob/main/montevideo_bus_forecast.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Curso Aprendizaje Automático para Datos en Grafos

**Docente:** Prof. Gonzalo Mateos (Universidad de Rochester, EEUU).

**Docente invitado:** Fernando Gama (Universidad de California Berkeley, EEUU).

**Otros docentes:** Marcelo Fiori y Federico La Rocca.

**Fechas:** 01/02/2021 al 04/02/2021 y 11/02/2021.

**Web:** [Página principal del curso en plataforma Eva](https://eva.fing.edu.uy/course/view.php?id=1484)



---



## Proyecto final del curso

### Predicción del flujo de pasajeros en las paradas de ómnibus del Sistema de Transporte Metropolitano (STM) de Montevideo

**Estudiante:** Guzmán López


---



Montar drive para descargar el repositorio del proyecto desde GitHub

In [2]:
from google.colab import drive
drive.mount("/content/gdrive")

%cd gdrive/My Drive/

Mounted at /content/gdrive
/content/gdrive/My Drive


In [None]:
!git clone https://github.com/guzmanlopez/montevideo-bus-forecast.git

In [3]:
%cd montevideo-bus-forecast/

# Discard collaboratory changes: ground truth is github repository
!git fetch --all
!git reset --hard origin/main
!git pull

/content/gdrive/My Drive/montevideo-bus-forecast
Fetching origin
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 12 (delta 9), reused 12 (delta 9), pack-reused 0[K
Unpacking objects: 100% (12/12), done.
From https://github.com/guzmanlopez/montevideo-bus-forecast
   2c78c46..c0b242d  main       -> origin/main
Checking out files: 100% (88/88), done.
HEAD is now at c0b242d Improve network plot
Already up to date.


Instalar todas las librerías necesarias:

In [None]:
# Install required packages
!pip install altair
!pip install bokeh
!pip install geopandas
!pip install matplotlib
!pip install networkx
!pip install numpy
!pip install pandas
!pip install pretty-errors
!pip install pygeos
!pip install requests
!pip install sklearn
!pip install typer
!pip install vega

# Instalar PyTorch
!pip install torch==1.8.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

# Instalar PyTorch Geometric
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-geometric

# Instalar PyTorch Geometric Temporal
!pip install torch-geometric-temporal

Descargar y procesar datos hasta obtener finalmente el grafo que usaremos para modelar:

In [None]:
# Note: this file can take some time to be downloaded because is 2.5 GB of size
# %run src/preparation/download_stm_bus_data.py

In [None]:
# Recorridos y paradas
%run src/preparation/download_bus_stops.py
%run src/preparation/download_bus_tracks.py

⬇️ Downloading data...
  STM bus stops from: https://intgis.montevideo.gub.uy/sit/tmp/v_uptu_paradas.zip
💾 Writing data...
Saved to: data/raw/stm_paradas.geojson

✅ Done! 

⬇️ Downloading data...
  STM bus line tracks from: https://intgis.montevideo.gub.uy/sit/tmp/uptu_variante_no_maximal.zip
💾 Writing data...
Saved to: data/raw/stm_recorridos.geojson

✅ Done! 



In [None]:
# Procesar datos en bruto descargados
%run src/processing/process_stm_bus_data.py

🏋️‍♀️ Loading data...
Loading data/raw/stm_viajes_octubre.csv...
✅ Done! 

⚙️ Processing...
Pre-process raw data
ℹ️ Convert to timeseries and add time features
✅ Done! 

💾 Writing data...
File saved to data/processed/df_stm_bus_proc.pkl



In [None]:
# Procesamiento de recorridos y paradas
%run src/processing/build_bus_line_tracks_and_stops.py

🏋️‍♀️ Loading data...
Loading data/processed/df_stm_bus_proc.pkl...
✅ Done! 

🏋️‍♀️ Loading data...
Loading data/raw/stm_paradas.geojson...
✅ Done! 

🏋️‍♀️ Loading data...
Loading data/raw/stm_recorridos.geojson...
✅ Done! 

🚌 103
⚙️ Processing...
build_bus_line_tracks_and_stops

⚙️ Processing...
get_longest_track_from_bus_line

✅ Done! 

⚙️ Processing...
cut_tracks_by_bus_stops for bus line

💾 Writing data...
Saved to: data/processed/bus_tracks/bus_track_proc_busline_103.geojson

✅ Done! 

💾 Writing data...
Saved to: data/processed/bus_stops/bus_stops_proc_103.geojson

🚌 G
⚙️ Processing...
build_bus_line_tracks_and_stops

⚙️ Processing...
get_longest_track_from_bus_line

✅ Done! 

⚙️ Processing...
cut_tracks_by_bus_stops for bus line

💾 Writing data...
Saved to: data/processed/bus_tracks/bus_track_proc_busline_G.geojson

✅ Done! 

💾 Writing data...
Saved to: data/processed/bus_stops/bus_stops_proc_G.geojson

🚌 183
⚙️ Processing...
build_bus_line_tracks_and_stops

⚙️ Processing...
get_

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


💾 Writing data...
Saved to: data/processed/bus_stops/bus_stops_proc_110.geojson



In [None]:
# Procesamiento de recorridos y paradas
%run src/processing/sort_bus_stops_along_bus_track.py

🏋️‍♀️ Loading data...
Loading data/processed/df_stm_bus_proc.pkl...
✅ Done! 

🚌 103
🚏 Bus stops
🏋️‍♀️ Loading data...
Loading file data/processed/bus_stops/bus_stops_proc_103.geojson...
✅ Done! 

🚌 103
🛣️ Bus tracks
🏋️‍♀️ Loading data...
Loading file data/processed/bus_tracks/bus_track_proc_busline_103.geojson...
✅ Done! 

🚌 103
⚙️ Processing...
Get order of bus stops along bus track
⚠️ This operation can take some time 

⚙️ Processing...
Simplifying bus track
ℹ️ Using 5 meters of tolerance
ℹ️ Removed 504 points of 1000 (50.4%) 

⚙️ Processing...
Densyfing bus track by adding points at equal intervals
ℹ️ Densify line lengths equal or above 5 meters
ℹ️ Add points every 5 meters
ℹ️ Added 6392 new points of 496 initial points (1288.7%) 

⚙️ Processing...
Sort points along a path using Nearest Neighbors and minimizing path cost
ℹ️ Using 3 neighbors
ℹ️ 5110/5110 nodes used to calculate shortest path
✅ Done! 



In [None]:
# Construir matriz de adyacencia
%run src/processing/build_adyacency_matrix.py

⚙️ Processing...
Loading data

🚌 103
🚏 Bus stops
🏋️‍♀️ Loading data...
Loading file data/processed/bus_stops/bus_stop_ordered_103.geojson...
✅ Done! 

🚌 103
🛣️ Bus tracks
🏋️‍♀️ Loading data...
Loading file data/processed/bus_tracks/bus_track_ordered_103.geojson...
✅ Done! 

🚌 G
🚏 Bus stops
🏋️‍♀️ Loading data...
Loading file data/processed/bus_stops/bus_stop_ordered_G.geojson...
✅ Done! 

🚌 G
🛣️ Bus tracks
🏋️‍♀️ Loading data...
Loading file data/processed/bus_tracks/bus_track_ordered_G.geojson...
✅ Done! 

🚌 183
🚏 Bus stops
🏋️‍♀️ Loading data...
Loading file data/processed/bus_stops/bus_stop_ordered_183.geojson...
✅ Done! 

🚌 183
🛣️ Bus tracks
🏋️‍♀️ Loading data...
Loading file data/processed/bus_tracks/bus_track_ordered_183.geojson...
✅ Done! 

🚌 185
🚏 Bus stops
🏋️‍♀️ Loading data...
Loading file data/processed/bus_stops/bus_stop_ordered_185.geojson...
✅ Done! 

🚌 185
🛣️ Bus tracks
🏋️‍♀️ Loading data...
Loading file data/processed/bus_tracks/bus_track_ordered_185.geojson...
✅ Done! 

🚌

In [None]:
# Construir grafo dirigido
%run src/processing/build_graph.py

⚙️ Processing...
Building Networkx Directed graph

🏋️‍♀️ Loading data...
Loading data/processed/from_to_weight.csv...
✅ Done! 

ℹ️ 
Name: Bus lines of Montevideo
Type: DiGraph
Number of nodes: 606
Number of edges: 617
Average in degree:   1.0182
Average out degree:   1.0182



# Análisis Exploratorio de Datos

In [5]:
import altair as alt
import geopandas as gpd
import pandas as pd
from bokeh.io import output_notebook
from notebooks.eda.plots import (
    network_bokeh_plot,
    plot_boardings_by_day_name,
    plot_boardings_by_hour_and_day_name,
    plot_boardings_by_time,
)
from src.preparation.constants import BUS_LINES, CRS, DAY_NAME_MAPPING, PROCESSED_FILE
from src.preparation.utils import (
    load_pickle_file,
    load_spatial_data,
    load_stm_bus_line_track,
    load_stm_bus_stops,
)
from src.processing.utils import get_networkx_graph

alt.renderers.enable("colab")

  shapely_geos_version, geos_capi_version_string


RendererRegistry.enable('colab')

In [None]:
# Load data pre-processed data
df_proc = load_pickle_file(PROCESSED_FILE)
df_proc

🏋️‍♀️ Loading data...
Loading data/processed/df_stm_bus_proc.pkl...
✅ Done! 



Unnamed: 0_level_0,ordinal_de_tramo,cantidad_pasajeros,codigo_parada_origen,dsc_linea,sevar_codigo,mes,nombre_dia,hora
fecha_evento,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-10-08 08:17:57-03:00,2,1,3895,195,1438.0,10,Jueves,8
2020-10-26 14:54:13-03:00,2,1,5709,181,7602.0,10,Lunes,14
2020-10-05 17:36:47-03:00,2,1,3663,151,3364.0,10,Lunes,17
2020-10-07 11:59:09-03:00,2,1,1567,137,815.0,10,Miércoles,11
2020-10-09 15:46:40-03:00,2,1,3133,145,3358.0,10,Viernes,15
...,...,...,...,...,...,...,...,...
2020-10-13 18:39:14-03:00,2,1,3200,102,254.0,10,Martes,18
2020-10-07 13:29:24-03:00,2,1,1145,494,2898.0,10,Miércoles,13
2020-10-05 06:33:48-03:00,2,1,4201,109,2214.0,10,Lunes,6
2020-10-16 06:34:05-03:00,2,1,4201,113,602.0,10,Viernes,6


In [None]:
df_hourly = df_proc.groupby([pd.Grouper(freq="1H")])["cantidad_pasajeros"].sum().reset_index()
plot_boardings_by_time(df_hourly)

In [None]:
# Daily sum of boardings and median by day name
df_day_name = (
    df_proc.groupby([pd.Grouper(freq="1D"), "nombre_dia"])["cantidad_pasajeros"]
    .sum()
    .groupby(["nombre_dia"])
    .median()
    .reset_index()
)
plot_boardings_by_day_name(df_day_name)

In [None]:
# Hourly sum of boardings and median by day name
df_hourly_day_name = df_hourly.copy()
df_hourly_day_name.set_index("fecha_evento", inplace=True)
df_hourly_day_name.loc[:, "nombre_dia"] = df_hourly_day_name.index.day_name()
df_hourly_day_name.loc[:, "nombre_dia"].replace(DAY_NAME_MAPPING, inplace=True)
df_hourly_day_name.loc[:, "hora"] = df_hourly_day_name.index.hour
df_hourly_day_name = df_hourly_day_name.groupby(["hora", "nombre_dia"]).median().reset_index()

plot_boardings_by_hour_and_day_name(df_hourly_day_name)

In [None]:
# Get top buses lines per day of the week
df_weekly_by_day_name_and_line = (
    df_proc.groupby([pd.Grouper(freq="1D"), "nombre_dia", "dsc_linea"])["cantidad_pasajeros"]
    .sum()
    .groupby(["dsc_linea", "nombre_dia"])
    .median()
    .groupby("dsc_linea")
    .sum()
    .sort_values(ascending=False)
    .reset_index()
)

df_weekly_by_day_name_and_line["decile_rank"] = pd.qcut(
    df_weekly_by_day_name_and_line["cantidad_pasajeros"], 10, labels=False
)

# Contribution of each decile
df_decile_rank_prop = df_weekly_by_day_name_and_line.groupby("decile_rank").sum().reset_index()
df_decile_rank_prop["proportion"] = (
    df_decile_rank_prop["cantidad_pasajeros"] / df_decile_rank_prop["cantidad_pasajeros"].sum()
)
df_decile_rank_prop

Unnamed: 0,decile_rank,cantidad_pasajeros,proportion
0,0,1509.5,0.001341
1,1,6931.0,0.006157
2,2,13356.5,0.011865
3,3,30346.5,0.026959
4,4,50580.0,0.044933
5,5,84003.5,0.074626
6,6,136517.5,0.121277
7,7,186790.5,0.165938
8,8,241547.5,0.214582
9,9,374083.5,0.332322


In [None]:
# Select bus lines from the highest decile
df_bus_lines = df_weekly_by_day_name_and_line.loc[
    df_weekly_by_day_name_and_line["decile_rank"] == 9, :
]
df_bus_lines = df_bus_lines.sort_values("cantidad_pasajeros", ascending=False)
df_bus_lines

Unnamed: 0,dsc_linea,cantidad_pasajeros,decile_rank
0,103,41948.5,9
1,183,36571.5,9
2,306,32997.5,9
3,185,31321.0,9
4,181,29585.0,9
5,G,29426.0,9
6,546,27830.0,9
7,137,25128.0,9
8,110,24509.5,9
9,145,24112.0,9


In [None]:
# Load bus stops
gdf_bus_stops = load_stm_bus_stops()

# Load bus tracks
gdf_bus_tracks = load_stm_bus_line_track()

# Load ordered bus stops and bus tracks
all_bus_stops, all_bus_tracks = gpd.GeoDataFrame(), gpd.GeoDataFrame()

for bus_line in BUS_LINES:
    # Read all bus stops by bus line from geojson files
    all_bus_stops = all_bus_stops.append(load_spatial_data(bus_line, type="bus_stop_ordered"))
    
    # Read all bus tracks by bus line from geojson files
    df = load_spatial_data(bus_line, type="bus_track_ordered")
    df["line"] = bus_line
    all_bus_tracks = all_bus_tracks.append(df)

all_bus_stops = all_bus_stops.set_crs(CRS)
all_bus_tracks = all_bus_tracks.set_crs(CRS)

🏋️‍♀️ Loading data...
Loading data/raw/stm_paradas.geojson...
✅ Done! 

🏋️‍♀️ Loading data...
Loading data/raw/stm_recorridos.geojson...
✅ Done! 

🚌 103
🚏 Bus stops
🏋️‍♀️ Loading data...
Loading file data/processed/bus_stops/bus_stop_ordered_103.geojson...
✅ Done! 

🚌 103
🛣️ Bus tracks
🏋️‍♀️ Loading data...
Loading file data/processed/bus_tracks/bus_track_ordered_103.geojson...
✅ Done! 

🚌 G
🚏 Bus stops
🏋️‍♀️ Loading data...
Loading file data/processed/bus_stops/bus_stop_ordered_G.geojson...
✅ Done! 

🚌 G
🛣️ Bus tracks
🏋️‍♀️ Loading data...
Loading file data/processed/bus_tracks/bus_track_ordered_G.geojson...
✅ Done! 

🚌 183
🚏 Bus stops
🏋️‍♀️ Loading data...
Loading file data/processed/bus_stops/bus_stop_ordered_183.geojson...
✅ Done! 

🚌 183
🛣️ Bus tracks
🏋️‍♀️ Loading data...
Loading file data/processed/bus_tracks/bus_track_ordered_183.geojson...
✅ Done! 

🚌 185
🚏 Bus stops
🏋️‍♀️ Loading data...
Loading file data/processed/bus_stops/bus_stop_ordered_185.geojson...
✅ Done! 

🚌 185
🛣️ 

In [None]:
# Check shared bus stops by lines
shared_bus_stops = (
    all_bus_stops.groupby(["COD_UBIC_P"])
    .agg(lines=("DESC_LINEA", "|".join), number_of_lines=("DESC_LINEA", len))
    .round(0)
    .sort_values("COD_UBIC_P", ascending=True)
    .reset_index()
    .astype({"COD_UBIC_P": int})
)


print("Paradas compartidas entre las líneas de ómnibus seleccionadas:\n")
(shared_bus_stops
.loc[shared_bus_stops["number_of_lines"] > 1, :]["lines"]
.drop_duplicates()
.reset_index(drop=True))

Paradas compartidas entre las líneas de ómnibus seleccionadas:



0             163|137
1             185|137
2         185|306|137
3                 G|G
4               G|145
5             185|306
6     185|306|163|137
7       G|185|306|137
8               G|137
9             183|306
10            183|163
11            145|145
12            103|405
13        103|405|110
14            103|110
15              G|163
16        185|145|405
17            145|405
18            185|163
Name: lines, dtype: object

In [6]:
# Cargar grafo
G = get_networkx_graph()

[97m[1m🏋️‍♀️ Loading data...[0m
[37mLoading data/processed/from_to_weight.csv...[0m
[92m[1m✅ Done! 
[0m
[97m[1m🏋️‍♀️ Loading data...[0m
[37mLoading data/raw/stm_paradas.geojson...[0m
[92m[1m✅ Done! 
[0m
[96m[22mℹ️ 
Name: Bus lines of Montevideo
Type: DiGraph
Number of nodes: 606
Number of edges: 617
Average in degree:   1.0182
Average out degree:   1.0182
[0m


In [10]:
output_notebook()

# Visualizar grafo
network_bokeh_plot(
    G,
    title="Grafo de paradas de ómnibus",
    colorby="in_degree",
    add_labels=False,
    save=False,
)