<a href="https://colab.research.google.com/github/guzmanlopez/montevideo-bus-forecast/blob/main/montevideo_bus_forecast.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Curso Aprendizaje Automático para Datos en Grafos

**Docente:** Prof. Gonzalo Mateos (Universidad de Rochester, EEUU).

**Docente invitado:** Fernando Gama (Universidad de California Berkeley, EEUU).

**Otros docentes:** Marcelo Fiori y Federico La Rocca.

**Fechas:** 01/02/2021 al 04/02/2021 y 11/02/2021.

**Web:** [Página principal del curso en plataforma Eva](https://eva.fing.edu.uy/course/view.php?id=1484)



---



## Proyecto final del curso

### Predicción del flujo de pasajeros en las paradas de ómnibus del Sistema de Transporte Metropolitano (STM) de Montevideo

**Estudiante:** Guzmán López


---



Montar drive para descargar el repositorio del proyecto desde GitHub

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

%cd gdrive/My Drive/

In [None]:
!git clone https://github.com/guzmanlopez/montevideo-bus-forecast.git

In [None]:
%cd montevideo-bus-forecast/
# !git checkout .
!git pull

Instalar todas las librerías necesarias:

In [None]:
# Install required packages
!pip install pandas
!pip install geopandas
!pip install networkx
!pip install numpy
!pip install altair
!pip install requests
!pip install typer
!pip install pretty-errors
!pip install matplotlib
!pip install sklearn
!pip install pygeos

# Instalar PyTorch
!pip install torch==1.8.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

# Instalar PyTorch Geometric
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-geometric

# Instalar PyTorch Geometric Temporal
!pip install torch-geometric-temporal

Descargar y procesar datos hasta obtener finalmente el grafo que usaremos para modelar:

In [None]:
# Note: this file can take some time to be downloaded because is 2.5 GB of size
# %run src/preparation/download_stm_bus_data.py

In [15]:
# Recorridos y paradas
%run src/preparation/download_bus_stops.py
%run src/preparation/download_bus_tracks.py

⬇️ Downloading data...
  STM bus stops from: https://intgis.montevideo.gub.uy/sit/tmp/v_uptu_paradas.zip
💾 Writing data...
Saved to: data/raw/stm_paradas.geojson

✅ Done! 

⬇️ Downloading data...
  STM bus line tracks from: https://intgis.montevideo.gub.uy/sit/tmp/uptu_variante_no_maximal.zip
💾 Writing data...
Saved to: data/raw/stm_recorridos.geojson

✅ Done! 



In [16]:
# Procesar datos en bruto descargados
%run src/processing/process_stm_bus_data.py

🏋️‍♀️ Loading data...
Loading data/raw/stm_viajes_octubre.csv...
✅ Done! 

⚙️ Processing...
Pre-process raw data
ℹ️ Convert to timeseries and add time features
✅ Done! 

💾 Writing data...
File saved to data/processed/df_stm_bus_proc.pkl



In [17]:
# Procesamiento de recorridos y paradas
%run src/processing/build_bus_line_tracks_and_stops.py

🏋️‍♀️ Loading data...
Loading data/processed/df_stm_bus_proc.pkl...
✅ Done! 

🏋️‍♀️ Loading data...
Loading data/raw/stm_paradas.geojson...
✅ Done! 

🏋️‍♀️ Loading data...
Loading data/raw/stm_recorridos.geojson...
✅ Done! 

🚌 103
⚙️ Processing...
build_bus_line_tracks_and_stops

⚙️ Processing...
get_longest_track_from_bus_line

✅ Done! 

⚙️ Processing...
cut_tracks_by_bus_stops for bus line

💾 Writing data...
Saved to: data/processed/bus_tracks/bus_track_proc_busline_103.geojson

✅ Done! 

💾 Writing data...
Saved to: data/processed/bus_stops/bus_stops_proc_103.geojson

🚌 G
⚙️ Processing...
build_bus_line_tracks_and_stops

⚙️ Processing...
get_longest_track_from_bus_line

✅ Done! 

⚙️ Processing...
cut_tracks_by_bus_stops for bus line

💾 Writing data...
Saved to: data/processed/bus_tracks/bus_track_proc_busline_G.geojson

✅ Done! 

💾 Writing data...
Saved to: data/processed/bus_stops/bus_stops_proc_G.geojson

🚌 183
⚙️ Processing...
build_bus_line_tracks_and_stops

⚙️ Processing...
get_

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


💾 Writing data...
Saved to: data/processed/bus_stops/bus_stops_proc_110.geojson



In [None]:
# Procesamiento de recorridos y paradas
%run src/processing/sort_bus_stops_along_bus_track.py

🏋️‍♀️ Loading data...
Loading data/processed/df_stm_bus_proc.pkl...
✅ Done! 

🚌 103
🚏 Bus stops
🏋️‍♀️ Loading data...
Loading file data/processed/bus_stops/bus_stops_proc_103.geojson...
✅ Done! 

🚌 103
🛣️ Bus tracks
🏋️‍♀️ Loading data...
Loading file data/processed/bus_tracks/bus_track_proc_busline_103.geojson...
✅ Done! 

🚌 103
⚙️ Processing...
Get order of bus stops along bus track
⚠️ This operation can take some time 

⚙️ Processing...
Simplifying bus track
ℹ️ Using 5 meters of tolerance
ℹ️ Removed 504 points of 1000 (50.4%) 

⚙️ Processing...
Densyfing bus track by adding points at equal intervals
ℹ️ Densify line lengths equal or above 5 meters
ℹ️ Add points every 5 meters
ℹ️ Added 6392 new points of 496 initial points (1288.7%) 

⚙️ Processing...
Sort points along a path using Nearest Neighbors and minimizing path cost
ℹ️ Using 3 neighbors


In [None]:
# Construir matriz de adyacencia y grafo final para modelar
%run src/processing/build_adyacency_matrix.py
%run src/processing/build_graph.py

# Análisis Exploratorio de Datos

In [24]:
import geopandas as gpd
import networkx as nx
import pandas as pd
from notebooks.eda.plots import (
    plot_boardings_by_day_name,
    plot_boardings_by_hour_and_day_name,
    plot_boardings_by_time,
)
from src.preparation.constants import BUFFER, BUS_LINES, CRS, DAY_NAME_MAPPING, PROCESSED_FILE
from src.preparation.utils import (
    load_pickle_file,
    load_spatial_data,
    load_stm_bus_data,
    load_stm_bus_line_track,
    load_stm_bus_stops,
    save_pickle_file,
)
from src.processing.process_stm_bus_data import pre_process_data
from src.processing.utils import (
    build_adyacency_matrix,
    build_bus_line_tracks_and_stops,
    fix_bus_stop_order,
)


In [46]:
# Load data pre-processed data
df_proc = load_pickle_file(PROCESSED_FILE)
df_proc

🏋️‍♀️ Loading data...
Loading data/processed/df_stm_bus_proc.pkl...
✅ Done! 



Unnamed: 0_level_0,ordinal_de_tramo,cantidad_pasajeros,codigo_parada_origen,dsc_linea,sevar_codigo,mes,nombre_dia,hora
fecha_evento,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-10-08 08:17:57-03:00,2,1,3895,195,1438.0,10,Jueves,8
2020-10-26 14:54:13-03:00,2,1,5709,181,7602.0,10,Lunes,14
2020-10-05 17:36:47-03:00,2,1,3663,151,3364.0,10,Lunes,17
2020-10-07 11:59:09-03:00,2,1,1567,137,815.0,10,Miércoles,11
2020-10-09 15:46:40-03:00,2,1,3133,145,3358.0,10,Viernes,15
...,...,...,...,...,...,...,...,...
2020-10-13 18:39:14-03:00,2,1,3200,102,254.0,10,Martes,18
2020-10-07 13:29:24-03:00,2,1,1145,494,2898.0,10,Miércoles,13
2020-10-05 06:33:48-03:00,2,1,4201,109,2214.0,10,Lunes,6
2020-10-16 06:34:05-03:00,2,1,4201,113,602.0,10,Viernes,6


In [27]:
df_hourly = df_proc.groupby([pd.Grouper(freq="1H")])["cantidad_pasajeros"].sum().reset_index()
plot_boardings_by_time(df_hourly)

In [28]:
# Daily sum of boardings and median by day name
df_day_name = (
    df_proc.groupby([pd.Grouper(freq="1D"), "nombre_dia"])["cantidad_pasajeros"]
    .sum()
    .groupby(["nombre_dia"])
    .median()
    .reset_index()
)
plot_boardings_by_day_name(df_day_name)

In [29]:
# Hourly sum of boardings and median by day name
df_hourly_day_name = df_hourly.copy()
df_hourly_day_name.set_index("fecha_evento", inplace=True)
df_hourly_day_name.loc[:, "nombre_dia"] = df_hourly_day_name.index.day_name()
df_hourly_day_name.loc[:, "nombre_dia"].replace(DAY_NAME_MAPPING, inplace=True)
df_hourly_day_name.loc[:, "hora"] = df_hourly_day_name.index.hour
df_hourly_day_name = df_hourly_day_name.groupby(["hora", "nombre_dia"]).median().reset_index()

plot_boardings_by_hour_and_day_name(df_hourly_day_name)

In [44]:
# Get top buses lines per day of the week
df_weekly_by_day_name_and_line = (
    df_proc.groupby([pd.Grouper(freq="1D"), "nombre_dia", "dsc_linea"])["cantidad_pasajeros"]
    .sum()
    .groupby(["dsc_linea", "nombre_dia"])
    .median()
    .groupby("dsc_linea")
    .sum()
    .sort_values(ascending=False)
    .reset_index()
)

df_weekly_by_day_name_and_line["decile_rank"] = pd.qcut(
    df_weekly_by_day_name_and_line["cantidad_pasajeros"], 10, labels=False
)

# Contribution of each decile
df_decile_rank_prop = df_weekly_by_day_name_and_line.groupby("decile_rank").sum().reset_index()
df_decile_rank_prop["proportion"] = (
    df_decile_rank_prop["cantidad_pasajeros"] / df_decile_rank_prop["cantidad_pasajeros"].sum()
)
df_decile_rank_prop

Unnamed: 0,decile_rank,cantidad_pasajeros,proportion
0,0,1509.5,0.001341
1,1,6931.0,0.006157
2,2,13356.5,0.011865
3,3,30346.5,0.026959
4,4,50580.0,0.044933
5,5,84003.5,0.074626
6,6,136517.5,0.121277
7,7,186790.5,0.165938
8,8,241547.5,0.214582
9,9,374083.5,0.332322


In [45]:
# Select bus lines from the highest decile
df_bus_lines = df_weekly_by_day_name_and_line.loc[
    df_weekly_by_day_name_and_line["decile_rank"] == 9, :
]
df_bus_lines = df_bus_lines.sort_values("cantidad_pasajeros", ascending=False)
df_bus_lines

Unnamed: 0,dsc_linea,cantidad_pasajeros,decile_rank
0,103,41948.5,9
1,183,36571.5,9
2,306,32997.5,9
3,185,31321.0,9
4,181,29585.0,9
5,G,29426.0,9
6,546,27830.0,9
7,137,25128.0,9
8,110,24509.5,9
9,145,24112.0,9


In [50]:
# Load bus stops
gdf_bus_stops = load_stm_bus_stops()

# Load bus tracks
gdf_bus_tracks = load_stm_bus_line_track()

388    B
Name: DESC_VARIA, dtype: object

In [None]:
# %% [markdown]
# ## Build bus line tracks

# Load bus stops
gdf_bus_stops = load_stm_bus_stops()

# Load bus tracks
gdf_bus_tracks = load_stm_bus_line_track()

# %%
# Read all bus stops by bus line from geojson files
all_bus_stops = gpd.GeoDataFrame()
for bus_line in BUS_LINES:
    all_bus_stops = all_bus_stops.append(load_spatial_data(bus_line, type="bus_stop"))
all_bus_stops = all_bus_stops.set_crs(CRS)

# Read all bus tracks by bus line from geojson files
all_bus_tracks = gpd.GeoDataFrame()
for bus_line in BUS_LINES:
    df = load_spatial_data(bus_line, type="bus_line")
    df["line"] = bus_line
    all_bus_tracks = all_bus_tracks.append(df)
all_bus_tracks = all_bus_tracks.set_crs(CRS)

# %%
# Get ordered bus stops and bus tracks from files
all_bus_stops_ordered, all_bus_tracks_ordered = gpd.GeoDataFrame(), gpd.GeoDataFrame()

for bus_line in BUS_LINES:
    df_bus_stop_ordered = load_spatial_data(bus_line, type="bus_stop_ordered")
    all_bus_stops_ordered = all_bus_stops_ordered.append(df_bus_stop_ordered)

    df_bus_track_ordered = load_spatial_data(bus_line, type="bus_track_ordered")
    df_bus_track_ordered["DESC_LINEA"] = bus_line
    all_bus_tracks_ordered = all_bus_tracks_ordered.append(df_bus_track_ordered)

all_bus_stops_ordered = all_bus_stops_ordered.set_crs(CRS)
all_bus_stops_ordered = all_bus_stops_ordered.astype({"COD_UBIC_P": "int"})
all_bus_tracks_ordered = all_bus_tracks_ordered.set_crs(CRS)

# %%
# Fix order from origin
for bus_line in BUS_LINES:
    if bus_line == "183":
        fix_bus_stop_order(bus_line, reorder=True)
    elif bus_line != "405":
        fix_bus_stop_order(bus_line)

# %%
# Check shared bus stops by lines
shared_bus_stops = (
    all_bus_stops_ordered.groupby(["COD_UBIC_P"])
    .agg(lines=("DESC_LINEA", "|".join), number_of_lines=("DESC_LINEA", len))
    .round(0)
    .sort_values("COD_UBIC_P", ascending=True)
    .reset_index()
    .astype({"COD_UBIC_P": int})
)

print(shared_bus_stops.loc[shared_bus_stops["number_of_lines"] > 1, :]["lines"].unique())


# %%
# Get distances for shared bus stations
shared = shared_bus_stops.loc[shared_bus_stops["number_of_lines"] > 1, :][["COD_UBIC_P", "lines"]]

for bus_stop, bus_lines in zip(shared["COD_UBIC_P"], shared["lines"]):
    bus_lines = bus_lines.split("|")
    for bus_line in bus_lines:
        bus_stops_by_line = all_bus_stops_ordered.loc[
            (all_bus_stops_ordered["DESC_LINEA"] == bus_line) & (all_bus_stops_ordered["COD"]), :
        ]
        bus_tracks_by_line = all_bus_tracks_ordered.loc[
            all_bus_tracks_ordered["DESC_LINEA"] == bus_line, :
        ]

# %%
# Build adyacency matrix
# df_adyacency_matrix, df_from_to_weight = build_adyacency_matrix(control=True)
df_adyacency_matrix = pd.read_csv("data/processed/adyacency_matrix.csv", index_col=0)
df_adyacency_matrix.columns = df_adyacency_matrix.columns.astype(int)

df_from_to_weight = pd.read_csv("data/processed/from_to_weight.csv", index_col=0)

# %%
# Check adyacency matrix
bus_stops_103 = all_bus_stops_ordered.loc[all_bus_stops_ordered["DESC_LINEA"] == "103", :]
bus_stops_list = bus_stops_103["COD_UBIC_P"].unique()
dist_between_stops = list()

for i in range(0, (len(bus_stops_list) - 1)):
    bus_stop_start = bus_stops_103.loc[i, "COD_UBIC_P"]
    bus_stop_end = bus_stops_103.loc[(i + 1), "COD_UBIC_P"]
    d = df_adyacency_matrix.loc[bus_stop_start, bus_stop_end]
    dist_between_stops.append(d)

print(dist_between_stops)


# %%
G = nx.from_pandas_edgelist(
    df_from_to_weight, source="from", target="to", edge_attr="weight", create_using=nx.DiGraph
)
G.name = "Bus lines of Montevideo"
print(nx.info(G))

# %%
layout = nx.spring_layout(G)
nx.draw(G, layout, with_labels=True)


# %%
# Build directed graph from A. matrix
G = nx.from_pandas_adjacency(df_adyacency_matrix, create_using=nx.DiGraph)
G.name = "Graph of bus lines of Montevideo"
print(nx.info(G))

# %%
A = df_adyacency_matrix.values
G = nx.from_numpy_array(A, parallel_edges=False, create_using=nx.DiGraph)
G.name = "Graph of main buses lines of Montevideo"
print(nx.info(G))


# %%
df = pd.DataFrame([[0, 0, 0], [1, 0, 0], [0, 1, 0]])
print(df)
A = df.values
# G = nx.from_numpy_array(A, parallel_edges=True, create_using=nx.DiGraph())

# G = nx.from_pandas_adjacency(df, create_using=nx.Graph)
G.name = "Graph from pandas adjacency matrix"
print(nx.info(G))
