<img heigth="8" src="https://i.imgur.com/s60I1zW.png" alt="pbs-enae">

<h1 align="left">¿Como es un proyecto de Machine Learning de pricinpio a fin?</h1>

<h2 align="left"><i>Traffic Congestion Prediction</i></h2>

<p align="left">
  <h3>Joseph F. Vergel-Becerra | Machine Learning - Tools and Skill Courses</h3>
  <br>
  <b>Last updated:</b> <i>07/02/2023</i>
  <br><br>
  <a href="#tabla-de-contenido">Tabla de contenido</a> •
  <a href="#contribuir">Contribuir</a>
  <a href="#agradecimientos">Agradecimientos</a>
  <br><br>
</p>
<table align="left">
  <td>
      <a href="https://img.shields.io/badge/version-0.1.0-blue.svg?cacheSeconds=2592000">
        <img src="https://img.shields.io/badge/version-0.1.0-blue.svg?cacheSeconds=2592000" alt="Version" height="18">
      </a>
  </td>
  <td>
    <a href="https://colab.research.google.com/github/joefavergel/pbs-enae-ml-course/blob/main/traffic_congestion_end-to-end_ml_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
      </a>
  </td>
  <td>
    <a href="https://github.com/joefavergel/pbs-enae-ml-course" target="_parent"><img src="https://img.shields.io/github/forks/joefavergel/pbs-enae-ml-course?style=social" alt="Fork"/>
      </a>
  </td>
</table>
<br>
<br>

---

Optimizar el tiempo de entrega al maximo, es uno de los principales objetivos que las empresas logistica de ultima milla definen como crucial en sus operaciones. Ahora bien, imagina que te acaban de contratar en el equipo de ciencia de datos de una de estas compañias y tu primer reto es ***\"optimizar las rutas y entregas para que tomen el menor tiempo posible\"***. Para ello cuentas con datos referentes a la congestion del trafico en las ciudades en las que la empresa opera y debes diseñar e implementar una prueba de concepto (PoC del ingles *Proof-of-Concept*) que agregue valor a la compañia.

## Tabla de contenido


<p align="left">
    <img width="500px" src="https://i.imgur.com/Br1iISZ.png" alt="table-of-contents" style="float: left;">
</p>

In [1]:
import sys
from IPython.core.display import HTML

from pathlib import Path
from packaging import version
import sklearn
import urllib.request


print("[INFO] Este proyecto requiere python 3.8 o superior y Scikit-Learn 1.0.1 o superior.")
assert sys.version_info >= (3, 8)
assert version.parse(sklearn.__version__) >= version.parse("1.0.1")
print("[INFO] Versiones vificadas exitosamente!")


def css_styling():
    styles_path = Path(f"./styles/custom.css")
    if not styles_path.is_file():
        Path("styles").mkdir(parents=True, exist_ok=True)
        url = f"https://github.com/joefavergel/pbs-enae-ml-course/blob/main/styles/custom.css?raw=true"
        urllib.request.urlretrieve(url, styles_path)

    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)


css_styling()

[INFO] Este proyecto requiere python 3.8 o superior y Scikit-Learn 1.0.1 o superior.
[INFO] Versiones vificadas exitosamente!


## 2.1. Defininedo el problema

Es comun que los objetivos que trazan los [C levels](https://academia.crandi.com/negocios-digitales/ejecutivos-c-level/) de las compañias, son definidos en terminos de la planeacion estrategica y los objetivos de negocio de sus respectivas divisiones. Nuestro caso no es la excepcion, pues ***\"Optimizar las rutas y entregas para que tomen el menor tiempo posible\"***, seguramente requerira mas de una solucion basada en datos y no existira un unico modelo "magico" que lo consiga con alta precision. No obstante, un muy buen primer paso dentro de la etapa de **entendimiento del negocio** es plantear una hipotesis referente a  analizar lo que la intuicion y el buen sentido comun nos sugiere. Es por esto que pensar en que:
<br><br>

<div class=hypo>
<b>Hipotesis 1:</b> La congestion del trafico en las ciudades de operacion, es posiblemente la principal causa de retrasos en las entregas.
</div><br><br>

## 2.2. Trabajando con datos reales y una mirada general

El proyecto inicio hace un par de meses y en tu tarea de empalme, solo recibiste la informacion recolectada. Los datos proporcionados consisten de **métricas de registros de viajes agregados**, para varias flotas de vehículos comerciales de tipo semirremolque. La Fig. 1 muestra la metodologia de recollecion de datos del proveedor.

<p align="center">
  <a target="_blank">
    <img width="400px" src="https://assets-global.website-files.com/5f2a93fe880654a977c51043/60ca13eef83c460f414f849a_image6.gif" alt="intersection-ai">
  </a><br>
  <b>Figura 1:</b> Analisis de deteccion de objectos y segmentacion de una camera aerea, para una de las intersecciones viales proporcionada por el proveedor de los datos¹.
</p>

Ahora bien, al ser un problema de ***big data²***, el proveedor **agrupó por los datos intersección vial, mes, hora del día, dirección conducida a través de la intersección y si el día fue fin de semana o no**. Dentro de la poca documentacion que entrego el anterior cientifico de datos, se encontro en un `README.md` la siguiente nota con respecto a la agregacion de los datos:

> *Bajo nuestro diseño y por solicitud nuestra, se le solicito al proveedor de los datos la informacion de tres cuantiles diferentes, de dos métricas diferentes, que cubren cuánto tiempo le tomó al grupo de vehículos atravesar la intersección. Específicamente, los percentiles 20, 50 y 80 para el tiempo total detenido en una intersección `TotalTimeStopped` y la distancia entre la intersección y el primer lugar donde se detuvo un vehículo mientras esperaba `DistanceToFirstStop`.²*

Sabiendo esto, el primer paso es dar un vistazo a los datos.

---

<a href="https://www.mapbox.com/blog/300-more-lane-guidance-for-navigation-powered-by-ai-mapping">¹ mapbox: 300% More Lane Guidance For Navigation Powered By Ai Mapping</a>.<br>
² Ver **Ejercicio 1**.

In [81]:
from pathlib import Path
from zipfile import ZipFile
import urllib.request

DATASET = "bigquery-geotab-intersection-congestion"
DATA_PATH = f"datasets/{DATASET}/"


def load_traffic_congestion_data(dataset: str):
    zipfile_path = Path(f"datasets/{dataset}.zip")
    if not zipfile_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = f"https://github.com/joefavergel/datasets/blob/main/{dataset}.zip?raw=true"
        urllib.request.urlretrieve(url, zipfile_path)
    Path(f"datasets/{dataset}").mkdir(parents=True, exist_ok=True)
    try:
        ZipFile(zipfile_path).extractall(f"datasets/{dataset}")
        print(f"[INFO] Dataset \'{dataset}\' downloaded and uncompressed correctly!")
    except Exception as e:
        print(f"[Exception] There's been a problem: {e}")


load_traffic_congestion_data(dataset=DATASET)

[INFO] Dataset 'bigquery-geotab-intersection-congestion' downloaded and uncompressed correctly!


In [53]:
import os

import pandas as pd

train = pd.read_csv(os.path.join(DATA_PATH, "train.csv"))
test = pd.read_csv(os.path.join(DATA_PATH, "test.csv"))

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(f"[INFO] Training dataset dimnesions (rows, cols): {train.shape}")
    display(train.head())
    
    print(f"\n[INFO] testing dataset dimnesions (rows, cols): {test.shape}")
    display(test.head())

[INFO] Training dataset dimnesions (rows, cols): (856387, 28)


Unnamed: 0,RowId,IntersectionId,Latitude,Longitude,EntryStreetName,ExitStreetName,EntryHeading,ExitHeading,Hour,Weekend,Month,Path,TotalTimeStopped_p20,TotalTimeStopped_p40,TotalTimeStopped_p50,TotalTimeStopped_p60,TotalTimeStopped_p80,TimeFromFirstStop_p20,TimeFromFirstStop_p40,TimeFromFirstStop_p50,TimeFromFirstStop_p60,TimeFromFirstStop_p80,DistanceToFirstStop_p20,DistanceToFirstStop_p40,DistanceToFirstStop_p50,DistanceToFirstStop_p60,DistanceToFirstStop_p80,City
0,1921357,0,33.791659,-84.430032,Marietta Boulevard Northwest,Marietta Boulevard Northwest,NW,NW,0,0,6,Marietta Boulevard Northwest_NW_Marietta Boule...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Atlanta
1,1921358,0,33.791659,-84.430032,Marietta Boulevard Northwest,Marietta Boulevard Northwest,SE,SE,0,0,6,Marietta Boulevard Northwest_SE_Marietta Boule...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Atlanta
2,1921359,0,33.791659,-84.430032,Marietta Boulevard Northwest,Marietta Boulevard Northwest,NW,NW,1,0,6,Marietta Boulevard Northwest_NW_Marietta Boule...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Atlanta
3,1921360,0,33.791659,-84.430032,Marietta Boulevard Northwest,Marietta Boulevard Northwest,SE,SE,1,0,6,Marietta Boulevard Northwest_SE_Marietta Boule...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Atlanta
4,1921361,0,33.791659,-84.430032,Marietta Boulevard Northwest,Marietta Boulevard Northwest,NW,NW,2,0,6,Marietta Boulevard Northwest_NW_Marietta Boule...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Atlanta



[INFO] testing dataset dimnesions (rows, cols): (1921357, 13)


Unnamed: 0,RowId,IntersectionId,Latitude,Longitude,EntryStreetName,ExitStreetName,EntryHeading,ExitHeading,Hour,Weekend,Month,Path,City
0,0,1,33.75094,-84.393032,Peachtree Street Southwest,Peachtree Street Southwest,NE,NE,0,0,6,Peachtree Street Southwest_NE_Peachtree Street...,Atlanta
1,1,1,33.75094,-84.393032,Peachtree Street Southwest,Mitchell Street Southwest,SW,SE,0,0,6,Peachtree Street Southwest_SW_Mitchell Street ...,Atlanta
2,2,1,33.75094,-84.393032,Peachtree Street Southwest,Peachtree Street Southwest,SW,SW,0,0,6,Peachtree Street Southwest_SW_Peachtree Street...,Atlanta
3,3,1,33.75094,-84.393032,Peachtree Street Southwest,Peachtree Street Southwest,NE,NE,1,0,6,Peachtree Street Southwest_NE_Peachtree Street...,Atlanta
4,4,1,33.75094,-84.393032,Peachtree Street Southwest,Peachtree Street Southwest,SW,SW,1,0,6,Peachtree Street Southwest_SW_Peachtree Street...,Atlanta


## 2.2. Ejercicio 1

a) Defina de manera concisa, _¿Cuanta data es "big data"?_.


b) Explique la nota del antiguo cientifico de datos, describiendo brevemente el procedimiento que tuvo que efectuar el proveedor para cumplir con el requerimiento de "nuestra compañia".

In [38]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 856387 entries, 0 to 856386
Data columns (total 28 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   RowId                    856387 non-null  int64  
 1   IntersectionId           856387 non-null  int64  
 2   Latitude                 856387 non-null  float64
 3   Longitude                856387 non-null  float64
 4   EntryStreetName          848239 non-null  object 
 5   ExitStreetName           850100 non-null  object 
 6   EntryHeading             856387 non-null  object 
 7   ExitHeading              856387 non-null  object 
 8   Hour                     856387 non-null  int64  
 9   Weekend                  856387 non-null  int64  
 10  Month                    856387 non-null  int64  
 11  Path                     856387 non-null  object 
 12  TotalTimeStopped_p20     856387 non-null  float64
 13  TotalTimeStopped_p40     856387 non-null  float64
 14  Tota

In [47]:
train.describe()

Unnamed: 0,RowId,IntersectionId,Latitude,Longitude,Hour,Weekend,Month,TotalTimeStopped_p20,TotalTimeStopped_p40,TotalTimeStopped_p50,...,TimeFromFirstStop_p20,TimeFromFirstStop_p40,TimeFromFirstStop_p50,TimeFromFirstStop_p60,TimeFromFirstStop_p80,DistanceToFirstStop_p20,DistanceToFirstStop_p40,DistanceToFirstStop_p50,DistanceToFirstStop_p60,DistanceToFirstStop_p80
count,856387.0,856387.0,856387.0,856387.0,856387.0,856387.0,856387.0,856387.0,856387.0,856387.0,...,856387.0,856387.0,856387.0,856387.0,856387.0,856387.0,856387.0,856387.0,856387.0,856387.0
mean,2349550.0,833.283384,39.618965,-77.916488,12.431234,0.27788,9.104808,1.755596,5.403592,7.722655,...,3.181096,9.162174,12.722165,18.926085,34.201656,6.765856,20.285128,28.837113,44.27231,83.991313
std,247217.8,654.308913,2.935437,5.952959,6.071843,0.447954,1.991094,7.146549,12.981674,15.68591,...,11.835994,20.446568,24.219271,29.851797,41.130668,29.535968,59.202108,75.217343,102.03225,160.709797
min,1921357.0,0.0,33.649973,-87.862288,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2135454.0,291.0,39.936739,-84.387607,8.0,0.0,7.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2349550.0,679.0,39.982974,-75.175055,13.0,0.0,9.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,27.0,0.0,0.0,0.0,0.0,60.4
75%,2563646.0,1264.0,41.910047,-75.100495,17.0,1.0,11.0,0.0,0.0,10.0,...,0.0,0.0,22.0,31.0,49.0,0.0,0.0,53.1,64.2,85.95
max,2777743.0,2875.0,42.381782,-71.02555,23.0,1.0,12.0,298.0,375.0,375.0,...,337.0,356.0,356.0,357.0,359.0,1901.9,2844.4,2851.1,3282.4,4079.2


In [37]:
train.isnull().sum()

RowId                         0
IntersectionId                0
Latitude                      0
Longitude                     0
EntryStreetName            8148
ExitStreetName             6287
EntryHeading                  0
ExitHeading                   0
Hour                          0
Weekend                       0
Month                         0
Path                          0
TotalTimeStopped_p20          0
TotalTimeStopped_p40          0
TotalTimeStopped_p50          0
TotalTimeStopped_p60          0
TotalTimeStopped_p80          0
TimeFromFirstStop_p20         0
TimeFromFirstStop_p40         0
TimeFromFirstStop_p50         0
TimeFromFirstStop_p60         0
TimeFromFirstStop_p80         0
DistanceToFirstStop_p20       0
DistanceToFirstStop_p40       0
DistanceToFirstStop_p50       0
DistanceToFirstStop_p60       0
DistanceToFirstStop_p80       0
City                          0
dtype: int64

In [36]:
cols = train.columns
num_cols = list(train._get_numeric_data().columns)
print("[INFO] Numerical features: ", num_cols)
cat_cols = list(set(cols) - set(num_cols))
print("\n[INFO] Categorical features:", cat_cols)

[INFO] Numerical features:  ['RowId', 'IntersectionId', 'Latitude', 'Longitude', 'Hour', 'Weekend', 'Month', 'TotalTimeStopped_p20', 'TotalTimeStopped_p40', 'TotalTimeStopped_p50', 'TotalTimeStopped_p60', 'TotalTimeStopped_p80', 'TimeFromFirstStop_p20', 'TimeFromFirstStop_p40', 'TimeFromFirstStop_p50', 'TimeFromFirstStop_p60', 'TimeFromFirstStop_p80', 'DistanceToFirstStop_p20', 'DistanceToFirstStop_p40', 'DistanceToFirstStop_p50', 'DistanceToFirstStop_p60', 'DistanceToFirstStop_p80']

[INFO] Categorical features: ['EntryStreetName', 'City', 'Path', 'ExitStreetName', 'EntryHeading', 'ExitHeading']


In [45]:
cities = list(train["City"].unique())
print("[INFO] Cities:", cities)
print("[INFO] Data by city:\n\n", train['City'].value_counts())

[INFO] Cities: ['Atlanta', 'Boston', 'Chicago', 'Philadelphia']
[INFO] Data by city:

 Philadelphia    390237
Boston          178617
Atlanta         156484
Chicago         131049
Name: City, dtype: int64


In [77]:
train_intersection_ids = set(train["IntersectionId"].value_counts().index)
print("[INFO] Number of unique intersections in training dataset: ", len(train_intersection_ids))
test_intersection_ids = set(test["IntersectionId"].value_counts().index)
print("[INFO] Number of unique intersections in testing dataset: ", len(test_intersection_ids))
print("[INFO] Number of commun intersections: ",
    len(set.intersection(
        train_intersection_ids,
        test_intersection_ids
    ))
)
test_inter_minus_train_inter = test_intersection_ids - train_intersection_ids
print("[INFO] Number of intersections that are in the testing set but not in the training set: ",
    
)
train_inter_minus_test_inter = train_intersection_ids - test_intersection_ids
print("[INFO] Number of intersections that are in the training set but not in the testing set: ",
    
)
alpha = len(test_inter_minus_train_inter) / len(train_inter_minus_test_inter)
print("[INFO] Proportionality factor between the two differences: ", round(alpha, 2))

[INFO] Number of unique intersections in training dataset:  2559
[INFO] Number of unique intersections in testing dataset:  2765
[INFO] Number of commun intersections:  2485
[INFO] Number of intersections that are in the testing set but not in the training set: 
[INFO] Number of intersections that are in the training set but not in the testing set: 
[INFO] Proportionality factor between the two differences:  3.78


> **_NOTA:_** Si *A* = `test_intersection_ids` $\Rightarrow$ `280` corresponde al area azul, i.e., 3.78 veces mas que su contraparte. Esto es un claro ejemplo de lo que se conoce como _dataset shift_ o discrepancia en los datos.

<p align="left">
    <img width="300px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/SetDifferenceA.svg/663px-SetDifferenceA.svg.png" alt="table-of-contents" style="float: left;">
</p>
<br>
<br>


---
<p>
    <a href="https://www.seldon.io/what-is-covariate-shift">³ Seldon: What is covariate shift in machine learning?</a>.<br>
<p/>

In [78]:
new_intersections = test.query("IntersectionId in @test_inter_minus_train_inter", engine="python")

Unnamed: 0,RowId,IntersectionId,Latitude,Longitude,EntryStreetName,ExitStreetName,EntryHeading,ExitHeading,Hour,Weekend,Month,Path,City
67537,67537,109,33.854340,-84.431184,Northside Parkway Northwest,Northside Parkway Northwest,N,N,0,0,6,Northside Parkway Northwest_N_Northside Parkwa...,Atlanta
67538,67538,109,33.854340,-84.431184,Northside Parkway Northwest,Northside Parkway Northwest,S,S,5,0,6,Northside Parkway Northwest_S_Northside Parkwa...,Atlanta
67539,67539,109,33.854340,-84.431184,Northside Parkway Northwest,Northside Parkway Northwest,N,N,6,0,6,Northside Parkway Northwest_N_Northside Parkwa...,Atlanta
67540,67540,109,33.854340,-84.431184,Northside Parkway Northwest,Northside Parkway Northwest,S,S,6,0,6,Northside Parkway Northwest_S_Northside Parkwa...,Atlanta
67541,67541,109,33.854340,-84.431184,Northside Parkway Northwest,Northside Parkway Northwest,N,N,7,0,6,Northside Parkway Northwest_N_Northside Parkwa...,Atlanta
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1916550,1916550,1979,40.004704,-75.176913,West Allegheny Avenue,West Allegheny Avenue,W,W,22,1,12,West Allegheny Avenue_W_West Allegheny Avenue_W,Philadelphia
1916551,1916551,1979,40.004704,-75.176913,North 29th Street,North 29th Street,N,N,23,1,12,North 29th Street_N_North 29th Street_N,Philadelphia
1916552,1916552,1979,40.004704,-75.176913,North 29th Street,North 29th Street,S,S,23,1,12,North 29th Street_S_North 29th Street_S,Philadelphia
1916553,1916553,1979,40.004704,-75.176913,West Allegheny Avenue,West Allegheny Avenue,E,E,23,1,12,West Allegheny Avenue_E_West Allegheny Avenue_E,Philadelphia


## 2.2. Ejercicio 1

1. Deseo conocer el numero de muestras asociadas a cada interseccion. Procese el dataframe para obtener dicho resultado de la forma mas optima.

In [79]:
# Put your code here!