# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>

#### <center> **Final project: Machine Learning** </center>
---

**Date**: 23 de Noviembre, 2025

**Student Name**: Aura Melina Gutierrez Jimenez

**Professor**: Pablo Camarillo Ramirez

## Obejtivo
Agrupar puntos de conteo de tráfico en la ciudad según similaridad en volumen, ubicación y patrón horario.


## Justificación del algoritmo ML

Elegí K-emans porque el dataset no tiene una columna "label" / objetivo que se pudiera predecir. Pero si puedo hacer una busqueda para segmentar ubicaciones con comportamientos afines (zonas de alto/médio/bajo tráfico, picos horarios, etc.). Tambien, K-Means es escalable, está implementado en SparkML y permite interpretar clusters geoespaciales y de volumen de manera sencilla para visualización y decisiones de movilidad.

## Descipcion del Dataset

**Origen del Dataset**
Seleccione un conjunto de datos público que refleja mediciones reales de infraestructura de tráfico urbano.

Nombre del Dataset: Chicago Average Daily Traffic Counts

Fuente: Kaggle Public Repository

URL: https://www.kaggle.com/datasets/chicago/chicago-average-daily-traffic-counts

Archivo Principal: average-daily-traffic-counts.csv

**Tamaño del Dataset**

Número de filas: 1,280 registros

Número de columnas: 15 columnas

Peso del archivo: ~246 KB

**Número de dimensiones**

1. LATITUDE (double)
2. LONGITUDE (double)
3. AVG_TRAFFIC_VOLUME (double)(codificada)

- Todas son variables numéricas continuas
- No requieren encoding (ya son numéricas)
- Serán normalizadas usando StandardScaler para que todas contribuyan equitativamente

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ML: K-means") \
    .master("spark://spark-master:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("INFO")

# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/24 17:44:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [6]:
from auragutierrez.spark_utils import SparkUtils
from pyspark.sql.functions import col, to_timestamp

traffic_schema_columns = [
 ("ID", "int"),
 ("LOCATION_ADDRESS", "string"),
 ("STREET_NAME", "string"),
 ("DATE_OF_COUNT", "string"),
 ("TOTAL_VEHICLE_VOLUME", "int"),
 ("DIRECTIONAL_VOLUME", "string"),
 ("LATITUDE", "double"),
 ("LONGITUDE", "double"),
 ("LOCATION_RAW", "string"),
 ("BOUNDARIES_ZIP_CODES", "int"),
 ("COMMUNITY_AREAS", "int"),
 ("ZIP_CODES", "int"),
 ("CENSUS_TRACTS", "int"),
 ("WARDS", "int"),
 ("HISTORICAL_WARDS_2003_2015", "int")
]

traffic_schema = SparkUtils.generate_schema(traffic_schema_columns)
base_path = "/opt/spark/work-dir/data/"
df = spark.read \
 .option("header", "true") \
 .schema(traffic_schema) \
 .csv(base_path + "/Traffic/")

In [13]:
from pyspark.sql.functions import col, to_timestamp, lit, upper, trim, hour, dayofweek, when, avg, count

df = df.select(
 "ID",
 "STREET_NAME",
 "DATE_OF_COUNT",
 "TOTAL_VEHICLE_VOLUME",
 "LATITUDE",
 "LONGITUDE")

df_transformed = df.withColumn(
 "COUNT_TIMESTAMP",
     to_timestamp(col("DATE_OF_COUNT"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
).drop("DATE_OF_COUNT")

df_clean = df_transformed.filter(
 (col("TOTAL_VEHICLE_VOLUME").isNotNull()) &
 (col("TOTAL_VEHICLE_VOLUME") > 0) &
 (col("LATITUDE").isNotNull()) &
 (col("COUNT_TIMESTAMP").isNotNull()))

df_features = df_clean.withColumn(
 "HOUR_OF_DAY",
 hour(col("COUNT_TIMESTAMP"))
).withColumn(
 "DAY_TYPE",
 when(dayofweek(col("COUNT_TIMESTAMP")).isin(1, 7), lit("Weekend"))
 .otherwise(lit("WeekDay")))

df_aggregated = df_features.groupBy("LATITUDE", "LONGITUDE", "STREET_NAME", "HOUR_OF_DAY").agg(
    avg("TOTAL_VEHICLE_VOLUME").alias("AVG_TRAFFIC_VOLUME")
)

## ML Training process

In [8]:
from pyspark.ml.feature import VectorAssembler, StandardScaler

#Assemble the features into a single vector column
feature_columns = ["LATITUDE", "LONGITUDE", "AVG_TRAFFIC_VOLUME"]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features_raw" )

assembled_df = assembler.transform(df_aggregated)

In [9]:
# Scale feature
scaler = StandardScaler(inputCol="features_raw",outputCol="features",
    withMean=True,  # Center the data
    withStd=True    # Scale to unit variance
)

scaler_model = scaler.fit(assembled_df)
df_scaled = scaler_model.transform(assembled_df)

                                                                                

In [16]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# Test different values of K
k_values = range(2, 11)  # Test K from 2 to 10
silhouette_scores = []

for k in k_values:
    print(f"   Entrenando con K={k}...", end=" ")
    
    # Train K-Means
    kmeans = KMeans().setK(k).setSeed(13)
    model = kmeans.fit(df_scaled)
    predictions = model.transform(df_scaled)
    
    # Calculate Silhouette Score
    evaluator = ClusteringEvaluator()
    silhouette = evaluator.evaluate(predictions)
    silhouette_scores.append(silhouette)
    
    print(f"Silhouette={silhouette:.4f}")

   Entrenando con K=2... Silhouette=0.4784
   Entrenando con K=3... Silhouette=0.4058
   Entrenando con K=4... Silhouette=0.4612
   Entrenando con K=5... Silhouette=0.4846
   Entrenando con K=6... Silhouette=0.5001
   Entrenando con K=7... Silhouette=0.4602
   Entrenando con K=8... Silhouette=0.4467
   Entrenando con K=9... Silhouette=0.4522
   Entrenando con K=10... Silhouette=0.4005


In [9]:
optimal_k_idx = silhouette_scores.index(max(silhouette_scores))
optimal_k = list(k_values)[optimal_k_idx] 

#print(optimal_k)

In [17]:
## Final model with optmimal k
kmeans = KMeans().setK(optimal_k).setSeed(13)

model = kmeans.fit(df_scaled)
print("K-means model trained successfully")

#Save model
kmeans_model_path = "/opt/spark/work-dir/data/mlmodels/kmeans/traffic"
model.write().overwrite().save(kmeans_model_path)
model.__class__

K-means model trained successfully


                                                                                

pyspark.ml.clustering.KMeansModel

In [10]:
#Cargar modelo
from pyspark.ml.clustering import KMeansModel

kmeans_model_path = "/opt/spark/work-dir/data/mlmodels/kmeans/traffic"

k_model = KMeansModel.load(kmeans_model_path)
df_predictions = k_model.transform(df_scaled)

## ML Evaluation

In [12]:
from pyspark.ml.evaluation import ClusteringEvaluator
# Calculate Silhouette Score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(df_predictions)

print(f"   - Silhouette Score final: {silhouette}")

# Show the result
print("Cluster Centers: ")
for center in k_model.clusterCenters():
    print(center)

   - Silhouette Score final: 0.5001202059880351
Cluster Centers: 
[-1.21350763  1.01948177 -0.49385911]
[ 1.0251249  -1.50842142  0.14296192]
[-0.71071943  0.31861707  3.95264883]
[-0.96098644 -0.30259012  0.58631197]
[ 0.40776896  0.20764467 -0.52359891]
[ 0.81829538 -0.13535195  1.15266818]


**Interpretación**
- Clúster Crítico (Clúster 2): Requiere una atención inmediata para la optimización de semáforos, implementación de carriles reversibles o señalización de rutas alternas.

- Clústeres 3 y 5 (Corredores Principales): Necesitan monitoreo constante e inversiones a mediano plazo, ya que representan la mayoría de los flujos de tráfico pesado.

- Clústeres 0, 1 y 4 (Tráfico Promedio/Bajo): Pueden ser utilizados como rutas de desvío o zonas donde las intervenciones de tráfico pueden ser más ligeras (e.g., gestión de estacionamiento).

In [24]:
sc.stop()