# 🎓 **Maestría en Inteligencia Artificial Aplicada**

## 📈 **Curso: Análisis de grandes volúmenes de datos (Gpo 10)**

### 🏛️ Tecnológico de Monterrey

#### 👨‍🏫 **Profesor titular:** Dr. Iván Olmos Pineda
#### 👩‍🏫 **Profesor asistence:** Verónica Sandra Guzmán de Valle

### 📊 **Actividad 3 | Aprendizaje supervisado y no supervisado**

#### 📅 **22 de mayo de 2025**

🧑‍💻 **A01016093:** Oscar Enrique García García 

# 1. Introducción teórica

Los modelos de Machine Learning se dividen en dos principales rubros: Aprendizaje supervisado y aprendizaje no supervisado.

## 🕵️ Aprendizaje Supervisado

En el aprendizaje supervisado, el modelo se entrena utilizando un conjunto de datos etiquetado, es decir, cada entrada de datos viene acompañada de su salida deseada. El objetivo es aprender una función que, a partir de las variables de entrada, prediga correctamente la salida.

Dentro del aprendizaje supervisado, se encuentran problemas de:

Clasificación: Predecir una categoría (por ejemplo, detectar si una transacción se trata de fraude o no).
Regresión: Predecir un valor continuo (por ejemplo, el precio de una casa).

### Algoritmos representativos:

- Regresión logística (Logistic Regression): LogisticRegression
- Regresión lineal: LinearRegression
- Árboles de decisión (Decision Trees): DecisionTreeClassifier / DecisionTreeRegressor
- Random Forest: RandomForestClassifier / RandomForestRegressor
- Gradient-Boosted Trees: GBTClassifier / GBTRegressor
- NaiveBayes (para clasificación probabilística)


### 🙈 Aprendizaje No Supervisado

El aprendizaje no supervisado se utiliza cuando los datos no están etiquetados. El objetivo es encontrar patrones, relaciones o estructuras ocultas en los datos.

Uno de los problemas a resolver dentro del aprendizaje no supervisado, son los relacionados a:

Agrupamiento (clustering): Agrupar objetos similares (por ejemplo, segmentación de clientes).

## Algortimos representativos:

- K-means: KMeans

## PySpark MLlib

- Librería original de PySpark.
- Basada en RDDs (Resilient Distributed Datasets).

## PySpark ML

- Librería recomendada actualmente para ML en Spark.
- Basada en DataFrames (similares a los de pandas).
- Mejor integración con SQL.
- Mejor optimización y paralelización.


# 2. Importar librerías

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

In [3]:
spark = SparkSession.builder \
    .appName("Aprendizaje Supervisado y No Supervisado") \
    .getOrCreate()


25/05/24 18:43:50 WARN Utils: Your hostname, MacBook-Air-de-Oscar-2.local resolves to a loopback address: 127.0.0.1; using 192.168.68.117 instead (on interface en0)
25/05/24 18:43:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/24 18:43:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
df = spark.read.csv("/Users/oscgarcia/Documents/MNA/Análisis de Grandes Volúmenes de Datos/US_Accidents_Dec19.csv", header=True, inferSchema=True)
df.show(5)

25/05/24 18:44:05 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
25/05/24 18:44:05 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+---+-------+--------+-------------------+-------------------+-----------------+------------------+-------+-------+------------+--------------------+--------------------+------------+----------+-----+----------+-------+----------+------------+-------------------+--------------+-------------+-----------+------------+--------------+--------------+---------------+-----------------+-----------------+-------+-----+--------+--------+--------+-------+-------+----------+-------+-----+---------------+--------------+------------+--------------+--------------+-----------------+---------------------+
| ID| Source|Severity|         Start_Time|           End_Time|        Start_Lat|         Start_Lng|End_Lat|End_Lng|Distance(mi)|         Description|              Street|        City|    County|State|   Zipcode|Country|  Timezone|Airport_Code|  Weather_Timestamp|Temperature(F)|Wind_Chill(F)|Humidity(%)|Pressure(in)|Visibility(mi)|Wind_Direction|Wind_Speed(mph)|Precipitation(in)|Weather_Condition|Ameni

In [12]:
cols_to_select = ["Temperature(F)","Weather_Condition","Severity",
               "Humidity(%)", "Pressure(in)", "Wind_Direction",
               "Wind_Speed(mph)","Precipitation(in)"]

df = df.select(cols_to_select
               )
df.show()

+--------------+-----------------+--------+-----------+------------+--------------+---------------+-----------------+
|Temperature(F)|Weather_Condition|Severity|Humidity(%)|Pressure(in)|Wind_Direction|Wind_Speed(mph)|Precipitation(in)|
+--------------+-----------------+--------+-----------+------------+--------------+---------------+-----------------+
|          36.9|       Light Rain|       3|       91.0|       29.68|          Calm|           NULL|             0.02|
|          37.9|       Light Rain|       2|      100.0|       29.65|          Calm|           NULL|              0.0|
|          36.0|         Overcast|       2|      100.0|       29.67|            SW|            3.5|             NULL|
|          35.1|    Mostly Cloudy|       3|       96.0|       29.64|            SW|            4.6|             NULL|
|          36.0|    Mostly Cloudy|       2|       89.0|       29.65|            SW|            3.5|             NULL|
|          37.9|       Light Rain|       3|       97.0| 

In [17]:
reglas_particionamiento = [
    {"Weather_Condition": "Fair", "Severity": 2},
    {"Weather_Condition": "Mostly Cloudy", "Severity": 2},
    {"Weather_Condition": "Cloudy", "Severity": 2},
    {"Weather_Condition": "Partly Cloudy", "Severity": 2},
    {"Weather_Condition": "Clear", "Severity": 2},
    {"Weather_Condition": "Light Rain", "Severity": 2},
    {"Weather_Condition": "Overcast", "Severity": 2},
    {"Weather_Condition": "Clear", "Severity": 3},
    {"Weather_Condition": "Fair", "Severity": 3},
    {"Weather_Condition": "Mostly Cloudy", "Severity": 3},
]

muestras = []
tamaño_muestra_por_particion = 10000

for regla in reglas_particionamiento:
    condicion_clima = regla["Weather_Condition"]
    severidad = regla["Severity"]

    df_filtrado = df.filter(
        (col("Weather_Condition") == condicion_clima) &
        (col("Severity") == severidad)
    ).limit(tamaño_muestra_por_particion)

    muestras.append(df_filtrado)

df_muestra_M = muestras[0]
for df_temp in muestras[1:]:
    df_muestra_M = df_muestra_M.union(df_temp)

df_muestra_M.show(5)


+--------------+-----------------+--------+-----------+------------+--------------+---------------+-----------------+
|Temperature(F)|Weather_Condition|Severity|Humidity(%)|Pressure(in)|Wind_Direction|Wind_Speed(mph)|Precipitation(in)|
+--------------+-----------------+--------+-----------+------------+--------------+---------------+-----------------+
|          87.0|             Fair|       2|       35.0|       29.11|           SSE|            8.0|              0.0|
|          94.0|             Fair|       2|       31.0|       29.08|           SSE|            9.0|              0.0|
|          85.0|             Fair|       2|       44.0|       29.07|             S|           10.0|              0.0|
|          85.0|             Fair|       2|       44.0|       29.07|             S|           10.0|              0.0|
|          85.0|             Fair|       2|       44.0|       29.07|             S|           10.0|              0.0|
+--------------+-----------------+--------+-----------+-

In [14]:
df_muestra_M.count()

                                                                                

100000

In [18]:
df_muestra_M = df_muestra_M.dropna()

# Convertimos columnas categóricas a numéricas
indexers = [StringIndexer(inputCol="Weather_Condition", outputCol="Weather_Condition_idx"),
            StringIndexer(inputCol="Wind_Direction", outputCol="Wind_Direction_idx")
            ]

for indexer in indexers:
    df_muestra_M = indexer.fit(df_muestra_M).transform(df_muestra_M)

# Armamos el vector de características (se pueden agregar más columnas)
assembler = VectorAssembler(
    inputCols=["Temperature(F)","Weather_Condition_idx","Severity",
               "Humidity(%)", "Pressure(in)", "Wind_Direction_idx",
               "Wind_Speed(mph)","Precipitation(in)"],
    outputCol="features"
)

df_preprocesado = assembler.transform(df_muestra_M)
df_preprocesado.show(5)




+--------------+-----------------+--------+-----------+------------+--------------+---------------+-----------------+---------------------+------------------+--------------------+
|Temperature(F)|Weather_Condition|Severity|Humidity(%)|Pressure(in)|Wind_Direction|Wind_Speed(mph)|Precipitation(in)|Weather_Condition_idx|Wind_Direction_idx|            features|
+--------------+-----------------+--------+-----------+------------+--------------+---------------+-----------------+---------------------+------------------+--------------------+
|          87.0|             Fair|       2|       35.0|       29.11|           SSE|            8.0|              0.0|                  0.0|               2.0|[87.0,0.0,2.0,35....|
|          94.0|             Fair|       2|       31.0|       29.08|           SSE|            9.0|              0.0|                  0.0|               2.0|[94.0,0.0,2.0,31....|
|          85.0|             Fair|       2|       44.0|       29.07|             S|           10.0| 

                                                                                

In [19]:
train_data, test_data = df_preprocesado.randomSplit([0.8, 0.2], seed=42)

In [20]:
modelo_rf = RandomForestClassifier(labelCol="Severity", featuresCol="features")
modelo_entrenado = modelo_rf.fit(train_data)
predicciones = modelo_entrenado.transform(test_data)

evaluador = MulticlassClassificationEvaluator(labelCol="Severity", predictionCol="prediction", metricName="accuracy")
precision = evaluador.evaluate(predicciones)
print(f"Precisión: {precision}")




Precisión: 1.0


                                                                                

In [21]:
modelo_kmeans = KMeans(k=3, featuresCol="features")
modelo_kmeans_entrenado = modelo_kmeans.fit(df_preprocesado)
clusters = modelo_kmeans_entrenado.transform(df_preprocesado)

evaluador_cluster = ClusteringEvaluator()
silhouette = evaluador_cluster.evaluate(clusters)
print(f"Silhouette Score: {silhouette}")

25/05/24 19:00:59 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
25/05/24 19:00:59 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS

Silhouette Score: 0.48605629359221414


                                                                                