<a href="https://colab.research.google.com/github/marcelaman777/Bootcamp_Final/blob/main/MACHINE_LEARNING_ESCALABLE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MACHINE LEARNING ESCALABLE

Una tienda de cosmética quiere desarrollar un sistema inteligente que clasifique productos de skin  care en diferentes categorías según sus características. La clasificación ayudará a recomendar  productos adecuados a los clientes según su tipo de piel.

La tienda proporciona un dataset con información de productos, incluyendo:
- Ingredientes clave (como ácido hialurónico, retinol, vitamina C, etc.)
-  Nivel de hidratación
-  Nivel de absorción
-  Factor de protección solar (SPF)
-  Tipo de piel recomendado (seco, graso, mixto o sensible)

El objetivo es entrenar un modelo de clasificación con MLlib que prediga el tipo de piel recomendado  para cada producto.

Importante:
Se debe convertir Tipo de Piel en valores numéricos:
-  Seco → 0
-  Graso → 1
-  Mixto → 2
- Sensible → 3

In [None]:
from google.colab import drive
drive.mount('/content/drive')

ruta_archivos = '/content/drive/MyDrive/bootcamp_ciencia_de_datos/evaluaciones/archivos_spark/'

Mounted at /content/drive


##**1. Carga y exploración de datos**

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [None]:
ss = SparkSession.builder.appName('skincare').getOrCreate()
ss

In [None]:
df = ss.read.csv(ruta_archivos + 'skincare_products.csv', header=True, inferSchema=True)
df.show(truncate=False)

+--------------------+-----------+---------+---+------------+
|Ingredientes        |Hidratación|Absorción|SPF|Tipo de Piel|
+--------------------+-----------+---------+---+------------+
|Ácido Hialurónico   |Alto       |Medio    |0  |Seco        |
|Retinol             |Bajo       |Alto     |0  |Graso       |
|Vitamina C          |Medio      |Medio    |30 |Mixto       |
|Aloe Vera           |Alto       |Bajo     |15 |Sensible    |
|Niacinamida         |Medio      |Medio    |0  |Mixto       |
|Ceramidas           |Alto       |Bajo     |0  |Seco        |
|Ácido Salicílico    |Bajo       |Alto     |0  |Graso       |
|Centella Asiática   |Medio      |Medio    |20 |Sensible    |
|Extracto de Té Verde|Medio      |Alto     |0  |Mixto       |
|Manteca de Karité   |Alto       |Bajo     |0  |Seco        |
|Extracto de Regaliz |Medio      |Medio    |15 |Sensible    |
|Vitamina E          |Alto       |Medio    |25 |Seco        |
|Bakuchiol           |Bajo       |Medio    |0  |Mixto       |
|Ácido K

In [None]:
df.printSchema()

root
 |-- Ingredientes: string (nullable = true)
 |-- Hidratación: string (nullable = true)
 |-- Absorción: string (nullable = true)
 |-- SPF: integer (nullable = true)
 |-- Tipo de Piel: string (nullable = true)



In [None]:
col_num = [c[0] for c in df.dtypes if c[1] in ['int', 'bigint', 'double', 'float']]
df.select(col_num).describe().show(truncate=False)

+-------+-----------------+
|summary|SPF              |
+-------+-----------------+
|count  |20               |
|mean   |7.5              |
|stddev |10.19545822516343|
|min    |0                |
|max    |30               |
+-------+-----------------+



##**2. Preprocesamiento de datos**

In [None]:
df = (
    df.withColumn(
        "Tipo de Piel",
        when(col("Tipo de Piel")=='Seco', 0)
        .when(col("Tipo de Piel")=='Graso', 1)
        .when(col("Tipo de Piel")=='Mixto', 2)
        .otherwise(3)
    )
)

df = (
    df.withColumn(
        "Hidratación",
        when(col("Hidratación")=='Bajo', 0)
        .when(col("Hidratación")=='Medio', 1)
        .otherwise(2)
    )
)

df = (
    df.withColumn(
        "Absorción",
        when(col("Absorción")=='Bajo', 0)
        .when(col("Absorción")=='Medio', 1)
        .otherwise(2)
    )
)

df.show(truncate=False)

+--------------------+-----------+---------+---+------------+
|Ingredientes        |Hidratación|Absorción|SPF|Tipo de Piel|
+--------------------+-----------+---------+---+------------+
|Ácido Hialurónico   |2          |1        |0  |0           |
|Retinol             |0          |2        |0  |1           |
|Vitamina C          |1          |1        |30 |2           |
|Aloe Vera           |2          |0        |15 |3           |
|Niacinamida         |1          |1        |0  |2           |
|Ceramidas           |2          |0        |0  |0           |
|Ácido Salicílico    |0          |2        |0  |1           |
|Centella Asiática   |1          |1        |20 |3           |
|Extracto de Té Verde|1          |2        |0  |2           |
|Manteca de Karité   |2          |0        |0  |0           |
|Extracto de Regaliz |1          |1        |15 |3           |
|Vitamina E          |2          |1        |25 |0           |
|Bakuchiol           |0          |1        |0  |2           |
|Ácido K

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.stat import Correlation

In [None]:
assembler = VectorAssembler(inputCols=['Hidratación', 'Absorción', 'SPF'], outputCol='atributos')
df_train = assembler.transform(df).select('atributos', 'Tipo de Piel')

df_train.show(truncate=False)

+--------------+------------+
|atributos     |Tipo de Piel|
+--------------+------------+
|[2.0,1.0,0.0] |0           |
|[0.0,2.0,0.0] |1           |
|[1.0,1.0,30.0]|2           |
|[2.0,0.0,15.0]|3           |
|[1.0,1.0,0.0] |2           |
|[2.0,0.0,0.0] |0           |
|[0.0,2.0,0.0] |1           |
|[1.0,1.0,20.0]|3           |
|[1.0,2.0,0.0] |2           |
|[2.0,0.0,0.0] |0           |
|[1.0,1.0,15.0]|3           |
|[2.0,1.0,25.0]|0           |
|[0.0,1.0,0.0] |2           |
|[1.0,2.0,0.0] |1           |
|[2.0,0.0,10.0]|3           |
|[1.0,1.0,20.0]|2           |
|[0.0,2.0,0.0] |1           |
|[2.0,1.0,0.0] |0           |
|[1.0,2.0,0.0] |2           |
|[2.0,0.0,15.0]|3           |
+--------------+------------+



##**3. División de datos y entrenamiento del modelo**

In [None]:
# train test split
train, test = df_train.randomSplit([0.8, 0.2], seed=42)

print(f'Vistazo al train set:')
train.show(truncate=False)

print(f'\nVistazo al test set:')
test.show(truncate=False)

Vistazo al train set:
+--------------+------------+
|atributos     |Tipo de Piel|
+--------------+------------+
|[0.0,1.0,0.0] |2           |
|[0.0,2.0,0.0] |1           |
|[0.0,2.0,0.0] |1           |
|[1.0,1.0,0.0] |2           |
|[1.0,1.0,15.0]|3           |
|[1.0,1.0,20.0]|3           |
|[1.0,2.0,0.0] |1           |
|[1.0,2.0,0.0] |2           |
|[1.0,2.0,0.0] |2           |
|[2.0,0.0,0.0] |0           |
|[2.0,0.0,10.0]|3           |
|[2.0,0.0,15.0]|3           |
|[2.0,0.0,15.0]|3           |
|[2.0,1.0,0.0] |0           |
|[2.0,1.0,0.0] |0           |
+--------------+------------+


Vistazo al test set:
+--------------+------------+
|atributos     |Tipo de Piel|
+--------------+------------+
|[0.0,2.0,0.0] |1           |
|[1.0,1.0,20.0]|2           |
|[1.0,1.0,30.0]|2           |
|[2.0,0.0,0.0] |0           |
|[2.0,1.0,25.0]|0           |
+--------------+------------+



In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(featuresCol='atributos', labelCol='Tipo de Piel', seed=42, maxDepth=3, maxBins=64)
dt_model = dt.fit(train)

##**4. Predicción y evaluación**

In [None]:
# predicciones para el train y test sets
pred_train = dt_model.transform(train)
pred_test  = dt_model.transform(test)

print("\nPredicción para el test set:")
pred_test.select("Tipo de Piel", "prediction", "probability").show(truncate=False)


Predicción para el test set:
+------------+----------+-----------------+
|Tipo de Piel|prediction|probability      |
+------------+----------+-----------------+
|1           |1.0       |[0.0,0.6,0.4,0.0]|
|2           |3.0       |[0.0,0.0,0.0,1.0]|
|2           |3.0       |[0.0,0.0,0.0,1.0]|
|0           |0.0       |[1.0,0.0,0.0,0.0]|
|0           |3.0       |[0.0,0.0,0.0,1.0]|
+------------+----------+-----------------+



In [None]:
# evaluar modelo con accuracy
evaluator = MulticlassClassificationEvaluator(labelCol="Tipo de Piel", predictionCol="prediction", metricName="accuracy")
accuracy_train = evaluator.evaluate(pred_train)
accuracy_test = evaluator.evaluate(pred_test)

print(f"Accuracy para el train set: {accuracy_train:.2%}")
print(f"Accuracy para el test set : {accuracy_test:.2%}")

Accuracy para el train set: 86.67%
Accuracy para el test set : 40.00%


##**5. Análisis de resultados y mejoras**

Modelo subajustado, se propone efectuar validación cruzada con ajuste de hiperparámetros.