# Preprocesamiento y preparaci√≥n del modelo

1. Carga del dataset limpio `spotify_clean_modeling.csv`.  
2. Separaci√≥n de variables predictoras (X) y objetivo (`y = is_hit`).  
3. Evaluaci√≥n de los modelos:  
   a. RandomForestClassifier  
   b. GradientBoostingClassifier  
   c. XGBoost  
   d. LightGBM  
   e. LogisticRegression  
   f. KNeighborsClassifier  
4. Escalado o normalizaci√≥n de variables num√©ricas.  
5. Divisi√≥n del conjunto en entrenamiento y prueba (`train_test_split`).  
6. Guardado de los datos procesados (`X_train`, `X_test`, `y_train`, `y_test`).  


In [48]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import os
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression


import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import seaborn as sns

# Ruta al archivo fuente inicial 
DATA_PATH = "../data/processed/spotify_clean_modeling.csv"

# Verificar existencia
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"No se encontr√≥ el archivo en {DATA_PATH}")

# Carga el archivo CSV
df = pd.read_csv(DATA_PATH)
print(f"Dataset se ha cargado correctamente en un arreglo: {df.shape}")

display(df.columns.T)
display(df.head())


Dataset se ha cargado correctamente en un arreglo: (232724, 15)


Index(['genre', 'popularity', 'acousticness', 'danceability', 'duration_ms',
       'energy', 'instrumentalness', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'valence', 'is_hit'],
      dtype='object')

Unnamed: 0,genre,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,mode,speechiness,tempo,time_signature,valence,is_hit
0,Movie,0,0.611,0.389,99373,0.91,0.0,0.346,-1.828,1,0.0525,166.969,4,0.814,0
1,Movie,1,0.246,0.59,137373,0.737,0.0,0.151,-5.559,0,0.0868,174.003,4,0.816,0
2,Movie,3,0.952,0.663,170267,0.131,0.0,0.103,-13.879,0,0.0362,99.488,5,0.368,0
3,Movie,0,0.703,0.24,152427,0.326,0.0,0.0985,-12.178,1,0.0395,171.758,4,0.227,0
4,Movie,4,0.95,0.331,82625,0.225,0.123,0.202,-21.15,1,0.0456,140.576,4,0.39,0


## Normalizacion duration_ms

In [49]:
## Normalizacion Duracion
df["duration_min"] = df["duration_ms"] / 60000
df.drop(columns="duration_ms", inplace=True)


In [50]:
print(f"‚úÖ Canciones clasificadas como HIT: {df['is_hit'].sum()} de {len(df)} ({df['is_hit'].mean()*100:.2f}%)")
# Correlaci√≥n directa con popularidad o is_hit
corr = df.corr(numeric_only=True)
corr["is_hit"].sort_values(ascending=False)

‚úÖ Canciones clasificadas como HIT: 10544 de 232724 (4.53%)


is_hit              1.000000
popularity          0.391859
danceability        0.121247
loudness            0.108466
energy              0.061671
valence             0.037736
time_signature      0.037443
tempo               0.022765
speechiness        -0.008862
mode               -0.019692
duration_min       -0.023957
liveness           -0.043567
acousticness       -0.094851
instrumentalness   -0.097980
Name: is_hit, dtype: float64

Se encuentra un desbalance de los datos solo 4.53% representan hits, lo que nos hace notar que solo tener un buen accuracy (Predicciones Correctas) no
es suficiente, para el modelo.  

## Separaci√≥n de variables predictoras (X) y objetivo (y)


In [51]:

X = df.drop(columns=["is_hit","popularity"])

y = df["is_hit"]


## Creaci√≥n de DataFrames para Codificar Variabls Categ√≥ricas y Entrenamiento de Modelos 


In [52]:
# Para modelos de √°rboles ‚Üí LabelEncoder
X_tree = X.copy()
le = LabelEncoder()
X_tree["genre"] = le.fit_transform(X_tree["genre"])

# Para modelos lineales / distancia ‚Üí OneHotEncoder
preprocessor_ohe = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["genre"])
], remainder="passthrough")

In [53]:
display(X_tree.dtypes)

X_tree.describe()




genre                 int64
acousticness        float64
danceability        float64
energy              float64
instrumentalness    float64
liveness            float64
loudness            float64
mode                  int64
speechiness         float64
tempo               float64
time_signature        int64
valence             float64
duration_min        float64
dtype: object

Unnamed: 0,genre,acousticness,danceability,energy,instrumentalness,liveness,loudness,mode,speechiness,tempo,time_signature,valence,duration_min
count,232724.0,232724.0,232724.0,232724.0,232724.0,232724.0,232724.0,232724.0,232724.0,232724.0,232724.0,232724.0,232724.0
mean,13.62327,0.368562,0.554366,0.570958,0.148302,0.21501,-9.569896,0.65203,0.120765,117.666494,3.885147,0.454919,3.918697
std,7.491218,0.354768,0.185608,0.263456,0.302769,0.198273,5.998215,0.476328,0.185519,30.898942,0.462956,0.260065,1.982265
min,0.0,0.0,0.0569,2e-05,0.0,0.00967,-52.457,0.0,0.0222,30.379,0.0,0.0,0.25645
25%,7.0,0.0376,0.435,0.385,0.0,0.0974,-11.771,0.0,0.0367,92.959,4.0,0.237,3.047604
50%,14.0,0.232,0.571,0.605,4.4e-05,0.128,-7.762,1.0,0.0501,115.7775,4.0,0.444,3.673783
75%,20.0,0.722,0.692,0.787,0.0358,0.264,-5.501,1.0,0.105,139.0545,4.0,0.66,4.429467
max,26.0,0.996,0.989,0.999,0.999,1.0,3.744,1.0,0.967,242.903,5.0,1.0,92.548617


## Division de datos

### Divisi√≥n en entrenamiento y prueba


In [54]:
X_train_tree, X_test_tree, y_train, y_test = train_test_split(X_tree, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [55]:
X_train_tree.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
genre,186179.0,13.617218,7.486709,0.0,7.0,14.0,20.0,26.0
acousticness,186179.0,0.368902,0.354867,1e-06,0.0376,0.233,0.722,0.996
danceability,186179.0,0.554477,0.185612,0.0569,0.435,0.571,0.692,0.989
energy,186179.0,0.570743,0.263453,2e-05,0.385,0.605,0.787,0.999
instrumentalness,186179.0,0.148567,0.303068,0.0,0.0,4.5e-05,0.0359,0.999
liveness,186179.0,0.215131,0.198184,0.00967,0.0974,0.128,0.264,1.0
loudness,186179.0,-9.572569,5.999228,-52.457,-11.779,-7.762,-5.502,3.744
mode,186179.0,0.651701,0.476433,0.0,0.0,1.0,1.0,1.0
speechiness,186179.0,0.120547,0.18521,0.0222,0.0367,0.0501,0.105,0.967
tempo,186179.0,117.667631,30.906681,30.379,92.938,115.836,139.09,242.903


In [56]:
X_train_tree["genre"].unique()[:10]


array([25,  9,  6,  8, 14, 23, 17, 22, 24,  4])

## Definicion de Modelos Dinamico

1. Logistic Regression

Modelo lineal.
Sirve como baseline. R√°pido, interpretable y muestra qu√© variables empujan a la probabilidad de ser hit.

2. Random Forest

Ensamble de muchos √°rboles de decisi√≥n.
Robusto, maneja no-linealidades y detecta interacciones entre features autom√°ticamente.

3. Gradient Boosting (GBM cl√°sico de sklearn)

Construye √°rboles de manera secuencial, corrigiendo errores del anterior.
Mejor rendimiento que RandomForest pero m√°s lento.

4. XGBoost

Implementaci√≥n optimizada y m√°s poderosa de boosting.
Alta precisi√≥n, muy usado en competencias de Kaggle. Excelente con datasets tabulares.

5. LightGBM

Boosting muy r√°pido desarrollado por Microsoft.
Funciona excelente con grandes vol√∫menes (como tu dataset de 230k filas).
Suele superar a XGBoost en velocidad con rendimiento similar o mejor.

In [57]:
# ===============================
# CONFIGURACI√ìN GENERAL
# ===============================

# Calcular peso de clase positiva (para XGBoost)
pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

# Colecci√≥n para guardar resultados de todos los experimentos
resultados_globales = []

# ===============================
# DEFINIR BATCHES DE EXPERIMENTOS
# ===============================

batch_1 = {
    "RandomForest": {"n_estimators": 300, "max_depth": None, "min_samples_leaf": 2},
    "GradientBoosting": {"n_estimators": 400, "learning_rate": 0.05, "max_depth": 5},
    "XGBoost": {"n_estimators": 600, "learning_rate": 0.05, "max_depth": 6},
    "LightGBM": {"n_estimators": 600, "num_leaves": 64, "learning_rate": 0.03},
    "LogisticRegression": {"max_iter": 1000, "solver": "liblinear"},
    "KNeighbors": {"n_neighbors": 10, "weights": "distance"}
}

batch_2 = {
    "RandomForest": {"n_estimators": 800, "max_depth": 10, "min_samples_leaf": 1},
    "GradientBoosting": {"n_estimators": 800, "learning_rate": 0.02, "max_depth": 6},
    "XGBoost": {"n_estimators": 1000, "learning_rate": 0.03, "max_depth": 8},
    "LightGBM": {"n_estimators": 1000, "num_leaves": 128, "learning_rate": 0.02},
    "LogisticRegression": {"max_iter": 2000, "solver": "liblinear"},
    "KNeighbors": {"n_neighbors": 20, "weights": "uniform"}
}

# Puedes agregar batch_3, batch_4, etc.
batches = {"Batch_1": batch_1, "Batch_2": batch_2}

# ===============================
# FUNCI√ìN PARA EJECUTAR UN BATCH
# ===============================

def entrenar_batch(nombre_batch, config_batch):
    resultados = []

    # Modelos tipo √°rbol
    tree_models = {
        "RandomForest": RandomForestClassifier(n_jobs=-1, random_state=42, class_weight="balanced", **config_batch["RandomForest"]),
        "GradientBoosting": GradientBoostingClassifier(random_state=42, **config_batch["GradientBoosting"]),
        "XGBoost": XGBClassifier(
            n_jobs=-1,
            eval_metric="logloss",
            random_state=42,
            scale_pos_weight=pos_weight,
            **config_batch["XGBoost"]
        ),
        "LightGBM": LGBMClassifier(
            n_jobs=-1,
            random_state=42,
            class_weight="balanced",
            **config_batch["LightGBM"]
        )
    }

    for nombre, modelo in tree_models.items():
        modelo.fit(X_train_tree, y_train)
        y_pred = modelo.predict(X_test_tree)
        resultados.append({
            "Batch": nombre_batch,
            "Modelo": nombre,
            "Accuracy": accuracy_score(y_test, y_pred),
            "F1": f1_score(y_test, y_pred),
            "ROC_AUC": roc_auc_score(y_test, y_pred)
        })

    # Modelos lineales / distancia
    linear_models = {
        "LogisticRegression": LogisticRegression(class_weight="balanced", **config_batch["LogisticRegression"]),
        "KNeighbors": KNeighborsClassifier(n_jobs=-1, **config_batch["KNeighbors"])
    }

    for nombre, modelo in linear_models.items():
        clf = Pipeline(steps=[
            ("preprocess", preprocessor_ohe),
            ("model", modelo)
        ])
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        resultados.append({
            "Batch": nombre_batch,
            "Modelo": nombre,
            "Accuracy": accuracy_score(y_test, y_pred),
            "F1": f1_score(y_test, y_pred),
            "ROC_AUC": roc_auc_score(y_test, y_pred)
        })

    return resultados

# ===============================
# EJECUTAR TODOS LOS BATCHES
# ===============================

for nombre_batch, config in batches.items():
    resultados_globales.extend(entrenar_batch(nombre_batch, config))

# ===============================
# RESULTADOS COMBINADOS
# ===============================

df_resultados = pd.DataFrame(resultados_globales).sort_values(by=["Batch", "F1"], ascending=[True, False])
display(df_resultados)


KeyboardInterrupt: 

## Tabla de resultados ordenada por Modelo y Accuracy para cada batch

In [None]:
df_ordenado = (
    df_resultados
        .sort_values(by=["Modelo", "Accuracy"], ascending=[True, False])
        .reset_index(drop=True)
)

display(df_ordenado)


# Interpretacion de resultados.

In [None]:
display(
    df_resultados
        .query("Modelo == 'GradientBoosting'")
        .sort_values(by="Accuracy", ascending=False)
        .reset_index(drop=True)
)

Interpretaci√≥n:

Accuracy alto (95%) = enga√±oso por el desbalance.

F1 muy bajo (12‚Äì13%) ‚Üí no logra detectar hits.

AUC ‚âà 0.53 ‚Üí apenas mejor que azar.

üëâ Conclusi√≥n: GradientBoosting es malo para este problema.

In [None]:
display(
    df_resultados
        .query("Modelo == 'KNeighbors'")
        .sort_values(by="Accuracy", ascending=False)
        .reset_index(drop=True)
)

Interpretaci√≥n:

Comportamiento totalmente inestable.

F1 p√©simo (especialmente Batch_2).

AUC ‚âà 0.51 ‚Üí casi aleatorio.

üëâ Conclusi√≥n: KNN no sirve en absoluto para este dataset.

In [None]:
display(
    df_resultados
        .query("Modelo == 'LightGBM'")
        .sort_values(by="Accuracy", ascending=False)
        .reset_index(drop=True)
)

Interpretaci√≥n:

F1 de 0.51 en Batch_2 ‚Üí muy bueno dada la clase minoritaria.

ROC-AUC 0.89 ‚Üí Excelente capacidad de separar hits de no-hits.

Mucho mejor que cualquier otro.

üëâ Conclusi√≥n:
LightGBM es el mejor modelo general (F1 y AUC).

In [None]:
display(
    df_resultados
        .query("Modelo == 'LogisticRegression'")
        .sort_values(by="Accuracy", ascending=False)
        .reset_index(drop=True)
)

Interpretaci√≥n:

Muy consistente entre lotes.

Accuracy bajo ‚Üí normal en modelos lineales.

F1 promedio (0.26) ‚Üí mejor que GBM cl√°sico y KNN.

ROC-AUC 0.81 ‚Üí bastante bueno.

üëâ Conclusi√≥n:
Modelo simple pero bien alineado con el problema. Bueno como baseline.

In [None]:
display(
    df_resultados
        .query("Modelo == 'RandomForest'")
        .sort_values(by="Accuracy", ascending=False)
        .reset_index(drop=True)
)

Interpretaci√≥n:

Batch_1 y Batch_2 tienen comportamientos muy distintos.

Batch_1 tiene accuracy inflado por el desbalance.

Batch_2 tiene mejor AUC pero menor F1.

üëâ Conclusi√≥n:
RandomForest es inestable y peor que LightGBM/XGBoost.

In [None]:
display(
    df_resultados
        .query("Modelo == 'XGBoost'")
        .sort_values(by="Accuracy", ascending=False)
        .reset_index(drop=True)
)

Interpretaci√≥n:

Muy parecido a LightGBM.

F1 excelente (0.49 en Batch_2).

ROC-AUC muy alto (0.88).

üëâ Conclusi√≥n:
Segundo mejor modelo despu√©s de LightGBM.

## Resumen de los modelos

Mejores modelos (claramente)

LightGBM

XGBoost
Ambos alcanzan:

F1 > 0.49

AUC > 0.86

Son consistentes y robustos en datasets grandes

In [None]:
df_batch1 = df_resultados[df_resultados["Batch"] == "Batch_1"]
df_batch1.plot(
    x="Modelo",
    y=["Accuracy", "F1", "ROC_AUC"],
    kind="bar",
    figsize=(10,5)
)
plt.title("Comparaci√≥n de modelos ‚Äì Predicci√≥n de 'Hit' (Batch 1)")
plt.ylabel("Puntaje")
plt.ylim(0,1)
plt.grid(True)
plt.show()

In [None]:
df_batch2 = df_resultados[df_resultados["Batch"] == "Batch_2"]
df_batch2.plot(
    x="Modelo",
    y=["Accuracy", "F1", "ROC_AUC"],
    kind="bar",
    figsize=(10,5)
)
plt.title("Comparaci√≥n de modelos ‚Äì Predicci√≥n de 'Hit' (Batch 2)")
plt.ylabel("Puntaje")
plt.ylim(0,1)
plt.grid(True)
plt.show()

## Comparacion de F1-Score Bacth1 

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(data=df_resultados, x="Modelo", y="F1", hue="Batch")
plt.title("Comparaci√≥n de F1-score por modelo y batch")
plt.ylabel("F1-score")
plt.ylim(0,1)
plt.grid(True)
plt.show()


In [None]:
plt.figure(figsize=(10,5))
sns.barplot(data=df_resultados, x="Modelo", y="ROC_AUC", hue="Batch")
plt.title("Comparaci√≥n de ROC_AUC por modelo y batch")
plt.ylabel("ROC_AUC")
plt.ylim(0,1)
plt.grid(True)
plt.show()

## Seleccion de Modelo

Basados en los resultados que hemos encontrado usaremos el modelo LightGBM (LGBMClassifier)
1. Mejor F1-score

Es la m√©trica m√°s importante porque tu dataset est√° muy desbalanceado (4.6% hits).

Batch 2:

F1 = 0.5136

El m√°s alto de todos los modelos

Detecta hits mucho mejor que XGBoost, LogisticRegression o RandomForest

2. Mejor ROC-AUC o casi igual al mejor

Batch 2:

AUC = 0.8908

Excelente separaci√≥n entre hits y no-hits

XGBoost tiene 0.8867, casi igual, pero F1 m√°s bajo

3. Consistente entre Batch_1 y Batch_2

En ambos aparece como el mejor o muy cercano al mejor.

4. Rendimiento superior en datasets grandes

Tu dataset tiene 232,724 canciones
LightGBM est√° dise√±ado para:

datasets grandes

alta dimensionalidad

operaciones r√°pidas

desbalance de clases con par√°metros integrados

5. Menor riesgo de overfitting que XGBoost

XGBoost rinde muy bien, pero LightGBM tiende a generalizar mejor en escenarios como este.