# Support Vector Machine (SVM)
## Modelo Epsilon-Support Vector Regression (ε-SVM Regression)

El objetivo es intentar predecir el 'track_popularity' de cualquier canción.

Para ello probaermos distintas configuraciones de modelos SVR entrenados con distintas versiones del dataset:
- Datos escalados,
- PCA de 6 componentes (sólo *musical features*),
- PCA de 9 componentes *(incluye dummies de 'genre').*


In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# PCA 6 componentes

In [2]:
df = pd.read_csv('df_scaled.csv')
df_pca = pd.read_csv('df_pca6.csv')

In [3]:
df_pca.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23081 entries, 0 to 23080
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   PC0               23081 non-null  float64
 1   PC1               23081 non-null  float64
 2   PC2               23081 non-null  float64
 3   PC3               23081 non-null  float64
 4   PC4               23081 non-null  float64
 5   PC5               23081 non-null  float64
 6   track_id          23081 non-null  object 
 7   track_popularity  23081 non-null  float64
dtypes: float64(7), object(1)
memory usage: 1.4+ MB


In [4]:
y = df_pca['track_popularity']
X = df_pca.drop(columns=['track_popularity','track_id'])

Separamos el dataset en sets de entrenamiento y testeo.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Entrenamiento

Realizamos una búsqueda de los mejores parámetros de Support Vector Regressor para nuestro dataset.

In [9]:
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

svm_hitters = SVR()

In [None]:
"""
grid = GridSearchCV(svm_hitters,
                    [{"C": [0.01, 0.1, 1, 5, 10, 100], "kernel": ["linear"]},
                     {"C": [0.01, 1, 100], "gamma": [0.1, 1, 10, 100], "kernel": ["rbf", "sigmoid"]},
                     {"C": [0.01, 1, 100], "degree": [2, 3, 4, 5, 6], "kernel": ["poly"]}],
                    refit=True,
                    cv=5,
                    scoring='neg_mean_absolute_error') 
grid.fit(X_train,y_train)
"""

KeyboardInterrupt: 

**Tiempo de ejecución: 20h** (Interrumpido sin resultados)

## Grid Search

In [10]:
param_grid_linear = [{"C": [0.01, 0.1, 1, 5, 10, 100], "kernel": ["linear"]}]

grid_linear = GridSearchCV(svm_hitters,
                           param_grid_linear,
                           refit=True,
                           verbose=1,
                           cv=5,
                           n_jobs=-1,
                           scoring='neg_mean_absolute_error')

grid_linear.fit(X_train, y_train)

**Tiempo de ejecución: 15m 12.5s**

In [11]:
param_grid_rbf_sigmoid = [{"C": [0.01, 1, 100], "gamma": [0.1, 1, 10, 100], "kernel": ["rbf", "sigmoid"]}]

                     {"C": [0.01, 1, 100], "gamma": [0.1, 1, 10, 100], "kernel": ["rbf", "sigmoid"]},
grid_rbf_sigmoid = GridSearchCV(svm_hitters,
                                param_grid_rbf_sigmoid,
                                refit=True,
                                verbose=2,
                                cv=5,
                                n_jobs=-1,
                                scoring='neg_mean_absolute_error')

grid_rbf_sigmoid.fit(X_train, y_train)

**Tiempo de ejecución: 14m 10s**

In [12]:
"""
param_grid_poly = [{"C": [0.01, 1, 100], "degree": [2, 3, 4, 5, 6], "kernel": ["poly"]}]

grid_poly = GridSearchCV(svm_hitters,
                         param_grid_poly,
                         refit=True,
                         cv=5,
                         n_jobs=-1,
                         scoring='neg_mean_absolute_error')

grid_poly.fit(X_train, y_train)
"""

KeyboardInterrupt: 

**Tiempo de ejecución: 7h 20m** (Interrumpido sin resultados)

## Randomized Search + Bayesian Search

In [None]:
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.svm import SVR
import numpy as np

In [None]:
# Definir el modelo inicial: SVR con kernel 'poly'
model_poly = SVR(kernel='poly')

# Definir el espacio de búsqueda para RandomizedSearchCV
param_dist_poly = {
    "C": [0.01, 1, 100],
    "degree": [2, 3, 4, 5, 6]
}

# Realizar RandomizedSearchCV
random_search_poly = RandomizedSearchCV(
    estimator=model_poly,
    param_distributions=param_dist_poly,
    n_iter=10,  # Puedes ajustar este número si lo deseas
    cv=3,
    verbose=3,
    scoring='neg_mean_absolute_error',
    n_jobs=-1,
    random_state=42
)


print('Comienza RandomizedSearchCV …')
random_search_poly.fit(X_train, y_train)
print('RandomizedSearchCV finalizado.')

# Imprimir los mejores parámetros obtenidos
print(f"Mejores parámetros iniciales (kernel 'poly'): {random_search_poly.best_params_}")

# Obtener los mejores parámetros del RandomizedSearchCV
best_C = random_search_poly.best_params_['C']
best_degree = random_search_poly.best_params_['degree']



In [None]:

# Importar BayesSearchCV
from skopt import BayesSearchCV
from skopt.space import Real, Integer

# Definir el espacio de búsqueda refinado para BayesSearchCV
param_dist_poly = {
    'C': Real(max(0.01, best_C * 0.5), best_C * 2, prior='log-uniform'),
    'degree': Integer(max(2, best_degree - 1), min(6, best_degree + 1)),
    'kernel': ['poly']
}

# Usar BayesSearchCV para búsqueda fina
bayes_search_poly = BayesSearchCV(
    estimator=SVR(kernel='poly'),
    search_spaces=param_dist_poly,
    n_iter=30,  # Más iteraciones para una búsqueda más exhaustiva
    cv=5,
    verbose=3,
    scoring='neg_mean_absolute_error',
    n_jobs=-1,
    random_state=42
)

print('Comienza BayesSearchCV …')
bayes_search_poly.fit(X_train, y_train)
print('BayesSearchCV finalizado.')

# Imprimir los mejores parámetros refinados
print(f"Mejores parámetros refinados (kernel 'poly'): {bayes_search_poly.best_params_}")

In [14]:
best_params_linear = grid_linear.best_params_
best_score_linear = grid_linear.best_score_

best_params_rbf_sigmoid = grid_rbf_sigmoid.best_params_
best_score_rbf_sigmoid = grid_rbf_sigmoid.best_score_

#best_params_poly = grid_poly.best_params_
#best_score_poly = grid_poly.best_score_

In [16]:
results = {
    "linear": {"params": best_params_linear, "score": best_score_linear},
    "rbf_sigmoid": {"params": best_params_rbf_sigmoid, "score": best_score_rbf_sigmoid},
    #"poly": {"params": best_params_poly, "score": best_score_poly}
}

# Encuentra el mejor modelo
best_model = max(results, key=lambda x: results[x]["score"])
print("Mejor modelo:", best_model, results[best_model])


Mejor modelo: rbf_sigmoid {'params': {'C': 0.01, 'gamma': 0.1, 'kernel': 'rbf'}, 'score': -0.1598405126301487}


## REVISAR

In [17]:
grid.best_params_

NameError: name 'grid' is not defined

In [None]:
# Vemos todos los datos de la busqueda como Dataframe que nos facilita la visualizacion
pd.DataFrame(grid.cv_results_).sort_values("rank_test_score")

In [19]:
from sklearn.metrics import (mean_absolute_error, r2_score,
                             root_mean_squared_error, 
                             mean_absolute_percentage_error)

svm_hitters_best = grid_rbf_sigmoid.best_estimator_

y_pred = svm_hitters_best.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE de testeo fue: {mae}")
print(f"RMSE de testeo fue: {rmse}")
print(f"MAPE de testeo fue: {mape}")
print(f"R2 de testeo fue: {r2}")

MAE de testeo fue: 0.15994229831958814
RMSE de testeo fue: 0.1943452915874804
MAPE de testeo fue: 10542280272399.533
R2 de testeo fue: 0.010002396511244682


## Comparación

Entrenamos una regresión lineal de Ridge para comparar las métricas.

In [20]:
from sklearn.linear_model import Ridge

ridge_hitters = Ridge()

grid_ridge = GridSearchCV(ridge_hitters,
                    {"alpha": np.linspace(0, 20, 1000)},
                    refit=True,
                    cv=5,
                    scoring='neg_mean_absolute_error')
grid_ridge.fit(X_train,y_train)

In [21]:
grid_ridge.best_params_

{'alpha': 0.0}

In [22]:
ridge_hitters = grid_ridge.best_estimator_

y_pred = ridge_hitters.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE de testeo fue: {mae}")
print(f"RMSE de testeo fue: {rmse}")
print(f"MAPE de testeo fue: {mape}")
print(f"R2 de testeo fue: {r2}")

MAE de testeo fue: 0.16069644864677907
RMSE de testeo fue: 0.1946560227523725
MAPE de testeo fue: 10382514201157.889
R2 de testeo fue: 0.006834128019646002
