# Support Vector Machine (SVM)
## Modelo Epsilon-Support Vector Regression (ε-SVM Regression)

El objetivo es intentar predecir el 'track_popularity' de cualquier canción.

Para ello probaermos distintas configuraciones de modelos SVR entrenados con distintas versiones del dataset:
- Datos escalados,
- PCA de 6 componentes (sólo *musical features*),
- PCA de 9 componentes *(incluye dummies de 'genre').*


In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# PCA 6 componentes

In [2]:
df = pd.read_csv('df_scaled.csv')
df_pca = pd.read_csv('df_pca6.csv')

In [3]:
df_pca.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23081 entries, 0 to 23080
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   PC0               23081 non-null  float64
 1   PC1               23081 non-null  float64
 2   PC2               23081 non-null  float64
 3   PC3               23081 non-null  float64
 4   PC4               23081 non-null  float64
 5   PC5               23081 non-null  float64
 6   track_id          23081 non-null  object 
 7   track_popularity  23081 non-null  float64
dtypes: float64(7), object(1)
memory usage: 1.4+ MB


In [14]:
y = df_pca['track_popularity']
X = df_pca.drop(columns=['track_popularity','track_id'])

Separamos el dataset en sets de entrenamiento y testeo.

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Entrenamiento

Realizamos una búsqueda de los mejores parámetros de Support Vector Regressor para nuestro dataset.

https://chatgpt.com/share/670c4daf-008c-8009-ab58-a9cb405f0f5a

In [None]:
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

svm_hitters = SVR()

grid = GridSearchCV(svm_hitters,
                    [{"C": [0.01, 0.1, 1, 5, 10, 100], "kernel": ["linear"]},
                     {"C": [0.01, 0.1, 1, 5, 10, 100], "gamma": [0.1, 0.5, 1, 2, 10, 100], "kernel": ["rbf", "sigmoid"]},
                     {"C": [0.01, 0.1, 1, 5, 10, 100], "degree": [2, 3, 4, 5, 6], "kernel": ["poly"]}],
                    refit=True,
                    cv=5,
                    scoring='neg_mean_absolute_error') 
grid.fit(X_train,y_train)

**Tiempo de ejecución: 1100 min**

In [None]:
grid.best_params_

In [None]:
# Vemos todos los datos de la busqueda como Dataframe que nos facilita la visualizacion
pd.DataFrame(grid.cv_results_).sort_values("rank_test_score")

In [None]:
from sklearn.metrics import (mean_absolute_error, r2_score,
                             root_mean_squared_error, 
                             mean_absolute_percentage_error)

svm_hitters_best = grid.best_estimator_

y_pred = svm_hitters_best.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE de testeo fue: {mae}")
print(f"RMSE de testeo fue: {rmse}")
print(f"MAPE de testeo fue: {mape}")
print(f"R2 de testeo fue: {r2}")

## Comparación

Entrenamos una regresión lineal de Ridge para comparar las métricas.

In [None]:
from sklearn.linear_model import Ridge

ridge_hitters = Ridge()

grid = GridSearchCV(ridge_hitters,
                    {"alpha": np.linspace(0, 20, 1000)},
                    refit=True,
                    cv=5,
                    scoring='neg_mean_absolute_error')
grid.fit(X_train,y_train)

In [None]:
grid.best_params_

In [None]:
ridge_hitters = grid.best_estimator_

y_pred = ridge_hitters.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE de testeo fue: {mae}")
print(f"RMSE de testeo fue: {rmse}")
print(f"MAPE de testeo fue: {mape}")
print(f"R2 de testeo fue: {r2}")