# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [25]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [26]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [27]:
#drop rows containing any missing value
spaceship_dropped = spaceship.dropna()
spaceship_dropped.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

In [28]:
#Cabin is too granular quiero obtener solo la letra como columna nueva
spaceship['Cabin_letter'] = spaceship['Cabin'].str[0]
spaceship[['Cabin', 'Cabin_letter']].head()

Unnamed: 0,Cabin,Cabin_letter
0,B/0/P,B
1,F/0/S,F
2,A/0/S,A
3,A/0/S,A
4,F/1/S,F


In [29]:

#Drop PassengerId and Name
spaceship = spaceship.drop(columns=['PassengerId', 'Name'])
spaceship.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Cabin_letter
0,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B
1,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F
2,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A
3,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A
4,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F


In [30]:
#Eliminar Cabin
spaceship = spaceship.drop(columns=['Cabin'])
spaceship.head()

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Cabin_letter
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A
3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A
4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F


In [31]:
#For non-numerical columns (HomePlanet, CryoSleep, Destination, VIP, Transported, Cabin_letter), do dummies
spaceship = pd.get_dummies(spaceship, drop_first=True)
spaceship.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True,Cabin_letter_B,Cabin_letter_C,Cabin_letter_D,Cabin_letter_E,Cabin_letter_F,Cabin_letter_G,Cabin_letter_T
0,39.0,0.0,0.0,0.0,0.0,0.0,False,True,False,False,False,True,False,True,False,False,False,False,False,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,False,False,False,False,True,False,False,False,False,False,True,False,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,True,False,False,False,True,True,False,False,False,False,False,False,False
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,True,False,False,False,True,False,False,False,False,False,False,False,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,False,False,False,False,True,False,False,False,False,False,True,False,False


In [32]:
spaceship.isnull().sum()

Age                          179
RoomService                  181
FoodCourt                    183
ShoppingMall                 208
Spa                          183
VRDeck                       188
Transported                    0
HomePlanet_Europa              0
HomePlanet_Mars                0
CryoSleep_True                 0
Destination_PSO J318.5-22      0
Destination_TRAPPIST-1e        0
VIP_True                       0
Cabin_letter_B                 0
Cabin_letter_C                 0
Cabin_letter_D                 0
Cabin_letter_E                 0
Cabin_letter_F                 0
Cabin_letter_G                 0
Cabin_letter_T                 0
dtype: int64

In [33]:
# RELLENAR los nulos de las columnas num√©ricas que quedaron
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
# 1. Es VITAL escalar los datos antes de usar KNNImputer. 
# Si no, las columnas con n√∫meros grandes (como Spa) dominar√°n la distancia.
scaler = StandardScaler()
spaceship_scaled = scaler.fit_transform(spaceship)
# 2. Configuramos el imputador (por defecto busca 5 vecinos)
imputer = KNNImputer(n_neighbors=5)
# 3. Rellenamos los huecos
# Esto devuelve un array de numpy, as√≠ que lo convertimos de nuevo a DataFrame
spaceship_imputed = imputer.fit_transform(spaceship_scaled)
spaceship_final = pd.DataFrame(spaceship_imputed, columns=spaceship.columns)
print(f"Nulos restantes: {spaceship_final.isnull().sum().sum()}")

Nulos restantes: 0


In [34]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# 1Ô∏è‚É£ Escalado de spaceship_final
scaler_final = StandardScaler()
X_scaled = scaler_final.fit_transform(spaceship_final.drop(columns=['Transported']))

# 2Ô∏è‚É£ Separar variable objetivo
y = spaceship_final['Transported'].astype(int)

# 3Ô∏è‚É£ Feature Selection usando RandomForest
selector = SelectFromModel(
    RandomForestClassifier(n_estimators=100, random_state=42),
    threshold="median"  # selecciona las features m√°s importantes
)

selector.fit(X_scaled, y)

X_selected = selector.transform(X_scaled)

print(f"N√∫mero de features originales: {X_scaled.shape[1]}")
print(f"N√∫mero de features seleccionadas: {X_selected.shape[1]}")

N√∫mero de features originales: 19
N√∫mero de features seleccionadas: 10


In [35]:
# Si tu variable Transported es True/False o -1/0, convertirla a 0/1
y = spaceship['Transported'].astype(int)  # True->1, False->0

In [36]:
#perform train test split teniendo en cuenta numero de features seleccionadas 10
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)
print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")

Train shape: (6954, 10), Test shape: (1739, 10)


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [37]:
#Grandinent Boosting Classifier was the best model in previous lab
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    r2_score,
    mean_absolute_error
)
# Inicializar el modelo
gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
# Entrenar
gb.fit(X_train, y_train)
# Predecir
y_pred_gb = gb.predict(X_test)
# --------------------------
# Evaluaci√≥n
# --------------------------
accuracy = accuracy_score(y_test, y_pred_gb)
r2 = r2_score(y_test, y_pred_gb)
mae = mean_absolute_error(y_test, y_pred_gb)
print("üèÜ Gradient Boosting Classifier")
print("-" * 50)
print(f"Accuracy: {accuracy:.4f}")
print(f"R2 Score: {r2:.4f}")
print(f"MAE: {mae:.4f}")
print("\nüìä Matriz de Confusi√≥n:")
print(confusion_matrix(y_test, y_pred_gb))
print("\nüìÑ Reporte de Clasificaci√≥n:")
print(classification_report(y_test, y_pred_gb))
print("-" * 50)

üèÜ Gradient Boosting Classifier
--------------------------------------------------
Accuracy: 0.7872
R2 Score: 0.1489
MAE: 0.2128

üìä Matriz de Confusi√≥n:
[[629 232]
 [138 740]]

üìÑ Reporte de Clasificaci√≥n:
              precision    recall  f1-score   support

           0       0.82      0.73      0.77       861
           1       0.76      0.84      0.80       878

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739

--------------------------------------------------


- Evaluate your model

In [38]:
# Evaluaci√≥n
# --------------------------
accuracy = accuracy_score(y_test, y_pred_gb)
r2 = r2_score(y_test, y_pred_gb)
mae = mean_absolute_error(y_test, y_pred_gb)
print("üèÜ Gradient Boosting Classifier")
print("-" * 50)
print(f"Accuracy: {accuracy:.4f}")
print(f"R2 Score: {r2:.4f}")
print(f"MAE: {mae:.4f}")
print("\nüìä Matriz de Confusi√≥n:")
print(confusion_matrix(y_test, y_pred_gb))
print("\nüìÑ Reporte de Clasificaci√≥n:")
print(classification_report(y_test, y_pred_gb))
print("-" * 50)

üèÜ Gradient Boosting Classifier
--------------------------------------------------
Accuracy: 0.7872
R2 Score: 0.1489
MAE: 0.2128

üìä Matriz de Confusi√≥n:
[[629 232]
 [138 740]]

üìÑ Reporte de Clasificaci√≥n:
              precision    recall  f1-score   support

           0       0.82      0.73      0.77       861
           1       0.76      0.84      0.80       878

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739

--------------------------------------------------


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [39]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier

# 1. Definimos un rango de par√°metros (m√°s amplio que antes)
param_dist = {
    'n_estimators': [100, 300, 500],      # N√∫mero de √°rboles
    'learning_rate': [0.01, 0.05, 0.1],   # Paso de aprendizaje
    'max_depth': [3, 4, 5, 6],            # Profundidad de los √°rboles
    'min_samples_split': [2, 5, 10],      # M√≠nimo de datos para dividir un nodo
    'subsample': [0.8, 0.9, 1.0]          # Usar solo una parte de los datos para cada √°rbol
}

# 2. Configuramos la b√∫squeda aleatoria
random_search = RandomizedSearchCV(
    estimator=GradientBoostingClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=10,    # ¬°ESTO ES LA CLAVE! Solo probar√° 10 combinaciones al azar
    cv=5, 
    n_jobs=-1, 
    scoring='accuracy',
    random_state=42
)

# 3. Entrenar
random_search.fit(X_train, y_train)

# 4. El mejor modelo ya est√° "listo" en random_search.best_estimator_
best_gb_model = random_search.best_estimator_

y_pred = best_gb_model.predict(X_test)
print(f"Mejor Accuracy: {accuracy_score(y_test, y_pred):.4f}")

#dame el r2 y mae
print(f"Mejor R2: {r2_score(y_test, y_pred):.4f}")
print(f"Mejor MAE: {mean_absolute_error(y_test, y_pred):.4f}")


Mejor Accuracy: 0.7930
Mejor R2: 0.1719
Mejor MAE: 0.2070


- Run Grid Search

In [40]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, r2_score, mean_absolute_error, confusion_matrix, classification_report

# 1. Definimos una cuadr√≠cula m√°s fina basada en tus resultados previos
# Ya sabemos que valores cercanos a estos funcionan bien
param_grid = {
    'n_estimators': [300, 400, 500],
    'learning_rate': [0.01, 0.05],
    'max_depth': [4, 5],
    'subsample': [0.8, 0.9]
}

# 2. Ejecutamos Grid Search (esta vez busca todas las combinaciones de esta lista peque√±a)
grid_search = GridSearchCV(
    estimator=GradientBoostingClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring='accuracy'
)

print("üöÄ Iniciando refinamiento con Grid Search...")
grid_search.fit(X_train, y_train)

# 3. Extraemos el mejor modelo y evaluamos
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# 4. Resultados finales limpios
print("\n" + "="*30)
print("üèÜ MODELO OPTIMIZADO FINAL")
print("="*30)
print(f"Mejores par√°metros: {grid_search.best_params_}")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"R2 Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")

print("\nüìä Matriz de Confusi√≥n:")
print(confusion_matrix(y_test, y_pred))

print("\nüìÑ Reporte de Clasificaci√≥n:")
print(classification_report(y_test, y_pred))

üöÄ Iniciando refinamiento con Grid Search...

üèÜ MODELO OPTIMIZADO FINAL
Mejores par√°metros: {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 400, 'subsample': 0.9}
Accuracy: 0.7878
R2 Score: 0.1512
MAE: 0.2122

üìä Matriz de Confusi√≥n:
[[618 243]
 [126 752]]

üìÑ Reporte de Clasificaci√≥n:
              precision    recall  f1-score   support

           0       0.83      0.72      0.77       861
           1       0.76      0.86      0.80       878

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739



- Evaluate your model

In [41]:
# 4. Resultados finales limpios
print("\n" + "="*30)
print("üèÜ MODELO OPTIMIZADO FINAL")
print("="*30)
print(f"Mejores par√°metros: {grid_search.best_params_}")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"R2 Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")

print("\nüìä Matriz de Confusi√≥n:")
print(confusion_matrix(y_test, y_pred))

print("\nüìÑ Reporte de Clasificaci√≥n:")
print(classification_report(y_test, y_pred))


üèÜ MODELO OPTIMIZADO FINAL
Mejores par√°metros: {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 400, 'subsample': 0.9}
Accuracy: 0.7878
R2 Score: 0.1512
MAE: 0.2122

üìä Matriz de Confusi√≥n:
[[618 243]
 [126 752]]

üìÑ Reporte de Clasificaci√≥n:
              precision    recall  f1-score   support

           0       0.83      0.72      0.77       861
           1       0.76      0.86      0.80       878

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739



El Randomized Search ha sido m√°s eficiente en este caso porque le damos m√°s libertad (rangos m√°s amplios) para investigar. El Grid Search que ejecutamos se qued√≥ "encerrado" en valores m√°s peque√±os y conservadores, lo que result√≥ en un modelo un poco m√°s d√©bil.

In [42]:
import pandas as pd
# Resultados de Random Search
random_results = {
    'M√©todo': 'Random Search',
    'Accuracy': 0.7918,
    'R2': 0.1673,
    'MAE': 0.2082
}
# Resultados de Grid Search
grid_results = {
    'M√©todo': 'Grid Search',
    'Accuracy': 0.7878,
    'R2': 0.1512,
    'MAE': 0.2122
}
# Crear DataFrame
df_comparativa = pd.DataFrame([random_results, grid_results])
# Mostrar tabla
print(df_comparativa)

          M√©todo  Accuracy      R2     MAE
0  Random Search    0.7918  0.1673  0.2082
1    Grid Search    0.7878  0.1512  0.2122
