# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [8]:


# Muestra información general: columnas, tipos de datos y valores nulos
print(spaceship.info())




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB
None


In [9]:
# Muestra la cantidad de valores nulos por columna (esto es muy útil para limpieza de datos)
print(spaceship.isnull().sum())

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64


In [10]:
#limpieza
# 1. Ver cuántos duplicados hay
print("Duplicados:", spaceship.duplicated().sum())

# 2. Eliminar duplicados si existen
spaceship.drop_duplicates(inplace=True)

# 3. Rellenar valores nulos en columnas numéricas con la media
spaceship.fillna(spaceship.mean(numeric_only=True), inplace=True)

# 4. Para columnas categóricas, rellenar con 'Desconocido'
for col in spaceship.select_dtypes(include='object').columns:
    spaceship[col].fillna('Desconocido', inplace=True)


Duplicados: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  spaceship[col].fillna('Desconocido', inplace=True)


In [11]:
#manejo de nulos
# Mostrar porcentaje de valores nulos por columna
porcentaje_nulos = spaceship.isnull().mean() * 100
print(porcentaje_nulos.sort_values(ascending=False))


PassengerId     0.0
HomePlanet      0.0
CryoSleep       0.0
Cabin           0.0
Destination     0.0
Age             0.0
VIP             0.0
RoomService     0.0
FoodCourt       0.0
ShoppingMall    0.0
Spa             0.0
VRDeck          0.0
Name            0.0
Transported     0.0
dtype: float64


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [12]:
from sklearn.preprocessing import StandardScaler

# Separar variable objetivo
y = spaceship['Transported']  # Esta es la variable que queremos predecir
X = spaceship.drop(columns=['Transported', 'Name', 'PassengerId', 'Cabin'])  # Quitamos columnas no útiles

# Convertir variables categóricas a numéricas (One-Hot Encoding)
X = pd.get_dummies(X)

# Escalar las columnas numéricas
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Aplica el escalado

# Convertimos de nuevo a DataFrame para mantener estructura
X = pd.DataFrame(X_scaled, columns=X.columns)


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [13]:
#randomforest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Dividir en entrenamiento y test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Crear el modelo base con parámetros por defecto
model = RandomForestClassifier(random_state=42)

# Entrenar el modelo con los datos de entrenamiento
model.fit(X_train, y_train)

# Predecir con los datos de test
y_pred = model.predict(X_test)

# Evaluar con accuracy (porcentaje de aciertos)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy del modelo base: {accuracy:.4f}")


Accuracy del modelo base: 0.7734


In [14]:
#logistic regresion

from sklearn.linear_model import LogisticRegression

# Crear modelo de regresión logística
logreg = LogisticRegression(random_state=42, max_iter=1000)  # A veces necesita más iteraciones

# Entrenar
logreg.fit(X_train, y_train)

# Predecir
y_pred_log = logreg.predict(X_test)

# Evaluar
accuracy_log = accuracy_score(y_test, y_pred_log)
print(f"Accuracy del modelo Logistic Regression: {accuracy_log:.4f}")


Accuracy del modelo Logistic Regression: 0.7763


Utilizare Logistic Regression pq tuvo mejores resultados

**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Definir el modelo base
logreg = LogisticRegression(random_state=42, max_iter=1000)

# Definir los hiperparámetros a probar
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],          # diferentes niveles de regularización
    'penalty': ['l1', 'l2'],               # tipos de regularización
    'solver': ['liblinear']                 # algoritmo que soporta l1 y l2
}

# Crear objeto Grid Search con validación cruzada de 5 folds
grid_search = GridSearchCV(estimator=logreg,
                           param_grid=param_grid,
                           cv=5,                   # validación cruzada 5 veces
                           scoring='accuracy',     # métrica para optimizar
                           verbose=1,              # para que muestre progreso
                           n_jobs=-1)              # usar todos los núcleos del CPU



- Run Grid Search

In [16]:
# Ejecutar Grid Search con los datos de entrenamiento
grid_search.fit(X_train, y_train)

# Mostrar los mejores hiperparámetros encontrados
print("Mejores hiperparámetros:", grid_search.best_params_)



Fitting 5 folds for each of 10 candidates, totalling 50 fits
Mejores hiperparámetros: {'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}


- Evaluate your model

In [17]:
# Evaluar el mejor modelo con datos de test
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Accuracy con los mejores hiperparámetros: {accuracy_best:.4f}")


Accuracy con los mejores hiperparámetros: 0.7769
