# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv"
df = pd.read_csv(url)

# Drop PassengerId and Name columns
df.drop(columns=['PassengerId', 'Name'], inplace=True)

# Transform Cabin to contain only the first letter (deck category)
df['Cabin'] = df['Cabin'].astype(str).str[0]

# Convert categorical variables to dummies
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Define features (X) and target (y)
X = df.drop(columns=['Transported'])
y = df['Transported']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Fill missing numerical values with median
X_train = X_train.fillna(X_train.median())
X_test = X_test.fillna(X_test.median())

# Feature Scaling (Standardization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling and engineering completed successfully!")

Feature scaling and engineering completed successfully!


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [4]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = df.drop(columns=['Transported'])  # Features (independent variables)
y = df['Transported']  # Target variable

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Print shapes to verify split
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

print("Train-Test Split completed successfully!")

X_train shape: (6954, 20)
X_test shape: (1739, 20)
y_train shape: (6954,)
y_test shape: (1739,)
Train-Test Split completed successfully!


In [6]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize Gradient Boosting model
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train the model
gb_clf.fit(X_train_scaled, y_train)

# Make predictions
y_pred_gb = gb_clf.predict(X_test_scaled)

- Evaluate your model

In [7]:
# Evaluate model performance
accuracy_gb = accuracy_score(y_test, y_pred_gb)
class_report_gb = classification_report(y_test, y_pred_gb)

# Print results
print(f"Gradient Boosting Accuracy: {accuracy_gb:.4f}")
print("\nGradient Boosting Classification Report:\n", class_report_gb)

Gradient Boosting Accuracy: 0.8033

Gradient Boosting Classification Report:
               precision    recall  f1-score   support

       False       0.82      0.77      0.80       863
        True       0.79      0.84      0.81       876

    accuracy                           0.80      1739
   macro avg       0.80      0.80      0.80      1739
weighted avg       0.80      0.80      0.80      1739



**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [10]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

# Define the model
gb_clf = GradientBoostingClassifier(random_state=42)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200],  # Number of trees
    'learning_rate': [0.01, 0.05, 0.1],  # Step size
    'max_depth': [3, 5, 7],  # Tree depth
    'min_samples_split': [2, 5, 10],  # Minimum samples per split
    'min_samples_leaf': [1, 3, 5]  # Minimum samples per leaf
}

- Run Grid Search

In [11]:
# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=gb_clf, 
    param_grid=param_grid, 
    scoring='accuracy', 
    cv=3,  # 3-fold cross-validation
    n_jobs=-1,  # Use all CPU cores
    verbose=2
)

# Fit Grid Search to the scaled training data
grid_search.fit(X_train_scaled, y_train)

# Print best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)

Fitting 3 folds for each of 162 candidates, totalling 486 fits
Best Parameters: {'learning_rate': 0.05, 'max_depth': 5, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 100}
Best Accuracy: 0.7970951970089158


In [12]:
print(grid_search.best_params_)
print(f"Best Accuracy: {grid_search.best_score_:.4f}")

{'learning_rate': 0.05, 'max_depth': 5, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 100}
Best Accuracy: 0.7971


- Evaluate your model

In [13]:
# Train the final Gradient Boosting model with the best parameters found in Grid Search
final_gb_clf = GradientBoostingClassifier(
    learning_rate=0.05,
    max_depth=5,
    min_samples_leaf=3,
    min_samples_split=2,
    n_estimators=100,
    random_state=42
)

# Fit the model
final_gb_clf.fit(X_train_scaled, y_train)

# Make predictions
y_pred_final = final_gb_clf.predict(X_test_scaled)

# Evaluate model performance
accuracy_final = accuracy_score(y_test, y_pred_final)
class_report_final = classification_report(y_test, y_pred_final)

# Print results
print(f"Final Gradient Boosting Accuracy: {accuracy_final:.4f}")
print("\nFinal Gradient Boosting Classification Report:\n", class_report_final)

Final Gradient Boosting Accuracy: 0.8085

Final Gradient Boosting Classification Report:
               precision    recall  f1-score   support

       False       0.82      0.78      0.80       863
        True       0.79      0.84      0.81       876

    accuracy                           0.81      1739
   macro avg       0.81      0.81      0.81      1739
weighted avg       0.81      0.81      0.81      1739

