# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [4]:
spaceship_cleaned = pd.read_csv("spaceship_cleaned.csv")
spaceship_cleaned.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,...,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_T
0,39,0,0,0,0,0,0,1,0,0,...,0,1,0,1,0,0,0,0,0,0
1,24,109,9,25,549,44,1,0,0,0,...,0,1,0,0,0,0,0,1,0,0
2,58,43,3576,0,6715,49,0,1,0,0,...,0,1,1,0,0,0,0,0,0,0
3,33,0,1283,371,3329,193,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
4,16,303,70,151,565,2,1,0,0,0,...,0,1,0,0,0,0,0,1,0,0


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [5]:
X = spaceship_cleaned.drop(columns=['Transported'])
y = spaceship_cleaned['Transported']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [6]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit on training data and transform
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data
X_test_scaled = scaler.transform(X_test)


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [None]:
# RANDOM FOREST

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf_clf = RandomForestClassifier(
    n_estimators=100,      # Number of trees in the forest
    max_depth=None,        # Expand until all leaves are pure or contain less than min_samples_split samples
    random_state=42,       # Reproducibility
    n_jobs=-1              # Utilize all available CPUs
)

rf_clf.fit(X_train, y_train)

y_pred_rf = rf_clf.predict(X_test)

- Evaluate your model

In [8]:
rf_accuracy = accuracy_score(y_test, y_pred_rf) * 100
print(f"Random Forest Accuracy: {rf_accuracy:.2f}%")

Random Forest Accuracy: 81.62%


```
n_estimators: Number of trees in the forest.
Common values: [50, 100, 200, 500]

max_depth: Maximum depth of the trees.
Common values: [10, 20, 50, None]

min_samples_split: Minimum number of samples required to split a node.
Common values: [2, 5, 10]

min_samples_leaf: Minimum number of samples required at a leaf node.
Common values: [1, 2, 4]

max_features: Number of features to consider for the best split.
Common values: ['auto', 'sqrt', 'log2']

bootstrap: Whether bootstrap samples are used when building trees.
Options: [True, False]
```


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100],        # Number of trees
    'max_depth': [10, None],            # Maximum depth of the trees
    'min_samples_split': [2, 5],      # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2],        # Minimum number of samples required at a leaf node
    'max_features': ['sqrt', 'log2'],     # Number of features to consider for the best split
    'bootstrap': [True, False]            # Whether to use bootstrap samples
}

In [22]:
rf = RandomForestClassifier(random_state=42)

- Run Grid Search

In [23]:
grid_search = GridSearchCV(
    estimator=rf_clf, 
    param_grid=param_grid, 
    cv=3,               # Reduced number of folds for faster computation
    scoring='accuracy', # Evaluate based on accuracy
    n_jobs=-1,          # Utilize all available CPUs
    verbose=2           # To track progress
)

grid_search.fit(X_train, y_train)


Fitting 3 folds for each of 64 candidates, totalling 192 fits


- Evaluate your model

In [24]:
# Use the best estimator to make predictions
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

# Calculate accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with Best RF Model:", accuracy)


Accuracy with Best RF Model: 0.8154311649016641
