# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [153]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [154]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.describe( include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
PassengerId,8693.0,8693.0,0001_01,1.0,,,,,,,
HomePlanet,8492.0,3.0,Earth,4602.0,,,,,,,
CryoSleep,8476.0,2.0,False,5439.0,,,,,,,
Cabin,8494.0,6560.0,G/734/S,8.0,,,,,,,
Destination,8511.0,3.0,TRAPPIST-1e,5915.0,,,,,,,
Age,8514.0,,,,28.82793,14.489021,0.0,19.0,27.0,38.0,79.0
VIP,8490.0,2.0,False,8291.0,,,,,,,
RoomService,8512.0,,,,224.687617,666.717663,0.0,0.0,0.0,47.0,14327.0
FoodCourt,8510.0,,,,458.077203,1611.48924,0.0,0.0,0.0,76.0,29813.0
ShoppingMall,8485.0,,,,173.729169,604.696458,0.0,0.0,0.0,27.0,23492.0


In [155]:
# drop rows containing any missing value
spaceship = spaceship.dropna()

# Cleaning the data
spaceship['Transported'] = spaceship['Transported'].astype(int)

# Column `Cabin` is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
spaceship['Cabin'] = spaceship['Cabin'].str[0]

# Drop PassengerId and Name
spaceship = spaceship.drop(['PassengerId', 'Name', 'Destination'], axis=1)

non_numerical_columns = spaceship.select_dtypes(include='object').columns
spaceship = pd.get_dummies(spaceship, columns=non_numerical_columns, drop_first=True)

spaceship.describe( include = 'all').T


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Age,6606.0,,,,28.894036,14.533429,0.0,19.0,27.0,38.0,79.0
RoomService,6606.0,,,,222.991674,644.987936,0.0,0.0,0.0,49.0,9920.0
FoodCourt,6606.0,,,,478.958523,1678.592291,0.0,0.0,0.0,82.75,29813.0
ShoppingMall,6606.0,,,,178.356494,576.328407,0.0,0.0,0.0,30.0,12253.0
Spa,6606.0,,,,313.16152,1144.016291,0.0,0.0,0.0,65.0,22408.0
VRDeck,6606.0,,,,303.780048,1127.142166,0.0,0.0,0.0,52.0,20336.0
Transported,6606.0,,,,0.503633,0.500025,0.0,0.0,1.0,1.0,1.0
HomePlanet_Europa,6606.0,2.0,False,4933.0,,,,,,,
HomePlanet_Mars,6606.0,2.0,False,5239.0,,,,,,,
CryoSleep_True,6606.0,2.0,False,4274.0,,,,,,,


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [156]:
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
spaceship_scaled = spaceship.copy()
spaceship_scaled[spaceship.columns] = sc.fit_transform(spaceship)

spaceship_scaled.describe( include = 'all').T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,6606.0,9.035057000000001e-17,1.000076,-1.988259,-0.680829,-0.130333,0.6266,3.447896
RoomService,6606.0,-1.613403e-18,1.000076,-0.345756,-0.345756,-0.345756,-0.26978,15.035541
FoodCourt,6606.0,3.3343660000000003e-17,1.000076,-0.285355,-0.285355,-0.285355,-0.236054,17.476705
ShoppingMall,6606.0,-1.936084e-17,1.000076,-0.309494,-0.309494,-0.309494,-0.257436,20.952563
Spa,6606.0,2.904126e-17,1.000076,-0.273759,-0.273759,-0.273759,-0.216938,19.314857
VRDeck,6606.0,1.936084e-17,1.000076,-0.269534,-0.269534,-0.269534,-0.223396,17.773921
Transported,6606.0,-9.465298000000001e-17,1.000076,-1.007293,-1.007293,0.99276,0.99276,0.99276
HomePlanet_Europa,6606.0,3.8721670000000005e-17,1.000076,-0.582361,-0.582361,-0.582361,1.717147,1.717147
HomePlanet_Mars,6606.0,-4.5175290000000005e-17,1.000076,-0.510811,-0.510811,-0.510811,-0.510811,1.957672
CryoSleep_True,6606.0,8.658597e-17,1.000076,-0.738664,-0.738664,-0.738664,1.353795,1.353795


In [157]:
# Feature selection
X = spaceship_scaled.drop('Transported', axis=1)
y = spaceship_scaled['Transported']

**Perform Train Test Split**

In [158]:
# Perform Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [159]:
# now you will try to apply different ensemble methods in order to get a better model: bagging and pasting reggressors
from sklearn.ensemble import BaggingRegressor

# Bagging Regressor
bagging = BaggingRegressor(n_estimators=100, random_state=42)
bagging.fit(X_train, y_train)

# Predict the labels
y_pred_train = bagging.predict(X_train)
y_pred_test = bagging.predict(X_test)

# Evaluate the model
from sklearn.metrics import mean_squared_error

print('Bagging Regressor')
print('Train MSE: ', mean_squared_error(y_train, y_pred_train))
print('Test MSE: ', mean_squared_error(y_test, y_pred_test))

Bagging Regressor
Train MSE:  0.22648373373720593
Test MSE:  0.5620103714250981


- Random Forests

In [160]:
# Ramdom Forest Regressor
from sklearn.ensemble import RandomForestRegressor

random_forest = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest.fit(X_train, y_train)

# Print scores
print('Random Forest Regressor')
print('Train score: ', random_forest.score(X_train, y_train))
print('Test score: ', random_forest.score(X_test, y_test))

Random Forest Regressor
Train score:  0.7737737029587954
Test score:  0.43800517467069566


- Gradient Boosting

In [161]:
# Gradient Boosting
from sklearn.ensemble import GradientBoostingRegressor

gradient_boosting = GradientBoostingRegressor(n_estimators=100, random_state=42)
gradient_boosting.fit(X_train, y_train)

# Print scores
print('Gradient Boosting Regressor')
print('Train score: ', gradient_boosting.score(X_train, y_train))
print('Test score: ', gradient_boosting.score(X_test, y_test))

Gradient Boosting Regressor
Train score:  0.49163318950615653
Test score:  0.4580762744539145


- Adaptive Boosting

In [162]:
# Adaptive Boosting
from sklearn.ensemble import AdaBoostRegressor

ada_boost = AdaBoostRegressor(n_estimators=100, random_state=42)
ada_boost.fit(X_train, y_train)

# Print scores
print('AdaBoost Regressor')
print('Train score: ', ada_boost.score(X_train, y_train))
print('Test score: ', ada_boost.score(X_test, y_test))

AdaBoost Regressor
Train score:  0.3563547716331642
Test score:  0.3602622333771275


Which model is the best and why?

In [None]:
# comment here

# Bagging and Pasting Regressors with 0.56 score is the best model approach to predict the Transported column.