# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [458]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
# now you will try to apply different ensemble methods in order to get a better model: bagging and pasting
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

# Bagging
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [459]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.describe( include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
PassengerId,8693.0,8693.0,0001_01,1.0,,,,,,,
HomePlanet,8492.0,3.0,Earth,4602.0,,,,,,,
CryoSleep,8476.0,2.0,False,5439.0,,,,,,,
Cabin,8494.0,6560.0,G/734/S,8.0,,,,,,,
Destination,8511.0,3.0,TRAPPIST-1e,5915.0,,,,,,,
Age,8514.0,,,,28.82793,14.489021,0.0,19.0,27.0,38.0,79.0
VIP,8490.0,2.0,False,8291.0,,,,,,,
RoomService,8512.0,,,,224.687617,666.717663,0.0,0.0,0.0,47.0,14327.0
FoodCourt,8510.0,,,,458.077203,1611.48924,0.0,0.0,0.0,76.0,29813.0
ShoppingMall,8485.0,,,,173.729169,604.696458,0.0,0.0,0.0,27.0,23492.0


In [460]:
# drop rows containing any missing value
spaceship = spaceship.dropna()

# Cleaning the data
spaceship['Transported'] = spaceship['Transported'].astype(int)

# Column `Cabin` is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
spaceship['Cabin'] = spaceship['Cabin'].str[0]

# Drop PassengerId and Name
spaceship = spaceship.drop(['PassengerId', 'Name'], axis=1)

# One-hot encoding for non-numerical columns
non_numerical_columns = spaceship.select_dtypes(include='object').columns

spaceship = pd.get_dummies(spaceship, columns=non_numerical_columns, drop_first=True)

spaceship.describe( include = 'all').T


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Age,6606.0,,,,28.894036,14.533429,0.0,19.0,27.0,38.0,79.0
RoomService,6606.0,,,,222.991674,644.987936,0.0,0.0,0.0,49.0,9920.0
FoodCourt,6606.0,,,,478.958523,1678.592291,0.0,0.0,0.0,82.75,29813.0
ShoppingMall,6606.0,,,,178.356494,576.328407,0.0,0.0,0.0,30.0,12253.0
Spa,6606.0,,,,313.16152,1144.016291,0.0,0.0,0.0,65.0,22408.0
VRDeck,6606.0,,,,303.780048,1127.142166,0.0,0.0,0.0,52.0,20336.0
Transported,6606.0,,,,0.503633,0.500025,0.0,0.0,1.0,1.0,1.0
HomePlanet_Europa,6606.0,2.0,False,4933.0,,,,,,,
HomePlanet_Mars,6606.0,2.0,False,5239.0,,,,,,,
CryoSleep_True,6606.0,2.0,False,4274.0,,,,,,,


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [461]:
X = spaceship.drop('Transported', axis=1)
y = spaceship['Transported']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)

# 1. Normalizamos: KNN, SVM, Neural Networks (Modelos de distancia)
# 2. Standarization: Regressiones Lineales
# 3. No normalizar/ni estandarizar (Ensamblados): Decision Trees, Random Forest, Gradient

normalizer = MinMaxScaler()

normalizer.fit(X_train)

X_train_norm = normalizer.transform(X_train) 

X_test_norm = normalizer.transform(X_test)

X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_train_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,0.405063,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.050633,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.379747,0.0,0.007916,0.0,0.051276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.21519,0.00131,0.0,0.046111,0.016378,4.9e-05,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.329114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [462]:
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)
X_test_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,0.632911,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.227848,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,0.189873,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.658228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,0.78481,0.0,0.054775,0.0,0.07774,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


**Perform Train Test Split**

In [463]:
# Perform Train Test Split
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [464]:

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=1000,
    max_samples=100, bootstrap=True, n_jobs=-1
)
bag_clf.fit(X_train_norm, y_train)

# Evaluate the model

y_pred = bag_clf.predict(X_test_norm)
accuracy_score(y_test, y_pred)

0.7874432677760969

- Random Forests

In [465]:
# Ramdom Forest Classifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train_norm, y_train)


In [466]:
# Evaluate the model
y_pred_rf = rnd_clf.predict(X_test_norm)
accuracy_score(y_test, y_pred_rf)

0.7723146747352496

- Gradient Boosting

In [467]:
# Gradient Boosting
gbrt = GradientBoostingClassifier(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X_train_norm, y_train)

In [468]:
# Evaluate the model
y_pred_gb = gbrt.predict(X_test_norm)
accuracy_score(y_test, y_pred_gb)

0.7662632375189108

- Adaptive Boosting

In [469]:
# Adaptive Boosting
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5
)
ada_clf.fit(X_train_norm, y_train)



In [470]:
# Evaluate the model
y_pred_ada = ada_clf.predict(X_test_norm)
accuracy_score(y_test, y_pred_ada)

0.7874432677760969

Which model is the best and why?

In [471]:
# comment here

# Adaptative Boosting is the best accuray with 0.7874 score