# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [31]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [32]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [33]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [34]:
#your code here
X_num = spaceship.select_dtypes(include='number')
X_cat = spaceship.select_dtypes(include='object')
X_num.shape , X_cat.shape

((8693, 6), (8693, 7))

**Check for missing values**

In [35]:
#your code here
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [36]:
#your code here
spaceship.dropna(inplace=True)
spaceship.shape

(6606, 14)

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [37]:
#your code here
spaceship['Cabin'] = spaceship['Cabin'].apply(lambda x : x.split('/')[0])

- Drop PassengerId and Name

In [38]:
#your code here
spaceship.drop(columns=['PassengerId','Name'], inplace=True)

- For non-numerical columns, do dummies.

In [39]:
#your code here
X_num = spaceship.select_dtypes(include='number')
X_cat = spaceship.select_dtypes(include='object')

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()

# Ajuste y transformación de los datos categóricos
X_cat_encoded = encoder.fit_transform(X_cat)

# Convertir el resultado a un DataFrame para una mejor visualización (opcional)
X_cat_encoded_df = pd.DataFrame(X_cat_encoded.toarray(), columns=encoder.get_feature_names_out())
X_cat_encoded_df

Unnamed: 0,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6601,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
6602,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
6603,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
6604,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


**Perform Train Test Split**

In [40]:
spaceship

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,A,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False
8689,Earth,True,G,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False
8690,Earth,False,G,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True
8691,Europa,False,E,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False


In [74]:
from sklearn.preprocessing import StandardScaler

standarizer = StandardScaler().fit(X_num)
x_standarized = standarizer.transform(X_num)
df_standarizado = pd.DataFrame(x_standarized, columns=X_num.columns)
df_standarizado

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.695413,-0.345756,-0.285355,-0.309494,-0.273759,-0.269534
1,-0.336769,-0.176748,-0.279993,-0.266112,0.206165,-0.230494
2,2.002842,-0.279083,1.845163,-0.309494,5.596357,-0.226058
3,0.282540,-0.345756,0.479034,0.334285,2.636384,-0.098291
4,-0.887266,0.124056,-0.243650,-0.047470,0.220152,-0.267759
...,...,...,...,...,...,...
6601,0.833037,-0.345756,3.777285,-0.309494,1.162518,-0.203876
6602,-0.749641,-0.345756,-0.285355,-0.309494,-0.273759,-0.269534
6603,-0.199145,-0.345756,-0.285355,2.938900,-0.272885,-0.269534
6604,0.213728,-0.345756,0.339621,-0.309494,0.034826,2.600774


In [75]:
#your code here
df_standarizado.reset_index(drop=True, inplace=True)
X_cat_encoded_df.reset_index(drop=True, inplace=True)

y = spaceship['Age']
X = pd.concat([df_standarizado, X_cat_encoded_df], axis=1, ignore_index=True)

In [76]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [77]:
#your code here
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=3)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
knn.score(X_test, y_test)

0.9495298863625894

In [78]:
from sklearn.metrics import r2_score
r2_score(y_test, pred)


0.9495298863625894

- Evaluate your model's performance. Comment it

In [None]:
#your code here
# El pricipio había hecho el programa sin haber normalizado nada y el score esra de entorno a 0.6, pero después de normalizar todo da 0.94
# Es muy importante normalizar