# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [3]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [6]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [9]:
#your code here
spaceship.isnull().sum()*100/len(spaceship)

PassengerId     0.000000
HomePlanet      2.312205
CryoSleep       2.496261
Cabin           2.289198
Destination     2.093639
Age             2.059128
VIP             2.335212
RoomService     2.082135
FoodCourt       2.105142
ShoppingMall    2.392730
Spa             2.105142
VRDeck          2.162660
Name            2.300702
Transported     0.000000
dtype: float64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [13]:
#your code here
spaceship.dropna(inplace=True)
spaceship.shape

(6606, 14)

In [19]:
spaceship.Cabin

0          B/0/P
1          F/0/S
2          A/0/S
3          A/0/S
4          F/1/S
          ...   
8688      A/98/P
8689    G/1499/S
8690    G/1500/S
8691     E/608/S
8692     E/608/S
Name: Cabin, Length: 6606, dtype: object

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [29]:
#your code here
#display(spaceship["Cabin"][0][0])

spaceship['Cabin_transformed'] = spaceship['Cabin'].apply(lambda x: x[0])

'B'

- Drop PassengerId and Name

In [30]:
#your code here
spaceship.columns

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported', 'Cabin_transformed'],
      dtype='object')

In [33]:
spaceship = spaceship.drop(['PassengerId','Name'], axis = 1)
spaceship.columns

Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP',
       'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported', 'Cabin_transformed'],
      dtype='object')

- For non-numerical columns, do dummies.

In [34]:
#encoding non-numerical columns
spaceship.dtypes

HomePlanet            object
CryoSleep             object
Cabin                 object
Destination           object
Age                  float64
VIP                   object
RoomService          float64
FoodCourt            float64
ShoppingMall         float64
Spa                  float64
VRDeck               float64
Transported             bool
Cabin_transformed     object
dtype: object

In [51]:
for i in spaceship.columns:
    if spaceship[i].dtype == object:
        spaceship_enc = pd.get_dummies(spaceship)


In [52]:
spaceship_enc

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,...,VIP_False,VIP_True,Cabin_transformed_A,Cabin_transformed_B,Cabin_transformed_C,Cabin_transformed_D,Cabin_transformed_E,Cabin_transformed_F,Cabin_transformed_G,Cabin_transformed_T
0,39.0,0.0,0.0,0.0,0.0,0.0,False,False,True,False,...,True,False,False,True,False,False,False,False,False,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,True,False,False,...,True,False,False,False,False,False,False,True,False,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,False,True,False,...,False,True,True,False,False,False,False,False,False,False
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,False,True,False,...,True,False,True,False,False,False,False,False,False,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,True,False,False,...,True,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0,False,False,True,False,...,False,True,True,False,False,False,False,False,False,False
8689,18.0,0.0,0.0,0.0,0.0,0.0,False,True,False,False,...,True,False,False,False,False,False,False,False,True,False
8690,26.0,0.0,0.0,1872.0,1.0,0.0,True,True,False,False,...,True,False,False,False,False,False,False,False,True,False
8691,32.0,0.0,1049.0,0.0,353.0,3235.0,False,False,True,False,...,True,False,False,False,False,False,True,False,False,False


**Perform Train Test Split**

In [57]:
#your code here

features = spaceship_enc.drop(columns=["Transported"])
target = spaceship_enc["Transported"]


Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars',
       'CryoSleep_False',
       ...
       'VIP_False', 'VIP_True', 'Cabin_transformed_A', 'Cabin_transformed_B',
       'Cabin_transformed_C', 'Cabin_transformed_D', 'Cabin_transformed_E',
       'Cabin_transformed_F', 'Cabin_transformed_G', 'Cabin_transformed_T'],
      dtype='object', length=5329)

In [58]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [59]:
from sklearn.neighbors import KNeighborsClassifier 


In [60]:
#your code here
knn = KNeighborsClassifier()

In [61]:
knn.fit(X_train, y_train)

- Evaluate your model's performance. Comment it

In [63]:
knn.score(X_test, y_test)

0.7723146747352496

In [79]:
pred = knn.predict(X_test)

In [112]:
pred

array([ True,  True,  True, ...,  True,  True,  True])

In [109]:
eval = eval.rename(columns={0:'pred', 1: 'y_test'})
eval = eval.map(lambda x: 1 if True else 0)

In [111]:
count = 0
for i in eval['pred']:
    if eval['pred'][i] == 1:
        count = count + 1
print(count)

1322


In [97]:
print(sum(pred), sum(y_test))

662 661


array([ True,  True,  True, ...,  True,  True,  True])

In [68]:
#your code here



MSE = mean_squared_error(y_test, pred)
RMSE = mean_squared_error(y_test, pred, squared=False)

print(MSE)
print(RMSE)

TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.