# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [3]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [4]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [5]:
#your code here
spaceship.info(), spaceship.isnull().sum().sort_values(ascending=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


(None,
 CryoSleep       217
 ShoppingMall    208
 VIP             203
 HomePlanet      201
 Name            200
 Cabin           199
 VRDeck          188
 Spa             183
 FoodCourt       183
 Destination     182
 RoomService     181
 Age             179
 PassengerId       0
 Transported       0
 dtype: int64)

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [6]:
#your code here
spaceship.dropna(how='any', inplace=True)
spaceship.shape, spaceship.isnull().sum()

((6606, 14),
 PassengerId     0
 HomePlanet      0
 CryoSleep       0
 Cabin           0
 Destination     0
 Age             0
 VIP             0
 RoomService     0
 FoodCourt       0
 ShoppingMall    0
 Spa             0
 VRDeck          0
 Name            0
 Transported     0
 dtype: int64)

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [16]:
#your code here
spaceship['Deck'] = spaceship['Cabin'].astype(str).str[0]
spaceship['Deck'].value_counts().sort_values(ascending=False)

Deck
F    2152
G    1973
E     683
B     628
C     587
D     374
A     207
T       2
Name: count, dtype: int64

- Drop PassengerId and Name

In [17]:
#your code here
cols_to_drop = ['PassengerId', 'Name', 'Cabin']
spaceship = spaceship.drop(columns=cols_to_drop)
spaceship.head(3)

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A


- For non-numerical columns, do dummies.

In [18]:
#your code here

# --- Target / features split ---
y = spaceship["Transported"].astype(int)
X = spaceship.drop(columns=["Transported"])

# --- Identify columns by dtype ---
cat_cols  = X.select_dtypes(include=["object", "category"]).columns.tolist()
bool_cols = X.select_dtypes(include=["bool"]).columns.tolist()

print("Categorical (object) columns to dummy:", cat_cols)
print("Boolean columns to cast to 0/1:", bool_cols)

# --- Make booleans numeric (0/1) ---
for c in bool_cols:
    X[c] = X[c].astype(int)

# --- One-hot encode only the categorical (object) columns ---
X = pd.get_dummies(X, columns=cat_cols, drop_first=False)

X.head()


Categorical (object) columns to dummy: ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck']
Boolean columns to cast to 0/1: []


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,VIP_False,VIP_True,Deck_A,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_T
0,39.0,0.0,0.0,0.0,0.0,0.0,False,True,False,True,...,True,False,False,True,False,False,False,False,False,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,False,False,True,...,True,False,False,False,False,False,False,True,False,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,True,False,True,...,False,True,True,False,False,False,False,False,False,False
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,True,False,True,...,True,False,True,False,False,False,False,False,False,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,False,False,True,...,True,False,False,False,False,False,False,True,False,False


**Perform Train Test Split**

In [19]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape

((5284, 24), (1322, 24))

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [28]:
#your code here
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

scaler = StandardScaler(with_mean=False)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)
X_train_scaled.shape, X_test_scaled.shape

knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)
cm  = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred, digits=4)
print(f"Accuracy: {acc:.4f}\n")
print("Confusion matrix:\n", cm, "\n")
print("Classification report:\n", report)

Accuracy: 0.7761

Confusion matrix:
 [[544 112]
 [184 482]] 

Classification report:
               precision    recall  f1-score   support

           0     0.7473    0.8293    0.7861       656
           1     0.8114    0.7237    0.7651       666

    accuracy                         0.7761      1322
   macro avg     0.7794    0.7765    0.7756      1322
weighted avg     0.7796    0.7761    0.7755      1322



- Evaluate your model's performance. Comment it

- Overall: The accuracy on the test set indicates that the model generalizes reasonably well after proper preprocessing (one-hot encoding for categorical features and StandardScaler before KNN).

- By class: The classification report shows similar precision/recall for both classes (Transported vs Not-Transported), which suggests the model is not strongly biased to one class. If you see one class with notably lower recall, that’s where the model is missing more true cases.

- Confusion matrix: Most predictions fall on the main diagonal (correct), with some off-diagonal errors. These errors are expected for a simple KNN baseline, especially because one-hot encoding creates a high-dimensional, sparse feature space that KNN can find challenging.