# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [54]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [55]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [56]:
# Load the dataset
url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv"
df = pd.read_csv(url)

# Check the shape of the data
print(df.shape)

(8693, 14)


**Check for data types**

In [57]:
# Check the data types of each column
print(df.dtypes)

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object


**Check for missing values**

In [58]:
# Check for missing values
missing_values = df.isnull().sum()

# Display missing values per column
print(missing_values)

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64


There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [59]:
# Drop rows with any missing values
df_cleaned = df.dropna()

# Check the new shape of the dataset
print(df_cleaned.shape)

(6606, 14)


- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [60]:
# Extract the first letter of the cabin as the deck category
df['Cabin'] = df['Cabin'].astype(str).str[0]

# Display unique values to confirm transformation
print(df['Cabin'].unique())

['B' 'F' 'A' 'G' 'n' 'E' 'D' 'C' 'T']


In [61]:
# Convert 'Cabin' into dummy variables
df = pd.get_dummies(df, columns=['Cabin'], drop_first=True)

- Drop PassengerId and Name

In [62]:
# Drop PassengerId and Name columns
df.drop(columns=['PassengerId', 'Name'], inplace=True)

# Display updated DataFrame
print(df.head())

  HomePlanet CryoSleep  Destination   Age    VIP  RoomService  FoodCourt  \
0     Europa     False  TRAPPIST-1e  39.0  False          0.0        0.0   
1      Earth     False  TRAPPIST-1e  24.0  False        109.0        9.0   
2     Europa     False  TRAPPIST-1e  58.0   True         43.0     3576.0   
3     Europa     False  TRAPPIST-1e  33.0  False          0.0     1283.0   
4      Earth     False  TRAPPIST-1e  16.0  False        303.0       70.0   

   ShoppingMall     Spa  VRDeck  Transported  Cabin_B  Cabin_C  Cabin_D  \
0           0.0     0.0     0.0        False     True    False    False   
1          25.0   549.0    44.0         True    False    False    False   
2           0.0  6715.0    49.0        False    False    False    False   
3         371.0  3329.0   193.0        False    False    False    False   
4         151.0   565.0     2.0         True    False    False    False   

   Cabin_E  Cabin_F  Cabin_G  Cabin_T  Cabin_n  
0    False    False    False    False    Fa

- For non-numerical columns, do dummies.

In [63]:
# Identify categorical columns (excluding the target 'Transported')
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

# Remove 'Cabin' only if it exists in the list
if 'Cabin' in categorical_cols:
    categorical_cols.remove('Cabin')

# Convert categorical columns to dummy variables
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Verify that all columns are now numeric
print(df.dtypes)

Age                          float64
RoomService                  float64
FoodCourt                    float64
ShoppingMall                 float64
Spa                          float64
VRDeck                       float64
Transported                     bool
Cabin_B                         bool
Cabin_C                         bool
Cabin_D                         bool
Cabin_E                         bool
Cabin_F                         bool
Cabin_G                         bool
Cabin_T                         bool
Cabin_n                         bool
HomePlanet_Europa               bool
HomePlanet_Mars                 bool
CryoSleep_True                  bool
Destination_PSO J318.5-22       bool
Destination_TRAPPIST-1e         bool
VIP_True                        bool
dtype: object


**Perform Train Test Split**

In [64]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = df.drop(columns=['Transported'])  # Features (excluding the target variable)
y = df['Transported']  # Target variable

# Perform train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Print shapes to verify
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (6954, 20)
X_test shape: (1739, 20)
y_train shape: (6954,)
y_test shape: (1739,)


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [65]:
df.dtypes

Age                          float64
RoomService                  float64
FoodCourt                    float64
ShoppingMall                 float64
Spa                          float64
VRDeck                       float64
Transported                     bool
Cabin_B                         bool
Cabin_C                         bool
Cabin_D                         bool
Cabin_E                         bool
Cabin_F                         bool
Cabin_G                         bool
Cabin_T                         bool
Cabin_n                         bool
HomePlanet_Europa               bool
HomePlanet_Mars                 bool
CryoSleep_True                  bool
Destination_PSO J318.5-22       bool
Destination_TRAPPIST-1e         bool
VIP_True                        bool
dtype: object

In [67]:
# Fill missing numerical values with median
X_train = X_train.fillna(X_train.median())
X_test = X_test.fillna(X_test.median())

In [68]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize KNN classifier with default settings
knn = KNeighborsClassifier()

# Fit the model to the training data
knn.fit(X_train, y_train)
print("Model training complete.")

Model training complete.


- Evaluate your model's performance. Comment it

In [69]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Generate classification report
class_report = classification_report(y_test, y_pred)

# Generate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Print results
print(f"Model Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", class_report)
print("\nConfusion Matrix:\n", conf_matrix)


Model Accuracy: 0.7671

Classification Report:
               precision    recall  f1-score   support

       False       0.77      0.76      0.76       863
        True       0.77      0.78      0.77       876

    accuracy                           0.77      1739
   macro avg       0.77      0.77      0.77      1739
weighted avg       0.77      0.77      0.77      1739


Confusion Matrix:
 [[655 208]
 [197 679]]


In [None]:
# 1. Accuracy (76.71%)

# The model correctly predicts whether a passenger was transported about 77% of the time.
# This suggests moderate performance but leaves room for improvement.
# 2. Precision & Recall (Balanced Performance)

# For class "False" (not transported):
# Precision: 0.77 → When the model predicts "False," it is correct 77% of the time.
# Recall: 0.76 → The model correctly identifies 76% of actual "False" cases.
# For class "True" (transported):
# Precision: 0.77 → When the model predicts "True," it is correct 77% of the time.
# Recall: 0.78 → The model correctly identifies 78% of actual "True" cases.
# The scores for both classes are similar, meaning the model is not biased towards one class.
# 3. F1-Score (~0.77)

# A balanced F1-score of 0.77 suggests the model is handling precision and recall well.
# No major class imbalance issues, as both classes perform similarly.