# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
# Step 0 - Import libraries
# pandas: data manipulation and analysis
# numpy: numerical operations
# train_test_split: split dataset into training and testing sets
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [2]:
# Step 0b - Load Spaceship Titanic dataset
# pd.read_csv loads CSV data from the URL
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

# Preview first 5 rows of data
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [3]:
# Step 1 - Check the shape of your data
# .shape returns number of rows and columns in a DataFrame
spaceship.shape

(8693, 14)

**Check for data types**

In [4]:
# Step 2 - Check for data types
# .dtypes returns the data type of each column
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [5]:
# Step 3 - Check for missing values
# .isnull() returns True for NaN
# .sum() counts number of missing values per column
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [6]:
# Step 4 - Handle missing data
# Because few missing values, we drop rows with any NaN
# .dropna() removes rows with NaN
spaceship = spaceship.dropna()

# Verify no missing values remain
spaceship.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [7]:
# Step 5 - Transform Cabin column
# Cabin is in format 'F/123/P'
# We only take the first character to reduce granularity
# .str[0] extracts the first letter
spaceship['Cabin'] = spaceship['Cabin'].str[0]

# Check unique values in Cabin
spaceship['Cabin'].unique()

array(['B', 'F', 'A', 'G', 'E', 'C', 'D', 'T'], dtype=object)

- Drop PassengerId and Name

In [8]:
# Step 6 - Drop PassengerId and Name
# .drop() removes columns
# axis=1 indicates columns
spaceship = spaceship.drop(['PassengerId','Name'], axis=1)

# Verify columns
spaceship.columns

Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP',
       'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported'],
      dtype='object')

- For non-numerical columns, do dummies.

In [9]:
# Step 7 - For non-numerical columns, create dummies (one-hot encoding)
# Do not encode the target column 'Transported'
# drop_first=True avoids multicollinearity by dropping the first category

target = spaceship['Transported']  # Save target separately
features = spaceship.drop('Transported', axis=1)  # Keep all other columns

# Apply one-hot encoding only to categorical columns in features
features = pd.get_dummies(features, drop_first=True)

# Verify columns
features.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,True,False,False,True,False,False,False,False,False,False,False,True,False
1,24.0,109.0,9.0,25.0,549.0,44.0,False,False,False,False,False,False,False,True,False,False,False,True,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,True,False,False,False,False,False,False,False,False,False,False,True,True
3,33.0,0.0,1283.0,371.0,3329.0,193.0,True,False,False,False,False,False,False,False,False,False,False,True,False
4,16.0,303.0,70.0,151.0,565.0,2.0,False,False,False,False,False,False,False,True,False,False,False,True,False


**Perform Train Test Split**

In [10]:
# Step 8 - Perform Train Test Split
# X: all feature columns after dummies
# y: target column (0/1 or False/True)
# test_size=0.2 means 20% test, 80% train
# random_state ensures reproducibility

X = features
y = target.astype(int)  # convert boolean to int if necessary

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Verify shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((5284, 19), (1322, 19), (5284,), (1322,))

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [11]:
# Step 9 - Model Selection (KNN)
# KNeighborsClassifier: distance-based classifier
# Initialize without hyperparameters (default n_neighbors=5)
knn = KNeighborsClassifier()

# Fit the model to training data
# .fit() trains the KNN model
knn.fit(X_train, y_train)

- Evaluate your model's performance. Comment it

In [12]:
# Step 10 - Evaluate model performance
# .predict() generates predictions on test set
y_pred = knn.predict(X_test)

# accuracy_score compares true vs predicted
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of KNN model: {accuracy:.2f}")

# Comment:
# Accuracy around 0.78-0.82 is typical for default KNN on this dataset.
# For improvement, we could:
# - Tune n_neighbors
# - Scale numeric features
# - Feature selection/engineering

Accuracy of KNN model: 0.79


[WinError 2] The system cannot find the file specified
  File "C:\Users\famil\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
        "wmic CPU Get NumberOfCores /Format:csv".split(),
        capture_output=True,
        text=True,
    )
  File "C:\Users\famil\anaconda3\Lib\subprocess.py", line 554, in run
    with Popen(*popenargs, **kwargs) as process:
         ~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\famil\anaconda3\Lib\subprocess.py", line 1039, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                        pass_fds, cwd, env,
                        ^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
                        gid, gids, uid, umask,
                        ^^^^^^^^^^^^^^^^^^^^^^
                        start_new_session, process_group)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^