# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [13]:
spaceship.shape

(8693, 14)

**Check for data types**

In [14]:
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [15]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [17]:
spaceship_clean = spaceship.dropna()


- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [18]:
# Extraer la letra del deck (antes de la primera '/')
spaceship_clean['Deck'] = spaceship_clean['Cabin'].apply(lambda x: x.split('/')[0])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_clean['Deck'] = spaceship_clean['Cabin'].apply(lambda x: x.split('/')[0])


In [19]:
print(spaceship_clean['Deck'].unique())

['B' 'F' 'A' 'G' 'E' 'C' 'D' 'T']


- Drop PassengerId and Name

In [20]:
spaceship_clean = spaceship_clean.drop(columns=['PassengerId', 'Name'])

In [21]:
spaceship_clean.head(3)

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck
0,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B
1,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F
2,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A


- For non-numerical columns, do dummies.

In [22]:
spaceship_clean.select_dtypes(include='object').columns

Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP', 'Deck'], dtype='object')

In [None]:
spaceship_encoded = pd.get_dummies(spaceship_clean, columns=['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck'], drop_first=False)

In [30]:
# Verifica que no esté Cabin
if 'Cabin' in spaceship_clean.columns:
    spaceship_clean = spaceship_clean.drop(columns='Cabin')

# Aplicar one-hot encoding a las columnas categóricas
spaceship_encoded = pd.get_dummies(
    spaceship_clean,
    columns=['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck'],
    drop_first=False  # para mantener todas las categorías
)


**Perform Train Test Split**

In [31]:
X = spaceship_encoded.drop(columns='Transported')  # Features
y = spaceship_encoded['Transported']               # Target
#X contiene todas las columnas menos Transported, porque son las pistas. 
# y contiene solo la columna Transported, porque es lo que queremos predecir.

In [32]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,                   # Datos
    test_size=0.2,          # 20% para test
    random_state=42,        # Semilla para reproducibilidad
    stratify=y              # Asegura misma proporción de clases
)
#X_train: el 80% de las pistas, para entrenar.
# X_test: el 20% de las pistas, para probar.
# y_train: el 80% de las respuestas verdaderas (Transported).
# y_test: el 20% de las respuestas verdaderas, para comparar si el modelo acertó.

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [33]:
from sklearn.neighbors import KNeighborsClassifier

In [34]:
knn = KNeighborsClassifier(n_neighbors=5)

In [35]:
knn.fit(X_train, y_train)

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [36]:
y_pred = knn.predict(X_test)


In [38]:
print(y_pred)


[ True  True  True ...  True  True  True]


In [39]:
import pandas as pd

resultado = pd.DataFrame({'Real': y_test, 'Predicho': y_pred})
print(resultado.head(10))  # Muestra las primeras 10 filas


       Real  Predicho
8153  False      True
7374   True      True
8231  False      True
5795   True      True
2536   True      True
7228   True      True
88    False     False
3395  False      True
995    True      True
2540  False     False


- Evaluate your model's performance. Comment it

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print("accuracy del modelo KNN:", round(accuracy * 100, 2), "%")


Precisión del modelo KNN: 76.78 %


In [40]:
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")


Precision: 0.76


In [41]:
#De todos los positivos reales, cuántos predijo correctamente.

from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.2f}")


Recall: 0.78


In [42]:
#balancear recall y precision balancear ambos si hay desequilibrio en clases
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.2f}")


F1 Score: 0.77


In [None]:
#detalle de las predicciones verdaderas y falsas
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print("Matriz de Confusión:")
print(cm)
#Tu modelo acertó 496 veces que la clase era negativa y predijo negativo (TN).
#Erróneamente predijo positivo en 160 casos donde era negativo (FP).
#Erróneamente predijo negativo en 147 casos donde era positivo (FN).
#Acertó 519 veces que la clase era positiva y predijo positivo (TP).

Matriz de Confusión:
[[496 160]
 [147 519]]
