# LAB | Intro to Machine Learning

**Load the data**

In this challenge, we will be working with Spaceship Titanic data. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [44]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

In [36]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [37]:
spaceship.columns

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported'],
      dtype='object')

**Check the shape of your data**

In [38]:
#your code here

spaceship.shape

(8693, 14)

**Check for data types**

In [39]:
#your code here

spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [40]:
#your code here

spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [48]:
# Drop the 'Name' column as discussed

spaceship = spaceship.drop(columns=['Name'])


KeyError: "['Name'] not found in axis"

In [45]:
# Handle missing values before type conversion
# For numerical columns, we use the mean

num_imputer = SimpleImputer(strategy='mean')
num_columns = spaceship.select_dtypes(include=['float64']).columns
spaceship[num_columns] = num_imputer.fit_transform(spaceship[num_columns])

In [46]:
# For categorical columns, we use the most frequent value (mode)

cat_imputer = SimpleImputer(strategy='most_frequent')
cat_columns = spaceship.select_dtypes(include=['object']).columns
spaceship[cat_columns] = cat_imputer.fit_transform(spaceship[cat_columns])

In [47]:
# Convert boolean columns to numerical after imputation

spaceship['CryoSleep'] = spaceship['CryoSleep'].astype(float)
spaceship['Transported'] = spaceship['Transported'].astype(int)
spaceship['VIP'] = spaceship['VIP'].astype(int)

In [50]:
# Label encode categorical variables with a small number of unique values

le = LabelEncoder()
spaceship['HomePlanet'] = le.fit_transform(spaceship['HomePlanet'])
spaceship['Destination'] = le.fit_transform(spaceship['Destination'])

In [52]:
# One-hot encode the 'Cabin' column

spaceship = pd.get_dummies(spaceship, columns=['Cabin'], prefix='Cabin')

In [58]:
# Convert all boolean columns (if any remain) to numerical

bool_columns = spaceship.select_dtypes(include=['bool']).columns
spaceship[bool_columns] = spaceship[bool_columns].astype(int)

**KNN**

K Nearest Neighbors is a distance based algorithm, and requeries all **input data to be numerical.**

Let's only select numerical columns as our features.

In [59]:
spaceship

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,...,Cabin_G/996/S,Cabin_G/998/P,Cabin_G/998/S,Cabin_G/999/P,Cabin_G/999/S,Cabin_T/0/P,Cabin_T/1/P,Cabin_T/2/P,Cabin_T/2/S,Cabin_T/3/P
0,0001_01,1,0.0,2,39.0,0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,0002_01,0,0.0,2,24.0,0,109.0,9.0,25.0,549.0,...,0,0,0,0,0,0,0,0,0,0
2,0003_01,1,0.0,2,58.0,1,43.0,3576.0,0.0,6715.0,...,0,0,0,0,0,0,0,0,0,0
3,0003_02,1,0.0,2,33.0,0,0.0,1283.0,371.0,3329.0,...,0,0,0,0,0,0,0,0,0,0
4,0004_01,0,0.0,2,16.0,0,303.0,70.0,151.0,565.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,1,0.0,0,41.0,1,0.0,6819.0,0.0,1643.0,...,0,0,0,0,0,0,0,0,0,0
8689,9278_01,0,1.0,1,18.0,0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
8690,9279_01,0,0.0,2,26.0,0,0.0,0.0,1872.0,1.0,...,0,0,0,0,0,0,0,0,0,0
8691,9280_01,1,0.0,0,32.0,0,0.0,1049.0,0.0,353.0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
spaceship.dtypes

PassengerId     object
HomePlanet       int64
CryoSleep      float64
Destination      int64
Age            float64
                ...   
Cabin_T/0/P      int64
Cabin_T/1/P      int64
Cabin_T/2/P      int64
Cabin_T/2/S      int64
Cabin_T/3/P      int64
Length: 6572, dtype: object

In [63]:
#your code here

# convert categorical variables to numerical variables

# Ensure all columns except 'PassengerId' are numerical

spaceship.loc[:, spaceship.columns != 'PassengerId'] = spaceship.loc[:, spaceship.columns != 'PassengerId'].apply(pd.to_numeric)


And also lets define our target.

In [60]:
#your code here

# Separating features and target variable

features = spaceship.drop(columns=['Transported', 'PassengerId'])
target = spaceship['Transported']

features, target

(      HomePlanet  CryoSleep  Destination   Age  VIP  RoomService  FoodCourt  \
 0              1        0.0            2  39.0    0          0.0        0.0   
 1              0        0.0            2  24.0    0        109.0        9.0   
 2              1        0.0            2  58.0    1         43.0     3576.0   
 3              1        0.0            2  33.0    0          0.0     1283.0   
 4              0        0.0            2  16.0    0        303.0       70.0   
 ...          ...        ...          ...   ...  ...          ...        ...   
 8688           1        0.0            0  41.0    1          0.0     6819.0   
 8689           0        1.0            1  18.0    0          0.0        0.0   
 8690           0        0.0            2  26.0    0          0.0        0.0   
 8691           1        0.0            0  32.0    0          0.0     1049.0   
 8692           1        0.0            2  44.0    0        126.0     4688.0   
 
       ShoppingMall     Spa  VRDeck  .

**Train Test Split**

Now that we have split the data into **features** and **target** variables and imported the **train_test_split** function, split X and y into X_train, X_test, y_train, and y_test. 80% of the data should be in the training set and 20% in the test set.

In [64]:
#your code here

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

You need to choose between **Classificator** or **Regressor**. Take into consideration target variable to decide.

Initialize a KNN instance without setting any hyperparameter.

In [66]:
#your code here

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Initialize the KNN classifier

knn = KNeighborsClassifier()

In [67]:
# Check the shapes of the splits to confirm

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (6954, 6570)
X_test shape: (1739, 6570)
y_train shape: (6954,)
y_test shape: (1739,)


Fit the model to your data.

In [68]:
#your code here

knn.fit(X_train, y_train)

In [69]:
# Predict the target variable for the test set

y_pred = knn.predict(X_test)

Evaluate your model.

In [71]:
#your code here

from sklearn.metrics import accuracy_score, classification_report

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7619321449108684


In [72]:
# Classification report

report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.74      0.75       861
           1       0.75      0.79      0.77       878

    accuracy                           0.76      1739
   macro avg       0.76      0.76      0.76      1739
weighted avg       0.76      0.76      0.76      1739



Key Metrics in Context of “Transported”:

1.	Precision:

For class 0 (Not Transported): Precision of 0.77 means that out of all passengers predicted to not be transported, 77% were correctly identified as not transported.
For class 1 (Transported): Precision of 0.75 means that out of all passengers predicted to be transported, 75% were correctly identified as transported.
	
2.	Recall:

For class 0 (Not Transported): Recall of 0.74 means that out of all passengers who were actually not transported, 74% were correctly identified by the model.
For class 1 (Transported): Recall of 0.79 means that out of all passengers who were actually transported, 79% were correctly identified by the model.

3.	F1-Score:

For class 0 (Not Transported): F1-score of 0.75 balances the precision and recall, indicating overall how well the model performs in identifying passengers not transported.
For class 1 (Transported): F1-score of 0.77 balances the precision and recall, indicating overall how well the model performs in identifying passengers transported.

4.	Support:

For class 0 (Not Transported): There are 861 instances of passengers who were not transported.
For class 1 (Transported): There are 878 instances of passengers who were transported.

5.	Accuracy:

The overall accuracy of 76% means that 76% of all predictions (both transported and not transported) made by the model were correct.

**Congratulations, you have just developed your first Machine Learning model!**

Summary:

- Model Performance: The model has a good balance between precision and recall for both classes, with slightly better recall for class 1. The overall accuracy is 76%, which indicates that the model correctly predicts the class 76% of the time.
- Class Imbalance: The support values show that the classes are fairly balanced in this dataset, which helps in achieving balanced precision and recall values.

Conclusion:

- The KNN model performs reasonably well with an accuracy of 76%. The precision and recall scores are fairly balanced between the two classes, indicating that the model is well-suited for this classification task. However, there may still be room for improvement, possibly through hyperparameter tuning, feature engineering, or trying different algorithms.