# LAB | Intro to Machine Learning

**Load the data**

In this challenge, we will be working with Spaceship Titanic data. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [69]:
#import libraries
import pandas as pd
import requests
import numpy as np
import os
import time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import accuracy_score, classification_report

### Accesing the metadata, playing with different options

In [39]:
# Test ping to GitHub raw content
!ping raw.githubusercontent.com -n 4

# Compare with timed curl request, easy way to check if the connection is working and the timing
start = time.time()
metadata = !curl -k -s -L https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.md
end = time.time()
print(f"\nRequest time: {(end-start)*1000:.2f}ms")


Pinging raw.githubusercontent.com [185.199.110.133] with 32 bytes of data:
Reply from 185.199.110.133: bytes=32 time=13ms TTL=60
Reply from 185.199.110.133: bytes=32 time=12ms TTL=60
Reply from 185.199.110.133: bytes=32 time=12ms TTL=60
Reply from 185.199.110.133: bytes=32 time=13ms TTL=60

Ping statistics for 185.199.110.133:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 12ms, Maximum = 13ms, Average = 12ms

Request time: 273.40ms


In [40]:
# 1st option to see the metadata:
# Fetch metadata using curl with SSL verification disabled and progress meter silenced
metadata = !curl -k -s -L https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.md
print('\n'.join(metadata)) 

# Convert list to string with line breaks:
# 1. metadata is a list where each element is a line from curl output
# 2. '\n' is the newline character
# 3. join() combines all list elements with '\n' between them
# 4. print() displays the formatted string

# Spaceship Titanic Dataset

Metadata

1. **Id**: Unique identifier for each property.
1. **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
2. **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
3. **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
4. **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
5. **Destination** - The planet the passenger will be debarking to.
6. **Age** - The age of the passenger.
7. **VIP** - Whether the passenger has paid for special VIP service during the voyage.
RoomService, FoodCourt, ShoppingMal

In [41]:
# 2nd way to access the metadata file, using requests
url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.md"
response = requests.get(url)

# Display metadata
if response.status_code == 200:
    print(response.text)
else:
    print("Failed to retrieve metadata")

# Spaceship Titanic Dataset

Metadata

1. **Id**: Unique identifier for each property.
1. **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
2. **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
3. **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
4. **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
5. **Destination** - The planet the passenger will be debarking to.
6. **Age** - The age of the passenger.
7. **VIP** - Whether the passenger has paid for special VIP service during the voyage.
RoomService, FoodCourt, ShoppingMal

### Loading the dataset

In [52]:
# 1. Load dataset from GitHub
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [53]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


**Check the shape of your data**

In [54]:
# 2. Shape Analysis
print("\n=== Dataset Shape ===")
print(f"Number of rows: {spaceship.shape[0]}")
print(f"Number of columns: {spaceship.shape[1]}")


=== Dataset Shape ===
Number of rows: 8693
Number of columns: 14


**Check for data types**

In [55]:
# 3. Data Types Check
print("\n=== Data Types ===")
print(spaceship.dtypes)


=== Data Types ===
PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object


**Check for missing values**

In [56]:
# 4. Missing Values Analysis
print("\n=== Missing Values ===")
missing_values = spaceship.isnull().sum()
missing_percentage = (spaceship.isnull().sum() / len(spaceship) * 100).round(2)

# Combine count and percentage in a DataFrame for better visualization and sort ascending
missing_df = pd.DataFrame({
    'Missing Count': spaceship.isnull().sum(),
    'Missing Percentage': (spaceship.isnull().sum() / len(spaceship) * 100).round(2)
}).sort_values('Missing Count', ascending=True)

# Display sorted results
print(missing_df)


=== Missing Values ===
              Missing Count  Missing Percentage
PassengerId               0                0.00
Transported               0                0.00
Age                     179                2.06
RoomService             181                2.08
Destination             182                2.09
FoodCourt               183                2.11
Spa                     183                2.11
VRDeck                  188                2.16
Cabin                   199                2.29
Name                    200                2.30
HomePlanet              201                2.31
VIP                     203                2.34
ShoppingMall            208                2.39
CryoSleep               217                2.50


There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [60]:
# 1. Store initial shape
initial_rows = spaceship.shape[0]
print(f"Initial number of rows: {initial_rows}")

# 2. Drop rows with missing values
spaceship_clean = spaceship.dropna()

# 3. Check new shape
final_rows = spaceship_clean.shape[0]
print(f"Final number of rows: {final_rows}")

# 4. Calculate percentage of data lost - fixed round() syntax
lost_percentage = round(((initial_rows - final_rows) / initial_rows * 100), 2)
print(f"Percentage of data lost: {lost_percentage}%")

# 5. Verify no missing values remain
print("\n=== Missing Values After Cleaning ===")
print(spaceship_clean.isnull().sum().sum())

Initial number of rows: 8693
Final number of rows: 6606
Percentage of data lost: 24.01%

=== Missing Values After Cleaning ===
0


**KNN**

K Nearest Neighbors is a distance based algorithm, and requeries all **input data to be numerical.**

Let's only select numerical columns as our features.

In [65]:
# 1. Automatically select numerical features
X = spaceship_clean.select_dtypes(include=['int64', 'float64'])
# Remove target if it's numerical
X = X.drop('Transported', axis=1, errors='ignore')

# Verify selected columns
print("=== Automatically Selected Numerical Features ===")
print(X.dtypes)
print("\nShape:", X.shape)


=== Automatically Selected Numerical Features ===
Age             float64
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
dtype: object

Shape: (6606, 6)


And also lets define our target.

In [66]:
# 2. Define target variable
y = spaceship_clean['Transported']

# Check target distribution
print("\n=== Target Distribution ===")
print(y.value_counts(normalize=True).round(3))


=== Target Distribution ===
Transported
True     0.504
False    0.496
Name: proportion, dtype: float64


**Train Test Split**

Now that we have split the data into **features** and **target** variables and imported the **train_test_split** function, split X and y into X_train, X_test, y_train, and y_test. 80% of the data should be in the training set and 20% in the test set.

In [67]:
# 1. Split data into training and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X,                  # features
    y,                  # target
    test_size=0.2,     # 20% for testing
    random_state=42     # for reproducibility
)

# 2. Print shapes to verify split
print("=== Train-Test Split Shapes ===")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

=== Train-Test Split Shapes ===
X_train shape: (5284, 6)
X_test shape: (1322, 6)
y_train shape: (5284,)
y_test shape: (1322,)


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

You need to choose between **Classificator** or **Regressor**. Take into consideration target variable to decide.

Initialize a KNN instance without setting any hyperparameter.

In [70]:
# 1. Initialize KNN with default parameters
knn = KNeighborsClassifier()

Fit the model to your data.

In [75]:
# 2. Fit model to training data
knn.fit(X_train, y_train)

In [76]:
# 3. Make predictions
y_pred = knn.predict(X_test)

Evaluate your model.

In [79]:
# 4. Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Optional: Show detailed classification report
print("\n=== Classification Report ===")
print(classification_report(y_test, y_pred))

Model Accuracy: 0.7716

=== Classification Report ===
              precision    recall  f1-score   support

       False       0.79      0.74      0.76       653
        True       0.76      0.80      0.78       669

    accuracy                           0.77      1322
   macro avg       0.77      0.77      0.77      1322
weighted avg       0.77      0.77      0.77      1322



### Step-by-Step Analysis of Model Results

1. **Accuracy Score**: 0.7716 (77.16%)
   - Model correctly predicts ~77% of cases
   - Above baseline but room for improvement

2. **Class Distribution**:
   - False: 653 cases
   - True: 669 cases
   - Nearly balanced dataset

3. **Metrics per Class**:
   - False predictions:
     - Precision: 0.79
     - Recall: 0.74
     - F1: 0.76
   
   - True predictions:
     - Precision: 0.76
     - Recall: 0.80
     - F1: 0.78

4. **Overall Performance**:
   - Consistent metrics (~0.77)
   - Similar performance for both classes
   - No significant class bias

5. **Next Steps**:
   - Try different k values
   - Feature scaling
   - Feature selection/engineering
   - Cross-validation

**Congratulations, you have just developed your first Machine Learning model!**