# LAB | Intro to Machine Learning

**Load the data**

In this challenge, we will be working with Spaceship Titanic data. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [98]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [99]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [100]:
#your code here

spaceship.shape

(8693, 14)

**Check for data types**

In [101]:
#your code here

spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [102]:
#your code here

spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [103]:
#your code here

spaceship.dropna(inplace=True)

**KNN**

K Nearest Neighbors is a distance based algorithm, and requeries all **input data to be numerical.**

Let's only select numerical columns as our features.

In [104]:
#your code here

spaceship_num = spaceship.select_dtypes(include=['int64', 'float64'])
spaceship_num.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,39.0,0.0,0.0,0.0,0.0,0.0
1,24.0,109.0,9.0,25.0,549.0,44.0
2,58.0,43.0,3576.0,0.0,6715.0,49.0
3,33.0,0.0,1283.0,371.0,3329.0,193.0
4,16.0,303.0,70.0,151.0,565.0,2.0


And also lets define our target.

In [105]:
#your code here

spaceship_num.corr()

# Todas las variables tienen una correlacion bastante pequeña por lo general, con lo que esto, sumado a que no nos dicen qué variable coger como target voy a elegir la FoodCourt
# que es de las que mejor correlación tienen, ya que probé con edad y daba resultados malísimos.

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
Age,1.0,0.074783,0.135844,0.042314,0.12382,0.105031
RoomService,0.074783,1.0,-0.013614,0.060478,0.012472,-0.026002
FoodCourt,0.135844,-0.013614,1.0,-0.01232,0.215995,0.216997
ShoppingMall,0.042314,0.060478,-0.01232,1.0,0.022168,0.000383
Spa,0.12382,0.012472,0.215995,0.022168,1.0,0.149447
VRDeck,0.105031,-0.026002,0.216997,0.000383,0.149447,1.0


In [106]:
features = spaceship_num.drop(['FoodCourt'], axis=1)
target = spaceship_num['FoodCourt']

**Train Test Split**

Now that we have split the data into **features** and **target** variables and imported the **train_test_split** function, split X and y into X_train, X_test, y_train, and y_test. 80% of the data should be in the training set and 20% in the test set.

In [107]:
#your code here

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

In [108]:
X_train.head()

Unnamed: 0,Age,RoomService,ShoppingMall,Spa,VRDeck
7832,25.0,0.0,0.0,642.0,612.0
5842,36.0,0.0,1657.0,2799.0,1.0
3928,34.0,0.0,0.0,0.0,0.0
4091,37.0,0.0,0.0,0.0,0.0
7679,22.0,0.0,0.0,0.0,0.0


In [109]:
y_train.head()

7832    1673.0
5842    2624.0
3928       0.0
4091       0.0
7679       0.0
Name: FoodCourt, dtype: float64

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

You need to choose between **Classificator** or **Regressor**. Take into consideration target variable to decide.

Initialize a KNN instance without setting any hyperparameter.

In [110]:
#your code here
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor()

Fit the model to your data.

In [111]:
#your code here

knn.fit(X_train, y_train)

In [112]:
knn.score(X_test, y_test)

-0.008345805818149588

Evaluate your model.

In [113]:
#your code here

pred = knn.predict(X_test)

print(pred)

[ 0.   0.   0.  ...  0.   0.  98.4]


In [114]:
y_test

8441        0.0
8058        0.0
320         0.0
2548        0.0
8027        3.0
8661        0.0
8076        0.0
6843      980.0
3035        0.0
3210        0.0
7456        0.0
7287        0.0
2438        0.0
1899        6.0
7571     4655.0
8041        0.0
8473       17.0
3488        0.0
3455        0.0
7345       60.0
5769        0.0
8541       12.0
4663       10.0
2836        0.0
2982        0.0
4010        0.0
3399      432.0
1286        0.0
5886        0.0
6463     1208.0
3219        0.0
2578      756.0
2490       33.0
2321        0.0
1589        0.0
613      3344.0
4235        0.0
3134       56.0
2719       25.0
5545        0.0
4194        0.0
810         0.0
660         0.0
3567        4.0
5785        0.0
4182        0.0
5357        0.0
5505      131.0
42        164.0
5657        0.0
6924      185.0
6491      752.0
7906     2154.0
3418        0.0
1394        0.0
8524      475.0
131         0.0
1603        4.0
1689      275.0
21          0.0
1539       32.0
5467        0.0
3115    

**Congratulations, you have just developed your first Machine Learning model!**