# Intro to Supervised Machine Learning

## Table of Contents

insert table of contents here

## Classification

#### Loading and preparing the data

In [1]:
from sklearn.datasets import load_iris, fetch_california_housing
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
iris = load_iris()
print(iris["DESCR"])

In [None]:
df = pd.DataFrame(iris["data"], columns = iris["feature_names"])
df["target"] = iris["target"]

df.head()

#### Checking for anomalies

In [None]:
df.info()

#### Quick EDA

In [None]:
sns.pairplot(df, hue="target")
plt.show()

#### Train Test Split

First we need to distinguish from features and target

In [6]:
features = df.drop(columns = ["target"])
target = df["target"]

Now we perform the division between Train and Test, we will reserve 20% of our data to Test.

In [7]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [None]:
X_train.head()

In [None]:
y_train.head()

For this dataset we will use KNN Classifier

In [10]:
from sklearn.neighbors import KNeighborsClassifier

Creating an instance of the model, for now, we will use n_neighbors=3 (we will see how to optimize this hyperparameter later)

In [11]:
knn = KNeighborsClassifier(n_neighbors=3)

Training the model

In [None]:
knn.fit(X_train, y_train)

Now, our model is already trained, we can make predictions for new data points

In [None]:
pred = knn.predict(X_test)
pred

Let's compare with the true labels

In [None]:
y_test.values

It seems that our model was able to correctly predict the great majority of data points. To be precise, 29 out of 30 data points were correctly labeled.

In order to evaluate our model, we will use the method "score" that will give us accuracy.

In [None]:
knn.score(X_test, y_test)

## Regression

#### Loading and preparing the data

In [None]:
california = fetch_california_housing()
print(california["DESCR"])

In [None]:
df_cali = pd.DataFrame(california["data"], columns = california["feature_names"])
df_cali["median_house_value"] = california["target"]

df_cali.head()

#### Checking for anomalies

In [None]:
df_cali.info()

#### Quick EDA

In [None]:
sns.pairplot(df_cali, y_vars=['median_house_value'], x_vars=df_cali.columns[:-1], kind='scatter')

#### Train Test Split

First we need to distinguish from features and target

In [20]:
features = df_cali.drop(columns = ["median_house_value"])
target = df_cali["median_house_value"]

Now we perform the division between Train and Test, we will reserve 20% of our data to Test.

In [21]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

In [None]:
X_train.head()

In [None]:
y_train.head()

Now, we will use KNN Regressor, we will use hyperparameter n_neighbors = 10.

In [24]:
from sklearn.neighbors import KNeighborsRegressor

In [25]:
knn = KNeighborsRegressor(n_neighbors=10)

In [None]:
knn.fit(X_train, y_train)

We are going to evaluate our model performance with R-Squared

In [None]:
knn.score(X_test, y_test)

Pretty bad model, remember, we want R2 as high as possible!

KNN is a distance base model, features having different scales are having an impact on the model's performance.