<a href="https://colab.research.google.com/github/abelowska/dataPy/blob/main/Classes_03_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Nonlinear regressors: K-Nearest Neighbours

Using an open source [Obesity Levels Based On Eating Habits and Physical Condition dataset](https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition) we're going to **model** and then **predict** *weight* based on multiple features with simple linear regression. The dataset is provided by [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/) which contains multiple datasets useful for studying and experimenting.

In [None]:
!pip install ucimlrepo

Imports

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_theme(style="whitegrid", palette="deep")

from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn import set_config
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import median_absolute_error, r2_score, classification_report, PredictionErrorDisplay, median_absolute_error, mean_squared_error, mean_absolute_error, accuracy_score
from ucimlrepo import fetch_ucirepo

plt.rcParams["figure.figsize"] = (10,7)

In [None]:
# constans
test_size=0.2
random_state=42

In [None]:
def compute_score(y_true, y_pred):
  '''
  Helper function for printing scores.

  Parameters:
  y_true: ndarray of y values from original dataset.
  y_pred: ndarray of y values predicted with given model.

  Return:
  dictionary object that consists of R2 and median absolute error scores.

  '''
  return {
        "R2": f"{r2_score(y_true, y_pred):.3f}",
        "MedianAE": f"{median_absolute_error(y_true, y_pred):.3f}",
}

In [None]:
def compute_score_classification(y_true, y_pred):
  '''
  Helper function for printing scores.

  Parameters:
  y_true: ndarray of y values from original dataset.
  y_pred: ndarray of y values predicted with given model.

  Return:
  dictionary object that consists of accuracy and classification report.

  '''
  return {
        "Accuracy": f"{accuracy_score(y_true, y_pred):.3f}",
        "Classification Report": classification_report(y_true, y_pred),
}

In [None]:
def plot_prediction_error(y_test, y_pred, scores):
  _, ax = plt.subplots(figsize=(5, 5))

  y_test = y_test.to_numpy() if isinstance(y_test, pd.DataFrame) else y_test

  display_ = PredictionErrorDisplay.from_predictions(
      y_test,
      y_pred,
      kind="actual_vs_predicted",
      ax=ax,
      scatter_kwargs={"alpha": 0.5}
  )

  ax.set_title("Linear model")
  for name, score in scores.items():
      ax.plot([], [], " ", label=f"{name}: {score}")
  ax.legend(loc="upper left")
  plt.tight_layout()

## Load dataset

In [None]:
# fetch dataset
obesity_data = fetch_ucirepo(id=544)
obesity_data_df = obesity_data.data.features
obesity_data_df.head()

Inspect the dataset

In [None]:
obesity_data_df.describe()

## Regression

Recall the model from the last classes:

*Weight ~ Age + FCVC + Height*

So far, we modeled this relationship using linear regression. Let's take a look at how the non-linear estimator - KNN will manage to model this relationship.

To create a model with KNN estimator, you simply need to create an object of type [`KNeighborsRegressor()`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor) instead of the linear regression object. Give yourself a moment to read the documentation of KNN regressor.

### Exercise 1
Let's compare the performance of KNN to the linear model.

1. The first model is the simplest linear model we have already done

In [None]:
# Linear regression model - for comparison
X = obesity_data_df[[
    'Age',
    'FCVC',
    'Height',
]]

y = obesity_data_df[['Weight']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

# create object of linear regression estimator
lm = linear_model.LinearRegression()

lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)

scores = compute_score(y_test, y_pred)
scores

In [None]:
plot_prediction_error(y_test, y_pred, scores)

2. KNN Regressor model with *default parameters*:

In [None]:
X = obesity_data_df[[
    'Age',
    'FCVC',
    'Height',
]]

y = obesity_data_df[['Weight']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

# create object of KNN estimator
knn = KNeighborsRegressor()

knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

scores = compute_score(y_test, y_pred)
scores

In [None]:
plot_prediction_error(y_test, y_pred, scores)

There is quite a difference! We didn't have to do anything - with zero knowledge of the data, we got a better result than using linear regression and all our knowledge of the data.

### Exercise 2
Now - try to play a little bit with KNN. Maybe you can extract even more from the model by changing its parameters?

Create at least 3 different KNN  and change at leat 2 different parameters.

In [None]:
# your code here

### (Exercise 2.1)

Too complex models tend to overfit, i.e., they are too closely matched to the training data and begin to perform poorly on the test data.
Plot training and testing performance vs number of neighbors to see, whether this model tends to overfit when the number of neighbors increases.

In [None]:
# Your code here

What do you think this chart means for the problem of predicting orthodoxy based on personality traits? How many people with similar personality profiles is it best to look at to make a good prediction of the level of orthodoxy?

## Classification

Classification is a type of supervised learning task in machine learning and statistics where the goal is to assign labels or categories to input data. Essentially, classification involves predicting the **category or class** of new observations based on previous examples with known labels.

### Exercise 3

Initially, out dataset was dedicated to predict obesity levels based on the habits and physical conditions. Now, we are going to create classification model:

*Obesity level ~ Age + FCVC + Height*

The obesity level is stored in the `data.targets` attribute of the `obesity_data` variable.

Look into the documentation of [`KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) and write down the code, employing the same patter as in the regression analysis.
To check the classification results, use the predefined `compute_score_classification()` method and print separately each metric. How you interpret the results of the model?

In [None]:
X = obesity_data_df[[
    'Age',
    'FCVC',
    'Height',
]]

y = # your code here

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

# create object of KNN Classifier
knn = # your code here

knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

scores = compute_score_classification(y_test, y_pred)
# print scores