# K-Nearest-Neighbors

üëá Load the `houses_clean.csv` dataset located in the `data` folder  
Or you can load it directly from this URL: [https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_clean.csv](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_clean.csv).  

The dataset description can be found [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset_description.txt).

In [None]:
# YOUR CODE HERE

üí° Most features are already preprocessed (scaled with normalization), as you did during the Data Preparation day  

üí° One feature, `GrLiveArea`, is not normalized. We keep it that way to see the impact of its normalization on our model performance later on  

üëá You can easily see this with descriptive statistics, check the min and max    

In [None]:
df.describe()

# Default KNN

üéØ The task is to predict the price of houses (`SalePrice`) with all the features.

üëá Use cross validation to evaluate a default [KNNRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) on such a task.  
‚ùì What is the proportion of the variance in `SalePrice` that is explained by the features?  
Save your answer in a variable named `base_knn_score`.

<details>
<summary> üí° Hint </summary>
    <br>
    ‚ÑπÔ∏è The proportion of the variance in the dependent variable that is explained by the independent variables is the R2 score.
</details>

In [None]:
# YOUR CODE HERE

### üß™ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('default_score',
                         score = base_knn_score)
result.write()
print(result.check())

# Scale sensitivity

KNNs and distance-based algorithms can be extremely sensitive to the scale of the features. 

üëá Rescale the feature set within an **exact common range**, and save it under a variable named `X_rescaled`  
Then, evaluate a model on the rescaled features and save its score under variable name `rescaled_score`.

<details>
<summary> üí° Hint </summary>
    
`MinMaxScaler()`

Even though only `GrLiveArea` needs to be normalized, using the MinxMaxScaler on all your features is fine  
    
Indeed, Min-Max Scaling is an [idempotent](https://en.wikipedia.org/wiki/Idempotence) transformation: if $X_{max}=1$ and $X_{min}=0$, then $X = \frac{X - X_{min}}{X_{max} - X_{min}}$
</details>


In [None]:
# YOUR CODE HERE

üëâ The R2 score should have increased!

üí° It is preferable for features to be in an exact common range when modeling distance-based algorithms.  
However, it does not always guarantee a better score.  
It is a trial and error process.

### üß™ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('scale_sensitivity',
                         base_score = base_knn_score,
                         rescaled_features = X_rescaled,
                         rescaled_score = rescaled_score)
result.write()
print(result.check())

# Optimizing $k$

üëá Fine tune the parameter K (using the parameter `n_neighbors`) of a KNNRegressor on the rescaled features. Plot the evolution of the score as K increases from 1 until 25.

In [None]:
# YOUR CODE HERE

‚ùì Which value of K produces the best performance? Save your answer under variable name `best_k`.

In [None]:
# YOUR CODE HERE

<details>
<summary> üëâ Solution üëà</summary>
    
By looking at your graph, you should see that the score stops increasing around k = 5 and the maximum score is reached for k = 11.

</details>



‚ùì What is you interpretation of the poor performance of the model for values $k$ < 5?

<details>
<summary> üëâ Solution üëà</summary>
    
When K is too small, the model will tend to overfit to the training set. It will focus on too few points to be able to generalize well. Increasing K will give the model more examples to base its predictions on.

</details>



### üß™ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('optimal_k',
                         optimal_k = best_k)
result.write()
print(result.check())

# Overfitting a KNN 

üí° When the parameter K of KNNs is too small, there is a risk of overfitting the training set and not being able to generalize well. 

üëá Plot the learning curves of a KNN with parameter K=2.

In [None]:
# YOUR CODE HERE

üëâ You should observe a high training score, but a low testing score. ‚ö†Ô∏è Overfitting alert ‚ö†Ô∏è This is due to a parameter K that is too low.

# Ideal K

üëá This time, plot the learning curves for the ideal K value you found in the "Optimizing $k$" section.

In [None]:
# YOUR CODE HERE

üëâ The curves should be close to converging, which indicates that the model is overfitting less and generalizing better.

üí° There are two key elements to remember when modelling with KNN models:  
    1. Distance-based algorithms are extremely sensitive to the scale of features  
    2. K must be tuned: it controls the tradeoff between performance, generalization, and overfitting

‚ùì What is the average difference between actual price and predicted price of the optimized KNN model? Compute your answer and save it under variable name `price_error`

<details>
<summary> üí° Hint </summary>
    
The metric you should calculate is the **Negative Mean Absolute Error (MAE)**.

</details>

In [None]:
# YOUR CODE HERE

### üß™ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('price_error',
                         error = price_error)
result.write()
print(result.check())

# Model Selection

‚ùì Which of those two models would you chose to perform the task of predicting house prices:
- The KNN model you just tuned
- A Linear Regression model

Save your answer as a string under variable name `best_model` as either "KNN" or "LinearReg".

<details>
<summary> üí° Hint </summary>
    
To chose either or, you'll have to evaluate the score of a Linear Regression on the same task and compare it to the score of the KNN. Make sure you are comparing the same metrics!!

</details>




In [None]:
# YOUR CODE HERE

üí° When comparing either metric of both models, the KNN model should outperform the Linear Regression. This could be due to its ability to capture non-linear patterns in the data.

### üß™ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('best_model',
                         model = best_model)
result.write()
print(result.check())

# üèÅ