# Workflow & Hyperparameter Optimization

Our dataset describes heart diseases in a binary class problem (1: disease, 0: no disease)

Your goal will be to fit KNN to best predict malignant targets by avoiding the maximum false negatives!

👇 Import the data

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np

In [None]:
data = pd.read_csv('heart.csv')
data

In [None]:
data.info()

In [None]:
X = data.drop(columns=['target'])
y = data['target']

In [None]:
y.value_counts()

## 1. Train/Test split

👇 Split the data to create your `X_train` `X_test` and `y_train` `y_test`

Use a test_size=0.3 and a `random_state=0` to compare with your buddy

## 2. Scaling

❓ Scale your training set using the scaler of your choice

## 3. Baseline KNN model

❓ Cross validate a basic KNN classifier on your training set on the using the "ROC area-under-curve" metric

## 4. Grid search

Use KNeighborsClassifier

👇 Grid search a KNN's hyperparameter k on the training data.
- Search k = [1,5,10,20,50]
- 5-fold cross validate
- Score with recall

In [None]:
# Instanciate model

# Hyperparameter Grid

# Instanciate Grid Search

# Fit data to Grid Search

❓ According to the grid search, what is the optimal K value?

❓ What is the best score the optimal K value produced?

We now have an idea about where the best k lies, but some of the values we did not try could be better!

Re-run grid search with k-values around to your previous best value

❓ What is the best score and best k?

## 5. Optimizing multiple hyperparameters

👇 Is the default distance parameter of a KNNClassifier optimal for the task? Run a random search to compute your answer. (look for the parameter 'p' in the [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html))

First, let's do a grid search for k and p at the same time. Try 10 combinations, e.g. k = [1, 10, 20, 30, 40]; p = [1, 2]

❓ What are the best parameters and the best score?

## 6. Random Search

Now let's see if a Random Search can find a better combination only 10 tries.

👇 Redo the same for `RandomizedSearchCV` but randomly sample `n_neighbors` from a `randint(1,40)` distribition

To compare apples to apples, run `RandomizedSearchCV` with `n_iter=10` to have the same number of total combinations to try. Also make sure you use the same scoring method for both, for example accuracy.

<details>
    <summary>🤔 Is the best score better with the Randomized Search?</summary>


It is not guaranteed because it is random, but we know there is a chance that RandomizedSearch will sample it.

You can play with np.random.seed() to see that sometimes RandomSearch will outperform GridSearch and sometimes not.

It is important to note that our dataset is extremely small and our hyperparameter optimization is thus extremely dependent (and overfitting) on our train/test split. **Always make sure your dataset is much bigger than the total number of hyperparameter combinations you are trying out!**

Randomized Search will become more useful even, when we want to search over even more than 2 numerical hyperparameters and sample all of them randomly, for example for SVMs!

One thing you can always do is run a coarse grained grid search frst, followed by a more fine grained search around the best parameter that you found. You can also do a randomized search followed by a grid search and vice versa. 
</details>

## 7. Generalization

👇 This is your final chance to finetune your model: Try to refine your Grid/RandomsearchCV, instanciate your best model and re-fit it on the entire train set.

👇 Time has come to discover its its performance on the **unseen** test set. 

❓ Would you consider the optimized model to generalize well?

<details><summary>Hints</summary>

Find horrible test performance? You probably foregot to scale your test set too! Re-use your scaler fitted on the train set to transform your test set accordingly!
</details>

🏁 Congratulation. Please push the exercice once completed