![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/39129299-787f2b38-470a-11e8-958e-84f118846629.jpg"
    style="width:250px; float: right; margin: 0 40px 40px 40px;"></img>

# Tuning diabetes prediction model

In this project, we'll focused in two key concept: **Cross-validation** and **Tunning Hyper-parameters** to achieve the best accuracy of the model.

We will continue working with the [Diabetes dataset](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes), which have 8 numeric features plus a 0-1 class label.

![separator2](https://user-images.githubusercontent.com/7065401/39119518-59fa51ce-46ec-11e8-8503-5f8136558f2b.png)

### Hands on! 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Load the `data/diabetes_3.csv` file, and store it into `diabetes_df` DataFrame.

This file has already wrong observations removed, and it is balanced.

In [None]:
diabetes_df = pd.read_csv('data/diabetes_3.csv')

diabetes_df.head()

### Show the shape of the resulting `diabetes_df`.

In [None]:
diabetes_df.shape

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Data preparation

Before modeling prepare the data:

#### Create features $X$ and labels $y$

In [None]:
X = diabetes_df.drop(['label'], axis=1)
y = diabetes_df['label']

#### Stantardize the features

Use the `StandardScaler` to standardize the features (`X`) before moving to model creation.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit_transform(X)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Model creation and cross-validation evaluation

Build a `get_kneighbors_score` function that receives:
- `X`: features
- `y`: label
- `k`: neighbors

This function should train a `KNeighborsClassifier` and returns the mean and standard deviation of the scores of a **4-fold Cross-validation**.


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

def get_kneighbors_score(X, y, k):
    model = KNeighborsClassifier(n_neighbors=k)

    scores = cross_val_score(model, X, y, cv=4)
    
    return (scores.mean(), scores.std())

#### Test your function

Use the **whole data** to test your `get_kneighbors_score()` function.

Print scores obtained by using `5`, `10` and `15` neighbors (`k`).

In [None]:
print(f"Using 5 neighbors: {get_kneighbors_score(X, y, 5)}")
print(f"Using 10 neighbors: {get_kneighbors_score(X, y, 10)}")
print(f"Using 15 neighbors: {get_kneighbors_score(X, y, 15)}")

Let's try to get the best `k` value.

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Getting the best amount of neighbors

Train a KNN to test different values of `k`.

Keep using a `KNeighborsClassifier` estimator and a **4-fold Cross-validation**.

Test the following `k` values:

In [None]:
parameters = [1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]

def get_kneighbors_score(k):
    model = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(model, X, y, cv=4)
    return scores.mean()

ACC_dev = []
for k in parameters:
    scores=get_kneighbors_score(k)
    ACC_dev.append(scores)
    
ACC_dev

#### Getting the learning curve

Plot the learning curve (testing accuracy versus k). Which is the best `k` parameter?

In [None]:
parameters

In [None]:
ACC_dev

In [None]:
f, ax = plt.subplots(figsize=(10,5))

plt.plot(parameters, ACC_dev, 'o-')
plt.xlabel('Neighbors')
plt.ylabel('Accuracy')

plt.grid()
plt.plot()

![separator2](https://user-images.githubusercontent.com/7065401/39119518-59fa51ce-46ec-11e8-8503-5f8136558f2b.png)