<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:30%; left:10%;">
    Machine Learning Fundamentals
</h1>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:43%; left:10%;">
    Santiago Basulto
</h3>
</div>

<div style="width: 100%; background-color: #222; text-align: center">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Project
</h1>
    
<h3 style="color: #ef7d22; font-weight: normal;">
    Balancing diabetes observations
</h3>

<br><br> 
</div>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

<img src="img/diabetes.jpg"
    style="width:250px; float: right; margin: 0 40px 40px 40px;"></img>

Now we will continue using the [Diabetes dataset](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes), which have 8 numeric features plus a 0-1 class label.

We'll analyze if the data is balanced before training our model and how are the errors that the model make.

### Hands on! 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Load the `data/diabetes_2.csv` file, and store it into `diabetes_df` DataFrame.

This file has already wrong observations removed.

In [None]:
diabetes_df = pd.read_csv('data/diabetes_2.csv')

diabetes_df.head()

### Show the shape of the resulting `diabetes_df`.

In [None]:
diabetes_df.shape

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Analyze `label` distribution

Are observations well balanced?

How many observations we there are for 0 _(no diabetes)_ and 1 _(yes diabetes)_?

In [None]:
diabetes_df['label'].value_counts()

Show a barplot displaying these values:

In [None]:
diabetes_df['label'].value_counts().plot(kind='bar', figsize=(14,6))

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Balancing data

As observations are imbalanced, you will need to balance them.

Your task: down-sample the majority class by randomly removing `0` (no diabetes) observations.

#### Step 1

Separate observations from each class:

In [None]:
no_diabetes = diabetes_df[diabetes_df['label'] == 0]
yes_diabetes = diabetes_df[diabetes_df['label'] == 1]

no_diabetes.shape, yes_diabetes.shape

#### Step 2

Resample the majority class (_no diabetes_) without replacement to match the number of samples of the minority class.

In [None]:
from sklearn.utils import resample

no_diabetes_downsampled = resample(no_diabetes, 
                                   replace=False,
                                   n_samples=yes_diabetes.shape[0],
                                   random_state=1)

#### Step 3

Concatenate the minority class and the new re-sampled majority class.

In [None]:
diabetes_df = pd.concat([no_diabetes_downsampled, yes_diabetes])

#### Step 4

Analyze `label` distribution again to validate that your data is now balanced. 

In [None]:
diabetes_df['label'].value_counts().plot(kind='bar', figsize=(14,6))

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Modeling with the balanced data

We will keep using a [**k-nearest neighbors classifier**](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).

Having diabetes observations balanced, let's use them to train our model and test if it improves.

#### Create features $X$ and labels $y$

In [None]:
X = diabetes_df.drop(['label'], axis=1)
y = diabetes_df['label']

#### Split the dataset

As we now have less data to process, we will use a smaller test set that will have only 10% of the observations.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.1,
                                                    random_state=10)

#### Stantardize the features

Use the `StandardScaler` to standardize the features (`X_train` and `X_test`) before moving to model creation.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#### Build and fit a k-nearest neighbors classifier

Use `10` neighbors.

For training use `X_train` and `y_train`.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

k = 10
model = KNeighborsClassifier(n_neighbors=k)

model.fit(X_train, y_train)

#### Evaluating the model

Now use your model to get the predictions for the `X_test` set:

In [None]:
y_pred = model.predict(X_test)

Get the `score` of the model using the `X_test` and `y_test` data:

In [None]:
model.score(X_test, y_test)

Get the `Accuracy` of your prediction:

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Confusion matrix

Show a confusion matrix to understand the outputs of the model.

In [None]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

Separate the values above into `tp`, `fn`, `fp` and `tn`.

In [None]:
tp, fn, fp, tn = conf_matrix.ravel()

Go ahead and manually calculate the precision and recall for "No diabetes" value.

#### Precision

In [None]:
no_diabetes_precision = tp / (tp + fp)
no_diabetes_precision

#### Recall

In [None]:
no_diabetes_recall = tp / (tp + fn)
no_diabetes_recall

Finally, call the `classification_report` method and validate precision and recall values of your model.

In [None]:
from sklearn.metrics import classification_report

model_report = classification_report(y_test, y_pred)

print('Model report: \n', model_report)

> Compare the results of this project with previous **Diabetes analysis project**.


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98729912-57be3e80-237a-11eb-80e4-233ac344b391.png"></img>
</div>