# Diabete Binary Classification Dataset

| Input Features |
|----------------|
| *preg_count* |
| *glucose_concentration* |
| *diastolic_bp* |
| *triceps_skin_fold_thickness* |
| *two_hr_serum_insulin* |
| *bmi* |
| *diabetes_pedi* |
| *age* |

|Target Feature | Objective |
|---------------|-----------|
| *diabetes_class* | Predict *diabetes_class* for given input features |

This example uses *Pima Indians* correlated data, contained in *pima_indians_diabetes_all.csv*. **Note: This data is not available anymore on UCI, due to permissions.**

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Setup Columns

In [None]:
columns = ['diabetes_class', 'preg_count', 'glucose_concentration', 'diastolic_bp',
       'triceps_skin_fold_thickness', 'two_hr_serum_insulin', 'bmi',
       'diabetes_pedi', 'age']

In [None]:
df = pd.read_csv('pima_indians_diabetes_all.csv') # create dataframe

In [None]:
df.describe() # generate descriptive statistics

In [None]:
df['glucose_concentration'].hist() # display glucose concentration values in histogram
plt.show()
# note much of the data is anchored at minimum value of 0

In [None]:
df['diabetes_class'].value_counts()

Class rersults: 500 cases are normal. 268 cases are diabetic.

## Separate diabetic and normal saamples

In [None]:
diabetic = df.diabetes_class == 1
normal = df.diabetes_class == 0

## Display the samples' glucose split in histogram

In [None]:
plt.hist(df[diabetic].glucose_concentration,label='diabetic')
plt.hist(df[normal].glucose_concentration,alpha=0.5, label='normal')
plt.title('Glucose Concentration')
plt.xlabel('Glucose Concentration')
plt.ylabel('Samples')
plt.legend()
plt.show()

## BMI Histogram

In [None]:
plt.hist(df[diabetic].bmi,label='diabetic')
plt.hist(df[normal].bmi,alpha=0.5, label='normal')
plt.title('BMI')
plt.xlabel('BMI')
plt.ylabel('Samples')
plt.legend()
plt.show()

## Age Histogram

In [None]:
# Age
plt.hist(df[diabetic].age,label='diabetic')
plt.hist(df[normal].age,alpha=0.5,label='normal')
plt.title('Age')
plt.xlabel('Age')
plt.ylabel('Samples')
plt.legend()
plt.show()

# Training and Validation Set

## Target Variable as first column followed by input features

*Note: Training, Validation files do not have a column header*

In [None]:
# Training = 70% of the data
# Validation = 30% of the data
# Randomize the datset
np.random.seed(200)
l = list(df.index)
np.random.shuffle(l)
df = df.iloc[l]

In [None]:
rows = df.shape[0]
train = int(0.7 * rows)
test = rows - train

In [None]:
rows, train, test # display shape

## Write Training Set

In [None]:
df[:train].to_csv('diabetesTrain.csv',
                 index=False,
                 index_label='Row', header=False,
                 columns=columns)

## Write Validation Set

In [None]:
df[train:].to_csv('diabetesValidation.csv',
                 index=False,
                 index_label='Row', header=False,
                 columns=columns)

In [None]:
## Write Column List

In [None]:
with open('diabetesTrain_columnList.txt','w') as f:
    f.write(','.join(columns))