## Assignment 1 - Diabetes Prediction
A machine learning model to accurately classify whether or not the patients in the dataset have diabetes or not.<br><br>
**Team members:**
* Ayush Yadav ( IMT2017009 )
* Kaustubh Nair ( IMT2017025 )
* Sarthak Khoche ( IMT2017038 )

### Overview:
1. [**Missing Data Handling**](#missing_data)
2. [**Data Preprocessing**](#preprocessing)
3. [**Approach 1 (Using PCA)**](#1)
  1. [**Exploratory Data Analysis**](#1_eda)
  2. [**PCA**](#1_pca)
  3. [**Model building**](#1_model)
4. [**Approach 2 (Using //)**](#2)
  0. [**Helper Functions**](#helpers)
  1. [**Exploratory Data Analysis**](#2_eda)
  2. [**Model building**](#2_model)


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from pandas.plotting import scatter_matrix
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
%matplotlib inline
pd.set_option('float_format', '{:f}'.format)

<a id='missing_data'></a>
### Missing Data Handling

First, open the csv data into a Pandas Dataframe

In [2]:
df = pd.read_csv("Pima_Indian_diabetes.csv")
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,742.0,752.0,768.0,746.0,768.0,757.0,768.0,749.0,768.0
mean,3.866601,119.966097,68.886078,20.309879,79.799479,31.711151,0.471876,33.761336,0.348958
std,3.479971,32.367659,19.427448,15.974523,115.244002,8.544789,0.331329,12.297409,0.476951
min,-5.412815,0.0,-3.496455,-11.94552,0.0,-16.288921,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.1,0.24375,24.0,0.0
50%,3.0,116.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.0,80.0,32.0,127.25,36.5,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Select list of features.

In [3]:
features = ['Pregnancies','Glucose', 'BloodPressure','SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

In [4]:
# count of null values
df.isnull().sum()

Pregnancies                 26
Glucose                     16
BloodPressure                0
SkinThickness               22
Insulin                      0
BMI                         11
DiabetesPedigreeFunction     0
Age                         19
Outcome                      0
dtype: int64

Filling the missing values by a random number picked between mean +/- std-dev

In [5]:
for feature in features:
    df[feature].fillna(value=np.random.normal(df[feature].mean(), df[feature].std()/2), inplace=True)

In [6]:
# count of null values
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

<a id='preprocessing'></a>
### Data Preprocessing

Filling in the negative values with zero

In [7]:
df.where( df < 0).count()

Pregnancies                 3
Glucose                     0
BloodPressure               3
SkinThickness               5
Insulin                     0
BMI                         4
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [8]:
for feature in features:
    df.loc[df[feature] < 0, feature] = 0

In [9]:
df.where( df < 0).count()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Dealing with higher values

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [11]:
df.loc[df.Insulin > 300, 'Insulin'] = 300

<a id='1'></a>
### Approach 1

<a id='1_eda'></a>
### Exploratory Data Analysis

In [12]:
#plot = scatter_matrix(df, alpha=0.2, figsize=(15, 15))

<a id='1_pca'></a>
### PCA

In [13]:
x = df.loc[:, features].values
y = df.loc[:,['Outcome']].values
x = StandardScaler().fit_transform(x)

pca = PCA(n_components=7)
principal_components = pca.fit_transform(x)
principal_df = pd.DataFrame(data = principal_components, columns = ['principal component 1', 'principal component 2', 'principal component 3', 'principal component 4', 'principal component 5', 'principal component 6', 'principal component 7'])

pca_df = pd.concat([principal_df, df[['Outcome']]], axis = 1)

In [14]:
pca_df.describe()
#plot = scatter_matrix(pca_df, alpha=0.2, figsize=(15, 15))

Unnamed: 0,principal component 1,principal component 2,principal component 3,principal component 4,principal component 5,principal component 6,principal component 7,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.348958
std,1.440108,1.293926,1.009961,0.941964,0.882994,0.825394,0.696675,0.476951
min,-4.82955,-2.192926,-3.123318,-2.640715,-2.638223,-3.25256,-2.026239,0.0
25%,-0.958447,-1.071372,-0.657164,-0.591702,-0.534643,-0.460917,-0.404186,0.0
50%,-0.090395,-0.317594,-0.10193,-0.136476,-0.029289,-0.05097,-0.068738,0.0
75%,0.914813,0.953352,0.552948,0.490361,0.545982,0.374448,0.324094,1.0
max,4.986868,3.660206,3.863457,4.541344,2.760521,4.557745,3.237236,1.0


<a id='1_model'></a>
### Model Building

In [15]:
accuracy = []

for i in range(500):
    pca_df = pca_df.sample(frac=1)
    pca_X = pca_df[pca_df.columns[0:7]]
    pca_y = pca_df[pca_df.columns[7]] 
    
    train_X, val_X, train_y, val_y = train_test_split(pca_X, pca_y, test_size = 0.20)
    lr = LogisticRegression(max_iter=2000, solver='lbfgs')
    lr.fit(train_X, train_y)
    accuracy.append(lr.score(val_X, val_y)*100)

In [16]:
average_accuracy = sum(accuracy)/len(accuracy)
print(average_accuracy, max(accuracy), min(accuracy))

76.98181818181818 85.06493506493507 63.63636363636363


<a id='2'></a>
### Approach 2

#### Helper Function

In [17]:
def linear_regression(df, feature, target):
    zero_target_data = df[ df[target] == 0 ]
    non_zero_target_data = df[ df[target] != 0]

    train_X = non_zero_target_data[feature].values.reshape(-1,1)
    train_y = non_zero_target_data[target].values.reshape(-1,1)
    val_X = zero_target_data[feature].values.reshape(-1,1)

    model = LinearRegression()
    model.fit(train_X, train_y)
    predicted_y = model.predict(val_X)

    j = 0
    for i in df.index:
        if df.at[i, target] == 0:
            df.at[i, target] = predicted_y[j][0]
            j+=1

<a id='2_eda'></a>
### Exploratory Data Analysis

In [18]:
#plot = df.plot(x='SkinThickness', y='BMI', style='.')
#y_label = plot.set_ylabel('BMI')

In [19]:
linear_regression(df, 'BMI', 'SkinThickness')

In [20]:
#plot = df.plot(x='SkinThickness', y='BMI', style='.')
#y_label = plot.set_ylabel('BMI')

In [21]:
#plot = df.plot(x='Insulin', y='Glucose', style='.')
#y_label = plot.set_ylabel('Glucose')

In [22]:
linear_regression(df, 'Glucose', 'Insulin')

In [23]:
#plot = df.plot(x='Insulin', y='Glucose', style='.')
#y_label = plot.set_ylabel('Glucose')

In [24]:
for feature in features:
    df[feature] = (df[feature] - df[feature].mean())/(df[feature].std())
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,-0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.0,0.348958
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.476951
min,-1.14048,-3.747762,-3.552478,-2.558652,-2.793478,-3.8458,-1.188778,-1.060203,0.0
25%,-0.84525,-0.661992,-0.355549,-0.685152,-0.679293,-0.548022,-0.68852,-0.813966,0.0
50%,-0.25479,-0.100942,0.160084,-0.048871,-0.144938,0.059543,-0.299933,-0.321491,0.0
75%,0.630901,0.584784,0.572591,0.580585,0.565741,0.55829,0.465923,0.581378,1.0
max,3.878433,2.454948,2.738252,7.203728,2.355588,4.267157,5.879733,3.864539,1.0


<a id='2_model'></a>
### Model Building

In [25]:
accuracy = []

for i in range(500):
    df = df.sample(frac=1)
    X = df[df.columns[0:8]]
    y = df[df.columns[8]] 
    
    train_X, val_X, train_y, val_y = train_test_split(X, y, test_size = 0.20)
    lr = LogisticRegression(max_iter=2000, solver='lbfgs')
    lr.fit(train_X, train_y)
    accuracy_percent = lr.score(val_X, val_y)*100
    accuracy.append(accuracy_percent)

In [26]:
average_accuracy = np.mean(accuracy)
print(average_accuracy, max(accuracy), min(accuracy))

76.45194805194805 87.66233766233766 66.88311688311688
