In [29]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from mlxtend.preprocessing import minmax_scaling

## Data loading and overview of null values

Dataset contains 768 samples with 8 variables. The `Outcome` column is the label we are going to predict.

In [2]:
df = pd.read_csv('data/pima-indians-diabetes-data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


The `Outcome` is not equally distributed, rather skewed toward healthy patients. This is not very surprising as there are much more healthy people in population than people with diabetes.

In [3]:
fig = go.Figure()
fig.add_bar(y=df.Outcome.value_counts(), x=['Healthy', 'Diabetic'], 
            marker_color=['lightskyblue', 'indigo'])
fig.update_layout(title='Distribution of outcome')

Boxplot reveals that almost all variables are skewed toward 0 suggesting some values are actually missing, but they are set to 0 instead of NaN. However, there are variables that may be 0, such as number of pregnancies, age (in case it's a newborn, but in this instance it's not plausible) and, of course, outcome.

In [4]:
px.box(df)

## Investigation of missing values

First, we look at how many values are missing. As we can see, 376 samples are missing at least one value. In the next cell we can see that the most values are missing in `Insulin` column, then `SkinThickness`, `BloodPressure`, `BMI` and `Glucose`. `Age` and `DiabetesPedigreeFunction` aren't missing any values, so we won't need to inpute any.

To have a clearer look at the missing values, prior imputation, we replaced 0s in relevant columns with NaN.

In [5]:
df[df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']].eq(0).any(axis=1)]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
5,5,116,74,0,0,25.6,0.201,30,0
7,10,115,0,0,0,35.3,0.134,29,0
...,...,...,...,...,...,...,...,...,...
761,9,170,74,31,0,44.0,0.403,43,1
762,9,89,62,0,0,22.5,0.142,33,0
764,2,122,70,27,0,36.8,0.340,27,0
766,1,126,60,0,0,30.1,0.349,47,1


In [6]:
df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']].agg(lambda x: x.eq(0).sum())

Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
dtype: int64

In [7]:
df_zeroes = df.copy()

df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']] = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']].replace(0, np.nan)
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101.0,76.0,48.0,180.0,32.9,0.171,63,0
764,2,122.0,70.0,27.0,,36.8,0.340,27,0
765,5,121.0,72.0,23.0,112.0,26.2,0.245,30,0
766,1,126.0,60.0,,,30.1,0.349,47,1


## Imputation of missing values

We selected median as an imputation method. However, the values may vary highly between healthy patients and people with diabetes, so we will have to find the median values of the healthy and diabetic patients separately. `create_distplot()` is a function to help us visualise the distribution of values prior to and after imputation, while also showing differences between healthy and diabetic samples.

In [8]:
def find_median(data, var):
    temp = data[data[var].notnull()]
    temp = data[[var, 'Outcome']].groupby('Outcome')[[var]].median().reset_index()
    
    return temp

def create_distplot(data, var):
    hist = [data.loc[(data.Outcome == 0), var], data.loc[(data.Outcome == 1), var]]

    fig = ff.create_distplot(hist, ['Healthy', 'Diabetes'], colors=['lightskyblue', 'indigo'], show_hist=True, bin_size=0, curve_type='kde')
    fig.update_layout(title=var)

    return fig

### Glucose

We plot the distribution of the first variable - `Glucose` - and observe that healthy patients tend to have lower amounts of glucose in blood than diabetic patients. There are, however, some values with zeroes on the left end of the plot. We find the median of and impute the values to respective groups. Second plot shows that variable after imputation doesn't contain any values equal to 0.

In [9]:
create_distplot(df_zeroes, 'Glucose').show()

find_median(df, 'Glucose')

Unnamed: 0,Outcome,Glucose
0,0,107.0
1,1,140.0


In [10]:
df.loc[(df.Outcome == 0) & (df.Glucose.isnull()), 'Glucose'] = 107.0
df.loc[(df.Outcome == 1) & (df.Glucose.isnull()), 'Glucose'] = 140.0

create_distplot(df, 'Glucose')

### BloodPressure

We continue with the next variable - `BloodPressure` - and follow the same process as with the previous column: plot distribution, find medians, impute them to respective groups and verify the results in another plot of distributions.

In [11]:
create_distplot(df_zeroes, 'BloodPressure').show()

find_median(df, 'BloodPressure')

Unnamed: 0,Outcome,BloodPressure
0,0,70.0
1,1,74.5


In [12]:
df.loc[(df.Outcome == 0) & (df.BloodPressure.isnull()), 'BloodPressure'] = 70.0
df.loc[(df.Outcome == 1) & (df.BloodPressure.isnull()), 'BloodPressure'] = 74.5

create_distplot(df, 'BloodPressure')

### SkinThickness

In [13]:
create_distplot(df_zeroes, 'SkinThickness').show()

find_median(df, 'SkinThickness')

Unnamed: 0,Outcome,SkinThickness
0,0,27.0
1,1,32.0


In [14]:
df.loc[(df.Outcome == 0) & (df.SkinThickness.isnull()), 'SkinThickness'] = 27.0
df.loc[(df.Outcome == 1) & (df.SkinThickness.isnull()), 'SkinThickness'] = 32.0

create_distplot(df, 'SkinThickness')

### Insulin

In [15]:
create_distplot(df_zeroes, 'Insulin').show()

find_median(df, 'Insulin')

Unnamed: 0,Outcome,Insulin
0,0,102.5
1,1,169.5


In [16]:
df.loc[(df.Outcome == 0) & (df.Insulin.isnull()), 'Insulin'] = 102.5
df.loc[(df.Outcome == 1) & (df.Insulin.isnull()), 'Insulin'] = 169.5

create_distplot(df, 'Insulin')

### BMI

In [17]:
create_distplot(df_zeroes, 'BMI').show()

find_median(df, 'BMI')

Unnamed: 0,Outcome,BMI
0,0,30.1
1,1,34.3


In [18]:
df.loc[(df.Outcome == 0) & (df.BMI.isnull()), 'BMI'] = 30.1
df.loc[(df.Outcome == 1) & (df.BMI.isnull()), 'BMI'] = 34.3

create_distplot(df, 'BMI')

Here, we can observe, that the dataset contains no more missing values, which is also visible in the boxplot.

In [19]:
df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']].agg(lambda x: x.isna().sum())

Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
dtype: int64

In [20]:
px.box(df)

## Correlation

Next, we investigate which variables are correlated. First, we plot a correlation matrix showing the highest correlation between age and pregnancies, outcome and glucose, skin thickness and BMI, and insulin and glucose. Interesting observation is that there are only two negatively correlated combinations, but they are very small. Generally in this dataset, increase in one variable leads only to increase in other variables, however small.

In [21]:
fig = go.Figure()
fig.add_heatmap(z=df.corr().round(4), x=df.columns, y=df.columns,
                text=df.corr().round(4),
                texttemplate="%{text}",
                colorscale=px.colors.diverging.RdBu,
                zmin=-1, zmax=1)
fig.update_layout(width=1000, height=1000)

Next, we will create scatterplots to investigate correlated variables. Again, we need to group the samples by the outcome.

In [22]:
def create_scatter(data, var1, var2):
    plot1 = [data.loc[(data.Outcome == 0), var1], data.loc[(data.Outcome == 1), var1]]
    plot2 = [data.loc[(data.Outcome == 0), var2], data.loc[(data.Outcome == 1), var2]]

    fig = go.Figure()

    fig.add_scatter(x=plot1[0], y=plot2[0], mode='markers', marker_color='lightskyblue', name='Healthy')
    fig.add_scatter(x=plot1[1], y=plot2[1], mode='markers', marker_color='indigo', name='Diabetic')

    fig.update_layout(width=1000, height=600,
                      title=f'Scatter of {var1} and {var2}',
                      xaxis_title=var1,
                      yaxis_title=var2)
    
    return fig

Insulin is a hormone responsible for maintaining the correct amount of glucose in blood. If there is an increase in glucose, the somatic reaction is to release more insulin, so these variables are positively correlated, which we may observe in the scatterplot. However, because we had to impute a lot of missing values, two lines were created with imputed values. This issue cannot be easily resolved and is better than having missing values or zeroes, which could heavily skew the results, leading to incorrect predictions.

In [23]:
create_scatter(df, 'Glucose', 'Insulin')

Number of pregnancies is correlated with age, because there is a need to carry each baby to term.

In [24]:
create_scatter(df, 'Pregnancies', 'Age')

Again, we can observe two lines from imputation of missing values. There aren't any visible lines from imputation of BMI, because there were only 11 samples, while `SkinThickness` was missing more than 200 values. The data is positively correlated, but we can also observe some outliers, all of which are diabetic.

In [25]:
create_scatter(df, 'BMI', 'SkinThickness')

`Outcome` is a binary variable, so all samples are polarized to either 1 (diabetic) or 0 (healthy). We can see that glucose levels ale slightly higher in diabetic patients than in the healthy ones. However, healthy samples are more spread out.

In [26]:
create_scatter(df, 'Glucose', 'Outcome')

Plotting insulin with outcome doesn't show any correlation, probably because the diabetic patients are compensated with additional insulin which helps keep the glucose levels in blood in the correct range. Other explanation, however, may be a result of the imputation, but as was seen in the imputation step, median insulin levels are higher in diabetic than healthy patients, but the healthy are more spread out.

In [27]:
create_scatter(df, 'Insulin', 'Outcome')

As the last step of data wrangling, we will scale the data to fit it into the <0, 1> interval.

In [30]:
df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']] = minmax_scaling(df,['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'])
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.352941,0.670968,0.489796,0.304348,0.186899,0.314928,0.234415,0.483333,1
1,0.058824,0.264516,0.428571,0.239130,0.106370,0.171779,0.116567,0.166667,0
2,0.470588,0.896774,0.408163,0.271739,0.186899,0.104294,0.253629,0.183333,1
3,0.058824,0.290323,0.428571,0.173913,0.096154,0.202454,0.038002,0.000000,0
4,0.000000,0.600000,0.163265,0.304348,0.185096,0.509202,0.943638,0.200000,1
...,...,...,...,...,...,...,...,...,...
763,0.588235,0.367742,0.530612,0.445652,0.199519,0.300613,0.039710,0.700000,0
764,0.117647,0.503226,0.469388,0.217391,0.106370,0.380368,0.111870,0.100000,0
765,0.294118,0.496774,0.489796,0.173913,0.117788,0.163599,0.071307,0.150000,0
766,0.058824,0.529032,0.367347,0.271739,0.186899,0.243354,0.115713,0.433333,1


In [31]:
df.to_csv('Data/cleaned.csv')