# Diabetes Dataset Preprocessing

Phabel Antonio López Delgado

Diabetes Dataset: <https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset> [Kaggle]

## 1) Data loading




## Dataset loading with Kaggle API: kagglehub

Download the dataset directly from Kaggle using its *kagglehub* API and the *KaggleDatasetAdapter* class.

In [None]:
# Import Libraries for Kaggle API loading
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd

Import the original raw dataset in csv format and create a Pandas Dataframe. Explore this original dataset.

In [None]:
# Load Data with the KaggleDatasetAdapter
diabetes_df = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
    "akshaydattatraykhare/diabetes-dataset",
    "diabetes.csv",
)

# Quickly check first rows
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Check the dataset dimensions, the variables and understand their nature. In this case, the diabetes dataset measures different variables in patients and check whether the patient has diabetes or not.

In [None]:
# Checkout entire dataset -> Dimensions are 768X9
diabetes_df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## Describe Variables

Use describe() to get further info. on the variables' measures of central tendency.

In [None]:
# Describe variables
diabetes_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Use info() to check the dataframe's properties.

In [None]:
# General Info -> There are zero non-null values and all vars are numerical
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


## Exploratory Data Analysis

In [None]:
# Import libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Plot the distributions of the dataset's variables. Use plotly for complete graphs, showing violin plots, boxplots, distributions, and datapoints in a dashboard.

In [None]:
# Get the columns to plot: numerical continuous variables, except binary variable "Outcome"
cols_to_plot = diabetes_df.columns.drop("Outcome")
num_cols = len(cols_to_plot)


# Determine the number of rows and columns for the subplot grid
# Set the number of columns in the grid
n_cols = 3
# Calculate the number of rows needed
n_rows = (num_cols + n_cols - 1) // n_cols


# Create the subplot figure
fig = make_subplots(# Number of rows
                    rows=n_rows,
                    # Number of cols
                    cols=n_cols,
                    # Titles per subplot
                    subplot_titles=[f"Violin Plot for {col}" for col in cols_to_plot] + ["Histogram of Outcome"],
                    # Adjust spacing between rows
                    vertical_spacing=0.1,
                    # Adjust spacing between columns
                    horizontal_spacing=0.1)


# Create a Violin plot per each variable, i.e. per column
for i, var in enumerate(cols_to_plot):
    # Current row
    row = i // n_cols + 1
    # Current col
    col = i % n_cols + 1
    # Create a violin plot trace for the current column
    violin_trace = go.Violin(# Variable for vertical violinplot
                             y=diabetes_df[var],
                             # Add Boxplot
                             box_visible=True,
                             # Add jigger datapoints
                             points='all',
                             # Name of the variable
                             name=var)
    # Add the trace to the subplot
    fig.add_trace(violin_trace, row=row, col=col)
    # Update the y-axis title for each subplot
    fig.update_yaxes(title_text="Distribution",
                     row=row,
                     col=col)
    # Update the x-axis title
    fig.update_xaxes(title_text=var,
                     row=row,
                     col=col)


# Add Histogram of Binary Outcome
histogram_trace = go.Histogram(x=diabetes_df["Outcome"],
                               name="Outcome",
                               opacity=0.6)
# Add the trace to the subplot
fig.add_trace(histogram_trace, row=3, col=3)
fig.update_yaxes(title_text="Distribution", row=3, col=3)
fig.update_xaxes(title_text="Outcome", row=3, col=3)
fig.update_traces(marker_line_width=1, marker_line_color="white")


# Update the overall layout of the figure
fig.update_layout(# Adjust overall height based on number of rows * Pixels
                  height=n_rows * 300,
                  # Adjust overall width based on number of columns * Pixels
                  width=n_cols * 300,
                  # Legend is not required
                  showlegend=False,
                  # Overall Title
                  title=dict(
                      # Set text
                      text="Plots for Diabetes Dataset Features",
                      # Center the title
                      x=0.5,
                      y=0.99,
                      xanchor='center',
                      yanchor='top',
                      # Size
                      font=dict(
                          size=40
                      )
                  )
)


# Show the Final Figure
fig.show()

Assess the correlation among variables with a heatmap.

In [None]:
# Correlation Matrix
corr_matrix = diabetes_df.corr(method="pearson", numeric_only=True).round(4)

# Heatmap
fig = px.imshow(# Data
                corr_matrix,
                # Continuous cmap colour
                color_continuous_scale='RdBu',
                # Add corr values
                text_auto=True,
                # Corr range
                range_color=[-1, 1]
                )

# Aesthetics & Layout
fig.update_layout(# Adjust overall height based on number of rows * Pixels
                  height=800,
                  # Adjust overall width based on number of columns * Pixels
                  width=800,
                  # Overall Title
                  title=dict(
                      # Set text
                      text="Correlation Heatmap of Diabetes Dataset",
                      # Center the title
                      x=0.5,
                      y=0.99,
                      xanchor='center',
                      yanchor='top',
                      # Size
                      font=dict(
                          size=30
                      )
                  )
)

# Show plot
fig.show()

Filter for strong absolute values of correlation coefficents. There does not seem to be highly-correlated variables.

In [None]:
# Correlation Matrix
corr_matrix = diabetes_df.corr(method="pearson", numeric_only=True).round(4)
# Masked corr matrix to just keep corr > 0.5
corr_matrix = corr_matrix.where(abs(corr_matrix) > 0.5)

# Heatmap
fig = px.imshow(# Data
                corr_matrix,
                # Continuous cmap colour
                color_continuous_scale='hot',
                # Add corr values
                text_auto=True,
                range_color=[-1, 1]
                )

# Aesthetics & Layout
fig.update_layout(# Adjust overall height based on number of rows * Pixels
                  height=800,
                  # Adjust overall width based on number of columns * Pixels
                  width=800,
                  # Overall Title
                  title=dict(
                      # Set text
                      text="|Correlation| > 0.5 Heatmap of Diabetes Dataset",
                      # Center the title
                      x=0.5,
                      y=0.99,
                      xanchor='center',
                      yanchor='top',
                      # Size
                      font=dict(
                          size=30
                      )
                  )
)

# Show plot
fig.show()

## 2) Missing Values & Imputation

In [None]:
# Import libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Get percentage of null values. There are not any.

In [None]:
# Check for Missing Values
diabetes_df.isnull().mean()*100

Unnamed: 0,0
Pregnancies,0.0
Glucose,0.0
BloodPressure,0.0
SkinThickness,0.0
Insulin,0.0
BMI,0.0
DiabetesPedigreeFunction,0.0
Age,0.0
Outcome,0.0


Visualize the absence of null values through a heatmap

In [None]:
# Heatmap
missingvals_df = diabetes_df.isnull()

fig = px.imshow(# Data
                missingvals_df,
                # Values
                text_auto=True,
                # cmap
                color_continuous_scale='hot_r',
                # Range
                range_color=[0, 1]
)

# Labels
fig.update_yaxes(title_text="Pacients")
fig.update_xaxes(title_text="Variables")

# Aesthetics & Layout
fig.update_layout(# Adjust overall height based on number of rows * Pixels
                  height=800,
                  # Adjust overall width based on number of columns * Pixels
                  width=800,
                  # Overall Title
                  title=dict(
                      # Set text
                      text="Heatmap of Missing Values",
                      # Center the title
                      x=0.5,
                      y=0.99,
                      xanchor='center',
                      yanchor='top',
                      # Size
                      font=dict(
                          size=30
                      )
                  ),
                  # Reverse y-axis to increase bottom->top
                  yaxis=dict(
                      autorange=True
                  )
)

# Show plot
fig.show()

## Outliers Analysis

### Boxplot

Use boxplots as visual method. Plot the mean, median, and mode as well. All variables seem to have outliers.

In [None]:
# Get the columns to plot: numerical continuous variables, except binary variable "Outcome"
cols_to_plot = diabetes_df.columns.drop("Outcome")
num_cols = len(cols_to_plot)


# Determine the number of rows and columns for the subplot grid
# Set the number of columns in the grid
n_cols = 3
# Calculate the number of rows needed
n_rows = (num_cols + n_cols - 1) // n_cols


# Create the subplot figure
fig = make_subplots(# Number of rows
                    rows=n_rows,
                    # Number of cols
                    cols=n_cols,
                    # Titles per subplot
                    subplot_titles=[f"Boxplot for {col}" for col in cols_to_plot],
                    # Adjust spacing between rows
                    vertical_spacing=0.1,
                    # Adjust spacing between columns
                    horizontal_spacing=0.1)


# Create a Boxplot per each variable, i.e. per column
for i, var in enumerate(cols_to_plot):
    # Current row
    row = i // n_cols + 1
    # Current col
    col = i % n_cols + 1
    # Create a boxplot trace for the current column
    boxplot_trace = go.Box(# Variable for vertical violinplot
                           y=diabetes_df[var],
                           # Show mean and sd
                           boxmean='sd',
                           # Show Outliers and Suspected Outliers
                           boxpoints='suspectedoutliers',
                           # Name of the variable
                           name=var)
    # Add the trace to the subplot
    fig.add_trace(boxplot_trace, row=row, col=col)
    # Update the y-axis title for each subplot
    fig.update_yaxes(title_text="Distribution",
                     row=row,
                     col=col)
    # Update the x-axis title
    fig.update_xaxes(title_text=var,
                     row=row,
                     col=col)

# Update the overall layout of the figure
fig.update_layout(# Adjust overall height based on number of rows * Pixels
                  height=n_rows * 300,
                  # Adjust overall width based on number of columns * Pixels
                  width=n_cols * 300,
                  # Legend is not required
                  showlegend=False,
                  # Overall Title
                  title=dict(
                      # Set text
                      text="Boxplots for Diabetes Dataset Features",
                      # Center the title
                      x=0.5,
                      y=0.99,
                      xanchor='center',
                      yanchor='top',
                      # Size
                      font=dict(
                          size=40
                      )
                  )
)


# Show the Final Figure
fig.show()

### Statistical Methods: IQR

Use the IQR statistical method to identify outliers. All have outliers, but only a small amount.


In [None]:
# Import libraries
from sklearn.impute import SimpleImputer

In [None]:
# Set IQR
IQR = diabetes_df.quantile(0.75) - diabetes_df.quantile(0.25)
# Lower limit
lower_limit = diabetes_df.quantile(0.25) - 1.5 * IQR
# Upper limit
upper_limit = diabetes_df.quantile(0.75) + 1.5 * IQR
# Identify Outliers -> Nulls
diabetes_nonoutliers_df = diabetes_df[(diabetes_df > lower_limit) & (diabetes_df < upper_limit)]
# Percentage of outliers per variable
diabetes_nonoutliers_df.isnull().mean()*100

Unnamed: 0,0
Pregnancies,0.520833
Glucose,0.651042
BloodPressure,5.859375
SkinThickness,0.130208
Insulin,4.427083
BMI,2.473958
DiabetesPedigreeFunction,3.776042
Age,1.171875
Outcome,0.0


Outliers input with mean values. Now there are no outliers

In [None]:
# Impute values to replace outliers
for col in diabetes_nonoutliers_df.columns:
    # Imputer object
    imputer_mean = SimpleImputer(strategy="mean")
    # Impute with mean per columns
    diabetes_nonoutliers_df[col] = imputer_mean.fit_transform(diabetes_nonoutliers_df[[col]])

# Check imputing results: Percentage of outliers per variable
diabetes_nonoutliers_df.isnull().mean()*100

Unnamed: 0,0
Pregnancies,0.0
Glucose,0.0
BloodPressure,0.0
SkinThickness,0.0
Insulin,0.0
BMI,0.0
DiabetesPedigreeFunction,0.0
Age,0.0
Outcome,0.0


In [None]:
# Update dataframe
diabetes_IQRfiltered_df = diabetes_nonoutliers_df.copy()
diabetes_IQRfiltered_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.786649,121.686763,72.208852,20.434159,62.328338,32.204005,0.429832,32.805007,0.348958
std,3.270153,30.435949,11.146615,15.698281,77.358761,6.41048,0.244918,11.047789,0.476951
min,0.0,44.0,38.0,0.0,0.0,18.2,0.078,21.0,0.0
25%,1.0,99.75,64.0,0.0,0.0,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.208852,23.0,30.5,32.204005,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,110.0,36.3,0.58225,40.0,1.0
max,13.0,199.0,106.0,63.0,318.0,50.0,1.191,66.0,1.0


In [None]:
diabetes_IQRfiltered_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    float64
 1   Glucose                   768 non-null    float64
 2   BloodPressure             768 non-null    float64
 3   SkinThickness             768 non-null    float64
 4   Insulin                   768 non-null    float64
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    float64
 8   Outcome                   768 non-null    float64
dtypes: float64(9)
memory usage: 54.1 KB


In [None]:
# Remove all entries with at least 1 outlier since they are so few
#rows_to_keep_mask = outliers_df.isnull().all(axis=1)
# Keep rows where all values are NaN
#diabetes_IQRfiltered_df = diabetes_df[rows_to_keep_mask].copy()
# Check again the stats
#diabetes_IQRfiltered_df

In [None]:
# Check how many entries where kept ~83%
#keep_percentage = 100*(len(diabetes_nonoutliers_df) / len(diabetes_df))
#print(f"Percentage of data kept: {keep_percentage:.2f}%.")

When plotting the boxplots again, the distributions have much less outliers, many variables even lack them.

In [None]:
# Show boxplots again -> Fewer Outliers
# Get the columns to plot: numerical continuous variables, except binary variable "Outcome"
cols_to_plot = diabetes_IQRfiltered_df.columns.drop("Outcome")
num_cols = len(cols_to_plot)


# Determine the number of rows and columns for the subplot grid
# Set the number of columns in the grid
n_cols = 3
# Calculate the number of rows needed
n_rows = (num_cols + n_cols - 1) // n_cols


# Create the subplot figure
fig = make_subplots(# Number of rows
                    rows=n_rows,
                    # Number of cols
                    cols=n_cols,
                    # Titles per subplot
                    subplot_titles=[f"Boxplot for {col}" for col in cols_to_plot],
                    # Adjust spacing between rows
                    vertical_spacing=0.1,
                    # Adjust spacing between columns
                    horizontal_spacing=0.1)


# Create a Boxplot per each variable, i.e. per column
for i, var in enumerate(cols_to_plot):
    # Current row
    row = i // n_cols + 1
    # Current col
    col = i % n_cols + 1
    # Create a boxplot trace for the current column
    boxplot_trace = go.Box(# Variable for vertical violinplot
                           y=diabetes_IQRfiltered_df[var],
                           # Show mean and sd
                           boxmean='sd',
                           # Show Outliers and Suspected Outliers
                           boxpoints='suspectedoutliers',
                           # Name of the variable
                           name=var)
    # Add the trace to the subplot
    fig.add_trace(boxplot_trace, row=row, col=col)
    # Update the y-axis title for each subplot
    fig.update_yaxes(title_text="Distribution",
                     row=row,
                     col=col)
    # Update the x-axis title
    fig.update_xaxes(title_text=var,
                     row=row,
                     col=col)

# Update the overall layout of the figure
fig.update_layout(# Adjust overall height based on number of rows * Pixels
                  height=n_rows * 300,
                  # Adjust overall width based on number of columns * Pixels
                  width=n_cols * 300,
                  # Legend is not required
                  showlegend=False,
                  # Overall Title
                  title=dict(
                      # Set text
                      text="Boxplots for Diabetes Dataset Features",
                      # Center the title
                      x=0.5,
                      y=0.99,
                      xanchor='center',
                      yanchor='top',
                      # Size
                      font=dict(
                          size=40
                      )
                  )
)


# Show the Final Figure
fig.show()

### Algorithmic Methods: Isolation Forest

Try out the algorithmic method Isolation Forest with 5% contamination to moderately remove outliers.

In [None]:
# Import libraries
from sklearn.ensemble import IsolationForest

Apply Isolation Forest algorithm and remove all outliers.

In [None]:
# Isolation Forest: -1 -> Outlier
# Create IsolationForest Model
IsoForest = IsolationForest(contamination=0.05, random_state=8)
# Fit model
diabetes_df["IsoF_Outliers"] = IsoForest.fit_predict(diabetes_df)
# Keep only non-outliers indexes
rows_to_keep_mask = diabetes_df.index[diabetes_df["IsoF_Outliers"] != -1]
# Filter dataframe
diabetes_IsoFfiltered_df = diabetes_df.loc[rows_to_keep_mask].copy()
diabetes_IsoFfiltered_df.drop("IsoF_Outliers", axis=1, inplace=True)
# Check again the stats
diabetes_IsoFfiltered_df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
5,5,116,74,0,0,25.6,0.201,30,0
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


Check that ~95% of data was kept.

In [None]:
# Check how many entries where kept ~94%
keep_percentage = 100*(len(diabetes_IsoFfiltered_df) / len(diabetes_df))
print(f"Percentage of data kept: {keep_percentage:.2f}%.")

Percentage of data kept: 94.92%.


Plot boxplots to assess the outliers removal. Still many outliers can be spotted. Therefore, we can conclude the contamination is higher than 5%.

In [None]:
# Show boxplots again -> Fewer Outliers
# Get the columns to plot: numerical continuous variables, except binary variable "Outcome"
cols_to_plot = diabetes_IsoFfiltered_df.columns.drop("Outcome")
num_cols = len(cols_to_plot)


# Determine the number of rows and columns for the subplot grid
# Set the number of columns in the grid
n_cols = 3
# Calculate the number of rows needed
n_rows = (num_cols + n_cols - 1) // n_cols


# Create the subplot figure
fig = make_subplots(# Number of rows
                    rows=n_rows,
                    # Number of cols
                    cols=n_cols,
                    # Titles per subplot
                    subplot_titles=[f"Boxplot for {col}" for col in cols_to_plot],
                    # Adjust spacing between rows
                    vertical_spacing=0.1,
                    # Adjust spacing between columns
                    horizontal_spacing=0.1)


# Create a Boxplot per each variable, i.e. per column
for i, var in enumerate(cols_to_plot):
    # Current row
    row = i // n_cols + 1
    # Current col
    col = i % n_cols + 1
    # Create a boxplot trace for the current column
    boxplot_trace = go.Box(# Variable for vertical violinplot
                           y=diabetes_IsoFfiltered_df[var],
                           # Show mean and sd
                           boxmean='sd',
                           # Show Outliers and Suspected Outliers
                           boxpoints='suspectedoutliers',
                           # Name of the variable
                           name=var)
    # Add the trace to the subplot
    fig.add_trace(boxplot_trace, row=row, col=col)
    # Update the y-axis title for each subplot
    fig.update_yaxes(title_text="Distribution",
                     row=row,
                     col=col)
    # Update the x-axis title
    fig.update_xaxes(title_text=var,
                     row=row,
                     col=col)

# Update the overall layout of the figure
fig.update_layout(# Adjust overall height based on number of rows * Pixels
                  height=n_rows * 300,
                  # Adjust overall width based on number of columns * Pixels
                  width=n_cols * 300,
                  # Legend is not required
                  showlegend=False,
                  # Overall Title
                  title=dict(
                      # Set text
                      text="Boxplots for Diabetes Dataset Features",
                      # Center the title
                      x=0.5,
                      y=0.99,
                      xanchor='center',
                      yanchor='top',
                      # Size
                      font=dict(
                          size=40
                      )
                  )
)


# Show the Final Figure
fig.show()

It seems the IQR method was able to deal with almost every single outlier. By removing every sample with outliers, the remaining data has better quality, and still has many samples. Thus, the IQR filtered dataset will be used for scaling.

## Scaling

Scale and normalize the data.

In [None]:
# Import labraries
from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

Check which variables are normal and which are not.

In [None]:
# Set normaltest model
normality_res = stats.normaltest(diabetes_IQRfiltered_df.drop("Outcome", axis=1).to_numpy())
# Analyse p-values
normality_res_df = pd.DataFrame(normality_res, columns=diabetes_IQRfiltered_df.drop("Outcome", axis=1).columns, index=["Statistic", "P-value"]).transpose()

In [None]:
# Get normal variables
normal_var = normality_res_df[normality_res_df["P-value"] >= 0.05]
normal_var

Unnamed: 0,Statistic,P-value
BloodPressure,1.015173,0.601947


In [None]:
# Get not normal variables
non_normal_var = normality_res_df[normality_res_df["P-value"] < 0.05]
non_normal_var

Unnamed: 0,Statistic,P-value
Pregnancies,69.528186,7.982635e-16
Glucose,35.325554,2.133798e-08
SkinThickness,451.081032,1.119434e-98
Insulin,113.476043,2.285548e-25
BMI,12.995447,0.001506865
DiabetesPedigreeFunction,85.757909,2.387299e-19
Age,99.091269,3.038117e-22


Apply Z-score scaling for normal variables.

In [None]:
# Z-score scaling for [quasi]-Normal Data -> StandardScaler
# Vars
scaling_columns = normal_var.index
# Standard Scaling
scaler = StandardScaler()
# Copy dataframe
diabetes_StandardScale_df = diabetes_IQRfiltered_df.copy()
# Fit and transform
diabetes_StandardScale_df[scaling_columns] = scaler.fit_transform(diabetes_IQRfiltered_df[scaling_columns])

# Check
diabetes_StandardScale_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,-0.018749,35.0,0.0,33.6,0.627,50.0,1.0
1,1.0,85.0,-0.55738,29.0,0.0,26.6,0.351,31.0,0.0
2,8.0,183.0,-0.736923,0.0,0.0,23.3,0.672,32.0,1.0
3,1.0,89.0,-0.55738,23.0,94.0,28.1,0.167,21.0,0.0
4,0.0,137.0,-2.891447,35.0,168.0,43.1,0.429832,33.0,1.0


Apply MinMax normalization and scaling to non-normal variables.

In [None]:
# Min-Max Scaling for Non-Normal Data -> Normalization
# Vars
scaling_columns = non_normal_var.index
# Standard Scaling
scaler = MinMaxScaler()
# Copy dataframe
diabetes_MinMaxScale_df = diabetes_StandardScale_df.copy()
# Fit and transform
diabetes_MinMaxScale_df[scaling_columns] = scaler.fit_transform(diabetes_StandardScale_df[scaling_columns])

# Check
diabetes_MinMaxScale_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.461538,0.670968,-0.018749,0.555556,0.0,0.484277,0.493261,0.644444,1.0
1,0.076923,0.264516,-0.55738,0.460317,0.0,0.264151,0.245283,0.222222,0.0
2,0.615385,0.896774,-0.736923,0.0,0.0,0.160377,0.533693,0.244444,1.0
3,0.076923,0.290323,-0.55738,0.365079,0.295597,0.311321,0.079964,0.0,0.0
4,0.0,0.6,-2.891447,0.555556,0.528302,0.783019,0.316112,0.266667,1.0


Check the distributions after pre-processing. All variables seem high-quality, with almost no outliers, no nulls or missing data, normalized and scaled. Ready to save.

In [None]:
# Check final data
# Get the columns to plot: numerical continuous variables, except binary variable "Outcome"
cols_to_plot = diabetes_MinMaxScale_df.columns.drop("Outcome")
num_cols = len(cols_to_plot)


# Determine the number of rows and columns for the subplot grid
# Set the number of columns in the grid
n_cols = 3
# Calculate the number of rows needed
n_rows = (num_cols + n_cols - 1) // n_cols


# Create the subplot figure
fig = make_subplots(# Number of rows
                    rows=n_rows,
                    # Number of cols
                    cols=n_cols,
                    # Titles per subplot
                    subplot_titles=[f"Violin Plot for {col}" for col in cols_to_plot] + ["Histogram of Outcome"],
                    # Adjust spacing between rows
                    vertical_spacing=0.1,
                    # Adjust spacing between columns
                    horizontal_spacing=0.1)


# Create a Violin plot per each variable, i.e. per column
for i, var in enumerate(cols_to_plot):
    # Current row
    row = i // n_cols + 1
    # Current col
    col = i % n_cols + 1
    # Create a violin plot trace for the current column
    violin_trace = go.Violin(# Variable for vertical violinplot
                             y=diabetes_MinMaxScale_df[var],
                             # Add Boxplot
                             box_visible=True,
                             # Add jigger datapoints
                             points='all',
                             # Name of the variable
                             name=var)
    # Add the trace to the subplot
    fig.add_trace(violin_trace, row=row, col=col)
    # Update the y-axis title for each subplot
    fig.update_yaxes(title_text="Distribution",
                     row=row,
                     col=col)
    # Update the x-axis title
    fig.update_xaxes(title_text=var,
                     row=row,
                     col=col)


# Add Histogram of Binary Outcome
histogram_trace = go.Histogram(x=diabetes_MinMaxScale_df["Outcome"],
                               name="Outcome",
                               opacity=0.6)
# Add the trace to the subplot
fig.add_trace(histogram_trace, row=3, col=3)
fig.update_yaxes(title_text="Distribution", row=3, col=3)
fig.update_xaxes(title_text="Outcome", row=3, col=3)
fig.update_traces(marker_line_width=1, marker_line_color="white")


# Update the overall layout of the figure
fig.update_layout(# Adjust overall height based on number of rows * Pixels
                  height=n_rows * 300,
                  # Adjust overall width based on number of columns * Pixels
                  width=n_cols * 300,
                  # Legend is not required
                  showlegend=False,
                  # Overall Title
                  title=dict(
                      # Set text
                      text="Plots for Diabetes Pre-Processed Dataset Features",
                      # Center the title
                      x=0.5,
                      y=0.99,
                      xanchor='center',
                      yanchor='top',
                      # Size
                      font=dict(
                          size=30
                      )
                  )
)


# Show the Final Figure
fig.show()

## Save Pre-processed Data

Save the pre-processed dataframe as a csv file for further processing.

In [None]:
# Save csv file
diabetes_MinMaxScale_df.to_csv("diabetes_preprocessed.csv", index=False)