# **MLCB - Assignment 2**
### Glykeria Spyrou

Setting up the project's directory and importing the proper libraries.

For the whole assignment project, a `functions.py` script was created containing several custom functions that were developped and in multiple instances.

In [1]:
# Load libraries
import numpy as np
import pandas as pd
import plotly.io as pio

import os
os.chdir('/Users/glykeriasp/Documents/DSIT/Machine Learning in Computational Biology/Assignments/Assignment 2/Assignment_2/src/')
from functions import boxplot_feats_plot, correlation_heatmap, pca_plot, outlier_mask

The dataset of the current assignment was downloaded via git through the following comnand:

`git clone https://github.com/MLCB2024Class/Assignment_2.git`

Then, through the following command we created three subdirectories for source code, notebooks, and models withing our working directory.

`mkdir src notebooks models`

### **TASK 1: Exploratory Data Analysis**

#### *Dataset Overview and Descriptive Statistics*

In [2]:
# Load the .csv file of the Diabetes dataset
os.chdir('/Users/glykeriasp/Documents/DSIT/Machine Learning in Computational Biology/Assignments/Assignment 2/Assignment_2')
data = pd.read_csv('data/Diabetes.csv')

In [3]:
# Shape of the dataset
print("Shape of the dataset:", data.shape)

# Printing the head of the dataset
print("\nFirst few rows of the dataset:")
print(data.head())

Shape of the dataset: (506, 10)

First few rows of the dataset:
    ID  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0  200            0      113             80             16        0  31.0   
1  552            6      114             88              0        0  27.8   
2   50            1      103             80             11       82  19.4   
3  631            0      102             78             40       90  34.5   
4   47            2       71             70             27        0  28.0   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.874   21        0  
1                     0.247   66        0  
2                     0.491   22        0  
3                     0.238   24        0  
4                     0.586   22        0  


In [4]:
# Display column data types and check for missing values
print("\nColumn data types and missing values:")
print(data.info())


Column data types and missing values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        506 non-null    int64  
 1   Pregnancies               506 non-null    int64  
 2   Glucose                   506 non-null    int64  
 3   BloodPressure             506 non-null    int64  
 4   SkinThickness             506 non-null    int64  
 5   Insulin                   506 non-null    int64  
 6   BMI                       506 non-null    float64
 7   DiabetesPedigreeFunction  506 non-null    float64
 8   Age                       506 non-null    int64  
 9   Outcome                   506 non-null    int64  
dtypes: float64(2), int64(8)
memory usage: 39.7 KB
None


As shown on the output above, there are non-null values in any of the columns of the dataset.
We can also see that the data types of all columns are integer, except for the 'BMI' and 'DiabetesPedigreeFunction' that are floats.

In [5]:
# Check for duplicate rows
duplicate_rows = data[data.duplicated()]
print("\nNumber of duplicate rows:", duplicate_rows.shape[0])


Number of duplicate rows: 0


In [6]:
# Descriptive statistics
print("\nDescriptive statistics:")
print(data.describe())


Descriptive statistics:
               ID  Pregnancies   Glucose  BloodPressure  SkinThickness  \
count  506.000000   506.000000  506.0000     506.000000     506.000000   
mean   385.225296     3.879447  120.5000      69.397233      20.480237   
std    220.920434     3.354809   31.6791      18.970491      15.602888   
min      0.000000     0.000000    0.0000       0.000000       0.000000   
25%    191.250000     1.000000   98.2500      62.000000       0.000000   
50%    382.500000     3.000000  116.0000      72.000000      23.000000   
75%    576.500000     6.000000  140.0000      80.000000      32.000000   
max    764.000000    17.000000  198.0000     122.000000      60.000000   

          Insulin         BMI  DiabetesPedigreeFunction         Age  \
count  506.000000  506.000000                506.000000  506.000000   
mean    76.666008   32.168775                  0.478875   33.268775   
std    107.365763    7.931377                  0.340221   11.542041   
min      0.000000    0.0

In the cell below the class balance of the dataset is described.

In [7]:
# Class balance
class_counts = data['Outcome'].value_counts()

# Calculating the proportions
healthy_count = class_counts[0]
diabetes_count = class_counts[1]

total_samples = data.shape[0]
healthy_prop = healthy_count/total_samples
diabetes_prop = diabetes_count/total_samples

print("Class Balance:")
print(f"Healthy Controls (Class 0): {healthy_count} instances, {healthy_prop:.2%} of total")
print(f"Diabetes Patients (Class 1): {diabetes_count} instances, {diabetes_prop:.2%} of total")

Class Balance:
Healthy Controls (Class 0): 329 instances, 65.02% of total
Diabetes Patients (Class 1): 177 instances, 34.98% of total


#### *Feature Assessment and Visualization*

For the assessment of the features, first of all, we are going to create boxplots that will help us examine their distributions.

In [8]:
feats_box = boxplot_feats_plot(data, "Boxplots of Features")
feats_box.show()

In [9]:
# Save as pdf
pio.write_image(feats_box, "plots/feat_boxplots.pdf", height=480, width=1100)

For the next task, we will perform correlation analysis in order to inspect the relationship between features in the dataset.
Again, all features will be selected to calculate the correaltion coefficients.
At the end we can visualize the relationship between the features through a heatmap.

In [10]:
# Create the correlation heatmap
heatmap_fig = correlation_heatmap(data, 'Feautures Correlation Heatmap', 'ID')
heatmap_fig.show()

In [11]:
# Save as pdf
pio.write_image(heatmap_fig, "plots/heatmap_feat_corr.pdf", height=570, width=600)

In the heatmap figure above, with blue colour are depicted the pairs of features that are negatively correlated, with red the positively correlated, and with white are shown the uncorrelated pairs.


For the next task, principal component analysis will be executed so as to reduce the dimensionality of the of our dataset.

In [12]:
# Specify features and target
features_rm = ['ID', 'Outcome']
target = ['Outcome']

class_names = {
    0: "Healthy",
    1: "Diabetes"
}

# Create PCA plot
pca_fig = pca_plot(data, features_rm=features_rm, target=target, class_names=class_names,
                   title='Principal Component Analysis Plot')
pca_fig.show()

In [13]:
# Save as pdf
pio.write_image(pca_fig, "plots/PCA_plot.pdf", height=450, width=550)

#### *Data Quality Evaluation*

As shown in the boxplots above, there are numerous entries containing zero values.
In the feature regarding the ammount of pregnancies, zero values are considered acceptable.
However, in features like, glucose, blood pressure, skin thickness, insulin, and BMI zero values create skewness in our dataset and is definitely require removal.

In [14]:
subset = data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']]

zero_values = (subset == 0).any(axis=1)
rows_with_zeros = subset[zero_values]
print(rows_with_zeros)

     Glucose  BloodPressure  SkinThickness  Insulin   BMI
0        113             80             16        0  31.0
1        114             88              0        0  27.8
4         71             70             27        0  28.0
6         97             60             23        0  28.2
7         96              0              0        0  23.7
..       ...            ...            ...      ...   ...
496      154             78             32        0  32.4
499      142             80             15        0  32.4
501       76             62              0        0  34.0
502       97             70             40        0  38.1
504      124             70             20        0  27.4

[246 rows x 5 columns]


Now, rows that contain zero values in the given columns will be removed.

In [15]:
cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# create a mask object
mask = (data[cols] == 0).any(axis=1)

# create new dataset
data_new = data[~mask]
data_new.reset_index(inplace=True, drop=True)

Creating a function that will detect rows with outliers based on the distribution of each of the column features.
For this task, the IQR methodology will be adopted, where the threshold will be set at 1.5 for the upper and lower bounds.

In [16]:
# Create the mask df using the function above
mask_df = outlier_mask(data_new, threshold=1.5)

mask_rows = mask_df.any(axis=1)
data_new = data_new[~mask_rows]
data_new.reset_index(inplace=True, drop=True)

In [17]:
mask_rows

0      False
1      False
2      False
3       True
4       True
       ...  
255    False
256    False
257    False
258    False
259    False
Length: 260, dtype: bool

Generate and inspect the new dataset (non zero values and no outliers), based on the figures that were created above.

In [18]:
# Create boxplots for features
feats_box2 = boxplot_feats_plot(data_new, "Boxplots of Features (curated)")
feats_box2.show()

In [19]:
# Save as pdf
pio.write_image(feats_box2, "plots/feat_boxplots2.pdf", height=480, width=1100)

In [20]:
# Create the correlation heatmap
heatmap_fig2 = correlation_heatmap(data_new, 'Feautures Correlation Heatmap (curated)', 'ID')
heatmap_fig2.show()

In [21]:
# Save as pdf
pio.write_image(heatmap_fig2, "plots/heatmap_feat_corr2.pdf", height=570, width=600)

Now, perform an additional PCA for the curated data.

In [22]:
# Specify features and target
features_rm = ['ID', 'Outcome']
target = ['Outcome']

class_names = {
    0: "Healthy",
    1: "Diabetes"
}

# Create PCA plot
pca_fig2 = pca_plot(data_new, features_rm=features_rm, target=target, class_names=class_names,
                   title='Principal Component Analysis Plot (curated)')
pca_fig2.show()

In [23]:
# Save as pdf
pio.write_image(pca_fig2, "plots/PCA_plot2.pdf", height=450, width=550)

For the new data set we examined again the new class balance, which as shown below as the almost the same as the original data set.

In [24]:
# Class balance
class_counts = data_new['Outcome'].value_counts()

# Calculating the proportions
healthy_count = class_counts[0]
diabetes_count = class_counts[1]

total_samples = data_new.shape[0]
healthy_prop = healthy_count/total_samples
diabetes_prop = diabetes_count/total_samples

print("Class Balance:")
print(f"Healthy Controls (Class 0): {healthy_count} instances, {healthy_prop:.2%} of total")
print(f"Diabetes Patients (Class 1): {diabetes_count} instances, {diabetes_prop:.2%} of total")

Class Balance:
Healthy Controls (Class 0): 157 instances, 70.09% of total
Diabetes Patients (Class 1): 67 instances, 29.91% of total


Save the new dataset into the 'data' folder of the directory.

In [25]:
data_new.to_csv('data/Diabetes_curated.csv', index=True)