# Project Proposal: Building a Classifier to Predict Diabetes

## 1. Introduction

Our project investigates potential key factors associated with diabetes mellitus in females of Pima Indian decent aged 21 and above. Diabetes mellitus is a serious disease affecting many populations causing severe health complications such as heart failure, with the main associated cause of death being coronary heart disease (Das, 2014). Those of Indian decent often have higher rates of diabetes suggesting a potential genetic predisposition to insulin resistance, however many other factors can play a role in the presence of diabetes. There are many known risk factors associated with this disorder, some of which include parental diabetes, obesity, and genetic components (Das, 2014). High rates of diabetes mellitus is not only a severe health issue, but it also places a significant strain on the healthcare system (Krishnamoorthy et al., 2022). Throughout our project we aim to answer the question of whether the 5 factors we have selected can accurately predict the onset of diabetes within a five year time frame with a high degree of over 75%. The original data set consisted of 9 different attributes, however we have chosen to narrow our scope to focus on the five factors most associated with potential modifiable or reversible qualities. The observations we will be examining throughout our project include plasma glucose concentration level at 2 hours in an oral glucose tolerance test (ie. glucose test, mg/dl), diastolic blood pressure (mmHg), triceps skin fold thickness (a measure of body fat (mm)), hour serum insulin (µU/mL), and body mass index (kg/m^2). The data set we have chosen is titled “Diabetes Dataset” created from the findings of the National Institute of Diabetes and Digestive and Kidney Diseases which was uploaded by Mehmet Akturk found on Kaggle.

## 2. Preliminary Analysis

Our GitHub repository can be found at: https://github.com/hesoru/DSCI_100_Diabetes_Prediction

### Dataset Source

Our dataset was obtained from Kaggle at the following URL: 

https://www.kaggle.com/datasets/mathchi/diabetes-data-set?fbclid=IwAR1DMzdJFDxoEqLDIZNTi3j7YJXTx_7BJwCl7sbn8syQKbQCnHfMtlsKH1E

This dataset was uploaded by user Mehmet Akturk and was sourced from the National Institute of Diabetes and Digestive and Kidney Diseases (donated by RMI Group Leader Vincent Sigillito).

The study population consisted of 768 female patients of at least 21 years of age and Pima Indian heritage, living near Phoenix, Arizona, USA. Researchers collected the following data:
 1. Number of times pregnant
 2. Plasma glucose concentration level at 2 hours in an oral glucose tolerance test (ie. glucose test, mg/dl)
 3. Diastolic blood pressure (mmHg)
 4. Triceps skin fold thickness - a measure of body fat (mm)
 5. 2-Hour serum insulin (µU/mL)
 6. Body mass index (kg/m^2)
 7. Diabetes pedigree function (probability of diabetes based on family history) 
 8. Age
 9. Outcome (0 = glucose test negative for diabetes 5+ years after data collection, 1 = glucose test positive for diabetes within 5 years of data collection)

Diabetes was diagnosed by a plasma glucose concentration level greater than 200 mg/dl at 2 hours in an oral glucose tolerance test. All patients had a negative glucose test for diabetes at initial data collection. 

In [None]:
# This code cell is for installation only, start from below if not needed
install.packages("tidyverse")
install.packages("tidymodels")
install.packages("gridExtra")

also installing the dependencies ‘broom’, ‘conflicted’, ‘forcats’, ‘lubridate’, ‘ragg’, ‘readr’, ‘readxl’


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

also installing the dependencies ‘vctrs’, ‘dials’, ‘dplyr’, ‘ggplot2’, ‘hardhat’, ‘modeldata’, ‘parsnip’, ‘purrr’, ‘recipes’, ‘rlang’, ‘rsample’, ‘rstudioapi’, ‘tune’, ‘workflows’, ‘workflowsets’, ‘yardstick’




In [None]:
set.seed(1000)
library("tidyverse")
library("tidymodels")
library("gridExtra")

### Tidying and Filtering Data

In [None]:
diabetes_dataset <- read_csv("https://raw.githubusercontent.com/hesoru/DSCI_100_Diabetes_Prediction/main/Dataset/diabetes.csv")
diabetes_dataset

To narrow down parameters for classification, we will only be using modifiable/reversible variables in analysis. These include:

1. Plasma glucose concentration level at 2 hours in an oral glucose tolerance test (ie. glucose test, mg/dl)
2. Diastolic blood pressure (mmHg)
3. Triceps skin fold thickness - a measure of body fat (mm)
4. 2-Hour serum insulin (µU/mL)
5. Body mass index (kg/m^2)

We will remove irreversible or non-modifiable variables: pregnancies (not a reversible variable), diabetes pedigree function (probability of diabetes based on family history, not modifiable), and age.

The dataset have no N/A values, however some cells have value equal to 0.

In particular, glucose, blood pressure, skin thickness, insulin, BMI should not have reading of 0 in practice. Cells with value of 0 in these columns will be treated as N/A and filtered.

In [None]:
diabetes_dataset_filtered <- diabetes_dataset |>
    select(-Pregnancies, -DiabetesPedigreeFunction, -Age) |>
    filter(Glucose != 0, BloodPressure != 0, SkinThickness != 0, Insulin != 0, BMI != 0)
diabetes_dataset_filtered

### Split Clinical Data into Training and Testing Sets

In [None]:
diabetes_dataset_filtered_split <- initial_split(data = diabetes_dataset_filtered, prop = 0.75, strata = Outcome)
diabetes_dataset_filtered_training <- training(diabetes_dataset_filtered_split)
diabetes_dataset_filtered_testing <- testing(diabetes_dataset_filtered_split)

diabetes_dataset_filtered_training
diabetes_dataset_filtered_testing

### Exploration of Training Data

In [None]:
patient_means_by_outcome <- diabetes_dataset_filtered_training |>
    group_by(Outcome) |>
    summarise(Patients = n(),
              Mean_Glucose = mean(Glucose),
              Mean_BP = mean(BloodPressure),
              Mean_SkinThickness = mean(SkinThickness),
              Mean_Insulin = mean(Insulin),
              Mean_BMI = mean(BMI))
patient_means_by_outcome

- 0 = negative glucose test for diabetes 5 years after data collection
- 1 = positive glucose test for diabetes within 5 years of data collection

**Interpretation:**
- 97/293 patients received a positive glucose test for diabetes (1) within 5 years of data collection. This implies a startling rate of diabetes development, however it is possible that some patients were already diabetic at data collection (false negatives).
- There is a large relative difference (at least 25%) in Mean_Glucose and Mean_Insulin between pre-diabetics and non-diabetics.

In [None]:
patient_distribution_glucose <- diabetes_dataset_filtered_training |>
    ggplot(aes(x = Glucose)) +
    geom_histogram(bins = 10, binwidth = 5) +
    labs(x = "GTT: Glucose Plasma Concentration (mg/dl)",
         y = "Number of Patients") +
    theme(text = element_text(size = 11)) +
    theme(legend.position = "none")

patient_distribution_BP <- diabetes_dataset_filtered_training |>
    ggplot(aes(x = BloodPressure)) +
    geom_histogram(bins = 10, binwidth = 4) +
    labs(x = "Diastolic Blood Pressure (mmHg)",
         y = "Number of Patients") +
    theme(text = element_text(size = 11)) +
    theme(legend.position = "none")

patient_distribution_SkinThickness <- diabetes_dataset_filtered_training |>
    ggplot(aes(x = SkinThickness)) +
    geom_histogram(bins = 10, binwidth = 2) +
    labs(x = "Tricep Skin Thickness (mm)",
         y = "Number of Patients") +
    theme(text = element_text(size = 11)) +
    theme(legend.position = "none")

patient_distribution_Insulin <- diabetes_dataset_filtered_training |>
    ggplot(aes(x = Insulin)) +
    geom_histogram(bins = 10, binwidth = 25) +
    labs(x = "2-Hour Serum Insulin (µU/mL)",
         y = "Number of Patients") +
    theme(text = element_text(size = 11)) +
    theme(legend.position = "none")

patient_distribution_BMI <- diabetes_dataset_filtered_training |>
    ggplot(aes(x = BMI)) +
    geom_histogram(bins = 10, binwidth = 2) +
    labs(x = "Body Mass Index (kg/m^2)",
         y = "Number of Patients") +
    theme(text = element_text(size = 11)) +
    theme(legend.position = "none")

In [None]:
grid.arrange(patient_distribution_glucose,
             patient_distribution_BP,
             patient_distribution_SkinThickness,
             patient_distribution_Insulin,
             patient_distribution_BMI,
             ncol=2)

# 3. Methods

We will perform K nearest neighbors classification on our testing dataset (25% of our entire dataset):

**Parameters:**
1. Plasma glucose concentration level at 2 hours in an oral glucose tolerance test (ie. glucose test, mg/dl)
2. Diastolic blood pressure (mmHg)
3. Triceps skin fold thickness - a measure of body fat (mm)
4. 2-Hour serum insulin (µU/mL)
5. Body mass index (kg/m^2)

**Predicted class:** Outcome (whether the patient will receive a positive glucose test for diabetes in the next 5 years).


Our classifier will be trained on our training data (75% of our entire dataset). We will tune K using our training dataset and assess classifier accuracy, precision, and recall by comparing classifier outcome predictions to actual outcome in the testing dataset.


We aim to create the following visualizations:
- Bar plot with outcome on the x-axis, and sample counts under each outcome (0 and 1) based on the classifier vs. actual observations on the y-axis
- Estimated accuracy of classifier on the y-axis and neighbors on the x-axis (tuning K)
- Since we're using 5 parameters for our classifier, it's not practical to plot the training/testing data on a scatterplot including all the parameters (5 axes!)

# 4. Expected Outcomes and Significance


We start with an initial assumption that prediction of diabetes 5 years prior to being positively tested for it based on clinical data is feasible, thus we expect that our dataset which covers a portion of commonly used clinical data can be used to train a predictor (classifier).

Since predicting a disease that is not currently curable like diabetes can be very valuable to potential patients, it will be a case which a type 1 error is more acceptable than a type 2 error. Although we have no specific expectation on the accuracy of our predictor, we expect to reduce type 2 error while maximizing its accuracy.

As stated earlier, creating prediction algorithms and identifying effective predictors in the process can be very helpful as they provides potential patients valuable time to take caution before some irreversible point, these knowledge also narrows direction for further inspections such as clarifying the causality among predictors and between predictors and diabetes.