# Predicting Diabetes in the Pima Indian Female Population Based on Modifiable Factors

Ensure that the following packages are installed before proceeding:

In [3]:
# installation packages commented out for convenience.

#install.packages("tidyverse")
#install.packages("tidymodels")
#install.packages("gridExtra")
#install.packages("repr")
#install.packages("kknn")
#install.packages("cowplot")
#install.packages("shiny")

also installing the dependency ‘dplyr’


“installation of package ‘dplyr’ had non-zero exit status”
“installation of package ‘tidyverse’ had non-zero exit status”
Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [2]:
# set seed and load necessary packages

set.seed(1000)

library("tidyverse")
library("tidymodels")
library("gridExtra")
library("repr")
library("kknn")
library("cowplot")
library("shiny")

ERROR: Error in library("tidyverse"): there is no package called ‘tidyverse’


# Introduction

Diabetes mellitus is a serious disease causing severe health complications such as heart failure, with the main associated cause of death being coronary heart disease ([Das, 2014](https://doi.org/10.2174/1876524601407010005)). Our project attempts to predict the development of diabetes mellitus based on modifiable measures of health (eg. BMI), specifically in a population of adult female patients (aged 21 and above) of Pima Indian descent. Those of Indian descent often have higher rates of diabetes, suggesting a potential genetic predisposition to insulin resistance, however many other factors can play a role in the pathogenesis of diabetes. There are many known risk factors associated with this disorder, some of which include parental diabetes, obesity, and genetic components ([Das, 2014](https://doi.org/10.2174/1876524601407010005)). High prevalence of diabetes mellitus is not only a severe health issue, but also places a significant strain on the healthcare system ([Krishnamoorthy et al., 2022](https://doi.org/10.1007/s13300-022-01329-6)).

In order to predict the development of diabetes mellitus, we aim to train a K-nearest neighbors classifier on a dataset of Pima Indian female patients that were monitored longitudinally for the onset of diabetes. This `Diabetes Dataset` was uploaded to Kaggle by user Mehmet Akturk and sourced from the National Institute of Diabetes and Digestive and Kidney Diseases ([Smith et al., 1988](https://www.kaggle.com/datasets/mathchi/diabetes-data-set?fbclid=IwAR1DMzdJFDxoEqLDIZNTi3j7YJXTx_7BJwCl7sbn8syQKbQCnHfMtlsKH1E)). The study population consisted of adult female patients (at least 21 years of age) of Pima Indian heritage (n = 768), living near Phoenix, Arizona (USA). Researchers collected the following data:
 1. `Pregnancies`: Number of times pregnant
 2. `Glucose`: Plasma glucose concentration at 2 hours in an oral glucose tolerance test (ie. glucose test, mg/dl)
 3. `BloodPressure`: Diastolic blood pressure (mmHg)
 4. `SkinThickness`: Triceps skin fold thickness (mm) (measure of body fat)
 5. `Insulin`: 2-Hour serum insulin (µU/mL)
 6. `BMI`: Body mass index (kg/m^2)
 7. `DiabetesPedigreeFunction`: Diabetes pedigree function (probability of diabetes based on family history) 
 8. `Age`: Age (years)
 9. `Outcome`: 0 = glucose test negative for diabetes 5+ years after data collection, 1 = glucose test positive for diabetes within 5 years of data collection

Diabetes was diagnosed by a plasma glucose concentration level greater than 200 mg/dl at 2 hours in an oral glucose tolerance test. All patients had a negative glucose test for diabetes at initial data collection ([Smith et al., 1988](https://www.kaggle.com/datasets/mathchi/diabetes-data-set?fbclid=IwAR1DMzdJFDxoEqLDIZNTi3j7YJXTx_7BJwCl7sbn8syQKbQCnHfMtlsKH1E)). In choosing parameters for our classifier, we have chosen to narrow our scope to the 5 modifiable and reversible variables in the dataset - this allows doctors to focus on modifying lifestyle factors within the patients' control. In diabetes mellitus, the body loses the ability to produce or respond to insulin effectively, thus those with low levels of insulin but high levels of glucose are subject to the condition ([Department of Health & Human Services, 2004](https://www.betterhealth.vic.gov.au/health/conditionsandtreatments/diabetes-and-insulin)). Diabetes mellitus is directly responsible for hyperglycemia (high blood glucose levels). If hyperglycemia is left untreated, severe health issues related to eyes, kidneys, nerves, and heart may occur and require emergency care ([Mayo Clinic Staff, 2022](https://www.mayoclinic.org/diseases-conditions/hyperglycemia/symptoms-causes/syc-20373631)). The disease can cause damage to the kidneys, resulting in salt and water retention and high blood pressure. Therefore, high blood pressure can be a key indicator of diabetes ([New York Presbyterian Hospital, 2023](https://www.nyp.org/diabetes-and-endocrinology/diabetes-resource-center/diabetes-and-hypertension#:~:text=%E2%80%9CDiabetes%20causes%20damage%20by%20scarring,contribute%20to%20high%20blood%20pressure.%E2%80%9D)). Obese individuals with a high BMI tend to exhibit insulin resistance and are therefore at a higher risk of developing diabetes mellitus. Skin fold thickness is a measure of body fat and is used to identify those at risk of type 2 diabetes, which is linked to obesity. This is a non-invasive strategy that can predict diabetes, and is an accessible test to those who may not have access to other forms of testing ([Ruiz-Alejos et al., 2020](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6960014/#:~:text=Bicipital%20and%20subscapular%20skinfolds%20were,fold%20risk%20for%20developing%20HT.)). It should be noted that insulin resistance caused by obesity can be reversed with adequate weight loss ([Klein et al., 2021](https://pubmed.ncbi.nlm.nih.gov/34986330/#:~:text=The%20cellular%20and%20physiological%20mechanisms,normalized%20with%20adequate%20weight%20loss.)).



**Research Question:**
Can our K-nearest neighbors classifier predict the onset of diabetes within a five-year time frame (`Outcome`) based on the 5 modifiable measures of health below with a high degree of accuracy, precision, and recall (>75%)?
 1. `Glucose`: Plasma glucose concentration level at 2 hours in an oral glucose tolerance test (ie. glucose test, mg/dl)
 2. `BloodPressure`: Diastolic blood pressure (mmHg)
 3. `SkinThickness`: Triceps skin fold thickness (mm) (measure of body fat)
 4. `Insulin`: 2-Hour serum insulin (µU/mL)
 5. `BMI`: Body mass index (kg/m^2)

# Methods

Our classifier was trained to predict diabetes development in the next 5 years (ie. pre-diabetes) using K-nearest neighbors analysis. First, all non-modifiable/irreversible variables and `N/A` values were filtered out of our dataset. The filtered dataset (n = 392) was split into training (75% of the data) and testing (25% of the data) sets. The training data was used for exploratory analysis of the dataset: the mean of each variable in the non-diabetic and pre-diabetic groups were calculated, and the distributions of each variable were visualized. To select the appropriate value of K (number of neighbors) for our classifier, the training data was split into 10 subsets for cross-validation, and the cross-validation sets were used to tune the model. The accuracy estimates from K = 1 to K = 100 were visualized, and the optimal accuracy estimate was set at K = 21 for the classifier. The classifier was subsequently trained using the chosen K, with all the variables in the filtered dataset selected as parameters for classification of impending diabetes development (`Outcome`). The classifier was evaluated on the testing set, and the resulting predictions were compared to the actual outcome in the study as a confusion matrix. The proportion of true negative, false negative, true positive, and false positive predictions by our classifier was calculated from the confusion matrix data and visualized. Accuracy, precision, and recall of the classifier were all calculated from the confusion matrix to evaluate the quality of classifier predictions.

Data wrangling and cleaning was performed using the `tidyverse` package. Visualization was performed using the `tidyverse`, `cowplot`, and `gridExtra` packages. The dataset was split into training/testing sets, the classifier was built/trained, K was optimized, and the confusion matrix was created using the `tidymodels` package.

# Analysis and Results

## Importing and Tidying Data

We will read in `diabetes.csv` from the `DSCI_100_Diabetes_Prediction` repository on GitHub. Here we will inspect the dataset variables and dimensions.

In [None]:
# read in data
diabetes_dataset <- read_csv("https://raw.githubusercontent.com/hesoru/DSCI_100_Diabetes_Prediction/main/Dataset/diabetes.csv")

# view dataset variables and tibble dimensions
colnames(diabetes_dataset)
dim(diabetes_dataset)

Dataset variables are listed above, and our unfiltered dataset has a sample size of n = 768.

Next we will remove the irreversible or non-modifiable variables from our classifier: `Pregnancies` (not a reversible variable), `DiabetesPedigreeFunction` (probability of diabetes based on family history), and `Age`.

The dataset has no `N/A` values, however some cells have a value equal to 0. `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, and `BMI` cannot have a reading of 0 in practice, so cells with a value of 0 in these columns will be treated as `N/A` and filtered out.

In [None]:
diabetes_dataset_filtered <- diabetes_dataset |>
    select(-Pregnancies, -DiabetesPedigreeFunction, -Age) |>
    filter(Glucose != 0, BloodPressure != 0, SkinThickness != 0, Insulin != 0, BMI != 0)

# view dataset variables and tibble dimensions
colnames(diabetes_dataset_filtered)
dim(diabetes_dataset_filtered)

Note that we have filtered out approximately half of the patients in our dataset, with a final sample size of n = 392.

## Splitting Data into Training and Testing Sets

Now that our data is tidy, we will split our filtered data (n = 392) into training (75% of the data) and testing (25% of the data) sets, and convert the categorical variable `Outcome` into the factor data type.

In [None]:
set.seed(1000)

diabetes_dataset_filtered_split <- initial_split(data = diabetes_dataset_filtered,
                                                 prop = 0.75,
                                                 strata = Outcome)
training_filtered <- training(diabetes_dataset_filtered_split) |>
    mutate(Outcome = as_factor(Outcome))
testing_filtered <- testing(diabetes_dataset_filtered_split) |>
    mutate(Outcome = as_factor(Outcome))

## Exploration of Training Data

First, we will find the mean of each variable in the groups that did (`Outcome` = 1) and did not (`Outcome` = 0) develop diabetes within 5 years of data collection.

In [None]:
patient_means_by_outcome <- training_filtered |>
    group_by(Outcome) |>
    summarise(Patients = n(),
              Mean_Glucose = mean(Glucose),
              Mean_BP = mean(BloodPressure),
              Mean_SkinThickness = mean(SkinThickness),
              Mean_Insulin = mean(Insulin),
              Mean_BMI = mean(BMI))
patient_means_by_outcome

**Table 1. Mean health measurements in Pima Indian female patients that did (`Outcome` = 1) and did not (`Outcome` = 0) develop diabetes within 5 years of data collection.**

**Interpretation:**
- 97/293 patients (33.1%) received a positive glucose test for diabetes (`Outcome` = 1) within 5 years of data collection. This implies a startling rate of diabetes development! However, it is likely that a large number of non-diabetics were filtered out when tidying our data. It is also possible that some patients were already diabetic at data collection (false negatives).
- There is a large relative difference (at least 25%) in `Mean_Glucose` and `Mean_Insulin` between pre-diabetics and non-diabetics.

Next we plotted the distributions of each variable in the groups that did (pre-diabetic) and did not (non-diabetic) develop diabetes within 5 years of data collection.

In [None]:
options(repr.plot.width = 10, repr.plot.height = 7)

patient_distribution_glucose <- training_filtered |>
    ggplot(aes(x = Glucose)) +
    geom_histogram(bins = 10, binwidth = 5, aes(fill=Outcome)) +
    labs(x = "Glucose Plasma Concentration (mg/dl)",
         y = "Patients") +
    theme(text = element_text(size = 10)) +
    theme(legend.position = "none")

patient_distribution_BP <- training_filtered |>
    ggplot(aes(x = BloodPressure)) +
    geom_histogram(bins = 10, binwidth = 4, aes(fill=Outcome)) +
    labs(x = "Diastolic Blood Pressure (mmHg)",
         y = "Patients") +
    theme(text = element_text(size = 10)) +
    theme(legend.position = "none")

patient_distribution_SkinThickness <- training_filtered |>
    ggplot(aes(x = SkinThickness)) +
    geom_histogram(bins = 10, binwidth = 2, aes(fill=Outcome)) +
    labs(x = "Tricep Skin Thickness (mm)",
         y = "Patients") +
    theme(text = element_text(size = 10)) +
    theme(legend.position = "none")

patient_distribution_Insulin <- training_filtered |>
    ggplot(aes(x = Insulin)) +
    geom_histogram(bins = 10, binwidth = 25, aes(fill=Outcome)) +
    labs(x = "2-Hour Serum Insulin (µU/mL)",
         y = "Patients") +
    theme(text = element_text(size = 10)) +
    theme(legend.position = "none")

patient_distribution_BMI <- training_filtered |>
    ggplot(aes(x = BMI)) +
    geom_histogram(bins = 10, binwidth = 2, aes(fill=Outcome)) +
    labs(x = "Body Mass Index (kg/m^2)",
         y = "Patients") +
    theme(text = element_text(size = 10)) +
    theme(legend.position = "none")

# plot created purely for its legend - plot will remain unvisualized
outcome_legend <- training_filtered |>
    ggplot(aes(x = BMI)) +
    geom_histogram(bins = 10, binwidth = 2, aes(fill=Outcome)) +
    labs(x = "Body Mass Index (kg/m^2)",
         y = "Patients") +
    scale_fill_discrete(labels=c('Non-Diabetic', 'Pre-Diabetic')) +
    theme(text = element_text(size = 10)) 
# grab legend from outcome_legend
legend <- cowplot::get_legend(outcome_legend)

In [None]:
# plot all distributions on a grid
grid.arrange(patient_distribution_glucose,
             patient_distribution_BP,
             patient_distribution_SkinThickness,
             patient_distribution_Insulin,
             patient_distribution_BMI,
             legend,
             ncol=3)

**Figure 1. Distributions of health measurements in Pima Indian female patients that did (pre-diabetic) and did not (non-diabetic) develop diabetes within 5 years of data collection.** 

**Interpretation:**
- The distributions for glucose plasma concentration and 2-hour serum insulin appear to have different centers for non-diabetics and pre-diabetics. This is consistent with **Table 1**, which shows that the means for glucose plasma concentration and 2-hour serum insulin are significantly different between non-diabetics and pre-diabetics.

## Selecting K for the Classifier

First we will set up the model recipe, K-nearest neighbor classification specifications, split the training data into 10 subsets for cross-validation, and set the K-values to test. We chose to tune our classifier from K = 1 to K = 100.

In [None]:
set.seed(1000)

knn_recipe <- recipe(Outcome ~ ., data = training_filtered) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

vfold <- vfold_cv(training_filtered, v = 10, strata = Outcome)

k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

Now we will calculate and visualize the accuracy of our model from K = 1 to K = 100.

In [None]:
knn_results <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = vfold, grid = k_vals) |>
  collect_metrics() 
                 
accuracies <- knn_results |>
    filter(.metric =="accuracy")

In [None]:
cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate") + 
  theme(text = element_text(size = 15))
cross_val_plot

**Figure 2. Estimated accuracy of model plotted against number of neighbors for K-nearest neighbors classification.**

Setting the number of neighbors to K = 21 provides the highest estimated accuracy, thus we will move forward using this K-value for the classification.

## Training the Classifier

Now we will combine our recipe, K-nearest neighbor classification specifications (with our newly determined K), and training data to train our model.

In [None]:
set.seed(1000)

# our knn_recipe was defined previously when tuning the classifier

mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 21) |> 
    set_engine("kknn") |>
    set_mode("classification")

mnist_fit <- workflow() |>
    add_recipe(knn_recipe) |>
    add_model(mnist_spec) |>
    fit(data = training_filtered)

## Evaluating the Classifier's Performance

We will build a confusion matrix: comparing our classifier's predictions of `Outcome` against actual results.

In [None]:
set.seed(1000)

mnist_predictions <- predict(mnist_fit, testing_filtered) |>
    bind_cols(testing_filtered)

mnist_metrics <- mnist_predictions |>
    metrics(truth = Outcome, estimate = .pred_class) |>
    filter(.metric == "accuracy")

mnist_conf_mat <- mnist_predictions |>
    conf_mat(truth = Outcome, estimate = .pred_class)
mnist_conf_mat

**Table 2. Confusion matrix comparing K-nearest neighbor classifier predictions (Prediction) against actual results (Truth).** Our classifier predicted whether patients will (`1`) or will not (`0`) develop diabetes within 5 years of data collection. Predictions were performed on our testing dataset (n = 99) of adult Pima Indian female patients. 

We can calculate accuracy, precision, and recall of our classifier using the data from our confusion matrix.

## $ \textrm{accuracy} = \frac{\textrm{number of correct predictions}}{\textrm{total number of predictions}} = \frac{\textrm{61 + 14}}{\textrm{61 + 19 + 5 + 14}} = 0.7576 $

## $ \textrm{precision} = \frac{\textrm{number of correct positive predictions}}{\textrm{total number of positive predictions}} = \frac{\textrm{14}}{\textrm{14 + 5}} = 0.7368 $

## $ \textrm{recall} = \frac{\textrm{number of correct positive predictions}}{\textrm{total number of positive test set predictions}} = \frac{\textrm{14}}{\textrm{14 + 19}} = 0.4242 $

Additionally, the confusion matrix can also be written in terms of the number of true negative, false negative, true positive, and false positive results by the classifier.

In [None]:
conf_mat_table <- data.frame(T_or_F = c("True", "False", "True", "False"),
                             Neg_or_Pos = c("Negative (Non-Diabetic)", "Negative (Non-Diabetic)",
                                            "Positive (Pre-Diabetic)", "Positive (Pre-Diabetic)"),
                             Count = c(61, 19, 14, 5)) |>
                             mutate(Proportion = Count/99)

Which can be visualized below.

In [None]:
mnist_plot <- ggplot(conf_mat_table, aes(x = Neg_or_Pos, y = Proportion, fill = T_or_F)) +
       geom_bar(stat = "identity") +
       labs(x = "Negative or Positive Predictions",
            y = "Proportion of Total Predictions",
            fill = "Prediction") +
       theme(text = element_text(size = 17))
mnist_plot

**Figure 3. Evaluation of K-nearest neighbor classifier predictions.** Our classifier predicted whether patients will (Pre-Diabetic) or will not (Non-Diabetic) develop diabetes within 5 years of data collection. The proportion of true negative, false negative, true positive, and false positive predictions by our classifier was determined after classifying our testing dataset (n = 99) of adult Pima Indian female patients. 

# Discussion

This study has found that a K-nearest neighbor classification algorithm trained on the 5 selected predictors (glucose plasma concentration, diastolic blood pressure, triceps skin thickness, 2-hour serum insulin, and body mass index) can achieve a relatively high accuracy (75.76%) and precision (73.68%) in identifying pre-diabetic patients as early as 5 years before diagnosis. This finding is expected: during training data exploration there was an observable difference in some of the selected predictors (glucose plasma concentration and 2-hour serum insulin) between non-diabetic and pre-diabetic patients. Although statistical significance of these differences was not assessed, it was hypothesized that such differences are sufficient to facilitate classification. Past studies additionally suggest a correlation between the selected predictors and diabetes mellitus. Several studies have successfully used blood pressure to predict diabetes ([Mbanya et al., 2016](https://doi.org/10.1111/jch.12774)), even among patients with normoglycemic samples ([Edeoga et al., 2017](https://doi.org/10.1016/j.jdiacomp.2017.07.019)). Although many studies have focused primarily on gestational diabetes, the correlation between body fat percentage and diabetes is apparent ([Singh et al., 2023](https://doi.org/10.1016/j.jogc.2023.04.026); [Nassr et al., 2018](https://doi.org/10.1016/j.ejogrb.2018.07.001)). Skin thickness can be used to predict body fat percentage, and the accuracy of this method has been confirmed ([Jayawardena et al., 2020](https://doi.org/10.1016/j.dsx.2020.02.003)). Similarly, high levels of insulin are strongly associated with pre-diabetes ([Quan et al., 2021](https://doi.org/10.1007/s13410-021-00983-z)), and [He et al.](https://doi.org/10.1111/jdi.13777) suggests that body mass index is a predictor of whether a pre-diabetic patient develops diabetes ([2022](https://doi.org/10.1111/jdi.13777)).

The classifier's precision of 73.68% shows the rate of true positives, and that a positive (pre-diabetic) result can generally be trusted. The classifier's accuracy still has room for improvement: if the classifier guessed the patient was non-diabetic 100% of the time, it would achieve an accuracy of 66.67%, which is not much lower than our accuracy of 75.76%. On the other hand, classifier recall fell far below our expectations at 42.42%, meaning that over half of pre-diabetic patients were missed (false negatives). These pre-diabetic patients would remain undiagnosed and without medical intervention, which would have consequences on both the patient and medical system. Despite this, the classifier is still an improvement on the current medical system, which only diagnoses patients when they are already diabetic (0% recall). Improving the accuracy and recall of our classifier could be the subject of future research: we would suggest the addition of other parameters, such as the non-modifiable factors in the dataset (pregnancies, age, diabetes pedigree function) that we filtered out in our own analysis.

This study combines diabetes biomarkers identified in past studies into a practical classifier. In the pre-diabetic stage, diabetes can be prevented with lifestyle modifications and medical intervention. Our classifier can potentially be used to provide valuable extra time for pre-diabetic patients to receive treatment and adapt to a different lifestyle. It is important to note that our classifier was trained on the female adult Pima Indian population, and should not be considered representative of the global population. Further studies could assess to which extent these findings can be extrapolated to the broader population. Additionally, replicating this study with a broader population will allow for a more generalized algorithm to be made, which would assist in the identification of pre-diabetics around the world.

# References

Das, R. N. (2014). Determinants of diabetes mellitus in the Pima Indian mothers and Indian Medical Students. *The Open Diabetes Journal*, 7(1), 5–13. https://doi.org/10.2174/1876524601407010005

Department of Health & Human Services. (2004, September 15). Diabetes and insulin. Better Health Channel. https://www.betterhealth.vic.gov.au/health/conditionsandtreatments/diabetes-and-insulin

Edeoga, C., Owei, I., Siwakoti, K., Umekwe, N., Ceesay, F., Wan, J., &amp; Dagogo-Jack, S. (2017). Relationships between blood pressure and blood glucose among offspring of parents with type 2 diabetes: Prediction of incident dysglycemia in a biracial cohort. *Journal of Diabetes and Its Complications*, 31(11), 1580–1586. https://doi.org/10.1016/j.jdiacomp.2017.07.019 

He, Y., Feng, Y., Shi, J., Tang, H., Chen, L., &amp; Lou, Q. (2022). Β‐cell function and body mass index are predictors of exercise response in elderly patients with Prediabetes. *Journal of Diabetes Investigation*, 13(7), 1253–1261. https://doi.org/10.1111/jdi.13777 

Jayawardena, R., Waniganayake, Y. C., Abhayaratna, S. A., &amp; Ranasinghe, P. (2020). Prediction of body fat in Sri Lankan adults: Development and validation of a skinfold thickness equation. *Diabetes & Metabolic Syndrome: Clinical Research & Reviews*, 14(2), 147–150. https://doi.org/10.1016/j.dsx.2020.02.003 

Klein, S., Gastaldelli, A., Yki-Järvinen, H., & Scherer, P. (2021, January 4). Why does obesity cause diabetes?. Cell metabolism. https://pubmed.ncbi.nlm.nih.gov/34986330/#:~:text=The%20cellular%20and%20physiological%20mechanisms,normalized%20with%20adequate%20weight%20loss.


Krishnamoorthy, Y., Rajaa, S., Verma, M., Kakkar, R., & Kalra, S. (2022). Spatial patterns and determinants of diabetes mellitus in Indian adult population: A secondary data analysis from nationally representative surveys. *Diabetes Therapy*, 14(1), 63–75. https://doi.org/10.1007/s13300-022-01329-6

Mayo Clinic Staff. (2022, August 20). Hyperglycemia in diabetes. Mayo Clinic. https://www.mayoclinic.org/diseases-conditions/hyperglycemia/symptoms-causes/syc-20373631

Mbanya, V. N., Mbanya, J., Kufe, C., &amp; Kengne, A. P. (2016). Effects of single and multiple blood pressure measurement strategies on&nbsp;the prediction of prevalent screen‐detected diabetes mellitus: A&nbsp;population‐based survey. *The Journal of Clinical Hypertension*, 18(9), 864–870. https://doi.org/10.1111/jch.12774 

Nassr, A. A., Shazly, S. A., Trinidad, M. C., El-Nashar, S. A., Marroquin, A. M., &amp; Brost, B. C. (2018). Body fat index: A novel alternative to body mass index for prediction of gestational diabetes and hypertensive disorders in pregnancy. *European Journal of Obstetrics & Gynecology and Reproductive Biology*, 228, 243–248. https://doi.org/10.1016/j.ejogrb.2018.07.001 

NewYork-Presbyterian Hospital. (2023). Diabetes and hypertension - diabetes resource center: Newyork-prebsyterian. NewYork-Presbyterian. https://www.nyp.org/diabetes-and-endocrinology/diabetes-resource-center/diabetes-and-hypertension#:~:text=%E2%80%9CDiabetes%20causes%20damage%20by%20scarring,contribute%20to%20high%20blood%20pressure.%E2%80%9D

Quan, H., Fang, T., Lin, L., Lin, L., Ou, Q., Zhang, H., Chen, K., &amp; Zhou, Z. (2021). Effects of fasting proinsulin/fasting insulin, proinsulin/insulin, vitamin D3, and waistline on diabetes prediction among the Chinese Han population. *International Journal of Diabetes in Developing Countries*, 42(2), 218–226. https://doi.org/10.1007/s13410-021-00983-z 

Ruiz-Alejos, A., Carrillo-Larco, R. M., Miranda, J. J., Gilman, R. H., Smeeth, L., & Bernabé-Ortiz, A. (2020, January). Skinfold thickness and the incidence of type 2 diabetes mellitus and hypertension: An analysis of the Peru Migrant Study. Public health nutrition. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6960014/#:~:text=Bicipital%20and%20subscapular%20skinfolds%20were,fold%20risk%20for%20developing%20HT.

Singh, D., Mittal, P., Bachani, S., Mukherjee, B., Mittal, M. K., &amp; Suri, J. (2023). Ultrasonographic assessment of body fat index for prediction of gestational diabetes mellitus and neonatal complications. *Journal of Obstetrics and Gynaecology Canada*, 45(11), 102177. https://doi.org/10.1016/j.jogc.2023.04.026 

Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. *In Proceedings of the Symposium on Computer Applications and Medical Care*, 261-265. https://www.kaggle.com/datasets/mathchi/diabetes-data-set?fbclid=IwAR1DMzdJFDxoEqLDIZNTi3j7YJXTx_7BJwCl7sbn8syQKbQCnHfMtlsKH1E