# Dinh et al. (2019)

## A data-driven approach to predicting diabetes and cardiovascular disease with machine learning

URL: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-019-0918-5


## Brief Summary

Dinh et al. (2019) uses different ML models (logistic regression, support vector machines, random forest, and gradient boosting) on NHANES dataset to predict i) Diabetes and ii) Cardiovascular disease ("CVD").

**Goal**: Identification mechanism for patients at risk of diabetes and cardiovascular diseases and key contributors to diabetes .

**Results**:

Best scores:

- CVB prediction based on 131 NHANES variables achieved an AU-ROC score of 83.9% .
- Diabetes prediction based on 123 NHANES variables achieved an AU-ROC score of 95.7% .
- Pre-diabetic prediction based on 123 NHANES variables achieved an AU-ROC score of 84.4% .
- Top 5 predictors in diabetes patients were 1) `waist size`, 2) `age`, 3) `self-reported weight`, 4) `leg length`, 5) `sodium intake`.



This notebook replicates the results of the paper. The structure follows the following steps: 

1. NHANES data 
2. Pre-processing of the data
3. Transformation of the data
4. Train/Test Split 
5. CV 10-fold
6. Training monitoring using MLflow
7. Get metric results (AUC)


The structure of the analysis emulates the Figure 1 from the paper: 

![Fig 1 from Dinh et al. 2019](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Fig1.png)


In [632]:
#jupyter nbconvert --to markdown R_replicate_Dinh.ipynb --output README.md

In [633]:
library(arrow)
library(dplyr)
library(readxl)

## 1. HNANES data

URL: https://www.cdc.gov/nchs/index.htm


## Target


- Case I: Diabetes.

    - Glucose >= 126 mg/dL. OR;
    - "Yes" to the question "Have you ever been told by a doctor that you have diabetes?"

- Case II: Undiagnosed Diabetes. 

    - Glucose >= 126 mg/dL. AND;
    - "No" to the question "Have you ever been told by a doctor that you have diabetes?"

- Cardio: Cardiovascular disease.

    - "Yes" to any of the the questions "Have you ever been told by a doctor that you had congestive heart failure, coronary heart disease, a heart attack, or a stroke?"

- Pre diabetes

    - Glucose 125 >= 100 mg/dL

## Covariates

The paper did not say what variables they use from NHANES. I emailed the author in the correspondence section of the paper to try to get the list of variables they used, but no answer from him yet.

Given that NHANES have more than 3000 variables, I cannot just randomly take the variables I believe are important. 

For now, I will consider the variables taken from [Figure 5](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Fig5.png) and [Figure 6](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Fig6.png) of the paper. I compiled them by hand in an Excel file using NHANES search tool for variables:



In [634]:
DATA_PATH <- "/Users/pipegalera/dev/ml_diabetes/data/NHANES/"
dinh_2019_vars <- read_excel(paste0(DATA_PATH, "dinh_2019_variables_doc.xlsx"))

head(dinh_2019_vars[c("Variable Name", "NHANES Name")], n=15)


Variable Name,NHANES Name
<chr>,<chr>
Age,RIDAGEYR
Alcohol consumption,ALQ130
Alcohol intake,DRXTALCO
"Alcohol intake, First Day",DR1TALCO
"Alcohol intake, Second Day",DR2TALCO
Arm circumference,BMXARMC
Arm length,BMXARML
Blood osmolality,LBXSOSSI
Blood relatives have diabetes,MCQ250A
Blood urea nitrogen,LBDSBUSI


For the complete list (n=62), check the file `dinh_2019_variables_doc.xlsx` under NHANES data folder.

NHANES data is made by multiple files (see `NHANES` unde data folder) that have to be compiled together. The data was downloaded automatically via script, all the files converted from SAS to parquet, and the files were stacked and merged based on the individual index ("SEQN"). For more details please check the `nhanes_data_backfill` notebook. 

Plese notice that no transformation are made to the covariates, the files were only arranged and stacked together. 

In [635]:
df <- read_parquet(paste0(DATA_PATH, "raw_data/dinh_raw_data.parquet"))

In [636]:
head(df)

SEQN,YEAR,RIDAGEYR,ALQ130,DRXTALCO,DR1TALCO,DR2TALCO,BMXARMC,BMXARML,LBXSOSSI,...,RHD143,DIQ010,MCQ160B,MCQ160b,MCQ160C,MCQ160c,MCQ160E,MCQ160e,MCQ160F,MCQ160f
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1999-2000,2,,0.0,,,15.2,18.6,,...,,2,,,,,,,,
2,1999-2000,77,1.0,0.0,,,29.8,38.2,288.0,...,,2,2.0,,2.0,,2.0,,2.0,
3,1999-2000,10,,0.0,,,19.7,25.5,,...,,2,,,,,,,,
4,1999-2000,1,,0.0,,,16.4,20.4,,...,,2,,,,,,,,
5,1999-2000,49,3.0,34.56,,,35.8,39.7,276.0,...,,2,2.0,,2.0,,2.0,,2.0,
6,1999-2000,19,,0.0,,,26.0,34.5,277.0,...,,2,,,,,,,,


In [637]:
tail(df)

SEQN,YEAR,RIDAGEYR,ALQ130,DRXTALCO,DR1TALCO,DR2TALCO,BMXARMC,BMXARML,LBXSOSSI,...,RHD143,DIQ010,MCQ160B,MCQ160b,MCQ160C,MCQ160c,MCQ160E,MCQ160e,MCQ160F,MCQ160f
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
83726,2013-2014,40,,,,,31.0,39.0,,...,,2,,2.0,,2.0,,2.0,,2.0
83727,2013-2014,26,3.0,,14.0,19.9,29.9,35.2,285.0,...,,2,,2.0,,2.0,,2.0,,2.0
83728,2013-2014,2,,,0.0,0.0,14.7,16.5,,...,,2,,,,,,,,
83729,2013-2014,42,,,0.0,0.0,37.0,37.6,277.0,...,,2,,2.0,,2.0,,2.0,,2.0
83730,2013-2014,7,,,,,19.0,26.0,,...,,2,,,,,,,,
83731,2013-2014,11,,,0.0,0.0,25.0,31.7,,...,,2,,,,,,,,


In [638]:
nrow(df)

In [639]:
colnames(df)

# 2. Pre-processing and Data modeling

There are some fixes before the data is ready for analysis. 


## 2.1 Extreme values and replacing Missing/Don't know answers

> The preprocessing stage also converted any undecipherable values (errors in datatypes and standard formatting) from the database to null representations.

For this, I've checked the variables according to their possible values in the NHANES documentation (https://wwwn.cdc.gov/nchs/nhanes/search/default.aspx). I did not found any any extreme value out of the possible ranges. However, the data is reviwed and updated after the survey, so it might be that the NCHS applied some fixes after they saw them. 


I have replaced "Don't know" and "Refused" for NA values and converted the intial encoding of the categorical variables to the real values in the survey - given that the encoding is not consistent accross years. For the model, I will encode the varaibles myself. 

In [640]:
# https://wwwn.cdc.gov/nchs/nhanes/search/default.aspx


In [641]:
# Refused or Don"t know for NA
df_formatted <- df %>%
  mutate(
    ALQ130 = case_when(
      ALQ130 == 77 ~ NA,
      ALQ130 == 99 ~ NA, 
      ALQ130 == 777 ~ NA,
      ALQ130 == 999 ~ NA,
      TRUE ~ ALQ130
    ),
    WHD140 = case_when(
      WHD140 == 7777 ~ NA,
      WHD140 == 77777 ~ NA,
      WHD140 == 9999 ~ NA,
      WHD140 == 99999 ~ NA,
      TRUE ~ WHD140
    ),
    DIQ010 = case_when(
      DIQ010 == 1 ~ "Yes",
      DIQ010 == 2 ~ "No",
      DIQ010 == 3 ~ "Borderline",
      DIQ010 == 7 ~ NA,
      DIQ010 == 9 ~ NA,
      TRUE ~ as.character(DIQ010)
    ),
    MCQ250A = case_when(
      MCQ250A == 1 ~ "Yes",
      MCQ250A == 2 ~ "No",
      MCQ250A == 7 ~ NA,
      MCQ250A == 9 ~ NA,
      TRUE ~ as.character(MCQ250A)
    ),
    MCQ300C = case_when(
      MCQ300C == 1 ~ "Yes",
      MCQ300C == 2 ~ "No",
      MCQ300C == 7 ~ NA,
      MCQ300C == 9 ~ NA,
      TRUE ~ as.character(MCQ300C)
    ),
    MCQ300c = case_when(
      MCQ300c == 1 ~ "Yes",
      MCQ300c == 2 ~ "No",
      MCQ300c == 7 ~ NA,
      MCQ300c == 9 ~ NA,
      TRUE ~ as.character(MCQ300c)
    ),
    MCQ160B = case_when(
      MCQ160B == 1 ~ "Yes",
      MCQ160B == 2 ~ "No",
      MCQ160B == 7 ~ NA,
      MCQ160B == 9 ~ NA,
      TRUE ~ as.character(MCQ160B)
    ),
    MCQ160b = case_when(
      MCQ160b == 1 ~ "Yes",
      MCQ160b == 2 ~ "No",
      MCQ160b == 7 ~ NA,
      MCQ160b == 9 ~ NA,
      TRUE ~ as.character(MCQ160b)
    ),
    MCQ160C = case_when(
      MCQ160C == 1 ~ "Yes",
      MCQ160C == 2 ~ "No",
      MCQ160C == 7 ~ NA,
      MCQ160C == 9 ~ NA,
      TRUE ~ as.character(MCQ160C)
    ),
    MCQ160c = case_when(
      MCQ160c == 1 ~ "Yes",
      MCQ160c == 2 ~ "No",
      MCQ160c == 7 ~ NA,
      MCQ160c == 9 ~ NA,
      TRUE ~ as.character(MCQ160c)
    ),
    MCQ160E = case_when(
      MCQ160E == 1 ~ "Yes",
      MCQ160E == 2 ~ "No",
      MCQ160E == 7 ~ NA,
      MCQ160E == 9 ~ NA,
      TRUE ~ as.character(MCQ160E)
    ),
    MCQ160e = case_when(
      MCQ160e == 1 ~ "Yes",
      MCQ160e == 2 ~ "No",
      MCQ160e == 7 ~ NA,
      MCQ160e == 9 ~ NA,
      TRUE ~ as.character(MCQ160e)
    ),
    MCQ160F = case_when(
      MCQ160F == 1 ~ "Yes",
      MCQ160F == 2 ~ "No",
      MCQ160F == 7 ~ NA,
      MCQ160F == 9 ~ NA,
      TRUE ~ as.character(MCQ160F)
    ),
    MCQ160f = case_when(
      MCQ160f == 1 ~ "Yes",
      MCQ160f == 2 ~ "No",
      MCQ160f == 7 ~ NA,
      MCQ160f == 9 ~ NA,
      TRUE ~ as.character(MCQ160f)
    ),
    BPQ080 = case_when(
      BPQ080 == 1 ~ "Yes",
      BPQ080 == 2 ~ "No",
      BPQ080 == 7 ~ NA,
      BPQ080 == 9 ~ NA,
      TRUE ~ as.character(BPQ080)
    ),
    HUQ010 = case_when(
      HUQ010 == 1 ~ "Excellent",
      HUQ010 == 2 ~ "Very Good",
      HUQ010 == 3 ~ "Good",
      HUQ010 == 4 ~ "Fair",
      HUQ010 == 5 ~ "Poor",
      HUQ010 == 7 ~ NA,
      HUQ010 == 9 ~ NA,
      TRUE ~ as.character(HUQ010)
    ),
    HSD010 = case_when(
      HSD010 == 1 ~ "Excellent",
      HSD010 == 2 ~ "Very Good",
      HSD010 == 3 ~ "Good",
      HSD010 == 4 ~ "Fair",
      HSD010 == 5 ~ "Poor",
      HSD010 == 7 ~ NA,
      HSD010 == 9 ~ NA,
      TRUE ~ as.character(HSD010)
    ),
    INDHHIN2 = case_when(
      INDHHIN2 == 1 ~	"$0 to $ 4,999",
      INDHHIN2 == 2 ~	"$5,000 to $ 9,999",
      INDHHIN2 == 3 ~	"$10,000 to $14,999",
      INDHHIN2 == 4 ~	"$15,000 to $19,999",
      INDHHIN2 == 5 ~	"$20,000 to $24,999",
      INDHHIN2 == 6 ~	"$25,000 to $34,999",
      INDHHIN2 == 7 ~	"$35,000 to $44,999",
      INDHHIN2 == 8 ~	"$45,000 to $54,999",
      INDHHIN2 == 9 ~	"$55,000 to $64,999",
      INDHHIN2 == 10 ~ "$65,000 to $74,999",
      INDHHIN2 == 12 ~ "Over $20,000",
      INDHHIN2 == 13 ~ "Under $20,000",
      INDHHIN2 == 14 ~ "$75,000 to $99,999",
      INDHHIN2 == 15 ~ "$100,000 and Over",
      INDHHIN2 == 77 ~ NA,
      INDHHIN2 == 99 ~ NA,
      TRUE ~ as.character(INDHHIN2)
    ),
    RIDRETH1 = case_when(
      RIDRETH1 == 1 ~ "Mexican American",
      RIDRETH1 == 2 ~ "Other Hispanic",
      RIDRETH1 == 3 ~ "Non-Hispanic White",
      RIDRETH1 == 4 ~ "Non-Hispanic Blac",
      RIDRETH1 == 5 ~ "Other Race - Including Multi-Racial",
      TRUE ~ as.character(RIDRETH1)
    ),

  )


## 2.2 Homogenize variables that are the same but are called diffrent in different NHANES years

Intake variables went from 1 day in 1999 to 2001 to 2 days from 2003 on, therefore the variable has to be homogenized. Dinh et al. (2019) do not specify which examination records the authors, but my best guess is that they problably took the average of both days that the examination was performed. 

This situation happends with:

- Alcohol intake (`DRXTALCO`, `DR1TALCO`, `DR2TALCO`)
- Caffeine intake (`DRXTCAFF`, `DR1TCAFF`, `DR2TCAFF`)
- Calcium intake (`DRXTCALC`, `DR1TCALC`, `DR2TCALC`)
- Carbohydrate intake (`DRXTCARB`, `DR1TCARB`, `DR2TCARB`)
- Fiber intake (`DRXTFIBE`, `DR1TFIBE`, `DR2TFIBE`)
- Kcal intake (`DRXTKCAL`, `DR1TKCAL`, `DR2TKCAL`)
- Sodium intake (`DRDTSODI`, `DR1TSODI`, `DR2TSODI`)


Also, small changes in same quesion format are registered with different codes. Examples: 

- `MCQ250A`, `MCQ300C` and `MCQ300c`
- `LBDHDDSI` and `LBDHDLSI`
- `LBXGLUSI` and `LBDGLUSI`

And same questions are coded differnetly as well:

- `MCQ160B` and `MCQ160b`
- `MCQ160C` and `MCQ160c`
- `MCQ160E` and `MCQ160e`
- `MCQ160F` and `MCQ160F`


It can be seen here:

In [642]:
# Similar questions (or the same) with different NHANES variable codes
var_docs <- read_excel(paste0(DATA_PATH, "dinh_2019_variables_doc.xlsx"))
var_docs |> 
  filter(`NHANES Name` %in% c('MCQ250A', 'MCQ300C', 'MCQ300c', 'LBDHDDSI', 'LBDHDLSI', 'LBXGLUSI', 'LBDGLUSI'))

Variable Name,NHANES Name,NHANES File,NHANES Type of data,Variable Definition
<chr>,<chr>,<chr>,<chr>,<chr>
Blood relatives have diabetes,MCQ250A,MCQ,Questionnaire,"Including living and deceased, were any of {SP&apos;s/ your} biological that is, blood relatives including grandparents, parents, brothers, sisters ever told by a health professional that they had . . .diabetes?"
Close relative had diabetes,MCQ300c,MCQ,Questionnaire,"Including living and deceased, were any of {SP&apos;s/your} close biological that is, blood relatives including father, mother, sisters or brothers, ever told by a health professional that they had diabetes?"
Close relative had diabetes,MCQ300C,MCQ,Questionnaire,"Including living and deceased, were any of {SP&apos;s/your} close biological that is, blood relatives including father, mother, sisters or brothers, ever told by a health professional that they had diabetes?"
HDL-cholesterol,LBDHDLSI,"Lab13, l13_b, HDL",Laboratory,HDL-cholesterol (mmol/L)
HDL-cholesterol,LBDHDDSI,"Lab13, l13_b, HDL",Laboratory,HDL-cholesterol (mmol/L)
Plasma Glucose,LBXGLUSI,"LAB10AM, L10AM_B",Laboratory,Plasma glucose: SI(mmol/L)
Plasma Glucose,LBDGLUSI,GLU,Laboratory,Plasma glucose: SI(mmol/L)


In [643]:
var_docs |> 
  filter(`NHANES Name` %in% c('MCQ160b', 'MCQ160B', 'MCQ160c', 'MCQ160C', 'MCQ160F', 'MCQ160f', 'MCQ160E', 'MCQ160e'))

Variable Name,NHANES Name,NHANES File,NHANES Type of data,Variable Definition
<chr>,<chr>,<chr>,<chr>,<chr>
Told CHF by a Doctor,MCQ160B,MCQ,Questionnaire,Has a doctor or other health professional ever told {you/SP} that {you/s/he} . . .had congestive heart failure?
Told CHF by a Doctor,MCQ160b,MCQ,Questionnaire,Has a doctor or other health professional ever told {you/SP} that {you/s/he} . . .had congestive heart failure?
Told CHD by a Doctor,MCQ160C,MCQ,Questionnaire,Has a doctor or other health professional ever told {you/SP} that {you/s/he} . . .had coronary heart disease?
Told CHD by a Doctor,MCQ160c,MCQ,Questionnaire,Has a doctor or other health professional ever told {you/SP} that {you/s/he} . . .had coronary heart disease?
Told HA by a Doctor,MCQ160E,MCQ,Questionnaire,Has a doctor or other health professional ever told {you/SP} that {you/s/he} . . .had a heart attack (also called myocardial infarction (my-o-car-dee-al in-fark-shun))?
Told HA by a Doctor,MCQ160e,MCQ,Questionnaire,Has a doctor or other health professional ever told {you/SP} that {you/s/he} . . .had a heart attack (also called myocardial infarction (my-o-car-dee-al in-fark-shun))?
Told stroke by a Doctor,MCQ160F,MCQ,Questionnaire,Has a doctor or other health professional ever told {you/SP} that {you/s/he} . . .had a stroke?
Told stroke by a Doctor,MCQ160f,MCQ,Questionnaire,Has a doctor or other health professional ever told {you/SP} that {you/s/he} . . .had a stroke?


In [644]:
# unique(df$YEAR[!is.na(df$MCQ250A)])
# unique(df$YEAR[!is.na(df$MCQ300C)])

To fix that, I will create a function that creates an average of the Intake variable of Day 1 and Day and average them, givin only one variable - for example "Alcohol_Intake" instead of having 'DRXTALCO', 'DR1TALCO', 'DR2TALCO'.

In [645]:
create_intake_new_column <- function(df, day0_col, day1_col, day2_col) {
    ifelse(is.na(df[[day0_col]]), 
           rowMeans(df[, c(day1_col, day2_col)], na.rm = TRUE), 
           df[[day0_col]])
}

df_formatted <- df_formatted |>
# Create new columns
  mutate(
    # Alcohol intake
    Alcohol_Intake = create_intake_new_column(df,'DRXTALCO', 'DR1TALCO', 'DR2TALCO'),
    # Caffeine intake
    Caffeine_Intake = create_intake_new_column(df,'DRXTCAFF', 'DR1TCAFF', 'DR2TCAFF'),
    # Calcium intake
    Calcium_Intake = create_intake_new_column(df,'DRXTCALC', 'DR1TCALC', 'DR2TCALC'),
    # Carbohydrate intake
    Carbohydrate_Intake = create_intake_new_column(df,'DRXTCARB', 'DR1TCARB', 'DR2TCARB'),
    # Fiber intake
    Fiber_Intake = create_intake_new_column(df,'DRXTFIBE', 'DR1TFIBE', 'DR2TFIBE'),
    # Kcal intake
    Kcal_Intake = create_intake_new_column(df,'DRXTKCAL', 'DR1TKCAL', 'DR2TKCAL'),
    # Sodium intake
    Sodium_Intake = create_intake_new_column(df,'DRDTSODI', 'DR1TSODI', 'DR2TSODI'),
    # Relative_Had_Diabetes
    Relative_Had_Diabetes = coalesce(MCQ250A, MCQ300C, MCQ300c),
    # Heart conditions
    Told_CHF = coalesce(MCQ160B, MCQ160b),
    Told_CHD = coalesce(MCQ160C, MCQ160c),
    Told_HA = coalesce(MCQ160E, MCQ160e),
    Told_stroke = coalesce(MCQ160F, MCQ160f),
    # HDL-cholesterol
    HDL_Cholesterol = coalesce(LBDHDLSI, LBDHDDSI),
    # Glucose
    Glucose = coalesce(LBXGLUSI, LBDGLUSI)
   ) |>
# Delete old columns that are not needed
  select(-c(DRXTALCO, DR1TALCO, DR2TALCO, DRXTCAFF, DR1TCAFF, DR2TCAFF,
            DRXTCALC, DR1TCALC, DR2TCALC, DRXTCARB, DR1TCARB, DR2TCARB,
            DRXTFIBE, DR1TFIBE, DR2TFIBE, DRXTKCAL, DR1TKCAL, DR2TKCAL,
            DRDTSODI, DR1TSODI, DR2TSODI, MCQ250A, MCQ300C, MCQ300c,
            MCQ160B, MCQ160b, MCQ160C, MCQ160c, MCQ160E, MCQ160e, MCQ160F,
            MCQ160f, LBDHDLSI, LBDHDDSI, LBXGLUSI, LBDGLUSI,)
            )

In [646]:
#unique(df_formatted$YEAR[!is.na(df_formatted$Relative_Had_Diabetes)])

# 2.3 Choosing between different readings in Blood analysis 

[From NHANES](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/BPX_H.htm): 

> After resting quietly in a seated position for 5 minutes and once the participants maximum inflation level (MIL) has been determined, three consecutive blood pressure readings are obtained. If a blood pressure measurement is interrupted or incomplete, a fourth attempt may be made. All BP determinations (systolic and diastolic) are taken in the mobile examination center (MEC). 

In Dinh et al. (2019) the authors do not say which readings are taking, but I'm assuming they take the last one to avoid the [white coat syndrom](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5352963/) and for data consistency.

In [647]:
df_formatted <- df_formatted |>
# Create new columns
  mutate(
    Diastolic_Blood_Pressure = coalesce(BPXDI4, BPXDI3, BPXDI2, BPXDI1),
    Systolic_Blood_Pressure = coalesce(BPXSY4, BPXSY3, BPXSY2, BPXSY1),
  ) |>
# Delete old columns that are not needed
  select(-c(BPXDI4, BPXDI3, BPXDI2, BPXDI1,
            BPXSY4, BPXSY3, BPXSY2, BPXSY1)
  )

## 2.4 Discretional trimming of the data according to the authors

> In our study, all datasets were limited to non-pregnant subjects and adults of at least twenty years of age.

In [None]:
Need to add SEQ060 to RHD143, RHQ141

In [653]:
df_formatted |> 
    filter(RIDAGEYR >= 20) # Are you pregnant now? "Yes" = 1 

SEQN,YEAR,RIDAGEYR,ALQ130,BMXARMC,BMXARML,LBXSOSSI,LBDSBUSI,BMXBMI,LB2SCLSI,...,Sodium_Intake,Relative_Had_Diabetes,Told_CHF,Told_CHD,Told_HA,Told_stroke,HDL_Cholesterol,Glucose,Diastolic_Blood_Pressure,Systolic_Blood_Pressure
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
2,1999-2000,77,1,29.8,38.2,288,6.8,24.90,,...,5710.03,No,No,No,No,No,1.39,4.646,56,98
5,1999-2000,49,3,35.8,39.7,276,5.7,29.10,,...,3756.36,No,No,No,No,No,1.08,5.550,82,122
7,1999-2000,59,,31.7,38.1,283,3.6,29.39,,...,3808.53,Yes,No,No,No,No,2.73,4.756,82,124
10,1999-2000,43,1,37.6,43.0,281,4.6,30.94,,...,3377.12,,No,No,No,No,1.31,4.989,96,142
12,1999-2000,37,3,37.2,40.0,283,7.1,30.62,,...,7511.18,Yes,No,No,No,No,0.98,4.606,100,176
13,1999-2000,70,2,27.0,33.5,288,6.8,25.57,,...,1066.72,Yes,No,No,No,No,1.28,,70,130
14,1999-2000,81,1,33.4,36.7,284,5.0,27.33,,...,,No,No,No,No,No,1.04,,64,138
15,1999-2000,38,2,32.5,37.5,271,4.3,26.68,,...,3832.49,No,No,No,No,No,1.49,5.484,70,106
16,1999-2000,85,,23.6,34.8,290,10.0,19.96,,...,2129.94,No,No,No,No,No,1.41,,62,136
20,1999-2000,23,3,26.1,34.5,271,3.6,23.68,,...,2746.43,Yes,No,No,No,No,1.10,5.033,62,104


In [623]:
df_formatted |> 
  filter(RHD143 != 1) |>  # Are you pregnant now? "Yes" = 1 
  filter(RIDAGEYR >= 20)  |> # Age
  select(-c(RHD143))

SEQN,YEAR,RIDAGEYR,ALQ130,BMXARMC,BMXARML,LBXSOSSI,LBDSBUSI,BMXBMI,LB2SCLSI,...,Sodium_Intake,Relative_Had_Diabetes,Told_CHF,Told_CHD,Told_HA,Told_stroke,HDL_Cholesterol,Glucose,Diastolic_Blood_Pressure,Systolic_Blood_Pressure
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
21010,2003-2004,52,4,29.1,35.9,280,2.50,25.49,,...,1970.0,No,No,No,No,No,3.08,,84,134
21017,2003-2004,37,2,26.5,32.5,271,2.14,19.34,,...,3499.5,Yes,No,No,No,No,1.99,,62,104
21018,2003-2004,33,1,20.1,38.4,,,16.57,,...,2658.5,No,No,No,No,No,,,100,136
21048,2003-2004,42,2,25.9,35.9,277,4.28,21.85,,...,3295.5,No,No,No,No,No,1.01,,88,128
21052,2003-2004,37,3,29.1,34.8,272,3.93,20.01,,...,4187.5,No,No,No,No,No,1.40,,68,116
21059,2003-2004,51,1,25.2,32.0,272,3.93,20.61,,...,1803.5,Yes,No,No,No,No,2.30,,76,130
21097,2003-2004,23,10,27.2,36.3,277,3.57,22.52,,...,4128.5,No,No,No,No,No,1.16,,88,126
21115,2003-2004,32,,37.0,31.4,271,2.86,42.57,,...,3011.0,Yes,No,No,No,No,1.40,,52,108
21116,2003-2004,38,2,37.9,35.1,278,5.36,37.82,,...,2144.0,No,No,No,No,No,1.58,5.296,72,132
21122,2003-2004,37,1,26.1,32.4,273,1.79,22.36,,...,,No,No,No,No,No,1.71,,72,100


## 2.5 Creating the  Target Variables

Tables 1 & 3 from Dinh et al. 2019:

From [Tables 1 & 3 from Dinh et al. 2019](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Table1_3.png):

`Diabetes = 1` if

- Glucose >= 126 mg/dL. OR;
- "Yes" to the question "Have you ever been told by a doctor that you have diabetes?"

`undiagnosed diabetes = 1` if

- Glucose >= 126 mg/dL. AND;
- "No" to the question "Have you ever been told by a doctor that you have diabetes?" and had a blood glucose level greater than or equal

`pre diabetes = 1` if

- Glucose 125 >= 100 mg/dL

`CVD = 1` if

- "Yes" to any of the the questions "Have you ever been told by a doctor that you had congestive heart failure, coronary heart disease, a heart attack, or a stroke?"



In [625]:
colnames(df_formatted)

In [626]:
df_formatted <- df_formatted %>%
  mutate(
    # Diabetic or not diabetic
    Diabetes_Case_I = case_when(
      (Glucose > 7.0 | DIQ010 == "Yes") ~ 1,
      TRUE ~ 0),
    Diabetes_Case_II = case_when(
      # Undiagnosed Diabetic 
      (Diabetes_Case_I == 0 & Glucose > 7.0 & DIQ010 == "No") ~ 1,
      # Prediabetic
      (Diabetes_Case_I == 0 & Glucose >= 5.6 & Glucose < 7.0) ~ 1,
      TRUE ~  0),
    # Cardiovascular Disease
    CVD = case_when(
      (Told_CHF == "Yes" | Told_CHD == "Yes" | Told_HA == "Yes" | Told_stroke == "Yes") ~ 1,
      TRUE ~  0)
  )  |>
  select(-c(Told_CHF, Told_CHD, Told_HA, Told_stroke, Glucose, DIQ010))

## 2.6 Column name formatting

In [627]:
df_formatted <- df_formatted %>% 
  rename(
    Alcohol_consumption = ALQ130,
    Arm_circumference = BMXARMC,
    Arm_length = BMXARML,
    Body_mass_index = BMXBMI,
    Height = BMXHT,
    Leg_length = BMXLEG,
    Waist_circumference = BMXWAIST,
    Weight = BMXWT,
    Told_High_Cholesterol = BPQ080,
    Pulse = BPXPLS,
    General_health = HSD010,
    Health_status = HUQ010,
    Household_income = INDHHIN2,
    Chloride = LB2SCLSI,
    LDL_cholesterol = LBDLDLSI,
    Lymphocytes = LBDLYMNO,
    Blood_urea_nitrogen = LBDSBUSI,
    Triglycerides = LBDSTRSI,
    Total_cholesterol = LBDTCSI,
    Mean_cell_volume = LBXMCVSI,
    Aspartate_aminotransferase_AST = LBXSASSI,
    Gamma_glutamyl_transferase = LBXSGTSI,
    Osmolality = LBXSOSSI,
    White_blood_cell_count = LBXWBCSI,
    Age = RIDAGEYR,
    Race_ethnicity = RIDRETH1,
    `Self-reported_greatest_weight` = WHD140,
    Survey_year = YEAR,
  )

In [628]:
head(df_formatted)

SEQN,Survey_year,Age,Alcohol_consumption,Arm_circumference,Arm_length,Osmolality,Blood_urea_nitrogen,Body_mass_index,Chloride,...,Fiber_Intake,Kcal_Intake,Sodium_Intake,Relative_Had_Diabetes,HDL_Cholesterol,Diastolic_Blood_Pressure,Systolic_Blood_Pressure,Diabetes_Case_I,Diabetes_Case_II,CVD
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1999-2000,2,,15.2,18.6,,,14.9,,...,7.41,1358.88,1621.35,,,,,0,0,0
2,1999-2000,77,1.0,29.8,38.2,288.0,6.8,24.9,,...,36.99,2463.0,5710.03,No,1.39,56.0,98.0,0,0,0
3,1999-2000,10,,19.7,25.5,,,17.63,,...,11.16,1517.69,1676.51,,0.78,62.0,112.0,0,0,0
4,1999-2000,1,,16.4,20.4,,,,,...,5.45,1474.93,1277.31,,,,,0,0,0
5,1999-2000,49,3.0,35.8,39.7,276.0,5.7,29.1,,...,17.28,2658.14,3756.36,No,1.08,82.0,122.0,0,0,0
6,1999-2000,19,,26.0,34.5,277.0,3.2,22.56,,...,6.24,1113.66,949.52,,1.57,80.0,112.0,0,0,0


### 2.7 Normalization and Categorical Encoding.


> Normalization was performed on the data using the following standardization model: x' = x−x^/σ 

Before we apply `scale`, we need to: 

1. Classify all the columns between categorical and numerical.
2. Only apply the standarization to the numerical ones. 


In [629]:
# Categorical variables
categorical_vars <- c(
  'SEQN',
  'Survey_year',
  'Race_ethnicity',
  'General_health',
  'Health_status',
  'Told_High_Cholesterol',
  'Household_income',
  'Relative_Had_Diabetes'
)

# Numerical variables
numerical_vars <- c(
  'Age',
  'Alcohol_consumption',
  'Arm_circumference',
  'Arm_length',
  'Osmolality',
  'Blood_urea_nitrogen',
  'Body_mass_index',
  'Chloride',
  'Gamma_glutamyl_transferase',
  'Height',
  'LDL_cholesterol',
  'Leg_length',
  'Lymphocytes',
  'Mean_cell_volume',
  'Pulse',
  'Self-reported_greatest_weight',
  'Total_cholesterol',
  'Triglycerides',
  'Waist_circumference',
  'Weight',
  'White_blood_cell_count',
  'Aspartate_aminotransferase_AST',
  'Alcohol_Intake',
  'Caffeine_Intake',
  'Calcium_Intake',
  'Carbohydrate_Intake',
  'Fiber_Intake',
  'Kcal_Intake',
  'Sodium_Intake',
  'HDL_Cholesterol',
  'Diastolic_Blood_Pressure',
  'Systolic_Blood_Pressure'
)

df_formatted <- df_formatted |> 
    mutate(
        across(all_of(numerical_vars), scale)
    )


In [630]:
df_formatted

SEQN,Survey_year,Age,Alcohol_consumption,Arm_circumference,Arm_length,Osmolality,Blood_urea_nitrogen,Body_mass_index,Chloride,...,Fiber_Intake,Kcal_Intake,Sodium_Intake,Relative_Had_Diabetes,HDL_Cholesterol,Diastolic_Blood_Pressure,Systolic_Blood_Pressure,Diabetes_Case_I,Diabetes_Case_II,CVD
<dbl>,<chr>,"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>",...,"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>",<chr>,"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>",<dbl>,<dbl>,<dbl>
1,1999-2000,-1.1528298,,-1.6636987,-1.8937916,,,-1.38902690,,...,-0.79779041,-0.6751562,-0.89336922,,,,,0,0,0
2,1999-2000,1.8626044,-0.65078964,0.2069443,0.6926871,1.96698244,1.12256662,-0.05442850,,...,2.59705169,0.5726372,1.63753100,No,0.06515962,-0.62537804,-1.0586164,0,0,0
3,1999-2000,-0.8311835,,-1.0871307,-0.9832455,,,-1.02468154,,...,-0.36740982,-0.4956811,-0.85922508,,-1.51960665,-0.23674586,-0.3205259,0,0,0
4,1999-2000,-1.1930356,,-1.5099472,-1.6562579,,,,,...,-1.02273600,-0.5440052,-1.10633059,,,,,0,0,0
5,1999-2000,0.7368423,0.04596624,0.9757016,0.8906320,-0.34171723,0.58303298,0.50610283,,...,0.33497131,0.7931698,0.42820578,No,-0.74021340,1.05869472,0.2066816,0,0,0
6,1999-2000,-0.4693314,,-0.2799354,0.2044233,-0.14932559,-0.64317984,-0.36672453,,...,-0.93206915,-0.9522854,-1.30923318,,0.53279557,0.92915066,-0.3205259,0,0,0
7,1999-2000,1.1389002,,0.4503841,0.6794908,1.00502424,-0.44698579,0.54480618,,...,0.75272740,0.1325214,0.46049910,Yes,3.54644946,1.05869472,0.3121231,0,0,0
8,1999-2000,-0.7105661,,-1.0358802,0.4947423,-0.72650051,0.92637256,-1.30761640,,...,2.48687426,4.6304409,4.53106721,,0.97445174,-0.62537804,-1.1640579,0,0,0
9,1999-2000,-0.7909777,,-0.7796277,-0.1254847,,,-0.91124067,,...,-0.46725811,-0.4810347,-0.72717962,,0.32495737,-1.14355427,-0.4259674,0,0,0
10,1999-2000,0.4956076,-0.65078964,1.2063289,1.3261105,0.62024097,0.04349934,0.75166894,,...,1.17277886,1.7667610,0.19345554,,-0.14267858,1.96550312,1.2610966,0,0,0
