# Dinh et al. (2019)

## A data-driven approach to predicting diabetes and cardiovascular disease with machine learning

URL: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-019-0918-5


## Brief Summary

Dinh et al. (2019) uses different ML models (logistic regression, support vector machines, random forest, and gradient boosting) on NHANES dataset to predict i) Diabetes and ii) Cardiovascular disease ("CVD").

**Goal**: Identification mechanism for patients at risk of diabetes and cardiovascular diseases and key contributors to diabetes .

**Results**:

Best scores:

- CVB prediction based on 131 NHANES variables achieved an AU-ROC score of 83.9% .
- Diabetes prediction based on 123 NHANES variables achieved an AU-ROC score of 95.7% .
- Pre-diabetic prediction based on 123 NHANES variables achieved an AU-ROC score of 84.4% .
- Top 5 predictors in diabetes patients were 1) `waist size`, 2) `age`, 3) `self-reported weight`, 4) `leg length`, 5) `sodium intake`.



This notebook replicates the results of the paper. The structure follows the following steps: 

1. NHANES data 
2. Pre-processing of the data
3. Transformation of the data
4. Train/Test Split 
5. CV 10-fold
6. Training monitoring using MLflow
7. Get metric results (AUC)


The structure of the analysis emulates the Figure 1 from the paper: 

![Fig 1 from Dinh et al. 2019](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Fig1.png)


In [1]:
library(arrow)
library(dplyr)
library(readxl)

Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

The repository you retrieved Arrow from did not include all of Arrow's features.
You can install a fully-featured version by running:
`install.packages('arrow', repos = 'https://apache.r-universe.dev')`.


Attaching package: 'arrow'


The following object is masked from 'package:utils':

    timestamp



Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




## 1. HNANES data

URL: https://www.cdc.gov/nchs/index.htm


## Target

From the paper, the definitions are clear: 

![Dinh et al.(2019), Table 4](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Table4.png)

- Case I: Diabetes.

    - Glucose >= 126 mg/dL. OR;
    - "Yes" to the question "Have you ever been told by a doctor that you have diabetes?"

- Case II: Undiagnosed Diabetes. 

    - Glucose >= 126 mg/dL. AND;
    - "No" to the question "Have you ever been told by a doctor that you have diabetes?"

- Cardio: Cardiovascular disease.

    - "Yes" to any of the the questions "Have you ever been told by a doctor that you had congestive heart failure, coronary heart disease, a heart attack, or a stroke?"

The paper also defined and test for the target: 

- Pre diabetes

    - Glucose 125 >= 100 mg/dL

## Covariates

The paper did not say what variables they use from NHANES. I emailed the author in the correspondence section of the paper to try to get the list of variables they used, but no answer from him yet.

Given that NHANES have more than 3000 variables, I cannot just randomly take the variables I believe are important. 

For now, I will consider the variables taken from [Figure 5](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Fig5.png) and [Figure 6](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Fig6.png) of the paper. I compiled them by hand in an Excel file using NHANES search tool for variables:



In [143]:
DATA_PATH <- "/Users/pipegalera/dev/ml_diabetes/data/NHANES/raw_data/"
dinh_2019_vars <- read_excel(paste0(DATA_PATH, "dinh_2019_variables_doc.xlsx"))

head(dinh_2019_vars[c("Variable Name", "NHANES Name")], n=15)


Variable Name,NHANES Name
<chr>,<chr>
Age,RIDAGEYR
Alcohol consumption,ALQ130
Alcohol intake,DRXTALCO
"Alcohol intake, First Day",DR1TALCO
"Alcohol intake, Second Day",DR2TALCO
Arm circumference,BMXARMC
Arm length,BMXARML
Blood osmolality,LBXSOSSI
Blood relatives have diabetes,MCQ250A
Blood urea nitrogen,LBDSBUSI


For the complete list (n=62), check the file `dinh_2019_variables_doc.xlsx` under NHANES data folder.

NHANES data is made by multiple files (see `NHANES` unde data folder) that have to be compiled together. The data was downloaded automatically via script, all the files converted from SAS to parquet, and the files were stacked and merged based on the individual index ("SEQN"). For more details please check the `nhanes_data_backfill` notebook. 

Plese notice that no transformation are made to the covariates, the files were only arranged and stacked together. 

In [221]:
df <- read_parquet(paste0(DATA_PATH, "dinh_raw_data.parquet"))

In [222]:
head(df)

SEQN,YEAR,RIDAGEYR,ALQ130,DRXTALCO,DR1TALCO,DR2TALCO,BMXARMC,BMXARML,LBXSOSSI,...,BPXSY4,BPXSY2,BPXSY3,LBDTCSI,LBDSTRSI,BMXWAIST,BMXWT,LBXWBCSI,LBXSASSI,RHD143
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1999-2000,2,,0.0,,,15.2,18.6,,...,,,,,,45.7,12.5,,,
2,1999-2000,77,1.0,0.0,,,29.8,38.2,288.0,...,,98.0,98.0,5.56,1.298,98.0,75.4,7.6,19.0,
3,1999-2000,10,,0.0,,,19.7,25.5,,...,,104.0,112.0,3.34,,64.7,32.9,7.5,,
4,1999-2000,1,,0.0,,,16.4,20.4,,...,,,,,,,13.3,8.8,,
5,1999-2000,49,3.0,34.56,,,35.8,39.7,276.0,...,,122.0,122.0,7.21,3.85,99.9,92.5,5.9,22.0,
6,1999-2000,19,,0.0,,,26.0,34.5,277.0,...,,116.0,112.0,3.96,0.553,81.6,59.2,9.6,20.0,


In [223]:
tail(df)

SEQN,YEAR,RIDAGEYR,ALQ130,DRXTALCO,DR1TALCO,DR2TALCO,BMXARMC,BMXARML,LBXSOSSI,...,BPXSY4,BPXSY2,BPXSY3,LBDTCSI,LBDSTRSI,BMXWAIST,BMXWT,LBXWBCSI,LBXSASSI,RHD143
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
83726,2013-2014,40,,,,,31.0,39.0,,...,,,,,,97.7,79.0,,,
83727,2013-2014,26,3.0,,14.0,19.9,29.9,35.2,285.0,...,,116.0,112.0,4.91,0.858,87.1,71.8,5.1,27.0,
83728,2013-2014,2,,,0.0,0.0,14.7,16.5,,...,,,,,,47.2,11.3,6.6,,
83729,2013-2014,42,,,0.0,0.0,37.0,37.6,277.0,...,,130.0,138.0,3.93,1.197,102.7,89.6,6.4,26.0,
83730,2013-2014,7,,,,,19.0,26.0,,...,,,,4.32,,53.0,22.8,9.9,,
83731,2013-2014,11,,,0.0,0.0,25.0,31.7,,...,,94.0,90.0,,,73.5,42.3,,,


In [224]:
nrow(df)

In [225]:
colnames(df)

# 2. Pre-processing

There are some fixes before the data is ready for analysis. 


## 2.1 Homogenize variables that are the same but are called diffrent in different NHANES years

1. Intake variables went from 1 day in 1999 to 2001 to 2 days from 2003 on, therefore the variable has to be homogenized. Dinh et al. (2019) do not specify which examination records the authors, but my best guess is that they problably took the average of both days that the examination was performed. 

This situation happends with:

- Alcohol intake (`DRXTALCO`, `DR1TALCO`, `DR2TALCO`)
- Caffeine intake (`DRXTCAFF`, `DR1TCAFF`, `DR2TCAFF`)
- Calcium intake (`DRXTCALC`, `DR1TCALC`, `DR2TCALC`)
- Carbohydrate intake (`DRXTCARB`, `DR1TCARB`, `DR2TCARB`)
- Fiber intake (`DRXTFIBE`, `DR1TFIBE`, `DR2TFIBE`)
- Kcal intake (`DRXTKCAL`, `DR1TKCAL`, `DR2TKCAL`)
- Sodium intake (`DRDTSODI`, `DR1TSODI`, `DR2TSODI`)


2. Also, small changes in same quesion format are registered with different codes. Examples: 

    - `MCQ250A`, and `MCQ300C`
    - `LBDHDDSI` and `LBDHDLSI`.

In [226]:
# DRXTALCO only in 1999-2002
unique(df$YEAR[!is.na(df$DRXTALCO)])


In [227]:
# DRXTALCO replaced to DR1TALCO and DR2TALCO 2003 onwards due to new procedure.
unique(df$YEAR[!is.na(df$DR1TALCO)])


In [228]:
# Similar questions (or the same) with different NHANES variable codes
var_docs <- read_excel(paste0(DATA_PATH, "dinh_2019_variables_doc.xlsx"))
var_docs |> 
  filter(`NHANES Name` %in% c('MCQ250A', 'MCQ300C', 'MCQ300c', 'LBDHDDSI', 'LBDHDLSI'))

Variable Name,NHANES Name,NHANES File,NHANES Type of data,Variable Definition
<chr>,<chr>,<chr>,<chr>,<chr>
Blood relatives have diabetes,MCQ250A,MCQ,Questionnaire,"Including living and deceased, were any of {SP's/ your} biological that is, blood relatives including grandparents, parents, brothers, sisters ever told by a health professional that they had . . .diabetes?"
Close relative had diabetes,MCQ300c,MCQ,Questionnaire,"Including living and deceased, were any of {SP's/your} close biological that is, blood relatives including father, mother, sisters or brothers, ever told by a health professional that they had diabetes?"
Close relative had diabetes,MCQ300C,MCQ,Questionnaire,"Including living and deceased, were any of {SP's/your} close biological that is, blood relatives including father, mother, sisters or brothers, ever told by a health professional that they had diabetes?"
HDL-cholesterol,LBDHDLSI,"Lab13, l13_b, HDL",Laboratory,HDL-cholesterol (mmol/L)
HDL-cholesterol,LBDHDDSI,"Lab13, l13_b, HDL",Laboratory,HDL-cholesterol (mmol/L)


In [229]:
unique(df$YEAR[!is.na(df$MCQ250A)])


In [230]:
unique(df$YEAR[!is.na(df$MCQ300C)])


In [231]:
unique(df$YEAR[!is.na(df$MCQ300c)])

In [232]:
unique(df$YEAR[!is.na(df$LBDHDLSI)])

In [233]:
unique(df$YEAR[!is.na(df$LBDHDDSI)])

In [234]:
create_intake_new_column <- function(df, day0_col, day1_col, day2_col) {
    ifelse(is.na(df[[day0_col]]), 
           rowMeans(df[, c(day1_col, day2_col)], na.rm = TRUE), 
           df[[day0_col]])
}

df_formated <- df |>
# Create new columns
  mutate(
    # Alcohol intake
    Alcohol_Intake = create_intake_new_column(df,'DRXTALCO', 'DR1TALCO', 'DR2TALCO'),
    # Caffeine intake
    Caffeine_Intake = create_intake_new_column(df,'DRXTCAFF', 'DR1TCAFF', 'DR2TCAFF'),
    # Calcium intake
    Calcium_Intake = create_intake_new_column(df,'DRXTCALC', 'DR1TCALC', 'DR2TCALC'),
    # Carbohydrate intake
    Carbohydrate_Intake = create_intake_new_column(df,'DRXTCARB', 'DR1TCARB', 'DR2TCARB'),
    # Fiber intake
    Fiber_Intake = create_intake_new_column(df,'DRXTFIBE', 'DR1TFIBE', 'DR2TFIBE'),
    # Kcal intake
    Kcal_Intake = create_intake_new_column(df,'DRXTKCAL', 'DR1TKCAL', 'DR2TKCAL'),
    # Sodium intake
    Sodium_Intake = create_intake_new_column(df,'DRDTSODI', 'DR1TSODI', 'DR2TSODI'),
    # Relative_Had_Diabetes
    Relative_Had_Diabetes = coalesce(MCQ250A, MCQ300C, MCQ300c),
    # HDL-cholesterol
    HDL_cholesterol = coalesce(LBDHDLSI, LBDHDDSI)
   ) |>
# Delete old columns that are not needed
  select(-c(DRXTALCO, DR1TALCO, DR2TALCO, DRXTCAFF, DR1TCAFF, DR2TCAFF,
            DRXTCALC, DR1TCALC, DR2TCALC, DRXTCARB, DR1TCARB, DR2TCARB,
            DRXTFIBE, DR1TFIBE, DR2TFIBE, DRXTKCAL, DR1TKCAL, DR2TKCAL,
            DRDTSODI, DR1TSODI, DR2TSODI, MCQ250A, MCQ300C, MCQ300c, LBDHDLSI, LBDHDDSI)
            )

In [235]:
unique(df_formated$YEAR[!is.na(df_formated$Relative_Had_Diabetes)])


## 2.2 Discretional trimming of the data according to the authors

> In our study, all datasets were limited to non-pregnant subjects and adults of at least twenty years of age.

In [240]:
df_formated <- df_formated |> 
  filter(RHD143 == 2) |>  # Are you pregnant now? = "No"
  filter(RIDAGEYR >= 20) 

> The preprocessing stage also converted any undecipherable values (errors in datatypes and standard formatting) from the database to null representations.

For this, I've checked the variables according to their possible values in the NHANES documentation

In [245]:
colnames(df_formated)

In [251]:
unique(df_formated$ALQ130)

In [250]:
df_formated %>%
  filter(RIDAGEYR < 20 & RIDAGEYR > 80) |> 
  



"number of rows of result is not a multiple of vector length (arg 2)"
"number of rows of result is not a multiple of vector length (arg 2)"
"number of rows of result is not a multiple of vector length (arg 2)"
"number of rows of result is not a multiple of vector length (arg 2)"


SEQN,YEAR,RIDAGEYR,ALQ130,BMXARMC,BMXARML,LBXSOSSI,LBDSBUSI,BMXBMI,LB2SCLSI,...,RHD143,Alcohol_Intake,Caffeine_Intake,Calcium_Intake,Carbohydrate_Intake,Fiber_Intake,Kcal_Intake,Sodium_Intake,Relative_Had_Diabetes,HDL_cholesterol
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>


In [243]:
sort(unique(df_formated$RIDAGEYR))

In [None]:



seeing if there is any extreme value that might be due to bad input of the data. According to the paper:

> The preprocessing stage also converted any undecipherable values (errors in datatypes and standard formatting) from the database to null representations.

In [None]:
boxplot(df[, c('col1', 'col2', 'colN')])


In [None]:
normalize_columns <- function(master, columns) {
  for (column in columns) {
    master[, column] <- (master[, column] - min(master[, column])) / (max(master[, column]) - min(master[, column]))
  }
  return(master)
}
