# Dinh et al. (2019)

## A data-driven approach to predicting diabetes and cardiovascular disease with machine learning

URL: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-019-0918-5


## Brief Summary

Dinh et al. (2019) uses different ML models (logistic regression, support vector machines, random forest, and gradient boosting) on NHANES dataset to predict i) Diabetes and ii) Cardiovascular disease ("CVD").

**Goal**: Identification mechanism for patients at risk of diabetes and cardiovascular diseases and key contributors to diabetes .

**Results**:

Best scores:

- CVB prediction based on 131 NHANES variables achieved an AU-ROC score of 83.9% .
- Diabetes prediction based on 123 NHANES variables achieved an AU-ROC score of 95.7% .
- Pre-diabetic prediction based on 123 NHANES variables achieved an AU-ROC score of 84.4% .
- Top 5 predictors in diabetes patients were 1) `waist size`, 2) `age`, 3) `self-reported weight`, 4) `leg length`, 5) `sodium intake`.



This notebook replicates the results of the paper. The structure follows the following steps: 

1. NHANES data 
2. Pre-processing of the data
3. Transformation of the data
4. Train/Test Split 
5. CV 10-fold
6. Training monitoring using MLflow
7. Get metric results (AUC)


The structure of the analysis emulates the Figure 1 from the paper: 

![Fig 1 from Dinh et al. 2019](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Fig1.png)


In [1]:
library(arrow)
library(dplyr)
library(readxl)

Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

The repository you retrieved Arrow from did not include all of Arrow's features.
You can install a fully-featured version by running:
`install.packages('arrow', repos = 'https://apache.r-universe.dev')`.


Attaching package: 'arrow'


The following object is masked from 'package:utils':

    timestamp



Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




## 1. HNANES data

URL: https://www.cdc.gov/nchs/index.htm


## Target

From the paper, the definitions are clear: 

![Dinh et al.(2019), Table 4](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Table4.png)

- Case I: Diabetes.

    - Glucose >= 126 mg/dL. OR;
    - "Yes" to the question "Have you ever been told by a doctor that you have diabetes?"

- Case II: Undiagnosed Diabetes. 

    - Glucose >= 126 mg/dL. AND;
    - "No" to the question "Have you ever been told by a doctor that you have diabetes?"

- Cardio: Cardiovascular disease.

    - "Yes" to any of the the questions "Have you ever been told by a doctor that you had congestive heart failure, coronary heart disease, a heart attack, or a stroke?"

The paper also defined and test for the target: 

- Pre diabetes

    - Glucose 125 >= 100 mg/dL

## Covariates

The paper did not say what variables they use from NHANES. I emailed the author in the correspondence section of the paper to try to get the list of variables they used, but no answer from him yet.

Given that NHANES have more than 3000 variables, I cannot just randomly take the variables I believe are important. 

For now, I will consider the variables taken from [Figure 5](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Fig5.png) and [Figure 6](https://raw.githubusercontent.com/pipegalera/ml_diabetes/main/images/dinh_2019_Fig6.png) of the paper. I compiled them by hand in an Excel file using NHANES search tool for variables:



In [8]:
DATA_PATH <- "/Users/pipegalera/dev/ml_diabetes/data/NHANES/raw_data/"
dinh_2019_vars <- read_excel(paste0(DATA_PATH, "dinh_2019_variables_doc.xlsx"))

head(dinh_2019_vars[, c("Variable Name", "NHANES Name")], n=15)


Variable Name,NHANES Name
<chr>,<chr>
Age,RIDAGEYR
Alcohol consumption,ALQ130
Alcohol intake,DRXTALCO
"Alcohol intake, First Day",DR1TALCO
"Alcohol intake, Second Day",DR2TALCO
Arm circumference,BMXARMC
Arm length,BMXARML
Blood osmolality,LBXSOSSI
Blood relatives have diabetes,MCQ250A
Blood urea nitrogen,LBDSBUSI


For the complete list (n=62), check the file `dinh_2019_variables_doc.xlsx` under NHANES data folder.

NHANES data is made by multiple files (see `NHANES` unde data folder) that have to be compiled together. The data was downloaded automatically via script, all the files converted from SAS to parquet, and the files were stacked and merged based on the individual index ("SEQN"). For more details please check the `nhanes_data_backfill` notebook. 

Plese notice that no transformation are made to the covariates, the files were only arranged and stacked together. 

In [15]:
df <- read_parquet(paste0(DATA_PATH, "dinh_raw_data.parquet"))
head(df)

SEQN,YEAR,RIDAGEYR,ALQ130,DRXTALCO,DR1TALCO,DR2TALCO,BMXARMC,BMXARML,LBXSOSSI,...,BPXSY1,BPXSY4,BPXSY2,BPXSY3,LBDTCSI,LBDSTRSI,BMXWAIST,BMXWT,LBXWBCSI,LBXSASSI
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1999-2000,2,,5.397605e-79,,,15.2,18.6,,...,,,,,,,45.7,12.5,,
2,1999-2000,77,1.0,5.397605e-79,,,29.8,38.2,288.0,...,106.0,,98.0,98.0,5.56,1.298,98.0,75.4,7.6,19.0
3,1999-2000,10,,5.397605e-79,,,19.7,25.5,,...,110.0,,104.0,112.0,3.34,,64.7,32.9,7.5,
4,1999-2000,1,,5.397605e-79,,,16.4,20.4,,...,,,,,,,,13.3,8.8,
5,1999-2000,49,3.0,34.56,,,35.8,39.7,276.0,...,122.0,,122.0,122.0,7.21,3.85,99.9,92.5,5.9,22.0
6,1999-2000,19,,5.397605e-79,,,26.0,34.5,277.0,...,116.0,,116.0,112.0,3.96,0.553,81.6,59.2,9.6,20.0


In [16]:
tail(df)

SEQN,YEAR,RIDAGEYR,ALQ130,DRXTALCO,DR1TALCO,DR2TALCO,BMXARMC,BMXARML,LBXSOSSI,...,BPXSY1,BPXSY4,BPXSY2,BPXSY3,LBDTCSI,LBDSTRSI,BMXWAIST,BMXWT,LBXWBCSI,LBXSASSI
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
102951,2017-2018,4,,,5.397605e-79,5.397605e-79,20.9,24.0,,...,,,,,,,62.2,23.8,8.5,
102952,2017-2018,70,,,5.397605e-79,5.397605e-79,25.1,32.6,288.0,...,136.0,,142.0,140.0,3.08,1.106,82.2,49.0,5.1,27.0
102953,2017-2018,42,12.0,,5.397605e-79,,40.6,36.6,289.0,...,124.0,,122.0,116.0,4.71,1.287,114.8,97.4,8.3,29.0
102954,2017-2018,41,,,5.397605e-79,5.397605e-79,26.8,35.2,272.0,...,116.0,,118.0,114.0,4.45,0.723,86.4,69.1,5.1,15.0
102955,2017-2018,14,,,5.397605e-79,5.397605e-79,44.5,35.0,274.0,...,114.0,,114.0,114.0,3.88,1.005,113.5,111.9,11.4,16.0
102956,2017-2018,38,2.0,,5.397605e-79,5.397605e-79,40.0,38.0,277.0,...,150.0,,146.0,148.0,4.22,3.263,122.0,111.5,9.0,27.0


In [17]:
nrow(df)


# 2. Pre-processing

There are some fixes before the data is ready for analysis. For example, variables that are the same but are called diffrent in different NHANES years:

1. Intake variables went from 1 day in 1999 to 2001 to 2 days from 2003 on, therefore the variable has to be homogenized. Dinh et al. (2019) do not specify which examination records the authors, but my best guess is that they problably took the average of both days that the examination was performed. 

This situation happends with:

- Alcohol intake (`DRXTALCO`, `DR1TALCO`, `DR2TALCO`)
- Caffeine intake (`DRXTCAFF`, `DR1TCAFF`, `DR2TCAFF`)
- Calcium intake (`DRXTCALC`, `DR1TCALC`, `DR2TCALC`)
- Carbohydrate intake (`DRXTCARB`, `DR1TCARB`, `DR2TCARB`)
- Fiber intake (`DRXTFIBE`, `DR1TFIBE`, `DR2TFIBE`)
- Kcal intake (`DRXTKCAL`, `DR1TKCAL`, `DR2TKCAL`)
- Sodium intake (`DRDTSODI`, `DR1TSODI`, `DR2TSODI`)


In [21]:
unique(df$YEAR[!is.na(df$DRXTALCO)])


In [23]:
unique(df$YEAR[!is.na(df$DR1TALCO)])


In [31]:
df$test <- ifelse(is.na(df$DRXTALCO), (df$DR1TALCO + df$DR1TALCO)/2, df$DRXTALCO)

In [46]:
unique(df$DR1TALCO[!is.na(df$DR1TALCO)])

In [52]:
df |>
filter(DR1TALCO == 58.9)

SEQN,YEAR,RIDAGEYR,ALQ130,DRXTALCO,DR1TALCO,DR2TALCO,BMXARMC,BMXARML,LBXSOSSI,...,BPXSY4,BPXSY2,BPXSY3,LBDTCSI,LBDSTRSI,BMXWAIST,BMXWT,LBXWBCSI,LBXSASSI,test
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80519,2013-2014,36,3,,58.9,5.397605e-79,33.1,33.3,270,...,,96,106,5.3,1.964,107.0,71.8,6.9,19,58.9
93076,2015-2016,35,2,,58.9,5.397605e-79,35.5,39.6,281,...,,116,116,4.97,0.305,86.3,81.9,5.6,23,58.9


In [43]:
df$SEQN <- as.integer(df$SEQN)
df$RIDAGEYR <- as.integer(df$RIDAGEYR)
df$RIAGENDR <- as.integer(df$RIAGENDR)


In [41]:
filtered_df <- df |> 
  filter(RHD143 == 2) |>  # Are you pregnant now? = "No"
  filter(RIDAGEYR >= 20) |>
  filter(YEAR != "2015-2016/")

> The data was further analyzed for missing values within the variables, and any with more than 50% of missing values were dropped from the dataset.

Since the pregnancy question (RHD143) was only for *Females only 20 YEARS - 44 YEARS*, the missing values are men or women out of that age range, so I will not delete them.

In [51]:
missing_values <- sapply(df, function(x) mean(is.na(x))) * 100
missing_values_df <- data.frame(missing_percentage = missing_values) |>
                        arrange(desc(missing_percentage))
print(missing_values_df)

         missing_percentage
RHD143            93.640156
BPXDI4            91.407964
BPXSY4            91.407964
LBDLDL            80.323043
LBDTRSI           80.009124
LBDGLUSI          79.577893
LBXGH             57.676349
BPXDI3            33.969499
BPXSY3            33.968413
BPXDI2            33.337316
BPXSY2            33.337316
BPXDI1            32.356455
BPXSY1            32.356455
BPXPLS            27.077404
BMXWAIST          16.254263
BMXBMI            13.325802
BMXHT             12.859812
BMXWT              5.930786
DIQ010             4.515435
SEQN               0.000000
RIDAGEYR           0.000000
RIAGENDR           0.000000
YEAR               0.000000


In [None]:



seeing if there is any extreme value that might be due to bad input of the data. According to the paper:

> The preprocessing stage also converted any undecipherable values (errors in datatypes and standard formatting) from the database to null representations.

In [None]:
boxplot(df[, c('col1', 'col2', 'colN')])


In [None]:
normalize_columns <- function(master, columns) {
  for (column in columns) {
    master[, column] <- (master[, column] - min(master[, column])) / (max(master[, column]) - min(master[, column]))
  }
  return(master)
}
