# Modelling: Predicting Diabetes with NHANES Data

### Author: hl-n

## Overview

This notebook focuses on the training, optimisation, and evaluation of machine learning models for predicting the presence of diabetes using the National Health and Nutrition Examination Survey (NHANES) dataset. We will explore various models, including logistic regression, decision tree, and random forest, to identify the best approach for our predictive task.


## Import Relevant Modules

In [1]:
import os
if os.path.basename(os.getcwd()) != "diabetes-prediction-NHANES":
   os.chdir("..")
from src.utils.config_utils import load_config
from src.data_preparation.data_ingestion import load_dataset

## Loading and Processing Raw Data

Before starting the modelling, let's load and process the raw data using the data preparation steps from the EDA.

Let's start by loading the raw dataset using the data ingestion module.<br>
The URL to the raw dataset and the file path to save it to are stored in the config file.

In [2]:
config = load_config(config_path="config.yaml")
df = load_dataset(
    file_path=config.get("raw_dataset_path"),
    url=config.get("raw_dataset_url")
)
df # Preview the DataFrame

Unnamed: 0,seqn,sex,age,re,income,tx,dx,wt,ht,bmi,leg,arml,armc,waist,tri,sub,gh,albumin,bun,SCr
0,51624,male,34.166667,Non-Hispanic White,"[25000,35000)",0,0,87.4,164.7,32.22,41.5,40.0,36.4,100.4,16.4,24.9,5.2,4.8,6.0,0.94
1,51626,male,16.833333,Non-Hispanic Black,"[45000,55000)",0,0,72.3,181.3,22.00,42.0,39.5,26.6,74.7,10.2,10.5,5.7,4.6,9.0,0.89
2,51628,female,60.166667,Non-Hispanic Black,"[10000,15000)",1,1,116.8,166.0,42.39,35.3,39.0,42.2,118.2,29.6,35.6,6.0,3.9,10.0,1.11
3,51629,male,26.083333,Mexican American,"[25000,35000)",0,0,97.6,173.0,32.61,41.7,38.7,37.0,103.7,19.0,23.2,5.1,4.2,8.0,0.80
4,51630,female,49.666667,Non-Hispanic White,"[35000,45000)",0,0,86.7,168.4,30.57,37.5,36.1,33.3,107.8,30.3,28.0,5.3,4.3,13.0,0.79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6790,62155,male,33.000000,Mexican American,"[35000,45000)",0,0,94.3,163.5,35.28,34.4,34.7,35.5,112.3,20.2,,5.4,4.1,10.0,0.97
6791,62156,female,48.916667,Non-Hispanic White,"[0,5000)",0,1,87.1,156.9,35.38,33.9,34.5,37.0,99.4,28.6,25.4,5.5,4.1,7.0,0.89
6792,62157,male,27.500000,Other Hispanic,"[35000,45000)",0,0,57.0,164.3,21.12,35.3,33.7,29.6,73.2,4.2,6.8,5.6,4.5,11.0,0.94
6793,62158,male,75.750000,Non-Hispanic Black,"[10000,15000)",0,0,75.1,162.7,28.37,38.6,36.8,31.2,104.0,19.8,21.1,5.4,4.0,19.0,1.34
