## **Machine Learning Approach for Early Diagnosis of Type 2 Diabetes in the Spanish Adult Population**

MSc in Data Science, Universitat Oberta de Catalunya (UOC)

**Joana Llauradó Pont**, Master Student  
**Laia Carreté Muñoz**, Project Supervisor  
Laia Subirats, Coordinator Professor

---

### SECTION 2: DEVELOP AND TEST ML MODELS
Train different ML models to detect the likelihood of diabetes and compute evaluation metrics.

---

In [91]:
# Connect to google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [92]:
# Import libraries
import os
import joblib
import numpy as np
import pandas as pd

In [93]:
# Define the path to save results and ensure it exsists, if not create it
results_dir = "/content/drive/MyDrive/TFM 2024/results"
os.makedirs(results_dir, exist_ok=True)

DEFINE FUNCTIONS TO PREPARE DATA

We need to parse the input data of a aparticipant so it can be handlee by the preprocessor and the model.

Preprocessor expects 39 features:
['num__Edad_first' 'num__Colesterol_mean' 'num__Colesterol_std'
 'num__Colesterol_max' 'num__Colesterol_min' 'num__HDL-Colesterol_mean'
 'num__HDL-Colesterol_std' 'num__HDL-Colesterol_max'
 'num__HDL-Colesterol_min' 'num__Hb-Glicosilada_mean'
 'num__Hb-Glicosilada_std' 'num__Hb-Glicosilada_max'
 'num__Hb-Glicosilada_min' 'num__LDL-Calculado_mean'
 'num__LDL-Calculado_std' 'num__LDL-Calculado_max'
 'num__LDL-Calculado_min' 'num__Trigliceridos_mean'
 'num__Trigliceridos_std' 'num__Trigliceridos_max'
 'num__Trigliceridos_min' 'cat__Sexo_first_Hombre' 'cat__Sexo_first_Mujer'
 'cat__symptoms_first_Diabetes' 'cat__symptoms_first_Diabetes, sobrepeso'
 'cat__symptoms_first_anemia, fatiga'
 'cat__symptoms_first_fatiga, vision borrosa'
 'cat__symptoms_first_hiperlipidemia, sedentarismo'
 'cat__symptoms_first_hipertension' 'cat__symptoms_first_hipotension'
 'cat__symptoms_first_obesidad, hipertension, fatiga'
 'cat__symptoms_first_sedentarismo'
 'cat__symptoms_first_sedentarismo, hipertension, sin diagnostico'
 'cat__symptoms_first_sedentarismo, obesidad'
 'cat__symptoms_first_sedentarismo, sed habitual y necesidad de beber'
 'cat__symptoms_first_sedentarismo, sobrepeso'
 'cat__symptoms_first_sin diagnostico' 'cat__symptoms_first_sin sintomas'
 'cat__symptoms_first_sobrepeso']

### **CASE EXAMPLE**: Profile of an imginary participant, characteristics we may have:

* Symptoms: "hiperlipidemia", "sedentarismo", "hipertension", "sed habitual y necesidad de beber", "Diabetes", "sobrepeso", "obesidad, hipertension, fatiga", "fatiga, vision borrosa", "sin sintomas"
* Edad: Participant's age (numerical).
* Sexo: Gender of the participant (categorical: "Mujer", "Hombre").
* Prueba: The type of test performed, categorized into: "Colesterol", "HDL-Colesterol", "Hb-Glicosilada", "LDL-Calculado", "Trigliceridos"
* Resultado: Numerical value representing the test result.
* Rango Inferio i Superior: Lower  and upper limit of the test's reference range.
  * "Colesterol": (70, 200),
  * "HDL-Colesterol": (40, 60)
  * "Hb-Glicosilada": (3.5, 6)
  *  "LDL-Calculado": (0, 100)
  * "Trigliceridos": (35, 200)


In [98]:
import pandas as pd
import numpy as np
import joblib
from sklearn.preprocessing import OneHotEncoder

# Load model and preprocessor
model_path = "/content/drive/MyDrive/TFM 2024/results/random_forest_agg_test.joblib"
preprocessor_path = "/content/drive/MyDrive/TFM 2024/results/preprocessor_agg.joblib"

model = joblib.load(model_path)
preprocessor = joblib.load(preprocessor_path)

# Inspect the features expected by the preprocessor
preprocessor_features = preprocessor.get_feature_names_out()
print(f"Preprocessor expects {len(preprocessor_features)} features:")
print(preprocessor_features)

def arrange_features_for_preprocessing(participant_features):
    """
    Rearranges and aggregates participant data to fit the preprocessor's expected format.
    """
    # Aggregate numerical features
    aggregated_data = {
        'Edad_first': participant_features['Edad'].iloc[0] if 'Edad' in participant_features else 0,
        'Colesterol_mean': participant_features.loc[participant_features['Prueba'] == 'Colesterol', 'Resultado'].mean() if 'Colesterol' in participant_features['Prueba'].values else 0,
        'Colesterol_std': participant_features.loc[participant_features['Prueba'] == 'Colesterol', 'Resultado'].std(ddof=0) if 'Colesterol' in participant_features['Prueba'].values else 0,
        'Colesterol_max': participant_features.loc[participant_features['Prueba'] == 'Colesterol', 'Resultado'].max() if 'Colesterol' in participant_features['Prueba'].values else 0,
        'Colesterol_min': participant_features.loc[participant_features['Prueba'] == 'Colesterol', 'Resultado'].min() if 'Colesterol' in participant_features['Prueba'].values else 0,
        'HDL-Colesterol_mean': participant_features.loc[participant_features['Prueba'] == 'HDL-Colesterol', 'Resultado'].mean() if 'HDL-Colesterol' in participant_features['Prueba'].values else 0,
        'HDL-Colesterol_std': participant_features.loc[participant_features['Prueba'] == 'HDL-Colesterol', 'Resultado'].std(ddof=0) if 'HDL-Colesterol' in participant_features['Prueba'].values else 0,
        'HDL-Colesterol_max': participant_features.loc[participant_features['Prueba'] == 'HDL-Colesterol', 'Resultado'].max() if 'HDL-Colesterol' in participant_features['Prueba'].values else 0,
        'HDL-Colesterol_min': participant_features.loc[participant_features['Prueba'] == 'HDL-Colesterol', 'Resultado'].min() if 'HDL-Colesterol' in participant_features['Prueba'].values else 0,
        'Hb-Glicosilada_mean': participant_features.loc[participant_features['Prueba'] == 'Hb-Glicosilada', 'Resultado'].mean() if 'Hb-Glicosilada' in participant_features['Prueba'].values else 0,
        'Hb-Glicosilada_std': participant_features.loc[participant_features['Prueba'] == 'Hb-Glicosilada', 'Resultado'].std(ddof=0) if 'Hb-Glicosilada' in participant_features['Prueba'].values else 0,
        'Hb-Glicosilada_max': participant_features.loc[participant_features['Prueba'] == 'Hb-Glicosilada', 'Resultado'].max() if 'Hb-Glicosilada' in participant_features['Prueba'].values else 0,
        'Hb-Glicosilada_min': participant_features.loc[participant_features['Prueba'] == 'Hb-Glicosilada', 'Resultado'].min() if 'Hb-Glicosilada' in participant_features['Prueba'].values else 0,
        'LDL-Calculado_mean': participant_features.loc[participant_features['Prueba'] == 'LDL-Calculado', 'Resultado'].mean() if 'LDL-Calculado' in participant_features['Prueba'].values else 0,
        'LDL-Calculado_std': participant_features.loc[participant_features['Prueba'] == 'LDL-Calculado', 'Resultado'].std(ddof=0) if 'LDL-Calculado' in participant_features['Prueba'].values else 0,
        'LDL-Calculado_max': participant_features.loc[participant_features['Prueba'] == 'LDL-Calculado', 'Resultado'].max() if 'LDL-Calculado' in participant_features['Prueba'].values else 0,
        'LDL-Calculado_min': participant_features.loc[participant_features['Prueba'] == 'LDL-Calculado', 'Resultado'].min() if 'LDL-Calculado' in participant_features['Prueba'].values else 0,
        'Trigliceridos_mean': participant_features.loc[participant_features['Prueba'] == 'Trigliceridos', 'Resultado'].mean() if 'Trigliceridos' in participant_features['Prueba'].values else 0,
        'Trigliceridos_std': participant_features.loc[participant_features['Prueba'] == 'Trigliceridos', 'Resultado'].std(ddof=0) if 'Trigliceridos' in participant_features['Prueba'].values else 0,
        'Trigliceridos_max': participant_features.loc[participant_features['Prueba'] == 'Trigliceridos', 'Resultado'].max() if 'Trigliceridos' in participant_features['Prueba'].values else 0,
        'Trigliceridos_min': participant_features.loc[participant_features['Prueba'] == 'Trigliceridos', 'Resultado'].min() if 'Trigliceridos' in participant_features['Prueba'].values else 0,
    }

    # Ensure categorical fields are provided correctly
    aggregated_data["Sexo_first"] = participant_features['Sexo'].iloc[0]
    aggregated_data["symptoms_first"] = (
        participant_features['Symptoms'].iloc[0]
        if 'Symptoms' in participant_features.columns and pd.notna(participant_features['Symptoms'].iloc[0])
        else "unknown"
    )

    # Return as DataFrame
    return pd.DataFrame([aggregated_data])





Preprocessor expects 39 features:
['num__Edad_first' 'num__Colesterol_mean' 'num__Colesterol_std'
 'num__Colesterol_max' 'num__Colesterol_min' 'num__HDL-Colesterol_mean'
 'num__HDL-Colesterol_std' 'num__HDL-Colesterol_max'
 'num__HDL-Colesterol_min' 'num__Hb-Glicosilada_mean'
 'num__Hb-Glicosilada_std' 'num__Hb-Glicosilada_max'
 'num__Hb-Glicosilada_min' 'num__LDL-Calculado_mean'
 'num__LDL-Calculado_std' 'num__LDL-Calculado_max'
 'num__LDL-Calculado_min' 'num__Trigliceridos_mean'
 'num__Trigliceridos_std' 'num__Trigliceridos_max'
 'num__Trigliceridos_min' 'cat__Sexo_first_Hombre' 'cat__Sexo_first_Mujer'
 'cat__symptoms_first_Diabetes' 'cat__symptoms_first_Diabetes, sobrepeso'
 'cat__symptoms_first_anemia, fatiga'
 'cat__symptoms_first_fatiga, vision borrosa'
 'cat__symptoms_first_hiperlipidemia, sedentarismo'
 'cat__symptoms_first_hipertension' 'cat__symptoms_first_hipotension'
 'cat__symptoms_first_obesidad, hipertension, fatiga'
 'cat__symptoms_first_sedentarismo'
 'cat__symptoms_fi

EXAMPLE CASE 1: PARTICIPANT 1 TEST and 1 time point

In [101]:
# Participant data example
participant_features = pd.DataFrame([
    {"Prueba": "Colesterol", "Resultado": 180, "Edad": 60, "Sexo": "Mujer", "Symptoms": "hiperlipidemia"}
])

# Arrange features
arranged_features = arrange_features_for_preprocessing(participant_features)

# Preprocess input
processed_features = preprocessor.transform(arranged_features)

# Predict likelihood of T2DB
likelihood = model.predict_proba(processed_features)[:, 1]
print(f"Likelihood of T2DB: {likelihood[0]}")


Likelihood of T2DB: 0.26206177919261436


EXAMPLE 2: PARTICIPANT WITH MULTIPLE TEST DATA

In [103]:
participant_features = pd.DataFrame([
    {"Prueba": "Colesterol", "Resultado": 180, "Edad": 50, "Sexo": "Mujer", "Symptoms": "hiperlipidemia"},
    {"Prueba": "Colesterol", "Resultado": 190, "Edad": 50, "Sexo": "Mujer", "Symptoms": "hiperlipidemia"}
])
# Arrange features
arranged_features = arrange_features_for_preprocessing(participant_features)


# Preprocess input
processed_features = preprocessor.transform(arranged_features)

# Predict likelihood of T2DB
likelihood = model.predict_proba(processed_features)[:, 1]
print(f"Likelihood of T2DB: {likelihood[0]}")


Likelihood of T2DB: 0.22983587883842188


In [106]:
participant_features = pd.DataFrame([
    {"Prueba": "Colesterol", "Resultado": 180, "Edad": 50, "Sexo": "Mujer", "Symptoms": "hiperlipidemia"},
    {"Prueba": "Colesterol", "Resultado": 190, "Edad": 50, "Sexo": "Mujer", "Symptoms": "hiperlipidemia"},
    {"Prueba": "Hb-Glicosilada", "Resultado": 6.5, "Edad": 50, "Sexo": "Mujer", "Symptoms": "hiperlipidemia"},
    {"Prueba": "Hb-Glicosilada", "Resultado": 7.0, "Edad": 50, "Sexo": "Mujer", "Symptoms": "sed habitual y necesidad de beber"}
])
# Arrange features
arranged_features = arrange_features_for_preprocessing(participant_features)

# Preprocess input
processed_features = preprocessor.transform(arranged_features)

# Predict likelihood of T2DB
likelihood = model.predict_proba(processed_features)[:, 1]
print(f"Likelihood of T2DB: {likelihood[0]}")


Likelihood of T2DB: 0.38797340645194844
