# Diabetes prediction
Supervised Learning

> Faculty of Engineering at University of Porto  
> Bachelor's Degree in Informatics and Computing Engineering  
> Artificial Intelligence (L.EIC029) 2024/2025
>
> - Luís Paulo Gonçalves dos Reis (Regent of the course)
> - Telmo João Vales Ferreira Barros (Theoretical-Practical classes)

> **Class 09; Group 02**
>
> - Duarte Souto Assunção (up202208319@up.pt)
> - Guilherme Duarte Silva Matos (up202208755@up.pt)
> - João Vítor da Costa Ferreira (up202208393@up.pt)

Before running this notebook, read the [README.md](./README.md) with the software prerequisites and installation process.

## Problem definition and data exploration

This is a binary classification problem, where the objective is to **predict the diabetes status of the patient**, given various biomedical measurements and patient characteristics.

The [given dataset](./dataset.csv) has these characteristics:
- 800 unique patients;
- 200 duplicate IDs with different genders (assumed to be different patients);
- 56% male and 43% female;
- 70.6% of patients are between 50 and 61 years old;
- Urea, Cr, TG, HDL and VLDL may have outliers;
- 84% of patients have diabetes, i.e., the database is highly unbalanced!


## 0. Crate imports and initialization

⚠️ Make sure to execute this **before everything**!

⚠️ This step may take ~5 minutes for the first time!

- This code snippet compiles all dependencies and auxiliary modules;

- The compilation process is reused between runs in the same kernel instance;

In [47]:
// Auxiliary library at ./src (if needed)
:dep ai_lib = { package = "ai-diabetes", path = ".", version = "*" }

:dep csv = { version = "1.3.1" }
:dep rand = { version = "0.8.5" }

// ~ Scikit-learn
:dep linfa = { version = "0.7.1" }
:dep linfa-svm = { version = "0.7.2" }
:dep linfa-trees = { version = "0.7.1" }
:dep linfa-bayes = { version = "0.7.1" }
:dep linfa-preprocessing = { version = "0.7.1" }

// ~ NumPy
:dep ndarray = { version = "0.15", default-features = false } 
:dep ndarray-csv = { version = "0.5.1" }

// ~ Matplotlib
:dep plotters = { version = "0.3.7", features = ["boxplot", "evcxr", "all_series", "all_elements"] } 

// ~ Pandas
:dep polars = { version = "0.46.0", features = ["lazy", "csv", "polars-io", "describe"] }  

## 1. Data Loading and Type Conversion

Loading the [dataset.csv](./dataset.csv) into:
- A two-dimentional array with the features (`data`);
- An array with the targets, i.e., the "Class" column (`targets`).

To assure the same type (`f64`) for all the data, in this step some data preprocessing is done:
- The first two columns ("ID" and "No_Pation") are not included in `data`;
- Columns "Gender" and "Class" are converted into a boolean (and casted into f64).

In [48]:
use csv::ReaderBuilder;
use ndarray::Array1;
use ndarray::Array2;
use std::num::ParseFloatError;
use ndarray::Axis;
use ndarray::ArrayView;

const CSV_PATH: &str = "dataset.csv";
let columns: Vec<&'static str> = vec![
    "Gender", "AGE", "Urea", "Cr", "HbA1c", "Chol", "TG", "HDL", "LDL", "VLDL", "BMI",
];

let mut reader = ReaderBuilder::new()
    .has_headers(true)
    .delimiter(b',')
    .from_path(CSV_PATH)
    .expect("Cannot create reader");

let mut data: Array2<f64> = Array2::zeros((0, 11));
let mut targets: Array1<bool> = Array1::default(0);
for result in reader.records() {
    let record = result.expect("Error reading record");
    let row: Vec<String> = record.iter().map(|s| s.to_string()).collect();

    // The first two columns (Id, No_Pation) are not used in the model
    let mut filtered_row = row[2..].to_vec();

    // Convert "Gender" to "bool" casted as f64
    match filtered_row[0].as_str().trim() {
        "M" | "m" => filtered_row[0] = "1".to_string(),
        "F" | "f" => filtered_row[0] = "0".to_string(),
        _ => {}
    }

    // Convert "Class" to "bool" casted as f64
    match filtered_row[11].as_str().trim() {
        "N" | "n" => filtered_row[11] = "0".to_string(),
        "P" | "p" => filtered_row[11] = "0".to_string(),
        "Y" | "y" => filtered_row[11] = "1".to_string(),
        _ => {}
    }

    let parsed_row = filtered_row
        .iter()
        .map(|s| s.trim().parse::<f64>())
        .collect::<Result<Vec<f64>, ParseFloatError>>()
        .expect("Error parsing row");

    let _ = data.append(
        Axis(0),
        ArrayView::from_shape((1, 11), &parsed_row[0..11]).unwrap(),
    );
    let _ = targets.append(
        Axis(0),
        ArrayView::from_shape(1, &[((parsed_row[11] - 1.0).abs() <= f64::EPSILON)])
            .unwrap(),
    );
}

()

## 2. Data Cleaning

No data cleaning is needed due to the nature of this dataset:
- Despite all columns having outliers, these values are still valid and accurate;
- No `null` or missing values are present;
- There are 200 duplicate patient IDs, but the values are different and captured
after many years, so no duplicate patient will be removed and therefore, treated
as a new patient.

Remember that the columns "ID" and "No_Pation" were already removed and that 
this dataset is small with only 1000 entries.

## 3. Data Preprocessing

### 3.1. Oversample non-diabetic classes

In [49]:
use ndarray::s;

let mut non_diabetic_indices = Vec::new();
for (i, &target) in targets.iter().enumerate() {
    if !target {
        non_diabetic_indices.push(i);
    }
}

let diabetic_count = targets.iter().filter(|&&t| t).count();
let non_diabetic_count = non_diabetic_indices.len();
let oversample_count = diabetic_count - non_diabetic_count;

for _ in 0..oversample_count {
    let random_index = non_diabetic_indices[rand::random::<usize>() % non_diabetic_count];
    let new_row = data.slice(s![random_index, ..]).to_owned().to_vec();
    let new_target = targets[random_index];
    data.append(Axis(0), ArrayView::from_shape((1, 11), &new_row).unwrap())
        .expect("Error appending oversampled row");
    targets.append(
        Axis(0),
        ArrayView::from_shape(1, &[new_target]).unwrap(),
    ).expect("Error appending oversampled target");
}

()

### 3.2. Normalization

In this specific dataset, normalization of the features via l1, l2 or max norms 
results in worse models, especially for SVM, so, normalization is avoided in this
pipeline.

⚠️ The code is commented out! To run this step, remove the characters `/*` and `*/`.

In [50]:
use linfa_preprocessing::norm_scaling::NormScaler;

/* // <- Remove here the comment to use L1 normalization
let temp_dataset = Dataset::new(data.clone(), targets.clone()).with_feature_names(columns.clone());
let scaler = NormScaler::l1();
let temp_dataset = scaler.transform(temp_dataset);
let data = temp_dataset.records().to_owned();
let targets = temp_dataset.targets().to_owned();
*/ // <-

### 3.3. New features for highly correlated columns

No pair of features has a high enough correlation to be merged into one column.
No action is needed.

See the correlation heatmap below:

In [51]:
// Print a correlation heatmap
use linfa::Dataset;

let temp_dataset = Dataset::new(data.clone(), targets.clone()).with_feature_names(columns.clone());
let correlation = temp_dataset.pearson_correlation();
println!("{}", correlation);

Gender0.06 0.13 0.17 0.05 -0.05 0.14 -0.16 0.02 0.18 0.06 
AGE                0.11 0.07 0.46 -0.00 0.16 -0.02 -0.02 -0.02 0.47 
Urea                            0.65 0.04 -0.01 0.09 -0.00 -0.01 -0.00 0.07 
Cr                                           -0.00 -0.04 0.10 -0.01 0.05 0.02 0.04 
HbA1c                                                     0.23 0.26 0.02 -0.01 0.13 0.62 
Chol                                                                   0.27 0.09 0.42 0.09 0.12 
TG                                                                                  -0.11 0.03 0.17 0.19 
HDL                                                                                              -0.14 -0.06 0.05 
LDL                                                                                                           0.05 -0.03 
VLDL                                                                                                                       0.23 
BMI



## 4. Model Creation and Results

In [52]:
use ai_lib::Aux;
use ai_lib::ModelKind;
use linfa::prelude::*;

let dataset = Dataset::new(data.clone(), targets.clone()).with_feature_names(columns.clone());
let dataset = dataset.shuffle(&mut rand::thread_rng());
let (train_data, test_data) = dataset.split_with_ratio(0.8);

### 4.1. Support Vector Machines

In [53]:
let svm = Aux::train_and_test(ModelKind::SVM, &train_data, &test_data);
println!("{}", svm);

SVM:
Training took 377.84 ms;
Testing took 4.25 ms;
Accuracy of 97.92 %;
Sensitivity of 95.86 %
Precision of 100.00 %
F1 scores of 0.98
Confusion Matrix: 

classes    | false      | true      
false      | 162        | 7         
true       | 0          | 168       




### 4.2. Gaussian Naive Bayes

In [54]:
let gnb = Aux::train_and_test(ModelKind::GNB, &train_data, &test_data);
println!("{}", gnb);

GNB:
Training took 0.60 ms;
Testing took 0.11 ms;
Accuracy of 91.10 %;
Sensitivity of 93.41 %
Precision of 89.14 %
F1 scores of 0.91
Confusion Matrix: 

classes    | true       | false     
true       | 156        | 11        
false      | 19         | 151       




### 4.3. (Linear) Decision Trees

In [55]:
let ldt = Aux::train_and_test(ModelKind::LDT, &train_data, &test_data);
println!("{}", ldt);

LDT:
Training took 5.75 ms;
Testing took 0.02 ms;
Accuracy of 97.33 %;
Sensitivity of 98.26 %
Precision of 96.57 %
F1 scores of 0.97
Confusion Matrix: 

classes    | true       | false     
true       | 169        | 3         
false      | 6          | 159       




---

[License](LICENSE) | [Third Party Credits](THIRDPARTY.md)