# MedAId 
## A Package for Predicting Patient States Using Classification Tools 
#### Authors: Zofia Kamińska, Mateusz Deptuch, Karolina Dunal

### Tool Specification
Our tool is designed to assist doctors in the medical decision-making process. Its primary goal is to analyze tabular patient data, such as age, weight, cholesterol levels, etc., to predict:
- Whether a patient has a particular disease (binary classification).
- The severity level of the disease (multiclass classification).
- The risk of patient mortality (binary classification).

### Key Features
- Supports both binary and multiclass classification.
- Automated data processing: cleaning, exploratory analysis, and feature preparation.
- Interpretation of model results using tools like SHAP.
- Comparison of various ML models with different metrics (e.g., accuracy, ROC-AUC, sensitivity, specificity).

### Target Audience
The target audience includes doctors and medical personnel. The tool is designed for users who:
- Want to utilize patient data to make better medical decisions.
- Do not have advanced knowledge in programming or machine learning.
- Need intuitive visualizations and interpretations of model results.

# Overview of Existing Solutions
Below are existing tools with similar functionalities:

### 1. Pharm-AutoML
- **Description**: A tool focused on analyzing biomedical data using AutoML. It enables the analysis of genomics, pharmacogenomics, and biomarker data.
- **Advantages**: Specialization in biomedical fields, integrated biomarker models.
- **Limitations**: Limited application to tabular clinical data.

### 2. Cardea (MIT)
- **Description**: A machine learning platform focused on predicting patient outcomes based on clinical data such as Electronic Health Records (EHR).
- **Advantages**: Excellent integration with EHR, use of advanced models.
- **Limitations**: Focus on EHR may hinder application to simpler tabular data.

### 3. AutoPrognosis
- **Description**: AutoPrognosis is an advanced AutoML platform that automatically optimizes health models and processes medical data, offering a wide range of analyses, including classification, regression, and survival analysis. It allows full customization of processes and algorithm selection.
- **Advantages**: Offers advanced features and flexibility, supports diverse models, and provides interpretability tools, making it ideal for specialists with advanced needs.
- **Limitations**: While it has extensive capabilities, its use is more complex and requires greater technical knowledge, which can sometimes be a challenge in practice.

### 4. MLJAR
- **Description**: An AutoML tool that supports tabular data across various domains, including medicine.
- **Advantages**: Versatility, user-friendly reports, intuitive to use.
- **Limitations**: Lack of medical specialization, which may impact interpretability in clinical contexts.

### Comparison with Our Tool
Our tool stands out due to its simplicity of use, requiring minimal coding, making it ideal for users without advanced technical knowledge. It is also optimized for medical tabular data, making it more suited for biomedical analyses compared to more general tools. Unlike MLJAR, our results are tailored to the needs of doctors. We also differ from Cardea and Pharm-AutoML, which have narrower use cases. Compared to AutoPrognosis, which offers more advanced features and capabilities, our tool is simpler to use and more intuitive, making it easier to implement in practice.


## Tool Architecture

### Folder Structure:
- `data/` - Input data
- `medaid/` - Source code of the tool
- `tests/` - Unit tests

## Data Processing Flow:
### The MedAId package consists of three main components:

1. **Data Processing** (`preprocessing/`): This component handles loading the data, cleaning, encoding categorical variables, and splitting the dataset into training and testing sets. It also handles any missing values and normalizes numerical features. The preprocessing step ensures that the data is ready for model training by transforming it into a suitable format for machine learning algorithms.

2. **Modeling** (`training/`): This component focuses on creating classification models, training them, evaluating and comparing their performance, and saving the models to files. It includes model selection (e.g., logistic regression, random forest, support vector machines), hyperparameter tuning, cross-validation, and model evaluation using metrics like accuracy, precision, recall, and ROC-AUC. The best-performing model is selected and saved for future use.

3. **Result Interpretation** (`reporting/`): This component generates visualizations of the model results, creates comprehensive reports, performs SHAP (Shapley Additive Explanations) analysis for model interpretability, and compares various metrics to help users understand the model's decision-making process. It includes graphs like ROC curves, confusion matrices, feature importance plots, and detailed model performance summaries.


### `preprocessing/`
The module responsible for the comprehensive data processing pipeline, which includes the following steps: handling numerical formats, removing text columns, imputing missing data, encoding categorical variables, and feature scaling. This class integrates various data processing components into a single pipeline, allowing for simultaneous management of all required stages.

#### `preprocessing.py`
#### Data Processing Stages:
1. **Handling Numerical Formats**  
   The function `handle_numeric_format()` in the `NumericCommaHandler` class handles converting numbers formatted with commas (e.g., `1,000`) into the standard numeric format.

2. **Removing Text Columns**  
   The `ColumnRemover` class is used to identify and remove text columns based on specified thresholds (e.g., when missing data in a column exceeds a predefined limit).

3. **Imputation of Missing Data**  
   The imputation function relies on various methods, such as linear regression (in the `Imputer` class) and Random Forest. Correlation thresholds are set to use appropriate imputation algorithms depending on the correlation of columns with other variables.

4. **Encoding Categorical Variables**  
   Categorical variables, including the target column, are encoded by the `Encoder` class using different encoding methods, including `LabelEncoder` and `OneHotEncoder`.

5. **Feature Scaling**  
   The `scale()` function in the `Scaler` class scales numerical columns in the DataFrame. Depending on the data distribution (normal or skewed), standardization (for normal distribution) or normalization (for skewed distribution) is applied.

#### Key Functions of the `Preprocessing` Class:

- **`__init__(target_column, path, imputer_lr_correlation_threshold, imputer_rf_correlation_threshold, categorical_threshold, removal_correlation_threshold)`**:  
    Initializes the class object by setting parameters such as the target column, correlation thresholds, and other configuration options. It also creates instances of the appropriate processing components like `NumericCommaHandler`, `ColumnRemover`, `Encoder`, `Scaler`, `Imputer`, and `PreprocessingCsv`.
    **Parameter Descriptions:**
    - `target_column` (str): The name of the target column.
    - `path` (str): The directory path where processing details will be saved.
    - `imputer_lr_correlation_threshold` (float): The correlation threshold for imputation using linear regression.
    - `imputer_rf_correlation_threshold` (float): The correlation threshold for imputation using Random Forest.
    - `categorical_threshold` (float): The threshold for considering a column as text, which is not categorical. It is based on the ratio of unique values to all values; if higher than the threshold, the column is treated as text and removed.
    - `removal_correlation_threshold` (float): The correlation threshold for removing highly correlated columns (excluding the target column). Only one column from correlated groups is retained.

- **`preprocess(dataframe)`**:  
    The main processing function that performs all pipeline steps. It takes a DataFrame as input, processes it through each stage, and returns the processed DataFrame. After each stage, it logs processing details such as text column removal, imputation, encoding, and scaling.

- **`get_column_info()`**:  
    Returns details about the processing for each column, including information on removed columns, imputation methods, encoding, and scaling.

- **`save_column_info(text_column_removal_info, imputation_info, encoding_info, scaling_info)`**:  
    Saves processing details to a CSV file. This function uses the `PreprocessingCsv` class to store information about removed columns, imputation, encoding, and scaling.

- **`get_target_encoding_info()`**:  
    Returns information about the encoding method used for the target column.

#### Details of Implementation of Individual Components:
The following classes and their methods are implemented in separate files.

- **`NumericCommaHandler`** - `numeric_format_handler.py`:  
  Handles the conversion of numbers formatted with commas (e.g., `1,000`) into a numeric format, ensuring data consistency within the DataFrame.

- **`ColumnRemover`** - `column_removal.py`:  
  Allows for the removal of text columns whose values are deemed irrelevant, based on various criteria such as the amount of missing data or correlation with the target column.

- **`Imputer`** - `imputer.py`:  
  Performs imputation of missing data using different methods, such as linear regression, Random Forest, or other algorithms, depending on correlations with other variables.

- **`Encoder`** - `encoder.py`:  
  Encodes categorical variables, including the target variable, using `LabelEncoder` and `OneHotEncoder`, and ensures that encoding information and mappings are stored.

- **`Scaler`** - `scaler.py`:  
  Scales numerical variables, deciding between standardization or normalization based on the detected distribution of data within the columns.

- **`PreprocessingCsv`** - `preprocessing_info.py`:  
  Saves processing details to a CSV file, enabling the tracking of applied methods and parameters throughout the data processing pipeline.


### `training/`
#### `medaid.py`:
This module is used for training models and hyperparameter optimization.
1. **`__train(...)__`**: This function handles the entire training process and hyperparameter optimization. It trains various classification models, evaluates their performance, and selects the best-performing model based on specified evaluation metrics.
2. **`__search.py__`**: Defines the classes for Random Search and Grid Search, which are used during the hyperparameter optimization process.

### `reporting/`
#### `plots.py`:
This module is responsible for generating visualizations to support the analysis of model results, saved appropriately in subdirectories within the main `medaid#` folder:

1. **`distribution_plots(aid)`**: Creates histograms and bar plots for input variables.
2. **`correlation_plot(aid)`**: Generates a correlation matrix and dependency plots between features and the target variable.
3. **`make_confusion_matrix(aid)`**: Generates confusion matrices on the test set for each model.
4. **`shap_feature_importance_plot(aid)`**: Visualizes feature importance based on SHAP.
5. **`generate_supertree_visualizations(medaid, output_dir)`**: Creates interactive visualizations of SuperTree models.
6. **`makeplots(aid)`**: Runs all the above functions, generating a complete set of visualizations.

#### `mainreporter.py`:
The `MainReporter` class generates an HTML report with the results of data and model analysis. The report includes details about the data, preprocessing, feature distributions, correlation matrices, model results, and their in-depth analysis. The generated report is stored in the `reports/` folder inside the `medaid#` folder.

1. **`__init__(self, aid, path)`**: The constructor initializes the path to the result folder and the `aid` object containing data and models.
2. **`is_nan(value)`**: A helper function to check if a value is NaN.
3. **`generate_report()`**: Generates an HTML report, which includes:
   - Basic information about the data (number of rows, columns, unique target classes).
   - A preview of the data (first few rows of the DataFrame).
   - Details of preprocessing from the CSV file.
   - Feature distributions on plots.
   - Correlation analysis of features with the target and the full correlation matrix.
   - Details of the models used and their results (including Accuracy, Precision, Recall, F1).
   - Model-specific details (e.g., confusion matrix, feature importance, tree visualizations).
   - DecisionTree and RandomForest tree visualizations.

#### `predictexplain.py`:
The `PredictExplainer` class generates an explanation report for the model's prediction based on the input data, saved in the `medaid#` folder.

1. **`__init__(self, medaid, model)`**: Initializes the `PredictExplainer` class, assigning the `medaid` object and model, and loads preprocessing details from a CSV file.
2. **`preprocess_input_data(self, input_data)`**: Preprocesses the input data according to the stored preprocessing details, applying one-hot encoding, label encoding, imputation, and scaling based on previous settings.
3. **`analyze_prediction(self, prediction, target_column, prediction_proba)`**: Analyzes the predicted value for the target feature, compares it with the distribution in the dataset, and generates a classification report including a feature importance plot (SHAP) for classification tasks.
4. **`generate_html_report(self, df, input_data)`**: Using the other functions, generates an HTML report comparing the input data with the dataset, analyzes the predictions, and generates model interpretability plots.
5. **`generate_viz(self, input_data)`**: Generates visualizations for input data using SHAP (for most models) or LIME (for tree-based models).
6. **`generate_shap_viz(self, input_data)`**: Generates SHAP visualizations, including a force plot for a single prediction and a summary plot for the entire dataset, saving them as files.
7. **`generate_lime_viz(self, input_data)`**: Generates LIME visualizations for input data, saving the explanation plot to an HTML file.
8. **`predict_target(input_data)`**: Processes the input data, makes a prediction using the model, analyzes the result, and generates SHAP/LIME visualizations to increase interpretability.
9. **`classify_and_analyze_features(df, input_data)`**: Classifies features into binary, categorical-text, categorical-numeric, and continuous-numeric types, then provides detailed HTML reports based on their characteristics.
10. **`_analyze_binary(df, column, input_value)`**, **`_analyze_categorical_numbers(df, column, input_value)`**, **`_analyze_categorical_strings(df, column, input_value)`**, and **`_analyze_numerical_continuous(df, column, input_value)`**: These functions generate HTML content for different types of features (binary, categorical-numeric, categorical-text, and continuous-numeric), providing detailed information about the input value, its frequency in the dataset, and additional statistical details (such as comparisons with the mean, median, and standard deviation for continuous features).


## Description of the `medaid` Class

The `medaid` class is the main object of the tool. It allows you to load data, preprocess it, train models, save results, and generate reports.

#### Methods:
- **`__medaid()__`**: Constructor of the `MedAId` class, initializes the object with the provided parameters.
    - **`dataset_path`**: Path to the CSV file with the data.
    - **`target_column`**: Name of the column containing the target variable.
    - **`models`**: List of models to test (default: `["logistic", "tree", "random_forest", "xgboost", "lightgbm"]`).
    - **`metric`**: Metric to optimize for (default: `f1`, possible values: `["accuracy", "f1", "recall", "precision"]`).
    - **`path`**: Path to save the results.
    - **`search`**: Hyperparameter optimization method (default: `random`).
    - **`cv`**: Number of cross-validation splits (default: `3`).
    - **`n_iter`**: Number of iterations for hyperparameter optimization (default: `20`).
    - **`test_size`**: Size of the test set (default: `0.2`).
    - **`n_jobs`**: Number of processor cores to use (default: `1`).
    - **`param_grids`**: Dictionary containing the parameter grid for each model.
    - **`imputer_lr_correlation_threshold`**: Minimum correlation for linear regression imputation.
    - **`imputer_rf_correlation_threshold`**: Minimum correlation for Random Forest imputation.
    - **`categorical_threshold`**: Threshold for distinguishing text columns from categorical ones (if the ratio of unique values to total values in a column is greater than this threshold, the column is considered text and removed).
    - **`removal_correlation_threshold`**: Correlation threshold for removing strongly correlated columns (except the target variable, only one column from a group of strongly correlated ones remains).

- **`preprocess()`**: Conducts preprocessing of the data.
- **`train()`**: Performs preprocessing and trains models on the training data, saving the best models and their results.
- **`save()`**: Saves the models to the file `medaid.pkl` in the `medaid#/` folder.
- **`report()`**: Executes the `generate_report()` function from the `MainReporter` class, returning a report in HTML format with the results of data and model analysis, as described in the `reporting/` section.
- **`predict_explain(input_data, model)`**: Generates a report explaining the model's prediction based on input data, which is a single row from the DataFrame (excluding the target column). If the model or input data is not provided, the function uses the default values — the first model from the `best_models` list and the first row from the DataFrame.


## Sample use


In [None]:
from medaid.medaid import MedAId

In [None]:
aid = MedAId(dataset_path='./data/multiclass/Obesity_Classification.csv',
             target_column='Label',
             metric="f1",
             search="random",
             path="",
             n_iter=10,
             cv=3)

In [None]:
aid.train()

logistic progress: 100%|██████████| 10/10 [00:02<00:00,  3.39it/s]
tree progress: 100%|██████████| 10/10 [00:00<00:00, 34.87it/s]
random_forest progress: 100%|██████████| 10/10 [00:01<00:00,  5.87it/s]
xgboost progress: 100%|██████████| 10/10 [00:00<00:00, 20.88it/s]
lightgbm progress: 100%|██████████| 10/10 [00:04<00:00,  2.01it/s]



Finishing up...



In [None]:
aid.save()

In [None]:
aid.report()

In [None]:
aid.models_ranking()

Unnamed: 0,model,best_score,f1,accuracy,precision,recall,test_best_score,test_f1,test_accuracy,test_precision,test_recall
0,random_forest,0.965657,0.965657,0.965517,0.97318,0.965517,0.909091,0.909091,0.909091,0.909091,0.909091
1,tree,0.940752,0.940752,0.942529,0.962835,0.942529,0.865014,0.865014,0.863636,0.872727,0.863636
2,xgboost,0.928143,0.928143,0.930624,0.95178,0.930624,1.0,1.0,1.0,1.0,1.0
3,logistic,0.882606,0.882606,0.884647,0.8996,0.884647,0.953047,0.953047,0.954545,0.961039,0.954545
4,lightgbm,0.857823,0.857823,0.860016,0.896312,0.860016,1.0,1.0,1.0,1.0,1.0


In [None]:
aid.predict(aid.X_test.iloc[0:20], model_id=0)


{'Normal Weight': np.int64(0), 'Obese': np.int64(1), 'Overweight': np.int64(2), 'Underweight': np.int64(3)}
{'Normal Weight': np.int64(0), 'Obese': np.int64(1), 'Overweight': np.int64(2), 'Underweight': np.int64(3)}


['Underweight',
 'Underweight',
 'Overweight',
 'Underweight',
 'Underweight',
 'Normal Weight',
 'Overweight',
 'Normal Weight',
 'Underweight',
 'Underweight',
 'Overweight',
 'Underweight',
 'Obese',
 'Underweight',
 'Normal Weight',
 'Normal Weight',
 'Normal Weight',
 'Normal Weight',
 'Obese',
 'Overweight']

In [None]:
aid.predict_explain(model=aid.best_models[1])

In [None]:
aid.path

'/Users/mateuszdeptuch/SCHOOL/AUTOML/projekt2/medaid3'