# Multi-Modal Disease Prediction
## Introduction
Many reference kits  in the bio-medical domain focus on a single-model and single-modal solution. Exclusive reliance on a single method has some limitations, such as impairing the design of robust and accurate classifiers for complex datasets. To overcome these limitations, we provide this multi-modal disease prediction reference kit.

Multi-modal disease prediction is an Intel  optimized, end-to-end reference kit for fine-tuning and inference. This reference kit implements a multi-model and multi-modal solution that will help to predict diagnosis by using categorized contrast enhanced mammography data and radiologists’ notes.
 
Check out more workflow and reference kit examples in the [Developer Catalog](https://developer.intel.com/aireferenceimplementations).

## **Table of Contents**
- [Solution Technical Overview](#solution-technical-overview)
- [Dataset](#dataset)
- [Validated Hardware Details](#validated-hardware-details)
- [Software Requirements](#software-requirements)
- [How it Works?](#how-it-works)
    - [Architecture](#architecture)
- [Get Started](#get-started)
- [Download the Reference Kit Repository](#download-the-reference-kit-repository)
- [Download and Preprocess the Datasets](#download-and-preprocess-the-datasets)
- [Run Using Jupyter Lab](#run-using-jupyter-lab) 
- [Expected Output](#expected-output)
- [Result Visualization](#result-visualization)

<a id="solution-technical-overview"></a> 
## Solution Technical Overview
This reference kit demonstrates one possible reference implementation of a multi-model and multi-modal solution. While the vision workflow aims to train an image classifier that takes in contrast-enhanced spectral mammography (CESM) images, the natural language processing (NLP) workflow aims to train a document classifier that takes in annotation notes about a patient’s symptoms. Each pipeline creates prediction for the diagnosis of breast cancer. In the end, weighted ensemble method is used to create final prediction.

The goal is to minimize an expert’s involvement in categorizing samples as normal, benign, or malignant, by developing and optimizing a decision support system that automatically categorizes the CESM with the help of radiologist notes.

<a id="dataset"></a> 
### DataSet
The dataset is a collection of 2,006 high-resolution contrast-enhanced spectral mammography (CESM) images (1003 low energy images and 1003 subtracted CESM images) with annotations of 326 female patients. See Figure-1. Each patient has 8 images, 4 representing each side with two views (Top Down looking and Angled Top View) consisting of low energy and subtracted CESM images. Medical reports, written by radiologists, are provided for each case along with manual segmentation annotation for the abnormal findings in each image. As a preprocessing step, we segment the images based on the manual segmentation to get the region of interest and group annotation notes based on the subject and breast side. 

  ![CESM Images](assets/cesm_and_annotation.png)

*Figure-2: Samples of low energy and subtracted CESM images and Medical reports, written by radiologists from the Categorized contrast enhanced mammography dataset. [(Khaled, 2022)](https://www.nature.com/articles/s41597-022-01238-0)*

For more details of the dataset, visit the wikipage of the [CESM](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=109379611#109379611bcab02c187174a288dbcbf95d26179e8) and read [Categorized contrast enhanced mammography dataset for diagnostic and artificial intelligence research](https://www.nature.com/articles/s41597-022-01238-0).

<a id="validated-hardware-details"></a> 
## Validated Hardware Details
There are workflow-specific hardware and software setup requirements depending on how the workflow is run. Bare metal development system and Docker image running locally have the same system requirements. 

| Recommended Hardware         | Precision  |
| ---------------------------- | ---------- |
| Intel® 4th Gen Xeon® Scalable Performance processors| FP32, BF16 |
| Intel® 1st, 2nd, 3rd, and 4th Gen Xeon® Scalable Performance processors| FP32 |

To execute the reference solution presented here, use CPU for fine tuning. 

<a id="software-requirements"></a> 
## Software Requirements 
Linux OS (Ubuntu 22.04) is used to validate this reference solution. Make sure the following dependencies are installed.

1. `sudo apt update`
2. `sudo apt install -y build-essential gcc git libgl1-mesa-glx libglib2.0-0 python3-dev`
3. `sudo apy install python3.9 python3-pip`, and some virtualenv like python3-venv or [conda](#1-set-up-system-software) 
4. `pip install dataset-librarian`


<a id="how-it-works"></a> 
## How It Works?

<a id="architecture"></a> 
### Architecture
![Use_case_flow](assets/e2e_flow_HLS_Disease_Prediction.png)
*Figure-1: Architecture of the reference kit* 

- Uses real-world CESM breast cancer datasets with “multi-modal and multi-model” approaches.
- Two domain toolkits (Intel® Transfer Learning Toolkit and Intel® Extension for Transformers), Intel® Neural Compressor and other libs/tools and uses Hugging Face model repo and APIs for [ResNet-50](https://huggingface.co/microsoft/resnet-50) and [ClinicalBert](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT) models. 
- The NLP reference Implementation component uses [HF Fine-tuning and Inference Optimization workload](https://github.com/intel/intel-extension-for-transformers/tree/main/workflows/hf_finetuning_and_inference_nlp), which is optimized for document classification. This NLP workload employs Intel® Neural Compressor and other libraries/tools and utilizes Hugging Face model repository and APIs for ClinicalBert models. The ClinicalBert model, which is pretrained with a Masked-Language-Modeling task on a large corpus of English language from MIMIC-III data, is fine-tuned with the CESM breast cancer annotation dataset to generate a new BERT model.
- The Vision reference Implementation component uses [TLT-based vision workload](https://github.com/IntelAI/transfer-learning), which is optimized for image fine-tuning and inference. This workload utilizes Intel® Transfer Learning Tool and tfhub's ResNet-50 model to fine-tune a new convolutional neural network model with subtracted CESM image dataset. The images are preprocessed by using domain expert-defined segmented regions to reduce redundancies during training.
- Predict diagnosis by using categorized contrast enhanced mammography images and radiologists’ notes separately and weighted ensemble method applied to results of sub-models to create the final prediction.

<a id="get-started"></a> 
## Get Started
Start by defining an environment variable that will store the workspace path, this can be an existing directory or one to be created in further steps. This ENVVAR will be used for all the commands executed      using absolute paths.


In [None]:
import os
from pathlib import Path
WORKSPACE = f"~/mtw/work/" #Path can be changed by user to desired location
os.environ['WORKSPACE']=f"{WORKSPACE}"
Path(WORKSPACE).mkdir(parents=True, exist_ok=True)
print("Work dir: {}".format(WORKSPACE))

<a id="download-the-reference-kit-repository"></a> 
### Download the Reference Kit Repository
To download the repository, follow the instructions in section `Download the Reference Kit Repository` of the README.md file.

The following cell changes the current working directory to path for Python, this is needed to run the cells bellow:

In [None]:
import os
os.chdir(f"{WORKSPACE}/brca_multimodal/")

<a id="download-and-preprocess-the-datasets"></a> 
### Download and Preprocess the Datasets
Use the links below to download the image datasets. Or skip to the [Docker](#run-using-docker) section to download the dataset using a container.

- [High-resolution Contrast-enhanced spectral mammography (CESM) images](https://faspex.cancerimagingarchive.net/aspera/faspex/external_deliveries/260?passcode=5335d2514638afdaf03237780dcdfec29edf4238#)

Once you have downloaded and unzip the image files and placed them into the `${WORKSPACE}/brca_multimodal/data` directory, proceed by executing the following command. This command will initiate the download of segmentation and annotation data, followed by the application of segmentation and preprocessing operations.

Command-line Interface:
- -d : Directory location where the raw dataset will be saved on your system. It's also where the preprocessed dataset files will be written. If not set, a directory with the dataset name will be created.
- --split_ratio: Split ratio of the test data, the default value is 0.1.

More details of the dataset_librarian can be found [here](https://pypi.org/project/dataset-librarian/).

In [None]:
#The first time you execute dataset_librarian you will be requested to accept the licensing agreement,
#scroll down and accept (y) the agreement to continue. The prosses will end once the "Preprocessing has finished" message appears.
import os
import dataset_librarian as dl
from dotenv import load_dotenv, dotenv_values

package_path = dl.__path__[0]
env_file_path = os.path.join(package_path, ".env")
USER_CONSENT = dotenv_values(env_file_path).get("USER_CONSENT")

command = f'python3.9 -m dataset_librarian.dataset -n brca --download --preprocess -d data/ --split_ratio 0.1; echo "Preprocessing has finished."'

if USER_CONSENT  == "y":
    os.popen(command, 'w')
else:
    os.popen(command, 'w').write(input())

Note: See this dataset's applicable license for terms and conditions. Intel Corporation does not own the rights to this dataset and does not confer any rights to it.

Once preprocessing has ended, you should have the following files and directories inside the `data/` directory:
```
data/ 
    |
    └──annotation/
        |── annotation.csv
        |── testing_data.csv
        └── training_data.csv
    └──CDD-CESM/
        |── Low energy images of CDD-CESM/
        └── Subtracted images of CDD-CESM/
    └──Medical reports for cases/
    └──segmented_images/
        |── Benign/
        |── Malignant/
        └── Normal/
    └──train_test_split_images
        |── test/
        └── train/
    └──vision_images
        |── Benign/
        |── Malignant/
        └── Normal/
    └──Radiology manual annotations.xlsx
    └──Radiology_hand_drawn_segmentations_v2.csv
    └──README.md
```

<a id="run-using-jupyter-lab"></a> 
# Run Using Jupyter Lab

### 1. Setup Workflow
This step involves the installation of the following components:

- HF Fine-tune & Inference Optimization workflow
- Transfer Learning based on TLT workflow

In [None]:
!bash setup_workflows.sh

### 2. Model Building Process

To train the multi-model disease prediction, utilize the `breast_cancer_prediction.py` script along with the arguments outlined in the `disease_prediction_baremetal.yaml` configuration file, which has the following structure:

```
disease_prediction_baremetal.yaml
    
    |
    └──overwrite_training_testing_ids
    └──output_dir
    └──test_size
    └──write
    └──nlp
        |── finetune
        |── inference
        └── other parameters for HF fine-tune and inference optimization workflow
    └──vision
        |── finetune
        |── inference
        └── other parameters for HF fine-tune and inference optimization workflow
```

The `disease_prediction_baremetal.yaml` file includes the following parameters:

- output_dir: specifies the location of the output model and inference results
- write: a container parameter that is set to false for bare metal
- nlp:
  - finetune: runs nlp fine-tuning
  - inference: runs nlp inference
  - additional parameters for the HF fine-tune and inference optimization workflow (more information available [here](https://github.com/intel/intel-extension-for-transformers/tree/main/workflows/hf_finetuning_and_inference_nlp/config))

- vision:
  - finetune: runs vision fine-tuning
  - inference: runs vision inference
  - additional parameters for the Vision: Transfer Learning Toolkit based on TLT workflow (more information available [here](https://github.com/IntelAI/transfer-learning/tree/f2e83f1614901d44d0fdd66f983de50551691676/workflows/disease_prediction))



To solely perform the fine-tuning process, set the 'finetune' parameter to True in the following command and execute it:

In [None]:
%run src/breast_cancer_prediction.py --config_file configs/disease_prediction_baremetal.yaml --finetune True --inference False

### 3. Running Inference
After the models are trained and saved using the script from step 2, load the NLP and vision models using the inference option. This applies a weighted ensemble method to generate a final prediction. To only run inference, set the 'inference' parameter to true, the parameter 'finetune' to false and run the command provided in step 2.

> Alternatively, you can combine the training and inference processes into one execution by setting both the 'finetune' and 'inference' parameters to true and running the command provided in step 2.

In [None]:
%run src/breast_cancer_prediction.py --config_file configs/disease_prediction_baremetal.yaml --finetune False --inference True

<a id="expected-output"></a> 
## Expected Output
A successful execution of inference returns the confusion matrix of the sub-models and ensembled model, as shown in these example results: 
```
------ Confusion Matrix for Vision model ------
           Benign  Malignant  Normal  Precision
Benign       18.0     11.000   1.000      0.486
Malignant     5.0     32.000   0.000      0.615
Normal       14.0      9.000  25.000      0.962
Recall        0.6      0.865   0.521      0.652

------ Confusion Matrix for NLP model ---------
           Benign  Malignant  Normal  Precision
Benign     25.000      4.000     1.0      0.893
Malignant   3.000     34.000     0.0      0.895
Normal      0.000      0.000    48.0      0.980
Recall      0.833      0.919     1.0      0.930

------ Confusion Matrix for Ensemble --------
           Benign  Malignant  Normal  Precision
Benign     26.000      4.000     0.0      0.897
Malignant   3.000     34.000     0.0      0.895
Normal      0.000      0.000    48.0      1.000
Recall      0.867      0.919     1.0      0.939

```

<a id="result-visualization"></a> 
## Result Visualization
By utilizing the displayed widget, users can access a comprehensive overview that includes radiologists' annotation notes, corresponding subtracted CESM images, and ensemble predictions. Scroll down to see the selected results.

In [None]:
import importlib
import widget_manager
importlib.reload(widget_manager)