<a href="https://colab.research.google.com/github/mauro-nievoff/MultiCaRe_Dataset/blob/main/demos/Extended_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MedicalDatasetCreator Demo

The MedicalDatasetCreator is a class used to simplify the creation of customized subsets of the [MultiCaRe Dataset](https://zenodo.org/records/10079370).

This notebook is divided into three sections:

1. Downloading the whole MultiCaRe Dataset
2. Defining your Filters
3. Creating a Customized Dataset

Before starting, run the following cells to set everything up:

In [1]:
!git clone https://github.com/mauro-nievoff/MultiCaRe_Dataset

from MultiCaRe_Dataset.multicare import MedicalDatasetCreator

Cloning into 'MultiCaRe_Dataset'...
remote: Enumerating objects: 113, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 113 (delta 17), reused 9 (delta 9), pack-reused 87[K
Receiving objects: 100% (113/113), 1.34 MiB | 15.20 MiB/s, done.
Resolving deltas: 100% (57/57), done.


In [2]:
import os
import pandas as pd

Ok now, let's start!

## 1. Downloading the Whole MultiCaRe Dataset

When the MedicalDatasetCreator class is instantiated, the MultiCaRe Dataset is imported from Zenodo into a folder called 'whole_multicare_dataset' in the main directory. This main directory is called 'MultiCaRe' by default, but this name can be changed (e.g. we are naming it 'medical_datasets' in this example).

This step may take 5 to 10 minutes, and you only need to run it once (even if you intend to create multiple subsets).

In [4]:
mdc = MedicalDatasetCreator(directory = 'medical_datasets')

Downloading the MultiCaRe Dataset from Zenodo. This may take 5 to 10 minutes.
Importing and pre-processing the main files.
Done!


Let's see what we find in the dataset folder:

In [None]:
sorted(os.listdir('medical_datasets/whole_multicare_dataset'))

['PMC1',
 'PMC2',
 'PMC3',
 'PMC4',
 'PMC5',
 'PMC6',
 'PMC7',
 'PMC8',
 'PMC9',
 'abstracts.parquet',
 'captions_and_labels.csv',
 'case_images.parquet',
 'cases.parquet',
 'data_dictionary.csv',
 'metadata.parquet']

The first nine elements are folders that contain subfolders with images, and then there are some files (parquets and csvs) with the relevant data and metadata.

If you want to have a better idea about the contents of these files you can check _data_dictionary.csv_ (see the cell below), although this is not necessary at all: the _create_dataset_ method has all you need to create and preprocess a customized subset.

In [None]:
data_dictionary = pd.read_csv('medical_datasets/whole_multicare_dataset/data_dictionary.csv')

data_dictionary.head()

Unnamed: 0,file,field,explanation
0,captions_and_labels.csv,file_id,Primary key for each row. Each row contains on...
1,captions_and_labels.csv,file,Name of the image file. The file is in the fol...
2,captions_and_labels.csv,main_image,Id from the original image (it corresponds to ...
3,captions_and_labels.csv,patient_id,"Id of the patient, created combining the PMC o..."
4,captions_and_labels.csv,license,License of the article. The possible values ar...


## 2. Defining your Filters

The part of the whole dataset that will be included in your customized subset depends on the list of filters that you use. Some filters work at an article level (e.g. _min_year_ filters by the year of the case report article), others work at a patient level (e.g. _gender_) and some others work at an image level (e.g. _caption_).

Each filter is a dictionary with a 'field' key (name of the filter, such as 'min_year' or 'gender') and a 'string_list' key (relevant values that are used to filter). They can also have other keys sometimes, as we will explain in this section.

Note: In order to know which are the possible values that can be introduced in a string list, you can display the corresponding parameter from MedicalDatasetCreator (e.g. mdc.year_list or mdc.gender_list).

- _min_year_: Minimum article year that is included in the subset. The dataset includes articles from 1990 on, so values lower than that (like string_list = ['1980']) don't make sense.

- _max_year_: Maximum article year that is included in the subset. The dataset includes articles until 2023.

- _license_: Article license types that are included. These are the possible license types:
  - Commercial use allowed: CC0, CC BY, CC BY-SA, CC BY-ND
  - Non-commercial use only: CC BY-NC, CC BY-NC-SA, CC BY-NC-ND
  - Other: author_manuscript, NO-CC CODE
  
  So, if you are willing to use the dataset for a commercial purpose, you should use string_list&nbsp;=&nbsp;['CC0', 'CC BY', 'CC BY-SA' 'CC BY-ND'].

- _keywords_: This filter considers the keywords from the article metadata
  - There are around 87K keywords, which can be displayed using the .keyword_list param.
  - You can add the key 'operator' to this type of filter when you are including more than one keyword in your string_list. The value for this 'operator' can be 'all' (by default), 'any' or 'none' depending on if the article metadata should include all the listed keywords, at least one of them or none of them, respectively.
  - You can add the key 'match_type', which can be either 'full_match' (by default) or 'partial_match'. For example, the filter {'field': 'keywords', 'string_list': ['diabetes'], 'match_type': 'partial_match'} will retrieve all the cases with at least one keyword that contains the substring 'diabetes'. If 'full_match' were used, the filter would only retrieve cases which include the keyword 'diabetes' (exact match).

- _mesh_terms_: This filter considers the MeSH terms from the article metadata
  - There are more than 38K MeSH terms, which can be displayed using the .mesh_term_list param.
  - You can add the keys 'operator' and 'match_type' (see _keywords_).

- _min_age_: Minimum patient age that is included in the subset.

- _max_age_: Maximum patient age that is included in the subset.

- _gender_: Gender classes that should be included in the subset. The possible values are: 'Female', 'Male', 'Transgender' and 'Unknown'.

- _case_strings_: This filter looks for clinical cases that contain specific strings. It does not differentiate lowercase from uppercase. The key 'operator' can be added (see _keywords_).

- _caption_: This filter looks for image captions that contain specific strings. The key 'operator' can be added (see _keywords_). If you want this filter to differentiate between lowercase and uppercase, you should add 'matching_case'&nbsp;=&nbsp;True.

- _label_: This field refers to unspecific labels used to tag images. There are 19 of them, including Histology, Site, Position, Laterality, Image_Finding, Pathology_Test, Imaging_Test, Problem, Imaging_Technique, Other_Image_Type, EKG and EKG_Fiding, and also some less relevant classes (Assertion_Absent, Measurement_Value, Other, Modifier, Measurement_Unit, Negative_Entity_Class and Assertion_Present_And_Absent). The key 'operator' can be added (see _keywords_).

- _normalized_extractions_: This field refers to specific labels used to tag images (such as 'ct', 'bone' or 'h&e'), created by normalizing text extractions from captions. There are 176 of them, which are displayed in the cell below. The key 'operator' can be used here as well (see _keywords_).

In [31]:
print('Possible values for the normalized_extraction filter:\n')
for key in mdc.normalized_extraction_list.keys():
  if key != 'normalized_extractions': # key containing the full list, which is redundant here.
    print(f"'{key}': {mdc.normalized_extraction_list[key]}")

Possible values for the normalized_extraction filter:

'pathology_test': ['h&e', 'methenamine_silver', 'immunofluorescence', 'immunoreactivity', 'immunostaining', 'ihc', 'congo_red', 'ziehl_neelsen', 'masson_trichrome', 'culture', 'giemsa', 'acid_fast', 'pas', 'ki67', 'fish', 'papanicolaou', 'nuclear_staining', 'gram', 'red_stain', 'van_gieson', 'cytoplasmatic_staining', 'alcian_blue', 'green_birefringence', 'blue_stain', 'methylene_blue', 'cotton_blue', 'ck_5/6']
'image_type': ['x_ray', 'ct', 'echocardiogram', 'mri', 'ultrasound', 'cta', 'pet', 'ekg', 'angiography', 'gastroscopy', 'mra', 'colonoscopy', 'dsa', 'endoscopy', 'eeg', 'mammography', 'scintigraphy', 'fundus_photograph', 'oct', 'cystoscopy', 'mrcp', 'broncoscopy', 'opg', 'venogram', 'egd', 'emg', 'myelogram', 'autofluorescence', 'laryngoscopy', 'arthroscopy', 'ercp', 'spect', 'tractography']
'image_technique': ['t2', 'contrast', 'tracer', 't1', 'ir', 'flair', 'doppler', 'dwi', 'fat_suppression', 'intensity_projection', 'mip',

## 3. Creating a Customized Dataset

Now that you know everything about filters, you can create your own subset by using the .create_dataset() method.
- dataset_name (str): Name of the new subset. The data will be saved in a folder with this name inside the directory define when instantiating the MedicalDatasetCreator class.
- filter_list (list): List of filter dictionaries.
- dataset_type (str): Required type of dataset. It can be either 'text', 'image', 'multimodal' (default value) or 'case_series'. All the dataset types will include a readme.txt, a reference_list.json (with citation information from case report articles) and an article_metadata.json. Apart from this, each dataset contains different files:
  - text: The dataset contains a csv file with case_id, pmcid, case_text, age and gender of the patient.
  - image: The dataset contains a folder with images, and a json file with file_id, file_path, normalized_extractions, labels, caption, raw_image_link (from PMC), case_id, license, split_during_preprocessing (True if the raw image included more than one sub images).
  - multimodal: The dataset contains a combination of the files from text and image datasets.
  - case_series: The dataset contains a folder with images (there is one folder per patient), and a csv file with cases including case_id, pmcid, case_text, age, gender, link to the case report article, amount_of_images for the specific case, and image_folder.

If you want to create multiple subsets, you just need to use the .create_dataset() method multiple times using different dataset names and filters (there is no need to instantiate the MedicalDatasetCreator class each time).

## Example

Let's create a multimodal subset with adult male patients with brain MRI images and mentions related to cancer in the text of the case and in captions. This will only take some minutes.

In [34]:
filters = [{'field': 'min_age', 'string_list': ['18']},
           {'field': 'gender', 'string_list': ['Male']},
           {'field': 'case_strings', 'string_list': ['tumor', 'cancer', 'carcinoma'], 'operator': 'any'},
           {'field': 'caption', 'string_list': ['metastasis', 'tumor', 'mass'], 'operator': 'any'},
           {'field': 'normalized_extractions', 'string_list': ['mri', 'brain']}]

In [35]:
mdc.create_dataset(dataset_name = 'male_brain_tumor_dataset', filter_list = filters, dataset_type = 'multimodal')

The male_brain_tumor_dataset was successfully created!

Suggestions:
- Image captions: If you intend to use them, consider prioritizing images with 'split_during_preprocessing' == False.
  Many captions needed to be split during caption preprocessing, and the resulting strings may have some minor issues such as extra special characters or wrong capitalization.
- Image labels: They were created programatically based on image captions (they were not annotated manually).
  If you intend to use image labels, consider having them manually reviewed by a medical doctor or an SME.


In [42]:
print("The subset includes:")
print(f"  - Amount of patients: {len(mdc.filtered_cases)}")
print(f"  - Amount of images: {len(mdc.filtered_image_metadata_df)}")
print(f"  - Subset contents: {os.listdir(f'{mdc.directory}/{mdc.dataset_name}')}")

The subset includes:
  - Amount of patients: 10243
  - Amount of images: 352
  - Subset contents: ['case_report_citations.json', 'image_metadata.json', 'images', 'cases.csv', 'readme.txt', 'article_metadata.json']


And that's it! Enjoy your customized datasets! 🙂