# Creating an Image Classification Dataset based on MultiCaRe Subsets

First of all, install the `multiversity` library, import the MedicalDatasetCreator class, and instantiate it.

In [None]:
%%capture
!pip install multiversity

In [2]:
from multiversity.multicare_dataset import MedicalDatasetCreator

In [3]:
mdc = MedicalDatasetCreator(directory = 'medical_datasets')

Downloading the MultiCaRe Dataset from Zenodo. This may take approximately 5 minutes.
Importing and pre-processing the main files.
Done!


This code will create an image classification dataset by creating multiple subsets and then assigning one class to each. If all the subsets belong to a specific type of image, a common filter list can be added to the `dataset_dict` param. To understand how to use filters correctly, refer to [this demo](https://github.com/mauro-nievoff/MultiCaRe_Dataset/blob/main/Demos/customized_subset_creation.ipynb).

In [4]:
mri_filters = [
  {'field': 'min_year', 'string_list': ['2015']},
  {'field': 'min_age', 'string_list': ['18']},
  {'field': 'case_strings', 'string_list': ['lung cancer', 'lung carcinoma'], 'operator': 'any'},
  {'field': 'label', 'string_list': ['mri', 'head']},
  {'field': 'caption', 'string_list': ['metastasis', 'metastases'], 'operator': 'any'}
]

In [5]:
classifier_dict = {
    'dataset_name': 'gender_classifier',
    'common_filter_list': mri_filters,
    'class_subsets': [
        {'class': 'female',
         'filter_list': [{'field': 'gender', 'string_list': ['Female']}]},
        {'class': 'male',
         'filter_list': [{'field': 'gender', 'string_list': ['Male']}]}
    ]
}


In [6]:
mdc.create_image_classification_dataset(dataset_dict = classifier_dict,
                                        keep_label_columns = False) # Use True if you want to keep also the label columns from the original dataset ('image_type', etc).

The gender_classifier_female was successfully created!
The gender_classifier_male was successfully created!


In [10]:
import pandas as pd

classification_dataset = pd.read_csv('medical_datasets/gender_classifier/gender_classifier.csv')

As an outcome you will get a dataframe including the path to each image and its class.

In [11]:
classification_dataset.head()

Unnamed: 0,file_id,file,file_path,class
0,file_0007273,PMC10295819_encephalitis-2022-00122f1_A_1_6.webp,medical_datasets/gender_classifier_female/imag...,female
1,file_0007274,PMC10295819_encephalitis-2022-00122f1_B_2_6.webp,medical_datasets/gender_classifier_female/imag...,female
2,file_0007275,PMC10295819_encephalitis-2022-00122f1_C_3_6.webp,medical_datasets/gender_classifier_female/imag...,female
3,file_0007276,PMC10295819_encephalitis-2022-00122f1_D_4_6.webp,medical_datasets/gender_classifier_female/imag...,female
4,file_0007277,PMC10295819_encephalitis-2022-00122f1_E_5_6.webp,medical_datasets/gender_classifier_female/imag...,female
