This repository hosts the Python script to perform the Exploratory Data Analysis (EDA) of the Hospital Italiano de Buenos Aires skin lesion dataset shared through the ISIC Archive. This project was conducted by the Dermatology Department and the Artificial Intelligence and Data Science program of the Health Informatics Department of the institution.
The dataset described here was composed of information collected from 623 patients seen by expert dermatologists at the Hospital Italiano de Buenos Aires (Buenos Aires, Argentina). The collection comprises 1,616 images (1,270 contact-polarized dermoscopy images and 346 clinical images) captured from 1,246 lesions corresponding to the most frequent diagnoses observed at the institution.
The dataset is registered as a collection in the ISIC Archive. In this sense, it can be downloaded in two ways:
-
By visiting the ISIC Archive website or accessing the collection DOI https://doi.org/10.34970/587329.
-
By resorting to the official command line tool for interacting with the ISIC Archive (for more information visit https://github.com/ImageMarkup/isic-cli)
# Install tool
pip install isic-cli
# Find the collection ID
isic collection list # Hospital Italiano de Buenos Aires Skin Lesions (ID 251)
# Download the collection metadata
isic metadata download --collections 251
# Download the collection metadata
isic metadata download --collections 251
# Download the collection images (and save them in the 'images/' folder).
isic image download --collections 251 images/
In this Python notebook we performed the download of the collection from the ISIC Archive by resorting to the command line tool, and carried out the exploratory data analysis (EDA). In this regard, we evaluated the distribution of patients, lesions and images, as well as analyzed the characteristics of patients (age, sex, personal and family history of melanoma) and lesions (diagnosis, type of confirmation for diagnosis, location).
In this Python notebook we performed a comparison between the dataset presented here and publicly available datasets.