Skip to content
Tools for workup of the HAM10000 dataset
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
extract change link&license Apr 2, 2018
filter change link&license Apr 2, 2018
unify Initial commit Mar 27, 2018 Update Sci Data Bibtex citation Jul 23, 2018

HAM 10000 Dataset Tools

Creative Commons Lizenzvertrag

This repository gives access to the tools created and used for assembling the training dataset for the proposed HAM-10000 (Human Against Machine with 10000 training images) study, which extending part 3 of the ISIC 2018 challenge. The dataset itself is available for download at the Harvard dataverse or the ISIC-archive.


Following technique was used to leverage image data from PowerPoint slides, by extracting and ordering them with unique identifiers:


To more efficiently order large image sets of containing non-annotated overview (clinic), closeup (macro) and dermatoscopic (dsc) images, we fine-tuned a neural network to distinguish between those types automatically.

1. Annotation

  • filter/ An OpenCV based script to quickly annotate images within a subfolder into different image types. Results are stored in a CSV-file with the option to abort-and-resume annotation.

2. Training

Training was performed in Caffe / DIGITS abstracting away many training variables. We gained 1501 annotated images with the tool above and proceeded to training: GoogLeNet pretrained on ImageNet (taken from the NVIDIA DIGITS 5 Model Store) was fine-tuned on three classes for 20 epochs, landing at a final top-1 accuracy on the test-set of 98.68% (one dermatoscopic image classified as macro). The trained model files are provided in ./classify/caffe_model/*

3. Inference


Pathologic diagnoses in clinical practice are often non-standardized and verbose. The notebook below depicts our boilerplate used on different datasets to merge raw string data into a clean set of classes.

  • unify/unify_diagnoses.ipynb uses the pandas library to clean and unify diagnosis texts of dermatologic lesions into a confined set of diagnoses other or ambiguous classes.
    Note: The notebook contains only a subset of example terms for display purposes, as regular expressions are optimized to fit a given dataset. Therefore, most commonly the ones given will not be ready to be applied on a new set out of the box. Importantly, also the order of relabeling diagnoses matter, so we highly recommend manual checkup of relabeled diagnoses and stepwise iteration when applying to a new dataset.


To normalise image format without squeezing, one Bash/ImageMagick command was applied to final images before data submission to the archive:

find . -type f \( -iname \*.jpg -o -iname \*.jpeg -o -iname \*.tiff -o -iname \*.tif \) -print0 | xargs -0 -n1 mogrify -strip -rotate "90<" -resize "600x450^" -gravity center -crop 600x450+0+0 -density 72 -units PixelsPerInch -format jpg -quality 100


If tools or data helped your research, please cite:

  • Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 doi:10.1038/sdata.2018.161 (2018).
  author    = {Philipp Tschandl and
               Cliff Rosendahl and
               Harald Kittler},
  title     = {The {HAM10000} dataset, a large collection of multi-source dermatoscopic
               images of common pigmented skin lesions},
  journal   = {Sci. Data},
  volume    = {5},
  year      = {2018},
  pages     = {180161},
  doi       = {10.1038/sdata.2018.161}
You can’t perform that action at this time.