# Pre-processing of HAM10000 data for use in Peltarion platform

This notebook guides you through the process of downloading and preparing the HAM10000 data set for use in the Peltarion platform.

## Downloading the data set
Before running this notebook you need to download the HAM10000 images and their corresponding metadata from this [ISIC webpage](https://www.isic-archive.com/#!/topWithHeader/onlyHeaderTop/gallery). Under __DATABASE ATTRIBUTES__ select __DATASET: HAM10000__. At the top of the page you should now see __Filtered images: 10015__. At the top right of the page click __Download as zip__ and select __Download Images and Metadata__. The downloaded zip file contains 10015 jpg images and a metadata.csv file with information about each of the images. Unzip the file into a local folder on your computer.



In [1]:
import os
import numpy as np
import pandas as pd
from PIL import Image
np.random.seed(1) # Set seed explicitly to get determinstic training/validation data split

  return f(*args, **kwds)
  return f(*args, **kwds)


## Configuration

### Configure paths

Modify the __metadata_path__ and __image_path__ to correspond with where you unzipped the downloaded file.
Specify the __out_path__ to where the output data from this notebook should be written. The directory should not exist.


In [2]:
metadata_path = "/home/asa/demo/ISIC-images/metadata.csv"
image_path =  "/home/asa/demo/ISIC-images/HAM10000"
out_path = "/home/asa/demo/workshop"

In [3]:
os.mkdir(out_path)

FileExistsError: [Errno 17] File exists: '/home/asa/demo/workshop'

### Specify options for the pre-processing
__TRAINING_PERCENT__  
Percentage of the images that should be used for training (the rest will be used for validation)

__IMAGE_SIZE__  
(width, height) Output image size after resizing of all images. The original HAM10000 images are 600x450 which is quite big. In order to fit a reasonably sized batch (16-64 images) of your model in gpu memory you can choose to resize them to smaller dimensions, for example 200x150, before importing to Peltarion platform. 

__BALANCE_CLASSES__  
True or False. The classes in this dataset are very imbalanced. Usually better model performance can be achieved by training with balanced classes. To create a dataset where all 7 classes are equally balanced put this parameter to True.

In [None]:
TRAINING_PERCENT = 80
IMAGE_SIZE = (60,45)
BALANCE_CLASSES = True

## Read and clean up metadata

Select only a few of the columns that we will use, shorten column names and replace NaNs with the string "unknown".

In [None]:
metadata = pd.read_csv(metadata_path)
cols = ["name", 
        "meta.clinical.age_approx", 
        "meta.clinical.benign_malignant",
        "meta.clinical.diagnosis",
        "meta.clinical.diagnosis_confirm_type",
        "meta.clinical.sex", 
        ]
cols_renamed = {"name": "image",
                "meta.clinical.age_approx": "age",
                "meta.clinical.benign_malignant": "benign_malignant",
                "meta.clinical.diagnosis": "diagnosis",
                "meta.clinical.diagnosis_confirm_type": "diagnosis_confirm_type",
                "meta.clinical.sex": "sex"
               }
metadata = metadata[cols]
metadata = metadata.rename(index=str, columns=cols_renamed)
metadata['image'] = metadata["image"]+".jpg"
metadata["age"] = metadata["age"].fillna(0)
metadata = metadata.fillna("unknown")

In [None]:
#metadata.groupby("benign_malignant").count()[["image"]]

In [None]:
metadata.groupby("diagnosis").count()[["image"]]

In [None]:
#metadata.groupby(["diagnosis", "benign_malignant"]).count()[["image"]]

## Resize images and write to out_path

In [None]:
num_samples = metadata.shape[0]
print ("Starting the processing of " + str(num_samples) + " images. This can take a few minutes.")
for idx, row in metadata.iterrows():
    if int(idx) % 1000 == 0:
        print (idx +" samples out of " + str(num_samples) + " samples processed")
    img_name = row["image"]
    im = Image.open(os.path.join(image_path, img_name))
    im = im.resize(IMAGE_SIZE)
    im.save(os.path.join(out_path, img_name))
print("Done!")

## Split into train/val data according to TRAINING_PERCENT parameter

In [None]:
metadata['subset'] = np.where(np.random.randint(0, 100, metadata.shape[0]) <= TRAINING_PERCENT, 'train', 'val')

In [None]:
#metadata.groupby(["subset"]).count()

## Perform class balancing

This performs class balancing over the 7 classes by duplicating/oversampling the rarer classes.
Note that the oversampling is only done on the training data set, not for validation data.
True class distribution is kept for the validation data in order to get proper performance metrics on the validation data set.

In [None]:
#metadata.groupby(["subset","diagnosis"]).count()[["image"]]

In [None]:
if BALANCE_CLASSES:
    max_size = metadata[metadata['subset']=='train']['diagnosis'].value_counts().max()
    lst = [metadata]
    for class_index, group in metadata[metadata['subset']=='train'].groupby('diagnosis'):
        lst.append(group.sample(max_size-len(group), replace=True))
    metadata = pd.concat(lst)

In [None]:
metadata.groupby(["subset","diagnosis"]).count()[["image"]]

## Write metadata to index.csv file


In [None]:
metadata.to_csv(os.path.join(out_path, "index.csv"), index=False)

## Create the zip file that can be uploaded to the platform

This final step you perform outside of this notebook. 
You have to bundle the produced index.csv file and the resized images into a single zip file. 
You can do this in a terminal window by navigating to the __out_path__ that you specified above, and running below command:

zip mybundle.zip -r ./