# Melanoma Classification - Data Preperation
<b>Data Source:</b>
* https://challenge.isic-archive.com/data/#2019
* https://challenge.isic-archive.com/data/#2020

<b>Contents of this notebook</b>
1. Load groundtruth data
2. Combine 2019 and 2020 ISIC dataset
3. Remove duplicates 
4. Train Test Validation split
5. Save groundtruth data
6. Download and sort images 

Please note that up until step 5, no data will actually be downloaded as in saved to the desk. Further, all operations will be performed only on the csv files for performance issues up until step 6, where all previous operations will be performed on the images in one step.

## 1. Load groundtruth data

In [1]:
import pandas as pd

print("Loading ISIC 2020 Groundtruth")
url_2020 = "https://isic-challenge-data.s3.amazonaws.com/2020/ISIC_2020_Training_GroundTruth_v2.csv"
df_2020 =  pd.read_csv(url_2020)

print("Loading ISIC 2019 Groundtruth")
url_2019 = "https://isic-challenge-data.s3.amazonaws.com/2019/ISIC_2019_Training_GroundTruth.csv"
df_2019 =  pd.read_csv(url_2019)

Loading ISIC 2020 Groundtruth
Loading ISIC 2019 Groundtruth


## 2. Combine 2019 and 2020 ISIC dataset

#### Compare available columns

In [2]:
def print_avail_columns(name, df):
    col = df.columns.values.tolist()
    print("%s:\t %i columns available: %s" %(name, len(col), col))

print("Before transformation:")
print_avail_columns("2020 ISIC", df_2020)
print_avail_columns("2019 ISIC", df_2019)

Before transformation:
2020 ISIC:	 9 columns available: ['image_name', 'patient_id', 'lesion_id', 'sex', 'age_approx', 'anatom_site_general_challenge', 'diagnosis', 'benign_malignant', 'target']
2019 ISIC:	 10 columns available: ['image', 'MEL', 'NV', 'BCC', 'AK', 'BKL', 'DF', 'VASC', 'SCC', 'UNK']


#### Transform into the same format

In [3]:
df_2020 = df_2020[['image_name', 'target']]

df_2019['target'] = df_2019.apply(lambda row: row.MEL == True, axis = 1)
df_2019 = df_2019[['image', 'target']]
df_2019 = df_2019.rename(columns = {'image':'image_name'})

print("After transformation:")
print_avail_columns("2020 ISIC", df_2020)
print_avail_columns("2019 ISIC", df_2019)

After transformation:
2020 ISIC:	 2 columns available: ['image_name', 'target']
2019 ISIC:	 2 columns available: ['image_name', 'target']


#### Compare dataset size

In [4]:
def print_melanoma_count(name, df):
    count = len(df.index)
    count_melanoma = len(df[df['target']==True])
    percentage_melanoma = count_melanoma/count
    print("%s:\t %i total samples \t %i melanoma samples \t %0.2f %% melanoma percentage" 
          %(name, count, count_melanoma, percentage_melanoma*100))

print_melanoma_count("2020 ISIC", df_2020)
print_melanoma_count("2019 ISIC", df_2019)

2020 ISIC:	 33126 total samples 	 584 melanoma samples 	 1.76 % melanoma percentage
2019 ISIC:	 25331 total samples 	 4522 melanoma samples 	 17.85 % melanoma percentage


#### Concat data

In [5]:
df = pd.concat([df_2020, df_2019]).drop_duplicates()
print_melanoma_count("Combined ISIC", df)

Combined ISIC:	 58457 total samples 	 5106 melanoma samples 	 8.73 % melanoma percentage


## 3. Remove duplicates

In [6]:
df_duplicates_2020 = pd.read_csv('https://isic-challenge-data.s3.amazonaws.com/2020/ISIC_2020_Training_Duplicates.csv')
img_to_drop = df_duplicates_2020.image_name_1.values.tolist()

print("There are %i duplicate pictures" %len(df_duplicates_2020.index))
print("There are %i images in total" %len(df.index))

df = df[df.image_name.isin(img_to_drop) == False]
print("After dropping the duplicates, there are %i images in total" %len(df_2020.index))

print_melanoma_count("\nDuplicate free, combined ISIC", df)

There are 425 duplicate pictures
There are 58457 images in total
After dropping the duplicates, there are 33126 images in total

Duplicate free, combined ISIC:	 58032 total samples 	 5103 melanoma samples 	 8.79 % melanoma percentage


## 4. Train Test Validation split
<b>Training Set</b>
* 80%
* Used to train the learning algorithm

<b>Validation</b>
* Part of training set that is held out (10% of overall data)
* Used to evaluate different algorithms and hyperparameter settings

<b>Test</b>
* 10%
* Used to test the model and estimate ist performance in application
* Will <b>not</b> be used for training under no circumstances

In [7]:
from sklearn.model_selection import train_test_split

random_state = 42

X = df.drop(columns = ['target']).copy()
y = df[['target']]

train_size=0.8 # 80% training set
X_train, X_remaining, y_train, y_remaining = train_test_split(X, y, train_size=train_size, random_state=random_state)

test_size = 0.5 # 50% of remaining 20% -> 10%
X_valid, X_test, y_valid, y_test = train_test_split(X_remaining, y_remaining, test_size=test_size, random_state=random_state)

train = pd.concat([X_train.reset_index(drop=True), y_train.reset_index(drop=True)], axis=1, join='inner')
valid = pd.concat([X_valid.reset_index(drop=True), y_valid.reset_index(drop=True)], axis=1, join='inner')
test = pd.concat([X_test.reset_index(drop=True), y_test.reset_index(drop=True)], axis=1, join='inner')

print_melanoma_count("Train set", train)
print_melanoma_count("Valid set", valid)
print_melanoma_count("Test set", test)

Train set:	 46425 total samples 	 4096 melanoma samples 	 8.82 % melanoma percentage
Valid set:	 5803 total samples 	 526 melanoma samples 	 9.06 % melanoma percentage
Test set:	 5804 total samples 	 481 melanoma samples 	 8.29 % melanoma percentage


## 5. Saving groundtruth to computer

In [8]:
import os

data_root = 'data'
if not os.path.exists(data_root): 
  os.makedirs(data_root)
  print("Created new directory %s" %data_root)
    
train.to_csv(data_root + "/ISIC_2020_2019_train.csv")
valid.to_csv(data_root + "/ISIC_2020_2019_valid.csv")
test.to_csv(data_root + "/ISIC_2020_2019_test.csv")

Created new directory data


## 6. Saving images to computer

In [9]:
from zipfile import ZipFile
import datetime
import io
import requests

def sort_zip_file(zip_file, zip_namelist, img_names, dest):
    print("Sorting into %s" %dest)
    
    dest = data_root + "/" + dest
    if not os.path.exists(dest): 
      os.makedirs(dest)
      print("Created new directory %s" %dest)
    
    print("Started sorting at: ", datetime.datetime.now())
    folder_name = zip_namelist[0].partition("/")[0]
    common = [folder_name + "/" + filename + ".jpg" for filename in img_names if folder_name + "/" + filename + ".jpg" in zip_namelist]
    for filename in common:
        zip_file.extract(filename, dest)
    print("Finished sorting at: ", datetime.datetime.now())

def download_and_sort_img(src, img_name_list, dest_list):  
    file_name = data_root + "/images.zip"
     
    print("Downloading and sorting %s" %src)
    print("Started download at: ", datetime.datetime.now())
    with open(file_name,"wb") as file:
        for chunk in requests.get(src, stream = True).iter_content(chunk_size=1024):
            if chunk:
                file.write(chunk)
    print("Finished download at: ", datetime.datetime.now())

    with ZipFile(file_name, 'r') as zip_file:
        zip_namelist = zip_file.namelist()
        for i in range(len(img_name_list)):
            sort_zip_file(zip_file, zip_namelist, img_name_list[i], dest_list[i])
    
    print("Removing %s .zip file" %file_name)
    print("Started removing at: ", datetime.datetime.now())
    os.remove(file_name)
    print("Finished removing at: ", datetime.datetime.now())
        
train_img = X_train["image_name"].values.tolist()
test_img = X_test["image_name"].values.tolist()
valid_img = X_valid["image_name"].values.tolist()

img_name_list = [train_img, test_img, valid_img]
dest_list = ["train", "test", "valid"]

img_2020 = 'https://isic-challenge-data.s3.amazonaws.com/2020/ISIC_2020_Training_JPEG.zip'
img_2019 = 'https://isic-challenge-data.s3.amazonaws.com/2019/ISIC_2019_Training_Input.zip'

In [10]:
download_and_sort_img(img_2019, img_name_list, dest_list)

Downloading and sorting https://isic-challenge-data.s3.amazonaws.com/2019/ISIC_2019_Training_Input.zip
Started download at:  2022-10-12 23:08:35.010846
Finished download at:  2022-10-12 23:31:19.150132
Sorting into train
Created new directory data/train
Started sorting at:  2022-10-12 23:31:19.698541
Finished sorting at:  2022-10-12 23:49:01.005239
Sorting into test
Created new directory data/test
Started sorting at:  2022-10-12 23:49:01.036370
Finished sorting at:  2022-10-12 23:51:10.183360
Sorting into valid
Created new directory data/valid
Started sorting at:  2022-10-12 23:51:10.198988
Finished sorting at:  2022-10-12 23:53:13.783618
Removing data/images.zip .zip file
Started removing at:  2022-10-12 23:53:13.783618
Finished removing at:  2022-10-12 23:53:13.783618


In [11]:
download_and_sort_img(img_2020, img_name_list, dest_list)

Downloading and sorting https://isic-challenge-data.s3.amazonaws.com/2020/ISIC_2020_Training_JPEG.zip
Started download at:  2022-10-12 23:53:13.844418
Finished download at:  2022-10-13 00:49:51.722518
Sorting into train
Started sorting at:  2022-10-13 00:49:52.558686
Finished sorting at:  2022-10-13 01:24:55.859148
Sorting into test
Started sorting at:  2022-10-13 01:24:55.859148
Finished sorting at:  2022-10-13 01:29:01.797043
Sorting into valid
Started sorting at:  2022-10-13 01:29:01.797043
Finished sorting at:  2022-10-13 01:33:07.350322
Removing data/images.zip .zip file
Started removing at:  2022-10-13 01:33:07.365717
Finished removing at:  2022-10-13 01:33:07.365717
