# Comparing the data of different ISIC challenges
ISIC, short for Internationcal Skin Imagening Colaboration, has been holding challenges to classifing skin cancer in images for multiple years. The archiv can be found under: https://challenge.isic-archive.com/

It is reasonable to expect that the proportion of images classified as maligniant vary within these datasets. This notebook shall serve as a tool to compare the data of the different challenges. In this context the following aspects will be compared
- available meta data
- number of total samples
- number of samples classified as melamoma = true
- images used in multiple ISIC challenges

Please note that this notebook does not save any data to the disk and works completely in RAM.

## 1. Load ISIC Challenge Data
In the first step, the data will be downloaded from the ISIC archive. 

In [1]:
import pandas as pd
import requests
from zipfile import ZipFile
import io

print("Loading ISIC 2020 Groundtruth")
url_2020 = "https://isic-challenge-data.s3.amazonaws.com/2020/ISIC_2020_Training_GroundTruth_v2.csv"
df_2020=  pd.read_csv(url_2020)

print("Loading ISIC 2019 Groundtruth")
url_2019 = "https://isic-challenge-data.s3.amazonaws.com/2019/ISIC_2019_Training_GroundTruth.csv"
df_2019 =  pd.read_csv(url_2019)

print("Loading ISIC 2018 Groundtruth")
url_2018 = "https://isic-challenge-data.s3.amazonaws.com/2018/ISIC2018_Task3_Training_GroundTruth.zip"
file_name = "ISIC2018_Task3_Training_GroundTruth/ISIC2018_Task3_Training_GroundTruth.csv"
zip_file = ZipFile(io.BytesIO(requests.get(url_2018).content))
df_2018 = pd.read_csv(zip_file.open(file_name))

print("Loading ISIC 2017 Groundtruth")
url_2017 = "https://isic-challenge-data.s3.amazonaws.com/2017/ISIC-2017_Training_Part3_GroundTruth.csv"
df_2017 =  pd.read_csv(url_2017)

print("Loading ISIC 2016 Groundtruth")
url_2016 = "https://isic-challenge-data.s3.amazonaws.com/2016/ISBI2016_ISIC_Part3B_Training_GroundTruth.csv"
df_2016 =  pd.read_csv(url_2016)

Loading ISIC 2020 Groundtruth
Loading ISIC 2019 Groundtruth
Loading ISIC 2018 Groundtruth
Loading ISIC 2017 Groundtruth
Loading ISIC 2016 Groundtruth


## 2. Dropping duplicates ISIC 2020
For the ISIC 2020 data a list of duplicates is available. To compare only individual data those duplicates should be removed.

In [2]:
df_duplicates = pd.read_csv('https://isic-challenge-data.s3.amazonaws.com/2020/ISIC_2020_Training_Duplicates.csv')
img_to_drop = df_duplicates.image_name_1.values.tolist()
df_2020 = df_2020[df_2020.image_name.isin(img_to_drop) == False]

## 3. Split off valid & test set of ISIC 2020
For the ISIC 2020 challenge, there is no ground truth for test data publically available. For this reason, we have to split the valid and test data of ourselves.
This step is necessary even in a comparrison of different datasets, since the valid dataset should ONLY be used for evaluting the models final performance and not be used before that.

In [3]:
from sklearn.model_selection import train_test_split

random_state = 42

X = df_2020.drop(columns = ['diagnosis', 'benign_malignant', 'target']).copy()
y = df_2020[['diagnosis', 'benign_malignant', 'target']]

train_size=0.8 # 80% training set
X_train, X_remaining, y_train, y_remaining = train_test_split(X,y, 
                                                              train_size=train_size, 
                                                              random_state=random_state)
df_2020 = pd.concat([X_train, y_train], axis=1, join='inner')

## 4. Compare available columns
Not all ISIC challenges have the same target however they all contain information about whether an image is classified as Melanoma. In the next step, the different attributes available in the different challenges are investigated.

In [4]:
def print_avail_columns(year, df):
    print("ISIC %s" %year)
    col = df.columns.values.tolist()
    print("%i columns available: %s\n" %(len(col), col))

print_avail_columns("2020", df_2020)
print_avail_columns("2019", df_2019)
print_avail_columns("2018", df_2018)
print_avail_columns("2017", df_2017)
print_avail_columns("2016", df_2016)

ISIC 2020
9 columns available: ['image_name', 'patient_id', 'lesion_id', 'sex', 'age_approx', 'anatom_site_general_challenge', 'diagnosis', 'benign_malignant', 'target']

ISIC 2019
10 columns available: ['image', 'MEL', 'NV', 'BCC', 'AK', 'BKL', 'DF', 'VASC', 'SCC', 'UNK']

ISIC 2018
8 columns available: ['image', 'MEL', 'NV', 'BCC', 'AKIEC', 'BKL', 'DF', 'VASC']

ISIC 2017
3 columns available: ['image_id', 'melanoma', 'seborrheic_keratosis']

ISIC 2016
2 columns available: ['ISIC_0000000', 'benign']



## 5. Transform data into the same format
As seen in step 3, the data from the different columns contains slightly different data / columns. In order to compare them better, they will now be transformed into the same format. 

After the transformation, all datasets will have two attributes:
1. image_name = the name of the image
2. target = true (melanoma), false (no melanoma)

In [5]:
df_2020 = df_2020[['image_name', 'target']]

df_2019['target'] = df_2019.apply(lambda row: row.MEL == True, axis = 1)
df_2019 = df_2019[['image', 'target']]
df_2019 = df_2019.rename(columns = {'image':'image_name'})

df_2018['target'] = df_2018.apply(lambda row: row.MEL == True, axis = 1)
df_2018 = df_2018[['image', 'target']]
df_2018 = df_2018.rename(columns = {'image':'image_name'})

df_2017['target'] = df_2017.apply(lambda row: row.melanoma == True, axis = 1)
df_2017 = df_2017[['image_id', 'target']]
df_2017 = df_2017.rename(columns = {'image_id':'image_name'})

df_2016 = df_2016.rename(columns = {'ISIC_0000000':'image_name', 'benign':'target'})
df_2016['target'] = df_2016.apply(lambda row: row.target == 'malignant', axis = 1)

## 6. Compare percentage of melanoma classifications
In the next step, the percentage of images classified as melanoma are calculated and compared.

In [6]:
def print_melanoma_count(year, df):
    count = len(df.index)
    count_melanoma = len(df[df['target']==True])
    percentage_melanoma = count_melanoma/count
    print("%s ISIC \n%i samples \t %i melanoma samples \t %0.2f %% melanoma percentage \n" 
          %(year, count, count_melanoma, percentage_melanoma*100))
    print("True / False values")
    print(df['target'].value_counts())
    print("\n\n")

print_melanoma_count("2020", df_2020)
print_melanoma_count("2019", df_2019)
print_melanoma_count("2018", df_2018)
print_melanoma_count("2017", df_2017)
print_melanoma_count("2016", df_2016)

2020 ISIC 
26160 samples 	 470 melanoma samples 	 1.80 % melanoma percentage 

True / False values
0    25690
1      470
Name: target, dtype: int64



2019 ISIC 
25331 samples 	 4522 melanoma samples 	 17.85 % melanoma percentage 

True / False values
False    20809
True      4522
Name: target, dtype: int64



2018 ISIC 
10015 samples 	 1113 melanoma samples 	 11.11 % melanoma percentage 

True / False values
False    8902
True     1113
Name: target, dtype: int64



2017 ISIC 
2000 samples 	 374 melanoma samples 	 18.70 % melanoma percentage 

True / False values
False    1626
True      374
Name: target, dtype: int64



2016 ISIC 
899 samples 	 173 melanoma samples 	 19.24 % melanoma percentage 

True / False values
False    726
True     173
Name: target, dtype: int64





## 7. Check for duplicates
It is safe to assume that some images have been used in multiple ISIC challenges. In the following step, all duplicate image names will be collected. It may of course also be that even more duplicates are contained that are not named the same. These will not be filtered out within this step.

In [7]:
import numpy as np

def check_has_same_image(df1, df2):
    merged = df1.merge(df2)
    print("%i images with the same name found \n" %len(merged.index))
    return merged

dfs = [df_2020, df_2019, df_2018, df_2017, df_2016]
df_names = ["2020", "2019", "2018", "2017", "2016"]
df_all_names = pd.DataFrame(columns=["image_name", "target_x", "target_y"])

for i in range(len(dfs)):
    for j in range(len(dfs)-i-1):
        print("Comparing %s and %s dataset" %(df_names[i], df_names[j+1+i]))
        merged = check_has_same_image(dfs[i], dfs[j+1+i])
        df_all_names= pd.concat([df_all_names, merged]).drop_duplicates()

number_total_images = len(pd.concat(dfs)) # total images in dataset including duplicates
number_more_than_once = len(df_all_names.index) # images that appear more than once time in the dataset
number_individual = len(pd.concat(dfs).drop_duplicates()) # total images in dataset excluding duplicates
number_only_once = number_individual - number_more_than_once # images that only appear once
number_to_remove = number_total_images - number_individual # number of images that should be discarded so that there are no duplicates

print("In total: ")
print("There are %i images in total of which..." %number_total_images)
print("%i image names appear more than once (%0.2f%%)" 
      %(number_more_than_once, number_more_than_once/number_total_images*100))
print("%i image names only appear once (%0.2f%%)" 
      %(number_only_once, number_only_once/number_total_images*100))
print("%i image names are individual (%0.2f%%)" 
      %(number_individual, number_individual/number_total_images*100))
print("%i images should be discarded to achieve a duplicate-free dataset (%0.2f%%)" 
      %(number_to_remove , number_to_remove/number_total_images*100))

Comparing 2020 and 2019 dataset
0 images with the same name found 

Comparing 2020 and 2018 dataset
0 images with the same name found 

Comparing 2020 and 2017 dataset
0 images with the same name found 

Comparing 2020 and 2016 dataset
0 images with the same name found 

Comparing 2019 and 2018 dataset
10015 images with the same name found 

Comparing 2019 and 2017 dataset
717 images with the same name found 

Comparing 2019 and 2016 dataset
565 images with the same name found 

Comparing 2018 and 2017 dataset
0 images with the same name found 

Comparing 2018 and 2016 dataset
0 images with the same name found 

Comparing 2017 and 2016 dataset
742 images with the same name found 

In total: 
There are 64405 images in total of which...
11061 image names appear more than once (17.17%)
41794 image names only appear once (64.89%)
52855 image names are individual (82.07%)
11550 images should be discarded to achieve a duplicate-free dataset (17.93%)
