# Melanoma Classification - Data Investigation
Data Source: https://challenge2020.isic-archive.com/


## 1. Downloading the data
This step will take a while. Please be patient and do not aboard the project prematurely. We are downloading 25 GB. Please make sure you have enough space for this large amount of data.

If you have already previously downloaded the data, please move it into a folder called "data" in the directory of this project, so that the following steps can run without a problem.

The structure after this step will be as followed:
<pre>
data_investigation.ipynb
data
|---ISIC_2020_Training_Duplicates.csv
|---ISIC_2020_Training_GroundTruth_v2.csv
|---ISIC_2020_Training_JPEG.zip
</pre>
In the next step the images will be unzipped. Continue with step 2.

If you have already unzipped the data the structure should now look like this: 
<pre>
data_investigation.ipynb
data
|---ISIC_2020_Training_Duplicates.csv
|---ISIC_2020_Training_GroundTruth_v2.csv
|---ISIC_2020_Training_JPEG
    |---train
        |---ISIC_0099474.jpg
        |---...
</pre>
If you are at this data structure please continue with step 3.

In [1]:
import os
import requests 

path = 'data'
if not os.path.exists(path): 
  os.makedirs(path)
  print("Created new directory for data")

print('Downloading ISIC img data')
img_data = 'https://isic-challenge-data.s3.amazonaws.com/2020/ISIC_2020_Training_JPEG.zip'
r = requests.get(img_data, allow_redirects=True, stream = True)
with open("data/ISIC_2020_Training_JPEG.zip","wb") as file:
    for chunk in r.iter_content(chunk_size=1024):
         if chunk:
             file.write(chunk)

print('Downloading ISIC meta data')
meta_data = 'https://isic-challenge-data.s3.amazonaws.com/2020/ISIC_2020_Training_GroundTruth_v2.csv'
r = requests.get(meta_data, allow_redirects=True)
open('data/ISIC_2020_Training_GroundTruth_v2.csv', 'wb').write(r.content)

print('Downloading ISIC duplicate image list')
duplicate_list_data = 'https://isic-challenge-data.s3.amazonaws.com/2020/ISIC_2020_Training_Duplicates.csv'
r = requests.get(duplicate_list_data, allow_redirects=True)
open('data/ISIC_2020_Training_Duplicates.csv', 'wb').write(r.content)

Created new directory for data
Downloading ISIC img data
Downloading ISIC meta data
Dowloading ISIC duplicate image list


11500

## 2. Unzipping downloaded data
The images are included in a .zip file. In order to work with the images, we have to unzip the data.

This step will also take quite a while. Be patient :-)

In [2]:
from zipfile import ZipFile

print("Unzipping Image Data")
with ZipFile("data/ISIC_2020_Training_JPEG.zip", 'r') as zObject:
    zObject.extractall(path="data/ISIC_2020_Training_JPEG")
print("Finished unzipping images")

Unzipping Image Data
Finished unzipping images


## 3. Filtering duplicates
The data includes some duplicates.These are stored in the file "data/ISIC_2020_Training_Duplicates.csv".
We do not want to train on duplicate data and will therefore remove these images from our training data.

In [3]:
import pandas as pd
import os

df_duplicates = pd.read_csv("data/ISIC_2020_Training_Duplicates.csv")
df_groundtruth = pd.read_csv("data/ISIC_2020_Training_GroundTruth_v2.csv")
img_path = 'data/ISIC_2020_Training_JPEG/train'

print("There are %i duplicate pictures in the data" %len(df_duplicates.index))
print("There are %i images in total in the meta-data file ('ground-truth')" %len(df_groundtruth.index))
print ("There are %i images on the hard-drive" %len([name for name in os.listdir(img_path) if os.path.isfile(os.path.join(img_path, name))]))

img_to_drop = df_duplicates.image_name_1.values.tolist()

print("Dropping duplicate images in ground truth")
df_groundtruth = df_groundtruth[df_groundtruth.image_name.isin(img_to_drop) == False]
df_groundtruth.to_csv("data/ISIC_2020_Training_GroundTruth_v2.csv")
print("After dropping the duplicates, there are %i images in total in the meta-data file ('ground-truth')" %len(df_groundtruth.index))

print("Deleting duplicate images from the hard-drive")
for img_name in img_to_drop:
    filename = img_path + "/" + img_name + ".jpg"
    try:
        os.remove(filename)
    except IOError:
        pass

print ("There are now %i images on the hard-drive" %len([name for name in os.listdir(img_path) if os.path.isfile(os.path.join(img_path, name))]))

There are 425 duplicate pictures in the data
There are 33126 images in total in the meta-data file ('ground-truth')
There are 33126 images on the hard-drive
Dropping images in ground truth
After dropping the duplicates, there are 32701 images in total in the meta-data file ('ground-truth')
Deleting duplicate images from the hard-drive
There are now 32701 images on the hard-drive
