## Skin Cancer MNIST: HAM10000 disease Classification (Extended Project)

**Author: Antreas Kasiotis**

**Student Number: B8035526**

----

### Introduction
The following work consists of an effort to develop an image classifier for dermatoscopic images of skin cancer. To tackle this project I will be working with the HAM10000 ("Human Against Machine with 10000 training images") dataset which is released as a training set for academic machine learning purposes and are publicly available through the ISIC archive. The dataset also consists of metadata for each of the images of the patients with information about their age, sex, the location of the disease on their body, the type of disease and the technical validation that confirmed the disease.

### Data Exploration


In [6]:
# Importing libraries
import pandas as pd

Importing and inspecting the data

In [9]:
# import the data of images
dataset_images_L = pd.read_csv("../cancer-data/hmnist_28_28_L.csv")
print(dataset_images_L.head(3))
print("shape of images: ",dataset_images_L.shape)

   pixel0000  pixel0001  pixel0002  pixel0003  pixel0004  pixel0005  \
0        169        171        170        177        181        182   
1         19         57        105        140        149        148   
2        155        163        161        167        167        172   

   pixel0006  pixel0007  pixel0008  pixel0009  ...  pixel0775  pixel0776  \
0        181        185        194        192  ...        184        186   
1        144        155        170        170  ...        172        175   
2        155        152        165        175  ...        163        178   

   pixel0777  pixel0778  pixel0779  pixel0780  pixel0781  pixel0782  \
0        185        180        157        140        140        159   
1        160        144        114         89         47         18   
2        157        166        167        148        141        136   

   pixel0783  label  
0        165      2  
1         18      2  
2        115      2  

[3 rows x 785 columns]
shape of imag

As we can see the grescale image dataset holds information about 784 pixels. This is essentially the color values for a 28x28 pixel image. The last column in named lable and it indicated the type of skin cancer the patient has.

In [7]:
# import the data of images
dataset_images_RGB = pd.read_csv("../cancer-data/hmnist_28_28_RGB.csv")
print(dataset_images_RGB.head(3))
print("shape of images: ",dataset_images_RGB.shape)

   pixel0000  pixel0001  pixel0002  pixel0003  pixel0004  pixel0005  \
0        192        153        193        195        155        192   
1         25         14         30         68         48         75   
2        192        138        153        200        145        163   

   pixel0006  pixel0007  pixel0008  pixel0009  ...  pixel2343  pixel2344  \
0        197        154        185        202  ...        173        124   
1        123         93        126        158  ...         60         39   
2        201        142        160        206  ...        167        129   

   pixel2345  pixel2346  pixel2347  pixel2348  pixel2349  pixel2350  \
0        138        183        147        166        185        154   
1         55         25         14         28         25         14   
2        143        159        124        142        136        104   

   pixel2351  label  
0        177      2  
1         27      2  
2        117      2  

[3 rows x 2353 columns]
shape of ima

As we can see the image dataset holds information about 2352 pixels. This is essentially the RGB values for a 28x28 pixel image bu becaus the data is stored for RGB colors, we also have three columns for each pixel since we have to store the RGB values for red, green and blue. Now let's inspect the metadata file.

In [8]:
# import the metadata
dataset_meta = pd.read_csv("../cancer-data/HAM10000_metadata.csv")
print(dataset_meta.head(3))
print("shape of metadata: ", dataset_meta.shape)

     lesion_id      image_id   dx dx_type   age   sex localization
0  HAM_0000118  ISIC_0027419  bkl   histo  80.0  male        scalp
1  HAM_0000118  ISIC_0025030  bkl   histo  80.0  male        scalp
2  HAM_0002730  ISIC_0026769  bkl   histo  80.0  male        scalp
shape of metadata:  (10015, 7)


As expected the metadata dataset holds patient information for each image related to their disease and personal characteristics.

### Exploratory data analysis