## Skin Cancer MNIST: HAM10000 disease Classification (Extended Project)

**Author: Antreas Kasiotis**

**Student Number: B8035526**

----

## Project Overview
The following work consists of an effort to develop an image classifier for dermatoscopic images of skin cancer. To tackle this project I will be working with the HAM10000 ("Human Against Machine with 10000 training images") dataset which is released as a training set for academic machine learning purposes and are publicly available through the ISIC archive. The dataset also consists of metadata for each of the images of the patients with information about their age, sex, the location of the disease on their body, the type of disease and the technical validation that confirmed the disease.

## Data Exploration

In [None]:
# Importing libraries
import pandas as pd

Importing and inspecting the data

In [None]:
# import the data of images
dataset_images_L = pd.read_csv("../cancer-data/hmnist_28_28_L.csv")
print(dataset_images_L.head(3))
print("shape of images: ",dataset_images_L.shape)

As we can see the grescale image dataset holds information about 784 pixels. This is essentially the color values for a 28x28 pixel image. The last column in named lable and it indicated the type of skin cancer the patient has.

In [None]:
# import the data of images
dataset_images_RGB = pd.read_csv("../cancer-data/hmnist_28_28_RGB.csv")
print(dataset_images_RGB.head(3))
print("shape of images: ",dataset_images_RGB.shape)

As we can see the image dataset holds information about 2352 pixels. This is essentially the RGB values for a 28x28 pixel image bu becaus the data is stored for RGB colors, we also have three columns for each pixel since we have to store the RGB values for red, green and blue. Now let's inspect the metadata file.

In [None]:
# import the metadata
dataset_meta = pd.read_csv("../cancer-data/HAM10000_metadata.csv")
print(dataset_meta.head(3))
print("shape of metadata: ", dataset_meta.shape)

As expected the metadata dataset holds patient information for each image related to their disease and personal characteristics. Before commencing on with the implementation of the models I will be carrying out some exploratory data analysis for the metadata.

## Exploratory data analysis

#### Introduction
In this section I will be looking at the columns of the metadata dataset to better understand the characteristics of the disease and the patients.

In [None]:
# importing necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns

#### Disease class frequencies

In [None]:
# Plotting the disease class frequencies
bar, ax = plt.subplots(figsize=(7, 5))
sns.countplot(x = 'dx', data = dataset_meta)
plt.xlabel('Disease', size=12)
plt.ylabel('Frequency', size=12)
plt.title('Frequency Distribution of Classes', size=16)
plt.show()

#### Age groups with the disease

In [None]:
# Plotting the frequencies of each age of those with the disease
bar, ax = plt.subplots(figsize=(7, 5))
sns.histplot(dataset_meta['age'])
plt.title('Age range of patients', size=14)
plt.show()

#### Localization of the disease on the body

In [None]:
# Plotting the distribution of the localization of the disease
disease_location = dataset_meta['localization'].value_counts()
plt.figure(figsize=(20, 8))
sns.countplot(x='localization', data=dataset_meta)
plt.title('Disease localisation distribution')
plt.show()

#### The genders of the patients

In [None]:
# Plotting the gender frequencies of the patients
bar, ax = plt.subplots(figsize=(7, 7))
plt.pie(dataset_meta['sex'].value_counts(),
        labels = dataset_meta['sex'].value_counts().index, 
        autopct="%.1f%%")
plt.title('Gender of Patient', size=16)
plt.show()

## Implemention of the Classifiers
At this phase of the project I will be using three different methods to build my skin cancer image classifiers. In this section will also be showing the various types of data pre-processing that was required to implement each classifier. In total I will be buiding three classifiers, the cnn (convolutional neural network), the lstm (Long short-term memory RNN), and the svm (Support vector machine).

#### CNN (convolutional neural network)

In [None]:
# Importing required libraries
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import keras
from imblearn.over_sampling import RandomOverSampler
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Flatten, Dense, MaxPool2D


Below we can see the implementation of the CNN model along with comments for explaining each part of the process.

In [None]:
# defining the number of classes
num_classes = 7
# defining the batch size and epochs for the model
batch_size = 128
epochs = 10

# defining the number of rows and columns representing the pixels
img_rows = 28
img_cols = 28

# removing the 'label' column from the data frame so I only keep the image data
images = dataset_images_RGB.drop(['label'], axis=1)
# keeping only the label column
labels = dataset_images_RGB['label']

# Oversampling to overcome class imbalance
oversample = RandomOverSampler()
images, labels = oversample.fit_resample(images, labels)

# resizing the images and parsing them into an array
images = np.array(images)
images = images.reshape(-1, 28, 28, 1)
print('Shape of images: ', images.shape)

# Normalizing the images.
images = (images-np.mean(images))/np.std(images)

# Splitting my predictive and response data into training and testing sets with an 80:20 ratio
# while the state is set to a constant so that the splitting can be done reproducibly
x_train, x_test, y_train, y_test = train_test_split(images, labels, random_state=1, test_size=0.20)

# encoding my labels to one-hot vectors
y_train = keras.utils.np_utils.to_categorical(y_train, num_classes)
y_test = keras.utils.np_utils.to_categorical(y_test, num_classes)

# Model building
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(img_rows, img_cols, 1)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.40))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.summary()

callback = tf.keras.callbacks.ModelCheckpoint(filepath='trained-models/cnn-best-model-L.h5', monitor='val_acc', mode='max', verbose=1)

model.compile(loss=keras.losses.categorical_crossentropy, optimizer='adam', metrics=['accuracy'])

# Fitting the model
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2, callbacks=[callback])

# Evaluating the model
score = model.evaluate(x_test, y_test, verbose=0)
print('Summary: Loss over the test dataset: %.2f, Accuracy: %.2f' % (score[0], score[1]))


#### LSTM (Long short-term memory RNN)

#### SVM (Support vector machine)

#### Conclusion