![image.png](https://media.sciencephoto.com/image/p7100248/800wm/P7100248-SEM_of_section_through_human_skin.jpg)
> Coloured scanning electron micrograph of a section through the human skin. <br>
Photo by <a href="https://www.sciencephoto.com/contributor/pzg/">STEVE GSCHMEISSNER</a>
  

# HAM10000: Neural Networks for Skin Lesion Classification


> "Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available datasets of dermatoscopic images. We tackle this problem by releasing the HAM10000 (“Human Against Machine with 10000 training images”) dataset."<br>
[Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 doi:10.1038/sdata.2018.161 (2018)](https://www.nature.com/articles/sdata2018161).


This notebook explores the HAM10000 dataset and analyse the performance of different Neural Networks strategies. 


## <center style="background-color:Gainsboro; width:40%;">Contents</center>
1. [Overview](#1.-Overview)<br>
1.1. [Content](#1.1.-Content)<br>
1.2. [Acknowledgements](#1.2.-Acknowledgements)<br>
2. [The Data](#2.-The-Data)<br>
2.1 [Melanoma-Malignant Cancer Analysis](#2.-Melanoma-Malignant-Cancer-Analysis)<br>
3. [Models](#3.-Models)<br>
3.1 [CNN](#3.1-CNN)<br>
3.2 [ResNet-50](#3.2-ResNet-50)<br>
3.3 [XCeption](#3.3-XCeption)<br>
4. [Results and Conclusion](#4.-Results-and-Conclusion)<br>

***Please remember to upvote if you find this Notebook helpful!***

# **1. Overview**

Dermatoscopy is a diagnostic technique that can improve the diagnosis of benign and malignant pigmented skin lesions. Other than increasing the accuracy of skin cancer detection (if compared to naked eye exams), dermatoscopic images can also be used to train ANN. In the past, promising attempts have been made to use ANN to classify skin lesions. However, the lack of data and computing power limited the application of this method.

The [ISIC archive](https://isic-archive.com/) is the largest public database for dermatoscopic image analysis research, and where the original HAM10000 was made available. In 2018, the database contained approximately 13.000 dermatoscopic images. Currently, the database holds over 60.000 images, demonstrating the power of collaboration between different scientific groups. 

As mentioned by the authors, the original paper and release of HAM10000 aimed to boost the research on the automated diagnosis of dermatoscopic images. We can say they have certainly achieved that goal after three successful challenges and an impressive expansion of the database.


## 1.1. Content ##

The HAM10000 dataset is composed of 10.015 dermatoscopic images of pigmented skin lesions. The data was collected from Australian and Austrian patients. Two institutions participated in providing the images: Cliff Rosendahl in Queensland, Australia, and Medical University of Vienna, Austria. According to the authors, seven classes are defined on this dataset where some diagnosis were unified into one class for simplicity. Information regarding patient age, sex, lesion location and diagnosis is also provided with each image.

## 1.2. Acknowledgements ##

The dataset has been collated and published by [Tschandl, P., Rosendahl, C. & Kittler, H.](https://www.nature.com/articles/sdata2018161)

In [None]:
from numpy.random import seed
seed(1)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
from glob import glob
import seaborn as sns
from PIL import Image
from sklearn.preprocessing import label_binarize
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

import itertools

import keras
from keras.applications import ResNet50, Xception
from keras.models import Sequential, Model
from keras.layers import Activation,Dense, Dropout, Flatten, Conv2D, MaxPool2D,AveragePooling2D,GlobalMaxPooling2D
from keras import backend as K
from keras.wrappers.scikit_learn import KerasClassifier
from keras.layers.normalization import BatchNormalization
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras import regularizers
from keras.optimizers import Adam, SGD
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau, EarlyStopping

np.random.seed(123)

# 2. The Data

In this section, we analyse our metadata and understand a bit more regarding the patients and dataset distribution. 

## Key Insights

* Small number of Missing Values, only for Age features where replacement with most frequent value was used
* Similar distribution between Males and Females
* Melanocytic nevi is the dominant class in the dataset (67%). It could result in a bias towards this type of os skin lesion
* Most samples are from patients within 35 - 60 yrs old
* Melanoma, malignant skin lesion, seems to be more common in the ages of 45 to 70. Males represent 62% of the incidence of this type of lesion

A sample of each type of skin lesion present in the dataset is demonstrated in the chart below.

In [None]:
def basic_EDA(df):
    size = df.shape
    sum_duplicates = df.duplicated().sum()
    sum_null = df.isnull().sum().sum()
    is_NaN = df. isnull()
    row_has_NaN = is_NaN. any(axis=1)
    rows_with_NaN = df[row_has_NaN]
    count_NaN_rows = rows_with_NaN.shape
    return print("Number of Samples: %d,\nNumber of Features: %d,\nDuplicated Entries: %d,\nNull Entries: %d,\nNumber of Rows with Null Entries: %d %.1f%%" %(size[0],size[1], sum_duplicates, sum_null,count_NaN_rows[0],(count_NaN_rows[0] / df.shape[0])*100))

def summary_table(df):
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    return summary

def countplot(df, x, x_axis_title,y_axys_title, plot_title):
    plt.figure(figsize=(20,8))
    sns.set(style="ticks", font_scale = 1)
    ax = sns.countplot(data = df,x=x,order = df[x].value_counts().index,palette="Blues_d")
    sns.despine(top=True, right=True, left=True, bottom=False)
    plt.xticks(rotation=0,fontsize = 12)
    ax.set_xlabel(x_axis_title,fontsize = 14,weight = 'bold')
    ax.set_ylabel(y_axys_title,fontsize = 14,weight = 'bold')
    plt.title(plot_title, fontsize = 16,weight = 'bold')  

In [None]:
#Lesion Dictionary
lesion_type_dict = {
    'nv': 'Melanocytic nevi',
    'mel': 'Melanoma',
    'bkl': 'Benign keratosis-like lesions ',
    'bcc': 'Basal cell carcinoma',
    'akiec': 'Actinic keratoses',
    'vasc': 'Vascular lesions',
    'df': 'Dermatofibroma'
}

base_skin_dir = os.path.join('..', 'input')
#Dictionary for Image Names
imageid_path_dict = {os.path.splitext(os.path.basename(x))[0]: x for x in glob(os.path.join(base_skin_dir, '*','*', '*.jpg'))}
#Read File csv
skin_df = pd.read_csv('../input/skin-cancer-mnist-ham10000/HAM10000_metadata.csv')
#Create useful Columns - Images Path, Lesion Type and Lesion Categorical Code
skin_df['path'] = skin_df['image_id'].map(imageid_path_dict.get)
skin_df['cell_type'] = skin_df['dx'].map(lesion_type_dict.get) 
skin_df['cell_type_idx'] = pd.Categorical(skin_df['cell_type']).codes

In [None]:
img = skin_df.sample(n=500,replace=False, random_state=1)
img['image'] = img['path'].map(lambda x: np.asarray(Image.open(x).resize((100,75))))
#Image Sampling
n_samples = 3

fig, m_axs = plt.subplots(7, n_samples, figsize = (4*n_samples, 3*7))

for n_axs, (type_name, type_rows) in zip(m_axs,img.sort_values(['cell_type']).groupby('cell_type')):
    n_axs[0].set_title(type_name)
    for c_ax, (_, c_row) in zip(n_axs, type_rows.sample(n_samples, random_state=1234).iterrows()):
        c_ax.imshow(c_row['image'])
        c_ax.axis('off')

The CSV file contains originally seven features. In the code lines above we add a few columns to support the data analysis and facilitate the extraction of images later on. The null entries are only related to the Age feature, as such no major data cleanse is required as the dataset is pretty much ready to use.

In [None]:
basic_EDA(skin_df)

In [None]:
summary_table(skin_df)

The summary helps to understand the type of metadata collected. We noticed that for some lesions there must be more than one image as the lesion and image ID do not match. The Uniques columns also indicate the number of classes (dx = 7), and how the age, sex and localization features were organized.

All the features are pretty much self-explanatory. To clarify, the **dx_type** column is the technique used to identify the type of skin lesion.

The count plots below help to understand the distribution of the data.

In [None]:
countplot(skin_df,'dx_type', 'Diagnostic Type', 'Count', 'Samples per Type of Diagnosis')
countplot(skin_df,'cell_type', 'Type of Skin Lesion', 'Count', 'Samples per Type of Skin Lesion')
countplot(skin_df,'sex', 'Gender', 'Count', 'Samples per Gender')

* The main thing to keep in mind is the unbalance between the different classes of skin lesion. Approximately 67% of data accounts for Melanocytic Nevi samples
* The least represented classes are Dermatofibroma lesions and Vascular skin lesions, with only 115 and 142 samples, respectively
* The samples are mostly Male participants, approximately 55%, not a significant difference between Genders

In [None]:
skin_df['age'].fillna((skin_df['age'].mode()), inplace=True)

plt.figure(figsize=(20,8))
sns.set(style="ticks", font_scale = 1)
ax = sns.countplot(data = skin_df,x='age',palette="Blues_d")
sns.despine(top=True, right=True, left=True, bottom=False)
plt.xticks(rotation=0,fontsize = 12)
ax.set_xlabel('Age',fontsize = 14,weight = 'bold')
ax.set_ylabel('Count',fontsize = 14,weight = 'bold')
plt.title('Age Distribution', fontsize = 16,weight = 'bold');

* The samples are predominantly from patients within 40 - 55 years old
* The number of samples rises sharply after 25 years old, doubling the samples for 30 years old and almost doubling again for 35 years old
* Between the ages of 60 - 70 years old the number of samples remain almost stable, returning to the downward trend after 75 years old

## 2.1. Melanoma


The only malignant type of skin lesion present on this dataset is Malignant Melanoma, where surgical removal in the early stage of cancer can provide a cure. The remaining skin lesions are benign even though they may require treatment.

In this segment, we make a quick analysis of Melanoma skin lesions.

In [None]:
skin_mel = skin_df.loc[:,['age','sex','localization','cell_type']]
skin_mel = skin_mel[skin_mel['cell_type'] == 'Melanoma']

countplot(skin_mel,'sex', 'Gender', 'Count', 'Melanoma - Malignant Cancer per Gender')

* Regarding gender, in the previous section, we saw a similar distribution across the main two genders. However, considering only Melanoma we can see from the above plot that more than 60% of the cases reported are Male

In [None]:
plt.figure(figsize=(20,8))
sns.set(style="ticks", font_scale = 1)
ax = sns.countplot(data = skin_mel,x='age',palette="Blues_d", hue = 'sex')
sns.despine(top=True, right=True, left=True, bottom=False)
plt.xticks(rotation=0,fontsize = 12)
ax.set_xlabel('Age',fontsize = 14,weight = 'bold')
ax.set_ylabel('Count',fontsize = 14,weight = 'bold')
plt.title('Age Distribution - Melanoma', fontsize = 16,weight = 'bold');

* The age distribution is quite different for Melanoma diagnostics when compared to the whole dataset
* For both genders, there are two distinct peaks, at 55 and 70 years old. The peaks could be related to the period people usually do a full health check-up
* Younger Melanoma samples are more likely to be females. Impressive difference of gender for the ages between 25 - 40 years old
* For males, melanoma has a higher incidence in older patients
* After 40 years of age, the number of cases regarding female samples seems to stabilize. The peaks at 55 and 70 years old are not as expressive as the male samples

In [None]:
skin_local = skin_mel.groupby(['localization']).size().sort_values(ascending=False, inplace=False).reset_index()
skin_local.columns = ['localization', 'count']
sort_by = skin_local['localization']

skin_heat = skin_mel.groupby(['age','localization']).size().reset_index()
skin_heat.columns = ['age', 'localization', 'count']
skin_heat.sort_values('count', ascending=False, inplace=True)

def heatmap(df, index,columns,values,vmax,sort_by,Title):
    df_wide = df.pivot(index=index, columns=columns, values=values)
    df_wide = df_wide.reindex(index=sort_by)
    plt.figure(figsize=(12,8))
    ax = sns.heatmap(df_wide, annot=True, fmt='.0f', yticklabels='auto', cmap=sns.color_palette("YlGnBu", as_cmap=True), center=.2,vmin = 0, vmax = vmax,linewidths=.5)
    ax.xaxis.tick_top() # x axis on top
    ax.xaxis.set_label_position('top')
    ax.set_xlabel(columns,fontsize = 14,weight = 'bold')
    ax.set_ylabel(index,fontsize = 14,weight = 'bold')    
    ax.set_title(Title,fontsize = 16,weight = 'bold',pad=20)
    plt.show()
    
heatmap(skin_heat,'localization', 'age','count', 20,sort_by,'Age and Localization of Melanomas')

* The heatmap makes a good representation of how age influences Cancer incidence. Note the cluster between the ages of 45 to 70
* Back, upper and lower extremities are the most common locations of this melanoma. For the age group of 50 and 70 years old, the face, abdomen, chest and trunk also present a higher number of incidence.
* The scalp seems to be a more common localization only for 70 years old. 
* The localizations do not seem to be related to the parts of the body most commonly exposed to the sun. If it was the case, scalp, hands and face should have a higher incidence

# 3. Models

Here we explore three different CNN networks. One manually built, ResNet-50 and XCeption. The data preparation for all of them is simple, and consists of:
* Add the images to the Dataframe
* Separate the dataframe into Features and Targets data
* Create Training and Test sets (80 - 20 ratio)
* Normalise the input. Following the best practices, the normalisation should be performed using the training set data as a reference. The test data cannot be normalised to its data, as it should remain unknown
* One Hot Encoding to transform the Target labels
* Separate the training set into Training and Validation sets (90 - 10 ratio)
* The CNN requires the images to be reshaped into 3 dimensions (height = 75px, width = 100px , canal = 3)

In [None]:
skin_df['image'] = skin_df['path'].map(lambda x: np.asarray(Image.open(x).resize((75,100))))

features=skin_df.drop(columns=['cell_type_idx'],axis=1)
target=skin_df['cell_type_idx']

# Create First Train and Test sets
x_train_o, x_test_o, y_train_o, y_test_o = train_test_split(features, target, test_size=0.20,random_state=123)

#The normalisation is done using the training set Mean and Std. Deviation as reference
x_train = np.asarray(x_train_o['image'].tolist())
x_test = np.asarray(x_test_o['image'].tolist())

x_train_mean = np.mean(x_train)
x_train_std = np.std(x_train)

x_train = (x_train - x_train_mean)/x_train_std
x_test = (x_test - x_train_mean)/x_train_std

# Perform one-hot encoding on the labels
y_train = to_categorical(y_train_o, num_classes = 7)
y_test = to_categorical(y_test_o, num_classes = 7)

#Splitting training into Train and Validatation sets
x_train, x_validate, y_train, y_validate = train_test_split(x_train, y_train, test_size = 0.1,random_state=123)

#Reshaping the Images into 3 channels (RGB)
x_train = x_train.reshape(x_train.shape[0], *(75,100, 3))
x_test = x_test.reshape(x_test.shape[0], *(75,100, 3))
x_validate = x_validate.reshape(x_validate.shape[0], *(75,100, 3))

# 3.1 CNN

The Convolutional Neural Network is a specialised type of neural network, ideal for data that can be represented as a grid. CNN is most commonly used for image recognition tasks
since this input can be perceived as a 2D grid of pixels. As described by Goodfellow et al., (2016), CNN are neural networks that use at least one of their layers the convolution operation. For image classification tasks, Convolutional Neural Networks (CNN) are used most often. 

The CNN architecture can vary and later we will explore well-known CNN models. In this section, it is presented a basic CNN architecture inspired by previous works and trial and error. 

* Average Pooling performed better than MaxPooling during my trials
* Batch Normalization and Dropout seems to help avoid overfitting
* A smaller pooling window (2,2) seems to perform better than larger kernel size (3,3) and (5,5)
* Convolutional Layer > Activation Function > Batch Normalization > Dropout was the best combination I found 

>I understand that the original paper on Batch Normalization indicated the use of Batch before Activation, and I have tried such combination that reduced the accuracy by about 2%


In [None]:
#Model Parameters
input_shape = (75, 100, 3)
num_classes = 7

optimizer = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)

epochs = 100
batch_size = 20

#Callbacks
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', patience=5, verbose=0, factor=0.5, min_lr=0.00001)
early_stopping_monitor = EarlyStopping(patience=20,monitor='val_accuracy')

#Data Augmentation
dataaugment = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=90,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0, # Randomly zoom image 
        width_shift_range=0,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False,  # randomly flip images
        shear_range = 10) 

dataaugment.fit(x_train)

def history(model):
    model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"])
    history = model.fit(dataaugment.flow(x_train,y_train, batch_size=batch_size),
                        epochs = epochs, validation_data = (x_validate,y_validate),
                        verbose = 0, steps_per_epoch=x_train.shape[0] // batch_size, 
                        callbacks=[learning_rate_reduction,early_stopping_monitor])

    loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
    predictions = model.predict(x_test)
    loss_v, accuracy_v = model.evaluate(x_validate, y_validate, verbose=0)
    loss_t, accuracy_t = model.evaluate(x_train, y_train, verbose=0)
    return (predictions,accuracy_t,accuracy_v,accuracy)

In [None]:
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),activation='relu',padding = 'Same',input_shape=input_shape))
model.add(BatchNormalization())
##############################
model.add(Conv2D(64, (3, 3), activation='relu',padding = 'Same'))
model.add(BatchNormalization())
model.add(AveragePooling2D(pool_size = (2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), activation='relu',padding = 'Same'))
model.add(BatchNormalization())

model.add(Conv2D(64, (3, 3), activation='relu',padding = 'Same'))
model.add(BatchNormalization())
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
##############################
model.add(Conv2D(64, (3, 3), activation='relu',padding = 'Same'))
model.add(BatchNormalization())
model.add(AveragePooling2D(pool_size = (2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), activation='relu',padding = 'Same'))
model.add(BatchNormalization())

model.add(Conv2D(64, (3, 3), activation='relu',padding = 'Same'))
model.add(BatchNormalization())
model.add(AveragePooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
##############################
model.add(Flatten())

model.add(BatchNormalization())
model.add(Dense(128, activation='relu'))
model.add(Activation('relu'))
model.add(Dropout(0.25))

#Output
model.add(BatchNormalization())
model.add(Dense(num_classes, activation='softmax'))

y_pred, accuracy_t,accuracy_v,accuracy = history(model)
print("Training: accuracy = %f" % (accuracy_t))
print("Validation: accuracy = %f" % (accuracy_v))
print("Test: accuracy = %f" % (accuracy))

* The Accuracy values are not as high as one wishes for cancer prediction. However, there is only a small difference between training and evaluation sets, which is a good indicator that the model is not overfitting

## 3.2 ResNet50

The idea behind the residual networks (ResNet) is an attempt to overcome a problem faced by many researchers when working with deeper models where the training error starts to increase as more layers are added to the network. One hypothesis is that accuracy degradation occurs in deeper models because they are harder to optimize.

The introduction of deep residual learning was done by [He et al. 2015](arXiv:1512.03385v1). In the same year, this proposed architecture won first place in the most prestigious image recognition competitions, ILSVRC 2015 (ImageNet) and COCO 2015.

From a quick literature review, it was noticed that ResNet was always among the pre-trained architectures used on the ISIC 2018 challenge. Here we use the ResNet-50, which uses 50 convolutional layers.

* Best performance was achieved by training all layers of the ResNet50 and using the Imagenet dataset initial weights

In [None]:
base_model = ResNet50(include_top=False, input_shape=(75,100, 3),pooling = 'avg', weights = 'imagenet');

ResNet50model = Sequential()
ResNet50model.add(base_model)
ResNet50model.add(Dropout(0.2))
ResNet50model.add(Dense(128, activation="relu"))
ResNet50model.add(Dropout(0.2))
ResNet50model.add(Dense(num_classes, activation = 'softmax'))
###################################

for layer in base_model.layers:
    layer.trainable = True

ResNet50y_pred,ResNet50accuracy_t,ResNet50accuracy_v,ResNet50accuracy = history(ResNet50model)
    
print("ResNet50 Training: accuracy = %f" % (ResNet50accuracy_t))
print("ResNet50 Validation: accuracy = %f" % (ResNet50accuracy_v))
print("ResNet50 Test: accuracy = %f" % (ResNet50accuracy))

* While the results vary according to the randomness of it all, ResNet50 usually performs better than the CNN model presented earlier. The test results vary within the range of 76-77% accuracy, similar to the CNN. However, when analysed after 5 runs, ResNet50 has a smaller standard deviation when compared to the CNN
* Ideally, the models results should be an average across X number of runs. However,since this notebook is already taking almost one hour to run, I removed the additional runs from   the code

# 3.3 XCeption

The Inception architectures, first introduced by Szegedy et al. in 2014 as GoogLeNet, and since then they have been the top-performing architectures on the ImageNet dataset.

According to the author of XCeption paper, Francois Chollet, this architecture relies on the lessons learned with VGG-16 and previous Inception models. Also, they use residual connections (a concept introduced in ResNet) and depthwise separable convolutions inspired by the initial work of Vanhoucke (2014). In total, XCeption uses 39 Convolutional Layers.

> "XCeption is a linear stack of depthwise separable convolution layers with residual connections" <br>
Chollet, F. (2017)

In [None]:
training_shape = (75,100, 3)
base_model = Xception(include_top=False,weights='imagenet',input_shape = training_shape)

XCeptionmodel = base_model.output
XCeptionmodel = Flatten()(XCeptionmodel)

XCeptionmodel = BatchNormalization()(XCeptionmodel)
XCeptionmodel = Dense(128, activation='relu')(XCeptionmodel)
XCeptionmodel = Dropout(0.2)(XCeptionmodel)

XCeptionmodel = BatchNormalization()(XCeptionmodel)
XCeptionoutput = Dense(num_classes, activation = 'softmax')(XCeptionmodel)
XCeptionmodel = Model(inputs=base_model.input, outputs=XCeptionoutput)

for layer in base_model.layers:
    layer.trainable = True

XCeptiony_pred,XCeptionaccuracy_t,XCeptionaccuracy_v,XCeptionaccuracy = history(XCeptionmodel)
    
print("XCeption Training: accuracy = %f" % (XCeptionaccuracy_t))
print("XCeption Validation: accuracy = %f" % (XCeptionaccuracy_v))
print("XCeption Test: accuracy = %f" % (XCeptionaccuracy))

# 4. Results and Conclusion

From the three different architectures shown in this study, XCeption had an impressive better accuracy in the test set when compared to the other models. Even though the training and test set had a bigger accuracy gap, the XCeption model did not demonstrate major overfitting issues. As expected, the test set presented slightly lower values of accuracy for all networks.

The dataset was unbalanced, with the Melanocytic Nevi being the majority of the samples. For such cases, the accuracy metric can give us a false perception of the model reliability. Other metric scores (F1-Score, Precision, Recall) can provide a better idea of our model behaviour. 

First, the confusion matrix can give us a few insights.

In [None]:
predictions = np.array(list(map(lambda x: np.argmax(x), XCeptiony_pred)))
categories = ['Actinic keratoses', 'Basal cell carcinoma',
              'Benign keratosis-like lesions ', 
              'Dermatofibroma', 
              'Melanocytic nevi',
              'Melanoma', 
              'Vascular lesions']

CMatrix = pd.DataFrame(confusion_matrix(y_test_o, predictions), columns=categories, index =categories)

plt.figure(figsize=(12, 6))
ax = sns.heatmap(CMatrix, annot = True, fmt = 'g' ,vmin = 0, vmax = 20,cmap = 'Blues')
ax.set_xlabel('Predicted',fontsize = 14,weight = 'bold')
ax.set_xticklabels(ax.get_xticklabels(),rotation =90);
ax.set_ylabel('Actual',fontsize = 14,weight = 'bold')    
ax.set_title('Confusion Matrix - Test Set',fontsize = 16,weight = 'bold',pad=20);

From the Confusion Matrix, the main interest was to evaluate how the Melanoma samples were being classified. An expressive number of samples were mistakenly classified as Melanocytic Nevi, almost the same number as the ones that were properly classified. It is not good that so many malign cancer samples are being classified as benign. For this reason, Accuracy is a dangerous metric for this application.

Now, let's analyse the F1-Score results for each class. The F1 score takes into account precision and recall metrics, being more appropriate for unbalanced datasets.

In [None]:
f1 = f1_score(y_test_o, predictions, average=None)
index = categories
f1_df = pd.DataFrame(f1,index, columns = ['F1'])
f1_df.sort_values(['F1'], ascending = False, inplace = True)

plt.figure(figsize=(22,10))
ax = sns.barplot(data =f1_df, x=f1_df.index, y = 'F1',palette = "Blues_d")
#Bar Labels
for p in ax.patches:
        ax.annotate("%.1f%%" % (100*p.get_height()), (p.get_x() + p.get_width() / 2., abs(p.get_height())),
        ha='center', va='bottom', color='black', xytext=(-3, 5),rotation = 'horizontal',textcoords='offset points')
sns.despine(top=True, right=True, left=True, bottom=False)
ax.set_xlabel('Skin Lesion Type',fontsize = 14,weight = 'bold')
ax.set_ylabel('F1-Score',fontsize = 14,weight = 'bold')
ax.set(yticklabels=[])
ax.axes.get_yaxis().set_visible(False) 
plt.title("F1-Score of XCeption Model for each Class", fontsize = 16,weight = 'bold');

We see that for the class with the majority of samples we had a pretty good result. The nevi and vascular lesions classes are the ones that made the model accuracy reach values above 80%. For the least represented classes, we have a disappointing F1 Score, particularly for the Melanoma skin lesion.

How the results could be improved, and ideas for future Notebooks:

* Optimisation methods to find optimal Data Augmentation, number of Convolutional layers and other hyperparameters. For this work, I mainly used Grid Search, which is not the most effective way to find hyperparameters
* Use additional data. The ISIC website contains additional pictures that could be used to improve the detection of the least represented classes
* In a similar topic, use GAN to generate more samples and improve model generalisation
* This [Notebook](https://www.kaggle.com/akarsh1/skin-cancer-classification-with-85-02-accuracy) used the output of the CNN as inputs for Logistic Regression and other ML methods with exciting outcomes
* As a final note, ensemble methods are always a good way to improve model accuracy

## Please, remember to upvote if you found this useful :) [Kaggle Notebook](https://www.kaggle.com/jnegrini/ham10000-analysis-and-model-comparison)

# References

P. Tschandl, C. Rosendahl and H. Kittler, “The HAM10000 Dataset, a Large Collection of MultiSource Dermatoscopic Images of Common Pigmented Skin Lesions,” Scientific data, vol. 5, p. 180161, 2018.

K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” arXiv:1512.03385v1, 2015. 

I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, Cambridge: MIT Press, 2016

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015

V. Vanhoucke. Learning visual representations at scale. ICLR, 2014

F., Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1251-1258). 2017