# Skin Cancer Detector with HAM10000

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import tensorflow as tf

### LOAD DATA
The dataset contains 10015 images divided in two parts. The 'HAM10000_metadata.csv' contains the ground truth of the dataset. Here we can use this file to retrieve the images from the relative folders and also analyse what the images are.

In [29]:
path = '.'
file_name = os.path.join(path, 'HAM10000_metadata.csv')
df_metadata = pd.read_csv(file_name, na_values=['NA','?'])
disease_types = pd.unique(df['dx'])
print(disease_types)

['bkl' 'df' 'mel' 'vasc' 'bcc' 'nv' 'akiec']


As you can see from the last print, there are 7 types of diseases classified in the dataset. Let's create a dictionary to better represent those diseases.

In [38]:
images_type = {'bkl': 'Benign Keratosis', 'nv': 'Melanocytic Nevi',
                      'df': 'Dermatofibroma', 'mel': 'Melanoma',
                      'vasc': 'Vascular Lesions', 'bcc': 'Basal Cell Carcinoma',
                      'akiec': "Bowen's disease"}
images_type

{'bkl': 'Benign Keratosis',
 'nv': 'Melanocytic Nevi',
 'df': 'Dermatofibroma',
 'mel': 'Melanoma',
 'vasc': 'Vascular Lesions',
 'bcc': 'Basal Cell Carcinoma',
 'akiec': "Bowen's disease"}

The dataset has been developed during the last 20 years and most of the images have been classified manually using a process that in medicine is called 'Histopathology' (histo) which entails to microscopically analyse a small portion of the skin tissue and then classifying it accordigly. There are also images which has been classified by the use of a tool called Cofocal Microscopy (cofocal) which allowed medicians to correctly identify where there's a disease and where it's not. Finally the data contains also a series of images that have not been classified rigorously, denoted by 'follow_up' (data that needs follow-up examination) and 'consensus' data that has been classified by the consensus of a medician. Since these last three categories do not represent a rigorous result, we decided to exclude them from the scope of the analysis.

In [31]:
classification_type = pd.unique(df['dx_type'])
print(classification_type)
# Drop the rows that have the 'consensus' or 'follow_up' value in the column 'dx_type'
indexes_consensus = df[df['dx_type'] == 'consensus'].index
indexes_follow_up = df[df['dx_type'] == 'follow_up'].index
df.drop(indexes_consensus, inplace=True)
df.drop(indexes_follow_up, inplace=True)
classification_type = pd.unique(df['dx_type'])
classification_type

['histo' 'confocal']


array(['histo', 'confocal'], dtype=object)

After having deleted the pieces of data that are not relevant for the analysis, we now check if there are any missing values.

In [32]:
df.isnull().any()

lesion_id       False
image_id        False
dx              False
dx_type         False
age             False
sex             False
localization    False
dtype: bool

From the analysis, it appears that the column of the 'age' has some missing values. To solve this problem we decided to fill those missing values with the median value.

In [33]:
median = df['age'].median()
df['age'] = df['age'].fillna(median)
df.isnull().any()

lesion_id       False
image_id        False
dx              False
dx_type         False
age             False
sex             False
localization    False
dtype: bool

We proceed our analysis by focusing on what kind of data we are trying to use.

In [34]:
df

Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization
0,HAM_0000118,ISIC_0027419,bkl,histo,80.0,male,scalp
1,HAM_0000118,ISIC_0025030,bkl,histo,80.0,male,scalp
2,HAM_0002730,ISIC_0026769,bkl,histo,80.0,male,scalp
3,HAM_0002730,ISIC_0025661,bkl,histo,80.0,male,scalp
4,HAM_0001466,ISIC_0031633,bkl,histo,75.0,male,ear
...,...,...,...,...,...,...,...
10010,HAM_0002867,ISIC_0033084,akiec,histo,40.0,male,abdomen
10011,HAM_0002867,ISIC_0033550,akiec,histo,40.0,male,abdomen
10012,HAM_0002867,ISIC_0033536,akiec,histo,40.0,male,abdomen
10013,HAM_0000239,ISIC_0032854,akiec,histo,80.0,male,face


Now we are going to load the images and adapt them to the best of our possibilities with respect to the size. The images in the dataset comes with two directories, part-1 and part-2. We are going to retrieve those images in two different variables and then unify in one big piece of data.

In [41]:
path_part_1 = '/Users/tommasocapecchi/Datasets/HAM10000/Images'
dataset = []
df['dx'] = df['dx'].astype('category')
for image_name in df['image_id'][:10]:
    image_path = os.path.join(path_part_1, image_name +'.jpg')
    df['img_path'] = image_path
    df['label'] = df['dx'].cat.codes.
    image = plt.imread(image_path, format='jpg')
    dataset.append(image)
    
dataset = np.array(dataset)
dataset.shape

AttributeError: 'Series' object has no attribute 'dx'

Althought the dataset contains more that 10000 images, after the preprocessing we just eliminated half of them because they were not classified using a scientific method. Those images that have been deleted are indeed those whose were classified as 'consensus' and 'follow_up'.

After this brief pre-processing, we are left with 5409 images, each with a dimension of 450x600, and they are indeed RGB images.

ValueError: Must pass 2-d input. shape=(5409, 450, 600, 3)