Skip to content

luisabreu102030/skin_cancer

Repository files navigation

Skin cancer images classification

Skin cancer

Skin cancer is the most common cancer in the United States of America (U.S.A.), and is estimated that about 4.9 million U.S. adults were treated for skin cancer each year from 2007 to 2011, for an average annual treatment cost of $8.1 billion. 1,2

It is estimated that approximately 9500 people are diagnosed with skin cancer every day, and that one in five Americans will develop skin cancer in their lifetime. 3,4,5

Skin cancer is the out-of-control growth of abnormal cells in the epidermis, the outermost skin layer, caused by unrepaired DNA damage that triggers mutations. These mutations lead the skin cells to multiply rapidly and form malignant tumors. The main types of skin cancer are basal cell carcinoma, squamous cell carcinoma, melanoma and Merkel cell carcinoma. 6

The main causes of skin cancer are the excess exposure to UV radiation from sunlight and the use of indoor UV tanning beds. Genetic predisposition is also a factor that increases the probability of skin cancer. 5

If found early, skin cancer is highly treatable. Often a dermatologist can treat an early skin cancer by simply removing the cancer, but late found cancers becomes more difficult to treat. 7

Dermatologists advises a regular self exam with a monthly periodicity. During each exam one should look for warning signs like changes in size, shape, or color of a mole or other skin lesion, the appearance of a new growth on the skin, or a sore that doesn't heal. If one notice any spots on the skin that are different from the others, or anything changing, itching or bleeding one should contact a dermatologist as soon as possible. 7

Dataset description

The HAM10000 ("Human Against Machine with 10000 training images") dataset was provided by International Skin Imaging Collaboration (ISIC) for the 2018 challenge hosted at the Medical Image Computing and Computer Assisted Intervention (MICCAI) conference in Granada, Spain and is a collection of dermatoscopic images from different populations, acquired and stored by different modalities. The dataset consists of 10015 dermatoscopic images, and corresponding metadata, which can serve as a training set for academic machine learning purposes. dataset

HAM10000 dataset has 7 different classes of skin cancer which are listed below :

  • Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec),
  • Basal cell carcinoma (bcc),
  • Benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl),
  • Dermatofibroma (df),
  • Melanoma (mel),
  • Melanocytic nevi (nv)
  • Vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc)

HAM10000 dataset metadata has 7 different features which are listed below :

  1. lesion_id : patient lesion id, one lesion can have multiple images
  2. image_id : image id, each image is unique
  3. dx : skin cancer class [akiec, bcc, bkl, df, mel, nv, vasc]
  4. dx_type : how the diagnosis was achieved [histo, follow_up, consensus, confocal]
    • histo : confirmed through histopathology
    • follow_up : diagnosis through follow-up examination
    • consensus : expert consensus
    • confocal : in-vivo confocal microscopy
  5. age : patient age
  6. sex : patient gender
  7. localization : cancer localization

Data exploration

In order to understand the different skin cancer visual characteristics, an image was built with 3 randomly selected images from each skin cancer presented in the dataset.

According to the exploration conducted in the dataset's metadata file, the dataset consists of 10015 images and 7 independent variables that provide information about the dataset. There are 57 missing values in the independent variable "Age." To replace these missing values, a study on the distribution of the Age data was conducted and, as a clear symmetrical distribution was visible, the mean Age was used to fill the missing Age values.

A total data of 7470 patients are recorded in the dataset, were 4001 patients are man, 3419 are woman and 50 patients gender is undisclosed.

The skin cancer most represented in the dataset is Melanocytic Nevi, labeled nv, with 6705 example images. The second most represented are Melanoma, labeled mel, with 1113 images, and Benign Keratosis-like Lesions, labeled bkl, with 1099 images. Basal Cell Carcinoma, labeled bcc, has 514 examples, while Actinic Keratoses and Bowen's Disease, labeled akiec, have 327 images. Vascular Lesions, including angiomas, angiokeratomas, pyogenic granulomas, and hemorrhage, labeled vasc, and Dermatofibroma, labeled df, have significantly fewer example images, with 142 and 115 images respectively.

It is possible to observe that 42,6% of the skin cancer recorded are located at the back and in the lower extremity while 25,4% of the cancers are placed at the trunk, upper extremity and abdomen.

Only a residual number of cancers was diagnosticated through in-vivo confocal microscopy. The second most used method to diagnosis skin cancer was the follow up method, which reinforces the importance of regular medical check-ups. The first most used is confirmation by histopathology exam.

It is noteworthy that, although follow up is the second most utilized method for diagnosing skin cancer, it was only used to diagnose Melanocytic Nevi. In-vivo confocal microscopy diagnosis was only used for the diagnosis of Benign Keratosis-like Lesions. Both Melanocytic Nevi and Benign Keratosis-like Lesions also required medical consensus for diagnosis and both required a histopathology exam, suggesting that these types of skin cancer may be more challenging to diagnose than the other skin cancers present in the dataset.

The quantity of skin cancer example images by gender is similar but with a slight more number of examples for the masculine gender patients.

It is observed that skin cancer cases are more prevalent in patients aged between 40 and 55 years. There are records of patients with an age of 0, so it is necessary to verify this information.

In the following plot reinforces the idea that at the age of 40 skin cancer diagnosis is higher than in younger ages.

At the age of 70 more cases of skin cancer located at the face of the patients is more common, while at the age of 50 is more common to diagnosis cancer at the trunk region. The back region is the most prone to skin cancer diagnosis area.

In men the back region is the are with the most skin cancer cases diagnosed, while in women is the lower extremity area. At the acral region(e.g. peripheral body parts, such as toes and fingers) only women have records of skin cancer diagnosis.

Melanocytic Nevi is widely distributed in all parts of the body, except for the face area where Benign Keratosis-like Lesions are more common. Melanoma is frequently diagnosed in the back, lower and upper extremities regions.

The following plot tries to capture in a easier way visual relationships between the different features presented on dataset.

CNN for Human skin cancer images classification

The aim of this project was to gain familiarity with the techniques necessary to perform image classification and to develop a better understanding of the issue of skin cancer.

Data preparation

Data preparation is a crucial step in building a successful Convolutional Neural Network (CNN) for image classification and in any other AI algorithm. Here are some key steps to follow during data preparation:

  1. Collect and organize the data : Gather a large number of images for the dataset and organize them into categories or labels. Ensure that the data is diverse and representative of the real-world scenarios that the model will encounter. In this case, as mentioned, I have used a dataset provided by ISIC for the 2018 challenge hosted at MICCAI.
  2. Data cleaning : The images from the dataset all had good quality, were unique and representative of the skin cancer. If this wasn't the case steps to remove any irrelevant images, duplicates, or images with low resolution should've been done.
  3. Split the dataset : Divide the dataset into training, validation, and testing sets. I have used 60% of the data for training, 20% for validation, and 20% for testing. The training set is used to train the model, the validation set is used to tune the hyperparameters and check for overfitting, and the testing set is used to evaluate the model's performance. In my case I haven't proceed with hyperparameters tunning because in this work I am interested in learning the various stages and steps to perform for Image classification.
  4. Preprocessing : Preprocessing is an essential step in image classification. Techniques such as resizing, normalization, and cropping can be used to make the images uniform and compatible with the model. It's important to ensure that the preprocessing techniques used are suitable for the type of images we have and the task we're trying to solve. In this work the image was resized from 600x450 to 100x75, and I have proceed to a standardization of the images.
  5. Data augmentation : Data augmentation is the process of artificially increasing the size of your dataset by creating new images based on the existing ones. Through techniques such as rotation, flipping, zooming, and changing brightness to create new variations of the images in the dataset. When an instance of ImageDataGenerator is created, the generator will randomly apply these augmentations to the input images during training. The result is that the model sees a larger variety of images during training, which helps prevent overfitting and improves the model's ability to generalize to new images. It's important to note that the specific augmentations you choose to use will depend on the nature of your problem and the type of images you're working with. These parameters are just some examples of the many types of data augmentations that can be applied with ImageDataGenerator. In this work I have created an instance of ImageDataGenerator with the following parameters parameters :
    • shear_range=0.2 : Shear transformation refers to slanting or skewing an image. This parameter specifies the range of shear angles to apply to the input images, in radians
    • zoom_range=0.2 : Zooming an image refers to scaling the image up or down. This parameter specifies the range of zoom values to apply to the input images. A zoom value of 0.2 means the image can be zoomed up to 20%
    • horizontal_flip=True : Horizontal flipping refers to flipping the image horizontally. This parameter specifies whether to randomly flip the input images horizontally
  6. Data balancing : Ensure that each class has an equal number of samples to avoid bias in the model. Techniques such as oversampling or undersampling can be used to balance the dataset, in my case I havn't applied none of this techniques

By following these steps, it is possible to create a high-quality dataset that is suitable for training the CNN model for image classification.

Model architecture

The model architecture was chosen in a manner that prioritized the objective of learning image classification using CNN over the development of a high-performance model. While the architecture selection process was not exhaustive, it was not random and was based on the task at hand.

Result

The CNN model achieved an accuracy of 70% with a loss of 1.7726. The confusion matrix indicates that the skin cancer type with the highest prevalence in the dataset, Melanocytic Nevi, was classified with the highest accuracy. However, the model struggled to accurately classify Dermatofibroma and Vascular Lesions due to the lack of data for these skin cancer types in the dataset.

This work has room for improvement, and future efforts should consider implementing the following steps to enhance the model:

  • Incorporate a hyperparameter optimization stage to improve the model's performance.
  • Enhance the data augmentation techniques to increase the diversity and quality of the dataset.
  • Due to the class imbalance in the dataset, utilize the F1 score instead of accuracy to evaluate the model's performance.
  • Apply various cross-validation techniques to ensure that the model is robust and generalizes well.
  • Experiment with other CNN architectures to find the best model for the task.

About

CNN for skin cancer classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages