## Problem Statement

To build a CNN based model which can accurately detect melanoma. Melanoma is a type of cancer that can be deadly if not detected early. It accounts for 75% of skin cancer deaths. A solution that can evaluate images and alert dermatologists about the presence of melanoma has the potential to reduce a lot of manual effort needed in diagnosis.

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import pathlib
from IPython.display import display
from time import time
import matplotlib.pyplot as plt
import seaborn as sns
import keras
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator, img_to_array
# from keras.utils import np_utils
from sklearn.datasets import load_files
from tqdm import tqdm
from collections import Counter
import tensorflow as tf
import pathlib
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import PIL
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import glob


In [None]:
!pip install tensorflow

In [None]:
import warnings

# Set the warning filter to "ignore"
warnings.filterwarnings("ignore")

In [None]:
import zipfile
#Unzipping compressed Input File
with zipfile.ZipFile('CNN_assignment.zip','r') as f:
    f.extractall('input/')

In [None]:
# Defining the path for train and test images
data_dir_train = pathlib.Path("input/Skin cancer ISIC The International Skin Imaging Collaboration/Train/")
data_dir_test = pathlib.Path('input/Skin cancer ISIC The International Skin Imaging Collaboration/Test/')

In [None]:
image_count_train = len(list(data_dir_train.glob('*/*.jpg')))
print(image_count_train)
image_count_test = len(list(data_dir_test.glob('*/*.jpg')))
print(image_count_test)

### Load using keras.preprocessing**

Let's load these images off disk using the helpful image_dataset_from_directory utility.

### Create a dataset

Define some parameters for the loader:

In [None]:
batch_size = 32
img_height = 180
img_width = 180

In [None]:
## train dataset
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir_train,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

In [None]:
## train dataset
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir_train,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

In [None]:
# To list out all the classes of skin cancer and store them in a list, you can use the class_names attribute of the train_ds dataset. These class names correspond to the directory names in alphabetical order.
class_names = train_ds.class_names
print(class_names)

### Data Visualisation 

In [None]:
import matplotlib.pyplot as plt
import glob
import os
import PIL

# Create a 3x3 grid for visualization
plt.figure(figsize=(10, 10))

# Iterate through different lesion types
for lesion_type in range(9):
    class_path = glob.glob(os.path.join(data_dir_train, class_names[lesion_type], '*'))
    
    # Check if there are any images in this class
    if class_path:
        img = PIL.Image.open(str(class_path[0]))
        ax = plt.subplot(3, 3, lesion_type + 1)
        plt.imshow(img)
        plt.title(class_names[lesion_type])
        plt.axis("off")
    else:
        # If there are no images in this class, display a placeholder or message
        ax = plt.subplot(3, 3, lesion_type + 1)
        plt.text(0.5, 0.5, "No Image", fontsize=12, ha='center')
        plt.title(class_names[lesion_type])
        plt.axis("off")

plt.show()


In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

### Training Simple CNN Model 

In [None]:
# Define the number of classes in your dataset
num_classes = 9  #  number of classes

# Create a Sequential model
model = keras.Sequential([
    layers.experimental.preprocessing.Rescaling(1.0 / 255, input_shape=(180, 180, 3)),
    layers.Conv2D(32, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dense(num_classes, activation='softmax')
])


# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Print the model summary
model.summary()


### Fitting the Model

In [None]:
epochs = 20
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs,
  verbose=1
)


## Visualising Training Results

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

### Here's what these indicators mean:

**Low Validation Accuracy:** This suggests that the model is not performing well on data it hasn't seen during training. It's not generalizing the patterns it learned from the training data to new, unseen data.

**High Validation Loss:** A high validation loss means that the model's predictions are far off from the true labels on the validation dataset. It's making significant errors on the validation data.

**Overfitting** can happen for several reasons, including:

**Complex Model:** The model may be too complex for the data, allowing it to memorize the training data rather than learning meaningful patterns.

**Not Enough Data:** With a small dataset, the model may not have enough examples to learn generalizable patterns. It ends up fitting noise rather than the underlying data distribution.

**Training for Too Long:** If you train the model for too many epochs, it may start memorizing the training data instead of generalizing.

### To address overfitting, you can consider the following techniques:

**Regularization:** Add dropout layers or L1/L2 regularization to reduce overfitting.

**More Data:** Collect more data if possible. More data can help the model learn generalizable patterns.

**Simpler Model:** Use a simpler architecture that is less likely to overfit the data.

**Early Stopping:** Monitor the validation loss during training and stop training when it starts to increase. This prevents the model from overfitting.

**Data Augmentation:** Apply data augmentation techniques to artificially increase the size of your training dataset.

**Hyperparameter Tuning:** Adjust hyperparameters such as learning rate, batch size, and layer sizes to find the best model for your data.

**Cross-Validation:** Use techniques like k-fold cross-validation to get a better estimate of your model's performance.

### Training Model2 with Dropout layers

In [None]:
# I will choose dropout layers

# Define the number of classes in your dataset
num_classes = 9  # Replace with the actual number of classes

# Create a Sequential model
model = keras.Sequential([
    layers.experimental.preprocessing.Rescaling(1.0 / 255, input_shape=(180, 180, 3)),  # Normalize pixel values

    # Add Convolutional layers with Dropout
    layers.Conv2D(32, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),

    # Flatten the output from Convolutional layers
    layers.Flatten(),

    # Add Dense layers with Dropout
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),  # Add a dropout layer with a dropout rate of 0.5
    layers.Dense(num_classes, activation='softmax')  # Use 'softmax' for multi-class classification
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Print the model summary
model.summary()


the **Adam optimizer** is a versatile choice that works well for many deep learning tasks, and **sparse_categorical_crossentropy** is a suitable loss function for multi-class classification when labels are provided as integers. These choices are often used as starting points, and you can fine-tune hyperparameters and explore other optimizers and loss functions based on the specific characteristics of your dataset and problem.

## Fitting the model 

In [None]:
epochs = 20
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs,
  verbose=1
)

## Visualising Training Results

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

## Training Model3 with Dropout and Data Augmentation

In [None]:
data_augmentation = tf.keras.Sequential([
    layers.RandomFlip("horizontal_and_vertical"),
    layers.RandomRotation(0.5),
    layers.RandomContrast(0.5),
    layers.RandomBrightness(0.3)
])

In [None]:
# visualizing how your augmentation strategy works for one instance of training image.
plt.figure(figsize=(10, 10))

for images, labels in train_ds.take(1):
    for i in range(9):
        augmented_image = data_augmentation(img)
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(augmented_image.numpy().astype("uint8"))
        plt.title(class_names[labels[1]])
        plt.axis("off")

In [None]:
## Using Dropout layer as there is an evidence of overfitting

dropout_conv=0.05
dropout_dense=0.25

model3 = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(img_height, img_width, 3)),
    data_augmentation,
    tf.keras.layers.Rescaling(1./255),
    tf.keras.layers.Conv2D(32, 3, activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Dropout(dropout_conv),
    tf.keras.layers.Conv2D(32, 3, activation='relu'),
    tf.keras.layers.MaxPooling2D(),  
    tf.keras.layers.Dropout(dropout_conv),
    tf.keras.layers.Conv2D(32, 3, activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Dropout(dropout_conv),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(dropout_dense),
    tf.keras.layers.Dense(num_classes)
])

model3.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Training the model

In [None]:
## Your code goes here, note: train your model for 20 epochs
epochs = 20 # As specified in the project pipeline
history = model3.fit(
  train_ds,
  validation_data = val_ds,
  epochs=epochs
)

## Visualising the results

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

As observed, the model, which includes data augmentation and dropout layers, does not exhibit improved performance on the training data; in fact, it performs noticeably worse. This suggests that the model is experiencing underfitting, where it struggles to capture the underlying patterns in the training data, leading to poor training accuracy

#### Finding the distribution of classes in the training dataset.
#### **Context:** Many times real life datasets can have class imbalance, one class can have proportionately higher number of samples compared to the others. Class imbalance can have a detrimental effect on the final model quality. Hence as a sanity check it becomes important to check what is the distribution of classes in the data.

In [None]:
# Create a dictionary to store the class distribution
class_distribution = {}

# Iterate through the training dataset and count the samples for each class
for _, labels in train_ds:
    for label in labels.numpy():
        if label in class_distribution:
            class_distribution[label] += 1
        else:
            class_distribution[label] = 1

# Print the class distribution with actual label names
for class_label, count in class_distribution.items():
    class_name = class_names[class_label]  # Get the actual label name
    print(f"Class '{class_name}': {count} samples")


#### **Todo:** Write your findings here: 
#### - Which class has the least number of samples?
seborrheic keratosis has least number of sample.
#### - Which classes dominate the data in terms proportionate number of samples?
pigmented benign keratosis & melanoma


####  Rectifying the class imbalance
#### **Context:** Using a python package known as `Augmentor` (https://augmentor.readthedocs.io/en/master/) to add more samples across all classes so that none of the classes have very few samples.

In [None]:
!pip install Augmentor

To use `Augmentor`, the following general procedure is followed:

1. Instantiate a `Pipeline` object pointing to a directory containing your initial image data set.<br>
2. Define a number of operations to perform on this data set using your `Pipeline` object.<br>
3. Execute these operations by calling the `Pipeline’s` `sample()` method.


In [None]:
data_dir_train

In [None]:
import Augmentor

# Create a PosixPath object representing the path to your training dataset
path_to_training_dataset=str(data_dir_train)

# Iterate through the class names
for class_name in class_names:
    p = Augmentor.Pipeline(path_to_training_dataset+'\\'+ class_name)  # Use the / operator to join paths
    p.rotate(probability=0.5, max_left_rotation=10, max_right_rotation=10)
    p.sample(500)  # Add 500 samples per class


In [None]:
image_count_train = len(list(data_dir_train.glob('*/output/*.jpg')))
print(image_count_train)

In [None]:
path_list = [x for x in glob.glob(os.path.join(data_dir_train, '*','output', '*.jpg'))]
path_list

In [None]:
lesion_list_new = [os.path.basename(os.path.dirname(os.path.dirname(y))) for y in glob.glob(os.path.join(data_dir_train, '*','output', '*.jpg'))]
lesion_list_new

In [None]:
dataframe_dict_new = dict(zip(path_list, lesion_list_new))

In [None]:
df2 = pd.DataFrame(list(dataframe_dict_new.items()),columns = ['Path','Label'])

In [None]:
df2['Label'].value_counts()

## Training Model4 with Data Augmentor

In [None]:
batch_size = 32
img_height = 180
img_width = 180

In [None]:
## Training set
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir_train,
  seed=123,
  validation_split = 0.2,
  subset = "training",
  image_size=(img_height, img_width),
  batch_size=batch_size)

In [None]:
## Validation set
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir_train,
  seed=123,
  validation_split = 0.2,
  subset = "validation",
  image_size=(img_height, img_width),
  batch_size=batch_size)

In [None]:
## define model 

num_classes = 9

model4 = tf.keras.Sequential([
  tf.keras.layers.InputLayer(input_shape=(img_height, img_width, 3)),
  tf.keras.layers.Rescaling(1./255),
  tf.keras.layers.Conv2D(32, 3, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Conv2D(32, 3, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Conv2D(32, 3, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(num_classes)
])

In [None]:
## compile model 

## your code goes here
model4.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [None]:
epochs = 30

history = model4.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs,
    verbose=0  # Set verbose to 0 to suppress training output
)

## Visualising the trained model

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

#### **Todo:**  Analyze your results here. Did you get rid of underfitting/overfitting? Did class rebalance help?




The results from the model with dataset augmentation have shown significant improvement over previous models, achieving a training accuracy of 0.94 at 20 epochs and a validation accuracy of 0.84. However, there are indications of overfitting, as seen by the large fluctuations in the validation loss. To address this issue, we can implement a learning rate scheduler to reduce the learning rate when an accuracy of 0.8 is reached, which may lead to further improvements.

Key takeaways and suggestions:

1. **Overfitting**: The model is still overfitting the data, as indicated by the fluctuations in the validation loss. To mitigate overfitting, consider adding more layers, neurons, or introducing dropout layers. This can help the model generalize better.

2. **Hyperparameter Tuning**: The model's performance can be further improved through hyperparameter tuning. Experiment with different learning rates, batch sizes, and optimizer choices to find the best configuration for your dataset.

3. **Learning Rate Scheduler**: Implement a learning rate scheduler that reduces the learning rate when the accuracy reaches 0.8. This can help the model converge more steadily and potentially reach a better optimum.

By addressing these issues and fine-tuning the model, you can achieve even better results and a more robust model for skin cancer detection.