# CNN CLASSIFICATION - SKIN CANCER DATASET

## **To start the project download 10015 images and put in the folder named *'data'.***
## **Link to download the dataset:** *https://www.kaggle.com/datasets/eliocordeiropereira/skin-cancer-the-ham10000-dataset* 

## A. Project Description
### -> This is **Computer Vision** based project to classify the ***Skin Cancer*** Type using CNN Technique.
### -> Dataset (Skin Cancer MNIST: HAM10000) International Skin Imaging Collaboration (ISIC)

### -> Publicaly available dataset containing 10,015 dermatoscopic images.

### -> A Metadata file containing the demographic information about each lesion.
### -> Lesions are identified using individual methods:

  - ***histo (histopathology)***
  - ***follow_up (follow up examination)***
  - ***consensus (expert consensus)***
  - ***confocal (in-vivo confocal microscopy)***

### -> There are total 7 class labels in which the skin cancer is classified. (nv, mel, bkl, bcc, akiec, vasc, df)
  - ***Melanocytic nevi (nv)***: Melanocytic nevus - the medical term for a mole (benign).
  - ***Melanoma (mel)***: Melanoma - a type of skin cancer involving the melanin cells.
  - ***Dermatofibroma (df)***: Dermatofibroma - common and benign.
  - ***Actinic keratoses (akiec)***: Actinic keratoses and intraepithelial carcinoma (also called "Bowen's disease") - an early form of skin cancer.
  - ***Basal cell carcinoma (bcc)***: Basal cell carcinoma - the most common type of skin cancer.
  - ***Benign keratosis-like lesions (bkl)***: Benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses) - common and benign.
  - ***Vascular lesions (vasc)***: Vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage) (benign).

### -> Dataset Details:
  - CSV File: 'ham_meta.csv'
  - Total Records: 10,015 
  - Features: 7 ['lesion_id', 'image_id', 'dx', 'dx_type', 'age', 'sex', 'localization']
  - Total Number of Corresponding Images: 10,015

## B. EDA Conclusion
- Feature 'age' has 57 missing values for the patients.
- For feature 'dx', class label 'nv' is the dominating category with 6705 entries in the dataset.
- Feature 'dx_type' has 'histo' as the most frequent method of detecting cancer with 5340 records in the dataset.
- Feature 'sex' has 3 categories as ['male', 'female', 'unknown'], where 'male' has occured 5406 times in the dataset.
- More number of cases (2,192) are present for 'back' category of 'localization' feature.
- Distribution for the age of the patients is slightly left skewed (negatively skewed) with a value of -0.1668.

## C. Data Preparation and Pre-Processing

#### 1. Total Images: 10,015. 
#### 2. Reading metadata from CSV file about patient's skin cancer class label and the corresponding Image ID.
#### 3. All input images are available on path './data' and label-wise ['nv', 'mel', 'bkl', 'df', 'akiec', 'bcc', 'vasc'] images are copied into the directories on path './folders'.
#### 4. Performing Train, Validation, and Test Splits on path './folders' to create datasets under path as './images/train', './images/val', './images/test'.

## D. Model Implementation Details

### 1. Simple Model with Original Dataset (unbalanced).

#### - Input Train Data Size: (8010) images belonging to 7 classes.
#### - Validation Data Size: (998) images belonging to 7 classes.
#### - Test Data Size: (1007) images belonging to 7 classes.

#### - Class Labels: akiec (0), bcc (1), bkl (2), df (3), mel (4), nv (5), vasc (6)

#### - Model:
    mdl = Sequential()
    
    mdl.add(Conv2D(32, kernel_size=(3,3), padding='valid', activation='relu', input_shape=(128,128,3)))
    mdl.add(MaxPooling2D(pool_size=(2,2), padding='valid', strides=2))

    mdl.add(Conv2D(64, kernel_size=(3,3), padding='valid', activation='relu'))
    mdl.add(MaxPooling2D(pool_size=(2,2), padding='valid', strides=2))

    mdl.add(Conv2D(128, kernel_size=(3,3), padding='valid', activation='relu'))
    mdl.add(MaxPooling2D(pool_size=(2,2), padding='valid', strides=2))

    mdl.add(Flatten())

    mdl.add(Dense(128, activation='relu'))
    mdl.add(Dense(64, activation='relu'))
    mdl.add(Dense(num_classes, activation='softmax'))

    mdl.compile(optimizer='Adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics=['accuracy'])
    
    mdl.fit(train_ds, epochs=5, validation_data=valid_ds)

        
#### - Accuracy:
- Train Dataset Accuracy : 74.00 % 
- Validation Dataset Accuracy : 70.54 %
- Test Dataset Accuracy : 79.92 % 


#### - Conclusion:
- ***Acceptable performance on all datasets***

### 2. Data Augmentation and Balancing

#### A. Load the original dataset with 10,015 images.

#### B. Dataset Augmentation Process
##### **B.1 Reduced Features Dataset from the original dataset.**
###### - **a. Small HAM10000 dataset after selecting specific features.**
###### - **b. Tagging the image as 'original' before augmentation. After augmentation, augmented (additional) images will be tagged as 'augmented'.**
###### - **c. Adding Extension '.jpg' to 'image_id' feature**

##### **B.2 Class Labels to be Augmented (using a 1000 as threshold value)**

##### **B.3 Image Data Generator Object and Augmentation Utility Function**
###### - **a. Image Data Generator Object**
###### - **b. Augmentation Utility Function**
###### - **c. Folders Copy and Data Augmentation Process**

##### **B.4 DataFrame Entry of Augmented Images**
###### - **a. Setting Path and Labels**
###### - **b. Preparing Data Structure for the Data Frame Entry**
###### - **c. Augmented Dataset (DataFrame)**
###### - **d. Combining Reduced and Augmented DataFrames**

##### **B.5 Grouping and Resampling the Sub Datasets to Balance the Dataset**
###### - **a. Grouping Data into Sub Datasets**
###### - **b. Resampling 500 images for each Class Label (without replacement)**
###### - **c. Creating Balanced DataFrame after Resampling**

##### **B.6 Adding Image Path to Dataset**
###### - **a. Source Directory of all Images**
###### - **b. Adding Path to Dataset**

##### **B.7 Saving the Final Balanced DataFrame into CSV file**

##### **B.8 Copying Aug Images to Source (all data gen) folder**
###### - **a. Setting Path and Labels**
###### - **b. Augmented Data Copy Process**

#### C. Displaying Sample Images

##### **C.1 Final Dataset.**

##### **C.2 Original Dataset Image.**

##### **C.3 Augmented Dataset Image.**

### 3. Simple Model with Final Dataset (balanced) of 3500 images.

#### - Input Train Data Size: (2800) images belonging to 7 classes.
#### - Validation Data Size: (350) images belonging to 7 classes.
#### - Test Data Size: (350) images belonging to 7 classes.

#### - Class Labels: akiec (0), bcc (1), bkl (2), df (3), mel (4), nv (5), vasc (6)

#### - Model:
    mdl = Sequential()
    
    mdl.add(Conv2D(32, kernel_size=(3,3), padding='valid', activation='relu', input_shape=(128,128,3)))
    mdl.add(MaxPooling2D(pool_size=(2,2), padding='valid', strides=2))
    
    mdl.add(Conv2D(64, kernel_size=(3,3), padding='valid', activation='relu'))
    mdl.add(MaxPooling2D(pool_size=(2,2), padding='valid', strides=2))
    
    mdl.add(Conv2D(128, kernel_size=(3,3), padding='valid', activation='relu'))
    mdl.add(MaxPooling2D(pool_size=(2,2), padding='valid', strides=2))
    
    mdl.add(Flatten())
    
    mdl.add(Dense(128, activation='relu'))
    mdl.add(Dense(64, activation='relu'))
    mdl.add(Dense(nclasses, activation='softmax'))

    mdl.compile(optimizer='Adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics=['accuracy'])
    
    mdl.fit(Xtrain, ytrain, epochs=20, validation_data=(Xval, yval))

        
#### - Accuracy:
- Train Dataset Accuracy : 87.61 % 
- Validation Dataset Accuracy : 58.86 %
- Test Dataset Accuracy : 54.57 % 


#### - Conclusion:
- ***Model exhibits overfitting issues.***
- ***Needs to be checked for issues by adding more data or hyper-parameter tuning.***

### 4. All CNN Models Comparison

### 5. Production Model (Best Model)

#### - Input Data Size: 2800 images belonging to 7 different class labels
#### - Model: 
    mdl = Sequential()
    
    mdl.add(Conv2D(128, kernel_size=(3,3), padding='valid', activation='relu', input_shape=(128,128,3)))
    mdl.add(MaxPooling2D(pool_size=(2,2), padding='valid', strides=2))
    
    mdl.add(Conv2D(64, kernel_size=(3,3), padding='valid', activation='relu'))
    mdl.add(MaxPooling2D(pool_size=(2,2), padding='valid', strides=2))
    
    mdl.add(Conv2D(32, kernel_size=(3,3), padding='valid', activation='relu'))
    mdl.add(MaxPooling2D(pool_size=(2,2), padding='valid', strides=2))
    
    mdl.add(Flatten())
    
    mdl.add(Dense(64, activation='relu'))
    mdl.add(Dense(32, activation='relu'))
    mdl.add(Dense(nclasses, activation='softmax'))
    
    mdl.summary()
    mdl.compile(optimizer='RMSProp', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics=['accuracy'])
    hist = mdl.fit(Xtrain, ytrain, batch_size=16, epochs=18, validation_data=(Xval, yval))
    
        
#### - Accuracy:
- Train Set Accuracy: 86.46 %
- Test Set Accuracy: 49.14 %

#### Conclusion:
- ***Model is performing with overfitting issues***
- ***Model requires more data and fine-tuning to reduce overfitting and increase the overall performance***

## E. Gradio Prediction App

#### - Production Model: *'sk_best_mdl.keras'*.
#### - Class Labels = 'akiec', 'bcc', 'bkl', 'df', 'mel', 'nv', 'vasc'

#### - Test Image: To be uploaded by the user.
#### - Test image pre-processing steps are applied suitably.
#### - Gradio function and interface.