# Experiment 1 - Minimum Balancing with 64x64 Resizing

This experiment consists on the pre-processing data pipeline established below:

- Data Balancing with Reduction + Oversampling, resulting in 2,100 training images as final dataset.
- Image Resizing to 64x64 in order to improve performance.
- Pixel Values Normalisation to range 0 to 1.

No further data augmentation was implemented during this experiment

## Initial Setup

This initial setup is used to allow direct import of classes from other Notebooks available on Google Colab.

In [1]:
#!pip install nbimporter

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [3]:
import sys
sys.path.append('/content/drive/MyDrive/Colab Notebooks/skin-cancer-project')

import nbimporter

## Libraries Import

In [4]:
from google.colab import drive
import pandas as pd
import numpy as np
import keras
import tensorflow as tf
from isic2018_task3_data_preprocessing import DataBalancer, DataPreparer
import matplotlib.pyplot as plt

## Instances of the class

In [5]:
db = DataBalancer()
dp = DataPreparer()

## Data Pre-Processing

### Capture and prepare labels data

In this step, the csv file containing diagnosis of lesion type is captured into a dataframe and transformed to provide labels in a single lesion type feature, encode categorical labels, and store image full path based on Google Drive folder structure.

In [6]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/skin-cancer-project/datasets/train/ISIC2018_Task3_Training_GroundTruth.csv')
df.head()

Unnamed: 0,image,MEL,NV,BCC,AKIEC,BKL,DF,VASC
0,ISIC_0024306,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,ISIC_0024307,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,ISIC_0024308,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,ISIC_0024309,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,ISIC_0024310,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
df = dp.labelPathMapper(df)
df.head()

Unnamed: 0,image,lesion_type,label_encoded,img_path
0,ISIC_0024306,NV,0,/content/drive/Colab Notebooks/skin-cancer-pro...
1,ISIC_0024307,NV,0,/content/drive/Colab Notebooks/skin-cancer-pro...
2,ISIC_0024308,NV,0,/content/drive/Colab Notebooks/skin-cancer-pro...
3,ISIC_0024309,NV,0,/content/drive/Colab Notebooks/skin-cancer-pro...
4,ISIC_0024310,MEL,1,/content/drive/Colab Notebooks/skin-cancer-pro...


### Data Balancing

For this experiment, the data balance consists of capturing n random samples of each category where n = the record count of the smallest lesion type in the dataset. Based on ISIC 2018 Train Dataset, as observed in file ISIC2018_Task3_Data_Analysis, this corresponds to lesion DF with 115 records.<br><br>
There are 7 categories, which when sampled to 115 per categories provides 805 records. This is insufficient data for training purposes, therefore oversampling is performed to achieve 300 images per categories, providing 2,100 records for training.

In [8]:
ds = db.minBalancing(df)

### Image Resizing and Normalisation

In [9]:
HEIGHT = 64
WIDTH = 64

def resize_wrapper(path, label):
    return dp.imageResizer(path, label, HEIGHT, WIDTH)

ds = ds.map(resize_wrapper)

In [10]:
ds = ds.map(dp.pixelNormalizer)