## Final Project, Part 1: *Proposal*

1. Draft a well-formed problem statement relevant to a business problem affecting your team, division, or organization.
2. Include the following elements:
   - Hypothesis/assumptions
   - Goals and success metrics
   - Risks or limitations
3. Identify at least one relevant internal dataset and confirm that you have (or can get) the right access permissions.

- Original dataset from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T
- Kaggle dataset and competition from https://www.kaggle.com/kmader/skin-cancer-mnist-ham10000/home
- Scientific paper from https://arxiv.org/abs/1803.10417


**null hypothesis:** There is no difference between dermatoscopic images of pigmented skin lesions between skin cancer diagnostic categories.

**alternative hypothesis:** There is a difference between dermatoscopic images of pigmented skin lesions between skin cancer diagnostic categories.

**goals and success:** Correctly classify dermatoscoptic images of pigmented skin lesions into a skin cancer diagnostic category at a probability higher than chance. There are 7 categories, so success metric would need to be 15% or greater.

**risks or limitations:** Requires enough sample data in each diagnostic category. Also required enough similarities of images within diagnostic category as well as distinctions between diagnostic categories.

## Final Project, Part 2: *Brief*

Exploratory data analysis is a crucial step in any data workflow. Create a Jupyter Notebook that explores your data mathematically and visually. Explore features, apply descriptive statistics, look at distributions, and determine how to handle sampling or any missing values.

1. Create an exploratory data analysis notebook.
2. Perform statistical analysis, along with any visualizations.
3. Determine how to handle sampling or missing values.
4. Clearly identify shortcomings, assumptions, and next steps.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

ham_df = pd.read_csv("./data/HAM10000_metadata.csv")

In [None]:
display(ham_df.shape)
display(ham_df.head())
display(ham_df.describe(include="all"))
display(ham_df.dtypes)

- The HAM metadata has 10015 rows and 7 columns
- There are 7470 lesions, however, some are taken with different magnification and angles resulting in 10,015 images accoridng to the paper. This serves as natural data augmentation
- The 7 unique `dx` suggests we are classifying for multiple diseases and not binary classification

In [None]:
print("Null values: \n", ham_df.isnull().sum())

In [None]:
age_null = ham_df[ham_df['age'].isnull()]
age_null.head()

- Although there are NaN and unknown values for patient information. I propose we do not replace or drop values as I am considering not using them as features. These variables serve to show the distribution of data.
- The feature would be the images themselves.
- This is despite distribution of diagnoses varying with patient information.
    - eg. melanoma on areas not exposed to sun (incidence higher in leg for women and back for men) as opposed to akiec (actinic keratoses on face and Bowen's disease in other areas)

In [None]:
print("Percentage distribution of target diagnostic values: \n", round(ham_df['dx'].value_counts(normalize=True)*100, 2))

variables = ['dx', 'dx_type', 'age', 'sex', 'localization']
for var in variables:
    ham_df[var].value_counts().plot(kind="bar")
    plt.title(var)
    plt.show();

ham_df['age'].plot(kind="hist", bins=15)
plt.title("age")
plt.show();

- `dx`
    - mostly nv (Melanocytic nevi) as these are benign/non-cancerous
    - much less sample data for the others
        - need to ensure there are enough sample data to include smaller categories
        - when modelling need to consider techniques such as oversampling, stratified folds
- `dx_type`
    - will be accepting all forms of diagnosis as accurate and not just histogram
- `localization`
    - in reality, these would vary by `dx` and `sex`
    
- Target value is `dx_type`

<img src="dx_samples.png" alt="Diagnostic category samples" title="Diagnostic category samples"/>
- From the images we can see that there are visual differences within categories
    - this is because they are categorised by biology
    - for example, `vasc` includes angiomas, angioketaomas, pyogenic granulomas and haemorrahge and therefore presentations may vary
    - in `blk`, a subset within (linchen-planus like keratoses) have features that are similar to `mel` and therefore ML may incorrectly classify these scenarios

**Next Steps**
- load and process data into `ham_df`
- use `hmnist` file and/or learn how images can be processed into data
- test train split
- start with a simple model
- read and find out what models are available and test these
- explore errors

## Final Project, Part 3: *Technical Notebook*

Develop a prototype model or process to successfully resolve the business problem you've chosen. Document your work in a technical notebook that can be shared with your peers.

Build upon your earlier analysis, folling the performance metrics you established as part of your problem's evaluation criteria. Demonstrate your approach logically, including all relevant code and data. Polish your notebook for peer audiences by cleanly formatting sections, headers, and descriptions in markdown. Include comments in any code.

1. A detailed Jupyter Notebook with a summary of your analysis, approach, and evaluation metrics.
2. Clearly formatted structure with section headings and markdown descriptions.
3. Comments explaining your code.

## (i) Organising the data

In [1]:
from glob import glob
import os 
import shutil
import pandas as pd

ham_df = pd.read_csv("./data/HAM10000_metadata.csv")

In [2]:
# Credit to Kevin Mader for this code
base_skin_dir = os.path.join(".", "data")

imageid_path_dict = {os.path.splitext(os.path.basename(x))[0]: x
                     for x in glob(os.path.join(base_skin_dir, "*", "*.jpg"))}

ham_df['path'] = ham_df['image_id'].map(imageid_path_dict.get)

In [3]:
# Placing label (diagnosis) into the name of image file
for dx, filename in zip(ham_df['dx'], ham_df['path']):
    path = filename[0:30]
    image_id = filename[-12:]
    os.rename(filename, path+dx+image_id)

- Renaming 10,015 image files with the diagnosis in the file name. It is not only heplful for later, but it helps me understand what, if any, each diagnoses of skin lesions look like

In [6]:
# Path where the image files are currently stored
path_1 = "./data/HAM10000_images_part_1"
path_2 = "./data/HAM10000_images_part_2"

# Create a new directory to store all images together
# os.mkdir("./data/images/")

# Function to copy files to this image directory
def copy_files(path):
    src_files = os.listdir(path)
    for filename in src_files:
        full_filename = os.path.join(path, filename)
        if (os.path.isfile(full_filename)):
            shutil.copy(full_filename, "./data/images/"+filename)
   
copy_files(path_1)
copy_files(path_2)

In [None]:
# Create a new directory to store some images
path_img = "./data/images"
os.mkdir("./data/test_images/")

# Function to move files to this test directory
def move_files(dx, repeat):
    src_files=os.listdir(path_img)
    i = 0
    dx = [filename for filename in src_files if dx in filename]
    for i in range(repeat):
        filename = dx[i]
        full_filename = os.path.join(path_img, filename)
        shutil.move(full_filename, "./data/test_images/"+filename)

# Set aside these images for testing at the end
move_files("akiec", 3)
move_files("bcc", 2)
move_files("bkl", 2)
move_files("df", 2)
move_files("mel", 2)
move_files("nv", 2)
move_files("vasc", 2)

- I have placed all images in one folder for train & validation
- From train & validation folder, I have removed 15 images to test at the very end

## (ii) Deep learning

In [None]:
from fastai.vision import *
from fastai.metrics import error_rate
import re
import numpy as np
import matplotlib.pyplot as plt

# !curl https://course.fast.ai/setup/colab | bash
%reload_ext_autoreload
%autoreload 2
%matplotlib inline

In [None]:
path_img = "./data/images/"
fnames = get_image_files(path_img)                  # grabs array of image files

np.random.seed(42)                                  # sets same validation set

# These are the data augmentations we'll be using (regularisation)
tfms = get_transforms(flip_vert=True,               # 8 symmetric dihedral rots/flips
                      max_rotate=20,                # rotate degrees
                      max_zoom=1.1,                 # zoom
                      max_lighting=0.3,             # lightin and contrast
                      max_warp=0.3,                 # magnitude of warp
                      p_affine=.9,                  # prob of applying affine and warp
                      p_lighting=.5)                # prob of applying lighting

In [None]:
src = ImageItemList.from_folder(path_img).random_split_by_pct(0.2, seed=42)

regex = r'([^/]+)_\d+.jpg$'                         # regex
pat = re.compile(regex)

# We want to visualise these augmentations to ensure they look reasonable
get_data = (src.label_from_re(regex)
           .transform(tfms, size=128)
           .databunch(bs=64).normalize(imagenet_stats))

def _plot(i,j,ax):
    x, y = get_data.train_ds[211]
    x.show(ax, y=y)

plot_multi(_plot, 3, 3, figsize=(9,9))
plt.savefig("akiec_augmentations.png")

<img src="akiec_augmentations.png" alt="Actinic keratosis augmentations" title="Actinic keratosis augmentations"/>
- I am visualising to ensure the data augmentations parameters that I passed are appropriate
- Data augmentation is an important step in computer vision by passing in a lot more data
- Nb: fast.ai applies `padding_mode = "reflection"` as default on otherwise black borders generated by data transformations

In [None]:
# Data bunch contains train and validation
data = ImageDataBunch.from_name_re(path_img,        # path with images
                                   fnames,          # filenames
                                   pat,             # regex
                                   valid_pct=0.2,   # randomly sets aside 20% for validation
                                   ds_tfms=tfms,    # data augmentation
                                   size=128,        # image size
                                   bs=64)           # value depends on gpu
data.normalize(imagenet_stats)                      # pixel values of RGB to have mean 0 and std 1

# print("With our split, there are {0} train images and {1} validation images."
#       .format(len(data.train_ds), len(data.valid_ds)))

In [7]:
data.classes                                        # there are 7 classes (labels)
data.show_batch(rows=4, figsize=(9,9))              # show some contents

NameError: name 'data' is not defined

In [None]:
# Create convolutional neural net with pretrained weights from imagenet
learn = create_cnn(data, 
                   models.resnet50,                 # architecture
                   metrics=accuracy)                # against val set
learn.model

- We will be using a convulational neural network, using the resnet50 (50 layers) architecture and the 2015 winner of imagenet
- Our goal is the improve accuracy
- We will be looking at the learning rate plot to select a good learning rate

In [None]:
learn.lr_find()                                     # learning rate finder
learn.recorder.plot()

In [None]:
lr = 1e-2/2                                         # using largest downward slope
learn.fit_one_cycle(15, slice(lr))

In [None]:
learn.save("stage-1-resnet50-128")                  # save weights
learn.unfreeze()                                    # unfreeze so we can train the whole model

In [None]:
learn.lr_find()
learn.recorder.plot()                               # before it goes up

In [None]:
learn.fit_one_cycle(5, slice(1e-4/2, lr/10))

- Selecting the correct epoch and learning rate will ensure the model is not over and under-fitting
- If we are happy with the train loss, valid loss and accuracy then we can save the model (in case we want to reload it)
- Next we will repeat with progressive resizing of images. This avoids overfitting and improves generalisation

In [None]:
learn.save("stage-2-resnet50-128")

In [None]:
# Repeat above with larger image size - progressive resizing
data = ImageDataBunch.from_name_re(path_img, 
                                   fnames, 
                                   pat, 
                                   valid_pct=0.2, 
                                   ds_tfms=tfms, 
                                   size=256,        # using larger image size
                                   bs=64)
data.normalize(imagenet_stats)

In [None]:
learn.data = data
learn.freeze()                                      # freeze model

data.train_ds[0][0].shape

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
lr = 1e-2/2
learn.fit_one_cycle(10, slice(lr))

In [None]:
learn.save("stage-1-resnet50-256")
learn.unfreeze()

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(7, slice(1e-5, lr/10))

In [None]:
learn.save("stage-2-resnet50-256")

- Our model has a 93% accuracy against the validation set
- We are happy with this result as it has surpassed our initial goals and dispproven the null hypothesis
- We could continue progressive resizing, however, we will need to decrease batch size. The subsequent gain in accuracy will be minimal and it will take a lot longer to train

In [None]:
learn.recorder.plot_losses()

In [None]:
interp = ClassificationInterpretation.from_learner(learn)

# Predict top losses
interp.plot_top_losses(9, figsize=(15,11))
plt.savefig("pred_act_loss_prob.png")

In [None]:
# Prediction and actual that was wrong most often
display(interp.most_confused(min_val=2))

# Confusion matrix
interp.plot_confusion_matrix(figsize=(9,9)) 

<img src="pred_act_loss_prob.png" alt="Prediction Actual Loss Probability" title="Prediction Actual Loss Probability"/>
- These are our incorrect predictions
- We can use these to visualise and see if they are incorrect or misclassified

<img src="confusion_matrix.png" alt="Confusion Matrix" title="Confusion Matrix"/>
- Most of the predictions fall on the diagonal axis (correctly predicted), however, we can also see where predictions may go wrong by looking at the darker boxes (eg. between mel and blk)
- We are happy with the confusion matrix, as the incorrect predictions are fairly evenly distributed on the whole

In [None]:
# Saves all information for inference eg. transforms, model weights
learn.export()

### Thank you to the fast.ai lessons taught by Jeremy Howard. His lessons on deep learning can be found at https://course.fast.ai/