## Final Project, Part 1: *Proposal*

1. Draft a well-formed problem statement relevant to a business problem affecting your team, division, or organization.
2. Include the following elements:
   - Hypothesis/assumptions
   - Goals and success metrics
   - Risks or limitations
3. Identify at least one relevant internal dataset and confirm that you have (or can get) the right access permissions.

- Original dataset from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T
- Kaggle dataset and competition from https://www.kaggle.com/kmader/skin-cancer-mnist-ham10000/home
- Scientific paper from https://arxiv.org/abs/1803.10417


**null hypothesis:** There is no difference between dermatoscopic images of pigmented skin lesions between skin cancer diagnostic categories.

**alternative hypothesis:** There is a difference between dermatoscopic images of pigmented skin lesions between skin cancer diagnostic categories.

**goals and success:** Correctly classify dermatoscoptic images of pigmented skin lesions into a skin cancer diagnostic category at a probability higher than chance. There are 7 categories, so success metric would need to be 15% or greater.

**risks or limitations:** Requires enough sample data in each diagnostic category. Also required enough similarities of images within diagnostic category as well as distinctions between diagnostic categories.

## Final Project, Part 2: *Brief*

Exploratory data analysis is a crucial step in any data workflow. Create a Jupyter Notebook that explores your data mathematically and visually. Explore features, apply descriptive statistics, look at distributions, and determine how to handle sampling or any missing values.

1. Create an exploratory data analysis notebook.
2. Perform statistical analysis, along with any visualizations.
3. Determine how to handle sampling or missing values.
4. Clearly identify shortcomings, assumptions, and next steps.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

ham_df = pd.read_csv("./data/HAM10000_metadata.csv")

In [None]:
display(ham_df.shape)
display(ham_df.head())
display(ham_df.describe(include="all"))
display(ham_df.dtypes)

- The HAM metadata has 10015 rows and 7 columns
- There are 7470 lesions, however, some are taken with different magnification and angles resulting in 10,015 images accoridng to the paper. This serves as natural data augmentation
- The 7 unique `dx` suggests we are classifying for multiple diseases and not binary classification

In [None]:
print("Null values: \n", ham_df.isnull().sum())

In [None]:
age_null = ham_df[ham_df['age'].isnull()]
age_null.head()

- Although there are NaN and unknown values for patient information. I propose we do not replace or drop values as I am considering not using them as features. These variables serve to show the distribution of data.
- The feature would be the images themselves.
- This is despite distribution of diagnoses varying with patient information.
    - eg. melanoma on areas not exposed to sun (incidence higher in leg for women and back for men) as opposed to akiec (actinic keratoses on face and Bowen's disease in other areas)

In [None]:
print("Percentage distribution of target diagnostic values: \n", round(ham_df['dx'].value_counts(normalize=True)*100, 2))

variables = ['dx', 'dx_type', 'age', 'sex', 'localization']
for var in variables:
    ham_df[var].value_counts().plot(kind="bar")
    plt.title(var)
    plt.show();

ham_df['age'].plot(kind="hist", bins=15)
plt.title("age")
plt.show();

- `dx`
    - mostly nv (Melanocytic nevi) as these are benign/non-cancerous
    - much less sample data for the others
        - need to ensure there are enough sample data to include smaller categories
        - when modelling need to consider techniques such as oversampling, stratified folds
- `dx_type`
    - will be accepting all forms of diagnosis as accurate and not just histogram
- `localization`
    - in reality, these would vary by `dx` and `sex`
    
- Target value is `dx_type`

<img src="dx_samples.png" alt="Diagnostic category samples" title="Diagnostic category samples"/>
- From the images we can see that there are visual differences within categories
    - this is because they are categorised by biology
    - for example, `vasc` includes angiomas, angioketaomas, pyogenic granulomas and haemorrahge and therefore presentations may vary
    - in `blk`, a subset within (linchen-planus like keratoses) have features that are similar to `mel` and therefore ML may incorrectly classify these scenarios

**Next Steps**
- load and process data into `ham_df`
- use `hmnist` file and/or learn how images can be processed into data
- test train split
- start with a simple model
- read and find out what models are available and test these
- explore errors

## Final Project, Part 3: *Technical Notebook*

Develop a prototype model or process to successfully resolve the business problem you've chosen. Document your work in a technical notebook that can be shared with your peers.

Build upon your earlier analysis, folling the performance metrics you established as part of your problem's evaluation criteria. Demonstrate your approach logically, including all relevant code and data. Polish your notebook for peer audiences by cleanly formatting sections, headers, and descriptions in markdown. Include comments in any code.

1. A detailed Jupyter Notebook with a summary of your analysis, approach, and evaluation metrics.
2. Clearly formatted structure with section headings and markdown descriptions.
3. Comments explaining your code.

## Organising the data

In [None]:
from glob import glob
import os 
import shutil
import pandas as pd

ham_df = pd.read_csv("./data/HAM10000_metadata.csv")

In [None]:
# Credit to Kevin Mader for code
base_skin_dir = os.path.join(".", "data")

imageid_path_dict = {os.path.splitext(os.path.basename(x))[0]: x
                     for x in glob(os.path.join(base_skin_dir, '*', '*.jpg'))}

ham_df['path'] = ham_df['image_id'].map(imageid_path_dict.get)

In [None]:
# Placing label (diagnosis) into the name of image file
for dx, filename in zip(ham_df['dx'], ham_df['path']):
    path = filename[0:30]
    image_id = filename[-12:]
    os.rename(filename, path+dx+image_id)

In [None]:
# Path where the image files are currently stored
path_1 = "./data/HAM10000_images_part_1"
path_2 = "./data/HAM10000_images_part_2"

# Create a new directory to store all images
os.mkdir("./data/images")

# Function to move files to this image directory
def move_files(path):
    src_files = os.listdir(path)
    for filename in src_files:
        full_filename = os.path.join(path, filename)
        if (os.path.isfile(full_filename)):
            shutil.copy(full_filename, "./data/images/"+filename)
   
move_files(path_1)
move_files(path_2)

## Deep learning

In [None]:
!curl https://course.fast.ai/setup/colab | bash

%reload_ext_autoreload
%autoreload 2
%matplotlib inline

In [None]:
from fastai.vision import *
from fastai.metrics import error_rate
import re
import numpy as np
# from PIL import Image

In [None]:
path_img = "./data/images/"
fnames = get_image_files(path_img) # grabs array of image files
fnames[:5]

In [None]:
np.random.seed(42) # sets same validation set
pat = re.compile(r'/([^/]+)_\d+.jpg$') # applies regex to obtain label
tfns = get_transforms(flip_vert=True, # this with default allows 8 symmetric dihedral rots/flips
                      max_lighting=0.1,
                      max_zoom=1.05
                      max_warp=0.1) # perspective warping

# ImageDataBunch contains contains train, validation (and optional test)
data = ImageDataBunch.from_name_re(path_img, # path with images
                                   fnames, # filenames
                                   pat, # regex
                                   valid_pct=0.2 # randomly sets aside 20% for validation
                                   ds_tfms=tfms, # data augmentation
                                   size=224, # image size
                                   bs=64) # value depends on gpu
data.normalize(imagenet_stats) # pixel values of RGB to have mean 0 and std 1

In [None]:
data.show_batch(rows=4, figsize=(9,9)) # show some contents

In [None]:
data.classes # there are 7 classes (labels)

In [None]:
len(data.train_ds), len(data.valid_ds)

In [None]:
# learn = ConvLearner(data, models.resnet34, metrics=error_rate)
# Create convolutional neural net with pretrained weights from imagenet
learn = create_cnn(data, 
                   models.resnet34, # architecture - use models.resnet50 for larger
                   metrics=error_rate) # error rate (1 - accuracy) against validation set
learn.model

In [None]:
# learn.lr_find()

In [None]:
# learn.recorder.plot() # find the biggest downward slope

In [None]:
# lr = ?
# learn.fit_one_cycle(4,slice(lr))

In [None]:
learn.fit_one_cycle(4, 3e-3) #  goes through dataset 4 times
# There is 78-79% accuracy as is
# Total time: 03:20
# epoch	train_loss	valid_loss	error_rate
# 1	1.169474	0.753518	0.260110
# 2	0.739373	0.644836	0.231653
# 3	0.627650	0.591759	0.216675
# 4	0.576652	0.576550	0.212681

In [None]:
if learning rate is too high, valid loss will become very high. Usually < 1
if learning rate is too low, error rate improves very slowly
can try learn.recorder.plot_losses() go down very slow
also training loss will be higher than valid loss. 
we never want training loss higher than valid loss. it means we haven't fitted enough.
therefore, increase epoch or increase learning rate
if epoch is too low, training loss is much higher than valid loss
if epoch is too high, it will overfit. error rate will improve then get worse
* training loss should be lower than valid loss. *

In [None]:
learn.save("stage-1") # save weights

In [None]:
interp = ClassificationInterpretation.from_learner(learn) # interpret data and model

In [None]:
interp.plot_top_losses(9, figsize=(15,11))

In [None]:
interp.plot_confusion_matrix(figsize=(12,12)) 
# diags are correct. dark areas outside diagonals are incorrect

In [None]:
interp.most_confused(min_val=2) # pred and actual that got wrong most often
# should have predicted mel but predicted nv, happened 122 times

In [None]:
# Fine tune
learn.unfreeze() # unfreeze rest of model so we can train the whole model
# learn.fit_one_cycle(1)
# Accuracy improved by 2% only

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot() # find before it goes up, and minus 10x for first. second is lr/5 or lr/10

In [None]:
learn.fit_one_cycle(2, # epoch ?increase 
                    max_lr=slice(3e-6,# choose number with strongest slope
                                 3e-4)) #train early layers at 1e-6, last layers at 1e-4
3e-5,3e-4

In [None]:
learn.save("stage-2")

In [None]:
learn.load("stage-2")

In [None]:
interp = ClassificationInterpretation.from_learner(learn)

In [None]:
interp.plot_confusion_matrix(figsize=(12,12)) 

In [None]:
data2 = ImageDataBunch.single_from_classes(path_img,
                                           fnames,
                                           pat,
                                           ds_tfms=tfms,
                                           size=224).normalize(imagenet_stats)
learn = create_cnn(data2, model.resnet34)
learn.load("stage-2")

In [None]:
pred_class, pred_idx, outputs = learn.predict(img)
pred_class

In [None]:
learn.lr_find() # find optimal learning rate ie. how quickly is it updating parameters of model

In [None]:
learn.recorder.plot() # select value before it starts to increase

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, 
                    max_lr=slice(1e-6,1e-4) #train early layers at 1e-6, last layers at 1e-4
                   )
# About 80% accuracy