In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In this kernel I will use the Fastai library which is built on top of pytorch. The notebook follows a similar workflow followed in lessons 1 and 2 of the <a href='https://course.fast.ai/'>Practical Deep Learning for Coders, v3</a> course taught by Jermey Howard. I used to use Keras for deep learning but found fastai to give better results, and use fewer lines of code. 

I also used some code from Gunther's <a href='https://www.kaggle.com/guntherthepenguin/fastai-v1-densenet169'> [fastai v1] Densenet169</a> kernel so you should check that one out as well. I commented the code I used from that kernel.

Be sure to turn the kernel GPU on and select Internet connected in the settings on the bottom right. Without the internet connection it will give you an error loading the pytorch pretrained model.

In [None]:
#Imports
from fastai.vision import *
from sklearn.metrics import roc_auc_score
np.random.seed(11)
%matplotlib inline

In [None]:
bs = 1024  #Sets batch size
path = Path('../input/') #Sets path to data
sz = 96 #Pixel size of images (96, 96)

In [None]:
#Read in the data
df = pd.read_csv(path/'train_labels.csv')
df.head()

In [None]:
df.shape

In [None]:
#Check percentage images with label=1
df.iloc[:,1].mean()

About 40% of our data is positive.

In [None]:
#Take a sample
df = df.sample(n=25000, random_state=11)

There is no need to train on all 200000+ samples for a baseline. Taking a sample lets us train and test ideas quicker

In [None]:
#Data tranformation function
tfms = get_transforms(flip_vert=True,max_warp=0)

We can set flip_vert=True because the histopathic images are top down images so its okay to flip the image in any direction. The default flip_vert value is false in fastai. We also set warp=0 because none of the images in the data are warped.

In [None]:
#Read the data in from the dataframe
data = (ImageDataBunch.from_df(path,           #Path the data is stored in
                               df=df,          #Dataframe with image filenames and labels
                               folder='train', #Folder contatining train+validation images
                               suffix='.tif',  #Suffix of the image names
                               ds_tfms=tfms,   #Apply the defined transformations
                               size=sz,        #Set the image pixel size (96, 96)
                               bs=bs,          #Set the batch size (1024)
                               test='test',    #Set the folder contatining test images
                               num_workers=0)  #Need to assign num_workers when running in kernel
                               .normalize())   #Normalize the data

The .normalize() should get the mean and standard deviation from the training data and apply the same mean subtraction and standard deviation  division to the train, validation, and test data. The validation data is by default set to 20% of the training data.

In [None]:
#Display a few images
data.show_batch(rows=3, figsize=(7,5))

Images are displayed with their label.

In [None]:
#https://www.kaggle.com/guntherthepenguin/fastai-v1-densenet169
#Define auc metric to track while training
def auc_score(y_pred,y_true,tens=True):
    score=roc_auc_score(y_true,torch.sigmoid(y_pred)[:,1])
    if tens:
        score=tensor(score)
    else:
        score=score
    return score

In [None]:
#Create our cnn
learn = create_cnn(data, models.resnet34, metrics=auc_score,path='.')
#Need to assign model path when running in kernel

This creates our ResNet34 model with pretrained weights. This creates a learner that will have ResNet34 architecture, use the DataBunch we created earlier, and track auc score at the end of each epoch. 

Fastai fit_one_cycle implements cyclical learning rates from Smith's paper<a href='https://arxiv.org/abs/1506.01186'>
Cyclical Learning Rates for Training Neural Networks</a>. As descibed in the paper 'this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations.' 


In [None]:
lr = 3e-3 #Assign learning rate

#This trains our added layers for two epochs with ResNet layers frozen
learn.fit_one_cycle(3, lr)

In the lectures learn.lr_find() was used to find a learning rate before any training. The kernel was running really slow so I removed it. Learning rates between 1e-3 to 3e-3 tend to work pretty well for the first round of training. Next I plot the learning rate to get a better understand of what fit_one_cycle() is doing.

In [None]:
#Plot the learning rates to see the change over iterations
learn.recorder.plot_lr()

Now with the added dense layers trained with reasonable parameters, we can unfreeze the earlier layers of the ResNet model and fine tune them for our task. 

We will also set a range of learning rates. We want the earlier layers of our model to have lower learning rates than the later layers. This is because the earlier layers find general features like a line/edge which we do not want to disrupt. 

In [None]:
#Unfreezes all layers
learn.unfreeze()

First we implement the learning rate finder. You can do this before training the first layers of the model, but I chose not to in this notebook because it was taking to long in the kernel.

In [None]:
#Find optimal learning rate
learn.lr_find()

In [None]:
#Observe learning rate increase through every iteration
learn.recorder.plot_lr()

In [None]:
#Observe the loss as we increase learning rate
learn.recorder.plot()

A general rule of thumb mentioned in one of the lectures to set the learning rates after unfreezing the layers is to:

1) Set the first learning rate at a rate where we still see the steepest drop in loss in the plot above. Sometimes thats not very clear so you want to atleast make sure its 10X smaller than the learning rate where the loss starts to increase.

2) Set the second learning rate (used for the later dense layers) to 1/5 or 1/10 of our starting learning rate we initially used to train the model.

In [None]:
learn.fit_one_cycle(5, max_lr=slice(5e-4, lr/5))

With less than 10% of the data we are able to quickly build a model with a high AUC. Our validation score is still getting better so the model can be trained for more epochs.

There are plenty of steps to take from here:

1. Plot the images the model predicted wrong with high confidence

2. Try different data augmentations with the get_transforms() function.

3. Try different learning rates and training for more epochs

4. Use more data

5. Try a more complex model

6. Predict with test time augmentation

Submission code was used from <a href='https://www.kaggle.com/guntherthepenguin/fastai-v1-densenet169'> [fastai v1] Densenet169</a>. That kernel also shows you how to implement test time augmentation which can increase your LB score.

In [None]:
preds_test,y_test=learn.get_preds(ds_type=DatasetType.Test)

In [None]:
sub=pd.read_csv(path/'sample_submission.csv').set_index('id')
sub.head()

In [None]:
#https://www.kaggle.com/guntherthepenguin/fastai-v1-densenet169
clean_fname=np.vectorize(lambda fname: str(fname).split('/')[-1].split('.')[0])
fname_cleaned=clean_fname(data.test_ds.items)
fname_cleaned=fname_cleaned.astype(str)

In [None]:
sub.loc[fname_cleaned,'label']=to_np(preds_test[:,1])
sub.to_csv('submission.csv')