In this notebook, I am gonna train a neural network to classify whether the person whose face is in an image is a man or woman. The dataset can be found in the ```data``` folder, or **you can create your own dataset** using the notebook *Data Processing*.

## Some preparations

First, some options and imports. We will use the ```fastai``` library, which was built on top of ```PyTorch```

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.imports import *

In [3]:
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *

For the neural net to train smoother and faster, all the image is going to be resized to the length assigned by the ```sz``` variable. The ```PATH``` variable is the folder in which the dataset is stored.

In [4]:
PATH = "data/"
sz=224

## Model creation

To save time and computation resource, I am using a pretrained neural net called [```resnet models```](https://github.com/KaimingHe/deep-residual-networks) as a starting point. 

In [5]:
arch=resnet34

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                
    0      0.802056   25.399355  0.243056  



Also, to increase the accuracy, I'm gonna apply ```data augmentation``` to the data. Since people's face are not normally seen upside down, we will apply only the zoom-in, horizontal reflection and focus along the side effects.

In [None]:
tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_paths(PATH, tfms=tfms)

Now, let's create a ```learner``` (i.e. a neural network object using the model structure ```arch``` and the dataset ```data``` we've just defined. The ```precompute``` option is set to **True**, implying that the activation of the *penultimate layer* (i.e. the layer before the last one) is precomputed. Since in the next step, we will need to train only the last layer, this precomputation helps saving time; but we will have to turn it off later for the data augmentation to take effects.

In [None]:
learner = ConvLearner.pretrained(arch, data, precompute=True)

## Training process

As we use a pretrained model with 1000 classes, we need to replace the last layer with a new layer of size ```1x1``` (Since we only have 2 classes). By default, this replacement is automatically done when we start training the model. Let's train the learner with one epoch to see where we are.

In [11]:
learner.fit(1e-2, 1)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                
    0      0.692106   0.367836   0.877604  



[0.36783618, 0.8776041567325592]

So, currently, without data augmentation and by only training the last layer for 1 epoch, we have accuracy of 87%. Now, let's turn off the precomputation, so that the data can be augmented.

In [12]:
learn.precompute=False

We will now train the network with data generated from data augmentation process. We also apply *stochastic gradient descent with restarts (SGDR)*, which means the learning rate is gradually decreased, but eventually be increased, so that the learner will be able to "*jump out*" of a minimum point in case that minimum point is to sensitive to changes (i.e. a small changes of weights cause big changes in the loss). That way, the learner will be able to find a more stable minimum.

We will continue the process with ```learning rate``` of ```0.01```, which is gradually decrease over a cycle, before being pushed back to the start value (0.01).
The ```cycle_len``` parameter defines that the cycle in which ```learning rate``` is decreased is 1 epoch. Number ```3``` is the number of cycles, so since each cycle lasts 1 epoch, we will have 3 epochs.

In [13]:
learn.fit(1e-2, 3, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                
    0      0.369047   0.294012   0.903646  
    1      0.357488   0.240393   0.895833                
    2      0.350418   0.216709   0.919271                



[0.21670878, 0.9192708432674408]

Now, we want to train not only the last layer, but also the whole neural net. By default, all the layers, except the last one, are ```frozen```, i.e. *unlearnable*, so we need to ```unfreeze``` them first.

In [17]:
learn.unfreeze()

In a pretrained model, the earlier layers are more general and the later layers are more customed. We, therefore, want to apply smaller learning rate for earlier layers and bigger learning rate for later layers. We can do that by setting a numpy 1-D array of the learning rates.

In [18]:
lr=np.array([1e-4,1e-3,1e-2])

The *SGDR* is great in helping us avoiding sensitive minimum, but suppose we find a not sensitive minimum, if we keep restarting our learning rate in every epoch, we might have a harder time finding the "sweet spot". Hence, overtime, it makes sense to lengthening the cycles, so that the weights have more chances to be fine-tuned. We add the ```cycle_mult``` parameter equals to 2, which means the cycle length with be doubled in every cycle. That way, if we start with cycle length = 1 and have 3 cycles, we should end up with ```1 + 1*2 + 1*2*2 = 7``` epochs.

In [19]:
learn.fit(lr, 3, cycle_len=1, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=7), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                
    0      0.476566   0.263725   0.905382  
    1      0.416922   0.225691   0.923611                
    2      0.373433   0.238075   0.931424                
    3      0.342238   0.211669   0.953125                
    4      0.315419   0.2079     0.953125                
    5      0.277854   0.206081   0.945312                
    6      0.251074   0.208454   0.945312                



[0.20845367, 0.9453125]

In [20]:
learn.sched.plot_lr()

In [21]:
learn.save('224_all')

In [22]:
learn.load('224_all')

## Fine-tuning results

Lastly, at test time, since our images are not always squared, there are chances that the prediction is made in an area, in which the main object is not there, which makes it wrong. We can partly avoid that issue by making sure that the prediction is taken in different areas of the same picture, and take average of these predictions as the final prediction. In practice, we use the main image and 4 random augmented versions of it to make prediction. This is called ```test-time augmentation``` or ```TTA```.

In [23]:
log_preds,y = learn.TTA()
probs = np.mean(np.exp(log_preds),0)

                                             

Our final result is 95% accuracy. Not bad for a dataset with just almost 500 images and a training process of 11 epochs.

In [24]:
accuracy_np(probs, y)

0.95

## Result Analysis

We will analyze the results using confusion matrix

In [25]:
preds = np.argmax(probs, axis=1)
probs = probs[:,1]

In [26]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y, preds)

In [27]:
plot_confusion_matrix(cm, data.classes)

[[47  3]
 [ 2 48]]
