## Image segmentation with CamVid

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.vision import *
from fastai.callbacks.hooks import *
from fastai.utils.mem import *

In [3]:
path = untar_data(URLs.CAMVID)
path.ls()

[WindowsPath('C:/Users/kting/.fastai/data/camvid/codes.txt'),
 WindowsPath('C:/Users/kting/.fastai/data/camvid/images'),
 WindowsPath('C:/Users/kting/.fastai/data/camvid/labels'),
 WindowsPath('C:/Users/kting/.fastai/data/camvid/valid.txt')]

In [4]:
codes = np.loadtxt(path/'codes.txt', dtype=str); codes, len(codes)

(array(['Animal', 'Archway', 'Bicyclist', 'Bridge', 'Building', 'Car', 'CartLuggagePram', 'Child', 'Column_Pole',
        'Fence', 'LaneMkgsDriv', 'LaneMkgsNonDriv', 'Misc_Text', 'MotorcycleScooter', 'OtherMoving', 'ParkingBlock',
        'Pedestrian', 'Road', 'RoadShoulder', 'Sidewalk', 'SignSymbol', 'Sky', 'SUVPickupTruck', 'TrafficCone',
        'TrafficLight', 'Train', 'Tree', 'Truck_Bus', 'Tunnel', 'VegetationMisc', 'Void', 'Wall'], dtype='<U17'),
 32)

In [5]:
path_lbl = path/'labels' # These are masks. They colour code the actual images.
path_img = path/'images'

In [6]:
# A little function that basically took the filename and added the _P and put it in the path "path_lbl".
get_y_fn = lambda x: path_lbl/f'{x.stem}_P{x.suffix}'

In [7]:
fnames = get_image_files(path_img)
img_f = fnames[0] # Pick the first image arbitrarily
mask = open_mask(get_y_fn(img_f)) 
src_size = np.array(mask.shape[1:])
src_size

array([720, 960])

## Understanding What's Going On with Individual Examples
### You can skip this section if you know what's up.

In [None]:
fnames = get_image_files(path_img)
fnames[:3]

In [None]:
# These are masks
lbl_names = get_image_files(path_lbl)
lbl_names[:3]

In [None]:
img_f = fnames[0] # Pick the first image arbitrarily
img = open_image(img_f)
img.show(figsize=(5,5))

In [None]:
# Load the corresponding mask of the first image. Same time we ensure that the fn is implemented properly

mask = open_mask(get_y_fn(img_f)) 
print(mask.data)
mask.show(figsize=(5,5), alpha=1)

TY - Since the above is a mask, we use `open_mask` rather than `open_image` for normal images. Had we use `open_image` for a mask, we would get a very dark image. You can see for yourself by visually inspecting the masks through your explorer.

TY - The 720x960 is the dimension of the mask. Note all figures are in integers. You can think of each individual integer as a pixel. However the number is not greyscale value. It instead varies from 0 to 31, because in `codes.txt`, we noted 32 distinct categories. 

Each integer represents a particular component in the image (like 'Building' or Wall'). fastai's `open_mask()` natively converts each distinct integer as a unique colour (colour code) for easy viewing. An ordinary image viewer that does not know this is a mask, would treat the integers as greyscale values.

## Datasets

In [8]:
free = gpu_mem_get_free_no_cache()
# the max size of bs depends on the available GPU RAM
if free > 8200: bs=8
else:           bs=4
print(f"using bs={bs}, have {free}MB of GPU RAM free")

using bs=4, have 5991MB of GPU RAM free


Every pixel is a classifier. We need to keep batch size small because that's a lot of pixels for our GPU to manage at a time.

The people that created this dataset actually gave us a list of file names (valid.txt) that are meant to be in your validation set and they are non-contiguous parts of the video. So here's how you can split your validation and training using a file name file. In this case I don't do it randomly because the pictures they've given us are frames from videos. If I did them randomly I would be having two frames next to each other: one in the validation set, one in the training set. That would be far too easy and treating.

In [9]:
src = (SegmentationItemList.from_folder(path_img)
       .split_by_fname_file('../valid.txt')
       .label_from_func(get_y_fn, classes=codes))

In [10]:
data = (src.transform(get_transforms(), 
                      size=src_size//2,
                      tfm_y=True)  # Transformation: If flip a particular image, the trfm must also apply to Y (the target mask)
        .databunch(bs=bs)
        .normalize(imagenet_stats))

In [None]:
data.show_batch(2, figsize=(10,7))

In [None]:
data.show_batch(2, figsize=(10,7), ds_type=DatasetType.Valid)

# Model

## Accuracy Function
Accuracy for pixel-wise segmentation is basically `correctly classified pixels / #total number of pixels`.

You can imagine each pixel was a separate object you're classifying, it's exactly the same accuracy. So you actually can just pass in accuracy as your metric, but in this case, we actually don't. 

The reason for creating a new metric called `acc_camvid` is that in the CamVid paper, they say when you're reporting accuracy, you should remove the void pixels. So all metrics take the actual output of the neural net (i.e. that's the input to the metric) and the target (i.e. the labels we are trying to predict).


In [11]:
name2id = {v:k for k,v in enumerate(codes)}
void_code = name2id['Void']

def acc_camvid(input, target):
    target = target.squeeze(1)
    mask = target != void_code
    return (input.argmax(dim=1)[mask]==target[mask]).float().mean()

The above function `acc_camvid` seems to work in mysterious ways. Let me try to elucidate.

`input` and `target` are both tensors (arrays). They both represent one image at a time - `acc_camvid` is called for every validation check on an image in the validatation dataset. We squeeze `target`, which changes it from `1 x 720 x 960` to `720 x 960`. (Assuming we did not resize)

`input` tensor is the predicted tensor of a given image. Its dimensions are `720 x 960 x 32` (I might get the axis ordering wrong). 32 represents the 32 different categories. Remember that each pixel in the predicted tensor is a probability distribution spread over 32 categories. This is why we will use `argmax` to extract the index for which the probability (out of 32 categories) is the maximum. This index as returned by `argmax` will just so happen to refer to the notation used to indicate which category each pixel it belong in. 

`argmax` as a result collapses the 32-wide dimension into a single integer, and now the `input` tensor has the same dimension as the squeezed `target`. We can do a pixel-wise boolean comparision to return a boolean tensor. This boolean tensor will consists of mostly 1s if our `input` tensor (i.e. the prediction) is highly similar to its taget, or be mostly 0s if it's poor. We find the mean value; number of 1s divided by total no. of pixels, to get a result accuracy resulting value between 0 to 1.

As for the non-void, this is hard to piece out. Like, if the predicted image is entirely void, I don't know what is the resulting computation of accuracy.

In [12]:
metrics=acc_camvid
# metrics=accuracy

In [13]:
wd=1e-2 # Weight decay

In [None]:
learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd)

In [None]:
lr_find(learn)
learn.recorder.plot()

In [None]:
lr=3e-3

In [None]:
learn.fit_one_cycle(10, slice(lr), pct_start=0.9)

# pct_start is the percentage of overall iterations where the LR is increasing
# Recall one epoch can have many iterations (unless your batch size is equal to entire data set -> you then have one iteration/epoch)
# So, given the default of 0.3, means LR is going up for 30% of your iterations and then decreasing over the last 70%.

TY - `pct-start` seems weird, but I talk about it below.

In [None]:
learn.save('stage-1')

In [None]:
learn.load('stage-1');

In [None]:
learn.show_results(rows=3, figsize=(8,9))

In [None]:
learn.unfreeze()

In [None]:
lrs = slice(lr/400,lr/4)

In [None]:
learn.fit_one_cycle(12, lrs, pct_start=0.8)

In [None]:
learn.save('stage-2');

# Go big
## Further training on the same images, but bigger. Progressive resizing.
### Note that we got 0.936091.

You may have to restart your kernel and come back to this stage if you run out of memory, and may also need to decrease `bs`.

We are stepping up on our learning, by training on a bigger picture. Where `size = src_size` (720\*960). Previously, we set `size = src_size//2`.

We have to destroy our `learn` and make a new one. Because there is a change in: (1) The dimensions of our image, and (2) Batch size, which become smaller as it's now more computationally heavy on our GPU.

Destroying our `learn` does not mean we starting from scratch. Actually once we re-initialised our new learning model, we can reload our original parameters. Of course, since the input layer is changed, I honestly don't know how the previous parameters can be mapped to this new setup.

In [None]:
learn.destroy()

free = gpu_mem_get_free_no_cache()
# the max size of bs depends on the available GPU RAM
if free > 8200: bs=3
else:           bs=1
print(f"using bs={bs}, have {free}MB of GPU RAM free")

In [None]:
data = (src.transform(get_transforms(), 
                      size = src_size, 
                      tfm_y=True)
        .databunch(bs=bs)
        .normalize(imagenet_stats))

In [None]:
learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd)

In [None]:
learn.load('stage-2');

In [None]:
lr_find(learn)
learn.recorder.plot()

In [None]:
lr=1e-3

In [None]:
learn.fit_one_cycle(10, slice(lr), pct_start=0.8)

In [None]:
learn.save('stage-1-big')

In [None]:
learn.load('stage-1-big');

In [None]:
learn.unfreeze()

In [None]:
lr=1e-3 # From Stage-1
lrs = slice(1e-6,lr/10)

In [None]:
learn.fit_one_cycle(10, lrs)

# Note that we got 0.936091.

In [None]:
learn.save('stage-2-big')

In [None]:
learn.load('stage-2-big');

In [None]:
learn.show_results(rows=3, figsize=(10,10))

In [None]:
learn.recorder.plot_losses()

In [None]:
learn.recorder.plot_lr()

# Let's try FP16

## I don't think it works.

In [14]:
free = gpu_mem_get_free_no_cache()
# the max size of bs depends on the available GPU RAM
if free > 8200: bs=3
else:           bs=1
print(f"using bs={bs}, have {free}MB of GPU RAM free")

using bs=1, have 5991MB of GPU RAM free


In [15]:
src = (SegmentationItemList.from_folder(path_img)
       .split_by_fname_file('../valid.txt')
       .label_from_func(get_y_fn, classes=codes))

data = (src.transform(get_transforms(), 
                      size = src_size, 
                      tfm_y=True)
        .databunch(bs=bs)
        .normalize(imagenet_stats))


In [16]:
learn.destroy()

NameError: name 'learn' is not defined

In [17]:
learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd).to_fp16()

In [18]:
learn.load('stage-2');

In [19]:
lr_find(learn)
learn.recorder.plot()

epoch,train_loss,valid_loss,acc_camvid,time


LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.


RuntimeError: CUDA out of memory. Tried to allocate 2.05 GiB (GPU 0; 6.00 GiB total capacity; 309.02 MiB already allocated; 2.05 GiB free; 2.30 GiB reserved in total by PyTorch)

## pct-start

https://github.com/hiromis/notes/blob/master/Lesson3.md

Getting the right learning rate is important. When you get the right learning rate, it zooms into the best spot very quickly. Otherwise, it will take very long, or it might diverge instead.

Now as you get closer to the final spot, something interesting happens which is that you really want your learning rate to decrease because you're getting close to the right spot.

So what actually happens is, think of your loss function's surface not as a small curve (or slope), but it actually tends to look. So you want a learning rate that's like high enough to jump over the bumps, but once you get close to the best answer, you don't want to be just jumping backwards and forwards between bumps without finding the minima. You want your learning rate to go down so that as you get closer, you take smaller and smaller steps. That's why we want our learning rate to go down at the end.

If you start off with a really small learning rate, it'll tend to kind of plod down and stick in these places. But if you gradually increase the learning rate, then it'll kind of like jump down and as the learning rate goes up, it's going to start going up again like this. Then the learning rate is now going to be up here, it's going to be bumping backwards and forwards. Eventually the learning rate starts to come down again, and it'll tend to find its way to these flat areas.

So it turns out that gradually increasing the learning rate is a really good way of helping the model to explore the whole function surface, and try and find areas where both the loss is low and also it's not bumpy. Because if it was bumpy, it would get kicked out again. This allows us to train at really high learning rates, so it tends to mean that we solve our problem much more quickly, and we tend to end up with much more generalizable solutions.