# Project: Dense Prediction: Monocular Depth Estimation and Semantic Segmentation

<img src='https://i.imgur.com/I2rSgxd.png' width=200> <img src='https://i.imgur.com/1oP2EIg.png' width=200>

# Part 1
## Introduction

- In this part of the project, you are tasked to create a model that **estimates depth from a single input image**. The input is an RGB image and the output is a single channel dense depth map where each pixel is the estimated distance from the 'camera sensor' to an object in the scene in real world units (e.g. in meters). Depth from a single image is a fundemental vision task with many useful applications including scene understanding and reconstruction.

- You are to develop a convolutional neural network (CNN) that formulates the problem as a regression of the depth map from a single RGB image. 

- In this section, we provide all the source code needed for loading and evaluating your model.  You will reuse the model in the next section

- Your task in this section is to modify the script in order to:
    - Define a [UNet](https://arxiv.org/abs/1505.04597) model that takes an RGB image and outputs a single channel depth map. **[25 points]**
    - Define an approprate loss function. **[15 points]**
    - Tune the model to achieve an RMSE of **0.035** or less on the given validation set. **[25 points]**


<hr/>

**Note**: Make sure that your Collab notebook is a GPU instance. Also, the first time you run the training, the instance might crash for exceeding the allocated memory. This is expected behaviour, especially with large batch sizes. Collab will suggest restarting the session and providing instances with larger memory sizes.

**Note**: This project is more open-ended than the previous projects. Multiple solutions can be considered _correct_. As there already exist implementations of various deep networks for this task on the interwebs, **plagiarism will NOT be tolerated**. Your code will be judged for similarity against code available online and other students' code. You are expected to justify every design decision when your project is being evaluated.

**Note**: The networks you will design/implement will be much larger than what you have previously designed. Please bring hardware concerns to the attention of the [TA](mailto:wamiq.para@kaust.edu.sa). You will need to begin early to test out new ideas/hyperparameters and training will take much longer. Best of luck!

<hr/>

## Downloading Data
Run the following cell to download the dataset and extract the zip archive.

If you are not running a Linux/Mac machine. Please download the following zip file manually and extract it in the same directory as the notebook.

In [None]:
! wget -nc https://densedepth2019.s3.amazonaws.com/UnrealData256.zip
! unzip -nq UnrealData256.zip

--2020-10-14 22:00:57--  https://densedepth2019.s3.amazonaws.com/UnrealData256.zip
Resolving densedepth2019.s3.amazonaws.com (densedepth2019.s3.amazonaws.com)... 52.217.39.204
Connecting to densedepth2019.s3.amazonaws.com (densedepth2019.s3.amazonaws.com)|52.217.39.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 800935450 (764M) [application/zip]
Saving to: ‘UnrealData256.zip’


2020-10-14 22:04:48 (3.32 MB/s) - ‘UnrealData256.zip’ saved [800935450/800935450]



## Hyperparameters

You are supposed to change the batch_size and learning_rate from their default value.

In [None]:
import os
import gc
import time
import datetime

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

epochs = 10
batch_size = 8
learning_rate = 1
workers = 1 # The number of parallel processes used to read data
gpu_id = [0] # only modify if you machine has more than one GPU card

In [None]:
%cd Depth_Estimation/

## Data Loader (no tasks required)

In [None]:
if __name__ == '__main__':
    from loaders import prep_loaders
    train_loader, valid_loader = prep_loaders('UnrealData256', batch_size=batch_size, workers=workers)

## Sanity check 

In [None]:
# Examine training data
%pylab inline
import torchvision
sample = iter(train_loader).next()
print(sample['image'].shape, sample['depth'].shape)
figure(figsize=(9,9)); imshow(torchvision.utils.make_grid(sample['image'], padding=0).permute((1, 2, 0)))
figure(figsize=(9,9)); imshow(torchvision.utils.make_grid(sample['depth'], padding=0, normalize=True, scale_each=True).permute((1, 2, 0))[:,:,0])

## Model [25 points]

Define your model here. The current model is going to perform very poorly on the task. 
But it will be fast. You are welcome to run it.

In [5]:
import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.A = nn.Conv2d(3, 1, kernel_size=3, padding=1, stride=2)

    def forward(self, x):
        return self.A(x)
    
def create_model_gpu():
    model = Model()
    model = model.cuda()
    model = nn.DataParallel(model, device_ids=[g for g in gpu_id])
    return model

model = create_model_gpu()
print('Ready to train.')

#model.load_state_dict(torch.load('trained_model.pkl'))

Ready to train.


## Loss Function [15 points]

Define a loss function that is suitable for the dense regression task.
Why will the current loss not work? Submit the answer in the notebook.

In [6]:
import torch
from math import exp
import torch.nn.functional as F


def loss_fn(pred_y, y):
    return torch.mean(y.sub(y_pred))

## Training + Evaluation [25 points]

Tune the hyperparameters and the architecture to achieve the target RMSE

In [7]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
run_id = 'model_gpu{}_n{}_bs{}_lr{}'.format(gpu_id, epochs, batch_size, learning_rate); print('\n\nTraining', run_id)
save_path = run_id + '.pkl'

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

class RMSE(object):
    def __init__(self):
        self.sq_errors = []
        self.num_pix = 0
        
    def get(self):
        return np.sqrt(
                    np.sum(np.array(self.sq_errors))/self.num_pix
                )
    
    def add_batch(self, pred, target):
        sqe = (pred-target)**2
        self.sq_errors.append(np.sum(sqe))
        self.num_pix += target.size
        
    def reset(self):
        self.sq_errors = []
        self.num_pix = 0


# Used to keep track of statistics
class AverageMeter(object):
    def __init__(self):
        self.val = 0; self.avg = 0; self.sum = 0; self.count = 0
    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

REPORTS_PER_EPOCH = 10
ITER_PER_EPOCH = len(train_loader)
ITER_PER_REPORT = ITER_PER_EPOCH//REPORTS_PER_EPOCH

metrics = RMSE()

for epoch in range(epochs):
    model.train()
    
    # Progress reporting
    batch_time = AverageMeter()
    losses = AverageMeter()
    N = len(train_loader)
    end = time.time()

    for i, (sample) in enumerate(train_loader):

        # Load a batch and send it to GPU
        x = sample['image'].float().cuda()
        y = sample['depth'].float().cuda()

        # Forward pass: compute predicted y by passing x to the model.
        y_pred = model(x)

        # Compute and print loss.
        loss = loss_fn(y_pred, y)
        
        # Record loss
        losses.update(loss.data.item(), x.size(0))

        # Before the backward pass, use the optimizer object to zero all of the
        # gradients for the variables it will update (which are the learnable
        # weights of the model).
        optimizer.zero_grad()

        # Backward pass: compute gradient of the loss with respect to model parameters
        loss.backward()

        # Calling the step function on an Optimizer makes an update to its parameters
        optimizer.step()
        
        # Measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()
        eta = str(datetime.timedelta(seconds=int(batch_time.val*(N - i))))

        # Log training progress
        if i % ITER_PER_REPORT == 0:
            print('\nEpoch: [{0}][{1}/{2}]\t' 'Time {batch_time.val:.3f} ({batch_time.sum:.3f})\t' 'ETA {eta}\t'
             'Training Loss {loss.val:.4f} ({loss.avg:.4f})'.format(epoch, i, N, batch_time=batch_time, loss=losses, eta=eta))
        elif i % (ITER_PER_REPORT//50) == 0:
            print('.', end='')
            
        #break # useful for quick debugging        
    torch.cuda.empty_cache(); del x, y; gc.collect()
    
    # Validation after each epoch
    model.eval()
    metrics.reset()
    for i, (sample) in enumerate(valid_loader):
        x, y = sample['image'].float().cuda(), sample['depth'].numpy()
        with torch.no_grad():
            y_pred = model(x).detach().cpu().numpy()

        metrics.add_batch(y_pred, y)
        print('_', end='')
    print('\nValidation RMSE {avg_rmse}'.format(avg_rmse=metrics.get()))
    

# Save model
torch.save(model.state_dict(), save_path)
print('\nTraining done. Model saved ({}).'.format(save_path))

## Visual Test of the Trained Model (no tasks required)

In [None]:
# Load model from disk
#model = create_model_gpu()
#model.load_state_dict(torch.load('trained_model.pkl'))
#model.eval() # set to evaluation mode

# Visualize validation sample
sample = iter(valid_loader).next()
x = sample['image'].float().cuda()
y_pred, y = model(x), sample['depth']

figure(figsize=(20,20)); imshow(torchvision.utils.make_grid(sample['image'], padding=0).permute((1, 2, 0)))
figure(figsize=(20,20)); imshow(torchvision.utils.make_grid(sample['depth'], padding=0, normalize=True, scale_each=True).permute((1, 2, 0))[:,:,0])
figure(figsize=(20,20)); imshow(torchvision.utils.make_grid(y_pred.detach().cpu(), padding=0, normalize=True, scale_each=True).permute((1, 2, 0))[:,:,0])

# Part 2

## Semantic Segmentation

In this part of the project, you will reuse the model you created in the previous part to perform Semantic Segmentation - instead of assigning a real number to each
pixel , you will assign it a class.

The tasks are as following:
- Write a Dataset class that processes the segmentation data. **[10 points]**
    - Modify the UNet model that takes an RGB image and now outputs a single channel _label map_
    - Define an approprate loss function. **[5 points]**
- Tune the model to achieve an mIOU of **0.45** or higher on the given validation set. **[20 points]**

### Dataset [10 points]
We are going to use the [PASCAL VOC dataset](https://drive.google.com/drive/folders/1G54WDNnOQecr5T0sEvZcuyme0WT5Qje3?usp=sharing), which is a commonly used benchmark. In order to reduce the
computational requirements, you should downsample the dataset to 256x256, similar to the previous project.

Now you have to implement the Dataset. Look at the file `loaders.py`.

The class you will need to emulate is `class ImageDepthDataset(Dataset)`. The class is called `VOCSeg`, and it must _inherit_ from the `Dataset` class,
just like the `ImageDepthDataset`.
You need to fill in the `__len__` and the `__getitem__` methods.
The `__getitem__` method should yield a dict of the RGB image and the labeled segmentation map.

Make sure you downsample the image and the labels to 256x256, otherwise the training will take too much time.

Make sure that the labels are in the range `0..N-1`, where
N is the number of classes - 21 in our case. You can have one special label for unknown regions.

We provide the map of RGB to label for convenience in `get_pascal_labels()`. The map should be read as this - if a pixel has color `[0, 0, 0]`, it has label 0. If the color is
`[128, 0, 0]`, the label is 1

It is also very common to change the RGB range from 0-255 to 0-1 or -1 to 1. Take a look at [torchvision.transforms.ToTensor](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.ToTensor)
and [torchvision.transforms.Normalize](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.Normalize)

The PASCAL VOC dataset has predefined train/val sets. Make sure your class implementation can take this _split_ as an argument. Now create train/val loaders using the `get_seg_loaders` function (look at `prep_loaders`), and we should be good to go.

In [None]:
# Sanity check
if __name__ == '__main__':
    from loaders import get_seg_loaders
    train_loader, valid_loader = get_seg_loaders(root_dir='./VOC2012')

    # we have read all files
    assert len(train_loader.dataset) == 1464
    assert len(valid_loader.dataset) == 1449

You should implement a few more sanity checks - the range of data in the RGB part, the range of data in the label part, whether the dataset returns tensors,
whether the labels have the datatype `torch.long` etc.

## Modifying the Loss and Architecture [5 points]
You will have to some form of surgery on the network you constructed in Part 1.

1. The number of channels the last layer predicts must change to the number of classes in the dataset.
2. The loss function must change to reflect the fact that we are now performing per-pixel classification. (What loss did you use for classification in Project 1?)
3. You might get a CUDA assert error. This means that you have a label higher than the number of channels in the _logits_. This is very common with semantic segmentation, where you might want to label some region unkown as it's label might be under doubt - for example near the edges of objects. Look up how to ignore a certain label with a classification loss.
4. Take care of input label and logit sizes. We want predictions to be 256x256 as well.

### !! 
### <span style="color:red"> At this point, we highly recommend restarting your notebook for part 2 and beginning modifying/training the  model</span>

In [None]:
import os
import gc
import time
import datetime

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

epochs = 10
batch_size = 8
learning_rate = 1
workers = 1 # The number of parallel processes used to read data
gpu_id = [0] # only modify if you machine has more than one GPU card

In [None]:
if __name__ == '__main__':
    from loaders import get_seg_loaders
    train_loader, valid_loader = get_seg_loaders(root_dir='./VOC2012')

In [None]:
import torch.nn as nn
import torch.nn.functional as F

# You can copy the depth model code with modifications here
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        #TODO Make sure you have the right number channels in the last layer
        self.A = nn.Conv2d(3, 1, kernel_size=3, padding=1, stride=2)

    def forward(self, x):
        return self.A(x)

def create_model_gpu():
    model = Model()
    model = model.cuda()
    model = nn.DataParallel(model, device_ids=[g for g in gpu_id])
    return model

model = create_model_gpu()
print('Ready to train.')

In [None]:
import torch

def loss_fn(pred_y, y):
    #TODO
    return torch.mean(y.sub(y_pred))

## Training and Evaluation [18 points]
Tune the hyperparameters to get the maximum possible score on the PASCAL VOC challenge. 
And answer the following questions:
1. What is the relationship between the _size_ of the class and the IOU How would you quantify this relationship?
2. What is the relationship between the number of instances and the IOU? how many times a class exists in an image vs the IOU?
3. The segmentation dataset is small. Initialize the weights of the segmentation net with the weights of the trained depth network.
4. Which weights can you not transfer?
5. Fine tune (ie train with a lower learning rate) the model in 3 for the same number of epochs as the model with a random initialization (or ImageNet initialized weights)
6. What trend do you observe?


In [None]:
from utils import Metrics

In [None]:
run_id = 'seg_model_gpu{}_n{}_bs{}_lr{}'.format(gpu_id, epochs, batch_size, learning_rate); print('\n\nTraining', run_id)
save_path = run_id + '.pkl'

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

metrics = Metrics(train_loader.dataset.num_classes, train_loader.dataset.class_names)

# Used to keep track of statistics
class AverageMeter(object):
    def __init__(self):
        self.val = 0; self.avg = 0; self.sum = 0; self.count = 0
    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

REPORTS_PER_EPOCH = 10
ITER_PER_EPOCH = len(train_loader)
ITER_PER_REPORT = ITER_PER_EPOCH//REPORTS_PER_EPOCH


for epoch in range(epochs):
    model.train()

    # Progress reporting
    batch_time = AverageMeter()
    losses = AverageMeter()
    N = len(train_loader)
    end = time.time()

    for i, (sample) in enumerate(train_loader):

        # Load a batch and send it to GPU
        x = sample['image'].float().cuda()
        y = sample['label'].float().cuda()

        # Forward pass: compute predicted y by passing x to the model.
        y_pred = model(x)

        # Compute and print loss.
        loss = loss_fn(y_pred, y)

        # Record loss
        losses.update(loss.data.item(), x.size(0))

        # Before the backward pass, use the optimizer object to zero all of the
        # gradients for the variables it will update (which are the learnable
        # weights of the model).
        optimizer.zero_grad()

        # Backward pass: compute gradient of the loss with respect to model parameters
        loss.backward()

        # Calling the step function on an Optimizer makes an update to its parameters
        optimizer.step()

        # Measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()
        eta = str(datetime.timedelta(seconds=int(batch_time.val*(N - i))))

        # Log training progress
        if i % ITER_PER_REPORT == 0:
            print('\nEpoch: [{0}][{1}/{2}]\t' 'Time {batch_time.val:.3f} ({batch_time.sum:.3f})\t' 'ETA {eta}\t'
             'Training Loss {loss.val:.4f} ({loss.avg:.4f})'.format(epoch, i, N, batch_time=batch_time, loss=losses, eta=eta))
        elif i % (ITER_PER_REPORT) == 0:
            print('.', end='')

        #break # useful for quick debugging
    torch.cuda.empty_cache(); del x, y; gc.collect()

    # Validation after each epoch
    model.eval()
    metrics.reset()
    for i, (sample) in enumerate(valid_loader):
        x, y = sample['image'].float().cuda(), sample['label'].numpy()
        with torch.no_grad():
            y_pred = model(x)
            y_pred = torch.argmax(y_pred, dim=1) # get the most likely prediction

        metrics.add_batch(y, y_pred.detach().cpu().numpy())
        print('_', end='')
    print('\nValidation stats ', metrics.get_table())


# Save model
torch.save(model.state_dict(), save_path)
print('\nTraining done. Model saved ({}).'.format(save_path))

### Visualization  [2 points]
Use the `decode_segmap` function to visualize images and their segmentation. The images must be from the validation set.


### Visualization  [2 points]
Use the `decode_segmap` function to visualize images and their segmentation. The images must be from the validation set.
