# Assignment 3

# Instructions

1. You have to use only this notebook for all your code.
2. All the results and plots should be mentioned in this notebook.
3. For final submission, submit this notebook along with the report ( usual 2-4 pages, latex typeset, which includes the challenges faces and details of additional steps, if any)
4. Marking scheme
    -  **60%**: Your code should be able to detect bounding boxes using resnet 18, correct data loading and preprocessing. Plot any 5 correct and 5 incorrect sample detections from the test set in this notebook for both the approached (1 layer and 2 layer detection), so total of 20 plots.
    -  **20%**: Use two layers (multi-scale feature maps) to detect objects independently as in SSD (https://arxiv.org/abs/1512.02325).  In this method, 1st detection will be through the last layer of Resnet18 and the 2nd detection could be through any layer before the last layer. SSD uses lower resolution layers to detect larger scale objects. 
    -  **20%**: Implement Non-maximum suppression (NMS) (should not be imported from any library) on the candidate bounding boxes.
    
5. Report AP for each of the three class and mAP score for the complete test set.

In [1]:
from __future__ import division, print_function, unicode_literals
import numpy as np
import torch
import torch.utils.data
import torchvision.transforms as transforms
from torch.autograd import Variable
import matplotlib.pyplot as plt

import os
import xml.etree.ElementTree as ET
from PIL import Image
from shutil import copyfile
from torchvision import datasets, models
import torch.nn as nn
import torch.optim as optim
import shutil

%matplotlib inline
plt.ion()
# Import other modules if required
# Can use other libraries as well

resnet_input = 224 #size of resnet18 input images

In [2]:
# Choose your hyper-parameters using validation data
batch_size = 64
epochs = 10
learning_rate =  0.005
# hyp_momentum = 0.9

## Build the data
Use the following links to locally download the data:
<br/>Training and validation:
<br/>http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
<br/>Testing data:
<br/>http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
<br/>The dataset consists of images from 20 classes, with detection annotations included. The JPEGImages folder houses the images, and the Annotations folder has the object-wise labels for the objects in one xml file per image. You have to extract the object information, i.e. the [xmin, ymin] (the top left x,y co-ordinates) and the [xmax, ymax] (the bottom right x,y co-ordinates) of only the objects belonging to the three classes(aeroplane, bottle, chair). For parsing the xml file, you can import xml.etree.ElementTree for you. <br/>
<br/> Organize the data as follows:
<br/> For every image in the dataset, extract/crop the object patch from the image one by one using their respective co-ordinates:[xmin, ymin, xmax, ymax], resize the image to resnet_input, and store it with its class label information. Do the same for training/validation and test datasets. <br/>
##### Important
You also have to collect data for an extra background class which stands for the class of an object which is not a part of any of the 20 classes. For this, you can crop and resize any random patches from an image. A good idea is to extract patches that have low "intersection over union" with any object present in the image frame from the 20 Pascal VOC classes. The number of background images should be roughly around those of other class objects' images. Hence the total classes turn out to be four. This is important for applying the sliding window method later.


In [3]:
classes = ('__background__',
           'aeroplane',
           'bottle','chair'
           )

In [4]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [28]:
k = 0
def intersection_over_union(orig_boxes,bg_boxes,img,file):
    global k
    for bg_box in bg_boxes:
        flag = 0
        for orig_box in orig_boxes:
            boxA = orig_box
            boxB = bg_box
            xA = max(boxA[0], boxB[0])
            yA = max(boxA[1], boxB[1])
            xB = min(boxA[2], boxB[2])
            yB = min(boxA[3], boxB[3])
 
            # compute the area of intersection rectangle
            interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1)
 
            # compute the area of both the prediction and ground-truth
            # rectangles
            boxAArea = (boxA[2] - boxA[0] + 1) * (boxA[3] - boxA[1] + 1)
            boxBArea = (boxB[2] - boxB[0] + 1) * (boxB[3] - boxB[1] + 1)
 
            # compute the intersection over union by taking the intersection
            # area and dividing it by the sum of prediction + ground-truth
            # areas - the interesection area
            iou = float(interArea) / float(boxAArea + boxBArea - interArea)
            if file.split('.')[0]=='003644':
                    print(orig_box,bg_box,interArea,iou)
            if interArea > 0:
                flag=1
                break
#             if interArea==0 and k<700:
        if not flag and k < 700:
            cropped_img = img.crop(bg_box)
            new_image_path = "data2/processed_data/{}/{}_{}".format('__background__', str(k), file.split('.')[0] + ".jpg")
            cropped_img.save(new_image_path)
            k+=1

In [29]:
def build_dataset():
    c=0
    found_classes = {'__background__': 0}
    processed_dir = 'data2/processed_data'
    if os.path.exists(processed_dir):
        shutil.rmtree(processed_dir)
    os.makedirs(processed_dir)
    data_path = "data2/VOCdevkit/Annotations/"
    images_path = "data2/VOCdevkit/JPEGImages/"
    for i in classes:
        directory = "data2/processed_data/{}".format(i)
        if not os.path.exists(directory):
            os.makedirs(directory)
    count = 0
    for file in os.listdir(data_path):
        flag = 0
        f = open(data_path + file, "r").read()
        xml = ET.fromstring(f)
        objects = xml.findall('./object')
        boxes = []
        for obj in objects:
            img_class = obj.find('name').text
            if img_class in classes:
                c+=1
                if found_classes.get(img_class, None) is not None:
                    found_classes[img_class] += 1
                else:
                    found_classes[img_class] = 1
                box = obj.find('bndbox')
                xmin = int(box.find('xmin').text)
                ymin = int(box.find('ymin').text)
                ymax = int(box.find('ymax').text)
                xmax = int(box.find('xmax').text)
                image_path = images_path + file.split('.')[0] + ".jpg"
                img = Image.open(image_path)
                area = (xmin, ymin, xmax, ymax)
                cropped_img = img.crop(area)
                new_image_path = "data2/processed_data/{}/{}_{}".format(img_class, str(c), file.split('.')[0] + ".jpg")
                cropped_img.save(new_image_path)
                flag = 1
            box = obj.find('bndbox')
            xmin = int(box.find('xmin').text)
            ymin = int(box.find('ymin').text)
            ymax = int(box.find('ymax').text)
            xmax = int(box.find('xmax').text)
            coordinates = [xmin, ymin, xmax, ymax]
            boxes.append(coordinates)
        image_path = images_path + file.split('.')[0] + ".jpg"
        img = Image.open(image_path)
        width, height = img.size
        if width<224 or height<224:
            continue
        bg = []
        coordinates = [0,0,224,224]
        bg.append(coordinates)
        coordinates = [width-224,0,width,224]
        bg.append(coordinates)
        coordinates = [0,height-224,224,height]
        bg.append(coordinates)
        coordinates = [width-224,height-224,width,height]
        bg.append(coordinates)
        intersection_over_union(boxes,bg,img,file)
#         if not flag and count < 600:
#             image_path = images_path + file.split('.')[0] + ".jpg"
# #                 img = Image.open(image_path)
# #                 area = (xmin, ymin, xmax, ymax)
# #                 cropped_img = img.crop(area)
#             new_image_path = "data2/processed_data/{}/{}_{}".format(classes[0], str(c), file.split('.')[0] + ".jpg")
# #                 cropped_img.save(new_image_path)
#             copyfile(image_path, new_image_path)
#             count+=1
    found_classes['__background__'] = k
    print(found_classes)

In [30]:
build_dataset()

[291, 104, 461, 281] [0, 0, 224, 224] 0 0.0
[236, 101, 331, 194] [0, 0, 224, 224] 0 0.0
[86, 110, 231, 281] [0, 0, 224, 224] 15985 0.26752242602758064
[291, 104, 461, 281] [276, 0, 500, 224] 20691 0.34272510435301135
[291, 104, 461, 281] [0, 57, 224, 281] 0 0.0
[236, 101, 331, 194] [0, 57, 224, 281] 0 0.0
[86, 110, 231, 281] [0, 57, 224, 281] 23908 0.46128615254008376
[291, 104, 461, 281] [276, 57, 500, 281] 30438 0.6012444444444445
{'__background__': 700, 'chair': 1432, 'aeroplane': 331, 'bottle': 634}


In [31]:
data_dir = 'data2/processed_data/'
def load_split_train_test(datadir, valid_size = .05):
    train_transforms = transforms.Compose([transforms.Resize([resnet_input,resnet_input]),
                                       transforms.ToTensor(),
                                       ])
    test_transforms = transforms.Compose([transforms.Resize([resnet_input,resnet_input]),
                                      transforms.ToTensor(),
                                      ])
    train_data = datasets.ImageFolder(datadir,       
                    transform=train_transforms)
    test_data = datasets.ImageFolder(datadir,
                    transform=test_transforms)
    num_train = len(train_data)
    print(num_train)
    indices = list(range(num_train))
    split = int(np.floor(valid_size * num_train))
    np.random.shuffle(indices)
    from torch.utils.data.sampler import SubsetRandomSampler
    train_idx, test_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    test_sampler = SubsetRandomSampler(test_idx)
    trainloader = torch.utils.data.DataLoader(train_data,
                   sampler=train_sampler, batch_size=batch_size)
    testloader = torch.utils.data.DataLoader(test_data,
                   sampler=test_sampler, batch_size=batch_size)
    return trainloader, testloader
trainloader, testloader = load_split_train_test(data_dir, .1)
# print(trainloader.dataset.)
print(trainloader.dataset.classes)

3097
['__background__', 'aeroplane', 'bottle', 'chair']


In [32]:
# class voc_dataset(torch.utils.data.Dataset): # Extend PyTorch's Dataset class
#     def __init__(self, root_dir, train, transform=None):
#         # Begin
        
#     def __len__(self):
#         # Begin
        
#     def __getitem__(self, idx):
#        # Begin
    

## Train the netwok
<br/>You can train the network on the created dataset. This will yield a classification network on the 4 classes of the VOC dataset. 

In [33]:
# composed_transform = transforms.Compose([transforms.Scale((resnet_input,resnet_input)),
#                                          transforms.ToTensor(),
#                                          transforms.RandomHorizontalFlip()])
# train_dataset = voc_dataset(root_dir='', train=True, transform=composed_transform) # Supply proper root_dir
# test_dataset = voc_dataset(root_dir='', train=False, transform=composed_transform) # Supply proper root_dir

# train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
# test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

### Fine-tuning
Use the pre-trained network to fine-tune the network in the following section:

In [34]:
device = torch.device("cuda" if torch.cuda.is_available() 
                                  else "cpu")
model = models.resnet18(pretrained=True)

model.fc = nn.Linear(model.fc.in_features, 4)

# Add code for using CUDA here

In [35]:
criterion = nn.CrossEntropyLoss()
# Update if any errors occur
# optimizer = optim.SGD(resnet18.parameters(), learning_rate, hyp_momentum)

# criterion = nn.NLLLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=learning_rate)
model.to(device)

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace)
      (conv2): Co

In [36]:
#One Layer Detection
def train():
#     epochs = 3
    steps = 0
    running_loss = 0
    print_every = 20
    train_losses, test_losses = [], []
    for epoch in range(epochs):
        for inputs, labels in trainloader:
            steps += 1
            inputs, labels = inputs.to(device),labels.to(device)
            optimizer.zero_grad()
            logps = model.forward(inputs)
            loss = criterion(logps, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

            if steps % print_every == 0:
                test_loss = 0
                accuracy = 0
                model.eval()
                with torch.no_grad():
                    for inputs, labels in testloader:
                        inputs, labels = inputs.to(device), labels.to(device)
                        logps = model.forward(inputs)
                        batch_loss = criterion(logps, labels)
                        test_loss += batch_loss.item()
                        ps = torch.exp(logps)
                        top_p, top_class = ps.topk(1, dim=1)
                        equals = top_class == labels.view(*top_class.shape)
                        accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
                train_losses.append(running_loss/len(trainloader))
                test_losses.append(test_loss/len(testloader))                    
                print(f"Epoch {epoch+1}/{epochs}.. "
                      f"Train loss : {running_loss/print_every:.3f}.. "
                      f"Test loss: {test_loss/len(testloader):.3f}.. "
                      f"Test accuracy: {accuracy/len(testloader):.3f}")
                acc = accuracy/len(testloader)
                if acc >= 0.99:
                    torch.save(model, 'models/resnet18new.pth')
                running_loss = 0
                model.train()
    torch.save(model, 'models/resnet18new.pth')

In [37]:
%time train()

Epoch 1/10.. Train loss : 0.864.. Test loss: 0.421.. Test accuracy: 0.880
Epoch 1/10.. Train loss : 0.388.. Test loss: 0.270.. Test accuracy: 0.912
Epoch 2/10.. Train loss : 0.293.. Test loss: 0.413.. Test accuracy: 0.844
Epoch 2/10.. Train loss : 0.260.. Test loss: 0.227.. Test accuracy: 0.922
Epoch 3/10.. Train loss : 0.239.. Test loss: 0.235.. Test accuracy: 0.916
Epoch 3/10.. Train loss : 0.185.. Test loss: 0.239.. Test accuracy: 0.909
Epoch 4/10.. Train loss : 0.203.. Test loss: 0.216.. Test accuracy: 0.929
Epoch 4/10.. Train loss : 0.183.. Test loss: 0.207.. Test accuracy: 0.921
Epoch 5/10.. Train loss : 0.185.. Test loss: 0.229.. Test accuracy: 0.932
Epoch 5/10.. Train loss : 0.175.. Test loss: 0.220.. Test accuracy: 0.927
Epoch 5/10.. Train loss : 0.172.. Test loss: 0.188.. Test accuracy: 0.936
Epoch 6/10.. Train loss : 0.157.. Test loss: 0.196.. Test accuracy: 0.926
Epoch 6/10.. Train loss : 0.215.. Test loss: 0.260.. Test accuracy: 0.912
Epoch 7/10.. Train loss : 0.146.. Test

In [None]:
#Two Layer Detection (SSD)
def train():
    
    # Begin

In [None]:
%time train()

# Testing and Accuracy Calculation
For applying detection, use a slding window method to test the above trained trained network on the detection task:<br/>
Take some windows of varying size and aspect ratios and slide it through the test image (considering some stride of pixels) from left to right, and top to bottom, detect the class scores for each of the window, and keep only those which are above a certain threshold value. There is a similar approach used in the paper -Faster RCNN by Ross Girshick, where he uses three diferent scales/sizes and three different aspect ratios, making a total of nine windows per pixel to slide. You need to write the code and use it in testing code to find the predicted boxes and their classes.

In [None]:
def sliding_window():
    # Begin

Apply non_maximum_supression to reduce the number of boxes. You are free to choose the threshold value for non maximum supression, but choose wisely [0,1].

In [None]:
def iou(boxA,boxB):
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])

    interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1)

    boxAArea = (boxA[2] - boxA[0] + 1) * (boxA[3] - boxA[1] + 1)
    boxBArea = (boxB[2] - boxB[0] + 1) * (boxB[3] - boxB[1] + 1)

    iou_ = float(interArea) / float(boxAArea + boxBArea - interArea)
    return iou_

In [None]:
def non_maximum_supression(boxes,threshold = 0.3):
    boxes_dict = {}
    for box in boxes:
        if box[1] in boxes_dict:
            boxes_dict[box[1]].append(box)
        else:
            boxes_dict[box[1]] = [box]
    bounding_box = []
    for cls,box in boxes_dict:
        max_score = 0
        for box1 in box:
            for box2 in box:
                if iou(box1[0],box2[0])>threshold:
                    if max_score<max(box1[2],box2[2]):
                        max_score = max(box1[2],box2[2])
                        if box1[2]>box2[2]:
                            best_box = box1
                        else:
                            best_box = box2
        bounding_box.append(best_box)
    return bounding_box

Test the trained model on the test dataset.

In [None]:
#One Layer Detection
def test(resnet18):
    # Write loops for testing the model on the test set
    # Also print out the accuracy of the model

In [None]:
%time test(resnet18)

In [None]:
#Two Layer Detection
def test(resnet18):
    # Write loops for testing the model on the test set
    # Also print out the accuracy of the model

In [None]:
%time test(resnet18)