# Assignment 3

# Instructions

1. You have to use only this notebook for all your code.
2. All the results and plots should be mentioned in this notebook.
3. For final submission, submit this notebook along with the report ( usual 2-4 pages, latex typeset, which includes the challenges faces and details of additional steps, if any)
4. Marking scheme
    -  **60%**: Your code should be able to detect bounding boxes using resnet 18, correct data loading and preprocessing. Plot any 5 correct and 5 incorrect sample detections from the test set in this notebook for both the approached (1 layer and 2 layer detection), so total of 20 plots.
    -  **20%**: Use two layers (multi-scale feature maps) to detect objects independently as in SSD (https://arxiv.org/abs/1512.02325).  In this method, 1st detection will be through the last layer of Resnet18 and the 2nd detection could be through any layer before the last layer. SSD uses lower resolution layers to detect larger scale objects. 
    -  **20%**: Implement Non-maximum suppression (NMS) (should not be imported from any library) on the candidate bounding boxes.
    
5. Report AP for each of the three class and mAP score for the complete test set.

SETUP

In [0]:
#This code will run on Google Colab
#Create a folder named CS783_A3 in google drive 
#Extract the given PASCAL VOC train and test zip files in this folder such that you have this sort of folder arrangement 
# My Drive/CS783_A3/VOCtrainval_06-Nov-2007/VOCdevkit/VOC2007/<all internal folders>
# My Drive/CS783_A3/VOCtest_06-Nov-2007/VOCdevkit/VOC2007/<all internal folders>

In [0]:
#Google colab authentication
from google.colab import drive
drive.mount('/content/drive')

In [0]:
from __future__ import division, print_function, unicode_literals
import numpy as np
import torch
import torch.utils.data
import torchvision.transforms as transforms
from torch.autograd import Variable
import matplotlib.pyplot as plt
%matplotlib inline
plt.ion()
from torch import nn
from torch import optim
import torch.nn.functional as F
from torchvision import datasets, transforms, models
# Import other modules if required
# Can use other libraries as well
%matplotlib inline
%config InlineBackend.figure_format = 'retina'


resnet_input = 224#size of resnet18 input images

In [0]:
# Choose your hyper-parameters using validation data
batch_size = 64
num_epochs = 5
learning_rate =  0.003

## Build the data
Use the following links to locally download the data:
<br/>Training and validation:
<br/>http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
<br/>Testing data:
<br/>http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
<br/>The dataset consists of images from 20 classes, with detection annotations included. The JPEGImages folder houses the images, and the Annotations folder has the object-wise labels for the objects in one xml file per image. You have to extract the object information, i.e. the [xmin, ymin] (the top left x,y co-ordinates) and the [xmax, ymax] (the bottom right x,y co-ordinates) of only the objects belonging to the three classes(aeroplane, bottle, chair). For parsing the xml file, you can import xml.etree.ElementTree for you. <br/>
<br/> Organize the data as follows:
<br/> For every image in the dataset, extract/crop the object patch from the image one by one using their respective co-ordinates:[xmin, ymin, xmax, ymax], resize the image to resnet_input, and store it with its class label information. Do the same for training/validation and test datasets. <br/>
##### Important
You also have to collect data for an extra background class which stands for the class of an object which is not a part of any of the 20 classes. For this, you can crop and resize any random patches from an image. A good idea is to extract patches that have low "intersection over union" with any object present in the image frame from the 20 Pascal VOC classes. The number of background images should be roughly around those of other class objects' images. Hence the total classes turn out to be four. This is important for applying the sliding window method later.


In [0]:
import xml.etree.ElementTree as ET
import cv2
import os
import glob
import numpy as np
import operator

classes=['chair','bottle','aeroplane'] 
allclass=['aeroplane','bicycle','bird','boat','bottle','bus','car','cat','chair','cow',
          'diningtable','dog','horse','motorbike','person','pottedplant','sheep','sofa',
          'train','tvmonitor']   

In [0]:
#extracting chairs, bottles and aeroplanes
key='train'
path='/content/drive/My Drive/CS783_A3/resized_full_'+key
os.mkdir(path)

for title in classes:
    os.mkdir(path+'/'+title)
    
annotations = glob.glob('/content/drive/My Drive/CS783_A3/VOCtrainval_06-Nov-2007/VOCdevkit/VOC2007/Annotations/*.xml')
imagedict = glob.glob('/content/drive/My Drive/CS783_A3/VOCtrainval_06-Nov-2007/VOCdevkit/VOC2007/JPEGImages/*.jpg')


#same for both
k=0
l=0
m=0
for j in range(len(imagedict)):
    tree = ET.parse(annotations[j])
    root = tree.getroot()
    img = cv2.imread(imagedict[j])
    for obj in root.iter('object'):
        title=obj[0].text
        if(title in classes):
            for bndbox in obj.iter('bndbox'):
                image=[]
                for i in range(4):
                    image.append(int(np.rint(float(bndbox[i].text))))
            crop=img[image[1]:image[3],image[0]:image[2]]
            crop=cv2.resize(crop,(224,224))
            if(title=='chair'):
                cv2.imwrite(path+title+'/'+str(k)+'.jpg',crop)
                k+=1
            elif(title=='bottle'):
                cv2.imwrite(path+title+'/'+str(l)+'.jpg',crop)
                l+=1
            else:
                cv2.imwrite(path+title+'/'+str(m)+'.jpg',crop)
                m+=1


In [0]:

#for extracting background

def distance(p0, p1):
            return np.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)

      
annotations = glob.glob('/content/drive/My Drive/CS783_A3/VOCtrainval_06-Nov-2007/VOCdevkit/VOC2007/Annotations/*.xml')
imagedict = glob.glob('/content/drive/My Drive/CS783_A3/VOCtrainval_06-Nov-2007/VOCdevkit/VOC2007/JPEGImages/*.jpg')

final=[]
k=0

for j in range(len(imagedict)):
    img = cv2.imread(imagedict[j])#take image
    tree = ET.parse(annotations[j])#xml parse
    root = tree.getroot()#get root
    
    
    image=[]#this will stor bounding boxes for single object present in an image
    bb=[]#this contains list of bounding boxes for all objects in an image
    back=[]#this contains list of all background bounding boxes extracted from an image
           #its length is 0 if no objects present(entire picture is background), 8 if single object is present, 12 otherwise
                                                                                        
    
    for obj in root.iter('object'):
        title=obj[0].text
        if(title in allclass):
            for bndbox in obj.iter('bndbox'):
                image=[]
                for i in range(4):
                    image.append(int(np.rint(float(bndbox[i].text))))
                bb.append(image)
            
    if(len(bb)==0):#no object present, then whole image is background
        back.append(img)
        backsort = back.copy()
    
    else:#if one or more objects present, extract 8 backgrounds, 2 from each corner (one horizontal, one vertical)
        k=0
        xmin=[]
        xmax=[]
        ymin=[]
        ymax=[]
        for i in range(len(bb)):
            xmin.append(bb[i][0])
            xmax.append(bb[i][2])
            ymin.append(bb[i][1])
            ymax.append(bb[i][3])
            k+=1
        
        c1=[(a,b) for a,b in zip(xmin,ymin)]#top left corner
        c2=[(a,b) for a,b in zip(xmax,ymin)]#top right corner
        c3=[(a,b) for a,b in zip(xmin,ymax)]#bottom left corner
        c4=[(a,b) for a,b in zip(xmax,ymax)]#bottom right corner
        
        #C1
        l=0
        u=0
        dist=[]
        for p in c1:
            dist.append(distance((0,0),p))
        point = c1[dist.index(min(dist))]#closest top-left to (0,0)
        r1=point[0]
        if(r1==min(xmin)):
            d1=375
        else:
            d1=min([x[1] for x in bb if x[0]<r1])
        d2=point[1]
        if(d2==min(ymin)):
            r2=500
        else:
            r2=min([x[0] for x in bb if x[1]<d2])
        back.append(img[u:d1,l:r1])
        back.append(img[u:d2,l:r2])
            
        #C2
        r=500
        u=0
        dist=[]
        for p in c2:
            dist.append(distance((500,0),p))
        point = c2[dist.index(min(dist))]#closest top-right to (500,0)
        l1=point[0]
        if(l1==max(xmax)):
            d1=375
        else:
            d1=min([x[1] for x in bb if x[2]>l1])
        d2=point[1]
        if(d2==min(ymin)):
            l2=0
        else:
            l2=max([x[2] for x in bb if x[1]<d2])
        back.append(img[u:d1,l1:r])
        back.append(img[u:d2,l2:r])
        
        #C3
        l=0
        d=375
        dist=[]
        for p in c3:
            dist.append(distance((0,375),p))#closest bottom-left to (0,375)
        point = c3[dist.index(min(dist))]
        r1=point[0]
        if(r1==min(xmin)):
            u1=0
        else:
            u1=max([x[3] for x in bb if x[0]<r1])
        u2=point[1]
        if(u2==max(ymax)):
            r2=500
        else:
            r2=min([x[0] for x in bb if x[3]>u2])
        back.append(img[u1:d,l:r1])
        back.append(img[u2:d,l:r2])
        
        #C4
        r=500
        d=375
        dist=[]
        for p in c4:
            dist.append(distance((500,375),p))
        point = c4[dist.index(min(dist))]#closest bottom right to (500,375)
        l1=point[0]
        if(l1==max(xmax)):
            u1=0
        else:
            u1=max([x[3] for x in bb if x[2]>l1])
        u2=point[1]
        if(u2==max(ymax)):
            l2=0
        else:
            l2=max([x[2] for x in bb if x[3]>u2])
        back.append(img[u1:d,l1:r])
        back.append(img[u2:d,l2:r])
        
        if(len(bb)>1):#if more than 1 object present, extract 4 more backgrounds from within detected objects
            
            leftright = min(xmax)
            ilr = xmax.index(leftright)
            rightleft = max(xmin)
            irl = xmin.index(rightleft)
            topbot = min(ymax)
            itb = ymax.index(topbot)
            bottop = max(ymin)
            ibt = ymin.index(bottop)
            
            bb1=bb.copy()
            bb2=bb.copy()
            bb1.remove(bb[ilr])
            if(ilr!=irl):
                bb1.remove(bb[irl])
            bb2.remove(bb[itb])
            if(itb!=ibt):
                bb2.remove(bb[ibt])
            
            if(len(bb1)==0):
                toptop=0
                botbot=375
            else:
                toptop=min([x[1] for x in bb1])
                botbot=max([x[3] for x in bb1])
            if(len(bb2)==0):
                leftleft=0
                rightright=500
            else:
                leftleft=min(x[0] for x in bb2)
                rightright=max(x[2] for x in bb2)
            
            im1 = img[0:toptop,leftright:rightleft]
            im2 = img[botbot:375,leftright:rightleft]
            im3 = img[topbot:bottop,0:leftleft]
            im4 = img[topbot:bottop,rightright:500]
            
            back.append(im1)
            back.append(im2)
            back.append(im3)
            back.append(im4)
        
        val={}
        backsort=[]
        for i in range(len(back)):
            val[i] = back[i].size#create dictionary based on size for backgrounds extracted from this image
            
        valsort = sorted(val.items(), key=operator.itemgetter(1),reverse=True)#sort in descending order to get maximum area backgrounds from that image
        backsort=[back[x[0]] for x in valsort]
        backsort = backsort[:len(bb)]#take those many backgrounds as ther are objects in that image

    final.append(backsort)#stores a list of all backgrounds extract from entire training data



full=[]    
sizes=[]
for lst in final:
    sizes.append(len(lst))
    for img in lst:
        if(img.size!=0):
            full.append(img)#store all backgrounds

np.save('/content/drive/My Drive/CS783_A3/all_backgrounds.npy',full)#store for future use


#sorting full acc. to size        
val={}
fullsort=[]
for i in range(len(full)):
    val[i] = full[i].size#create dictionary based on size for entire dataset
    
valsort = sorted(val.items(), key=operator.itemgetter(1),reverse=True)#sort in descending order to get maximum area backgrounds from entire dataset
fullsort=[full[x[0]] for x in valsort]


#for limited background samples, slice fullsort accordingly
fullsort = fullsort[10000:]# we have taken last backgrounds as they conatin least amount of information, and truly serve as background

#create background folder
k=0
path='/content/drive/My Drive/CS783_A3/resized_full_train/background'
os.mkdir(path)

for img in fullsort:
    res=cv2.resize(img,(224,224))
    cv2.imwrite(path+'/'+str(k)+'.jpg',res)#write images to folder
    k+=1    

## Train the netwok
<br/>You can train the network on the created dataset. This will yield a classification network on the 4 classes of the VOC dataset. 

In [0]:
train_dir = '/content/drive/My Drive/CS783_A3/resized_full_train'

def load_train(datadir):
    train_transforms = transforms.Compose([transforms.Resize(224), transforms.ToTensor(), transforms.RandomHorizontalFlip()])
    train_data = datasets.ImageFolder(datadir, transform=train_transforms)
    trainloader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)
    return trainloader

trainloader = load_train(train_dir)

### Fine-tuning
Use the pre-trained network to fine-tune the network in the following section:





ONE LAYER 

In [0]:
#code for using CUDA
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [0]:
model = models.resnet18(pretrained=True)

In [0]:
for param in model.parameters():#freezing all layers
    param.requires_grad = False
    
num_ftrs = model.fc.in_features

model.fc = nn.Sequential(nn.ReLU(),nn.Linear(num_ftrs, 4),nn.Softmax(dim=1))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=learning_rate)

In [0]:
model.to(device)#for seeing Resnet Architecture

In [0]:
#for checking model summary and finding number of trainable parameters
from torchsummary import summary
summary(model,(3,224,224))

SINGLE LAYER TRAINING

In [0]:
#declaring traininfg parameters
epochs = 5
steps = 0#for counting steps
running_loss = 0
print_every = 10#to print loss after ever print_every steps
train_losses = []

In [0]:
for epoch in range(epochs):
    for inputs, labels in trainloader:
        steps += 1
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        logps = model.forward(inputs)
        loss = criterion(logps, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if steps % print_every == 0:
            train_losses.append(running_loss/len(trainloader))
            print(f"Epoch {epoch+1}/{epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. ")
            running_loss = 0
            model.train()

In [0]:
#saving the model  
torch.save(model, '/content/drive/My Drive/CS783_A3/single_layer')

Model for second layer detection

In [0]:
#model obtained after removing last conv layers and keeping other pretrained layers

class ResNet18_new(nn.Module):
    def __init__(self, model):
        super(ResNet18_new, self).__init__()
        self.features = nn.Sequential(*list(model.children())[:-3])
        self.adpavg2dpool=nn.AdaptiveAvgPool2d(output_size=(1, 1))
        self.fc=nn.Sequential(nn.Linear(256,4),nn.Softmax(dim=1))
        ct = 0 
        for child in self.features.children():#freezing all the pretrained layer except the layers following outermost conv layer(including conv layer)
            ct += 1
            if ct < 7:
                for param in child.parameters():
                    param.requires_grad = False
    def forward(self, x):
        x = self.features(x)
        x= self.adpavg2dpool(x)
        x = x.view(x.size(0), - 1)
        x=self.fc(x)                                       
        return x
model = models.resnet18(pretrained=True)
res18 = ResNet18_new(model)
res18.to(device)

In [0]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(res18.fc.parameters(), lr=0.003)

In [0]:
#declaring traininfg parameters

epochs = 5
steps = 0#for counting steps
running_loss = 0
print_every = 10#to print loss after ever print_every steps
train_losses = []

training of model for second layer detection

In [0]:
for epoch in range(epochs):
    for inputs, labels in trainloader:
        steps += 1
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        logps = res18.forward(inputs)
        loss = criterion(logps, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if steps % print_every == 0:
            test_loss = 0
            accuracy = 0
            #model.eval()
            train_losses.append(running_loss/len(trainloader))
            print(f"Epoch {epoch+1}/{epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. ")
            running_loss = 0
            res18.train()

In [0]:
torch.save(res18, '/content/drive/My Drive/CS783_A3/second_layer')

# Testing and Accuracy Calculation
For applying detection, use a slding window method to test the above trained trained network on the detection task:<br/>
Take some windows of varying size and aspect ratios and slide it through the test image (considering some stride of pixels) from left to right, and top to bottom, detect the class scores for each of the window, and keep only those which are above a certain threshold value. There is a similar approach used in the paper -Faster RCNN by Ross Girshick, where he uses three diferent scales/sizes and three different aspect ratios, making a total of nine windows per pixel to slide. You need to write the code and use it in testing code to find the predicted boxes and their classes.

In [0]:
def sliding_window(size, stepSize , windowSize):
    # slide a window across the image
    for y in range(0, size, stepSize):
      for x in range(0,size, stepSize):
        # yield the current window
        if(x+windowSize[0]<=224 and y+windowSize[1]<=224):
          yield ([x, y, x + windowSize[0], y + windowSize[1]])

Apply non_maximum_supression to reduce the number of boxes. You are free to choose the threshold value for non maximum supression, but choose wisely [0,1].

In [0]:
def non_max_suppression_fast(boxes, overlapThresh):
	# if there are no boxes, return an empty list
	if len(boxes) == 0:
		return []
	# if the bounding boxes integers, convert them to floats --
	# this is important since we'll be doing a bunch of divisions
	if boxes.dtype.kind == "i":
		boxes = boxes.astype("float")
	# initialize the list of picked indexes	
	pick = []
	# grab the coordinates of the bounding boxes
	x1 = boxes[:,0]
	y1 = boxes[:,1]
	x2 = boxes[:,2]
	y2 = boxes[:,3]
	# compute the area of the bounding boxes and sort the bounding
	# boxes by the bottom-right y-coordinate of the bounding box
	area = (x2 - x1 + 1) * (y2 - y1 + 1)
	idxs = np.argsort(y2)
	# keep looping while some indexes still remain in the indexes
	# list
	while len(idxs) > 0:
		# grab the last index in the indexes list and add the
		# index value to the list of picked indexes
		last = len(idxs) - 1
		i = idxs[last]
		pick.append(i)
		# find the largest (x, y) coordinates for the start of
		# the bounding box and the smallest (x, y) coordinates
		# for the end of the bounding box
		xx1 = np.maximum(x1[i], x1[idxs[:last]])
		yy1 = np.maximum(y1[i], y1[idxs[:last]])
		xx2 = np.minimum(x2[i], x2[idxs[:last]])
		yy2 = np.minimum(y2[i], y2[idxs[:last]])
		# compute the width and height of the bounding box
		w = np.maximum(0, xx2 - xx1 + 1)
		h = np.maximum(0, yy2 - yy1 + 1)
		# compute the ratio of overlap
		overlap = (w * h) / area[idxs[:last]]
		# delete all indexes from the index list that have
		idxs = np.delete(idxs, np.concatenate(([last],
			np.where(overlap > overlapThresh)[0])))
	# return only the bounding boxes that were picked using the
	# integer data type
	return boxes[pick].astype("int")

Single layer detection

Test the trained model on the test dataset.

In [0]:
import os
import glob
import cv2
from torch.autograd import Variable
from PIL import Image, ImageDraw, ImageFont

ONE LAYER DETECTION

In [0]:
thresh=0.5
path1='/content/drive/My Drive/CS783_A3/VOCtest_06-Nov-2007/VOCdevkit/VOC2007/JPEGImages'
path2='/content/drive/My Drive/CS783_A3/outputs'
os.mkdir(path2)
images=glob.glob(path1+'/*.jpg')
k=0
for im in images:
    name=im[-10:-4]
    img=cv2.imread(im)
    img = cv2.resize(img,(224,224))
    for window_size in [(64,64),(64,128),(128,64),(24,48),(24,24),(48,24),(104,104),(104,208),(208,104)]:
        for window in sliding_window(224, 20, window_size):
            crop=img[window[1]:window[3],window[0]:window[2]]
            res = cv2.resize(crop,(224,224))
            t=transforms.ToTensor()(res)
            t=t.unsqueeze_(0)
            t=t.to(device)
            out=model(t)
            cpu_pred = out.cpu()
            result = cpu_pred.data.numpy()
            m=max(out[0].tolist())
            if(m>=thresh):
              window.append(result.argmax())
              if(k==0):
                boxes = np.array([window])
              else:
                boxes=np.append(boxes,np.array([window]),axis=0)
            k+=1
    output_nms=non_max_suppression_fast(boxes, 0.3)
    #following code saves bounding boxes and their labels for each image in a text file,
    with open(path2+'/'+name+'.txt','w') as f:
        for a in output_nms:
            f.write(classes[int(a[-1])]+' '+str(a[0])+' '+str(a[1])+' '+str(a[3])+' '+str(a[4])+'\n')

TWO LAYER DETECTION

In [0]:
#defining custom architecture
class ResNet18_new(nn.Module):
    def __init__(self, model):
        super(ResNet18_new, self).__init__()
        self.features = nn.Sequential(*list(model.children())[:-3])
        self.adpavg2dpool=nn.AdaptiveAvgPool2d(output_size=(1, 1))
        self.fc=nn.Linear(256,4)
        ct = 0 
        for child in self.features.children():
            ct += 1
            if ct < 7:
                for param in child.parameters():
                    param.requires_grad = False
    def forward(self, x):
        x = self.features(x)
        x= self.adpavg2dpool(x)
        x = x.view(x.size(0), - 1)
        x=self.fc(x)                                       
        return x

model2 = models.resnet18(pretrained=True)
res18 = ResNet18_new(model2)

In [0]:
thresh=0.5
path1='/content/drive/My Drive/CS783_A3/VOCtest_06-Nov-2007/VOCdevkit/VOC2007/JPEGImages'
path2='/content/drive/My Drive/CS783_A3/outputs'
os.mkdir(path2)
images=glob.glob(path1+'/*.jpg')
k=0
for im in images:
    name=im[-10:-4]
    img=cv2.imread(im)
    img = cv2.resize(img,(224,224))
    for window_size in [(64,64),(64,128),(128,64),(24,48),(24,24),(48,24),(104,104),(104,208),(208,104)]:
        for window in sliding_window(224, 20, window_size):
            crop=img[window[1]:window[3],window[0]:window[2]]
            res = cv2.resize(crop,(224,224))
            t=transforms.ToTensor()(res)
            t=t.unsqueeze_(0)
            t=t.to(device)
            out=model(t)
            cpu_pred = out.cpu()
            result = cpu_pred.data.numpy()
            m=max(out[0].tolist())
            if(m>=thresh):
              window.append(result.argmax())
              if(k==0):
                boxes = np.array([window])
              else:
                boxes=np.append(boxes,np.array([window]),axis=0)
            k+=1
    output_nms=non_max_suppression_fast(boxes, 0.3)
    #following code saves bounding boxes and their labels for each image in a text file,
    with open(path2+'/'+name+'.txt','w') as f:
        for a in output_nms:
            f.write(classes[int(a[-1])]+' '+str(a[0])+' '+str(a[1])+' '+str(a[3])+' '+str(a[4])+'\n')