<a href="https://www.nvidia.com/dli"> <img src="imgs/header.png" alt="Header" style="width: 400px;"/> </a>

<h1 align="center">Deep Learning for Intelligent Video Analytics</h1>
<h4 align="center">(Part 1)</h4>


<img src="imgs/intro.gif" alt="AFRL1" style="margin-top:50px"/>
<p style="text-align: center;color:gray"> Figure 1. Real-time object detection for "vehicle" class  </p>

Welcome to the *Deep Learning for Intelligent Video Analytics (IVA)* course! 

Billions of cameras generate immense volumes of video data every day. Extracting actionable insights like identification, tracking, segmentation and prediction of different types of objects, at this scale, is a non-trivial task and deep learning is known to be the best recipe to deal with data at such a high rate and scale. Optimizing retail stores and merchandising based on people flow, congestion management, analyzing traffic patterns that precede collisions, and smart parking systems are few examples of deep-learning-based Intelligent Video Analytics (IVA) applications. 

In this workshop, you'll learn how to:

- Efficiently process and prepare video feeds using hardware accelerated decoding methods (lab 1 and lab 3)
- Train and evaluate deep learning models and leverage "transfer learning" techniques to elevate efficiency and accuracy of these models and mitigate data sparsity issues (lab 2)
- Explore the strategies and trade-offs involved in developing high-quality neural network models to track moving objects in large-scale video datasets (lab 2)
- Deploy end-to-end accelerated video analytics solutions using __DeepStream SDK__ (lab 3)


Upon completion, you'll be able to design, train, test and deploy building blocks of a hardware-accelerated traffic management system based on parking lot camera feeds.


#### Prerequisites
Having previous knowledge of video processing methods, deep learning models and object-detection algorithms is beneficial, but not necessary. We also assume the learner is familiar with the foundational concepts of programming, especially with __Python__ and __C++__.



#### About this jupyter notebook
Before we get started, there are a few items to consider about this jupyter notebook:

1. The notebook is being rendered on your browser, but the contents are being streamed by an interactive iPython kernel running on a GPU enabled instance.

2. The notebook is composed of cells; cells can contain code which you can run, or they can hold text and/or images which are there for you to read.

3. You can execute code cells by clicking the ```Run``` icon in the menu, or via the following keyboard shortcuts ```Shift-Enter``` (run and advance) or ```Ctrl-Enter``` (run and stay in the current cell).

4. To interrupt cell execution, click the ```Stop``` button on the toolbar or navigate to the ```Kernel``` menu, and select ```Interrupt ```.


This tutorial covers the following topics:

* [1. INTRODUCTION](#1)
    * [1.1 Object detection, from still images to videos](#1-1)
    * [1.2 TensorFlow Object Detection API](#1-2)
* [2. Dataset: NVIDIA Endeavor Parking Dataset](#2)
* [3. Prepare data for the model](#3)
    * [3.1 Ingest raw annotation data into pandas DataFrame](#3-1)
    * [Exercise 1](#e1)
    * [Exercise 2](#e2)
* [4. Working with the video data](#4)
    * [4.1 Converting video file into frame images](#4-1)
    * [Exercise 3](#e3)
* [5. Inference](#5)
    * [5.1 Frame-by-Frame detection](#5-1)
    * [5.2 Quantitative analysis - Intersection over Union](#5-2)
* [6. Crop and Normalize the Annotations](#6)
    * [Exercise 4](#e4)
* [7. Create TFRecord files](#7)
    * [7.1 Encode annotations and images into TensorFlow Examples](#7-1)
    * [7.2 Create training and validation splits](#7-2)



<a name="1"></a>
## 1. INTRODUCTION

<a name="1-1"></a>
### 1.1 Object detection, from still images to videos

With the excessive increase in number of traffic cameras, broadening prospect of autonomous vehicles, and promising outlook of “__smart cities__”, faster and more efficient object detection and tracking models are rising in demand. An American can be caught on camera more than 75 times per day, resulting in [4 billion hours](https://www.forbes.com/sites/singularity/2012/08/30/dear-republicans-beware-big-brother-is-watching-you/#4317353620da) of video footage per week to be processed and possibly passed through an object detection pipeline!

In general, object detection is the process of finding instances of pre-defined classes (e.g. pedestrians, animals, buildings, and cars) within images (frames) and video datasets. Despite being well-explored in the domain of image processing, object detection has been less surveyed in regards with temporal movements and video sets. In order to detect and classify objects in images, the prevailing deep learning approach is to train a deep network model based on a sizable dataset -- *usually ImageNet or Coco* -- as the first step. The idea behind this step is to extract and model visual features of different species, models or any category of subclasses associated with the original class.  Later, the object detection inference step is conducted by bounding box regression over the areas of interest and consequently labeling the test images or videos.

While Frame-by-Frame had seen a surge at the beginning of deep-learning-based IVA applications, attempts have shifted towards more temporal video tracking processing techniques since. Compared to still images, dealing with video data requires more computational effort as well as addressing barriers of real-time data processing. Moreover, video objects might be deteriorated, obscured or present lower feature quality due to motions blur.

In addition, a large bottleneck in the processing, exploitation and dissemination chain is converting raw data into information. Many applications need to process thousands of hours of collected data. Each frame needs to be viewed, studied and converted into usable and actionable information. Artificial intelligence can help us reduce the burden on analysts who are required to perform this task.

In this course, we start with the naive approach towards IVA by examining Frame-by-Frame data preparation and object detection models and then will explore more video-specific models by acquiring the power of temporal object tracking models.


<a name="1-2"></a>
### 1.2 TensorFlow Object Detection API


In this course, we using the [TensorFlow Object Detection API](https://github.com/tensorflow/models/tree/master/research/object_detection). The API is an open source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models.  Within the object detection API, there are five different detection models based on recent advancements in deep learning:

1. Single Shot Multibox Detector ([SSD](https://arxiv.org/abs/1512.02325)) with [MobileNets](https://arxiv.org/abs/1704.04861)
2. SSD with [Inception v2](https://arxiv.org/abs/1512.00567)
3. [Region-Based Fully Convolutional Networks](https://arxiv.org/abs/1605.06409) (R-FCN) with [Resnet](https://arxiv.org/abs/1512.03385) 101
4. [Faster RCNN](https://arxiv.org/abs/1506.01497) with Resnet 101
5. Faster RCNN with [Inception Resnet v2](https://arxiv.org/abs/1602.07261)

In this lab, we will concentrate on training and testing SSD with Inception v2, Faster RCNN with Inception Resnet v2, and NasNet. We must be aware of some pitfalls including overfitting and variation in data. In the next lab we will learn more about handling large amounts of data using the `pandas` Python package.

For the task of detecting objects in streaming IVA data, each network has strengths and weaknesses that need to be considered when developing a deployable system. For example, for a given GPU, SSD can process data at or near common video frame rates (25 - 30 fps). However, although its accuracy is reasonable, it can produce __many false negatives and false alarms__ depending on how much data and the variety of data we train with.  In contrast, NasNet produces very accurate detection proposals with less training data at the cost of processing speed; typically, single digit (or less) fps for a given GPU. 

Below, we will briefly review these three types of models.


#### Single-Shot Multibox Detector (SSD)

Since the introduction of the Region-based Convolutional Neural Networks (R-CNNs), the detection operation has been split into two sub-tasks: 

- __Localization__: Where the model uses regression to return the coordinates of the potential object within the frame (image). The network is trained with ground truth bounding-boxes, and L2 distance is used to measure the loss value between the ground truth and the regressed coordinates.

- __Classification__: Classification is the task of labeling the given frame with one of the classes that the model has been trained with. 

Single-Shot Multibox Detector networks (SSD), combine the bounding-box localization and classification task in a single forward pass of the network. SSDs are built on top of their predecessor, the VGG-16 architecture, where the fully connected layers are replaced by a series of new convolutional feature extraction layers, each outputting a series of *k* bounding boxes (based on *prior* information) and their respective bounding boxes coordinates. Below, you can see the SSD network architecture:


<img src="imgs/ssd.jpg" alt="SSD" style="width: 800px;"/>
<p style="text-align: center;color:gray"> Figure 2. SSD Architecture</p>


#### Faster-RCNN

Faster-RCNN has a more complex architecture since it has more moving parts and separate pieces compares to SSDs.
Unlike the SSD network, in Faster-RCNN, localization and classification tasks are performed in different networks. The Localization network is called Region Proposal Network (RPN), and its output comprises a SoftMax layer where class types are "foreground" and "background" and the second output is a Regressor for the proposed "Anchors".
Next, the original feature-map together with the outputs of the RPN network are fed into the second network where the actual class labels are generated:

<img src="imgs/RCNN.jpg" alt="RCNN" style="width: 800px;"/>
<p style="text-align: center;color:gray"> Figure 3. Faster R-CNN Architecture</p>


#### NasNet

[NasNet](https://ai.googleblog.com/2017/11/automl-for-large-scale-image.html) is one of the most accurate models built so far achieving 82.7% on ImageNet validation set above all previous Inception models. NasNet uses an approach called AutoML to find the best layers that work well with the underlying dataset. In the case of NasnNet, the results of applying AutoML to COCO and ImageNet are combined to form the NasNet Architecture.


Later in this course, we will demonstrate how to develop more advanced systems by taking advantage of temporal feature correlation to increase both accuracy and performance. We will also make use of  __DeepStream__ https://developer.nvidia.com/deepstream-sdk) to enable system scaling for multiple video streams.

<img src="imgs/nas.jpg" alt="RCNN" style="width: 600px;"/>
<p style="text-align: center;color:gray"> Figure 4. AutoML reinforcement learning network selection</p>


<a name="1-3"></a>
### 1.3 Annotations

You often need to increase the number of training and test samples to train and evaluate object models. To do so, you need to expand the ground truth data. There are several proprietary and open source image mark-up tools. All videos used in this course are annotated using `Vatic`. For details of this annotation tool, please visit their [website](http://www.cs.columbia.edu/~vondrick/vatic/).

<br/>

<img src="imgs/vatic.jpg" alt="Vatic imaging" style="width: 800px;"/>
<p style="text-align: center;color:gray"> Figure 5. Vatic annotation tool </p>

In addition, the ontology and taxonomy of the annotations needs to be considered carefully. It needs to be descriptive and flexible enough to incorporate different object types. Also, this lab will only consider object detection, not segmentation and pixel level classification. These are achievable using techniques as described in other DLI labs, including the Deep Learning for Object Segmentation lab. However, in order to train a model to perform this task, your data also needs to be labelled in a similar manner (polygons and masks in addition to bounding boxes). Therefore, when you are considering which object labelling tool you are going to use, you must take the final goal of your task into account.
<br /><br />

With this introduction, we are going to start our inference task by introducing the Dataset in the next section.


<a name="2"></a>
## 2. Dataset: NVIDIA Endeavor Parking Dataset

For this course, we are using video files recorded at the NVIDIA headquarters parking lot. The video files are recorded using omnidirectional cameras, and consequently the raw video files are not useful for our video processing tasks, since all the straight lines are curved, and the video files are required to be unwarped. The videos we are using are already pre-processed and ready to use. Later in the course, when we work with the DeepStream SDK, we will learn how to unwarp the videos as a part of the pipelines we will be building.

<img src="imgs/360.png" alt="Vatic imaging"/>
<p style="text-align: center;color:gray"> Figure 6. a sample of the 360 camera recording and its respective unwarped result using DeepStream's Gst-nvdewarper plugin</p>

The Endeavor parking lot dataset annotation is provided in JSON format. Each entry starts with the `track_id` representing a unique index for each car within each recording. The mentioned track_id provides a set of bounding boxes and their respective bounding box positions. Below, you can see elements of the annotation format:

__track_id__ : the unique identifier of the vehicle in the video.
> __boxes__ : a collection of bounding box annotations, each indicating the position of the vehicle in a single frame.
> > __frame_id__ : a sequential integer indicating the frame number 
> > > __attributes__ : a set of *arbitrary* attributes, often describing the car make, model, color and park status.<br />
> > > __occluded__ : specifies if the car is fully visible or occluded.<br />
> > > __outside__ : indicating whether the car is located within the frame boundaries or outside.<br />
> > > __xbr__ : an integer in the range between [0, frame width], indicating the right-most location of the bounding box in coordinates relative to the frame size.<br />
> > > __xtl__ : an integer in the range between [0, frame width], indicating the left-most location of the bounding box in coordinates relative to the frame size.<br />
> > > __ybr__ : an integer in the range between [0, frame height], indicating the bottom-most location of the bounding box in coordinates relative to the frame size.<br />
> > > __ytl__ : an integer in the range between [0, frame height], indicating the top-most location of the bounding box in coordinates relative to the frame size.<br />



Below, you can see a snapshot of the json file for a sample video:

<img src="imgs/json_structure.png" alt="Vatic imaging"/>
<p style="text-align: center;color:gray"> Figure 7. a snapshot of the json annotation file  </p>

Now, let's import the libraries that we will need during this course:

In [None]:
#%matplotlib notebook
%matplotlib inline
import pylab as pl
pl.rcParams['figure.figsize'] = (8, 4)
import os, sys, shutil
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import io
import base64
from IPython.display import HTML
from IPython.display import clear_output
from IPython import display
import matplotlib.patches as patches
from matplotlib.pyplot import cm 
import time
import cv2
import pickle
import json
import sort

from os.path import join

from mpl_toolkits.mplot3d import Axes3D

        
import pandas as pd

For the rest of this course, we are going to utilize a config file to access specifications of the __data__. Also, we will be using other config files referring to properties of our __models__.

In [None]:
import configparser
config = configparser.ConfigParser()
config.sections()
config.read("utils/iva.conf")
config = config["General"]

Let's inspect some of the raw data to get an understanding of the type of data we will need to model. We have made a smaller version of one of the raw videos for viewing. Please note that we have resized the video and reduced the frame-rate so the playback works well in this environment.

In [None]:
def disp_video(fname):
    import io
    import base64
    from IPython.display import HTML
    video = io.open(fname, 'r+b').read()
    encoded = base64.b64encode(video)
    return HTML(data='''<video alt="test" width="640" height="480" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii')))

In [None]:
mp4_path = 'imgs/sample.mp4'
print ("Loading video...")
disp_video(mp4_path)

Also, let's inspect one of these files and see what the raw data looks like.

In [None]:
%%bash
head -c 1000 /dli/data/videos/126_206-A0-3.json

<a name="3"></a>
## 3. Prepare data for the model

In order to leverage the TensorFlow object detection API and measure the related KPIs, we need to convert our raw data into `Pandas DataFrame` objects.  Afterwards, we can combine the images and annotations for inference, and measure accuracy of our models.  

<a name="3-1"></a>
### 3.1 Ingest raw annotation data into pandas DataFrame

Later in this course, we need to convert our data into TensorFlow record files, or TFRecords for model training purposes.  These files are record-oriented binary files and are easily consumable by TensorFlow processes. The TFRecord specification encodes an image frame and all the annotations associated with that frame into a single row.  However, the annotation data provided in the parking lot dataset is organized by track_id, not frame_id.  This difference means the data needs to be sorted and organized, so it can effectively be converted into a TFRecord.  Pandas is going to be very useful for accomplishing this data pre-processing step.

For the next few exercises, we are going to work with data from a single video.  This includes the frames and the associated meta-data.  This amount of data is possible to train from, but we will later work with a pre-generated model that was created using the same approach, but with larger number of entries.  These larger files and models have been pre-generated in the interest of time.

Let’s look at how we can ingest a raw annotation file into a DataFrame.  


In [None]:
with open(config["JSON_Sample"], 'r') as f: 
    data = pd.read_json(f)

All the data that was contained inside the text file is now inside the data variable. Let's do some exploring with this data and see how we can manipulate it. We can simply print the data or, for a less verbose exposure, merely use the head() function to display the first few rows of the DataFrame.

In [None]:
print(data.iloc[0].head())

Due to the specifications of the Vatic annotation tool, there are several redundancies in the data file. For instance, when a vehicle leaves the frame, the bounding box annotation prevails until the end of the video! The only way to filter these types of annotations is by making use of the `outside` property, which is set to `1` as soon as the car leaves the field of view. Those annotations provide no benefit to our training or evaluation tasks and could be safely removed.

Another issue that requires our attention is the redundancy of frames. Many of the frames contain no cars at all! Many more (though depending on the camera) contain parked cars with no movements. Including these frames in the training dataset would lead to redundant/biased samples and negatively affects the training quality. To overcome this issue, we will only incorporate frames that have moving vehicles in them and ignore the rest.

<img src="imgs/similars.jpg" alt="Vatic imaging"/>
<p style="text-align: center;color:gray"> Figure 8. Redundancy of annotated frames</p>

The following code snippet takes care of the mentioned issues above and generates a list of frames containing moving vehicles only.


In [None]:
tracks = data.keys()
frames_list = []
frame_existance = np.zeros(15000)

for i in range(len(tracks)):
    boxes = data[list(tracks)[i]]["boxes"]
    {frames_list.append(k) for k, v in boxes.items() 
                      if v['outside'] == 0 and 'Moving' in v['attributes'] and k not in frames_list}
    
for i in frames_list:
    frame_existance[int(i)] = 1

Let's take a look of the final set of frames that:
- Contain moving vehicles
- Are stripped off annotations with the `outside` property set to `1`

In [None]:

y_pos = np.arange(len(frame_existance))
pl.rcParams['figure.figsize'] = (18, 3)
 
plt.bar(y_pos, frame_existance, align='center', alpha=0.5)
plt.yticks([])

plt.title('Frame indices that include moving cars')
 
plt.show()

As you can see, only a tiny portion of the frames contain moving vehicles. So far, the number of frames to process has been reduced dramatically.

<a name="e1"></a>
### Exercise 1:

Calculate the percentage of frames including moving vehicles below:

In [None]:
# YOUR CODE GOES HERE

Click [here](#a1) for answer.

As we saw before, the highest level of the annotations is the `track_id` field. Further down, each bounding box is categorized under the frame numbers. We need to flatten this structure to obtain a more understandable and easier-to-manipulate __DataFrame__. In addition, the provided bounding boxes are annotated on a different frame size from what we have and need to be adjusted accordingly. The annotated frames are drawn on `(611, 480)` frame sizes. Let look at the frame size of the provided videos:

In [None]:
# get video frame size
input_video = cv2.VideoCapture(config["Video_Sample"])
retVal, im = input_video.read()
size = im.shape[1], im.shape[0]
input_video.release()
print("Video frame size (width, height):", size)

The following commented code flattens the DataFrame and normalizes the bounding boxes according to the existing frame sizes. This is a long and time-consuming process, so we have set a limit of `1` for the number of tracks to be processed (we will read the processed data from a text file instead). If you wish to see the working example, uncomment the following section by selecting the code and pressing `Ctrl + /` and run the code afterwards.

In [None]:
# print("processing length:", len(frames_list))
# annotated_frames = pd.DataFrame()
# ANNOTATE_SIZE = (611, 480)
# limit = 1 #set this limit to avoid timely DataFrame generation

# if len(frames_list) > 0:
#     for i in range(len(tracks)):
        
#         # remove the following line if the DataFrame is not read from CSV file
#         if i == limit: break
#         boxes = data[list(tracks)[i]]["boxes"]
#         print("\rprocessing track no: {}".format(i), end = '')
#         for k, v in boxes.items():
            
#             if k in frames_list:#  and v['outside']!=1:
#                 # resizing the annotations
                
#                 xmin, ymin, xmax, ymax = v["xtl"], v["ytl"], v["xbr"], v["ybr"]
#                 xmin = int((float(xmin) / ANNOTATE_SIZE[0]) * size[0])
#                 xmax = int((float(xmax) / ANNOTATE_SIZE[0]) * size[0])
#                 ymin = int((float(ymin) / ANNOTATE_SIZE[1]) * size[1])
#                 ymax = int((float(ymax) / ANNOTATE_SIZE[1]) * size[1])
                
                
#                 annotated_frames = annotated_frames.append(pd.DataFrame({
#                     "frame_no": int(k),
#                     "track_id": [list(tracks)[i]],
#                     "occluded": [v["occluded"]],
#                     "outside": [v["outside"]],
#                     "xmin": [xmin],
#                     "ymin": [ymin],
#                     "xmax": [xmax],
#                     "ymax": [ymax],
#                     "label": ['vehicle'],
#                     "attributes": [','.join(v["attributes"])],
#                     "crop": [(0,0,0,0)],
#                     "camera": config["Test_Video_ID"]
#                 }), ignore_index=True)

We have processed the frames offline and have written them to a text file. Next, we are going to retrieve the DataFrame.

In [None]:
import ast
annotated_frames = pd.read_csv(config['Path_To_DF_File'], converters={2:ast.literal_eval})

In [None]:
print("Length of the full DF object:", len(annotated_frames))
annotated_frames.head()

The annotated frames include the *outside* vehicles that need to be removed.

In [None]:
occluded_filter = annotated_frames["outside"] == 0
annotated_frames = annotated_frames[occluded_filter]

In [None]:
annotated_frames.head()

let's find how many objects labeled "occluded" are in the dataset. We can do this using handy Boolean filters.

In [None]:
occluded_filter = annotated_frames["occluded"] == 1
occluded_only = annotated_frames[occluded_filter]
print ('Total number of occluded objects: {}'.format(len(occluded_only)))
occluded_only.head()

<a name="e2"></a>
### Exercise 2:

In addition to the data columns we have used, the `annotation_frames` object contains some un-structured labels. The column `attributes` contains such values. One of these values is the vehicle type (sedan, SUV, etc.) Try to find out how many of those vehicles are sedans:

In [None]:
#INSERT YOUR CODE HERE

Click [here](#a2) for answer.

<a name="4"></a>
## 4. Working with the video data

As you saw before, some cars are small relative to the scene. Additionally, the video is acquired with a non-square aspect ratio. These are things to keep in mind and take into consideration later, when we are setting up for training.

<a name="4-1"></a>
### 4.1 Converting video file into frame images

Because the object detection model operates on frame-based data, we will need to generate frames from the original movie file. To do so, we are going to use OpenCV to open the video file. We will be using the original mp4 file. Also, in our case we are going to write out every annotated frame but see if you can find a way to print out every nth frame.

In addition to converting video frames into `jpg` images, we are creating a video for which the annotations are displayed as bounding boxes:



In [None]:

colors = [(255, 255, 0), (255, 0, 255), (0, 255, 255), (0, 0, 255), (255, 0, 0), (0, 255, 0), (0, 0, 0), (255, 100, 0), (100, 255, 0), (100, 0, 255), (255, 0, 100)]

def save_images(video_path, image_folder, frames_list, annotated_frames,  video_out_path = '', fps=10):

    if not os.path.exists(image_folder):
        print("Creating image folder")
        os.makedirs(image_folder)
        
    input_video = cv2.VideoCapture(video_path)
    retVal, im = input_video.read()
    size = im.shape[1], im.shape[0]
    fourcc = cv2.VideoWriter_fourcc('h','2','6','4') 
    output_video = cv2.VideoWriter(video_out_path, fourcc, fps, size)

    if not input_video.isOpened():
        print("Sorry, couldn't open video")
        return

    frameCount = 0
    index_ = 1
    
    while retVal:
        
        #print("\r Processing frame no:", frameCount, end = '')
        if str(frameCount) in frames_list:
            print("\rsaving frame no:{}, index:{} out of {}".format(frameCount,index_,len(frames_list)), end = '')
            
            cv2.imwrite(join(image_folder, '{}.jpg'.format(frameCount)), im)
            
            index_ += 1
            #print("frame:",'{}.jpg'.format(frameCount))
            frame_items = annotated_frames[annotated_frames["frame_no"]==int(frameCount)]
            for index, box in frame_items.iterrows():
                #print(box["crop"])
                xmin, ymin, xmax, ymax = box["xmin"], box["ymin"], box["xmax"], box["ymax"]
                xmin2, ymin2, xmax2, ymax2 = box["crop"][0], box["crop"][1], box["crop"][2], box["crop"][3]
                cv2.rectangle(im, (xmin, ymin), (xmax, ymax), colors[0], 1)
                cv2.rectangle(im, (int(xmin2), int(ymin2)), (int(xmax2), int(ymax2)), colors[1], 1)
            output_video.write(im)

        retVal, im = input_video.read()
        frameCount += 1

    input_video.release()
    output_video.release()
    return size        

Calling this method over our video sample will take a moment to complete.

In [None]:
save_images(config["Video_Sample"], 
            '{}/images/{}'.format(config["Base_Dest_Folder"], config["Test_Video_ID"]),
            frames_list,
            annotated_frames,
            '{}/videos/{}.mp4'.format(config["Base_Dest_Folder"], config["Test_Video_ID"]))

Let's sort the frames based on the frame_no and extract the number of unique vehicles in the entire scene:

In [None]:
annotated_frames = annotated_frames.sort_values(by=['frame_no'])
print("Number of unique track IDs in the video:", annotated_frames['track_id'].nunique())


We can also determine the average number pixels that are on each target class (vehicle in this case). This is a simple area calculation using the bounding box coordinates associated with each annotation. The plot will show a histogram distribution of average area for the each "track_id".

In [None]:
import matplotlib.pyplot as plt

def calc_targ_area(row):
    area = (row['xmax'] - row['xmin']) * (row['ymax'] - row['ymin'])
    row['area'] = area
    return row

#filter for frames that include items
inside_items = annotated_frames[annotated_frames['outside']==0]

# Group the data by label and calculate the area for each annotation of that type
label_groups = inside_items.groupby(['track_id']).apply(calc_targ_area)
label_groups = label_groups.groupby(['track_id']).mean()


# Build up and view a histogram
y_pos = np.arange(len(label_groups))
plt.bar(y_pos, label_groups["area"], align='center', alpha=0.5)
plt.title('Average area of each vehicle in the video')
plt.xlabel("Track ID")
plt.ylabel("Area")
plt.show()


<a name="e3"></a>
### Exercise 3

Investigate the data some more.  It would be interesting to determine the average widths and heights of the target boxes.

In [None]:
# YOUR CODE GOES HERE

annotated_frames.head()

Click [here](#a3) for answer.

<a name="5"></a>
## 5. Inference


<a name="5-1"></a>
### 5.1 Frame-by-Frame detection

Before we run the training algorithm, we will review the inference process for Faster RCNN with ResNet, NasNet and SSD. Here, we can use an inference graph, an optimized model to detect objects within frames of the AVI data. In the functions below, we load the graph, create a session and loop through the feedforward function. A TensorFlow graph defines dependencies between operations of your model and a TensorFlow session runs parts of the graph across one or more devices. For more information on graphs and sessions, refer to TensorFlow documentations.
The algorithm also provides scores, bounding box locations and classes for a predefined number of proposals we can alter in the model config file. Reducing the number of proposals will increase performance but may negatively impact the accuracy of the model.

We also create a method to extract the ground-truth data for given frame. By using this method, we can compare the ground-truth with the inferenced data.

In [None]:
def get_info_from_DF(frame_no):
    result = []
    temp = annotated_frames[annotated_frames["frame_no"] == frame_no]
    for i, box in temp.iterrows():
        result.append([int(box["xmin"]), int(box["ymin"]), int(box["xmax"]), int(box["ymax"])])
    return result

In [None]:
from object_detection.utils import label_map_util
from object_detection.utils import visualization_utils as vis_util
def detect_frames(path_to_graph, path_to_labels,
                  data_folder, video_path, min_index, max_index, frame_rate, threshold):
    # We load the label maps and access category names and their associated indicies
    label_map = label_map_util.load_labelmap(path_to_labels)
    categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=90, use_display_name=True)
    category_index = label_map_util.create_category_index(categories)

    # Import a graph by reading it as a string, parsing this string then importing it using the tf.import_graph_def command
    print('Importing graph...')
    detection_graph = tf.Graph()
    with detection_graph.as_default():
        od_graph_def = tf.GraphDef()
        with tf.gfile.GFile(path_to_graph, 'rb') as fid:
            serialized_graph = fid.read()
            od_graph_def.ParseFromString(serialized_graph)
            tf.import_graph_def(od_graph_def, name='')

    # Generate a video object
    fourcc = cv2.VideoWriter_fourcc('h','2','6','4') 

    print('Starting session...')
    with detection_graph.as_default():
        with tf.Session(graph=detection_graph) as sess:
            # Define input and output Tensors for detection_graph
            image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
            # Each box represents a part of the image where a particular object was detected.
            detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
            # Each score represent how level of confidence for each of the objects.
            # Score is shown on the result image, together with the class label.
            detection_scores = detection_graph.get_tensor_by_name('detection_scores:0')
            detection_classes = detection_graph.get_tensor_by_name('detection_classes:0')
            num_detections = detection_graph.get_tensor_by_name('num_detections:0')

            frames_path = data_folder
            
            
            num_frames = max_index - min_index
    
            reference_image = os.listdir(data_folder)[0]
            image = cv2.imread(join(data_folder, reference_image))
            height, width, channels = image.shape 
            out = cv2.VideoWriter(video_path, fourcc, frame_rate, (width, height))
            print('Running Inference:')
            total_time = 0
            for fdx, file_name in \
                    enumerate(sorted(os.listdir(data_folder),  key=lambda fname: int(fname.split('.')[0]) )):
                
                if fdx<=min_index or fdx>=max_index:
                    continue;
                image = cv2.imread(join(frames_path, file_name))
                image_np = np.array(image)
                # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
                image_np_expanded = np.expand_dims(image_np, axis=0)
                bboxes = get_info_from_DF(int(file_name.split('.')[0]))
                # Actual detection.
                tic = time.time()
                (boxes, scores, classes, num) = sess.run(
                    [detection_boxes, detection_scores, detection_classes, num_detections],
                    feed_dict={image_tensor: image_np_expanded})
                toc = time.time()
                t_diff = toc - tic
                total_time = total_time + t_diff
                # Visualization of the results of a detection.
                vis_util.visualize_boxes_and_labels_on_image_array(
                    image,
                    np.squeeze(boxes),
                    np.squeeze(classes).astype(np.int32),
                    np.squeeze(scores),
                    category_index,
                    use_normalized_coordinates=True,
                    line_thickness=2,
                    min_score_thresh= threshold)

                
                cv2.putText(image, 'frame: {}'.format(file_name), (30, 30),
                                cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255))
                            
                for bbox in bboxes:
                   cv2.rectangle(image, (int(bbox[0]), int(bbox[1])), (int(bbox[2]), int(bbox[3])), (0, 0, 255), 2)
                   cv2.putText(image, 'FPS (GPU Inference) %.2f' % round(1 / t_diff, 2), (30, 60),
                               cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255))
                    
                    
                prog = 'Completed %.2f%% in %.2f seconds' % ((100 * float(fdx - min_index + 1) / num_frames), total_time)
                print('\r{}'.format(prog), end = "")
                cv2.imwrite("data/temp/{}.jpg".format(fdx), image)
                out.write(image)
        out.release()

We are going to analyse variations of the three models we reviewed earlier in this lab, "RCNN", "SSD", and "NasNet".
You can review a list of available trained models at [TensorFlow Model Zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md#coco-trained-models), where the model check-points (trained over the COCO dataset) are provided for download.

Visit the section titled `COCO-trained models` and compare the `speed` and `Mean Average Precision` values.  We are going to test three models next. Try to find out which model is the fastest among the three. Which model is the most accurate?

In [None]:
models = {'faster_rcnn_resnet_101': '/dli/data/tmp/faster_rcnn_resnet/frozen_inference_graph.pb' ,
          'nasnet': '/dli/data/tmp/faster_rcnn_nas/faster_rcnn_nas_coco_2018_01_28/frozen_inference_graph.pb',
          'ssd_mobilenet_v2':'/dli/data/tmp/ssd_mobilenet_v2_coco_2018_03_29/frozen_inference_graph.pb'}

image_folder = '{}/images/{}'.format(config["Base_Dest_Folder"], config["Test_Video_ID"])
model_name = 'faster_rcnn_resnet_101'
PATH_TO_LABELS = config["Path_To_COCO_Labels"]
PATH_TO_DATA = image_folder
VIDEO_OUT_PATH = 'imgs/inference_COCO.mp4'

Next, we run the inference process for each model and will try to compare the results visually for now. The `detect_frames` function allows you to provide a minimum and maximum frame index for inference, which we limited to set between "100" and "200" in favor of saving time.
Also, you can provide the cut-out confidence threshold for the network suggestions, which by default is set to 0.5.

In [None]:
# select model name: possible values:
#                                        'faster_rcnn_resnet_101'
#                                        'nasnet'
#                                        'ssd_mobilenet_v2'

model_name = 'faster_rcnn_resnet_101'
detect_frames(models[model_name], PATH_TO_LABELS, PATH_TO_DATA, VIDEO_OUT_PATH, 100, 200, 10, 0.5) 
disp_video(VIDEO_OUT_PATH)

Qualitatively speaking, the NasNet model produces the best results among the three. However, there is a threefold increase in inference time compared to the `faster_rcnn_resnet_101` which is our second-best model in terms of accuracy (we will quantize accuracy in the next section). On the other hand, SSD has produced the lowest quality results compared to the other two. This is not to say that SSD is not useful; on the contrary, it is an extremely efficient detector but requires a large dataset in a specific domain for training in order to generalize.

<a name="5-2"></a>
### 5.2 Quantitative analysis - Intersection over Union

In order to determine how the model has performed quantitatively, at least from a detection perspective, we must consider the IoU (Intersection over Union) calculations and false negative rates. For object detection, it is wise to calculate IoU when the detection score is above a fixed threshold. Otherwise, in this case, every box from the 300 proposals will be considered. It is common practice to set the threshold at 0.5. 

In addition, we must also consider the frame rate (fps - frames per second) of each model in turn. These concepts have been established in previous DLI courses however, to review our definitions:

<img src="imgs/IoU.jpg" alt="meta_arch" style="width: 600px;"/>
<p style="text-align: center;color:gray"> Figure 9, IoU measure   </p>

Here, we calculate the IoU for each detection for each ground truth bounding box from each frame. This is a good measure of performance for a detector, but it lacks some crucial information when applied to tracks. More on this later.

First, we must define a set of functions which will generate a dictionary of models, detections, ground truth bounding boxes and scores for every frame in each video. This can be found below in the commented script. The reason it is commented is because it takes quite some time to execute. Therefore, we have provided a pickled dictionary for model comparison with this lab.

In [None]:
def detect_frames_for_comparison(path_to_graph, path_to_labels,
                                 data_folder, min_index, max_index):
    # We load the label maps and access category names and their associated indicies
    label_map = label_map_util.load_labelmap(path_to_labels)
    categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=1, use_display_name=True)
    category_index = label_map_util.create_category_index(categories)
    
    

    # Import a graph by reading it as a string, parsing this string then importing it using the tf.import_graph_def command
    print('Importing graph...')
    detection_graph = tf.Graph()
    with detection_graph.as_default():
        od_graph_def = tf.GraphDef()
        with tf.gfile.GFile(path_to_graph, 'rb') as fid:
            serialized_graph = fid.read()
            od_graph_def.ParseFromString(serialized_graph)
            tf.import_graph_def(od_graph_def, name='')

    # Generate a video object

    print('Starting session...')
    output = []
    with detection_graph.as_default():
        with tf.Session(graph=detection_graph) as sess:
            # Define input and output Tensors for detection_graph
            image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
            # Each box represents a part of the image where a particular object was detected.
            detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
            # Each score represent how level of confidence for each of the objects.
            # Score is shown on the result image, together with the class label.
            detection_scores = detection_graph.get_tensor_by_name('detection_scores:0')
            detection_classes = detection_graph.get_tensor_by_name('detection_classes:0')
            num_detections = detection_graph.get_tensor_by_name('num_detections:0')

            frames_path = data_folder
            xml_path = join(data_folder, 'xml')
            num_frames = max_index - min_index
            reference_image = os.listdir(data_folder)[0]
            image = cv2.imread(join(data_folder, reference_image))
            height, width, channels = image.shape
            print('Running Inference:')
            for fdx, file_name in \
                    enumerate(sorted(os.listdir(data_folder),  key=lambda fname: int(fname.split('.')[0]) )):
                if fdx<=min_index or fdx>=max_index:
                    continue;
                image = cv2.imread(join(frames_path, file_name))
                image_np = np.array(image)
                # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
                image_np_expanded = np.expand_dims(image_np, axis=0)
                bboxes = get_info_from_DF(int(file_name.split(".")[0]))
                # Actual detection.
                tic = time.time()
                (boxes, scores, classes, num) = sess.run(
                    [detection_boxes, detection_scores, detection_classes, num_detections],
                    feed_dict={image_tensor: image_np_expanded})
                toc = time.time()
                t_diff = toc - tic
                fps = 1/t_diff
                
                boxes = np.squeeze(boxes)
                classes = np.squeeze(classes)
                scores = np.squeeze(scores)
                
                vis_util.visualize_boxes_and_labels_on_image_array(
                    image,
                    boxes,
                    classes.astype(np.int32),
                    scores,
                    category_index,
                    use_normalized_coordinates=True,
                    line_thickness=2,
                    min_score_thresh=0.5)


                #cv2.imwrite(join('/dli/dli-v3/iv05/data/temp', file_name),image)
                prog = '\rCompleted %.2f %%' % (100 * float(fdx - min_index + 1) / num_frames)
                print('{}'.format(prog), end = "")
                boxes = np.array([(i[0]*height, i[1]*width, i[2]*height, i[3]*width) for i in boxes])
                output.append((bboxes, (boxes, scores, classes, num, fps)))

    return output

In [None]:
PATH_TO_DATA = image_folder

model_name = 'faster_rcnn_resnet_101'
detections = detect_frames_for_comparison(models[model_name], PATH_TO_LABELS, PATH_TO_DATA, 100, 200)

Given two sets of coordinates, the `bbox_IoU` function, generates the IoU measures:

In [None]:
# function to compute the intersection over union of these two bounding boxes
def bbox_IoU(A, B):
  # A = list(ymin,xmin,ymax,xmax)
  # B = list(ymin,xmin,ymax,xmax) - (xmin, ymin, xmax, ymax)
  # assign for readability 
  yminA, xminA, ymaxA, xmaxA = A
  xminB, yminB, xmaxB, ymaxB = B

  # figure out the intersecting rectangle coordinates
  xminI = max(xminA, xminB)
  yminI = max(yminA, yminB)
  xmaxI = min(xmaxA, xmaxB)
  ymaxI = min(ymaxA, ymaxB)

  # compute the width and height of the interesecting rectangle
  wI = xmaxI - xminI
  hI = ymaxI - yminI

  # compute the area of intersection rectangle (enforce area>=0)
  areaI = max(0, wI) * max(0, hI)


  # compute areas of the input bounding boxes 
  areaA = (xmaxA - xminA) * (ymaxA - yminA)
  areaB = (xmaxB - xminB) * (ymaxB - yminB)

  # if intersecting area is zero, we're done (avoids IoU=0/0 also)
  if areaI == 0: return 0, areaI, areaA, areaB

  # finally, compute and return the intersection over union
  return areaI / (areaA + areaB - areaI), areaI, areaA, areaB

Now, we loop over the generated data to output the IoU measures for each frame by calling the `bbox_IoU` function

In [None]:
vid_calcs = list()
for frame_idx in range(len(detections)):
      det_boxes = detections[frame_idx][1][0]
      scores = detections[frame_idx][1][1]
      fps = detections[frame_idx][1][4]
      bbox_frame = detections[frame_idx][0]

      max_IoU_per_detection = list()

      #We loop over each bounding box and find the maximum IoU for each detection.
      for b_idx, bbox in enumerate(bbox_frame):
        IoU = 0
        for det_idx, det_box in enumerate(det_boxes):  
          if scores[det_idx] < 0.5: continue #We only include bounding box proposals with scores above and equal to 0.5         
          iou, I, A, B = bbox_IoU(det_box, bbox)
          IoU = max(iou, IoU)
        max_IoU_per_detection.append((IoU, fps))
      vid_calcs.append(max_IoU_per_detection)

Let's visualize the IoU results for detections:

In [None]:
IoU_list=[]
for item in vid_calcs:
    IoU_list.append(item[0][0])


y_pos = np.arange(len(IoU_list))
pl.rcParams['figure.figsize'] = (18, 3)
 
plt.bar(y_pos, IoU_list, align='center', alpha=0.5)

plt.title('IoU measure for detections')
 
plt.show()

The SSD model generates very low IoU values compared to the other two. Also, the number missing detections are high. On the other hand, the NasNet model produces much higher IoU values and less missing frames on the test dataset. The main caveat of using NasNet is the prolonged inference time, which makes it an unsuitable model for many online applications given conventional hardware limitations.

The `faster_rcnn_resnet_101` on the other hand, poses a good balance between accuracy and performance when applied to our dataset. In the next lab, we will try to further improve the accuracy of the model by acquiring `transfer learning` techniques.

<a name="6"></a>
## 6. Crop and Normalize the Annotations

The images are being cropped prior to being encoded into the example data structure format. The model we are using, requires a fixed width input of 448 x 448 pixels. That means any image being fed to this model via the object detection API will be resized to those dimensions.  Due to the aspect ratio and high resolution of the original images the model preprocessing would significantly alter the data and potentially lead to poor results.  Additionally, the target sizes are relatively small compared to the entire image and are even more susceptible to the resizing process.  Cropping at the images' native resolution removes some of the unintended resizing artifacts in both the overall image quality and the targets of interest.  However, this requirement makes processing the data slightly more complicated.  The functions below are going to be used to determine the crop values in image space, calculate the adjustments required to exclude annotations outside of the crop area, and re-index the remaining annotations to the new cropped image coordinate system.  Additionally, the bounding box values are going to be normalized relative to image height and width to make the data more flexible for future use.  We chose to crop this data around the center point of the images, however there is not a limitation on where this crop occurs.

It should be noted that these choices have downstream effects on any inference pipeline built against this model.  Just like the training data, any data passing through the graph for inference will need to be similarly resized. Obviously, sensors collect much larger field of views than the restraints imposed by this model, and others like it.  Therefore, if inference pipelines are going to be built off models based on these flavors of object detectors, the processing pipeline needs to handle segmentation, overlap, and all the annotation bookkeeping that is associated with keeping those things in order.
We are also going to add the `width` and `height` columns to the DataFrame for convenience. 

In [None]:
img_height = 692
img_width = 882
annotated_frames.insert(1, 'width', img_width)
annotated_frames.insert(1, 'height', img_height)

Below, we define the crop size for each image. There are many ways one could implement a function to set the crop size for the images. For example, you could crop all the images around the center point and filter the DataFrame to remove those without a moving vehicle. Here, we take an extra step and center the crop area around the middle of the moving object within each frame. We make sure that the coordinates fit within the image boundaries and no invalid bounding box (like negative coordinates) is generated. 

In [None]:
# define the crop size which is equal to the input size of our neural network
g_image_size = (448.0, 448.0)

def set_crop_size(crop_size, frames):
    for i, box in frames.iterrows():
        center_box_x = int (box['xmin'] + (box['xmax'] - box['xmin']) / 2)
        center_box_y = int (box['ymin'] + (box['ymax'] - box['ymin']) / 2)

        start_x = center_box_x - crop_size[0] / 2    
        end_x = start_x + crop_size[0]
        if start_x < 0:
            if box['xmin'] - 5 >= 0:
                start_x = box['xmin'] - 5
            else:
                start_x = box['xmin']
            end_x = start_x + crop_size[0]
        elif end_x >= box['width']:
            end_x = box['width']
            start_x = end_x - crop_size[0]

        start_y = center_box_y - crop_size[1] / 2    
        end_y = start_y + crop_size[1]
        if start_y < 0:
            if box['ymin'] - 5 >= 0:
                start_y = box['ymin'] - 5
            else:
                start_y = box['ymin']
            end_y = start_y + crop_size[1]
        elif end_y >= box['height']:
            end_y = box['height']
            start_y = end_y - crop_size[1]

        frames.at[i,'crop'] = [(start_x, start_y, end_x, end_y)]
    return frames


Another issue concerning the normalized data is when the object dimensions are larger than crop size. We need to make a proper decision for those cases as we cannot crop the object itself to achieve the proper network input. 

In our sample video such cases appear when the vehicle reaches to the bottom of the frame. You can see examples below:

<img src="imgs/resize_samples.jpg" alt="Vatic imaging"/>
<p style="text-align: center;color:gray"> Figure 10. Vehicles with enclosing boxes larger than the network input size</p>


Our choice to resolve the issue is to resize frames for which the enclosing vehicle is larger than the input size and then perform the cropping. Also, we are going to add a new column `resize`, which indicates if the frame has been resized for convenience. By default, we set its value to False. 

In [None]:
annotated_frames.insert(1, 'resize', False)

Let's take a look at the DataFrame:

In [None]:
annotated_frames.head()

 
<a name="e4"></a>
### Exercise 4

We need to drop unnecessary columns from our data.  The code below is already set up to drop the `occluded` column. Add the necessary code to remove the `outside` column.


In [None]:
# Remove the unnecessary columns since they are all the same value now
annotated_frames = annotated_frames.drop("occluded", axis=1)
annotated_frames.head()
# <<<<<<<<<<<<<<<<<<<<YOUR CODE HERE >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Based on the bounding box height and width, we will be creating two sets of DataFrames, the `normal_size_frames` containing normal-sized bounding boxes and the `oversized_frames` containing oversized bounding boxes. 

In [None]:
normal_size_frames = annotated_frames[annotated_frames.apply(lambda x: x['xmax'] - x['xmin'] <= g_image_size[0], axis=1) &
                                     annotated_frames.apply(lambda x: x['ymax'] - x['ymin'] <= g_image_size[1], axis=1)]

oversized_frames = annotated_frames[annotated_frames.apply(lambda x: x['xmax'] - x['xmin'] > g_image_size[0], axis=1) |
                                     annotated_frames.apply(lambda x: x['ymax'] - x['ymin'] > g_image_size[1], axis=1)]                                 

print("Number of frames within the crop size:{}, number of oversized vehicles/frames: {}".format(len(normal_size_frames),
                                                                                                 len(oversized_frames)))

normal_size_frames = set_crop_size(g_image_size, normal_size_frames)
normal_size_frames.head()

Resizing frames affects the bounding boxes annotation values (xmin, ymin, xmax and ymax). These values need to be adjusted according to the resize ratios. To include an object within a frame, we calculate `max(bounding_box_width, bounding_box_height) + some_offset_value` and consider it as the resize ratio. Finally, we adjust the __bounding_box__ coordinates accordingly

In [None]:
for i, box in oversized_frames.iterrows():
    resize_ratio = 0.0
    diff_x = box['xmax'] - box['xmin'] + 50 # adding offset to prevent round up errors
    diff_y = box['ymax'] - box['ymin'] + 50
    
    #find the maximum of x and y required ratio reduction
    resize_ratio = g_image_size[0]/max(diff_x, diff_y)
    
    #correct the existing bounding box values according to the ratio
    oversized_frames.at[i,'xmin'] = int(box['xmin'] * resize_ratio) 
    oversized_frames.at[i,'xmax'] = int(box['xmax'] * resize_ratio) 
    oversized_frames.at[i,'ymin'] = int(box['ymin'] * resize_ratio) 
    oversized_frames.at[i,'ymax'] = int(box['ymax'] * resize_ratio)
    
    #correct height and width values and set the resize column value to True
    oversized_frames.at[i,'width'] = int(box['width'] * resize_ratio)
    oversized_frames.at[i,'height'] = int(box['height'] * resize_ratio)
    oversized_frames.at[i,'resize'] = True

oversized_frames = set_crop_size(g_image_size, oversized_frames)    
oversized_frames.head()

Below, you can see how many oversized frames have been generated:

In [None]:
print('Number of oversized frames:',  len(oversized_frames))

We need to make sure that the obtained values are still valid and round-up errors have not contributed to our results. Also, since the image coordinates will be changed to the crop area, we need to subtract the top and left coordinates of the crop box from the bounding box coordinates.

In [None]:
def normalize_frames(frames):
    normalized_frames = frames[frames.apply(lambda x: x['crop'][0][0] >= 0, axis=1) &
                                     frames.apply(lambda x: x['crop'][0][1] >= 0, axis=1) &
                                     frames.apply(lambda x: x['crop'][0][2] <= x['width'], axis=1) &
                                     frames.apply(lambda x: x['crop'][0][3] <= x['height'], axis=1) &
                                     frames.apply(lambda x: x['crop'][0][0] <= x['xmin'], axis=1) &
                                     frames.apply(lambda x: x['crop'][0][1] <= x['ymin'], axis=1) &
                                     frames.apply(lambda x: x['crop'][0][2] >= x['xmax'], axis=1) &
                                     frames.apply(lambda x: x['crop'][0][3] >= x['ymax'], axis=1)]
    for i, box in normalized_frames.iterrows():
        normalized_frames.at[i, 'xmin'] = box['xmin'] - int(box["crop"][0][0])
        normalized_frames.at[i, 'ymin'] = box['ymin'] - int(box["crop"][0][1])
        normalized_frames.at[i, 'xmax'] = box['xmax'] - int(box["crop"][0][0])
        normalized_frames.at[i, 'ymax'] = box['ymax'] - int(box["crop"][0][1])
        
    return normalized_frames

In [None]:
cropped_frames = normalize_frames(normal_size_frames)
print('Number of normal sized objects:',  len(cropped_frames))

In [None]:
cropped_frames_oversize = normalize_frames(oversized_frames)
print('Number of oversized objects:',  len(cropped_frames_oversize))

We are going to draw samples from both `normal_size_frames` and `oversized_frames` and plot them for comparison. A sample set of 8 cropped images is shown below.

<img src="imgs/sample8.jpg" alt="Vatic imaging"/>
<p style="text-align: center;color:gray"> Figure 11. Randomly selected frame examples</p>


We need a few auxiliary functions to read, crop and plot those images:

- __crop_image__:          return a resized image based on the input reference coordinates
- __showarray__:           draws an array representing the cropped image
- __draw_rectangle__:      draws the matching vehicle bounding boxes for each sample
- __plot_random_samples__: samples the given DataFrame and applies the above methods to each item of the sample's DataFrame


In [None]:
from IPython.display import clear_output, Image, display
from io import StringIO
import PIL.Image
# Helper function to crop images
def crop_image(pil_image, coordinates):
    # get the size of the image
    xmin, ymin, xmax, ymax = int(coordinates[0]), int(coordinates[1]), int(coordinates[2]), int(coordinates[3])    
    crop_img = pil_image[ymin:ymax, xmin:xmax]
    return crop_img

def showarray(a, fmt='jpeg'):
    a = np.uint8(np.clip(a, 0, 255))
    f = StringIO()
    PIL.Image.fromarray(a).save(f, fmt)
    display(Image(data=f.getvalue()))
    
def draw_rectangle(draw, coordinates, color, width=1):
    for i in range(width):
        rect_start = (coordinates[0][0] - i, coordinates[0][1] - i)
        rect_end = (coordinates[1][0] + i, coordinates[1][1] + i)
        draw.rectangle((rect_start, rect_end), outline = color)


The `plot_random_samples` selects 8 random samples from a given set and for each of the sample, searches the given DataFrame for ground truth data, draws the boxes on the frames, and finally plots the resulting samples. 


In [None]:
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from matplotlib.pyplot import imshow

def plot_random_samples(frames):
    sample_frames = frames.sample(n=8)
    fig=plt.figure(figsize=(15, 8))
    columns = 4
    rows = 2
    i = 1 

    for index, box in sample_frames.iterrows():
        #print(box["crop"])

        im = Image.open('{}/images/{}/{}.jpg'.format(config["Base_Dest_Folder"], config["Test_Video_ID"], box["frame_no"]))


        if box['resize']:
            im = im.resize((int(box['width']), int(box['height'])), Image.ANTIALIAS)

        xmin, ymin, xmax, ymax = box["xmin"], box["ymin"], box["xmax"], box["ymax"]


        cropped_im = im.crop(box["crop"][0])


        draw = ImageDraw.Draw(cropped_im)
        draw.rectangle(((xmin, ymin), (xmax, ymax)), fill=None, outline='red')
        draw_rectangle(draw, ((xmin, ymin), (xmax, ymax)), color=colors[2], width=3)


        fig.add_subplot(rows, columns, i)
        i += 1
        plt.imshow(np.asarray(cropped_im))
    plt.show()


First, we plot samples from the oversized DataFrame:

In [None]:
plot_random_samples(cropped_frames_oversize)

Next, we plot samples from our normal frames:

In [None]:
plot_random_samples(cropped_frames)

We need to concat the two sets of frames as our final set of data and plot samples again.

In [None]:
temp_frames = [cropped_frames_oversize, cropped_frames]
cropped_frames = pd.concat(temp_frames)

Let's draw sample images from the combined set:

In [None]:
plot_random_samples(cropped_frames)

With that step, we have a DataFrame containing all the frame information required to crop the image and draw enclosing bounding boxes. Moreover, the noisy data are filtered, and the oversized frames are scaled down, so that the contained vehicles fit our model's input schema.  

Data preparation could be the longest and most critical task towards building a successful IVA application, and usually is very dependent on the type of data, cameras, illumination, weather, etc. Now we are ready to move to the next step which is creating __TFRecords__.


<a name="7"></a>
## 7. Create TFRecord files

TFRecord is a binary format for storing TensorFlow datasets. This allows a compact representation of the data as well as increased performance for data retrieval and memory management. Now that we have normalized our annotation coordinates, let's start packing the data into a TFRecord.  We are going to walk through the functionality and put it all together at the end to perform the processing.  



Since TFRecords store data in binary format, we need to provide data in `structured` format. TensorFlow provides you with two functions to serialize your data structure into TFrecords: `tf.trian.Example` and `tf.train.SequenceExample` both of which convert the data into TensorFlow standard model by providing a {"string": tf.train.Feature} mapping of the data.

An important consideration when building the TFRecords is the scale of the bounding box coordinates, which should be normalized to a float value between `0` and `1`. Also, as discussed earlier, our images need to get cropped prior to being encoded into the Example data structure. In addition to generating TFRecords, the following function crops the image by utilizing the `crop` coordinates.


In [None]:
from PIL import *
# test_output = {}
def To_tf_example(frame_data, img_path, img_name, 
                         label_map_dict,
                         img_size,
                         single_class):
    

    pil_image = Image.open(os.path.join(img_path,img_name))
    
    
    if frame_data['resize']:
        pil_image = pil_image.resize((int(frame_data['width']), int(frame_data['height'])), Image.ANTIALIAS)
    
    cropped_im = pil_image.crop(frame_data["crop"][0])

    encoded = cv2.imencode('.jpg', np.asarray(cropped_im))[1].tostring()

    xmin = []
    ymin = []
    xmax = []
    ymax = []
    classes = []
    classes_text = []

    
    
    # Append the  coordinates to the overall lists of coordinates
    xmin.append(float(frame_data['xmin'])/float(img_size[0]))
    ymin.append(float(frame_data['ymin'])/float(img_size[1]))
    xmax.append(float(frame_data['xmax'])/float(img_size[0]))
    ymax.append(float(frame_data['ymax'])/float(img_size[1]))
    


    # If only detecting object/not object then ignore the class-specific labels
    if single_class:
        classes.append(1)
    else:
        class_name = frame_data['label']
        classes_text.append(class_name.encode('utf8'))
        classes.append(label_map_dict[class_name])

    # Generate a TF Example using the object information
    example = tf.train.Example(features=tf.train.Features(feature={
        'image/height': dataset_util.int64_feature(int(img_size[1])),
        'image/width': dataset_util.int64_feature(int(img_size[0])),
        'image/filename': dataset_util.bytes_feature(
            img_name.encode('utf8')),
        'image/source_id': dataset_util.bytes_feature(
            img_name.encode('utf8')),
        'image/filepath': dataset_util.bytes_feature(
            img_path.encode('utf8')),
        'image/encoded': dataset_util.bytes_feature(encoded),
        'image/format': dataset_util.bytes_feature('jpeg'.encode('utf8')),
        'image/object/bbox/xmin': dataset_util.float_list_feature(xmin),
        'image/object/bbox/xmax': dataset_util.float_list_feature(xmax),
        'image/object/bbox/ymin': dataset_util.float_list_feature(ymin),
        'image/object/bbox/ymax': dataset_util.float_list_feature(ymax),
        'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
        'image/object/class/label': dataset_util.int64_list_feature(classes)}))
    return example

<a name="7-1"></a>
### 7.1 Encode annotations and images into TensorFlow Examples

The `generate_tf_records` function takes in image-specific DataFrames.  These DataFrames were yielded from the filtering process in the previous section that removed objects that were occluded, lost, or outside of the crop area.  The function below will use `To_tf_example` function, to crop the original image.  This data will be encoded as jpeg and returned to the next step to be written as a TFRecord.  Each annotation associated with the frame, and that survived the culling, will also be written into the record. 

In [None]:
from object_detection.utils import dataset_util
from object_detection.utils import label_map_util
def generate_tf_records(writer, 
                        frames_df,
                        image_folder,
                        reference_frames,
                        label_map_dict):

    for index, the_item in frames_df.iterrows():
        #check if frame belongs to the reference set; i.e. test/train
        if int(the_item["frame_no"]) in reference_frames:
            
            print("\r frame: {:>6}".format(int(the_item["frame_no"])), end='\r', flush=True)
            file_name = "{}.jpg".format(the_item["frame_no"])
            tf_example = To_tf_example(the_item,image_folder, file_name, label_map_dict, g_image_size, False)
            
            writer.write(tf_example.SerializeToString())

<a name="7-2"></a>
### 7.2 Create training and validation splits

In order to support the training process, we will need to produce a record containing training examples and another record containing validation examples. The examples in the training set will be used during the learning phase of training to determine the model parameters (weights and biases) that best fit the data. The examples in the validation set will be used to perform an evaluation of the model and assess the current performance of the model as it is training. A holdout, or test dataset, will need to be reserved for testing the final, trained model.

The validation data acts as an indicator of the current performance of the model and can signal whether the model is overfitting or has converged early. Whether the observation produces a positive or negative assessment of the model, it contributes to the early termination of the training process by the researcher. The early termination either produces a final model or subsequent configuration changes. Therefore, the validation data is really a hybrid dataset that does not affect the lower level training nor contributes to the final evaluation but does influence the process. 
  

The dataframe we just created ingested data for all the annotations of a single video. One of the advantages of the `pandas` library is it can be fed directly into other data manipulation libraries.  We are going to use `scikit-learn` to split the data into training and validation sets, given some desired split fraction.


In [None]:
import ast
from sklearn.model_selection import train_test_split
from object_detection.utils import label_map_util
from object_detection.utils import visualization_utils as vis_util
from object_detection.utils import dataset_util
from random import shuffle

unique_frames = cropped_frames.frame_no.unique()
#shuffle and split the set        
shuffle(unique_frames)
split = 0.2
train, test = train_test_split(unique_frames, test_size=split)

<a name="7-3"></a>
### 7.3 String all the functionality together and create the TFRecord

Now we're going to put everything together to generate the TFRecord files.  This involves the following steps:

1. Grab the label map file (discussed below) and read it into a dictionary (this is done using a convenience utility provided by the API)
2. Grab the parent directory of all the data
3. Build output paths for the generated record files
4. For each DataFrame group (train and test), pass them off to the `create_tf_record` function
5. Write annotations as a TensorFlow Example, per frame
6. Save Examples to a TFRecord File


To string all this functionality together and create a TFRecord for our training and test data, we are going to need a label map in a certain format.  This file basically maps class IDs to class names.  It must be 1-indexed.  The file is provided for you and wired in to the function below.

Its contents look like this:

```json
item {
  id: 1
  name: 'Object'
}
```

Remember, we are only working with data from a single video due to time and compute constraints in the lab environment.  All these approaches scale to larger datasets and were used to create larger record files to train the model you will work with later in the lab.

This process will take a while and is a good time to read ahead a bit.


In [None]:
video_list = ['126_206-A0-3']
label_map_dict = label_map_util.get_label_map_dict(config["Label_Map"])
train_writer = tf.python_io.TFRecordWriter(join(config["Base_Dest_Folder"],'train.record'))
eval_writer = tf.python_io.TFRecordWriter(join(config["Base_Dest_Folder"],'eval.record'))
for xx in video_list:
    #create train record
    generate_tf_records(train_writer, 
                        cropped_frames,
                        '{}/images/{}'.format(config["Base_Dest_Folder"], config["Test_Video_ID"]),
                        train,
                        label_map_dict)
    #create eval record
    generate_tf_records(eval_writer, 
                        cropped_frames,
                        '{}/images/{}'.format(config["Base_Dest_Folder"], config["Test_Video_ID"]),
                        test,
                        label_map_dict)
        
train_writer.close()   
eval_writer.close()   

## Summary


Congratulations on completing the first part of this IVA course! If you have any spare time left, please alter the script above to experiment with other model and video combinations.

So far, we have learnt;
* About different object detection methods, their difference and cons and pros of each
* Converting the frames of the video and the annotations into a format which can be easily consumed by the TensorFlow Object Detection API and by subsequent metric definitions.
* How to view the output to get a qualitative appreciation of the model accuracy.
* How to quantitatively measure accuracy and performance of the object detection models using IoU metrics.

In the next lab, we are going to learn how to train a model and fine-tune the network weights. We will also learn how to track objects in videos.  Thanks for your time!


## Answers
<a name="a1"></a>
### Exercise 1:

In [None]:
print("Ratio of frames with moving vehicles to tatal: {0:.2f}%".format((frame_existance == 1.0).sum() / len(frame_existance) * 100))

click [here](#e1) to go back

<a name="a2"></a>
### Exercise 2:

In [None]:
sedans = annotated_frames[annotated_frames["attributes"].str.contains("sedan") == True]
print ('Total number of sedans: {}'.format(len(sedans)))

click [here](#e2) to go back

<a name="a3"></a>
### Exercise 3:

In [None]:
# YOUR CODE GOES HERE
def calc_average_HW(row):
    row['Average_Height'] = row['ymax'] - row['ymin']
    row['Average_Width'] = row['xmax'] - row['xmin']
    return row
Average_HW = inside_items.groupby(['track_id']).apply(calc_average_HW)
Average_HW = Average_HW.groupby(['track_id']).mean()

Average_HW.head()

click [here](#e3) to go back