#Object Recognition in Videos

In this project we will import a pre-existing model that recognizes objects and use the model to identify those objects in a video. We'll edit the video to draw boxes around the identified object and then reassemble the video so that the boxes are shown around objects in the video.

## Team

*   Alex de Magalhaes
*   Lynn He
*   Wayne Chim



##Workflow

Our goal was to process a video frame-by-frame, identify objects in each frame, and draw a bounding box and label around each object.
 
We used the pre-built model [SSD MobileNet V1 Coco](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md) 'ssd_mobilenet_v1_coco' model. To start, we processed this video [found on Pixabay](https://pixabay.com/videos/cars-motorway-speed-motion-traffic-1900/). 
 

The [Coco labels file](https://github.com/nightrome/cocostuff/blob/master/labels.txt) can be used to identify classified objects.


 

##Skills
* Classification
* Saving and Loading Models
* OpenCV
* Video Processing

##Importing Video

Importing all of our libraries

In [0]:
import urllib.request
import os
import tarfile
import shutil
import tensorflow as tf
import cv2 as cv
import numpy as np
import matplotlib.pyplot as plt

This is our experimental code block where we tried working with one frame before moving on to more frames throughout the entire video.

In [0]:
#Put the name of the video you want to process here!
video_name = 'cars.mp4'

#Experimenting with different videos and individual frames
video = cv.VideoCapture(video_name)
video.set(cv.CAP_PROP_POS_FRAMES, 10) #123, 200
ret, image = video.read()
if not ret:
  raise Exception(f'Problem reading frame {current_frame} from video')

video.release()
image = cv.cvtColor(image, cv.COLOR_BGR2RGB)
plt.imshow(image)
plt.show()

Processing the entire video would be impractical, so a for loop is ran to grab the first frame in every second. Since the video lasts 60 seconds, we will be working with approximately 60 frames. All the frames will be padded to have square dimensions and then appended to a list that will be inputted into the model later on. We also added the frames to a list to pass into the model. By running only one TensorFlow session, we save time through parellelization.

In [0]:
video = cv.VideoCapture(video_name)

fps = int(video.get(cv.CAP_PROP_FPS))
total_frame = int(video.get(cv.CAP_PROP_FRAME_COUNT))

input_images = []
#In this case, one frame per second is retrieved from the video as part of the testing list
for i in range(0, total_frame, fps):
  video.set(cv.CAP_PROP_POS_FRAMES, i)
  ret, image = video.read()
  image = cv.cvtColor(image, cv.COLOR_BGR2RGB)
  height = image.shape[0]
  width = image.shape[1]
  
  #Padding is applied to each frame for more friendly formats
  left_pad, right_pad, top_pad, bottom_pad = 0, 0, 0, 0
  if height > width:
    left_pad = int((height-width) / 2)
    right_pad = height-width-left_pad
  elif width > height:
    top_pad = int((width-height) / 2)
    bottom_pad = width-height-top_pad

  img_square = cv.copyMakeBorder(
     image,
     top_pad,
     bottom_pad,
     left_pad,
     right_pad,
     cv.BORDER_CONSTANT,
     value=(255,255,255))

  video_w = img_square.shape[0]
  video_h = img_square.shape[1]
  #Finally the processed frames are added to the list
  input_images.append(img_square)
  
video.release()

The MobileNet model file is loaded and unzipped to extract all the files, and output nodes are initialized. The next step will be to input the selected frames into the model for processing.

In [0]:
base_url = 'http://download.tensorflow.org/models/object_detection/'
file_name = 'ssd_mobilenet_v1_coco_2018_01_28.tar.gz'

url = base_url + file_name

urllib.request.urlretrieve(url, file_name)

#Extracts the file, checks in computer directory to see what files are in it
dir_name = file_name[0:-len('.tar.gz')] #Name of the zip file

if os.path.exists(dir_name):
  shutil.rmtree(dir_name) 

tarfile.open(file_name, 'r:gz').extractall('./')

#Getting nodes
frozen_graph = os.path.join(dir_name, 'frozen_inference_graph.pb')

with tf.gfile.FastGFile(frozen_graph,'rb') as f:
  graph_def = tf.GraphDef()
  graph_def.ParseFromString(f.read())

outputs = (
  'num_detections',
  'detection_classes',
  'detection_scores',
  'detection_boxes',
)

Using a TensorFlow session, the list of frames is inputted into the model and the outputs are produced into a list called 'detections.' The items in the list is separated and labeled into Number of Detections, Detection Classes, Detection Scores, and Detection Boxes.

In [0]:
#The model is ran through a TensorFlow session
with tf.Session() as sess:
  sess.graph.as_default()
  tf.import_graph_def(graph_def, name='')

  detections = sess.run(
      [sess.graph.get_tensor_by_name(f'{op}:0') for op in outputs],
      feed_dict={ 'image_tensor:0': input_images }
  )

#Each output node is assigned to more elaborate names
num_detections = detections[0]
detection_classes = detections[1]
detection_scores = detections[2]
detection_boxes = detections[3]

Instead of hard coding the labels, the .txt file with the ID codes was cleaned up to create a dictionary (referenced above). As the algorithm iterates, it references the dictionary to label corresponding to the ID code (i.e. '3' is a car, and '10' is a traffic light).

In [0]:
#Open the .txt file that contains all the labels
f = open("labels.txt", "r")

#Initialize a dictionary
labels = {}

#Process the .txt file and establish the classes as the key and the object name as the value
for x in f:
  key,_,value= x.partition(':')
  value,_,_ = value.partition('\n') 
  labels[key] = value

A new video is created and formatted to write all the frames with the addition of boundary boxes and labels in. The algorithm iterates through each frame, then references the number of objects detected and begins drawing and labeling boundary boxes in each object within the frame. For our sample video, we put white boxes around the cars, and blue frames around non-car objects. 

The width and height were used to normalized the metrics provided by the Detection Box output. 

Finally, the video is released for good practice.

In [0]:
#Put the name of the new file here:
video_name = 'cars-detect.mp4'

#Initialize a new video object for the output
fourcc = cv.VideoWriter_fourcc(*'mp4v')
output_video = cv.VideoWriter(video_name, fourcc, fps, (video_w, video_h))

#First for-loop iterates through each frame 
for x in range(0,len(num_detections)):
  frame_copy = np.copy(input_images[x])
  height,width,_ = frame_copy.shape
  #Second for-loop itertes through each object detected in each frame
  for i in range(0,int(num_detections[x])):
    #Set a threshold for confidence levels over 30%
    if detections[2][0][i] > .3:
      #Car objects have their own colored boundary boxes
      if detection_classes[x][i] == 3: 
         color = [250,230,230]
      #Other objects have blue colored boxes
      else: #etc
         color = [255,0,0]
      #Dimensions for the boundary boxes are initialized and normalized
      left = int(width*detection_boxes[x][i][1])
      top = int(height*detection_boxes[x][i][0])
      right = int(width*detection_boxes[x][i][3])
      bottom = int(height*detection_boxes[x][i][2])
      #Boundary boxes are drawn
      cv.rectangle(frame_copy,
                    (left, top),
                    (right, bottom),
                    (color),
                    2)
      #Labels are retrieved from the dictionary
      label = labels[str(int(detection_classes[x][i]))]
      #Labels are properly assigned to each object
      cv.putText(frame_copy, label, (left, top-10), cv.FONT_HERSHEY_TRIPLEX, .5, [255,0,255], 1)

  frame_copy = cv.cvtColor(frame_copy, cv.COLOR_BGR2RGB)
  
  #Processed frames are written into the output video
  output_video.write(frame_copy)
  plt.subplots()
  plt.imshow(frame_copy)
  plt.show()

#Output video is released as good practice and to free up memory
output_video.release()


Our final product is a shortened video with relatively accurate labels for each detection. The model works best with simple backgrounds with little to no noise, but even then may make incorrect detections like the side of the highway being a chair.