# Improving Machine Learning Experience With GStreamer Techniques
Marco A. Franchi

## *Abstract*

*It has being very common face machine learning algorithms at the multimedia area: traffic count; people count; real-time vigilance cameras; baggage treaker; face/expression recognition; and so on. However, it is perceptive the gap between machine learning and multimedia solutions, where even at simple embedded sistems it is possible to reach 4k videos running at 60 frames per second, whereas the best neural network solution is able to handle 224x224 frames at 300 milliseconds in the same embedded system. Due to this, a vast number of solutions were developed: dedicated hardware for inference process; models manipulation; accelerated pre-processing image solutions; and videos manupalation techniques. This paper is based at the study of these videos manipulation techniques, exposing the most common algorithmics, such as frame-skip, frame-droop, resizing, color convert, and overlay solutions; and preseting the benneficies and issues of the adoption or not of those techniques.*



## Introduction

Aiming to diminish the gap between machine learning inference process time and the multimedia capability, which reachs 4k@60fps, some video manipulation solutions was purposed. Among them, the most common are the overlay solutions, which are able to create alpha layers over the video and insert information on it. These overlays are very common on object detections, once are responsible for drawing a square, select or color an object at the scene. The most common overlays are SVG, Cairo and OpenCV.
As the overlays, it is important to use a great video framework able to handle all the elements involved at the inference and display solutions. One of the best and most usage one is the GStreamer framework. GStreamer is able to handle plugins in pipelines, which is perfect to do some tests very quickly.
Appart from the video solutions, it is important to choose the best machine learning algorithmics as well. With focus on object detection, the most common and valueted ones are Single Shot Detection (SSD) and Tensorflow. Both as a incridible inference process capability, and the TFlite version demonstrated a great tool for embedded systems.
Thus, with all the tools available, this paper intends to compare the combination of theses algorithmics for object detection solutions. This comparison aims to demonstrate how we can increase the video frame rate with simple approaches and demonstrate the best scenarios to handle each neural network algorithmics and the overlays plugins behavior on these tests.

## Materials and Methods

This section describes the material such as the video files, models and programming language used, and adopted methodology.

### Programming Language



This paper uses Python 3 language and all the required support for the overlays, GStreamer, Tensorflow and SSD. These tools will be described in details on this section.
As first test, it will uses opencv to open a video based file in a indepent new window.

In [6]:
import cv2 as opencv

In [7]:
window="Video test"
opencv.namedWindow(window)
file="../data/video/video_device.mp4"
cap=opencv.VideoCapture(file)
    
while (cap.isOpened()):
    ret,frame=cap.read()
        
    if ret:
        opencv.imshow(window,frame)
        if opencv.waitKey(133)==27:
            break
    else :
        break
            
opencv.destroyAllWindows()
cap.release()

### Models and Labels

As the focus of this paper is the video techniques, and as the YOLO, SSD and TFlite already have a huge numbers of pre-processed models, this paper will not care about pre-processing or training models.
For this, the following pre-tested models and labels will be used at the tests:

**Models**
* tiny_yolov3.tflite
* mobilenet_ssd_v2_coco_quant_postprocess.tflite
* mobilenet_v2_1.0_224_quant.tflitetflite

**Labels**
* coco_labels.txt
* labels.txt

The following code section get all the required models and labels.

In [9]:
window="Video test"
opencv.namedWindow(window)
file="../figures/video_device.mp4"
cap=opencv.VideoCapture(file)
    
while (cap.isOpened()):
    ret,frame=cap.read()
        
    if ret:
        opencv.imshow(window,frame)
        if opencv.waitKey(133)==27:
            break
    else :
        break
            
opencv.destroyAllWindows()
cap.release()

### GStreamer and v4l2

As mentioned before, this paper uses GStreamer framework to reproduce the videos files.
For the comparison purpose, the following approaches will be performed:
* OpenCV v4l2 directly handle;
* GStreamer appsink pipeline + OpenCV v4l2 output;
* GStreamer appsink + appsrc pipelines;
* GStreamer overlay plugins support.

The workflow below describes the difference between each process:

### Workflow

**OpenCV v4l2 directly handle:**
![opencv](../data/images/opencv_v4l2.png)
For this approach, the OpenCV needs to handle the entire process. This is easely to test, but not fast for ML purpose, once the display results will always waiting for the processed frame.


**GStreamer appsink pipeline + OpenCV v4l2 output**
![appsink](../data/images/appsink_opencv_v4l2.png)
This idea shows a better perfomance than OpenCV v4l2 due the possibily to use dropping frame property at the GStreamer pipeline. It means that the displayed frame rate will not be impacted for the inference process. However, the videoconvert usage, required for the appsink to be able to display the results at screen, is a disavantage, once its only support CPU, and resize/color convert by CPU has a high processing coast.


**GStreamer appsink + appsrc pipelines**
![appsink/appsrc](../data/images/appsink_appsrc.png)
This combination shows very promissor, once we can use the appsink dropping frame property, but do not requires videoconvert, once its results will not be displayed yet. Actually, the resulted data will be processed by the inference process, which can include resizing and color convert, and the results will be send again to GStreamer, only to handle it and display at the screen.


**GStreamer overlay plugins support:**
![overlay](../data/images/gstreamer_overlay.png)

This is the best approach, but the most dificult to use. Here, the tee usage create two threads: one to be processed by the inference process; other to be displayed. The videobox keeps the frame and is able to return it ot the overlay plugin, so what we see is the combination of two frames, one is the original video, without be touched, other is an alpha image with all the required resizing process being displayed over the original video.

The following modularized source code is used to handle the four purposed solutions.

In [3]:
def opencv_v4l2():
    window="Video test"
    opencv.namedWindow(window)
    file="../data/video/video_device.mp4"
    #OpenCV handle the video file directly with VideoCapture v4l2 based function
    cap=opencv.VideoCapture(file)

    while (cap.isOpened()):
        ret,frame=cap.read()

        if ret:
            #Opencv displays the content results by using v4l2 again
            opencv.imshow(window,frame)
            if opencv.waitKey(133)==27:
                break
        else :
            break

    opencv.destroyAllWindows()
    cap.release()

In [2]:
def gstreamer_pipeline_opencv_v4l2():
    window="Video test"
    opencv.namedWindow(window)
    file="../data/video/video_device.mp4"
    #Uses GStreamer pipeline with leaky property, wich allows dropping frames
    cap=opencv.VideoCapture("""filesrc location=../data/video/video_device.mp4 ! decodebin !
                            queue max-size-buffers=1 leaky=downstream ! videoconvert !
                            appsink emit-signail=true max-buffers=1 drop=true""")

    while (cap.isOpened()):
        ret,frame=cap.read()

        if ret:
            #Opencv displays the content results by using v4l2 again
            opencv.imshow(window,frame)
            if opencv.waitKey(133)==27:
                break
        else :
            break

    opencv.destroyAllWindows()
    cap.release()

In [2]:
def gstreamer_appsink_appsrc():
    window="Video test"
    opencv.namedWindow(window)
    pipeline1_cmd="filesrc location=../data/video/video_device.mp4 do-timestamp=True ! decodebin ! videoconvert ! \
        videoscale n-threads=4 method=nearest-neighbour ! \
        video/x-raw,format=RGB,width=1920,height=1080 ! \
        queue leaky=downstream max-size-buffers=1 ! appsink name=sink \
        drop=True max-buffers=1 emit-signals=True max-lateness=8000000000"

    pipeline2_cmd = "appsrc name=appsource1 is-live=True block=True ! \
        video/x-raw,format=RGB,width=1920,height=1080, \
        framerate=20/1,interlace-mode=(string)progressive ! \
        videoconvert ! autovideosink"

    self.pipeline1 = Gst.parse_launch(pipeline1_cmd)
    appsink = self.pipeline1.get_by_name('sink')
    appsink.connect("new-sample", self.on_new_frame, appsink)

    self.pipeline2 = Gst.parse_launch(pipeline2_cmd)
    self.appsource = self.pipeline2.get_by_name('appsource1')

    self.pipeline1.set_state(Gst.State.PLAYING)
    bus1 = self.pipeline1.get_bus()
    self.pipeline2.set_state(Gst.State.PLAYING)
    bus2 = self.pipeline2.get_bus()

    while (cap.isOpened()):
        ret,frame=cap.read()

        if ret:
            #Opencv displays the content results by using v4l2 again
            opencv.imshow(window,frame)
            if opencv.waitKey(133)==27:
                break
        else :
            break

    opencv.destroyAllWindows()
    cap.release()

In [2]:
def gstreamer_overlay(overlay):
    window="Video test"
    opencv.namedWindow(window)
    scale = min(
        appsink_size[0] /
        src_size[0],
        appsink_size[1] /
        src_size[1])
    scale = tuple(int(x * scale) for x in src_size)
    scale_caps = 'video/x-raw,width={width},height={height}'.format(
        width=scale[0], height=scale[1])
    PIPELINE = 'filesrc location=%s ! decodebin' % videofile
    PIPELINE += """ ! tee name=t
         t. ! {leaky_q} ! imxvideoconvert_g2d ! {scale_caps} ! videobox name=box autocrop=true
            ! {sink_caps} ! {sink_element}
         t. ! queue ! imxvideoconvert_g2d
            ! rsvgoverlay name=overlay ! waylandsink
    """
    
    SRC_CAPS = 'video/x-raw,width={width},height={height},framerate=30/1'
    SINK_ELEMENT = 'appsink name=appsink emit-signals=true max-buffers=1 drop=true'
    SINK_CAPS = 'video/x-raw,format=RGB,width={width},height={height}'
    LEAKY_Q = 'queue max-size-buffers=1 leaky=downstream'

    src_caps = SRC_CAPS.format(width=src_size[0], height=src_size[1])
    sink_caps = SINK_CAPS.format(width=appsink_size[0], height=appsink_size[1])
    pipeline = PIPELINE.format(leaky_q=LEAKY_Q,
                               src_caps=src_caps, sink_caps=sink_caps,
                               sink_element=SINK_ELEMENT, scale_caps=scale_caps)

    while (cap.isOpened()):
        ret,frame=cap.read()

        if ret:
            #Opencv displays the content results by using v4l2 again
            opencv.imshow(window,frame)
            if opencv.waitKey(133)==27:
                break
        else :
            break

    opencv.destroyAllWindows()
    cap.release()

### Overlays

The last modularized GStreamer code as the overlay property. This overlay will be used to display the following at the scree:
* inference time
* video FPS (Frame per second)
* draw a square around the detected objects
* put a label at these objects
* confidence of it

One example is displayed at the image below:

![detection](../data/images/car_detection.gif)

This overlays procces comparison will be very important for this paper results. So three overlays will be compared:
* Opencv
* Cairo
* SVG

We will use the source code below to use each one:

In [2]:
def generate_opencv(self, opencv_im, objs, labels):
    height, width, channels = opencv_im.shape
    for obj in objs:
        x0, y0, x1, y1 = list(obj.bbox)
        x0, y0, x1, y1 = int(
            x0 * width), int(
            y0 * height), int(x1 * width), int(y1 * height)

        percent = int(100 * obj.score)
        label = '{}% {}'.format(percent, labels.get(obj.id, obj.id))

        opencv_im = opencv.rectangle(
            opencv_im, (x0, y0), (x1, y1), (0, 255, 0), 2)
        opencv_im = opencv.putText(opencv_im, label, (x0, y0 + 30),
                                   opencv.FONT_HERSHEY_SIMPLEX, 1.0,
                                   (255, 0, 0), 2)
        
def generate_cairo(self, img, objs, labels):
    stride = cairo.ImageSurface.format_stride_for_width(cairo.FORMAT_RGB24, width)
    surface = cairo.ImageSurface.create_for_data(buffer, cairo.FORMAT_RGB24,
                                                 width, height, stride)
    context = cairo.Context(surface)
    overlay = self.overlay()
    x = width - overlay.get_width()
    y = height - overlay.get_height()

    context.set_source_surface(self.overlay(), x, y)
    context.paint()

def generate_svg(self, src_size, inference_size,
                 inference_box, objs, labels, text_lines):
    dwg = svgwrite.Drawing('', size=src_size)
    src_w, src_h = src_size
    inf_w, inf_h = inference_size
    box_x, box_y, box_w, box_h = inference_box
    scale_x, scale_y = src_w / box_w, src_h / box_h

    for y, line in enumerate(text_lines, start=1):
        self.shadow_text(dwg, 10, y * 20, line)
    for obj in objs:
        x0, y0, x1, y1 = list(obj.bbox)
        x, y, w, h = x0, y0, x1 - x0, y1 - y0
        x, y, w, h = int(x * inf_w), int(y *
                                        inf_h), int(w * inf_w), int(h * inf_h)
        x, y = x - box_x, y - box_y
        x, y, w, h = x * scale_x, y * scale_y, w * scale_x, h * scale_y
        percent = int(100 * obj.score)
        label = '{}% {}'.format(percent, labels.get(obj.id, obj.id))
        self.shadow_text(dwg, x, y - 5, label)
        dwg.add(dwg.rect(insert=(x, y), size=(w, h),
                        fill='none', stroke='red', stroke_width='2'))
    return dwg.tostring()
                        sink_element=SINK_ELEMENT, scale_caps=scale_caps)

## Results

The following results.

## Conclusion

This is the conclusion:

## Source

Hwang, C. & Yoon, K., 1981. Multiple Attributes Decision Making Methods and Applications. Berlin: Springer-Verlag.