Skip to content

Object detection on resource-constrained hardware using deep learning and optimization techniques.

Notifications You must be signed in to change notification settings

jayjunlee/Resource-Constrained-Object-Detection

Repository files navigation

Resource Constrained Object Detection

DOI

A research project supervised by Dr David Thomas (Imperial College) on lightweight convolutional neural network solution to detect obstacle balls using computer vision that is trained using PyTorch / TFLite to be deployed on the following resource constrained environments: Intel DE10-Lite FPGA and Raspberry Pi 3.

Hard-coded ver. DL ver.

Limitations / Constraints

  • Limited dataset of about 500 images (approx. 100 for each class)
  • Dataset includes images of small objects
  • Limited compute power / resources
  • Constrained by low target inference time (~= 30 ms)
  • Constrained by high target frame rate (30 ~ 60 fps)

Datasets

Manually taken images are split into three categories in terms of lighting and thus brightness levels: dark, normal and bright. This is to mainly ensure robust inference of NNs under varying light settings.
Each of these images contain a single ball of five different colours: red, green, blue, yellow and pink.

The dataset contains the raw images (1920x1080) and the respective label csv file that contains the dimensions of bounding boxes and the colour of the ball on the image. The dataset is published as a DOI via zenodo.

Ball Dataset

Implementation and Training Model

Input and output formats

The input tensor to the CNN is RGB image of size 320x240. The output tensors are a tensor of size [1x4] for the object bounding box regression and a tensor of size [1x5] for the classification / scores of each class of the balls.

Image data augmentation

To make the most out of a limited dataset and to prevent the model from overfitting onto the training data, the input images can be augmented i.e. flipped, cropped, etc. just to make sure that the training images are slightly different and the model is not fed in and hence learning the exact same tensors.

Loss functions

Also known as the cost function, for the dual-inferencing CNN (bbox regression & classification), it is necessary to use the appropriate loss function to be minimised in order to achieve the desirable performance once trained or to even train the NNs. For classification tasks cross entropy loss was used and for bbox regression tasks, L1 loss (Mean Absolute Error) / L2 loss (Mean Squared Error) / IOU loss (Intersection over Union).

Simple CNN

The initial attempt was to design a CNN architecture with few conv2d layers followed by activations and maxpooling with fc layers in the end.

inference time of ~= 30 ms very low robustness & accuracy fitted to training set only x work on images with different backgrounds (high variance)

Transfer learning pre-trained state-of-the-art CNN

Some state-of-the-art CNNs such as the resnets and mobilenets with pre-trained early layers frozen (great feature extractors) and trainable fully connected layers at the end were trained on the dataset. The training and the progress in validation loss over the number of epochs trained was great compared to a simple CNN (just a few conv2d layers with fc layers). However, when the torch model was converted to TFLite model for deployment on raspberry pi, the inference time was about > 3000 ms due to the limited compute power. Although transfer learning is beneficial given a limited, small dataset, the computational cost is too high for a CPU to work in real-time with low inference time.

EfficientDet

Google's EfficientDet with the backbone of EfficientNet and BiFPN as its feature network that uses a compound scaling was able to achieve both the high accuracy and low inference time deep learning object detector.

  • EfficientDet-Lite0: ~= 1000 ms
  • EfficientDet-Lite2: ~= 2700 ms

Optimizations

Evaluation Metrics

  • Accuracy and robustness metrics
  • Speed metric
  • Resource utilisation metric

Quantization Techniques

  • Post-training quantization
  • Quantization-aware training
  • Torch to TFLite conversion For pytorch model to TFLite conversion, run the following command on terminal.
python3 torch_to_tflite.py --torch ./trained_model/CNN2.pt --tflite ./model/CNN2.tflite

Room for improvements

Biased dataset

I started this project by collecting the ball dataset, purely out of my experience from hard-coded computer vision that directly works (only in the optimal light setting) from the individual pixel values that are gaussian filtered to minimise noise. I thought by collecting images of certain categories of light settings (dark, normal, bright), it would help neural networks to generalize better but it turns out that was not the case despite image augmentation and possibly due to this nature of CNNs.

Importance of neural network accelerators

When deploying several trained models on Raspberry Pi 3, only a minimal CNN e.g. 2 conv2d layers followed by 2 fc layers was just about to satisfy the resource constraints and the target performance in terms of the frame rate and inference time. This leads to how the current processors on edge devices are not solely for DL inferencing and how the software stack or the compilers on top of hardware system adapt the CNNs to optimize for the CPU. The performance on inference can be improved by using AI accelerators such as Edge TPU from Google along with the CPU on edge devices.

Related papers

About

Object detection on resource-constrained hardware using deep learning and optimization techniques.

Resources

Stars

Watchers

Forks

Packages