Object Detection is a fundamental task for an Autonomous Vehicle. This repo contains configurations and code required to detect road users in camera images using the Tensorflow Object Detection API. The object detection models are trained on 2D Bounding Box annotations provided in the Waymo Open Dataset. We train the models to detect three classes: Vehicles, Pedestrians and Cyclists.
Please use the Dockerfile provided for a hassle free environment setup.
Build the image with:
docker build -t project-dev -f Dockerfile .
Create a container with:
docker run --gpus all -v <PATH TO LOCAL PROJECT FOLDER>:/app/project/ --network=host -ti project-dev bash
add any other flag you find useful to your system (eg, --shm-size
).
We use images from the forward facing camera in the Waymo Open Dataset. The dataset contains images from various times of the day as well as number of different driving conditions. Sample images from the dataset are shown below:
A majority of the annotations in the images are Vehicles which make up around 75% of the annotations. Pedestrians account for 23.5%, while the remaning 0.7% of them are bicyclisits.
The classwise scale and aspect ratios are highlighted in the following figures. It can be seen that pedestrians and bicyclists annotations have a smaller scale in the range 0.005-0.01. Vehicles though have a larger scale range indicating that the variation in their sizes is larger.
For Aspect Ratios, pedestrians and cyclists have a similar range which is concentrated between 2-4, while that of Vehicles is generally smaller in the range 0.5-2
The classwise position distributions are shown in the heat maps below:
It can be seen that Vechile classes are more likely to be seen at the center of the images. Pedestrians are also somewhat uniformly distributed around the image center. The distribution is much more sporadic in case of bicyclists. This could be because of the very few instances of bicyclists within the dataset.
We use hold-one-out stlye cross validation approach. We use a subset of the Waymo Open Dataset consisting of 500 sequences. 90% of the data is used for training and remaining 10% is used for Cross Validation
This is a baseline model which is taken from the Tensorflow model zoo. The model is pretrained on the MSCOCO dataset and therefore serves as a good starting point. The model uses a ResNet50 backbone with 5 SSD detection layers. The model is trained using a cosine decayed learning rate coupled with momentum optimizer.
The model performace after training for 50K epochs is shown in the following figures.
The model only achieves a mAP of 11% inspite of having a very high precision for large and medium size boxes. The precision and recall for the small sized boxes though is very low at 4% and 10% respectively.
Also, from the plots of classification and total loss we notice that after around 30K epochs the eval loss(red) for classification begins to increase despite the training loss(blue) showing a downward trend. This indicates that the model begins to overfit at this point.
To improve model performance, the following changes were made:
To enhance the model performance for smaller objects the scale which was originally set to 4.0 was reduced to 3.0, which was decided based on the classwise scale distributions introduced in the Data Analysis. To understand the Apsect Ratios, a plot of the aspct ratio distributions within the dataset was generated:
Based on the plot we selected 0.5,1.0,3.0 as the defult aspect ratios for anchor boxes at each level.
An experiment was done by sweeping the learning rate between 1e-5 and 1e-1, in order to select the learning rate for which there was the steepest drop in loss. Accordingly a step learning rate starting at 1e-3 and then decreasing to 1e-4 was selected
We added the following augmentation to avoid overfitting:
i. Random Image Cropping
ii. Random Image Scaling
iii. Random Hue and Contrast
A performance comparison of the baseline model with the updated model is shown in the figure below. The lines in orange represent the updated model. We can see from the precision and recall curves that the model precision for small images increases from 4% to 9% and recall from 10% to 17%. The mAP @ 0.5 IoU increases from 20% to ~37%