Master Individual Project on Simultaneous Localization and Mapping (SLAM) in dynamic outdoor environment
For the evaluation of the system, do I need to validate the tracking accuracy of the dynamic objects?
How can I implement it?
The public datasets for SLAM did not provide the ground truth for the dynamic objects only the ground truth for the robot (camera) pose.
ANS
(Ref: email of Dr. Julier)
As you've seen in your literature review, basically nobody has ground truth of the other objects. The reason is that they need to be "cooperative": they need to have their own ground truthing (e.g., GPS) on board or you have some other kind of super accurate sensing system. The other thing would be to manually set things up. E.g., with a dense laser scanner, you could manually fit a 3D model and figure out stuff from there.
Need to do
1.Have a look at DynSLAM (https://github.com/AndreiBarsan/DynSLAM). This system provides a dense SLAM system which attempts to reconstruct moving objects.
2.Try cuda 8
3.Implement DynSLAM, try it on different datasets and evaluate it.
Have a try to test it with scenes in London which are quite different with scenes in KITTI dataset.
(London has more tall buildings & cars may drive in a different way.)
Try the algorithm for sparse mapping.
4.Have a look at DATMO (https://www.ri.cmu.edu/pub_files/pub4/wang_chieh_chih_2007_1/wang_chieh_chih_2007_1.pdf) which describes the problem of jointly doing SLAM and object tracking. It is very detailed in the maths. The main thing to check at this stage would be the introduction which states the problem
Done
Have installed cuda 8 on my laptop.
Have got a try on DynSlam on the provided demo sequence.
For the records of setting up the DynaSLAM: see https://github.com/mounchiliu/EverythingAboutSLAM/blob/master/Tip2_DynSLAM
Things need to do
Have a look at DynSLAM, does it provide the trajectory of the dynamic onjects?
Have a look at the code of DynSLAM.
Try DynSLAM on other datesets (e.g. KITTI dataset)
Find limitations of the system.
- What if we increase the number of moving objects?
- What if the objects are occluded?
- Laser scanner? --- Mentioned in paper:
'We base our experiments on the video sequences from (a) the KITTI odometry and (b) the KITTI tracking benchmarks, using the LIDAR as a ground truth for evaluating the quality of both the reconstructed static maps, as well as of the dynamic object instances.'
Things need to do
Real-time?
Frame-to-frame tracked reconstruction of dynamic objects? Yes. The system reconstructs all the potential dynamic objects
Which kind of semantic method it use? MNC Network
Things have done
Successfully extracted objects' motion and the camera poses.
Draw the camera pose with the objects trajectory from frame to frame.
Problem: For the provided dataset, there are few dynamic objects. Therefore, the drawing cant show the trajectoty of the dynamic objects. I will try some other datasets to dest the robustness of the system.
Things need to do
Have a look at the segmented network, try on it.
Test the system with other datasets which have several dynamic objects.
Have test on kitti tracking datasets which have several dynamic objects in one sequence.
Current problem:
- Shows large error on the estimation of vehicle motion when the camera do the rotation.
- Some of dynamic objects determined to be 'uncertain' state. It seems the system does not track these uncertain objects. Therefore, according to the result shown, for a sequence with multiple dynamic objects, the system can only track some of them.
Things need to do
Track the objects with 3d model (the reconstruction)
1. Reconstruction
All the potential dynamic objects are reconstructed using the first frame that observe this object. For each object, its corresponding RGB and depth data are extracted using the mask resulting from the segmentation procedure. With these information and the estimated motions of the objects from frame to frame, the object can be reconstructed and also update the reconstruction from frame to frame.
- Results:
Rotation:
Some other object:
The redlines show the sparse flow estimation.
Potential problem:
For this condition, (I guess) the surface of the dynamic object change from frame to frame which may cause errors when we do the sparse flow estimation. We can not find enough matches for motion estimation. Therefore, the reconstruction is not perfect compare to the other object.
- video
(The system also update the dynamic object reconstruction in the static sceen from frame to frame.)
The estimated trajectory of each object: (red line for camera poses, the others for tracked objects)
2.Crop the sequence to have less dynamic objects to track,
- video
The estimated trajectory of each object:
Things need to do
- Have a look at libviso2 to see whether I can have some improvement on the sparse flow estimation. (Have a try on optical flow?)
3. Test on detected static objects (e.g. parked car)
For these objects, I can set the motion to the identity matrix if the system detects the static object to improve the accuracy.
However, without the setting of identity matrix, the motion matrix should be approximate to the identity matrix from frame to frame.
Here is the result,
There is small shift for the static objects. It seems the error shifts grow with time.
For static objects, identity matrix can be set to improve the accuracy. However, for dynamic objects, there isnt such kind of method to improve the accuracy.
[Note: This may because of the error of the camera poses. For Dense SLAM, we may focus only on the camera poses or the objects poses separately. We do not expect the 'overlap errors'].
Question:
For the dense slam, is there any optimization method to optimize the estimated the motion of the objects?
e.g. ORB-SLAM uses Bundle Adjustment to minimize the reprojection error. Can we do a 'local optimization' for each dynamic object?
->
[InfiniTAM (Dense SLAM) has its own Strategies to do the optimization.]
Try InifiTam by blocking the static scene and some other objects. Just focuus on one object.
1. Combine detected bounding boxes to get more information of the object
Previously, based on the results of the segmentation, there may be more than one detected bounding box on the same object.
I combined these bounding boxes according to the intersection area of each two boxes. If the ratio of the intersection area to the bouding box is greater than a threshold, the two boxes will be combined.
Result:
Potential problem and ideas to solve it
There are several missed or false detections from the semantic segmentation algorithm, which may cause the problem of data association. (The object may lose several frames to do the tracking and reconstruction).
e.g.
Previous frame:
Current frame:
ideas:
We may propose a assumption that the dynamic object moves in a constant speed in a short period (e.g. in two consecutive frames). Then, the moving speed of the object can be used to predict the bounding box of the object in the next frame. Then, the missed detection can be compensated.
2. Separate Sparse flow of the scene
Previously, the system computes the camera poses by finding matches in the scene (including dynamic objects). The camera pose is estimated based on the RANSAC on these matches. Then the matches of the whole scene (before doing the RANSAC) are passed to the dynamic instance reconstruction process. The system obtains the matches (before RANSAC) located on the dynamic objects based on the bounding box obtained from the semantic segmentation results.
e.g. The flow (after RANSAC) for estimating the camera pose
I separated the sparse flow by only retaining the matches located on the static scenes. Then, the camera pose is obtained by doing the RANSAC using these matches.
e.g. (After RANSAC)
Raw flow on potential dynamic objects:
e.g. (Raw flow Before RANSAC)
Some ideas to improve the reconstruction
Do the RANSAC on the flow of each object? Or change to the dense optical flow to get more matches for tracking the object.
Notes
- For the evaluation of the system, we need to considier the camera pose, the object pose (velocity, position...).
Ground truth of the dynamic objects? -> use some vedios for simulation? Or just see where they should be?
-
For the improvement of the estimation of dynamic object poses, we may use the ground information (e.g. shadow) to enhance the tracking?
Or use the bounding box which may represent the average motion of the dynamic objects to help the estimation?
-
(Evaluation) Compare the results after removing matches on dynamic objects.
Add semi-dense optical flow for pose estimation on dynamic objects
In order to avoid expensive computation, I use semi-dense optical flow to enhance the pose estimation of dynamic objects. -> If the object does not have enough sparse flow to do the pose estimation, the system will use semi-dense optical flow.
-
Get all the points within the bounding box in current left image.
-
For each point, calculate the local gradient. Retain points with salient gradient. (If the gradient of that point is larger then a threshold value, keep the point.)
-
Find matches using these points with optical flow between "current left images - current right images", "current left images - previous left images", "current left images - previous right images"
-> changed to
Find matches using optical flow between the left and right images and consecutive frames. Then match them in a circle.
-
Estimate pose with PnP algorithm. Furthermore, perform alternatively RANSAC iterations to get a pose estimation with the most inliers.
The original system may lose the tracking of the dynamic objects due to insufficient sparse flow. Now, the system can keep tracking the object and then keep updating the instance reconstruction unless there is
-
- a false segmentation detection (e.g. label the car as sth else like boat)
-
- or the current frame does not have a high resolution scene of that object
-
- or the frame can only get a small part of the object (at this condition, the system can not get enough matches even using semi-dence optical flow)
Problem:
Drift of the object instance:
I found this may because sometimes the detected bounding box is larger than the object. When the system take points to find matches for pose estimation using the optical flow, the points on the static scene will also be taken, which influences the pose estimation of the dynamic objects.
Therefore, I shrink the bounding box when the system chooses points for optical flow to ensure most chosen points located on the objects.
Result:
(Use the wrong bounding box, the DynSLAM has created two bounding box one for remove object from dynamic scene, one for object reconstruction. Here we should use the small one which is the one for object reconstruction. And then, this problem will not happen.)
Things need to do
- Try optical flow for the whole sequences
- use a larger bounding box then do the sparse flow?
Problem when the system finds matches
- Matches are influenced by the light condition (the change of light intensity e.g. shadow or specular highlight)
- May get wrong matches when there are similar patches around
Use the bounding box to constrain the matches
The bounding box of the object problem the information of the movement of the object in some degree. Therefore, we can calculate the movement of the bounding box, if the matches do not accord with the movement of the bounding box (e.g. get much larger or smaller movement than or the movement of the bounding box or shows a opposite direction of the movement compared with the bounding box), these matches will be labeled as outliers.
remove shadow for each frame
use template matching for tracking object
Have tried HSV image to do template matching & template matching to track the object
Notes
There are some proposed SLAM systems that deal with dynamic environments. According to the paper (https://www.cs.ox.ac.uk/files/9926/Visual%20Slam.pdf), the problem of SLAM in dynamic environments can be viewed from two perspectives.
-
As a robustness problem -> pose estimation in visual SLAM should remain accurate (segmenting the static and dynamic features in the image and regarding the dynamic parts as outliers. Pose estimation is computed based on the static parts)
-
Extending visual SLAM into dynamic environments -> (Do data association) system should be capable of segmenting the tracked features into different clusters, each associated with a different object.
(Note: systems like the ORBSLAM work reasonably well when most of the scene does not change. They are based on the geometric consistency to remove the fewer points of dynamic objects in the system.)
The first perspective
e.g.
DynaSLAM uses Mask R-CNN to segment those classes that are potentially dynamic. The camera is tracked using the static part of the image. Dynamic objects are not used for tracking and mapping. The system also finds a way to complete the parts of the 3D map that is temporally occluded by a moving object.
DSSLAM adopts SegNet to provide pixel-wise semantic segmentation which helps us to filter out dynamic objects effectively. The uses moving consistency check algorithm to determine whether the key points in segmentation results are moving. For dynamic targets, remove all the ORB feature points that fall within the outline of the targets before matching. Furthermore, dynamic targets will not be used for construct the map.
The approaches mentioned above both drop the key points on the dynamic objects and do the localization and mapping tasks without these points. However, for this project, we consider to keep the points located on dynamic objects as well and use these points for tracking and reconstruction of dynamic objects.
The second perspecctive
Notes for DynSLAM
DynSLAM presents a stereo-based dense mapping algorithm which is able to simultaneously reconstruct the static background, the moving objects and the potentially moving but currently stationary objects (e.g. parked cars) separately.
-
Pre-process the input by computing a dense map, sparse scene flow and semantic segmentation of the RGB data.
-
Compute VO from the sparse scene flow.
-
Separate inputs (colour, depth, and sparse flow) into multiple frames: background & potentially dynamic object in the frame.
-
Estimate the 3D motion of each new detection using the scene flow and semantic segmentation information, comparing it to the camera pose to classify each object as static, dynamic, or uncertain.
How to estimate whether an object is dynamic or not?
-
For each segmented object, the masked scene flow associated with the specific object instance is used as input to estimate the motion of the camera w.r.t the object instance which is assumed to be static. If the estimation is successful, then the 3D motion of the object is equal to the inverse of the camera motion.
-
For static objects, the resulting 3D object motion will be nearly identical to the camera's movement.
-
-
For each rigid object of interest (moving or potentially moving), initialize or update its reconstruction.
-
Update the static map reconstruction
For the static map and individual object reconstruction, DynSLAM uses InfiniTAM for volumetric fusion. DynSLAM separates the static background from the dynamic objects.
-
The estimated vehicle movement (I suppose here it mentioned as camera pose) by the visual odometry is used to fuse the static parts of the input colour and depth maps, which are identified based on the instance-aware semantic segmentation component.
-
Both moving and potentially moving objects are reconstructed individually. The estimated 3D motions of the individual objects are used for the object volumetric fusion.
-
-
Perform voxel garbage collection to remove voxels allocated spuriously due to artifacts in the depth map.
Notes for SLAM with DATMO (Wang, 2007)
-
SLAM with generalized objects calculates a joint posterior over all generated objects and the robot. It contains a structure for motion modelling of generalized objects. --> It is computationally demanding and infeasible.
-
SLAM with DATMO decomposes the estimation problem into two separate estimators (posteriors for stationary objects and moving objects). --> lower dimensional than SLAM with generalized objects. The system is able to deal with issues of perception modelling, data association, and moving objects detection.
Generally, a Bayesian formula was introduced to solve SLAM and DATMO.
-
How does this system detect moving objects? Two approaches -> one for consistency based approach and one for motion object map based approach.
- Consistency-based Detection
detect moving objects -> combination of the result from segmentation and moving point detection for determining the dynamic objects
-
Detect moving objects:
for a new scan of the local surrounding map, find the relative pose -> transform the local surrounding map tp the current coordinate system -> transform to a polar coordinate system -> detect moving points by comparing values along the range axis of the polar coordinate system. -
Determine dynamic objects:
A segment is identified as a potential moving object if the ratio of the number of moving points to the number of total points is greater than 0.5.Problem: temporary stationary objects cannot be detected & If the time period between consecutive measurements is very short, the motion of moving objects will be too small to detect. e.g. Detection of pedestrians at very low speed is difficult to detect.
- Moving Object Map based Detection
The system has a map that contains information from previous moving objects. If a blob is in an area that was previously occupied by moving objects, this object can be considered as a potential moving object.