# *Task* - Time-Series Inverse Perspective Mapping

## Summary

Develop a methodology to fuse camera image information from multiple consecutive time steps in order to compute an advanced semantic grid map using the geometry-based Inverse Perspective Mapping (IPM) approach.

- [Background and Motivation](#background-and-motivation)
- [Task](#task)
- [Required Tools and Data](#required-tools-and-data)
- [Hints](#hints)

## Background and Motivation

Grid maps play an important role in environment perception and may be used for purposes such as lane detection or free space detection. One way to compute semantic grid maps is to geometrically transform semantically segmented camera images using *Inverse Perspective Mapping (IPM)*. One exemplary semantic grid map computed from 8 semantically segmented camera images is shown below.

![](./assets/ipm.png)

The classical IPM approach has several shortcomings due to its assumption of a flat world:
- objects with vertical extent (e.g., cars) are heavily distorted;
- flat world assumption is often wrong even for seemingly flat surfaces like roads (leading to, e.g., non-parallel lane markers in grid map);
- effective resolution drops with distance.

One idea to improve on the basic IPM approach is to fuse camera image information from multiple consecutive time steps by involving information about the ego motion of the automated vehicle.

## Task

The task is to develop a methodology to fuse camera image information from multiple consecutive time steps in order to compute an advanced semantic grid map using the geometry-based Inverse Perspective Mapping (IPM) approach.

### Subtasks

> ***Note:*** *The subtasks listed below do not have to be followed strictly. They serve the purpose of guiding you along your own research for this topic.*

1. Implement a basic TensorFlow data pipeline.
1. Implement a basic TensorFlow model for semantic image segmentation.
1. Train basic TensorFlow models on the provided datasets for semantic image segmentation.
   - either train separate models for front/left/right/rear camera images
   - or train a single model for all cameras
1. Using the trained models, compute semantic segmentation predictions for all input camera images.
2. Apply IPM to ground-truth and predicted semantic segmentation images to obtain semantic grid map estimations.
3. Research methods to combine semantic grid maps of multiple consecutive time steps using available ego motion information (e.g., current ego velocity).
4. Develop an algorithm to fuse semantic grid maps of multiple consecutive time steps by also considering the ego motion of the automated vehicle.
   - start by fusing the ground-truth semantic segmentation images
   - add functionality to also fuse the predicted semantic segmentation images, such that semantic segmentation errors can also be corrected
5. Evaluate the results of the advanced IPM algorithm in comparison to the single-shot IPM method and ground truth bird's eye view data (suggested metric: *Mean IoU*).
   - evaluate performance on flat-world vs. static objects vs. dynamic objects
   - evaluate dependence on the number of included time steps;
   - evaluate dependence on the time delta between included time steps.
   - ...
6. Document your research, developed approach, and evaluations in a Jupyter notebook report. Explain and reproduce individual parts of your implemented functions with exemplary data.

## Required Tools and Data

### Tools

- TensorFlow
- Image Segmentation Training Pipeline & Model *(see [ACDC Exercise: Semantic Image Segmentation](https://github.com/ika-rwth-aachen/acdc-notebooks/blob/main/section_2_sensor_data_processing/1_semantic_image_segmentation.ipynb))*
- [Python IPM implementation from ika paper Cam2BEV](https://github.com/ika-rwth-aachen/Cam2BEV/tree/master/preprocessing/ipm)
  - the `ipm.py` script allows you to compute a semantic grid map that is matching the viewport of the ground-truth drone camera
- *(potentially)* OpenCV

### Data

- [two synthetic datasets](data/) containing consecutive samples of ...
  - camera images (front, rear, left right)
  - semantically segmented camera images (front, rear, left, right)
    - ground truth for semantic image segmentation model(s)
  - ground-truth semantically segmented drone camera images
    - ground-truth for evaluation
  - camera intrinsics/extrinsics
  - ego motion of vehicle

## Hints

### Relevant ACDC Sections

- **Sensor Data Processing Algorithms**
  - Image Segmentation
  - Camera-based Semantic Grid Mapping

### Thoughts on Possible Fusion Algorithms

> ***Note:*** *The suggestions detailed below do not have to be chosen for the developed methodology. They only serve as inspiration.*

There is no obvious answer to the question of which information in which representation to fuse for this task. One reasonable option is to fuse the segmentation model's raw output, i.e., before it is converted to a semantic segmentation map using `argmax`. The raw model output usually is the output of a `softmax` activation, which can be interpreted to contain semantic class probabilities for every image pixel. The following code snippets walk you through the idea.

First install and import the required Python packages for this demo.

In [1]:
import sys
!{sys.executable} -m pip install \
    numpy
import numpy as np



For demo purposes, let's only consider two tiny 3x4 camera images, for which we predict pixel-level association to five semantic classes.

In [2]:
N_IMAGES = 2
IMAGE_SHAPE = (3, 4, 3)
N_CLASSES = 5

For simplicity, let's only implement a dummy model function, yielding output as a semantic image segmentation model would. The `softmax`-outputs containing the class probabilities for each pixel can be converted to the final semantic segmentation map by applying `argmax`.

In [3]:

# dummy model function creating a random softmax output
# (class dimension of each pixel sums to 1)
def model(img):
    random = np.random.random((img.shape[0], img.shape[1], N_CLASSES))
    norm_over_class_dim = np.linalg.norm(random, ord=1, axis=-1)
    softmax_output = random / np.expand_dims(norm_over_class_dim, axis=-1)
    return softmax_output

def modelOutputToSegmentationMap(model_output):
    return np.argmax(model_output, axis=-1)

Let's now create two random images, compute the dummy model outputs and print some information.

In [4]:
# print information about camera images and computed segmentation maps
model_outputs = []
for i in range(N_IMAGES):
    camera_image = np.random.random(IMAGE_SHAPE)
    model_output = model(camera_image)
    model_outputs.append(model_output)
    segmentation_map = modelOutputToSegmentationMap(model_output)
    print(f"Image {i+1}")
    print(f"  Image shape: {camera_image.shape}")
    print(f"  Model output shape: {model_output.shape}")
    print(f"  Class probabilites for top-left pixel: {model_output[0, 0, :]}")
    print(f"  Segmentation map (class indices): \n{segmentation_map}")

Image 1
  Image shape: (3, 4, 3)
  Model output shape: (3, 4, 5)
  Class probabilites for top-left pixel: [0.13317226 0.27574571 0.23495605 0.06773161 0.28839436]
  Segmentation map (class indices): 
[[4 1 4 4]
 [4 0 2 2]
 [2 2 4 1]]
Image 2
  Image shape: (3, 4, 3)
  Model output shape: (3, 4, 5)
  Class probabilites for top-left pixel: [0.11428316 0.16525105 0.2131195  0.29853636 0.20880993]
  Segmentation map (class indices): 
[[3 0 1 2]
 [3 2 2 1]
 [2 0 1 2]]


Instead of fusing the final segmentation maps, where each pixel has already been assigned one particular class, we can fuse the information of the two images one step earlier by averaging the class probabilities in the model outputs.

In [5]:
def fuseModelOutput(model_outputs):
    return sum(model_outputs) / len(model_outputs)

fused_model_output = fuseModelOutput(model_outputs)
fused_segmentation_map = modelOutputToSegmentationMap(fused_model_output)
print(f"Fused model output")
print(f"  Averaged probabilites for top-left pixel: {fused_model_output[0, 0, :]}")
print(f"  Averaged segmentation map (class indices): \n{fused_segmentation_map}")

Fused model output
  Averaged probabilites for top-left pixel: [0.12372771 0.22049838 0.22403778 0.18313399 0.24860215]
  Averaged segmentation map (class indices): 
[[4 1 4 2]
 [3 0 2 1]
 [2 1 4 2]]


Note that ego motion of the automated vehicle was not considered in this example. If applied to the problem at hand, one would want to first shift the second semantic segmentation class probability tensor in the direction of travel (as extracted from ego motion).

Additionally, instead of averaging class probabilities, one could also incorporate heuristic rules such as: *if at least one of my considered samples is predicting 'road' for a particular region, then always assume 'road' for that region in the final output*. This could potentially lead to a mapping of the static world, where dynamic objects would be filtered out. The elimination of dynamic objects is one possible successful outcome of this research project.