#Reproducing and extending the paper Predicting Deeper into the Future of Semantic Segmentation

In this notebook we reproduce the results of the [Predicting Deeper into the Future of Semantic Segmentation](https://arxiv.org/pdf/1703.07684.pdf)

This ```Pytorch``` implementation was inspired by the lua code of the original authors
[Luc et al.](https://github.com/facebookresearch/SegmPred) and [AliManUtd](https://github.com/AliManUtd/segm_pred_pytorch). 

Stefan van der Heijden, 4473124
<br>
Wouter Kok, 4169778
<br>
Stas Mironov, 4457668

---


## Why predicting into the future?
A key component of intelligent decision-making is the prediction and anticipation of future events. In order to build smarter robotic systems and autonomous vehicles we should make decisions based on the analysis of the current situation and hypotheses made on what could happen next. While humans can, without any problems, instantly predict the movement of vehicles or pedestrians, for current computer vision systems this still remains an open challenge.

## Why semantic segmentation?
The approach to solve the aforementioned problem that we will further investigate in this blog is semantic segmentation. Semantic segmentation is one of the most complete forms of visual scene understanding, where the goal is to label each pixel with the corresponding semantic label (e.g., tree, pedestrian, car, etc.). However, these pixel-level annotations needed for semantic segmentation are expensive to acquire, and this is even worse if every video frame needs to be annotated. To solve these problems, state-of-the-art semantic image segmentation models to label all frames in videos is used. The future segmentation prediction models can then be learned from these automatically generated annotations. These segmentations are used as inputs and targets for our models. By modeling pixel-level object labels and moving away from just RGB predictions, the network seems to be better at learning basic physics and object interaction dynamics.

<img align="center" src="https://drive.google.com/uc?id=1M2WhAJFAfOjHeAvXWKB_WB_HzgTvlR9S" />

In this blog we try to reproduce a paper by Facebook AI Research from 2017, which is the [Predicting Deeper into the Future of Semantic Segmentation paper](https://arxiv.org/pdf/1703.07684.pdf), where the authors introduce the novel task of predicting future frames in the space of semantic segmentation. Compared with prediction of the RGB intensities, they show that they can predict further into the future with improved accuracy. In order to accomplish this they propose an autoregressive model which convincingly predicts segmentations up to 0.5 seconds into the future.

In order to investigate these claims and results further, we have composed a number of research questions. With the main research question being: _Can we replicate the average per-pixel accuracy score achieved by the original paper in Table 1 for the S2S model?_ 

To extend our experiments a bit more we have a number of subquestions that involve different datasets, different hyper parameters and different semantic image segmentation models. They are: 
- _Can we adapt the semantic segmentation to work on other datasets?_
- _Can the hyperparameters be tweaked to make the architecture more efficient?_
- _Will the system perform as well on different datasets and how do we compare them?_


## Related Work
A lot of prior work has been conducted on predicting future video frames, each with their own intrinsic properties. We will briefly discuss the most important ones and what differentiates them from our solution.

First of all, there are the tasks of video forecasting. Several authors developed methods related to our work to improve the temporal stability of semantic video segmentation. [Jin et al.](https://arxiv.org/pdf/1612.00119.pdf) train a model to predict the semantic segmentation of the immediate next image from the preceding input frames, and fuse this prediction with the segmentation computed from the next input frame.[Nilsson and Sminchisescu](https://arxiv.org/pdf/1612.08871.pdf) use a convolutional RNN model to accumulate the information from past and future frames in order to improve prediction of the current frame segmentation. Something similar has been done by [Patraucean et al.](https://arxiv.org/pdf/1511.06309.pdf) with a convolutional RNN to implicitly predict the optical flow, and use these to warp and aggregate per-frame segmentations. This is in contrast with our approach, which focuses on predicting future segmentations without seeing the corresponding frames. Most importantly, we target a longer time horizon than a single frame.

Another batch of promising related work resolves around generative models for future video frame forecasting. [Ranzato et al.](https://arxiv.org/pdf/1412.6604.pdf) introduced the first baseline of next video frame prediction. [Srivastava et al.](https://arxiv.org/pdf/1502.04681.pdf) developed a Long Short Term Memory (LSTM) architecture for the task, and demonstrated a gain in action classification using the learned features. Finally, [Mathieu et al.](https://arxiv.org/pdf/1511.05440.pdf) improved the predictions using a multi-scale convolutional architecture, adversarial training and a gradient difference loss.


## Architecture

The existing code base provided by the authors of the paper is written in Lua and their pretrained models and used batches are in the .T7 format. Since we wanted to port this project to PyTorch, we had to find a way to work with the .T7 files or make the batches ourselves. Fortunately, the older versions of PyTorch supported a way to read .T7 files, so after downgrading to PyTorch 0.4.1, we were able to work with the provided train and val batches. The pretrained models, however, turned out to be corrupted and were unusable.

The model we tried to recreate is called the S2S model, meaning that we use semantic segmented input frames to predict the semantic segmentation of the output frame. The model works with a multi-scale network, with two spatial scales as can be seen in the picture. This means that the network makes a series of predictions, starting from the lowest resolution, and uses the prediction of size $S^{k-1}$ as a starting point to make the prediction of size $S^{k}$. At the lowest scale $S^{k-1}$, the network takes only the segmented frame as an input. Each scale module is a four-layer convolutional network alternating convolutions and ReLU operations as can be seen in the code provided below.

<img align="center" src="https://drive.google.com/uc?id=1Hj1DxitBW1GKOFUgP4dSE86KpnPFecES" />

In [None]:
class Model(nn.Module):

    # Need channels input (#input frames * #classes, 4*19=76) and channels output (#classes, 19).
    def __init__(self, n_channels_input, n_channels_output, scale_small = True):
        super(Model, self).__init__()

        # Small kernel size for convolution (3x3).
        kernel_small = 3
        # Large kernel size for convolution (5x5).
        kernel_large = 5
        # Stride for convolution.
        stride = 1
        # Padding of 1 for 3x3 convolution.
        padding_small = 1
        # Padding of 2 for 5x5 convolution.
        padding_large = 2

        # The first round of convolution on the smaller scale frames, upper right part of the architecture image.
        # Give the possibility to run with only one scale, default is TWO.
        if (scale_small):
            # First we setup the 4 convolutional layers for the smaller scale, as described in the paper.
            self.conv1_small = nn.Conv2d(n_channels_input, 128, kernel_small, stride, padding_small)
            self.conv2_small = nn.Conv2d(128, 256, kernel_small, stride, padding_small)
            self.conv3_small = nn.Conv2d(256, 128, kernel_small, stride, padding_small)
            self.conv4_small = nn.Conv2d(128, n_channels_output, kernel_small, stride, padding_small)

            # After applying convolution, we upsample to go to the larger scale.
            self.upsample = nn.Upsample(scale_factor=2) # 
            
            
        # The second round of convolution on the larger scale frames, lower left part of the architecture image.
        self.conv1 = nn.Conv2d(n_channels_input + n_channels_output, 128, kernel_large, stride, padding_large)
        self.conv2 = nn.Conv2d(128, 256, kernel_small, stride, padding_small)
        self.conv3 = nn.Conv2d(256, 128, kernel_small, stride, padding_small)
        self.conv4 = nn.Conv2d(128, n_channels_output, kernel_large, stride, padding_large)
        
    # The forward pass of the convolution.
    # Requires 4 frames in small and large format with segmented annotation.
    def forward(self, frames_tensor):

        # First apply the small convolution on the smaller frames, with every layer alternated with a ReLU layer.
        temp_feature_map = F.relu(self.conv1_small(frames_tensor[0]))
        temp_feature_map = F.relu(self.conv2_small(temp_feature_map))
        temp_feature_map = F.relu(self.conv3_small(temp_feature_map))
        conv_scale_small = self.conv4_small(temp_feature_map)

        # Upsample the convolutioned smaller frames to the larger size.
        conv_scale_large = self.upsample(conv_scale_small)

        # Concatenate the upsampled convolutioned frames and original large frames.
        combined_scale_large = torch.cat((frames_tensor[1], conv_scale_large), 1)

        # Apply the larger convolution on the larger frames, with every layer alternated with a ReLU layer.
        temp_feature_map = F.relu(self.conv1(combined_scale_large))
        temp_feature_map = F.relu(self.conv2(temp_feature_map))
        temp_feature_map = F.relu(self.conv3(temp_feature_map))
        temp_feature_map = self.conv4(temp_feature_map)

        # Return the convolutioned smaller images and
        # the convolutioned larger images additioned with the upsampled convolutioned smaller images.
        return [conv_scale_small, temp_feature_map + conv_scale_large]

In order to use all of the input frames in determining the movement of the objects, we use an autoregressive model. In this model we take the past four segmentations to predict the next segmentation $S_{t+1}$. For predicting a segmentation $S_{t+2}$ after that, we take the past 3 inputted segmentations combined with the predicted frame $S_{t+1}$, and so on. This autoregressive approach is visualised in the image below.

<img align="center" src="https://drive.google.com/uc?id=17nm6dBVu2u674xuCdnVQc6kiwL5dKJsW" />

As for the loss function, it measures the different between the predicted output frame and the true target output. It is constructed out of two parts, the sum of an $\ell_{1}$ loss and a gradient difference loss between the model output \^{Y} and the target output Y:

\begin{equation}
    \mathcal{L}(\hat{Y}, Y) = \mathcal{L}_{\ell1}(\hat{Y}, Y) + \mathcal{L}_{gdl}(\hat{Y}, Y).
\end{equation}

Using $Y_{ij}$ to denote the pixel elements in Y, and similarly for \^{Y}, the losses are defined as:

\begin{equation}
    \mathcal{L}_{\ell1}(\hat{Y}, Y) = \sum_{i,j}|Y_{ij}-\hat{Y}_{ij}|,
\end{equation}

\begin{equation}
    \begin{split}
        \mathcal{L}_{gdl}(\hat{Y}, Y) = \sum_{i,j}||Y_{i,j} - Y_{i-1,j}| - |\hat{Y}_{i,j} - \hat{Y}_{i-1,j}|| \\ + ||Y_{i,j-1} - Y_{i,j}| - |\hat{Y}_{i,j-1} - \hat{Y}_{i,j}||.
    \end{split}
\end{equation}

These two loss functions each have their own task: On one hand, the $\ell_{1}$ loss tries to match all pixel predictions independently to their corresponding target values. On the other hand, the gradient difference loss penalizes errors in the gradients of the prediction. This error is more sensitive to high-frequency mismatches that are more significant (e.g. errors along the contours of an object) and relatively insensitive to low-frequency mismatches between prediction and target (e.g., adding a constant to all pixels does not affect the loss).



## Dataset

The paper we set out to reproduce used the [Cityscapes dataset](https://www.cityscapes-dataset.com/). Cityscapes is a large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5000 frames in addition to a larger set of 20000 weakly annotated frames. The main problem for us was that each annotated image was the $20^{th}$ image from a 30 frame video snippet, whereas we needed a number of consecutive frames to be annotated in order to make a prediction of a future image frame. In the paper they solved this by segmenting and annotating it themselves using the [Dilation10 network](https://arxiv.org/pdf/1511.07122.pdf). However, this segmented data was not made publicly available except for a small sample, so we have used that in our initial design to train and validate the model.
The code for Dilation10 is publicly available: [Multi-Scale Context Aggregation by Dilated Convolutions](https://github.com/fyu/dilation).

<img src="https://drive.google.com/uc?id=1OomuyYhrePOoFHYGPtedZJkXtsUekzFT" />

<img src="https://drive.google.com/uc?id=1NSXtPwD9TwruNo95SFEfUtvvuGdgmtNr" />

<img src="https://drive.google.com/uc?id=1HxR8vNrdOAXhpRXRPbnA9AnuM3E3cHDj" />

In [None]:
# All semantic classes of the Cityscapes dataset.
classes = ['road', 'sidewalk', 'building', 'wall', 'fence', 'pole',
           'traffic light', 'traffic sign', 'vegetation', 'terrain', 'sky',
           'person', 'rider', 'car', 'truck', 'bus', 'train', 'motorcycle',
           'bicycle', 'unlabeled']

# The number of classes -1 since unlabeled is not a real class.
nclasses = len(classes)-1

# The RGB colormapping for each of the semantic classes.
colormap = {0: [0.5, 0.25, 0.5],     #-- road
            1: [0.95, 0.14, 0.91],   #-- sidewalk
            2: [0.27, 0.27, 0.27],   #-- building
            3: [0.4, 0.4, 0.61],     #-- wall
            4: [0.745, 0.6, 0.6],    #-- fence
            5: [0.6, 0.6, 0.6],      #-- pole
            6: [0.98, 0.66, 0.11],   #-- traffic light
            7: [0.86, 0.86,  0],     #-- traffic sign
            8: [0.41, 0.55, 0.14],   #-- vegetation
            9: [0.59, 0.98, 0.59],   #-- terrain
            10: [0.27, 0.51, 0.71],  #-- sky
            11: [0.86, 0.27, 0.23],  #-- person
            12: [1, 0, 0],           #-- rider
            13: [0, 0, 0.55],        #-- car
            14: [0, 0, 0.27],        #-- truck
            15: [0, 0.55, 0.39],     #-- bus
            16: [0, 0.31, 0.39],     #-- train
            17: [0, 0, 0.9],         #-- motorcycle
            18: [0.46, 0.04, 0.13],  #-- bicycle
            19: [0, 0, 0]}           #-- unevaluated

# All classes that are capable of moving.
movingObjects = [12, 13, 14, 15, 16, 17, 18, 19]

# The dimension of the input frames.
oh, ow, oc = 128, 256, 3

We also used a synthetic dataset called [Synthia](http://synthia-dataset.net/) which consists of a collection of photo-realistic frames rendered from a virtual city and comes with precise pixel-level semantic annotations for 13 classes: misc, sky, building, road, sidewalk, fence, vegetation, pole, car, sign, pedestrian, cyclist and lane-marking. The two main selling points of this dataset were the enormous amount of data and the fact that all the frames were already semantically segmented. Due to the lesser amount of classes contained in this dataset, a straight up comparison between the results was quite hard. Hence we decided to re-train our model using data only from this dataset.

We tried to implement a Dilation network in PyTorch based on [Dilation-Pytorch-Semantic Segmentation](https://github.com/Blade6570/Dilation-Pytorch-Semantic-Segmentation). This dilation network had pretrained weights on the Cityscapes dataset. As Synthia is similar to Cityscapes, this dilation was applicable to Synthia as well. The segmentation data was not generated, nor was it feasible to run this dilation on all our images as one image took 10 minutes. To get optimal dilation results we would have need to retrain this dilation on Synthia RGB images. Because Synthia already had labels with ground truths we skipped the dilation step and directly used these labels.

<img src="https://drive.google.com/uc?id=1_JN7IiB0daaijIy269E0acw3xOFz4hfE" />


## Experiments

The goal of our implementation is to predict the semantic segmentation of the fifth frame in a sequence, based on the 4 frames before. Then, calculate the IoU score of this predicted fifth frame compared to the actual segmented fifth frame. In the image below, we show such a sequence of five frames of a Cityscapes street scene with the fifth frame being computed by our implementation.

<img src="https://drive.google.com/uc?id=1n862xXv7UG8QaVotN9ywsyrbfu6VdITM" />
<img src="https://drive.google.com/uc?id=1P3Fit_GNmvAUQRZT62d77yGsNDUywl_y" />
<img src="https://drive.google.com/uc?id=1IwVpt1K01WzegJD15O5Pch_kh7BlIlC0" />
<img src="https://drive.google.com/uc?id=1elybUHCXl-6VRqtBfQKSsU0haQjisuXb" />
<img src="https://drive.google.com/uc?id=1jlGDVKZoasFE7X4FihT1a_wySYK3G-k_" />

As you can see the prediction of the bigger objects works quite well, even the movement gets taken into account. The main problem is with the smaller objects, like the lampposts which are completely absent in the predicted fifth frame. Despite this problem, we managed to get an average per-pixel accuracy score of 89.3%, compared to the 91.8% accuracy in the paper.

We also used our implementation for mid-term predictions, where we predicted 3 frames into the future. 

<img src="https://drive.google.com/uc?id=1XjQMqJ-1_wXk3odioUxwtg3bvcDX51Q7" />
<img src="https://drive.google.com/uc?id=1iH_gIBYaPMfsZAQvAzuBS6CJRZPCzcEp" />
<img src="https://drive.google.com/uc?id=1KcCVpuamgabQRoqgh77jxJMMlYAI7WZD" />
<img src="https://drive.google.com/uc?id=1LWO_UEl7X8kw62CpGT1wgfuItapjndZ9" />
<img src="https://drive.google.com/uc?id=1o2nAawgupcX108Ne326tGCvaIQ2vg6iH" />
<img src="https://drive.google.com/uc?id=1HIyE3n86rFUgwwKHkUljs4w_yjI2pKFe" />
<img src="https://drive.google.com/uc?id=1J-_ZGbN-tp0brI7HU6DPg2OK4-MnL20w" />

The classes that dominate the frames are slowly transformed into blobs that are enlarged at each time step. If we predict up to 3 frames (mid-term) predictions we cannot see notable differences with the first prediction. There are some artifacts that pop up due to the "zoom in" effect as the predicted frames are simulating forward movement. we also calculated the average per-pixel accuracy score for this experiment and this time we obtained 82.0%, which is again quite close to the 87.9% achieved by the original paper. This is mostly due to the lack of data, as we assume that the model becomes better at predicting multiple frames in the future due to an abundance of data. If we train with less data, it might not be as noticeable in the first prediction, but the errors add up when stacking multiple predictions which are based off of the previous predictions.

### Dilation10 Cityscapes dataset

We initially trained on a small sample of the segmented Cityscapes dataset provided by the authors. The model was trained for 3000 epoch, which was determined to be the optimal number of training epochs. This is based on the Lua code from the authors which advocates 5000 epochs for the whole dataset. Since our dataset was significantly smaller we decided that 3000 for the smaller set was enough, this was supported by the IoU curves that showed stagnation already before that point. The paper states 960000 training iterations, however, the Lua code was using a 1000, so we used 1000 iterations as well. This is based on predicted performance and available computation power. 

<img src="https://drive.google.com/uc?id=1E5TOGoaYUqpQFVq0vwzi7kmtViGsYqZ2" />

### Synthia

The Synthia dataset is what we believed to be the best dataset for this architecture. As all the ground truths are known and all images are already given as semantic segmentation data and RGB color images. 

We wanted to reproduce the results we have gotten from training on the batches provided by Facebook, but now replacing them with our own batches from the Synthia dataset. First we examined the Synthia dataset. The spring of old European town (seq04) subset seemed to be the most similar to the Cityscapes dataset. This spring set has good visibility in terms of the classes. Compared to "summer" it is not overly bright. We also do not have to factor in the noise of the "soft rain" sequence.  

We used our self made batchmaker to create the sequences that the Facebook researchers mention in their readme; "val contains 500 test sequences from Cityscapes: 125 batches of 4 sequences x 7 frames (4 inputs + 3 targets) x 256 x 128 (contains both RGB images and their segmentations)."
After dealing with lots of errors we decided to keep it the same as the original code, which only uses 4 input frames and 1 target frame.
We have these batches not publicly available due to their size, making it hard to upload them anywhere. But, if anyone is interested we can always publish them.

We altered the original code to be able to handle the new pth format (torch.save method) and ran them through the same test code we used for the Cityscapes batch validation. The results however were disappointing. Depending on the choice of the optimizer we had different classes that dominated the frames. That is why we did not try midterm predictions.

<img src="https://drive.google.com/uc?id=15fq55YRqUMZ3fOx7KEsUSmdPbKyIlZVs" />
<img src="https://drive.google.com/uc?id=1rbHWGrDdoA4W2fLTcdGurfoGnn9OfE1Y" />
<img src="https://drive.google.com/uc?id=158jsN4_x8_7EL9c6tl5HpIOZAz5cZWzF" />
<img src="https://drive.google.com/uc?id=1GZ6tzGbVDWXPHymNxt5HMXJAyfv-e4ix" />
<img src="https://drive.google.com/uc?id=1Ns1ETG3iTPSJS3TXitGMLgruMPdgnxCW" />

This brought us to reexamine the way we have trained the models. For spring we initially took the same optimizer and learning rate as the original code where they used SGD with a learning rate of 0.01. The optimiser and its performance is not further analysed in the original paper. Where stochastic gradient descent can outperform some other optimiser when the parameters are fine-tuned, this did not seem to be the case. We trained on "spring" seq04 front and right from both cameras, with SGD and ADAM at a handful of different learning rates.

Intermediate results indicated that the model did not perform well, but as the results on the Cityscapes train and validation sets proved to be solid on the same model, the problem had to be data related. We suspect that this is due to us not having the same information that the authors used. After reverse engineering their batches, we notice that the values contained in the input resemble probabilities, presumably those that represent how likely each pixel is to be a certain object. This data is obviously not available in Synthia, or most segmented datasets publicly available on the web. We came up with two options regarding our validation approach. First we validate on part of the training set as this should yield good results due to it being the same data that is used for training. If that succeeds we know that the model is correct but maybe the training time or data should be increased. Afterwards we can validate on a subset of data, e.g. "spring", which is not used for training and see how our model performs. Unfortunately, the former was only predicting the segmentation in 2 classes (buildings and the rest of the scene) so validating those results didn't make sense.

## Conclusion

Our main goal at the start of the project was to achieve approximately the same accuracy on the dilated Cityscapes dataset as they did in the paper. Despite having only a small part of the training data this proved to be quite successful.
On one hand, quantifiably, with a comparable per pixel accuracy and on the other hand visually, with appealing predicted target frames. Thus training with the Cityscapes dataset turned out to give promising results. We observed that the model does learn the right weights when trained on the few training batches provided.

For our extension, specifically testing these results on other data, it's much more clouded. In the end, it is hard to state that we can conclude with full certainty that this model does not work on other datasets. Due to the nature of the original code, a lot of work was put into just reproducing the results of the original paper. Those also are not conclusive as we had to deal with a limited dataset for which the authors did not clarify their exact approach.

As we knew how the batches were supposed to be constructed we carefully rebuilt those for the Synthia dataset. Yet the output of the model seems different. That might be because of the different amount of classes in the segmentations of those two datasets. The number of input channels had to be decreased in the model yet we are quite sure that has virtually no impact.

A bigger reason might be the fact that the original batches use probabilities for each class at each pixel. This data is never publicized in the repository of the original paper and in the Synthia dataset, the provided segmentations are final, meaning they assign each pixel to exactly one class. When creating the batches we are thus forced to provide a probability of 1 (100%) for those classes while leaving all the others at 0. This prevents the model from learning the dynamic structure of the scenes and hence leaves us with a result which consists of 2 (or even 1 in some cases) classes.

For future work, it would be nice to have a video dataset that is segmented using the dilation 10 model mentioned in this paper, or any other segmentation model that provides percentages for the possible classes for that matter. Using this data combined with the proposed model that we ported to Pytorch proper results should be achieved. Changing some of the parameters like the optimizer to Adam instead of the SGD and fine-tuning the learning rate using grid search could result in significant improvement of the predictions.