# Video Representation Learning by Dense Predictive Coding (DPC)

- Self-supervised technique for learning spatio-temporal representations of video.
- For example, these features can be used in a downstream action recognition task.
- Hopefully also suitable for capturing features that can be used in our pipeline.

## Introduction
- Video data lends itself pretty well to self-supervised learning techniques and there's lots of it (youtube, etc).
- Commonly self-supervised techniques in this domain are based on predicting future frames in a video.
- The main difficulty in this lies in that the future is very much non deterministic.
- If the goal is just to learn a representation for downstream tasks then it is likely unnecessary to waste model capacity on predicting pixel level data of future frames.
- In this paper these future predictions are instead made in a latent representation space.
- The DPC model is similar to techniques like contrastive predictive coding (CPC) and word2vec.

## DPC overview
- DPC is designed to predict the future latent representations based on the recent past.
- DPC is trained using a variant of noise contrastive estimation which means that it never has to predict the exact future (i.e. not a regression loss) but rather just has to solve the multi-class classification problem of picking the correct future state out of many distractors.
- To do this the model has to learn to capture the shared semantics of multiple future states.

## DPC framework
- The input video is partitioned into multiple non-overlapping blocks $x_1, ..., x_n$ which each consists of a fixed number of frames.
    - $x_i \in \mathbb{R}^{T \times H \times W \times C}$
- An encoding function $f$ encodes each block separately into the latent representations $z_1, ..., z_n$.
    - $z_i \in \mathbb{R}^{H' \times W' \times C}$
- An aggregation function $g$ computes a context representation $c_t = g(z_1, ..., z_t)$ given the observed latents.
    - $c_i \in \mathbb{R}^{H' \times W' \times C}$
- A predictive function $\phi$ predicts the future latents $\hat{z}_{t+1}, \hat{z}_{t+2}, ...$
    - $\hat{z}_{i} \in \mathbb{R}^{H' \times W' \times C}$
- The following $c_{t+1}, ...$ and $\hat{z}_{t+2}, ...$ are predicted autoregressively.
- This is summarized in the following slide.

## DPC framework
<img src="figs/dpc/dpc-fig-2.png"></img>

## DPC loss
- $\hat{z}_i \in \mathbb{R}^{H' \times W' \times C}$ is the predicted latent embedding at timestep $i$. Note that the spatial layout is kept.
- $\hat{z}_{i,k} \in \mathbb{R}^{C}$ is the predicted latent embedding at timestep $i$ and at spatial index $k$.
- The DPC framework uses a variant of noise contrastive estimation in which the model is tasked to classify one positive sample against multiple negative samples. The samples to which to compare the predicted latents are taken from the $z_i$ computed by the encoder function $f$.
- For the predicted embedding $\hat{z}_{i,k}$ the only positive embedding is $z_{i,k}$.
- For the predicted embedding $\hat{z}_{i,k}$ the negative embeddings are taken from all $z_{j, m} \ s.t. (i, k) \neq (j, m)$.
- This is illustrated in the previous slide.
- The loss should then encourage the model to produce a higher similarity (dot-product) for the positive pair than any of the negative pairs. They define the loss as follows. $$\mathcal{L} = -\sum_{i,k} log \frac{exp(\hat{z}_{i,k}^T \cdot z_{i,k})}{\sum_{j,m} exp(\hat{z}_{i,k}^T \cdot z_{j,m})}$$

## DPC curriculum learning
- The negatives can be divided into groups of different difficulty.
- Easy negatives from embeddings from other inputs in the batch.
- Spatial negatives from embeddings from the same video but at different spatial indices.
- Temporal negatives (hard negatives) from embeddings from the same video and the same spatial index but at different temporal indices.
- The authors propose a curriculum learning strategy where the model is asked to progressively predict further into the future which both 
    - makes it progressively more difficult on its own
    - and introduces more hard negatives during this process

## DPC avoiding shortcuts
- In order to learn useful representations, some efforts have to be made to restrict the model from learning more trivial shortcuts.
- One of these is to try to disrupt simply learning optical flow which the authors combat by introducing framewise random data augmentation.
- The authors say that the curriculum learning strategy also alleviates this problem.
- Thirdly, the architecture design of splitting into blocks makes sure that the temporal receptive field does not allow "cheating".

## DPC network architecture
- The encoding function $f$ is implemented as a 3D resnet which uses a combination of 2D and 3D convolution kernels.
- The aggregation function $g$ is implemented as a ConvGRU with kernel size 1x1 thus sharing weights across spatial positions.
- The predictor function $\phi$ is implemented as a shallow MLP.
- The authors mention that its preferable to use a weak aggregator in order to train a strong encoder.

## Experiments and analysis
- They evaluate the model's ability to produce a useful representation by finetuning a classifier on three action classification datasets, UCF101, HMDB51, K400.
- They study the following things
    - An ablation study on the different design choices.
    - The effect of training on larger and more diverse datasets.
    - The correlation between performance in the self-supervised task and the downstream task.
    - The effect on learnt representation when predicting further into the future.

## Experiments and analysis: ablation study

<img src="figs/dpc/dpc-table-1.png"></img>

## Experiments and analysis: benefits of large datasets

<img src="figs/dpc/dpc-table-2.png"></img>

## Experiments and analysis: correlation of self-supervised and classification accuracy

<img src="figs/dpc/dpc-fig-3.png"></img>

## Experiments and analysis: predicting further into the future

<img src="figs/dpc/dpc-table-3.png"></img>

## Discussion
- How difficult are these action classification datasets really? At least the UCF101 seems pretty to solve with much simpler models. But I could be mistaken.
- What could be some other tasks to evaluate in? This would benefit our experimentation as well.