# Video Representation Learning by Dense Predictive Coding (DPC)

- Self-supervised technique for learning spatio-temporal representations of video.
- For example, these features can be used in a downstream action recognition task.
- Hopefully also suitable for capturing features that can be used in our pipeline.

## Introduction
- Video data lends itself pretty well to self-supervised learning techniques.
- There's lots of it (youtube, etc).
- Commonly self-supervised techniques in this domain are based on predicting future frames in a video.
- The main difficulty in this lies in that the future is very much non deterministic.
- If the goal is just to learn a representation for downstream tasks then it is likely unnecessary to waste model capacity on predicting pixel level data of future frames.
- In this paper these future predictions are instead made in a latent representation space.
- The DPC model is similar to techniques like contrastive predictive coding (CPC) and word2vec.

## DPC overview
- DPC is designed to predict the future latent representations based on the recent past.
- DPC is trained using a variant of noise contrastive estimation which means that it never has to predict the exact future but rather just has to solve the multi-class classification problem of picking the correct future state out of many distractors.
- To do this the model has to learn to capture the shared semantics of multiple future states.

## DPC framework
- The input video is partitioned into multiple non-overlapping blocks $x_1, ..., x_n$ which each consists of a fixed number of frames.
    - $x_t \in \mathbb{R}^{T \times H \times W \times C}$
- Each block is then encoded into the latent representations $z_1, ..., z_n$ via an encoding function $f$.
    - $z_t \in \mathbb{R}^{H' \times W' \times C}$
- An aggregation function $g$ computes a context representation $c_t = g(z_1, ..., z_t)$ given the observed latents.
    - $c_t \in \mathbb{R}^{H' \times W' \times C}$
- A predictive function $\phi$ predicts the future latents $\hat{z}_{t+1}, \hat{z}_{t+2}, ...$
    - $\hat{z}_{i} \in \mathbb{R}^{H' \times W' \times C}$
- The following $c_{t+1}, ...$ and $\hat{z}_{t+2}, ...$ are predicted autoregressively.
- This is summarized in the following slide.

## DPC framework
<img src="figs/dpc/dpc-fig-2.png"></img>

## DPC loss
- $\hat{z}_i \in \mathbb{R}^{H' \times W' \times C}$ is the predicted latent embedding at timestep $i$. Note that the spatial layout is kept.
- $\hat{z}_{i,k} \in \mathbb{R}^{C}$ is the predicted latent embedding at timestep $i$ and at spatial index $k$.
- The DPC framework uses a variant of noise contrastive estimation in which the model is tasked to classify one positive sample against multiple negative samples. The samples to which to compare the predicted latents are taken from the $z_i$ computed by the encoder function $f$.
- For the predicted embedding $\hat{z}_{i,k}$ the only positive embedding is $z_{i,k}$.
- For the predicted embedding $\hat{z}_{i,k}$ the negative embeddings are taken from all $z_{j, m} \ s.t. (i, k) \neq (j, m)$.
- This is illustrated in the previous slide.
- The loss is then defined as $$\mathcal{L} = -\sum_{i,k} log \frac{exp(\hat{z}_{i,k}^T \cdot z_{i,k})}{\sum_{j,m} exp(\hat{z}_{i,k}^T \cdot z_{j,m})}$$

## DPC network architecture
- The encoding function $f$ is implemented as a 3D resnet which uses a combinations of 2D and 3D convolution kernels.
- The aggregation function $g$ is implemented as a ConvGRU with kernel size 1x1 thus sharing weights across spatial positions.
- The authors mention that its preferable to use a weak aggregator in order to train a strong encoder.

## Experiments and analysis

## Discussion
- Is this even right for what we want to capture?