# Pretext Tasks in Self-Supervised Learning: In-Depth Mathematical Explanation

## Core Concepts of Self-Supervised Learning

### Pretext Tasks

In self-supervised learning (SSL), pretext tasks are artificially created tasks that do not require human-labeled data. These tasks are designed to help the model learn useful representations of the data that can be transferred to downstream tasks. Here, we will delve into several common pretext tasks with detailed mathematical explanations, training procedures, and the advantages and drawbacks of each approach.

#### 1. Image Inpainting

**Objective**: Predict the missing parts of an image.

**Mathematical Explanation**:

- Let $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$ represent an input image with height $H$, width $W$, and $C$ color channels.
- Define a binary mask $\mathbf{m} \in \{0,1\}^{H \times W}$ where $\mathbf{m}_{i,j} = 0$ indicates that the pixel at position $(i,j)$ is missing and needs to be predicted, and $\mathbf{m}_{i,j} = 1$ indicates that the pixel is present.
- The observed image can be denoted as $\mathbf{x}' = \mathbf{x} \odot \mathbf{m}$, where $\odot$ is the element-wise multiplication.
- The goal is to train a neural network $f_\theta$ with parameters $\theta$ to predict the missing pixels:
  $$ \hat{\mathbf{x}} = f_\theta(\mathbf{x}') $$

**Training**:

- The loss function can be the mean squared error (MSE) between the predicted image and the original image:
  $$ \mathcal{L}(\theta) = \frac{1}{|\mathcal{M}|} \sum_{(i,j) \in \mathcal{M}} \left( \hat{\mathbf{x}}_{i,j} - \mathbf{x}_{i,j} \right)^2 $$
  where $\mathcal{M} = \{(i,j) : \mathbf{m}_{i,j} = 0\}$.

**Derivatives**:

- To minimize the loss function, we use gradient descent. The gradient of the loss with respect to the predictions is:
  $$ \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{x}}_{i,j}} = \frac{2}{|\mathcal{M}|} \left( \hat{\mathbf{x}}_{i,j} - \mathbf{x}_{i,j} \right) $$
- The gradient of the loss with respect to the model parameters $\theta$ is obtained via the chain rule:
  $$ \frac{\partial \mathcal{L}}{\partial \theta} = \sum_{(i,j) \in \mathcal{M}} \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{x}}_{i,j}} \cdot \frac{\partial \hat{\mathbf{x}}_{i,j}}{\partial \theta} $$
- The parameters are updated using gradient descent:
  $$ \theta \leftarrow \theta - \eta \frac{\partial \mathcal{L}}{\partial \theta} $$

**Advantages**:

- Utilizes the spatial structure of images effectively.
- Helps the model learn context and relationships between different parts of the image.

**Drawbacks**:

- The effectiveness of the learned representations can depend on the type and size of the missing regions.
- May not generalize well to other types of data without significant modifications.

#### 2. Jigsaw Puzzles

**Objective**: Rearrange scrambled patches of an image.

**Mathematical Explanation**:

- Divide the image $\mathbf{x}$ into $N \times N$ grid patches, resulting in $N^2$ patches.
- Let $\mathbf{x}_{i,j}$ be the patch at position $(i,j)$.
- Scramble the patches using a permutation $\pi$, giving a scrambled image $\mathbf{x}^\pi$.
- The task is to predict the permutation $\pi^{-1}$ that reconstructs the original image.
- The model $f_\theta$ is trained to output the correct permutation:
  $$ \hat{\pi} = f_\theta(\mathbf{x}^\pi) $$

**Training**:

- The loss function is typically the cross-entropy loss over the permutation indices:
  $$ \mathcal{L}(\theta) = -\sum_{k=1}^{N^2} \log p_{\theta}(\pi^{-1}_k | \mathbf{x}^\pi) $$

**Derivatives**:

- The gradient of the cross-entropy loss with respect to the model parameters $\theta$ is given by:
  $$ \frac{\partial \mathcal{L}}{\partial \theta} = -\sum_{k=1}^{N^2} \left( \frac{\partial \log p_{\theta}(\pi^{-1}_k | \mathbf{x}^\pi)}{\partial \theta} \right) $$
- Using the chain rule, the derivative can be expanded as:
  $$ \frac{\partial \log p_{\theta}(\pi^{-1}_k | \mathbf{x}^\pi)}{\partial \theta} = \frac{1}{p_{\theta}(\pi^{-1}_k | \mathbf{x}^\pi)} \cdot \frac{\partial p_{\theta}(\pi^{-1}_k | \mathbf{x}^\pi)}{\partial \theta} $$
- The parameters are updated using gradient descent:
  $$ \theta \leftarrow \theta - \eta \frac{\partial \mathcal{L}}{\partial \theta} $$

**Advantages**:

- Forces the model to learn spatial relationships and dependencies between different parts of the image.
- Simple to implement and effective in various computer vision tasks.

**Drawbacks**:

- Computationally expensive for large images due to the number of possible permutations.
- The task might become trivial for small patch sizes, reducing the effectiveness of the learned representations.

#### 3. Colorization

**Objective**: Convert grayscale images to color images.

**Mathematical Explanation**:

- Convert the input image $\mathbf{x}$ to grayscale, $\mathbf{x}_{gray}$.
- Train a model $f_\theta$ to predict the color channels from the grayscale image:
  $$ \hat{\mathbf{x}}_{color} = f_\theta(\mathbf{x}_{gray}) $$

**Training**:

- The loss function can be the MSE between the predicted and original color images:
  $$ \mathcal{L}(\theta) = \frac{1}{HWC} \sum_{i=1}^{H} \sum_{j=1}^{W} \sum_{c=1}^{C} \left( \hat{\mathbf{x}}_{i,j,c} - \mathbf{x}_{i,j,c} \right)^2 $$

**Derivatives**:

- The gradient of the loss with respect to the predictions is:
  $$ \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{x}}_{i,j,c}} = \frac{2}{HWC} \left( \hat{\mathbf{x}}_{i,j,c} - \mathbf{x}_{i,j,c} \right) $$
- The gradient of the loss with respect to the model parameters $\theta$ is obtained via the chain rule:
  $$ \frac{\partial \mathcal{L}}{\partial \theta} = \sum_{i=1}^{H} \sum_{j=1}^{W} \sum_{c=1}^{C} \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{x}}_{i,j,c}} \cdot \frac{\partial \hat{\mathbf{x}}_{i,j,c}}{\partial \theta} $$
- The parameters are updated using gradient descent:
  $$ \theta \leftarrow \theta - \eta \frac{\partial \mathcal{L}}{\partial \theta} $$

**Advantages**:

- Helps the model learn to generate realistic colors and understand semantic information in the image.
- Effective in learning representations that transfer well to other vision tasks.

**Drawbacks**:

- The task is inherently ambiguous since multiple colorizations can be plausible for a single grayscale image.
- May require additional regularization techniques to prevent the model from generating unrealistic colors.

#### 4. Temporal Order Verification

**Objective**: Determine the correct sequence of frames in a video.

**Mathematical Explanation**:

- Let $\mathbf{X} = (\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T)$ be a sequence of $T$ frames from a video.
- Generate a shuffled sequence $\mathbf{X}^\pi$ using a permutation $\pi$.
- The task is to train a model $f_\theta$ to predict the permutation $\pi^{-1}$ that reconstructs the original sequence.
- The model is trained to output the correct permutation:
  $$ \hat{\pi} = f_\theta(\mathbf{X}^\pi) $$

**Training**:

- The loss function is the cross-entropy loss over the permutation indices:
  $$ \mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_{\theta}(\pi^{-1}_t | \mathbf{X}^\pi) $$

**Derivatives**:

- The gradient of the cross-entropy loss with respect to the model parameters $\theta$ is given by:
  $$ \frac{\partial \mathcal{L}}{\partial \theta} = -\sum_{t=1}^{T} \left( \frac{\partial \log p_{\theta}(\pi^{-1}_t | \mathbf{X}^\pi)}{\partial \theta} \right) $$
- Using the chain rule, the derivative can be expanded as:
  $$ \frac{\partial \log p_{\theta}(\pi^{-1}_t | \mathbf{X}^\pi)}{\partial \theta} = \frac{1}{p_{\theta}(\pi^{-1}_t | \mathbf{X}^\pi)} \cdot \frac{\partial p_{\theta}(\pi^{-1}_t | \mathbf{X}^\pi)}{\partial \theta} $$
- The parameters are updated using gradient descent:
  $$ \theta \leftarrow \theta - \eta \frac{\partial \mathcal{L}}{\partial \theta} $$

**Advantages**:

- Encourages the model to understand temporal dependencies and motion in videos.
- Useful for tasks that require temporal understanding, such as action recognition.

**Drawbacks**:

- The task can become challenging with longer video sequences due to the increased complexity of permutations.
- Requires careful selection of frame permutations to avoid trivial solutions that do not contribute to meaningful representation learning.

By designing these pretext tasks, self-supervised learning encourages models to learn meaningful and transferable representations of the data, which can significantly improve performance on downstream tasks. Each pretext task leverages the inherent structure and relationships within the data to guide the learning process.
