# <img src="https://img.icons8.com/dusk/64/000000/artificial-intelligence.png" style="height:50px;display:inline"> EE 046202 - Technion - Unsupervised Learning & Data Analysis
---

#### <a href="https://lioritan.github.io">Lior Friedman</a>

## Tutorial 10 - Contrastive Learning Continues

### <img src="https://img.icons8.com/bubbles/50/000000/checklist.png" style="height:50px;display:inline"> Agenda
---
* [Bootstrap Your Own Latent (BYOL)](#-Bootstrap-Your-Own-Latent-(BYOL))
* [Barlow Twins](#-Barlow-Twins)
* [Lightly - implementing contrastive learning](#-Lightly---implementing-contrastive-learning)
* [Recommended Videos](#-Recommended-Videos)
* [Credits](#-Credits)

In [2]:
# imports for the tutorial
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib notebook

### <img src="https://img.icons8.com/dusk/64/000000/paper.png" style="height:50px;display:inline">  Reminder: Contrastive learning
---
* Similar things should be close, giving low loss, and dissimilar things should be far.
* Use augmentations to find positive samples (similar things).
* <img src="./assets/selfsup_contrast_augs.png" style="height:200px;">
* Contrastive loss: collect a batch of $N$ samples and use $N-1$ as negative samples each time.
* <img src="./assets/selfsup_infonce_loss.png" style="height:100px;">

### <img src="https://img.icons8.com/dusk/64/000000/mountain.png" style="height:50px;display:inline"> Bootstrap Your Own Latent (BYOL)
---
* <a href="https://arxiv.org/abs/2006.07733">Bootstrap Your Own Latent (BYOL)</a> creates two views of the data and tries to predict one from the other.
* Unlike previous contrastive methods **there are no negative samples**.
* Create two augmented views of a sample $x$, $t(x),t'(x)$, feed them to two neural networks and predict one from the other.
* Each network has an encoder $f_\theta$, a projector $g_\theta$ and a predictor $q_\theta$.
<img src="./assets/selfsup_byol.png" style="height:300px;">
1. Sample $t,t'\sim \tau$, get latent variables $z=g_\theta(f_\theta(t(x))),\quad z_\xi=g_\xi(f_\xi(t'(x)))$.
2. Online network produces prediction $q_\theta(z)$.
3. Normalize $q_\theta(z),z_\xi$ ($L_2$-norm), $\quad\mathcal{L}_{BYOL}=\mathrm{MSE}(q_\theta(z),z_\xi)$.
4. Update online network with SGD, update target network via polyak averaging: $\xi\leftarrow\alpha \xi+(1-\alpha)\theta$.

* **The MSE loss does not require negative samples**.
* There is an *implicit* contrastive loss, by using **Batch Normalization** in the encoder and projection.
* Without this batch normalization, BYOL fails catastrophically.
* Why?
    * One purpose of negative examples in a contrastive loss function is to prevent mode collapse (i.e. what if you use all-zeros representation for every data point?).
    * BYOL has no negative samples, so we need some implict dependency on negative samples.
    * This is exactly what batch normalization does, no matter how similar a batch of inputs are, the values are re-distributed according to the learned mean and standard deviation (and scaling-shifting).
    * Mode collapse is prevented because all samples in the mini-batch **cannot take on the same value after batch normalization**.
* In other words, BYOL learns by asking **“how is this image different from the average image?“**, whereas contrastive methods ask **“what distinguishes these two specific images from each other?”**
<img src="./assets/selfsup_compare.png" style="height:250px;">

### <img src="https://img.icons8.com/bubbles/50/000000/yin-yang.png" style="height:50px;display:inline"> Barlow Twins
---
* <a href="https://arxiv.org/abs/2103.03230">Barlow Twins</a> does something similar to CCA (canonical-correlation analysis).
* Feed two distorted versions of the sample into the *same* network to extract features and learn to make the cross-correlation matrix between these two groups of output features **close to the identity matrix**. 
* In other words, the goal is to keep the representations of different versions of one sample similar, while minimizing the *redundancy* between these vectors (the idea comes from neuroscience).
<img src="./assets/selfsup_barlow.png" style="height:300px;">
1. Sample a batch of size, $N$, for each sample apply random augmentations $t,t'$ and encode: $z^A=f_\theta(t(x)),z^B=f_\theta(t'(x))$.
2. Calculate the cross-correlation matrix $\mathcal{C}$. 
    * $\mathcal{C}$ is a square matrix with the size same as the feature network’s output dimensionality. 
    * Each entry in the matrix $\mathcal{C}_{i,j}$ is the cosine similarity between the output vectors dimension at index $i,j$
    * $$\mathcal{C}_{i,j}=\frac{\sum_{b=1}^{N}z^A_{i,b}z^b_{j,b}}{\sqrt{\sum_b(z^A_{i,b})^2}\sqrt{\sum_b(z^B_{i,b})^2}}$$
    * $\mathcal{C}_{i,j}$ is between -1 (i.e. perfect anti-correlation) and 1 (i.e. perfect correlation).
3. $$\mathcal{L}_\mathrm{BT} = \underbrace{\sum_i (1-\mathcal{C}_{ii})^2}_\mathrm{invariance-term} + \lambda \underbrace{\sum_i\sum_{i\neq j} \mathcal{C}_{ij}^2}_\mathrm{redundancy-reduction-term}$$

Notes:
* *Explicitly reduces redundancy*, so no need for batch normalization to avoid mode collapse in the representation.
* Pretty robust to batch size, but sensitive to the choice of augmentations.

### <img src="./assets/selfsup_lightly.png" style="height:50px;display:inline"> Lightly - implementing contrastive learning
---
* Lightly SSL is a computer vision framework for self-supervised learning. 
* Contains Pytorch-based implementations for many popular models, including everything we talked about.
* **TODO**

### <img src="https://img.icons8.com/bubbles/50/000000/video-playlist.png" style="height:50px;display:inline"> Recommended Videos
---
#### <img src="https://img.icons8.com/cute-clipart/64/000000/warning-shield.png" style="height:30px;display:inline"> Warning!
* These videos do not replace the lectures and tutorials.
* Please use these to get a better understanding of the material, and not as an alternative to the written material.

#### Video By Subject

* BYOL - <a href="https://www.youtube.com/watch?v=YPfUiOMYOEE"> BYOL: Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning </a>
* 

## <img src="https://img.icons8.com/dusk/64/000000/prize.png" style="height:50px;display:inline"> Credits
---
* <a href="https://github.com/taldatech/ee046211-deep-learning/blob/main/ee046211_tutorial_09_self_supervised_representation_learning.ipynb"> ee045211 - Deep Learning </a> @ Technion
* <a href="https://lilianweng.github.io/posts/2021-05-31-contrastive/"> Weng, Lilian. (May 2021). Contrastive representation learning. Lil’Log </a>
* <a href="https://imbue.com/research/2020-08-24-understanding-self-supervised-contrastive-learning/"> Understanding self-supervised and contrastive learning with BYOL </a>
* A Cookbook of Self-Supervised Learning, Balestriero et al. 2023
* <a href="https://paperswithcode.com/method/byol">Bootstrap Your Own Latent (BYOL)</a>
* <a href="https://paperswithcode.com/method/barlow-twins">Barlow Twins</a>
* <a href="https://github.com/lightly-ai/lightly">Lightly SSL</a>