# Self-Supervised Learning

## Contrastive Learning

Contrastive learning is a method that describes learning representations by way of modeling similarity from natural variations of data, which is often organised into similar and dissimilar pairs, or by way of organising (insert BYOL here).

It is often presented in the following stages:
1. Encode different "views" from natural variations of a single example.
2. Compute a similarity metric on the representations from the encoder(s).
3. Evaluate the pre-trained representations with linear regression on the downstream task(s).


### Contrastive Losses


<div align="center">
    <img width="500" src="https://i.imgur.com/2uZeF4U.png"/>
    <div>Figure 1 from "Improved Baselines with Momentum Contrastive Learning" (Chen et al., 2020)</div>
</div>

Many contrastive learning methods use a variant of a contrastive loss function, which was first introduced in Noise Contrastive Estimation (Gutmann et al., 2010) and subsequently the InfoNCE loss from Contrastive Predictive Coding (Van den Oord et al., 2019).

This loss can be minimized using a variety of methods, which mostly differ in how they keep track of the keys of data examples. In the case of SimCLR (Chen et al., 2020), a single batch consists both of "positive" and "negative" pairs, which act as "keys" to the original examples. These are updated end-to-end by back-propagation. To increase the complexity of the contrastive learning task, it requires a large batch size to contain more negative examples. Conversely, for Momentum Contrast the negative examples' keys are maintained in a queue. It only encodes the queries and the positive keys in a single batch.


$$\mathcal{L}_{q, k^{+},\left\{k^{-}\right\}}=-\log \frac{\exp \left(q \cdot k^{+} / \tau\right)}{\exp \left(q \cdot k^{+} / \tau\right)+\sum_{k^{-}} \exp \left(q \cdot k^{-} / \tau\right)}$$


## CLMR

In the following examples, we will be taking a look at how Contrastive Learning of Musical Representations (Spijkervet & Burgoyne, 2021) uses self-supervised learning to learn powerful representations for the downstream task of music tagging. 

<div align="center">
<img width="700" src="../images/janne/clmr_model.png"/>
</div>

In the above figure, we transform a single audio example into two, distinct augmented views.

by processing it through a set of stochastic audio augmentations.



- Explain the intuition
- Explain SimCLR
- Explain SampleCNN

## Dataset
- We can initialise MTAT, MSD, or your own set of .mp3 / .wav files.

In [8]:
!git clone https://github.com/spijkervet/clmr.git
!pip3 install clmr/

import sys
sys.path.append("clmr")

fatal: destination path 'clmr' already exists and is not an empty directory.
Processing ./clmr
Building wheels for collected packages: clmr
  Building wheel for clmr (setup.py) ... [?25ldone
[?25h  Created wheel for clmr: filename=clmr-0.1.0-py3-none-any.whl size=7258 sha256=405a5457bfa7d1dac3f51ad0e16076fd845c1d641e5ccc680299d48fdde1200f
  Stored in directory: /private/var/folders/5n/msbkkqhj2y9bhqwj2gvr70n40000gp/T/pip-ephem-wheel-cache-gimi64r8/wheels/ef/e0/26/af55c7a1a7eac7e884f75767d644d68122ac8a86a7641fad16
Successfully built clmr
Installing collected packages: clmr
  Attempting uninstall: clmr
    Found existing installation: clmr 0.1.0
    Uninstalling clmr-0.1.0:
      Successfully uninstalled clmr-0.1.0
Successfully installed clmr-0.1.0


In [12]:
from clmr.datasets import get_dataset

train_dataset = get_dataset("magnatagatune", "./data", subset="train")

100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%


Merging zip files...



### Audio Data Augmentations
- Crop
- Filter
- Reverb
- Polarity
- Noise
- Pitch
- Gain
- Delay


In [1]:
import torchaudio
from torchaudio_augmentations import (
    RandomApply,
    ComposeMany,
    RandomResizedCrop,
    PolarityInversion,
    Noise,
    Gain,
    HighLowPass,
    Delay,
    PitchShift,
    Reverb,
)

# audio_fp = "test.wav"
# audio, sr = torchaudio.load(audio_fp)

# transformation = HighLowPass(sr)
# transformed_audio = transformation(audio)

Now, let's apply a series of transformations, each applied with an independent probability:

In [2]:
# train_transform = [
#     RandomResizedCrop(n_samples=args.audio_length),
#     RandomApply([PolarityInversion()], p=args.transforms_polarity),
#     RandomApply([Noise()], p=args.transforms_noise),
#     RandomApply([Gain()], p=args.transforms_gain),
#     RandomApply([HighLowPass(sample_rate=args.sample_rate)], p=args.transforms_filters),
#     RandomApply([Delay(sample_rate=args.sample_rate)], p=args.transforms_delay),
#     RandomApply([PitchShift(n_samples=args.audio_length, sample_rate=args.sample_rate)], p=args.transforms_pitch),
#     RandomApply([Reverb(sample_rate=args.sample_rate)], p=args.transforms_reverb),
# ]
# num_augmented_samples = 2

## Loss

Here, we apply an InfoNCE loss, as proposed by van den Oord et al. (2018) for contrastive learning. InfoNCE loss compares the similarity of our representations $z_i$ and $z_j$, to the similarity of $z_i$ to any other representation in our batch, and applies a softmax over the obtained similarity values. We can write this loss more formally as follows:

$$\ell_{i, j}=-\log \frac{\exp \left(\operatorname{sim}\left(z_{i}, z_{j}\right) / \tau\right)}{\sum_{k=1}^{2 N} \mathbb{1}_{[k \neq i]} \exp \left(\operatorname{sim}\left(z_{i}, z_{k}\right) / \tau\right)}=-\operatorname{sim}\left(z_{i}, z_{j}\right) / \tau+\log \left[\sum_{k=1}^{2 N} \mathbb{1}_{[k \neq i]} \exp \left(\operatorname{sim}\left(z_{i}, z_{k}\right) / \tau\right)\right]$$


The similarity metric is the cosine similarity between our representations:

$$\operatorname{sim}\left(z_{i}, z_{j}\right)=\frac{z_{i}^{\top} \cdot z_{j}}{\left\|z_{i}\right\| \cdot\left\|z_{j}\right\|}$$
