# PyTorch Tutorial

## CLMR

In the following examples, we will be taking a look at how Contrastive Learning of Musical Representations (Spijkervet & Burgoyne, 2021) uses self-supervised learning to learn powerful representations for the downstream task of music tagging. 

<div align="center">
<img width="700" src="../images/janne/clmr_model.png"/>
</div>

In the above figure, we transform a single audio example into two, distinct augmented views by processing it through a set of stochastic audio augmentations.

In [3]:
!git clone https://github.com/spijkervet/clmr.git
!pip3 install clmr/

import sys
sys.path.append("clmr")

fatal: destination path 'clmr' already exists and is not an empty directory.
Processing ./clmr
Building wheels for collected packages: clmr
  Building wheel for clmr (setup.py) ... [?25ldone
[?25h  Created wheel for clmr: filename=clmr-0.1.0-py3-none-any.whl size=7258 sha256=11bd135a2d8a72ae95dca3dd62c0b59b639a64bbd879a4250d06dd3a01951213
  Stored in directory: /private/var/folders/5n/msbkkqhj2y9bhqwj2gvr70n40000gp/T/pip-ephem-wheel-cache-qq2_mgjk/wheels/47/96/99/4adc73d74f8b28040b7b1a2bbed5172c9ce04a9ba62645fc05
Successfully built clmr
Installing collected packages: clmr
  Attempting uninstall: clmr
    Found existing installation: clmr 0.1.0
    Uninstalling clmr-0.1.0:
      Successfully uninstalled clmr-0.1.0
Successfully installed clmr-0.1.0


In [12]:
from clmr.datasets import get_dataset



100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%


Merging zip files...



### Audio Data Augmentations
- Crop
- Filter
- Reverb
- Polarity
- Noise
- Pitch
- Gain
- Delay


In [1]:
import torchaudio
from torchaudio_augmentations import (
    RandomApply,
    ComposeMany,
    RandomResizedCrop,
    PolarityInversion,
    Noise,
    Gain,
    HighLowPass,
    Delay,
    PitchShift,
    Reverb,
)

# audio_fp = "test.wav"
# audio, sr = torchaudio.load(audio_fp)

# transformation = HighLowPass(sr)
# transformed_audio = transformation(audio)

Now, let's apply a series of transformations, each applied with an independent probability:

In [2]:
# train_transform = [
#     RandomResizedCrop(n_samples=args.audio_length),
#     RandomApply([PolarityInversion()], p=args.transforms_polarity),
#     RandomApply([Noise()], p=args.transforms_noise),
#     RandomApply([Gain()], p=args.transforms_gain),
#     RandomApply([HighLowPass(sample_rate=args.sample_rate)], p=args.transforms_filters),
#     RandomApply([Delay(sample_rate=args.sample_rate)], p=args.transforms_delay),
#     RandomApply([PitchShift(n_samples=args.audio_length, sample_rate=args.sample_rate)], p=args.transforms_pitch),
#     RandomApply([Reverb(sample_rate=args.sample_rate)], p=args.transforms_reverb),
# ]
# num_augmented_samples = 2

## Loss

Here, we apply an InfoNCE loss, as proposed by van den Oord et al. (2018) for contrastive learning. InfoNCE loss compares the similarity of our representations $z_i$ and $z_j$, to the similarity of $z_i$ to any other representation in our batch, and applies a softmax over the obtained similarity values. We can write this loss more formally as follows:

$$\ell_{i, j}=-\log \frac{\exp \left(\operatorname{sim}\left(z_{i}, z_{j}\right) / \tau\right)}{\sum_{k=1}^{2 N} \mathbb{1}_{[k \neq i]} \exp \left(\operatorname{sim}\left(z_{i}, z_{k}\right) / \tau\right)}=-\operatorname{sim}\left(z_{i}, z_{j}\right) / \tau+\log \left[\sum_{k=1}^{2 N} \mathbb{1}_{[k \neq i]} \exp \left(\operatorname{sim}\left(z_{i}, z_{k}\right) / \tau\right)\right]$$


The similarity metric is the cosine similarity between our representations:

$$\operatorname{sim}\left(z_{i}, z_{j}\right)=\frac{z_{i}^{\top} \cdot z_{j}}{\left\|z_{i}\right\| \cdot\left\|z_{j}\right\|}$$
