# WavAugment walkthrough

In this colab document, we will go through some basic functionality that WavAugment provides. We will
*  install some required packages,
*  show how to apply simple augmentations on a speech sequence,
*  how to combine and randomize them,
*  discuss some useful considerations and limitations.

Our overall target is cover most of the things that we found useful for deep self-supervised learning.

## Applying simple and useful augmentations

Let's import everything we will need.

In [None]:
import torch
import torchaudio
import torchaudio.functional as F

print(torch.__version__)
print(torchaudio.__version__)

import os
import random
import augment
import numpy as np

from IPython.display import Audio
import matplotlib.pyplot as plt

Let's load the snippet of the speech and listen to it:

In [None]:
print(os.getcwd())
print(os.listdir("../../../datasets/GTZAN/gtzan_genre/genres/blues"))

In [None]:
root = '../../../datasets/GTZAN/gtzan_genre/genres/'
genres = ["blues", "classical", "country", "disco", "hiphop", "jazz", "metal", "pop", "reggae", "rock"]

test_list = []
for genre in genres:
    song = random.choice(os.listdir(root + genre))
    audio, sr = torchaudio.load(os.path.join(root, genre, song))
    test_list.append(['test_audio_' + str(genre), audio, sr])

print(test_list)

In [None]:
for audio in test_list:
    print(audio[0])
    Audio(audio[1], rate=audio[2])

In [None]:
def plot_waveform(waveform, sample_rate, title="Waveform", xlim=None):
    waveform = waveform.numpy()

    num_channels, num_frames = waveform.shape
    time_axis = torch.arange(0, num_frames) / sample_rate

    figure, axes = plt.subplots(num_channels, 1)
    if num_channels == 1:
        axes = [axes]
    for c in range(num_channels):
        axes[c].plot(time_axis, waveform[c], linewidth=1)
        axes[c].grid(True)
        if num_channels > 1:
            axes[c].set_ylabel(f"Channel {c+1}")
        if xlim:
            axes[c].set_xlim(xlim)
    figure.suptitle(title)
    plt.show(block=False)

In [None]:
def play_audio(waveform, sample_rate):
  waveform = waveform.numpy()

  num_channels, num_frames = waveform.shape
  if num_channels == 1:
    display(Audio(waveform[0], rate=sample_rate))
  elif num_channels == 2:
    display(Audio((waveform[0], waveform[1]), rate=sample_rate))
  else:
    raise ValueError("Waveform with more than 2 channels are not supported.")

In [None]:
for i in range(len(test_list)):
    #plot_waveform(test_list[i][1], test_list[i][2], title=str(test_list[i][0]), xlim=None)
    play_audio(test_list[i][1], test_list[i][2])

Similarly to `sox`, the central entity of WavAugment is a sequence of effects, `augment.EffectChain`. As the name indicates, we can create various combinations of audio effects by chaining them together. This chain can be empty and do nothing:


In [None]:
empty_chain = augment.EffectChain()
for x in test_list:
    y = empty_chain.apply(x[1], src_info={'rate': x[2]})

or can contain one or more effects. Let us create a chain that applies a clipping effect:

In [None]:
clip_chain = augment.EffectChain().clip(0.6)
for x in test_list:
    y = clip_chain.apply(x[1], src_info={'rate': x[2]})
    #plot_waveform(y, x[2], title=str(x[0]), xlim=None)
    print(x[0])
    play_audio(y, x[2])

We can append effects one after another, just like below where we put `rate` transformer after the `pitch` one:

In [None]:
for x in test_list:
    y = augment.EffectChain().pitch(1200).rate(x[2]).apply(x[1], src_info={'rate': x[2]})
    #plot_waveform(y, x[2], title=str(x[0]), xlim=None)
    print(x[0])
    play_audio(y, x[2])


Here, we have lowered the pitch of the voice by 2 tones: -200 indicates that we'll go lower by 200 cents of the tone.

Similarly, we can go up by the same amount:

Why do we to put `rate` after pitch? At the moment, WavAugment's `pitch` provides a somewhat thin wrapper around the corresponding effect of `libsox` [*]. Internally, `libsox` would represent change in the pitch as combination of `tempo` and `rate` effects; so for the time being we need to change the rate back manually.

[*] This is subject to change in the future, as we re-iterate on the library.


Another effect that we found useful is `reverb`. The reverberations that are provided by `sox` are specified by three parameters: reverberance, dumping factor, and room size. Check how it sounds:

In [None]:
for x in test_list:
    y = augment.EffectChain().reverb(100, 50, 100).channels(1).apply(x[1], src_info={'rate': x[2]})
    #plot_waveform(y, x[2], title=str(x[0]), xlim=None)
    print(x[0])
    play_audio(y, x[2])

Again, we need to add the `channels` effect due to pecularities of `libsox`.

What else can we do? Another effect that is often used in the literature, is replacing a small span of audio with silence. We can do that, too:

In [None]:
for x in test_list:
    y = augment.EffectChain().time_dropout(max_seconds=1.0).apply(x[1], src_info={'rate': x[2]})
    #plot_waveform(y, x[2], title=str(x[0]), xlim=None)
    print(x[0])
    play_audio(y, x[2])

Applying additive noise is a bit more involved, as we need a database of noise, such as [MUSAN](https://www.openslr.org/17/). For the sake of this small tutorial, we will use generated uniform noise. The additive noise effect consumes a Callable that returns the noise to be added: 


In [None]:
for x in test_list:
    noise_generator = lambda: torch.zeros_like(x[1]).uniform_()
    y = augment.EffectChain().additive_noise(noise_generator, snr=5).apply(x[1], src_info={'rate': sr})
    #plot_waveform(y, x[2], title=str(x[0]), xlim=None)
    print(x[0])
    play_audio(y, x[2])

WavAugment does not normalize the inputs, neither noise nor the input tensor, this needs to be kept in mind.

In terms of sox effects, bandreject can be implemented as follows:

In [None]:
for x in test_list:
    y = augment.EffectChain().sinc('-a', '120', '500-100').apply(x[1], src_info={'rate': x[2]})
    #plot_waveform(y, x[2], title=str(x[0]), xlim=None)
    print(x[0])
    play_audio(y, x[2])

## Randomization & Combining

So in data augmentation we typically want to randomize the applied augmentation and/or its strength. All effects in WavAugment take a Callable as any of its parameters, which provides a way randomize the applied effect. For instance, we can randomize pitch as follows:

If an effect has several parameters, we can replace all or some of them:

In [None]:
random_room_size = lambda: np.random.randint(0, 101)
random_reverb = augment.EffectChain().reverb(50, 50, random_room_size).channels(1)

for x in test_list:
    y = random_reverb.apply(x[1], src_info={'rate': x[2]})
    #plot_waveform(y, x[2], title=str(x[0]), xlim=None)
    play_audio(y, x[2])

We can easily stack augmentations:

In [None]:
combination = augment.EffectChain() \
  .pitch("-q", random_pitch_shift).rate(sr) \
  .reverb(50, 50, random_room_size).channels(1) \
  .additive_noise(noise_generator, snr=15) \
  .time_dropout(max_seconds=1.0)
y = combination.apply(x, src_info={'rate': sr}, target_info={'rate': sr})
Audio(y, rate=sr)

## Discussion & Limitations

*  Currently, all augmentations are non-batched (and done on CPU). Hence, it's a good idea to apply them inside a parallelized dataloader (see our example [example](https://github.com/facebookresearch/WavAugment/blob/master/examples/python/librispeech_selfsupervised.py)),
* In some corner cases, `pitch` augmentation within libsox might return `NaN`. If this happens, it can be useful to handle this case (as we do [here](https://github.com/facebookresearch/WavAugment/blob/master/examples/python/librispeech_selfsupervised.py#L118)),
* To interpret what sox-based effects do and which parameters they take, please consult the sox [documentation](http://sox.sourceforge.net/sox.html). All effects apart from additive noise, time dropout, and clipping are based on sox,
* The full list of 64 supported effects is:

In [None]:
augment.EffectChain.KNOWN_EFFECTS