## Using RNN for feature extraction from audio input

The internet suggest we should use RNN on spectogram (as omer did with cnn) by considering each column of the spectogram as the current input (in the time dimension) and then using the recurrent network.

Working with the raw audio is not so simple as I originally thought because even if I would use LSTM (which can handle longer sequences than simple RNN), we are talking about sequences of length of ~1e-6 and I think we won't be able to train this well naively. It is possible to do Truncated backpropagation through time (TBPTT) and if time would allow us, we will try that as well but because using spectogram was suggested by the internet, we will go with it.

In [2]:
import sys; sys.path.append('..')

### Data processing:

#### Playing with the data:

In [None]:
from pychorus.helpers import find_and_output_chorus
import matplotlib.pyplot as plt
from IPython.display import Audio
import numpy as np
import torchaudio
import librosa

For each song, we will focus only on the chorus. The idea behind this is both in term of performance and in term of computations. In terms of performance, the chorus contains the whole message of the song in just a few lines and also it will be the most powerful, highest energy, loudest, catchiest, and most memorable part of any song. Thus, it make sense that most tiktokers will choose this part for their video. In addition, in term of computation, working on shorter audio file (only the chorus compared to the whole song) / smaller spectogram will require less computations.

In order to do so, we will use pychorus library.

In [None]:
x, sr = librosa.load('../data/audio/0e3CM2Fm4cpDtxjzYkdLAr.mp3')
start = int(find_and_output_chorus(input_file='../data/audio/0e3CM2Fm4cpDtxjzYkdLAr.mp3', output_file=None, clip_length=20))

And now let's plot the predicted chorus and hear it:

In [None]:
plt.figure(figsize=(14, 5))
librosa.display.waveshow(x[start*sr:(start+30)*sr], sr=sr)

In [None]:
chorus = x[start*sr:(start+20)*sr]
Audio(data = chorus, rate=sr)

and it really sounds like the real chorus (cut in the middle because I limit the duration to be 30 seconds).

Let's try to plot the spectogram of the chorus. 

In [None]:
S = librosa.feature.melspectrogram(y=chorus, sr=sr) #n_fft=2048, hop_length=512 by default
fig, ax = plt.subplots()
S_dB = librosa.power_to_db(S, ref=np.max(S))
img = librosa.display.specshow(S_dB, x_axis='time',
                         y_axis='mel', sr=sr,
                         fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency spectrogram')

It seems that the high frequencies as high dB, which might indicate more rhythmic song (in addition to the previous plot where we can see rapid changes in the signal). This in turn can indicate the virality of the song but we will let the model decide it.

Let's create the pipeline of the preprocessing.

#### Preprocessing pipeline:

The following is the basic pipeline:

                Raw audio -> calculate mean of channels -> extract chorus from audio -> create spectorgram from audio -> convert spectogram from amplitude to dB

In order to avoid redundant calculations and speed-up the training time, I will create all spectogram before the training and save them as files and only load them each epoch.

In [3]:
from torch.utils.data import random_split
from src.RNN_utils.audio_utils import rechannel, get_chorus, createSpect
import pandas as pd
import torch
import matplotlib.pyplot as plt
from tqdm import tqdm
import torchaudio

AUDIO_PATH = '../data/audio'

TENSOR_PATH = '../data/specs'

METADATA_PATH = '../data/metadata.csv'

In [None]:
import os

os.mkdir(TENSOR_PATH)

In [4]:
df = pd.read_csv(METADATA_PATH)

Let's start by applying the pipeline on the training set and save the new tensors as files:

In [5]:
for idx in tqdm(df.index):
    song_path = AUDIO_PATH + '/' + df.loc[idx,'id'] + '.mp3'
    class_id = df.loc[idx,'viral']
    #load the audio file
    aud = torchaudio.load(song_path)
    #convert the audio to mono audio
    aud = rechannel(aud,new_channel=1)
    #take only the part of the chorus from the signal
    aud = get_chorus(song_path, 20, aud)
    #create the mel-spectogram
    sgram = createSpect(aud, n_mels=64)
    torch.save(sgram,TENSOR_PATH + '/' + df.loc[idx,'id'] + '.pt')

100%|██████████| 3648/3648 [1:59:50<00:00,  1.97s/it]  


Now, let's use the SoundDS class in order to create dataset from those tensors and then create dataloader for both the training and validation (test) sets:

In [6]:
from src.RNN_utils.dataset import SoundDS
myds = SoundDS(pd.read_csv('../data/metadata.csv'), '../data/specs/')

# Random split of 80:20 between training and validation
num_items = len(myds)
num_train = round(num_items * 0.8)
num_val = num_items - num_train
train_ds, val_ds = random_split(myds, [num_train, num_val])

# Create training and validation data loaders
train_dl = torch.utils.data.DataLoader(train_ds, batch_size=16, shuffle=True)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=16, shuffle=False)

Next we will check that everything is working properly:

In [9]:
inputs, labels = next(iter(train_dl))

In [11]:
print(f'Batch input shape: {inputs.shape}')
print(f'Batch label shape: {labels.shape}')

Batch input shape: torch.Size([16, 1, 64, 2206])
Batch label shape: torch.Size([16])


As we can see, each batch as 16 samples of shape (1,64,2206) - 1 channel, 64 mel bins of frequencies on 2206 windows of time. Having the data loader, we can now move to the model part!

### The Model:

We will use RNN based model in this notebook.

Because our input is of length 2206 which is pretty long, we won't use the basic RNN unit but the LSTM (Long Short Term Memory). The advantage of LSTM on the basic RNN is the ability to "remember" information from far earlier inputs. In addition,it also handle the vanishing gradient problem which we might suffer from with the basic RNN because we have long sequence inputs.

In [13]:
import torch.nn as nn
import torch