## Using RNN for feature extraction from audio input

The internet suggest we should use RNN on spectogram (as omer did with cnn) by considering each column of the spectogram as the current input (in the time dimension) and then using the recurrent network.

Working with the raw audio is not so simple as I originally thought because even if I would use LSTM (which can handle longer sequences than simple RNN), we are talking about sequences of length of ~1e-6 and I think we won't be able to train this well naively. It is possible to do Truncated backpropagation through time (TBPTT) and if time would allow us, we will try that as well but because using spectogram was suggested by the internet, we will go with it.

In [1]:
import sys; sys.path.append('..')
import torch
from torch.utils.data import random_split
import pandas as pd
import torch
import matplotlib.pyplot as plt
from tqdm import tqdm
import torchaudio
from src.RNN_utils.dataset import SoundDS

AUDIO_PATH = '../data/audio'

TENSOR_PATH = '../data/specs'

METADATA_PATH = '../data/metadata.csv'

SEED = 42

torch.manual_seed(SEED)

<torch._C.Generator at 0x1e3273cfab0>

### Data processing:

#### Playing with the data:

In [None]:
from pychorus.helpers import find_and_output_chorus
import matplotlib.pyplot as plt
from IPython.display import Audio
import numpy as np
import librosa

For each song, we will focus only on the chorus. The idea behind this is both in term of performance and in term of computations. In terms of performance, the chorus contains the whole message of the song in just a few lines and also it will be the most powerful, highest energy, loudest, catchiest, and most memorable part of any song. Thus, it make sense that most tiktokers will choose this part for their video. In addition, in term of computation, working on shorter audio file (only the chorus compared to the whole song) / smaller spectogram will require less computations.

In order to do so, we will use pychorus library.

In [None]:
x, sr = librosa.load('../data/audio/0e3CM2Fm4cpDtxjzYkdLAr.mp3')
start = int(find_and_output_chorus(input_file='../data/audio/0e3CM2Fm4cpDtxjzYkdLAr.mp3', output_file=None, clip_length=20))

And now let's plot the predicted chorus and hear it:

In [None]:
plt.figure(figsize=(14, 5))
librosa.display.waveshow(x[start*sr:(start+30)*sr], sr=sr)

In [None]:
chorus = x[start*sr:(start+20)*sr]
Audio(data = chorus, rate=sr)

and it really sounds like the real chorus (cut in the middle because I limit the duration to be 30 seconds).

Let's try to plot the spectogram of the chorus. 

In [None]:
S = librosa.feature.melspectrogram(y=chorus, sr=sr) #n_fft=2048, hop_length=512 by default
fig, ax = plt.subplots()
S_dB = librosa.power_to_db(S, ref=np.max(S))
img = librosa.display.specshow(S_dB, x_axis='time',
                         y_axis='mel', sr=sr,
                         fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency spectrogram')

It seems that the high frequencies as high dB, which might indicate more rhythmic song (in addition to the previous plot where we can see rapid changes in the signal). This in turn can indicate the virality of the song but we will let the model decide it.

Let's create the pipeline of the preprocessing.

#### Preprocessing pipeline:

The following is the basic pipeline:

                Raw audio -> calculate mean of channels -> extract chorus from audio -> create spectorgram from audio -> convert spectogram from amplitude to dB

In order to avoid redundant calculations and speed-up the training time, I will create all spectogram before the training and save them as files and only load them each epoch.

In [2]:
from torch.utils.data import random_split
from src.RNN_utils.audio_utils import rechannel, get_chorus, createSpect
import pandas as pd
import torch
from tqdm import tqdm
import torchaudio

AUDIO_PATH = '../data/audio'

TENSOR_PATH = '../data/specs'

METADATA_PATH = '../data/metadata.csv'

In [None]:
import os

os.mkdir(TENSOR_PATH)

In [3]:
df = pd.read_csv(METADATA_PATH)

Let's start by applying the pipeline on the training set and save the new tensors as files:

In [None]:
#for idx in tqdm(df.index):
for idx in tqdm(df.index):
    song_path = AUDIO_PATH + '/' + df.loc[idx,'id'] + '.mp3'
    #load the audio file
    aud = torchaudio.load(song_path)
    #convert the audio to mono audio
    aud = rechannel(aud,new_channel=1)
    #take only the part of the chorus from the signal
    aud = get_chorus(song_path, 20, aud)
    #create the mel-spectogram
    sgram = createSpect(aud, n_mels=64)
    torch.save(sgram,TENSOR_PATH + '/' + df.loc[idx,'id'] + '.pt')

Now, let's use the SoundDS class in order to create dataset from those tensors and then create dataloader for both the training and validation (test) sets:

In [4]:
from src.RNN_utils.dataset import SoundDS
myds = SoundDS(pd.read_csv('../data/metadata.csv'), '../data/specs/')

# Random split of 80:20 between training and validation
num_items = len(myds)
num_train = round(num_items * 0.8)
num_val = num_items - num_train
train_ds, val_ds = random_split(myds, [num_train, num_val])

# Create training and validation data loaders
train_dl = torch.utils.data.DataLoader(train_ds, batch_size=16, shuffle=True)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=16, shuffle=False)

Next we will check that everything is working properly:

In [5]:
inputs, labels = next(iter(train_dl))

In [6]:
print(f'Batch input shape: {inputs.shape}')
print(f'Batch label shape: {labels.shape}')

Batch input shape: torch.Size([16, 2206, 64])
Batch label shape: torch.Size([16])


As we can see, each batch as 16 samples of shape (2206,64) - 2206 windows of time and 64 mel bins of frequencies. The number of channels is only one. Having the data loader, we can now move to the model part!

### The Model:

We will use RNN based model in this notebook.

Because our input is of length 2206 which is pretty long, we won't use the basic RNN unit but the LSTM (Long Short Term Memory). The advantage of LSTM on the basic RNN is the ability to "remember" information from far earlier inputs. In addition,it also handle the vanishing gradient problem which we might suffer from with the basic RNN because we have long sequence inputs.

In [2]:
import torch.nn as nn

In [3]:
class viralCls(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=1, dropout=0.0, num_classes=2):
        super().__init__()
        self.feature_extractor = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, dropout=dropout)
        self.clf = nn.Sequential(
            nn.Linear(hidden_size, 64),
            nn.LeakyReLU(),
            nn.Linear(64, 64),
            nn.LeakyReLU(),
            nn.Linear(64, num_classes),
            nn.Softmax(dim=1)
        )
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.num_classes = num_classes

    def forward(self,X,h0=None,c0=None):
        batch_size = X.shape[0]
        if h0 is None or c0 is None:
            h0 = torch.normal(mean=0.0,std=1.0,size=(self.num_layers,batch_size,self.hidden_size))
            c0 = torch.normal(mean=0.0,std=1.0,size=(self.num_layers,batch_size,self.hidden_size))
        
        #extracting the features from the spectogram.
        out, _ = self.feature_extractor(X, (h0, c0))

        #classifing according to the extracted features.
        prob = self.clf(out[:,-1,:])
        return prob

Let's see if the new classifier is working on random input:

In [4]:
model = viralCls(5,10)
X = torch.rand(10,20,5)
model(X).shape

torch.Size([10, 2])

The input is 10 samples, each is with length of 20 and 5 features for each time. The output is probability distribution over 2 classes for all 10 samples. Success!

### The training loop:

As before, I will first create the loaders of the data:

In [4]:
from src.RNN_utils.dataset import SoundDS
myds = SoundDS(pd.read_csv('../data/metadata.csv'), '../data/specs/')

# Random split of 80:20 between training and validation
num_items = len(myds)
num_train = round(num_items * 0.8)
num_val = num_items - num_train
train_ds, val_ds = random_split(myds, [num_train, num_val])

# Create training and validation data loaders
train_dl = torch.utils.data.DataLoader(train_ds, batch_size=32, shuffle=True)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=32, shuffle=False)

In [5]:
b_size, seq_len, input_size = next(iter(train_dl))[0].shape
num_batches = len(train_dl)
hidden_size = 64

#### Overfitting the model:

Let's create the classification model:

In [6]:
model = viralCls(input_size, hidden_size)

We will use cross entropy loss and Adam optimizer:

In [27]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=5e-3)

epochs = 10

Let's start by train the model to overfitted to the first batch: 

In [28]:
(X,y) = next(iter(train_dl))
for epoch in range(epochs):
    optimizer.zero_grad()
    y_prob = model(X)
    loss = criterion(y_prob,y)
    loss.backward()
    optimizer.step()
    loss = loss.item()
    acc = torch.sum(torch.argmax(y_prob,dim=1)==y).item()/32
    #scheduler.step()
    print(f'Epoch #{epoch}: Loss - {loss}, Accuracy - {acc}')

Epoch #0: Loss - 0.6927495002746582, Accuracy - 0.53125
Epoch #1: Loss - 0.6708220839500427, Accuracy - 0.875
Epoch #2: Loss - 0.6415773630142212, Accuracy - 0.96875
Epoch #3: Loss - 0.5991092920303345, Accuracy - 0.9375
Epoch #4: Loss - 0.5533318519592285, Accuracy - 0.9375
Epoch #5: Loss - 0.5023677945137024, Accuracy - 0.96875
Epoch #6: Loss - 0.4437624514102936, Accuracy - 0.96875
Epoch #7: Loss - 0.38921135663986206, Accuracy - 1.0
Epoch #8: Loss - 0.35495197772979736, Accuracy - 1.0
Epoch #9: Loss - 0.3279470205307007, Accuracy - 1.0


#### Cross validation:

and now for the real training:

In [6]:
from src.RNN_utils.trainer import trainer
from src.RNN_utils.cross_val import crossValidate

model = viralCls(input_size, hidden_size)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=1e-3, weight_decay=3e-3)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer,step_size=20,gamma=0.5)

train_model = trainer(model,criterion,optimizer,scheduler)

In [7]:
cv_obj = crossValidate(train_ds=train_ds)

In [None]:
results = [cv_obj.runCV(train_model, epochs=10)]

In [None]:
from src.RNN_utils.cross_val import plotCV

plotCV(results, [{'lr':1e-3,'weight_decay':3e-3}],title='CV for RNN')

#### Training:

In [9]:
from src.RNN_utils.trainer import trainer

model = viralCls(input_size, hidden_size)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=1e-3, weight_decay=3e-3)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer,step_size=20,gamma=0.5)

train_model = trainer(model,criterion,optimizer,scheduler)

In [10]:
results = train_model.train(train_dl, 50, True)

Train Batch: 100%|██████████| 92/92 [00:49<00:00,  1.87it/s]


Epoch #0: Loss - 60.964875876903534, Accuracy - 0.6297185998627316


Train Batch: 100%|██████████| 92/92 [00:45<00:00,  2.04it/s]


Epoch #1: Loss - 60.423157036304474, Accuracy - 0.6297185998627316


Train Batch: 100%|██████████| 92/92 [00:50<00:00,  1.84it/s]


Epoch #2: Loss - 60.13418298959732, Accuracy - 0.6297185998627316


Train Batch: 100%|██████████| 92/92 [00:53<00:00,  1.72it/s]


Epoch #3: Loss - 60.00037455558777, Accuracy - 0.6297185998627316


Train Batch: 100%|██████████| 92/92 [00:52<00:00,  1.75it/s]


Epoch #4: Loss - 59.96689349412918, Accuracy - 0.6297185998627316


Train Batch: 100%|██████████| 92/92 [00:49<00:00,  1.84it/s]


Epoch #5: Loss - 59.251351058483124, Accuracy - 0.6297185998627316


Train Batch: 100%|██████████| 92/92 [00:52<00:00,  1.74it/s]


Epoch #6: Loss - 59.97588676214218, Accuracy - 0.6297185998627316


Train Batch: 100%|██████████| 92/92 [00:58<00:00,  1.58it/s]


Epoch #7: Loss - 59.71102350950241, Accuracy - 0.6297185998627316


Train Batch: 100%|██████████| 92/92 [00:56<00:00,  1.64it/s]


Epoch #8: Loss - 60.042579650878906, Accuracy - 0.6297185998627316


Train Batch: 100%|██████████| 92/92 [00:50<00:00,  1.83it/s]


Epoch #9: Loss - 58.92647331953049, Accuracy - 0.6297185998627316


Train Batch: 100%|██████████| 92/92 [00:52<00:00,  1.74it/s]


Epoch #10: Loss - 58.934528052806854, Accuracy - 0.6297185998627316


Train Batch: 100%|██████████| 92/92 [00:55<00:00,  1.66it/s]


Epoch #11: Loss - 58.69205057621002, Accuracy - 0.6365820178448868


Train Batch: 100%|██████████| 92/92 [01:01<00:00,  1.51it/s]


Epoch #12: Loss - 58.73748975992203, Accuracy - 0.651681537405628


Train Batch: 100%|██████████| 92/92 [01:01<00:00,  1.51it/s]


Epoch #13: Loss - 58.327474534511566, Accuracy - 0.6547700754975978


Train Batch: 100%|██████████| 92/92 [00:58<00:00,  1.58it/s]


Epoch #14: Loss - 58.0875568985939, Accuracy - 0.6564859299931366


Train Batch: 100%|██████████| 92/92 [01:06<00:00,  1.39it/s]


Epoch #15: Loss - 57.870153307914734, Accuracy - 0.6650652024708305


Train Batch: 100%|██████████| 92/92 [01:01<00:00,  1.51it/s]


Epoch #16: Loss - 56.92665520310402, Accuracy - 0.6750171585449554


Train Batch: 100%|██████████| 92/92 [01:04<00:00,  1.43it/s]


Epoch #17: Loss - 57.117179811000824, Accuracy - 0.6650652024708305


Train Batch: 100%|██████████| 92/92 [01:06<00:00,  1.38it/s]


Epoch #18: Loss - 56.90648704767227, Accuracy - 0.668496911461908


Train Batch: 100%|██████████| 92/92 [01:25<00:00,  1.07it/s]


Epoch #19: Loss - 57.1029235124588, Accuracy - 0.667124227865477


Train Batch: 100%|██████████| 92/92 [01:27<00:00,  1.05it/s]


Epoch #20: Loss - 55.18010175228119, Accuracy - 0.6976664378860673


Train Batch: 100%|██████████| 92/92 [01:24<00:00,  1.09it/s]


Epoch #21: Loss - 54.24480104446411, Accuracy - 0.7189430336307481


Train Batch: 100%|██████████| 92/92 [01:27<00:00,  1.05it/s]


Epoch #22: Loss - 52.08785858750343, Accuracy - 0.7378174330816747


Train Batch: 100%|██████████| 92/92 [01:14<00:00,  1.23it/s]


Epoch #23: Loss - 50.74869677424431, Accuracy - 0.7577213452299245


Train Batch: 100%|██████████| 92/92 [01:12<00:00,  1.27it/s]


Epoch #24: Loss - 49.168992817401886, Accuracy - 0.7769389155799589


Train Batch: 100%|██████████| 92/92 [01:10<00:00,  1.30it/s]


Epoch #25: Loss - 48.24203506112099, Accuracy - 0.7855181880576527


Train Batch: 100%|██████████| 92/92 [01:06<00:00,  1.38it/s]


Epoch #26: Loss - 46.74368596076965, Accuracy - 0.8098833218943033


Train Batch: 100%|██████████| 92/92 [01:02<00:00,  1.47it/s]


Epoch #27: Loss - 46.004360258579254, Accuracy - 0.8088538091969801


Train Batch: 100%|██████████| 92/92 [01:07<00:00,  1.36it/s]


Epoch #28: Loss - 45.35944101214409, Accuracy - 0.8239533287577213


Train Batch: 100%|██████████| 92/92 [01:15<00:00,  1.22it/s]


Epoch #29: Loss - 43.895823538303375, Accuracy - 0.8400823610157858


Train Batch: 100%|██████████| 92/92 [01:46<00:00,  1.15s/it]


Epoch #30: Loss - 44.01968061923981, Accuracy - 0.8400823610157858


Train Batch: 100%|██████████| 92/92 [01:11<00:00,  1.29it/s]


Epoch #31: Loss - 43.564445823431015, Accuracy - 0.8428277282086479


Train Batch: 100%|██████████| 92/92 [01:09<00:00,  1.32it/s]


Epoch #32: Loss - 42.0831498503685, Accuracy - 0.86238846945779


Train Batch: 100%|██████████| 92/92 [01:11<00:00,  1.29it/s]


Epoch #33: Loss - 41.92361205816269, Accuracy - 0.8661633493479753


Train Batch: 100%|██████████| 92/92 [01:12<00:00,  1.27it/s]


Epoch #34: Loss - 41.62326240539551, Accuracy - 0.8665065202470831


Train Batch: 100%|██████████| 92/92 [01:10<00:00,  1.30it/s]


Epoch #35: Loss - 40.008546620607376, Accuracy - 0.8846945778997941


Train Batch: 100%|██████████| 92/92 [01:09<00:00,  1.33it/s]


Epoch #36: Loss - 39.41250318288803, Accuracy - 0.8901853122855182


Train Batch: 100%|██████████| 92/92 [01:11<00:00,  1.29it/s]


Epoch #37: Loss - 39.26542356610298, Accuracy - 0.893273850377488


Train Batch: 100%|██████████| 92/92 [01:11<00:00,  1.29it/s]


Epoch #38: Loss - 39.527875155210495, Accuracy - 0.8953328757721345


Train Batch: 100%|██████████| 92/92 [01:15<00:00,  1.22it/s]


Epoch #39: Loss - 39.03679233789444, Accuracy - 0.9049416609471517


Train Batch: 100%|██████████| 92/92 [01:18<00:00,  1.17it/s]


Epoch #40: Loss - 37.963591039180756, Accuracy - 0.9094028826355525


Train Batch: 100%|██████████| 92/92 [01:18<00:00,  1.17it/s]


Epoch #41: Loss - 36.87475794553757, Accuracy - 0.9217570350034318


Train Batch: 100%|██████████| 92/92 [01:12<00:00,  1.27it/s]


Epoch #42: Loss - 36.56873497366905, Accuracy - 0.924159231297186


Train Batch: 100%|██████████| 92/92 [01:11<00:00,  1.29it/s]


Epoch #43: Loss - 36.49076610803604, Accuracy - 0.926904598490048


Train Batch: 100%|██████████| 92/92 [01:14<00:00,  1.24it/s]


Epoch #44: Loss - 36.15938702225685, Accuracy - 0.9293067947838023


Train Batch: 100%|██████████| 92/92 [01:16<00:00,  1.20it/s]


Epoch #45: Loss - 36.097174137830734, Accuracy - 0.9296499656829101


Train Batch: 100%|██████████| 92/92 [01:13<00:00,  1.24it/s]


Epoch #46: Loss - 36.47840404510498, Accuracy - 0.9299931365820179


Train Batch: 100%|██████████| 92/92 [01:15<00:00,  1.22it/s]


Epoch #47: Loss - 35.83904209733009, Accuracy - 0.9299931365820179


Train Batch: 100%|██████████| 92/92 [01:22<00:00,  1.11it/s]


Epoch #48: Loss - 35.774996131658554, Accuracy - 0.9289636238846946


Train Batch: 100%|██████████| 92/92 [01:16<00:00,  1.20it/s]

Epoch #49: Loss - 35.697693794965744, Accuracy - 0.9337680164722032





In [11]:
train_model.evaluate(val_dl)

Test Batch: 100%|██████████| 23/23 [00:08<00:00,  2.65it/s]


(15.972053527832031, 0.5967078189300411)