# Lab Audio / Audio Source Separation by U-Net

**WARNING: This jupiter notebook was created as part of Télécom-Paris teaching programme and is distributed only to its students. 
Any re-use, modification, distribution outside this framework or making it available (through github, colab or others) is forbidden.**

- Date: 2022/02/14
- Version: v0.1
- Author: geoffroy.peeters@telecom-paris.fr

## Objectives of this Lab:

The goal of this Lab is to implement a Blind-Audio-Source-Separation (BASS) model.
You will implement the U-net model proposed by Spotify in [Jansson et al., ISMIR, 2017](https://ejhumphrey.com/assets/pdf/jansson2017singing.pdf).
This model was a real breakthrough for STFT-based BASS systems.


## Deep Neural Network

The model is a U-Net, i.e. an Auto-Encoder with skip-connections between the Encoder and the Decoder.
The model takes as inut the amplitude STFT $|X(t,f)|$ of a mixed signal.
It gives as output an estimate of the Ideal Soft/Ratio Mask (IRM) $M(t,f)$ to be applied to $|X(t,f)|$ to get the amplitude STFT of the isolated source (here the vocal) $|\hat{S}_j(t,f)|$: 

$$|\hat{S}_j(t,f)|=|X(t,f)| \odot M_j(t,f)$$

The mask is the output of the network, $ M_j(t,f) = f_{\theta}(|X(t,f)|)$, and is trained in a supervised way given a training set made of examples $(x^{(i)}=|X(t,f)|,y^{(i)}=|S_j(t,f)|)$.

To train the system you will minimize the L1-loss between $|\hat{S}_j(t,f)|$ and $|S_j(t,f)|$.

**Note**: in our case, we deal with a single output $j$, hence we can ommit the subscript $j$.

## Dataset

You will train the network using the ``ccmixter`` dataset (https://sigsep.github.io/datasets/).
This dataset is not the largest but it is manageable in the framework of a Lab.
It contains 50 examples of pairs (mix,vocal) signals.

This dataset is available as a .json file and a .zip file.
- A light version of the audio (conversion to mono and downsampling to 22.050 Hz) is accessible here https://drive.google.com/file/d/1gOvAGRKifgB-UjBf7kJS3ZKnsfykGY8D/view?usp=sharing.
- The corresponding dataset definition is accessible through a .json file here https://drive.google.com/file/d/1DnNLlMaARjYCpoeUX4dDEmy2oN9cCVzj/view?usp=sharing. 

You will first upload both to your Google Drive in a folder named 'My Drive/_sound/_ccmixter'.
You will first **mount your Google drive** in this notebook to allow accessing its storage.
For this lab, we will also run the code on the GPU of Google colab. Do not forget to set **execution type** to GPU.

When creating your ``dataset`` class, you will split this dataset into a train and a test part.


## Feature extraction.

While BASS can be performed directly at the waveform level (but this necessitates a lot of training data and computation time), we perform it here using the STFT as representation.
We then perform the separation in the magintude STFT domain and get back the audio using STFT$^{-1}$.

## Audio patches

As in the previous lab (Magna-tag-a-tune), data will be processed by patches/slices of STFT frames.

## Dataset and Dataloader 

As in the previous lab (Magna-tag-a-tune), you will create two ``dataset`` (one for training, the other for testing) and two ``dataloader`` (same) to help you with the management of the data on the fly.

In our case, the ``__getitem__`` method, should gives back the input and ground-truth output $x=|X(t,f)|$ and $y=|S_j(t,f)|$.


## Testing

Once trained you will test the performances of your system using the standard SDR, SIR and SAR performance measures. 
To get those, you will apply your trained model to all the patches of a given test file, and from those, you will reconstruct the estimated audio of the isolated track $\hat{s}_j(t)$ (*). 

With the ground-truth $x_j(t)$ and estimated $\hat{s}_j(t)$ isolated audio files, you will then compute the standard SDR, SIR and SAR performance measures.
You will repeat the process for all the files of the test-set and compute the avergared values.

(*) The temporal signal is re-constructed using the STFT$^{-1}$ algorithm (DFT with overlap-add methode) using the estimated amplitude STFT $|\hat{S}_j(t,f)|$ and the phase of the original signal $\phi_{X(t,f)}$.




In [None]:
import os
import shutil
import glob

import numpy as np
import matplotlib.pyplot as plt

import librosa

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

from tqdm import tqdm
import pprint as pp
import json
import IPython

! pip install fast-bss-eval
import fast_bss_eval



# Parameters

In [None]:
use_colab = True
do_student = True

do_train = False
nb_epoch = 50

dataset_json = 'cc-mixter.json'
dataset_zip = 'dataset_ccmixter-mono-22k.zip'
subDIR = '/_ccmixter/'
dataset_subDIR = '/dataset_ccmixter-mono-22k/'
file_save_model = 'model-u-net'

# Set Google Drive

When you store locally data in Colab; these data will be removed at the end of your session.
In order to able to store definitely your data, you can connect Colab to your Google Drive and then store the data on it. It is done in the following way.

In [None]:
from google.colab import drive
mnt_point = '/content/drive/'
drive.mount(mnt_point)
DIR = mnt_point + '/My Drive/_sound/' + subDIR

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


We now unzip the file containing the audio of the dataset.

In [None]:
if not os.path.isdir(DIR+dataset_subDIR):
  shutil.unpack_archive(DIR + dataset_zip, DIR)
print('number of files found:', len(glob.glob(DIR + dataset_subDIR + '*/*')))

number of files found: 264


# Load and prepare dataset

The dataset (relationships between audio files and tags) is described in a json file.
We load this file.

In [None]:
with open(DIR + dataset_json, 'r') as f:
    dataset_l = json.load(f)

# --- Replace old folder name by new ones
oldDIR = '/Users/peeters/_work/_sound/__music/__separation/_ccmixter/corpus/ccmixter-mono-22k/'
newDIR = DIR + dataset_subDIR
for data in dataset_l:
    data['mix'] = data['mix'].replace(oldDIR, newDIR)
    data['inst'] = data['inst'].replace(oldDIR, newDIR)
    data['vocal'] = data['vocal'].replace(oldDIR, newDIR)

pp.pprint(dataset_l[0:3])

[{'inst': '/content/drive//My '
          'Drive/_sound//_ccmixter//dataset_ccmixter-mono-22k/tmray_-_Forget_It_-_Demo/source-01.wav',
  'mix': '/content/drive//My '
         'Drive/_sound//_ccmixter//dataset_ccmixter-mono-22k/tmray_-_Forget_It_-_Demo/mix.wav',
  'vocal': '/content/drive//My '
           'Drive/_sound//_ccmixter//dataset_ccmixter-mono-22k/tmray_-_Forget_It_-_Demo/source-02.wav'},
 {'inst': '/content/drive//My '
          'Drive/_sound//_ccmixter//dataset_ccmixter-mono-22k/stellarartwars_-_Amy_Winehouse_Blues_(stems)/source-01.wav',
  'mix': '/content/drive//My '
         'Drive/_sound//_ccmixter//dataset_ccmixter-mono-22k/stellarartwars_-_Amy_Winehouse_Blues_(stems)/mix.wav',
  'vocal': '/content/drive//My '
           'Drive/_sound//_ccmixter//dataset_ccmixter-mono-22k/stellarartwars_-_Amy_Winehouse_Blues_(stems)/source-02.wav'},
 {'inst': '/content/drive//My '
          'Drive/_sound//_ccmixter//dataset_ccmixter-mono-22k/geertveneklaas_-_Blue_Boy/source-01.wav',
  'm

# Audio features

We write a function that will compute the STFT from an audio file and save its amplitude and phase into two separated ``.npy`` files.

Note that the function ``librosa.magphase`` actually returns the amplitude and exponential-imag-phase ($A_X$ and $e^{j \Phi_x})$ such that $X=A_X \cdot e^{i \Phi_X}$).

In [None]:

# --- In the original paper, the parameters correspond to
L_sec = 1024/8192; print('L_sec:', L_sec)
STEP_sec = 768/8192; print('STEP_sec:', STEP_sec)
print('path_sec:', STEP_sec*128)

# --- In our case we use the following parameters
win_size = 2*1024
hop_size = int(win_size/4)
# --- Those correspond to
sr_hz = 22050
L_sec = win_size/22050; print('L_sec:', L_sec)
STEP_sec = hop_size/22050; print('STEP_sec:', STEP_sec)
print('path_sec:', STEP_sec*128)


def F_stft(audio_filename):
    """
    description:
        compute the STFT, split it into amplitude and exp-phase and save to two files
    inputs:
        - audio_filename
    outputs:
        - am_stft_m (N/2+1, nb_frame)
        - ph_stft_m (N/2+1, nb_frame)
    """
    mag_filename = audio_filename + '.spec' + '.npy'
    phase_filename = audio_filename + '.phase' + '.npy'

          
    if not os.path.isfile(mag_filename):
        if do_student:
            # --- START CODE HERE
            audio_v, sr = librosa.load(audio_filename, sr = sr_hz)
            audio_spectrum_m = librosa.stft(audio_v, n_fft=win_size, hop_length=hop_size)
            am_stft_m, expiph_stft_m = librosa.magphase(audio_spectrum_m)
            # --- STOP CODE HERE
             
        #np.save(mag_filename, am_stft_m)
        #np.save(phase_filename, expiph_stft_m)
        
    else:
        am_stft_m = np.load(mag_filename)
        expiph_stft_m = np.load(phase_filename)

    return am_stft_m, expiph_stft_m

L_sec: 0.125
STEP_sec: 0.09375
path_sec: 12.0
L_sec: 0.09287981859410431
STEP_sec: 0.023219954648526078
path_sec: 2.972154195011338


# Patches

As in the previous lab, we will split the big STFT matrix $W$ into a set of small matrices $A_m$ (we name those ```patches```) where each $A_m$ represent a temporal slice of the big-matrix. 
The size of $A_m$ will be ```(nb_feature, size_of_patch)```.

$W$ will then be represented by a set of $A_m$. 
Each $A_m$ has the same size as $W$ but represent a patch/slice of the big-matrix starting at a different time $t$.

The two parameters we need to define are:
- the distance in frames between two successive patch ```patch_hop_frame```
- the width of the patch ```patch_halfduration_frame```

Given an audio file, the following function compute the start and end position of each patch $A_m$ within $W$.

We will re-use mostly the same function we did in the previous lab.


In [None]:
patch_halfduration_frame = 64
patch_hop_frame = int(patch_halfduration_frame/2)

def F_slice_into_patches(patch_hop_frame, patch_halfduration_frame, patch_info_l, nb_frame, idx_file):
    """
    create structure for storing patch-based slides of the spectrogram
    """
    middle_frame = patch_halfduration_frame
    while middle_frame + patch_halfduration_frame < nb_frame:
        if do_student:
            # --- START CODE HERE
            start_frame = middle_frame-patch_halfduration_frame
            stop_frame = middle_frame+patch_halfduration_frame
            # --- STOP CODE HERE
        
        patch_info_d = {'idx_file': idx_file, 
                        'start_frame': start_frame, 
                        'middle_frame': middle_frame, 
                        'end_frame': stop_frame}
        patch_info_l.append(patch_info_d)

        middle_frame += patch_hop_frame
    return patch_info_l

# Dataset

You will write a torch ``dataset`` such that when calling ``__getitem__`` you will receive 
- one patch of $x$ (a patch of the amplitude STFT of the mixed audio) and 
- the corresponding ground-truth patch $y$ (the corresponding patch of the amplitude STFT of the ground-truth isolated vocal).

**Note**: to speed-up the whole process you can store directly (when calling the ``__init__`` method) all the data in the memory of the GPU, using ``.cuda()``. Don't forget to first convert your data to ``.float`` since the model will use ``float`` values.

**Note**: to help you, you can take a look at the code of the ``dataset`` provided for the previous lab on Magna-tag-a-tune.

In [None]:
dataset_l[0]

{'inst': '/content/drive//My Drive/_sound//_ccmixter//dataset_ccmixter-mono-22k/tmray_-_Forget_It_-_Demo/source-01.wav',
 'mix': '/content/drive//My Drive/_sound//_ccmixter//dataset_ccmixter-mono-22k/tmray_-_Forget_It_-_Demo/mix.wav',
 'vocal': '/content/drive//My Drive/_sound//_ccmixter//dataset_ccmixter-mono-22k/tmray_-_Forget_It_-_Demo/source-02.wav'}

In [None]:
class SpectrogramDataset(Dataset):
    
    def __init__(self, dataset_l, do_train):
        
        if do_train: dataset_l = [dataset_l[idx] for idx in range(len(dataset_l)) if idx % 10 != 0]
        else:        dataset_l = [dataset_l[idx] for idx in range(len(dataset_l)) if idx % 10 == 0]
        
        self.dataset_l = dataset_l
        self.patch_info_l = []
        self.data_d = {}

        self.mix_filename_l = [data['mix'] for data in dataset_l]
        self.vocal_filename_l = [data['vocal'] for data in dataset_l]

        if do_student:
            # --- START CODE HERE
            for idx_file in tqdm(range(len(dataset_l)),desc='Load song'):
              mix_am, _ = F_stft(self.mix_filename_l[idx_file])
              vocal_am, _ = F_stft(self.vocal_filename_l[idx_file])
              nb_frame = mix_am.shape[1]
              self.patch_info_l = F_slice_into_patches(patch_hop_frame, patch_halfduration_frame, self.patch_info_l, nb_frame, idx_file)
              self.data_d[idx_file] = {'mix_am':torch.from_numpy(mix_am).float().cuda(), "vocal_am":torch.from_numpy(vocal_am).float().cuda()}
            # --- STOP CODE HERE
        

    def __len__(self):
        return len(self.patch_info_l)

    def __getitem__(self, idx_patch):
        """
        outputs:
            - 'x'(mix_am_stft_m) [N/2+1, size_of_patch] one patch of mixed amplitude STFT
            - 'y'(vocal_am_stft_m) [N/2+1, size_of_patch] the corresponding patch of isolated amplitude STFT
        """
        if do_student:
            # --- START CODE HERE
            idx_song = self.patch_info_l[idx_patch]['idx_file']
            mix_am_stft_m = self.data_d[idx_song]['mix_am']
            vocal_am_stft_m = self.data_d[idx_song]['vocal_am']
            #Get the patch of mix and vocal
            mix_am_stft_m = mix_am_stft_m[:,self.patch_info_l[idx_patch]['start_frame']:self.patch_info_l[idx_patch]['end_frame']]
            vocal_am_stft_m = vocal_am_stft_m[:,self.patch_info_l[idx_patch]['start_frame']:self.patch_info_l[idx_patch]['end_frame']]

            # --- STOP CODE HERE
        
        return {'x':mix_am_stft_m, 'y':vocal_am_stft_m}

We instantiate the class to get a training set and a test set.

**Note**: the first time you run this code, it can take some time since it has first to compute the STFT of all the files.

In [None]:
train_set = SpectrogramDataset(dataset_l=dataset_l, do_train=True)
test_set = SpectrogramDataset(dataset_l=dataset_l, do_train=False)

Load song: 100%|██████████| 45/45 [01:20<00:00,  1.78s/it]
Load song: 100%|██████████| 5/5 [00:02<00:00,  1.89it/s]


In [None]:
train_set[0]

{'x': tensor([[7.4227e-03, 2.7048e-02, 7.3964e+00,  ..., 9.7261e-02, 1.4050e-01,
          1.3435e-01],
         [3.7978e-03, 3.4408e-02, 7.3077e+00,  ..., 1.5772e-01, 3.7793e-02,
          8.7790e-02],
         [4.4469e-04, 3.4279e-02, 7.2088e+00,  ..., 3.9509e-01, 2.3003e-01,
          2.5028e-01],
         ...,
         [1.5805e-04, 3.2140e-02, 2.1280e-01,  ..., 5.3895e-03, 7.3881e-03,
          4.5101e-03],
         [2.7755e-04, 3.3521e-02, 1.5277e-01,  ..., 6.1820e-03, 1.2567e-02,
          8.3427e-03],
         [3.1129e-04, 3.3628e-02, 1.1065e-01,  ..., 1.2627e-02, 1.4889e-02,
          4.7268e-03]], device='cuda:0'),
 'y': tensor([[6.3131e-03, 6.2675e-03, 3.6780e-03,  ..., 1.2991e-01, 2.9494e-02,
          6.2350e-04],
         [2.8998e-03, 3.6636e-03, 3.4219e-03,  ..., 2.1631e-01, 9.7165e-02,
          7.8036e-02],
         [3.4405e-04, 5.9297e-04, 1.8479e-03,  ..., 3.7471e-01, 1.7147e-01,
          3.1662e-01],
         ...,
         [1.4152e-04, 6.4348e-05, 4.6063e-04,  ..., 

In [None]:
print('nb_path(train_set):', len(train_set))
print('nb_path(test_set):', len(test_set))
print('shape train:', train_set[0]['x'].shape)
print('shape test:', train_set[0]['y'].shape)

nb_path(train_set): 14513
nb_path(test_set): 886
shape train: torch.Size([1025, 128])
shape test: torch.Size([1025, 128])


You should get the following values

```
nb_path(train_set): 14513
nb_path(test_set): 886
shape train: torch.Size([1025, 128])
shape test: torch.Size([1025, 128])
```

# Dataloader

We then create the ``Dataloader`` for the training set and for the test set.

In [None]:
train_loader = DataLoader(train_set, batch_size=8, shuffle=True, num_workers=0)
test_loader = DataLoader(test_set, batch_size=8, shuffle=False, num_workers=0)

In [None]:
one_mini_batch = next(iter(train_loader))
one_mini_batch['x'].size()

torch.Size([8, 1025, 128])

You should get the following value
```
torch.Size([8, 1025, 128])
```

# DNN model: U-Net

We now define the architecture of our U-Net to perform multi-label classification. 

The network takes as input a patch/slice of the magnitude STFT of size (513 frequencies, 128 frames) and output a mask of the same dimension (513 frequencies, 128 frames) that will be later use to perform the separation.

The network has the following architecture.

where ``16*(5,5)`` denotes 16 filters of size (5,5), ``s`` denotes stride, ``p`` padding, ``BN`` batch-normalization over feature-maps, ``in`` gives (as information) the expected input depth.

<img src="https://perso.telecom-paristech.fr/gpeeters/doc/Lab_DL_Source-Separation-U-Net.png">

In [None]:
class Encoder(nn.Module):
  def __init__(self):
    super(Encoder, self).__init__()
    self.channels = (1,16,32,64,128,256,512)
    self.nb_layers = len(self.channels)-1
    self.encoder_layers = nn.ModuleList([nn.Conv2d(self.channels[i], self.channels[i+1],
                                                   kernel_size=5, stride=2, padding=2) for i in range(len(self.channels)-1)])
    self.BN_layers = nn.ModuleList([nn.BatchNorm2d(self.channels[i+1]) for i in range(len(self.channels)-1)])
    self.LRelu = nn.LeakyReLU(0.2)

  def forward(self, x_in):
    features = []
    features.append(x_in)
    #print("Start Encoder")
    x = x_in
    for i in range(self.nb_layers):
      x = self.encoder_layers[i](x)
      if i==0:
        a = x_in
      else:
        a = x
      #print("Shape input:", a.shape)
      x = self.BN_layers[i](x)     
      x = self.LRelu(x)
      features.append(x)
      #print("Shape out put:", x.shape)
      #print(f"Layer {i} done")
    #print("Encoder ok")
    return features


class Decoder(nn.Module):
  def __init__(self):
    super(Decoder, self).__init__()
    self.channels=(256, 128, 64, 32, 16)
    self.nb_layers = len(self.channels)-1
    self.decoder_first_layer = nn.ConvTranspose2d(512, 256,
                                                   kernel_size=5, stride=2, padding=2, output_padding=(0, 1))
    self.decoder_layers = nn.ModuleList([nn.ConvTranspose2d(self.channels[i]*2, self.channels[i+1],
                                                   kernel_size=5, stride=2, padding=2, output_padding=(0, 1)) for i in range(len(self.channels)-1)])
    self.last_first_layer= nn.ConvTranspose2d(32, 1,
                                                   kernel_size=5, stride=2, padding=2, output_padding=(0, 1))
    self.BN_first_layer = nn.BatchNorm2d(256)
    self.BN_layers = nn.ModuleList([nn.BatchNorm2d(self.channels[i+1]) for i in range(len(self.channels)-1)])
    self.relu = nn.ReLU()
    self.sigmoid = nn.Sigmoid()
    self.dropout = nn.Dropout(p=0.5)

  def forward(self, x_in, encoder_feature):
    #print("------------")
    #print("Start Decoder")
    x = x_in
    x = self.decoder_first_layer(x)
    x = self.BN_first_layer(x)
    x = self.relu(x)
    x = self.dropout(x)
    #print(x.shape)
    #print(encoder_feature[i].shape)
    x = torch.cat([x, encoder_feature[0]], dim=1)

    for i in range(self.nb_layers):
      #print("Shape input:", x.shape)
      x = self.decoder_layers[i](x)
      x = self.BN_layers[i](x)
      x = self.relu(x)
      x = self.dropout(x)
      #print(x.shape)
      #print(encoder_feature[i+1].shape)
      x = torch.cat([x, encoder_feature[i+1]], dim=1)
    
    x = self.last_first_layer(x)
    x = self.sigmoid(x)
    #print("Decoder ok")
    return x

In [None]:
class UNet(nn.Module):
    def __init__(self):
        super(UNet, self).__init__()

        if do_student:
            # --- START CODE HERE
            self.encoder = Encoder()
            self.decoder = Decoder()
            # --- STOP CODE HERE
        
    def forward(self, mix):
        if do_student:
            # --- START CODE HERE
            enc_features = self.encoder(mix)
            out = self.decoder(enc_features[::-1][0], enc_features[::-1][1:] )
            #print("Last layer")
            # compute by multiply with input
            #print("Mask size:", out.shape)
            #out = torch.mul(enc_features[0], out)
            #print("y_hat size:", out.shape)
            # --- STOP CODE HERE
        
        return out

### Test

We instantiate the model, send it to the GPU and display its summary.
You should get the following output.

In [None]:
up_conv = nn.ConvTranspose2d(512, 256, (5,5), stride =2, padding =2, output_padding=(0, 1))
input = torch.randn(1, 512, 9, 2)
x = up_conv(input)
print(x.shape)

torch.Size([1, 256, 17, 4])


In [None]:
up_conv = nn.Conv2d(256, 512, 5, stride =2, padding =2)
input = torch.randn(1, 256, 17, 4)
x = up_conv(input)
print(x.shape)

torch.Size([1, 512, 9, 2])


In [None]:
model = UNet().cuda()

# --- Display the structure of the model
from torchsummary import summary
summary(model, input_size=(1, 513, 128))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1          [-1, 16, 257, 64]             416
       BatchNorm2d-2          [-1, 16, 257, 64]              32
         LeakyReLU-3          [-1, 16, 257, 64]               0
            Conv2d-4          [-1, 32, 129, 32]          12,832
       BatchNorm2d-5          [-1, 32, 129, 32]              64
         LeakyReLU-6          [-1, 32, 129, 32]               0
            Conv2d-7           [-1, 64, 65, 16]          51,264
       BatchNorm2d-8           [-1, 64, 65, 16]             128
         LeakyReLU-9           [-1, 64, 65, 16]               0
           Conv2d-10           [-1, 128, 33, 8]         204,928
      BatchNorm2d-11           [-1, 128, 33, 8]             256
        LeakyReLU-12           [-1, 128, 33, 8]               0
           Conv2d-13           [-1, 256, 17, 4]         819,456
      BatchNorm2d-14           [-1, 256

In [None]:
# --- Test the (un-trained) model on a single data to check that the dimensions are OK
x = one_mini_batch['x'][:,None,:,:]
y = one_mini_batch['y']
mask = model( x )
print('mask size:', mask.size())
y_hat = mask * x
print('y_hat size:', y_hat.size())
print('y_hat size:', y_hat.squeeze().size())
print('y size:', y.size())

mask size: torch.Size([8, 1, 1025, 128])
y_hat size: torch.Size([8, 1, 1025, 128])
y_hat size: torch.Size([8, 1025, 128])
y size: torch.Size([8, 1025, 128])


# Define score

In [None]:
class Score():
    def __init__(self):
        self.loss_l = []
    def add(self, loss):
        self.loss_l.append( float(loss) )
    def print(self, text, num_epoch):
        print('{} Epoch:{} \t loss:{:.3f}'.format(text, num_epoch, np.mean(self.loss_l)))
    def get_summary(self):
        output = [np.mean(self.loss_l)]
        return output
    def get_name(self):
        output = ['loss']
        return output
    def plot_curve(self, score_train_l, score_test_l):
        score_name_l = self.get_name()
        plt.figure(figsize=(16,4))
        nb_score = len(score_train_l[0])
        for num_score in range(nb_score):
            store_train_l, store_test_l = [], []
            for score_train in score_train_l:
                store_train_l.append(score_train[num_score])
            for score_test in score_test_l:  
                store_test_l.append(score_test[num_score])
            plt.subplot(1, nb_score, num_score+1)
            plt.plot(store_train_l, 'g', store_test_l, 'r')
            plt.legend(['train', 'test'])
            plt.grid(True), plt.xlabel('# Epoch'); plt.ylabel(score_name_l[num_score])

# Train the model

To train the model, we still need to define a loss to be minimized (L1-norm) and an an optimizer (Adam with a learning rate of 1e-3).

In [None]:
if do_student:
    # --- START CODE HERE
    criterion = nn.L1Loss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    # --- STOP CODE HERE

As usual, we need to write the code for the **training** (using ``train_loader``):
- send the input $X$ to the model
- get the output estimated mask $M$
- multiply the estimated mask $M$ by the input $X$ to get the isolated source $\hat{Y}$
- compute the loss between $\hat{Y}$ and the ground-truth $Y$
- do the back-propagation, 
- update the parameters.

We do the same for the **testing** (using ``test_loader``).
- same as above without back-propagation and parameter updates.

In [None]:
if do_train:
    score_train_l = []
    score_test_l = []
    for num_epoch in range(nb_epoch):    
        
        # --- Train
        model.train()
        my_train_score = Score()
        for batch_idx, batch in enumerate(tqdm(train_loader)):
            if do_student:
                # --- START CODE HERE
                model.zero_grad()
                x = batch['x'][:,None,:,:].cuda()        
                y = batch['y'].cuda()
                mask = model(x)
                y_hat = mask * x
                #y_hat = y_hat.squeeze()
                loss = criterion(y_hat, y)
                loss.backward()
                optimizer.step()
                my_train_score.add(loss)
                # --- STOP CODE HERE

        my_train_score.print('Train', num_epoch)
        score_train_l.append( my_train_score.get_summary() )
        
            
        # --- Test
        model.eval()
        my_test_score = Score()
        for batch_idx, batch in enumerate(test_loader):
            if do_student:
                # --- START CODE HERE
                x = batch['x'][:,None,:,:].cuda()        
                y = batch['y'].cuda()
                mask = model(x)
                y_hat = mask*x
                #y_hat = y_hat.squeeze()
                loss = criterion(y_hat, y)   
                my_test_score.add(loss)    
                # --- STOP CODE HERE

        my_test_score.print('Test', num_epoch) 
        score_test_l.append(my_test_score.get_summary())
        
    torch.save(model, DIR + file_save_model)
else:
    model = torch.load(DIR + file_save_model)
    model.eval()

In [None]:
torch.save(model.state_dict(), DIR+"model_save")

## Display training/test curves

We now display the loss for the training and test set.

In [None]:
my_train_score.plot_curve(score_train_l, score_test_l)

NameError: ignored

<img src="https://i.ibb.co/4gHTd2C/Whats-App-Image-2022-03-08-at-03-08-48.jpg">

I checked the model and data loader but after 50 epochs, the loss decreases very slowly. The model needs a lot of time to train.

After 50 epochs, the training and test loss should be
```
Train Epoch:49 	 loss:0.184
Test Epoch:49 	 loss:0.309
```

# Evaluating the model using SDR, SIR, SAR

So far, we have minimized a L1-loss and test the model using the same L1-loss. 
This is fine for optimization however this does not tell us us about the quality of the BASS algorithm we have developped.

To evaluate this aspect, we will the standard SDR, SIR and SAR performance measures. 
To compute those we need first to get the audio signals which corresponds to our separation.
We do this by applying our model to all the possible patches of a given test file; and from the masked output (combined with the phase of the mixed signal) get the audio using the STFT$^{-1}$ algorithm.

Load saved magnitude and phase of STFT all file.

In [None]:
def F_get_separation_audio(mix_audio_filename):
    """
    description:
        perform BASS on the mixed audio file mix_audio_v using 
        - the trained U-Net model 
        - the audio reconstruction using the estimated magnitude STFT, original phase and STFT-1 algorithm
    inputs:
        - mix_audio_v: mixed audio signal as np.array
    outputs:
        - hatvocal_audio_v: separated audio signal as np.array
    """

    if do_student:
        # --- START CODE HERE
        #audio_v, phase = F_stft(mix_audio_filename)
        #mag_m = hdf5_fid[mix_audio_filename + '/ampl/']
        #phase_m = hdf5_fid[mix_audio_filename + '/expiph/']

        # We split mag_m to patchs and then pass it through our model.
        mag_m, phase_m = F_stft(mix_audio_filename)
        patch_info_file = []
        patch_info_file = F_slice_into_patches(patch_hop_frame, patch_halfduration_frame, patch_info_file, mag_m.shape[1], 0)
        to_predict = []
        for i in range(len(patch_info_file)):
          to_predict.append(mag_m[:,patch_info_file[i]['start_frame']:patch_info_file[i]['end_frame']])
        to_predict = np.array(to_predict)
        to_predict = torch.from_numpy(to_predict).float().cuda()
        to_predict = to_predict[:,None,:,:]
        #Make prediction.
        with torch.no_grad():
          mask = model(to_predict)
          y_hat = mask*to_predict
        #print(mag_m.shape)
        #print(y_hat.shape)
        y_hat = y_hat.detach().cpu().numpy()
        torch.cuda.empty_cache() # Free up memory of GPU after got prediction, else GPU will run out of memory.
        y_hat = y_hat.squeeze()
        #print(y_hat.shape)

        #We reconstruct the entire song by concatenate from patchs. (attention: patchs are overlaped together)
        mag_voice_predicted = y_hat[0,:,:patch_hop_frame]
        for i in range(1, y_hat.shape[0]-1):
          mag_voice_predicted = np.concatenate((mag_voice_predicted, y_hat[i,:,:patch_hop_frame]), axis=1)
        mag_voice_predicted = np.concatenate((mag_voice_predicted, y_hat[-1,:,:]), axis=1)
        phase_equal_length = phase_m[:,:mag_voice_predicted.shape[1]]
        vocal_audio_predicted_stft = mag_voice_predicted*phase_equal_length
        hatvocal_audio_v = librosa.istft(vocal_audio_predicted_stft, hop_length=hop_size)
        # --- STOP CODE HERE
    
    return hatvocal_audio_v
    
    

In [None]:
def F_get_SDR_SIR_SAR_onefile(data_d, do_display=False):
    """
    description:
        Compute the BASS standard peformance measures SDR, SIR, SAR
    inputs:
        - data_d['mix', 'vocal', inst']: file paths to the mix, isolated vocal and isolated instrumental
    outputs:
        - sdr, sir, sar
    """
    mix_audio_v, sr_hz = librosa.load(data_d['mix'])
    vocal_audio_v, sr_hz = librosa.load(data_d['vocal'])
    inst_audio_v, sr_hz = librosa.load(data_d['inst'])
    
    hatvocal_audio_v = F_get_separation_audio(data_d['mix'])
    
    L = min(len(vocal_audio_v), len(hatvocal_audio_v))
    hatinst_audio_v = mix_audio_v[:L] - hatvocal_audio_v[:L]
    orig = np.array([vocal_audio_v[:L], inst_audio_v[:L]])
    pred = np.array([hatvocal_audio_v[:L], hatinst_audio_v])
    sdr, sir, sar, perm = fast_bss_eval.bss_eval_sources(orig, pred)
    
    
    if do_display:
        print('SDR: {}, SIR: {}, SAR: {}'.format(sdr, sir, sar, perm))
        S = int(30*sr_hz)
        E = int((30+10)*sr_hz)
        IPython.display.display(IPython.display.Audio(data=mix_audio_v[S:E], rate=sr_hz))
        IPython.display.display(IPython.display.Audio(data=vocal_audio_v[S:E], rate=sr_hz))
        IPython.display.display(IPython.display.Audio(data=hatvocal_audio_v[S:E], rate=sr_hz))
    
    return sdr[0], sir[0], sar[0]

In [None]:
def F_get_SDR_SIR_SAR_allfile(set_l):
    L = len(set_l)
    sdr_l, sir_l, sar_l = np.zeros(L), np.zeros(L), np.zeros(L)
    for idx in range(L):
        sdr_l[idx], sir_l[idx], sar_l[idx] = F_get_SDR_SIR_SAR_onefile(set_l[idx])

    print('sdr:', np.mean(sdr_l))
    print('sir:', np.mean(sir_l))
    print('sar', np.mean(sar_l))

In [None]:
# --- Train-set
trainset_l = [dataset_l[idx] for idx in range(len(dataset_l)) if idx % 10 != 0]
F_get_SDR_SIR_SAR_allfile(trainset_l)
# --- Test-set
testset_l = [dataset_l[idx] for idx in range(len(dataset_l)) if idx % 10 == 0]
F_get_SDR_SIR_SAR_allfile(testset_l)


sdr: -5.935923739274343
sir: -2.6791162623299494
sar 12.017743746439615
sdr: -6.936703252792358
sir: -2.2409851789474486
sar -0.6179231762886047


You should get the following results (for the train and test resp.):
```
sdr: 7.730700800153945
sir: 17.16172769334581
sar 8.515425576104057

sdr: 0.9553127944469452
sir: 9.847758293151855
sar 2.2173049688339233
```

## Listen to the results on one example

In [None]:
F_get_SDR_SIR_SAR_onefile(testset_l[3], True);

SDR: [-5.8952613  7.334742 ], SIR: [-0.8102182  7.3409743], SAR: [-0.8488736 36.504757 ]
