In [26]:
################## IMPORT LIBRARIES ##################
import soundfile as sf
from IPython.display import Audio
import numpy as np
import random 
import pandas as pd
import os
from os.path import join as pjoin
import tqdm
from acoustics.bands import third
import scipy.signal as sig
from IPython.display import Audio
# from masp import shoebox_room_sim as srs
from scipy.io import wavfile
#import mat73
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (4, 3)
import torch
import sklearn
import speechmetrics 

def summarize_eval(datapath,exptag, b_save=False):
    resultsdir=pjoin(datapath,exptag)
    # load data frame containing evaluation metrics computed for each batch
    metricsvals =pd.read_csv(pjoin(resultsdir,"eval_all.csv"))
    # compute a mean of each measure across all testing batches 
    metricsvals_groups=metricsvals.groupby("label").mean().reset_index()
    display(metricsvals_groups)
    if b_save:
        metricsvals_groups.to_csv(exptag+"_evalsummary.scv")

In [27]:
datapath="/media/ssd2/RESULTS-reverb-match-cond-u-net/"
# datapath="/home/ubuntu/Data/RESULTS-reverb-match-cond-u-net"

###  <font color='fuchsia'>Compare training results: 18-11-2023 (conditions with different hyperparameter combinations) </font>

In [28]:
summarize_eval(datapath,"runs-exp-18-11-2023", b_save=False)

Unnamed: 0,label,nb_pesq_input,pesq_input,stoi_input,stftloss_input,nb_pesq_predict,pesq_predict,stoi_predict,stftloss_predict,styleloss_input,styleloss_predict
0,many-to-many_lr-0.0001_bs-24,2.792605,1.998135,0.866868,1.049151,2.854526,2.125407,0.879407,0.970231,0.998927,0.999879
1,many-to-many_lr-0.0001_bs-8,2.789234,1.99326,0.866119,1.049633,2.838643,2.094338,0.87327,0.992471,0.998654,0.99988
2,many-to-many_lr-0.001_bs-24,2.789005,1.991656,0.866514,1.050049,2.835408,2.085583,0.877903,1.000845,0.998617,0.999801
3,many-to-many_lr-0.001_bs-8,2.786195,1.991779,0.866549,1.050822,2.839366,2.094184,0.877632,0.999263,0.998437,0.999724
4,many-to-many_lr-1e-05_bs-24,2.788024,1.99015,0.866709,1.050695,2.801345,2.021275,0.860672,1.107349,0.999989,0.999988
5,many-to-many_lr-1e-05_bs-8,2.787288,1.993435,0.86623,1.050749,2.839811,2.103157,0.876933,1.009374,0.999883,0.999972
6,one-to-many_lr-0.0001_bs-24,2.470913,1.750308,0.900779,1.103646,2.78399,2.094574,0.90414,0.913602,0.997124,0.999694
7,one-to-many_lr-0.0001_bs-8,2.472859,1.747066,0.900035,1.102275,2.793292,2.116918,0.906913,0.892252,0.997353,0.999779
8,one-to-many_lr-0.001_bs-24,2.472379,1.748528,0.900662,1.101645,2.572778,1.831943,0.855065,1.005086,0.996392,0.99971
9,one-to-many_lr-0.001_bs-8,2.467431,1.742308,0.901678,1.103012,2.709755,2.009166,0.901923,0.907714,0.993417,0.999804


STFT Loss: 

- STFT loss is a measure of how close two audio signals are to each other, minimizing this loss is an objective of our network
- For almost all training parameters, after training the network the signal transformed by our approach is closer to the target then the input signal
- The average stft loss for many-to-many is lower than for one-to-many. Its because in many-to-many the input signal already has some amount of reverb, which makes it initially closer to the target. 
- For many-to-many the best parameters are lr=1e-4 and bs=24, and the worst is lr=1e-5 and bs=24 (the transformation makes the signal further from the target!)
- For one-to-many the best parameters are lr=1e-4 and bs=8, and the worst is lr=0.001 and bs=24

STYLE SIMILARITY: 

- Style similarity is the cosine similarity loss between the style embeddings of two audio signals
- The style similarity depends on the model training, that is why the values differ both for input and for prediction. 
- The objective of the training is to obtain a transformation which changes the style of signal, so that it is closer to the target at the output than at the input. This objective is fulfilled for most of the training parameters. If there is no difference in style before and after the transformation, than it probably means the encoder has not been trained properly. 
- For many-to-many the best parameters (leading to a biggest difference) are lr=0.001 and bs=8, and the worst is lr=1e-5 and bs=24 
- For one-to-many the best parameters are lr=0.001 and bs=8, and the worst is lr=1e-5 and bs=24
- This measure is not so informative as it depends both on the room encoding network and style transfer network 

PESQ SIMILARITY: 

- PESQ similarity is a similarity measure based on the PESQ score, which is typically used to compare clean and enhanced signals. Typically, the closer the enhanced signal to the clean signal, the higher the pesq value, which means the better performance of an enhancing algorithm. Here, we use this measure to compare PESQ(target signal, input signal) with PESQ(target signal, predicted signal). If our approach is working then PESQ(target signal, predicted signal) is higher than PESQ(target signal, input signal).
- For many-to-many the best parameters are lr=1e-4 and bs=24, and the worst is lr=1e-5 and bs=24 (same as STFT loss)
- For one-to-many the best parameters are lr=1e-4 and bs=8, and the worst is lr=0.001 and bs=24 (same as STFT Loss)
- In the korean study the PESQ goes from 2.643 to 3.105 in the best model  

STOI SIMILARITY: 

- STOI similarity is a similarity measure based on the STOI score, which is typically used to compare clean and enhanced signals. Typically, the closer the enhanced signal to the clean signal, the higher the STOI value, which means the better performance of an enhancing algorithm. Here, we use this measure to compare STOI(target signal, input signal) with STOI(target signal, predicted signal). If our approach is working then STOI(target signal, predicted signal) is higher than STOI(target signal, input signal).
- For many-to-many the best parameters are lr=1e-4 and bs=24, and the worst is lr=1e-5 and bs=24 (same as STFT loss)
- For one-to-many the best parameters are lr=1e-4 and bs=8, and the worst is lr=0.001 and bs=24 (same as STFT Loss)
- In the korean study the STOI goes from 78% to 80.92% in the best model

GENERAL OBSERVATIONS: 

- For many-to-many the condition with the lowest cumulative loss (lr=0.0001 and bs=8) is not the same condition as the one which corresponds to the lowest stft loss, and objective metrics (lr=0.0001 and bs=24), which means that trying to fulfill the second objective (style loss) is preventing the network from getting to low stft loss (?)
- For one-to-many the condition with the lowest cumulative loss is the same as the condition that leads to the lowest stft loss

###  <font color='fuchsia'>Compare training results: 11-12-2023 (best* conditions trained longer) - compare checkpoints
*actually for many-to-many is not the best condition, it is only the one that looks best based on the loss function (see 1 cell above)

 </font>

In [30]:
summarize_eval(datapath,"runs-exp-11-12-2023", b_save=False)

Unnamed: 0,label,nb_pesq_input,pesq_input,stoi_input,stftloss_input,nb_pesq_predict,pesq_predict,stoi_predict,stftloss_predict,styleloss_input,styleloss_predict
0,many-to-many_lr-0.0001_bs-8,2.788324,1.992269,0.866207,1.050408,2.907897,2.212275,0.887292,0.915007,0.998077,0.999501
1,one-to-many_lr-0.0001_bs-8,2.47133,1.744751,0.901425,1.099414,2.876911,2.226713,0.917384,0.85434,0.996203,0.999448


- Improvement for STOI is bigger for many-to-many than for one-to-many. 

    rev1 target:rev2 input -> rev1 target:rev1 prediction 
    is bigger than:
    rev1 target:anechoic input -> rev1 target:rev2 prediction

    Interpretation: Replacing false reverberation has a bigger impact on speech intelligibility than adding correct reverberation to the anechoic signal. 


- Improvement for PESQ is smaller for many-to-many than for one-to-many. 

    rev1 target:rev2 input -> rev1 target:rev1 prediction 
    is smaller than:
    rev1 target:anechoic input -> rev1 target:rev2 prediction 

    Interpretation: Adding correct reverberation to the anechoic signal has a bigger effect on speech quality than replacing the wrong reverberation. 



###  <font color='fuchsia'>Compare training results: 15-01-2024 (different loss functions)
 </font>

In [32]:
summarize_eval(datapath,"runs-exp-15-01-2024", b_save=False)

Unnamed: 0,label,nb_pesq_input,pesq_input,stoi_input,stftloss_input,nb_pesq_predict,pesq_predict,stoi_predict,stftloss_predict,styleloss_input,styleloss_predict
0,18-01-2024--00-56_many-to-many_stft,2.785063,1.989314,0.867151,1.050991,2.899321,2.203292,0.88771,0.919596,0.919174,0.971813
1,18-01-2024--14-13_many-to-many_stft+rev,2.78843,1.996256,0.867036,1.050626,2.889186,2.191147,0.885903,0.933133,0.925503,0.97556
2,19-01-2024--03-54_many-to-many_stft+emb,2.788701,1.995231,0.865863,1.049138,2.883911,2.173069,0.882874,0.947855,0.997936,0.999391
3,19-01-2024--18-08_many-to-many_stft+rev+emb,2.789627,1.993386,0.865548,1.05077,2.889063,2.183598,0.882396,0.941709,0.995789,0.998951


So far here it looks like the best results are achieved by STFT loss only...


###  <font color='fuchsia'>Compare training results: 22-01-2024 ( different loss functions, corrected)
 </font>

In [33]:
summarize_eval(datapath,"runs-exp-22-01-2024", b_save=False)

Unnamed: 0,label,nb_pesq_input,pesq_input,stoi_input,stftloss_input,nb_pesq_predict,pesq_predict,stoi_predict,stftloss_predict,styleloss_input,styleloss_predict
0,22-01-2024--18-36_many-to-many_rev,2.789856,1.990917,0.866469,1.050406,2.842136,2.085734,0.866091,1.095309,0.937831,0.979793
1,23-01-2024--07-54_many-to-many_stft+emb,2.790451,1.992069,0.866646,1.050265,2.892618,2.182555,0.886652,0.935732,0.997357,0.999179
2,23-01-2024--22-05_many-to-many_stft+rev+emb,2.791655,1.997236,0.86601,1.051143,2.917615,2.229002,0.886512,0.920348,0.995666,0.998494


###  <font color='fuchsia'>Compare training results: 26-01-2024 (longer training)
 </font>

In [34]:
summarize_eval(datapath,"runs-exp-26-01-2024", b_save=False)

Unnamed: 0,label,nb_pesq_input,pesq_input,stoi_input,stftloss_input,nb_pesq_predict,pesq_predict,stoi_predict,stftloss_predict,styleloss_input,styleloss_predict
0,28-01-2024--15-34_many-to-many_stft,2.789627,1.993386,0.865548,1.05077,2.941856,2.27219,0.893875,0.884298,0.901222,0.968116
