## Data Validation
I am using the SignalTrain dataset which contains raw audio signals and their corresponding compressed versions, which can be used to model the action of the comprresor used.

I took the authors' advice and removed any errors in the dataset (by moving or removing certain files which are corrupted or which have corresponding pairs of audio which are actually out of phase). This is addressed below. 

This notebook is used to look at all the input-output audio pairs and to see if there are any audio files without a corresponding target. It also cross correlations the pairs of inputs and outputs to determine if any of the files have a relative delay that they shouldn't have. If there are phase shifted signals, this will not be good for our network as we will end up training (or validating) our model on an input and output pair which are not perfectly correlated, so it must be fixed.

In [49]:
import os 
import numpy as np
from scipy.io import wavfile
from scipy.signal import correlate

data = {}
splits = ["Test", "Train", "Val"]

for split in splits:
    split_files = os.listdir(f'data/{split}')
    split_inputs = [audio_path for audio_path in split_files if 'LA2A' not in audio_path]

    split_correspondance = {}
    for inp in split_inputs:
        for audio in split_files:
            if (inp != audio) & (inp.split('_')[1] == audio.split('_')[1]):
                split_correspondance[f"data/{split}/{inp}"] = f"data/{split}/{audio}" 
    
    if len(split_correspondance) != len(split_inputs):
        print(f"Not all inputs have a target for split: {split}, {len(split_correspondance)}, {len(split_inputs)}")
    
    data[split] = split_correspondance

In [50]:
def cross_correlate(input_audio, target_audio):
    # Load the audio files
    fs_x, x = wavfile.read(input_audio)
    _, y = wavfile.read(target_audio)

    # Ensure that both signals have the same length
    min_len = min(len(x), len(y))
    x = x[:min_len]
    y = y[:min_len]

    # Cross-correlation
    corr_result = correlate(x, y, mode='full')

    # Find the time delay corresponding to the maximum correlation
    time_delay = np.argmax(corr_result) - (len(x) - 1)

    # Calculate the time delay of the signals in seconds
    time_delay_seconds = time_delay / fs_x
    
    return time_delay_seconds

In [51]:
for split in splits: 
    for input_audio, target_audio in data[split].items():
        cross_correlation = cross_correlate(input_audio, target_audio)
        if cross_correlation > 0:
            print(f"Non-zero cross correlation of the following audio from {split} split: {input_audio}, {target_audio}")

Non-zero cross correlation of the following audio from Train split: data/Train/input_147_.wav, data/Train/target_147_LA2A_3c__0__45.wav
