## Neural Analog Modelling 
Please note that the dataset used for this project is not 100% clean, and the errors from the authors errata page are addressed and resolved in the other notebook `0-data-validation.ipynb`. This is done to ensure that in the below our models are not impacted.
### -- Introduction
This repo is concerned with modelling analog effects with neural networks. Why? Mostly for curiosity.

Most VST effects are typically implemented in C++, using handy frameworks like JUCE which contains libraries to handle many of the typical challenges of plugin design such as cross-platform functionality, DSP, frontend design.. In order to do real analog modelling, one typically needs to have some high level domain knowledge and have the capacity to understand complicated analog circuits present in analog synths and effects, as well as a knowledge of DSP which allows us to model these signals numerically. Other approaches include simulation of physical processes like reverberation.

In principle, neural networks are a more high level approach which, presupposing the access to a relevant dataset of dry/wet signals, allow us to model the effects without having to bust out Korg manuals from the 1980s and examine the intricacies of the circuits for their wonderful filters. 

There are clearly many limits in using neural networks to process audio. Some include:
- Clean datasets of dry/wet signals are not typically easy to come by, and are labour intensive to repair. Hence NNs are not always a suitable approach for modelling analog effects.
- Neural networks are typically quite bulky, and implemented in Python (which is typically slower compared to DSP implementations in C++). Although not impossible (see work of [C. Steinmetz](https://scholar.google.com/citations?user=jSvSfIMAAAAJ&hl=en)), this makes it difficult to use neural networks to process signals in real time. This inherent slowness makes a lot of neural approaches only suitable to asynchronous audio processing. Sequence to sequence audio modelling is notoriously slow, especially given the large size of common audio models used today e.g [Demucs](https://github.com/facebookresearch/demucs) (for source separation). 

I am interesting in exploring these limits, especially the second one. 

### -- Data 
In this case, we use the [SignalTrain](https://zenodo.org/records/3824876) dataset from 2019 which contains various dry and wet recordings, where the wet recordings are processed with an analog compressor, the Universal Audio LA-2A. 

This compressor is a very simple one, and we will be concerned with modelling the signal using only two parameters on the compressor:
- The switch between compression and limiting
- The peak reduction

The information about the parameters of the compressor are contained in the file names. In particular, the value between 0-100 represents the peak reduction knob, whereas the binary value 0/1 represents the switch between compression (0) and limiting (1). The authors say that there was no changes to the input or output gains, and that only these two parameters above were changed during recording.

### -- Audio
The audio in this dataset has a sampling rate of 44.1kHz and is mono, as the original analog LA-2A was designed to process mono signals. The individual audio files actually contain a "collage" of different pieces of music which are stitched seamlessly together. The whole recording will be passed through the compressor to obtain the wet signal.

I followed the advice of the authors in cleaning up errors in the dataset (removing / moving certain files) in other notebook `data_validation.ipynb`. I cross correlated all the signals to see if any signals except those mentioned by the authors had a phase shift between the dry and wet signals. To simplify my life, I found all the signals which had a relative phase shift and removed them, keeping only perfectly correlated dry/ wet pairs; which to the credit of the authors, was the vast majority of the signals.

## Compressors

In the context of music production we are interested in Dynamic Range Compressors. These are effects which reduce the dynamic range (the gap between the loudest and quietest parts of a signal) via downward or upward compression (reducing the gain of loud sounds, and increasing the gain of quiet sounds respectively). They have several parameters, the most important ones being:

- Threshold: the volume at which the compressor will be activated. In the case of downward compression, if the threshold is -6dB, and the signal peaks at -10dB, the compressor will never be activated, if it is, the compressor will activate and trigger gain reduction.  On the other hand, in upward compression, if the threshold is -6dB, any sound below this threshold (e.g -10dB) will activate the compression and cause a gain increase.

- Ratio: a parameter which controls the amount of compression applied to an incoming signal. It is called a ratio because it is typically defined in such a way that if the ratio is 4:1, any signal 4dB **over** the threshold will be reduced to 1dB over the threshold.

Some other common parameters include attack and release, which delay or extend resp. the activation of the compressor. e.g if the attack is 50ms, when the compressor detects a signal which crosses the threshold, it will delay its activation by 50ms. The release, if 50ms, will extend the action of the compressor by 50ms. The effect of attack / release is often not considered in a step function like manner, but will often be smoother, where for example during the attack phase of 50ms after the detection of a signal crossing the compressor's threshold, the action of the compressor may be linearly ramped up to its full action. 

Another important control is the "knee" setting. In the case of a "hard knee" compressor, the compression will only activate once the signal crosses the threshold, triggering gain reduction. In the case of a soft knee compressor, this transition point is 'blurred' and less abrupt, even for signals below the threshold, the gain may be attenuated slightly by the compressor. This results in a 'smoother' transition between compressor and uncompressed parts of the signal. 

Here is an example of a minimal compressor implemented in numpy

In [50]:
import numpy as np

def compressor(signal, threshold, ratio):
    compressed_signal = np.zeros_like(signal)
    for i in range(len(signal)):
        if np.abs(signal[i]) > threshold:
            compressed_signal[i] = np.sign(signal[i]) * (threshold + (np.abs(signal[i]) - threshold) / ratio)
        else:
            compressed_signal[i] = signal[i]
    return compressed_signal

Something important to remember is the units in which you are working. In a 16 bit system, a sample can take any value between -32768 and +32767 (2^16 values hence 16 "bit"). If a signal exceeds this maximum value, it may result in clipping (the value of the signal will be cut off at the maximum value). From these raw numbers, we can define the dBFS units (decibels relative to full scale), where here our "full scale" value will be the maximum value of our amplitude in our 16 bit system, +32767. The conversion can be defined as $ X \text(dB) = 20 \text{log}_{10} \left( \frac{\text{Amplitude}}{32767} \right)$. If we reach our max value, notice that the corresponding value in dBFS will be 0dB, as expected. 

dB is commonly used because humans have a logarithmic perception of volume - i.e doubling the intensity of a signal will not be perceived by humans as "twice as loud" indeed the human perception of loudness more closely mimics that of the logarithm of intensity, rather than a linear relationship - interestingly, loudness is a "psychological" quantity, and humans even have different perceptions of loudness depending on the frequency of the incoming signal - that means that a signal at 2kHz and 15kHz with the same intensity will not be perceived as the same loudness by a human. For further details you can read about Fletcher-Munson curves which discuss this phenomenon.

## Metadata Preparation 
It is common when working with data such as audio to keep a metadata dataframe which contains important information about the audio that is not contained in the raw audio signal, so that we can feed this information to our model. Whether compression or limiting is being applied is an example of such metadata. The authors did not include such a metadata frame i.e a df with each row being a track, with a unique id, the paths to the raw and processed audio, and the compression settings applied, so we have to create it ourselves from the filenames provided.

Our data consists of some raw audio files of the form e.g `./data/<split>/input_XXX_.wav` and some corresponding processed audio files `./data/<split>.target_XXX_LA2A_YY__Z__WW.wav`. The compression parameters are contained in the file name of the processed file. The parameters are the following:
- XXX: audio file id
- YY: seems to be the compressor revision, not that important
- Z: compressor/limiter switch either 0 or 1. 
- WW: peak reduction switch, from 0-100. 

In the other notebook I already removed pairs of audio which were out of phase (an error in the creation of the dataset), so we can proceed with extracting the data from the track files as described above. 

 The input data will be:
- Raw audio segment.
- Compressor / Limiter Switch (0 or 1 resp.).
- Compressor peak reduction (0 - 100).
Where the target will be the corresponding processed audio segment.

In [52]:
# Get list of unique track ids by parsing the info in the file names
import os 
import pandas as pd 
from collections import defaultdict


def get_track_ids(track_paths):
    """
    Gets a list of track_ids which are unique from the given track_paths
    """
    track_ids = [track_path.split('_')[1] for track_path in track_paths]
    numeric_ids_only = [track for track in track_ids if track.isdigit()]
    return list(set(numeric_ids_only)) 

def prepare_metadata_records(splits):
    """
    Gets a set of parsed metadata records for every track in each split.
    - Creates a dict with split names as keys e.g {'train': .., 'test': .., ..}
    - The value corresponding to each split name is a list of nested records of the form [{'track_id':{'X_path':xxx, 'param1':yyy, 'param2':zzz, 'Y_path':www}, ..]
    """
    metadata = defaultdict(list)
    for split in splits:
        split_name = split.split('/')[-2].lower() # unfortunate naming of variable here
        track_paths = os.listdir(split)
        track_ids = get_track_ids(track_paths)

        split_metadata = []
        for track_id in track_ids:
            track_level_data = defaultdict(dict)
            for track_path in track_paths:
                if track_id in track_path and 'target' in track_path:
                    split_path = track_path.split('_')
                    compress_or_limit = split_path[-3]
                    peak_reduction = split_path[-1].split('.wav')[0]
                    
                    track_level_data[track_id]["raw_audio_path"] = split + 'input_' + track_id + '_.wav'
                    track_level_data[track_id]["compress_or_limit"] = compress_or_limit
                    track_level_data[track_id]["peak_reduction"] = peak_reduction
                    track_level_data[track_id]["processed_audio_path"] = split + track_path
            split_metadata.append(track_level_data)
        metadata[split_name] = split_metadata
    return metadata  

def turn_records_into_df(records):
    """
    Turns the metadata records into a nice and simple dataframe
    """
    flattened_data = []
    for split, records in records.items():
        for record in records:
            for record_id, details in record.items():
                flattened_data.append({'split': split, 'track_id': record_id, **details})

    return pd.DataFrame(flattened_data)

splits = ["./data/Train/", "./data/Test/", "./data/Val/"]
records = prepare_metadata_records(splits)
df = turn_records_into_df(records)

Now we have our basic metadata df that will allow us to prepare each small audio segment for training, let's just do a quick sanity check to make sure we have no parsing errors:

In [53]:
df.isna().sum()

split                   0
track_id                0
raw_audio_path          0
compress_or_limit       0
peak_reduction          0
processed_audio_path    0
dtype: int64

In [54]:
df.track_id.duplicated().sum()

0

Looks good, no NaNs or duplicates. Let's just fix the types of some columns as our model will need them to have the correct types later:

In [55]:
df["track_id"] = df["track_id"].astype(int)
df["compress_or_limit"] = df["compress_or_limit"].astype(int)
df["peak_reduction"] = df["peak_reduction"].astype(int)

## Creating Audio Segments for Training + Evaluation + Challenges
Each file in the dataset is not a unique track, but rather a very long (can be up to 20 minutes) collage of different tracks, stitched together without interruption. All tracks in the file are compressed with the same compression parameters indicated in the "target" file name. We also have access to the raw, uncompressed audio file corresponding to this.

Our model will not have an input size of 20 minutes and will not handle variable input sizes effectively, so we need to break each file into chunks of 3-10 seconds and process them one by one, these are common input sizes in the literature for audio ML (e.g ShortChunk has 15 seconds, whereas MusiCNN has 3 seconds). 

This means that from 1 file of 20 minutes, we will actually get up to (20*60)/3 = __400__ training examples for a model with 3 second input length. 

It also means that during inference (i.e during modelling) we will not be able to process a whole file at once but will need to split the incoming audio into 3 second chunks and then process each chunk separately. We will call the incoming audio chunk the "buffer". This fact is actually a serious design challenge for modelling a time based effect such as compression, because a naive model architecture means it will only use incoming audio in the buffer to produce an output - but time based effects like compression have parameters such as attack/release (discussed above) which means audio in the buffer should trigger compression in the next incoming buffer. If this is not clear, imagine that our 3 second audio signal in the buffer has a loud peak at 2.99s. If our compressor has an attack time of 0,015s (15ms), the compression will only "kick in" or be triggered _after_ the current signal is out of the buffer (2.99 + 0.15 > 3.0), affecting only the start of the next 3 second signal coming into the buffer. This means there is a potential dependency between windows that are being processed independently by our model. The same problem exists in the realm of DSP audio plugin design, where these "transient" errors are often treated using a lookahead buffer. This challenge may end up motivating the design of our model later if naive models prove to suffer from this possible complication.

## Model Architecture

## Training