## Neural Analog Modelling 
Please note that the dataset used for this project is not 100% clean, and the errors from the authors errata page are addressed and resolved in the other notebook `0-data-validation.ipynb`. This is done to ensure that in the below our models are not impacted.
### -- Introduction
This repo is concerned with modelling analog effects with neural networks. Why? Mostly for curiosity.

Most VST effects are typically implemented in C++, using handy frameworks like JUCE which contains libraries to handle many of the typical challenges of plugin design such as cross-platform functionality, DSP, frontend design.. In order to do real analog modelling, one typically needs to have some high level domain knowledge and have the capacity to understand complicated analog circuits present in analog synths and effects, as well as a knowledge of DSP which allows us to model these signals numerically. Other approaches include simulation of physical processes like reverberation.

In principle, neural networks are a more high level approach which, presupposing the access to a relevant dataset of dry/wet signals, allow us to model the effects without having to bust out Korg manuals from the 1980s and examine the intricacies of the circuits for their wonderful filters. 

There are clearly many limits in using neural networks to process audio. Some include:
- Clean datasets of dry/wet signals are not typically easy to come by, and are labour intensive to repair. Hence NNs are not always a suitable approach for modelling analog effects.
- Neural networks are typically quite bulky, and implemented in Python (which is typically slower compared to DSP implementations in C++). Although not impossible (see work of [C. Steinmetz](https://scholar.google.com/citations?user=jSvSfIMAAAAJ&hl=en)), this makes it difficult to use neural networks to process signals in real time. This inherent slowness makes a lot of neural approaches only suitable to asynchronous audio processing. Sequence to sequence audio modelling is notoriously slow, especially given the large size of common audio models used today e.g [Demucs](https://github.com/facebookresearch/demucs) (for source separation). 

I am interesting in exploring these limits, especially the second one. 

### -- Data 
In this case, we use the [SignalTrain](https://zenodo.org/records/3824876) dataset from 2019 which contains various dry and wet recordings, where the wet recordings are processed with an analog compressor, the Universal Audio LA-2A. 

This compressor is a very simple one, and we will be concerned with modelling the signal using only two parameters on the compressor:
- The switch between compression and limiting
- The peak reduction

The information about the parameters of the compressor are contained in the file names. In particular, the value between 0-100 represents the peak reduction knob, whereas the binary value 0/1 represents the switch between compression (0) and limiting (1). The authors say that there was no changes to the input or output gains, and that only these two parameters above were changed during recording.

### -- Audio
The audio in this dataset has a sampling rate of 44.1kHz and is mono, as the original analog LA-2A was designed to process mono signals. The individual audio files actually contain a "collage" of different pieces of music which are stitched seamlessly together. The whole recording will be passed through the compressor to obtain the wet signal.

I followed the advice of the authors in cleaning up errors in the dataset (removing / moving certain files) in other notebook `data_validation.ipynb`. I cross correlated all the signals to see if any signals except those mentioned by the authors had a phase shift between the dry and wet signals. To simplify my life, I found all the signals which had a relative phase shift and removed them, keeping only perfectly correlated dry/ wet pairs; which to the credit of the authors, was the vast majority of the signals.

## Compressors

In the context of music production we are interested in Dynamic Range Compressors. These are effects which reduce the dynamic range (the gap between the loudest and quietest parts of a signal) via downward or upward compression (reducing the gain of loud sounds, and increasing the gain of quiet sounds respectively). They have several parameters, the most important ones being:

- Threshold: the volume at which the compressor will be activated. In the case of downward compression, if the threshold is -6dB, and the signal peaks at -10dB, the compressor will never be activated, if it is, the compressor will activate and trigger gain reduction.  On the other hand, in upward compression, if the threshold is -6dB, any sound below this threshold (e.g -10dB) will activate the compression and cause a gain increase.

- Ratio: a parameter which controls the amount of compression applied to an incoming signal. It is called a ratio because it is typically defined in such a way that if the ratio is 4:1, any signal 4dB **over** the threshold will be reduced to 1dB over the threshold.

Some other common parameters include attack and release, which delay or extend resp. the activation of the compressor. e.g if the attack is 50ms, when the compressor detects a signal which crosses the threshold, it will delay its activation by 50ms. The release, if 50ms, will extend the action of the compressor by 50ms. The effect of attack / release is often not considered in a step function like manner, but will often be smoother, where for example during the attack phase of 50ms after the detection of a signal crossing the compressor's threshold, the action of the compressor may be linearly ramped up to its full action. 

Another important control is the "knee" setting. In the case of a "hard knee" compressor, the compression will only activate once the signal crosses the threshold, triggering gain reduction. In the case of a soft knee compressor, this transition point is 'blurred' and less abrupt, even for signals below the threshold, the gain may be attenuated slightly by the compressor. This results in a 'smoother' transition between compressor and uncompressed parts of the signal. 

Here is an example of a minimal compressor implemented in numpy

In [1]:
import numpy as np

def compressor(signal, threshold, ratio):
    compressed_signal = np.zeros_like(signal)
    for i in range(len(signal)):
        if np.abs(signal[i]) > threshold:
            compressed_signal[i] = np.sign(signal[i]) * (threshold + (np.abs(signal[i]) - threshold) / ratio)
        else:
            compressed_signal[i] = signal[i]
    return compressed_signal

Something important to remember is the units in which you are working. In a 16 bit system, a sample can take any value between -32768 and +32767 (2^16 values hence 16 "bit"). If a signal exceeds this maximum value, it may result in clipping (the value of the signal will be cut off at the maximum value). From these raw numbers, we can define the dBFS units (decibels relative to full scale), where here our "full scale" value will be the maximum value of our amplitude in our 16 bit system, +32767. The conversion can be defined as $ X \text(dB) = 20 \text{log}_{10} \left( \frac{\text{Amplitude}}{32767} \right)$. If we reach our max value, notice that the corresponding value in dBFS will be 0dB, as expected. 

dB is commonly used because humans have a logarithmic perception of volume - i.e doubling the intensity of a signal will not be perceived by humans as "twice as loud" indeed the human perception of loudness more closely mimics that of the logarithm of intensity, rather than a linear relationship - interestingly, loudness is a "psychological" quantity, and humans even have different perceptions of loudness depending on the frequency of the incoming signal - that means that a signal at 2kHz and 15kHz with the same intensity will not be perceived as the same loudness by a human. For further details you can read about Fletcher-Munson curves which discuss this phenomenon.

## Metadata Preparation 
It is common when working with data such as audio to keep a metadata dataframe which contains important information about the audio that is not contained in the raw audio signal, so that we can feed this information to our model. Whether compression or limiting is being applied is an example of such metadata. The authors did not include such a metadata frame i.e a df with each row being a track, with a unique id, the paths to the raw and processed audio, and the compression settings applied, so we have to create it ourselves from the filenames provided.

Our data consists of some raw audio files of the form e.g `./data/<split>/input_XXX_.wav` and some corresponding processed audio files `./data/<split>.target_XXX_LA2A_YY__Z__WW.wav`. The compression parameters are contained in the file name of the processed file. The parameters are the following:
- XXX: audio file id
- YY: seems to be the compressor revision, not that important
- Z: compressor/limiter switch either 0 or 1. 
- WW: peak reduction switch, from 0-100. 

In the other notebook I already removed pairs of audio which were out of phase (an error in the creation of the dataset), so we can proceed with extracting the data from the track files as described above. 

 The input data will be:
- Raw audio segment.
- Compressor / Limiter Switch (0 or 1 resp.).
- Compressor peak reduction (0 - 100).
Where the target will be the corresponding processed audio segment.

In [2]:
# Get list of unique track ids by parsing the info in the file names
import os
import pandas as pd
from collections import defaultdict


def get_track_ids(track_paths):
    """
    Gets a list of track_ids which are unique from the given track_paths
    """
    track_ids = [track_path.split("_")[1] for track_path in track_paths]
    numeric_ids_only = [track for track in track_ids if track.isdigit()]
    return list(set(numeric_ids_only))


def prepare_metadata_records(splits):
    """
    Gets a set of parsed metadata records for every track in each split.
    - Creates a dict with split names as keys e.g {'train': .., 'test': .., ..}
    - The value corresponding to each split name is a list of nested records of the form [{'track_id':{'X_path':xxx, 'param1':yyy, 'param2':zzz, 'Y_path':www}, ..]
    """
    metadata = defaultdict(list)
    for split in splits:
        split_name = split.split("/")[-2].lower()  # unfortunate naming of variable here
        track_paths = os.listdir(split)
        track_ids = get_track_ids(track_paths)

        split_metadata = []
        for track_id in track_ids:
            track_level_data = defaultdict(dict)
            for track_path in track_paths:
                if track_id in track_path and "target" in track_path:
                    split_path = track_path.split("_")
                    compress_or_limit = split_path[-3]
                    peak_reduction = split_path[-1].split(".wav")[0]

                    track_level_data[track_id]["raw_audio_path"] = (
                        split + "input_" + track_id + "_.wav"
                    )
                    track_level_data[track_id]["compress_or_limit"] = compress_or_limit
                    track_level_data[track_id]["peak_reduction"] = peak_reduction
                    track_level_data[track_id]["processed_audio_path"] = (
                        split + track_path
                    )
            split_metadata.append(track_level_data)
        metadata[split_name] = split_metadata
    return metadata


def turn_records_into_df(records):
    """
    Turns the metadata records into a nice and simple dataframe
    """
    flattened_data = []
    for split, records in records.items():
        for record in records:
            for record_id, details in record.items():
                flattened_data.append(
                    {"split": split, "track_id": record_id, **details}
                )

    return pd.DataFrame(flattened_data)


splits = ["./data/Train/", "./data/Test/", "./data/Val/"]
records = prepare_metadata_records(splits)
df = turn_records_into_df(records)

Now we have our basic metadata df that will allow us to prepare each small audio segment for training, let's just do a quick sanity check to make sure we have no parsing errors:

In [3]:
df.isna().sum()

split                   0
track_id                0
raw_audio_path          0
compress_or_limit       0
peak_reduction          0
processed_audio_path    0
dtype: int64

In [4]:
df.track_id.duplicated().sum()

0

Looks good, no NaNs or duplicates. Let's just fix the types of some columns as our model will need them to have the correct types later:

In [5]:
df["track_id"] = df["track_id"].astype(int)
df["compress_or_limit"] = df["compress_or_limit"].astype(int)
df["peak_reduction"] = df["peak_reduction"].astype(int)

## Creating Audio Segments for Training + Evaluation + Challenges
Each file in the dataset is not a unique track, but rather a very long (can be up to 20 minutes) collage of different tracks, stitched together without interruption. All tracks in the file are compressed with the same compression parameters indicated in the "target" file name. We also have access to the raw, uncompressed audio file corresponding to this.

Our model will not have an input size of 20 minutes and will not handle variable input sizes effectively, so we need to break each file into chunks of 3-10 seconds and process them one by one, these are common input sizes in the literature for audio ML (e.g ShortChunk has 15 seconds, whereas MusiCNN has 3 seconds). 

This means that from 1 file of 20 minutes, we will actually many training examples for a model with 1 second input length. 

It also means that during inference (i.e during modelling) we will not be able to process a whole file at once but will need to split the incoming audio into 3 second chunks and then process each chunk separately. We will call the incoming audio chunk the "buffer". This fact is actually a serious design challenge for modelling a time based effect such as compression, because a naive model architecture means it will only use incoming audio in the buffer to produce an output - but time based effects like compression have parameters such as attack/release (discussed above) which means audio in the buffer should trigger compression in the next incoming buffer. If this is not clear, imagine that our 3 second audio signal in the buffer has a loud peak at 2.99s. If our compressor has an attack time of 0,015s (15ms), the compression will only "kick in" or be triggered _after_ the current signal is out of the buffer (2.99 + 0.15 > 3.0), affecting only the start of the next 3 second signal coming into the buffer. This means there is a potential dependency between windows that are being processed independently by our model. The same problem exists in the realm of DSP audio plugin design, where these "transient" errors are often treated using a lookahead buffer. This challenge may end up motivating the design of our model later if naive models prove to suffer from this possible complication.

First, we will load a full 15-20 minute audio file and split it up into windows of size 1 seconds. We will have a window stride of 0.5 second. This means that we will keep a bit of the "last" window in the current window, which is a common technique in audio processing / time series analysis to have smoother transitions between adjacent windows. Let's make a helper function that will take an object of size N, a window of size K (K <= N) and a stride length S and compute how many overlapping intervals we can compute from our audio. For an example of what this means look below for an explicit example.

In [6]:
def overlapping_interval_count(object_size, window_size, window_stride):
    """all lengths in samples here"""
    stride = 0
    count = 0
    while True:
        pos = stride + window_size 
        if pos >= object_size:
            return count + 1 # gives number of overlapping intervals that fit in the 
        count += 1
        stride += window_stride

Now let's create our training dataset

In [7]:
import torchaudio
import torch 
from torch.utils.data import Dataset
import audiofile

class AudioDataset(Dataset):
    def __init__(
        self,
        input_audio_paths,
        target_audio_paths,
        window_size,
        overlap,
        params=None,
    ):
        self.input_audio_paths = input_audio_paths
        self.target_audio_paths = target_audio_paths
        self.window_size = window_size  # in seconds
        self.window_size_samples = self.window_size * 44100 # in samples
        self.overlap = overlap  # in seconds
        self.overlap_samples = self.overlap * 44100 # in samples
        self.params = params  # compressor parameters i.e compress-limit switch, peak reduction
        self.window_to_audio_mapping = self.window_index_to_audio_indices() # map an
        self.examples = self.create_examples()

    def __len__(self):
        return self.dataset_size

    def __getitem__(self, window_index):
        """ 
        Converts a window index to a tuple of audio indices, i.e the first window (0)
        corresponds to the first audio (0) and the first window in this audio (0). 

        The set of {window_index: (audio_index, window_index_in_audio)} is precomputed 
        in the `self.window_to_audio_mapping method`. 
        
        We use this tuple of indices to 
        load the actual audio data (the stream of samples), lazily, for training. 
        """
        audio_index, window_index_in_audio = self.window_to_audio_mapping[window_index]
        input_audio_path, target_audio_path, params = self.examples[audio_index]

        input_waveform, _ = audiofile.read(input_audio_path, offset=(window_index_in_audio*self.overlap_samples)/44100, duration=self.window_size)
        target_waveform, _ = audiofile.read(target_audio_path, offset=(window_index_in_audio*self.overlap_samples)/44100, duration=self.window_size)

        input_waveform = (input_waveform + 1.0) / 2.0 # Rescale (invertibly) values that go from -1 to 1, to be between 0 and 1
        target_waveform = (target_waveform + 1.0) / 2.0 

        input_waveform = torch.tensor(input_waveform, dtype=torch.float32)
        target_waveform = torch.tensor(target_waveform, dtype=torch.float32)
        compress_limit = torch.tensor(params[0], dtype=torch.float32)
        peak_reduction = torch.tensor(params[1]/100.0, dtype=torch.float32) # Ensure the peak reduction is between 0 and 1, not 0 - 100

        return (input_waveform, compress_limit, peak_reduction), target_waveform

    def window_index_to_audio_indices(self):
        """ 
        Our audio dataset constructor takes in a list of input/target audio paths. In the SignalTrain dataset, each 
        individual audio is very long, up to 20 mins. Our model will have a much smaller input size. Therefore, we 
        will use a given audio file to create several training examples by taking overlapping windows of for example 3 seconds 
        from each file. e.g 0-3s, 1-4s, 2-5s all the way up until we use all 20 minutes of the audio file.

        Since __getitem__ method is lazy, it will try to load each window of audio data on the fly instead of preparing
        them in advance. This means that when we say __getitem__(293), we need to know which audio file corresponds to 
        window 293, and where precisely this window is situated in that audio file, so that we can load the right data.
        
        For example, window with index 0 can be identified with the first audio file, and the first window in that audio i.e (0,0).
        
        This function is used to pre-compute the mapping from a window index into a tuple of indices identifying both the audio file,
        and the position of this window in that audio file so that we can load the audio data on the fly in __getitem__. 
        """
        cumulative_window_index = 0
        mapping = {}
        for audio_index, path in enumerate(self.input_audio_paths):
            num_windows = self.calculate_num_windows(path)
            for i in range(num_windows):
                mapping[cumulative_window_index + i] = (
                    audio_index,
                    i,
                )  # e.g (3rd audio, 4th window of 3rd audio)
            cumulative_window_index += num_windows
        self.dataset_size = len(mapping)

        return mapping

    def calculate_num_windows(self, audio_path):
        waveform, sample_rate = torchaudio.load(audio_path)
        window_samples = self.window_size * sample_rate
        overlap_samples = self.overlap * sample_rate
        num_windows = overlapping_interval_count(
            waveform.size(1), window_samples, overlap_samples
        )
        return num_windows

    def create_examples(self):
        return list(zip(self.input_audio_paths, self.target_audio_paths, self.params))


# Example usage with multiple paths and parameters
train_df = df[df.split == "train"]
test_df = df[df.split == "test"]
val_df = df[df.split == "val"]

train_dataset = AudioDataset(
    train_df["raw_audio_path"].to_list(),
    train_df["processed_audio_path"].to_list(),
    window_size=0.02,
    overlap=0.01,
    params=list(zip(train_df["compress_or_limit"], train_df["peak_reduction"])),
)
test_dataset = AudioDataset(
    test_df["raw_audio_path"].to_list(),
    test_df["processed_audio_path"].to_list(),
    window_size=0.02,
    overlap=0.01,
    params=list(zip(test_df["compress_or_limit"], test_df["peak_reduction"])),
)
val_dataset = AudioDataset(
    val_df["raw_audio_path"].to_list(),
    val_df["processed_audio_path"].to_list(),
    window_size=0.02,
    overlap=0.01,
    params=list(zip(val_df["compress_or_limit"], val_df["peak_reduction"])),
)

Note: 
- Torchaudio loads audio as a tuple containing a 2D tensor and the sample rate. The 2D tensor is structured as (num_channels, num_frames). In the case of mono audio (1 channel), it still follows this format, but with the number of channels being 1.

Below is an example of what our input / output data will look like:

In [8]:
(input_waveform, cl, pr), target_waveform = train_dataset.__getitem__(400)

print(f"input waveform:\n{input_waveform}, shape: {input_waveform.shape}\n")
print(f"compress or limit:\n{cl}, shape: {cl.shape}\n")
print(f"peak reduction:\n{pr}, shape: {pr.shape}\n")
print(f"target waveform:\n{target_waveform}, shape: {target_waveform.shape}\n")

input waveform:
tensor([0.4536, 0.4522, 0.4511, 0.4502, 0.4496, 0.4493, 0.4491, 0.4492, 0.4496,
        0.4504, 0.4515, 0.4529, 0.4548, 0.4570, 0.4595, 0.4624, 0.4657, 0.4694,
        0.4735, 0.4781, 0.4830, 0.4882, 0.4933, 0.4983, 0.5028, 0.5070, 0.5107,
        0.5139, 0.5166, 0.5190, 0.5212, 0.5234, 0.5254, 0.5274, 0.5294, 0.5313,
        0.5333, 0.5352, 0.5372, 0.5392, 0.5409, 0.5423, 0.5433, 0.5436, 0.5431,
        0.5419, 0.5400, 0.5375, 0.5345, 0.5312, 0.5277, 0.5242, 0.5207, 0.5174,
        0.5144, 0.5115, 0.5087, 0.5058, 0.5028, 0.4996, 0.4962, 0.4927, 0.4891,
        0.4855, 0.4822, 0.4792, 0.4766, 0.4745, 0.4728, 0.4715, 0.4704, 0.4696,
        0.4689, 0.4683, 0.4680, 0.4680, 0.4682, 0.4684, 0.4689, 0.4697, 0.4706,
        0.4715, 0.4727, 0.4741, 0.4759, 0.4780, 0.4807, 0.4838, 0.4874, 0.4912,
        0.4954, 0.4996, 0.5039, 0.5079, 0.5116, 0.5150, 0.5180, 0.5206, 0.5229,
        0.5249, 0.5266, 0.5280, 0.5293, 0.5304, 0.5314, 0.5323, 0.5333, 0.5342,
        0.5353, 0.5363, 

## Model Architecture

Let's consider an MLP that will have a "bottleneck" (like a trivial autoencoder) to make our PoC.

Challenges of Neural Modelling made explicit:
- It is very difficult to use "large" architectures like LSTM / Transformers in this setup, because with an input size of say 882 samples (which with a SR of 44.1kHz represents 0.02 seconds of music), a fairly standard LSTM with hidden size 64 and embedding size of 128 for the compressor parameters (quite big to be fair), with an MLP decoder (4 layers that reduce the dimensions of the flattened LSTM output + embedded compressor parameters back to the correct number of samples i.e 882). This gave me a model with 235 million parameters. 4 times bigger than BERT (think of it like this, RoBERTa has a sequence length of 512, our "sequence length" is effectively 882 if we throw the 0.02s of audio into the LSTM). This 0.02s may be insufficient to capture the dynamics of the compressor depending on the attack and release times (which may exceed 20s) - making the modelling of a time dependent effect like a compressor very difficult.

- Needless to say, running real time inference for high fidelity audio for a model which has a parameter count in the millions, in python, to fill an audio buffer of your DAW, it seems impossible. Hence the challenge of neural modelling, and why companies that propose "neural" inspired VSTs are using a very small network (effectively fancy LR) to model circuits, and not audio - greatly reducing the dimensions.

- DSP approaches much better suited for this task.

Possible solutions
- Downsample audio for inference, use low latency upsampler to upsample the output (these models do exist e.g the one from Facebook using RVQ)

- Do not use LSTM / sequential modelling to model an inidivdual sequence of 882 samples, if using sequential modelling, use it to treat a sequence of samples so that instead of the effective sequence length being 882, it would be, sy 16 (if you treat 16 clips of 882 samples) - this can reduce the parameter count and still perhaps bring some "temporal" modelling benefits to the model

Notes:
- The training loop below converges very quickly, after basically 25 batches (of an epoch of 780 batches) - will have to make sure there's no error in the training and that the inference can produce realistic audio but this is interesting as it means that perhaps a small amount of data is needed to do this type of modelling.

In [9]:
import torch
import torch.nn as nn
import time
from torch.utils.data import DataLoader
import gc 

gc.collect()

device = torch.device("mps")

# Hyperparameters
sequence_length = 882  # Assuming each audio interval has this length
batch_size = 8192
learning_rate = 0.001
print(f"Number of batches: {len(train_dataset)/batch_size}")

dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

class MLPDecoder(nn.Module):
    def __init__(self, input_size=sequence_length+2):
        super().__init__()
        self.relu1 = nn.ReLU()
        self.relu2 = nn.ReLU()
        self.relu3 = nn.ReLU()
        self.relu4 = nn.ReLU()
        self.dense1 = nn.Linear(input_size, int(input_size/4))
        self.dense2 = nn.Linear(int(input_size/4), int(input_size/8))
        self.dense3 = nn.Linear(int(input_size/8), int(input_size/4))
        self.dense4 = nn.Linear(int(input_size/4), sequence_length)

    def forward(self, x):
        lstm_input = x[0].to(device)
        compress_limit_input = x[1].unsqueeze(1).to(device)
        peak_reduction_input = x[2].unsqueeze(1).to(device)
        concat_embeddings = torch.cat([lstm_input, compress_limit_input, peak_reduction_input], dim=1)

        x = self.relu1(self.dense1(concat_embeddings))
        x = self.relu2(self.dense2(x))
        x = self.relu3(self.dense3(x))
        x = self.relu4(self.dense4(x))

        return x

model = MLPDecoder()  # Input size is 1 due to reshaping
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
loss_fn = nn.MSELoss()  # Adjust loss function if needed
model.to(device)


# Training loop
for epoch in range(1):
    total_batches = len(dataloader)
    for batch_idx, (data, label) in enumerate(dataloader):
        
        start_time = time.time()

        label = label.to(device)

        optimizer.zero_grad()
        prediction = model(data)

        loss = loss_fn(prediction, label)
        loss.backward()
        optimizer.step()
        
        end_time = time.time()
        
        batches_done = batch_idx + 1
        batches_left = total_batches - batches_done
        
        print(f"EPOCH {epoch+1} | Batch {batch_idx+1}/{total_batches} | Time: {(end_time - start_time)/60.0} mins | Loss: {loss.item()} | Batches Left: {batches_left}")

print("Training complete!")


Number of batches: 797.599853515625
EPOCH 1 | Batch 1/798 | Time: 0.012181365489959716 mins | Loss: 0.23357173800468445 | Batches Left: 797
EPOCH 1 | Batch 2/798 | Time: 0.000499268372853597 mins | Loss: 0.20942844450473785 | Batches Left: 796
EPOCH 1 | Batch 3/798 | Time: 0.0005411823590596517 mins | Loss: 0.1717255860567093 | Batches Left: 795
EPOCH 1 | Batch 4/798 | Time: 0.0005015015602111816 mins | Loss: 0.13751059770584106 | Batches Left: 794
EPOCH 1 | Batch 5/798 | Time: 0.0005005836486816406 mins | Loss: 0.13605394959449768 | Batches Left: 793
EPOCH 1 | Batch 6/798 | Time: 0.0005006194114685059 mins | Loss: 0.1254543960094452 | Batches Left: 792
EPOCH 1 | Batch 7/798 | Time: 0.0005041162172953287 mins | Loss: 0.10199355334043503 | Batches Left: 791
EPOCH 1 | Batch 8/798 | Time: 0.0005008300145467122 mins | Loss: 0.08879490196704865 | Batches Left: 790
EPOCH 1 | Batch 9/798 | Time: 0.0005022366841634115 mins | Loss: 0.08615609258413315 | Batches Left: 789
EPOCH 1 | Batch 10/798 

KeyboardInterrupt: 

IS THE SHUFFLE PARAMETER MESSING UP MY "WINDOW INDEX" STUFF INSIDE THE DATASET CLASS ? AM I COMPUTING THE INDICES OF AUDIO CLIPS, JUST TO HAVE THIS NEGATED BY A SHUFFLING MECHANISM WHICH MEANS THE INDICES ARE ALL MIXED UP? 

to confirm