<a href="https://colab.research.google.com/github/malloyca/steelpan-pitch/blob/main/steelpan-crepe/steelpan_crepe_batch%3A256_epoch%3A10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Steelpan-crepe Training Notebook

This notebook can create CREPE models and train them on steelpan data.

To use the notebook as is, download the downsampled audio ("tiny_16kHz/") and make sure that the dirpaths for the train and validation sets are correct. Also download the .h5 files from the CREPE repo's "models" branch if you want to train from existing weights.

In [1]:
import tensorflow as tf
import numpy as np
import os
import soundfile
import librosa # Is this needed after all?

print(tf.__version__)
# This code allows for the GPU to be utilized properly.
tf.autograph.set_verbosity(0)
physical_devices = tf.config.list_physical_devices("GPU")
try:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
except:
    pass

print(physical_devices)
print("If the above list is empty, then TF won't use any accelerator")

2.6.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
If the above list is empty, then TF won't use any accelerator


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [18]:
# the model is trained on 16kHz audio
model_srate = 16000

# set batch size for training
batch_size = 256

## Model builder

This code is modified a bit from the repo as it normally stores the models in a dict, which I think is unnecessary for our purposes.

You can also load weights from an existing .h5 file. I began training by loading the weights of "model-full.h5" from the marl/crepe models branch. The weights for the model trained on steelpan data is named "new-crepe-full.h5" and is in the Drive folder.

In [4]:
# todo - This is a note from Crepe, delete and set it up to just do a 'full' model
# store as a global variable, since we only support a few models for now
models = {
    'tiny': None,
    'small': None,
    'medium': None,
    'large': None,
    'full': None
}

def make_model(model_capacity, metrics, weights=None):
    '''
    model_capacity: tiny, small, medium, large, full
    weights: path of .h5 weights file
    '''

    from tensorflow.keras.layers import Input, Reshape, Conv2D, BatchNormalization
    from tensorflow.keras.layers import MaxPool2D, Dropout, Permute, Flatten, Dense
    from tensorflow.keras.models import Model

    capacity_multiplier = {
        'tiny': 4, 'small': 8, 'medium': 16, 'large': 24, 'full': 32
    }[model_capacity]

    layers = [1, 2, 3, 4, 5, 6]
    filters = [n * capacity_multiplier for n in [32, 4, 4, 4, 8, 16]]
    widths = [512, 64, 64, 64, 64, 64]
    strides = [(4, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1)]

    x = Input(shape=(1024,), name='input', dtype='float32')
    y = Reshape(target_shape=(1024, 1, 1), name='input-reshape')(x)

    for l, f, w, s in zip(layers, filters, widths, strides):
        y = Conv2D(f, (w, 1), strides=s, padding='same',
                    activation='relu', name="conv%d" % l)(y)
        y = BatchNormalization(name="conv%d-BN" % l)(y)
        y = MaxPool2D(pool_size=(2, 1), strides=None, padding='valid',
                        name="conv%d-maxpool" % l)(y)
        y = Dropout(0.25, name="conv%d-dropout" % l)(y)

    y = Permute((2, 1, 3), name="transpose")(y)
    y = Flatten(name="flatten")(y)
    y = Dense(360, activation='sigmoid', name="classifier")(y)

    model = Model(inputs=x, outputs=y)

    if weights != None:
        model.load_weights(weights)
    model.compile(tf.keras.optimizers.Adam(learning_rate=0.0002), 'binary_crossentropy', metrics=metrics)

    models[model_capacity] = model

    return model

In [5]:
# todo - Just use Librosa's db_to_power instead?
def db_to_pow(db):
  '''Convert from dB to power'''
  return 10**(db / 10)


def frame_energy(frame):
  '''Calculates the average energy for a frame
    
    Parameters
    ----------
    frame : np.array
      audio frame in np.float32 format

    Returns
    -------
    average_energy : float
      Average energy level for frame
  '''

  # Square the sample values to convert to energy values
  energy = frame**2

  # Sum the energy values to get total energy
  total_energy = np.sum(energy)

  # Divide by length to get average energy
  return total_energy / len(frame)

## Data formatting

Audio is formatted according to how CREPE does it, but is formatted into batch format for input into model.fit(). Step size defaults to 10ms like in CREPE.

The label data is scaled and encoded into one-hots to fit the model's bucketed output. There are 360 outputs, but only 29 actually appear as labels (MIDI pitches 60-89).

Data is being trimmed by dBFS levels, which I've set to -30 for now. Anything -40 and below trims almost nothing, and -30 trims a good amount (I think it does need more tuning). **No, trimming needs to happen before normalization. That's the issue.**

The one-hotted labels have Gaussian blurring applied to them too (as specified in the other notebook).

In [6]:
# todo - fix this
# The latest changes have broken the dimensions of the output frames (1024,)
# The labels appear accurate, but check it

# todo - Can we re-code this using Numba to speed it up?

from os import walk

def load_audio_batch(dir, model_srate=16000, threshold_db=-60, step_size=10, blur=True):
    '''Load a batch of audio files 
    
    Parameters
    ----------
    dir : str
        Filepath of directory containing audio files to load
    threshold_db : int
        Threshold for cutting leading and trailing silence of audio files in dB
    step_size : float
        Step size between audio frames in ms
    
    Returns
    ------
    audio_frames : np.ndarray[shape=(1024,360,#todo)]
    labels : np.ndarray[shape=(360,)]
    '''
    
    from numpy.lib.stride_tricks import as_strided
    
    # Initialize arrays
    audio_frames = np.empty((0,1024), dtype=np.float32)
    audio_labels = []

    hop_length = int(model_srate * step_size / 1000)
    threshold_pow = db_to_pow(threshold_db)

    #test
    count = 0
    # Iterate over files in directory
    for (dirpath, dirnames, filenames) in os.walk(dir):
        for filename in filenames:
            # load audio
            audio, _ = soundfile.read(dirpath + "/" + filename)
            
            
            # Split into audio frames
            n_frames = 1 + int((len(audio) - 1024) / hop_length)
            framed_audio = as_strided(audio, shape=(1024, n_frames),
                                strides=(audio.itemsize, hop_length * audio.itemsize))
            framed_audio = framed_audio.transpose().copy()
            
            
             # Trim audio leading and trailing silence from audio
            for f in range(len(framed_audio)):
                if frame_energy(framed_audio[f]) > threshold_pow:
                    start_frame = f
                    break

            for f in range(len(framed_audio) - 1, -1, -1):
                if frame_energy(framed_audio[f]) > threshold_pow:
                    end_frame = f + 1
                    break
                    
            trimmed_audio = framed_audio[start_frame:end_frame]

            
            # Normalize the audio data by frame
            trimmed_audio -= np.mean(trimmed_audio, axis=1)[:, np.newaxis]
            trimmed_audio /= np.std(trimmed_audio, axis=1)[:, np.newaxis]
            
            # Append normalized audio to audio_frames array
            audio_frames = np.append(audio_frames, trimmed_audio, axis=0)
            
            
            
            # Append values to the labels array
            audio_labels += [int(filename.split("_")[0]) for _ in range(end_frame - start_frame)]
            
            
    
    # Convert audio_labels to numpy array
    audio_labels = np.array(tf.one_hot(5 * (np.array(audio_labels) - 24), 360))
    
    if blur:
        # Apply Gaussian blur to labels
        cents_i = np.arange(360)
        for i in range(len(audio_labels)):
            cents_true = np.where(audio_labels[i] == 1)[0][0]
            audio_labels[i] = np.exp(-((20 *(cents_i - cents_true)) ** 2) / (2 * (25 ** 2)))

        
    return audio_frames, audio_labels

In [33]:
metrics = "Accuracy"
model = make_model("full", metrics=metrics, weights='/content/drive/MyDrive/Research Projects/Steelpan pitch detection/Jason\'s Work/model-full.h5')
model.summary()

Model: "model_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           [(None, 1024)]            0         
_________________________________________________________________
input-reshape (Reshape)      (None, 1024, 1, 1)        0         
_________________________________________________________________
conv1 (Conv2D)               (None, 256, 1, 1024)      525312    
_________________________________________________________________
conv1-BN (BatchNormalization (None, 256, 1, 1024)      4096      
_________________________________________________________________
conv1-maxpool (MaxPooling2D) (None, 128, 1, 1024)      0         
_________________________________________________________________
conv1-dropout (Dropout)      (None, 128, 1, 1024)      0         
_________________________________________________________________
conv2 (Conv2D)               (None, 128, 1, 128)       8388

## Loading/Formatting data

In [8]:
x_train, y_train = load_audio_batch('/content/drive/MyDrive/Research Projects/Steelpan pitch detection/Jason\'s Work/tiny_16kHz/train', blur=True)
x_val, y_val = load_audio_batch('/content/drive/MyDrive/Research Projects/Steelpan pitch detection/Jason\'s Work/tiny_16kHz/validation', blur=False)

## Training

Any batch size larger than 256 is slower, and also has trouble fitting into VRAM. 20 epochs was an arbitrary choice, which took roughly an hour and a half on a 1650 super.

This also saves the model's weights after training.

In [31]:
# todo - Do I need to include comparator instructions (minimize)? Auto mode seems to be making the wrong choice
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_Accuracy', patience=32, mode='min')
# todo - check to make sure this is correct - I think the first time I tried this, the checkpoint callback failed because I didn't capitalize the A
# todo - Set save_weights_only to False? I want to save the whole model.
checkpoints = tf.keras.callbacks.ModelCheckpoint(filepath='/retrain_weights_colab_batch_size:256_epoch:{epoch:02d}_val_Accuracy:{val_Accuracy:.2f}.hdf5',
                                                 monitor='val_Accuracy',
                                                 mode='max',
                                                 save_best_only=True)

callbacks = [early_stopping, checkpoints]

In [34]:
# todo - Increasing the batch size slows convergence. The original model was trained on 32 sample batches. Is there difference in performance depending on the batch size?
# todo - enable callbacks... history = model.fit(..., callbacks=callbacks, ...)
# Batch size will effect batch normalization, backpropagation, etc...
# Anecdotally, it seemed that I got better validation accuracy scores using batch_size=128
history = model.fit(x=x_train, y=y_train, batch_size=batch_size, epochs=10,
                    validation_data=(x_val, y_val),
                    callbacks=callbacks)

# This shouldn't be necessary anymore since checkpoints is set up
#model.save('test-crepe-full_colab_batch:256_epoch:10.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Results

This model has pretty normal graphs except for accuracy, which confuses me so much. Please take a look and see if you can find anything wrong with the data formatting or training, as the accuraccy *does* improve over epochs, but starts out at near 0. Precision and Recall are also suspiciously high, at 0.99+ each. 

It's possible this is simply because I used the wrong metrics, as the model seems to be pretty accurate for individual files from the training/validation set (see below).

In [None]:
from matplotlib import pyplot as plt
def plot(data, labels, x, y):
    '''Plot statistics. Takes a list of lists and list of labels.'''
    
    print(y + " vs. " + x)
    for i in range(len(data)):
        plt.plot(data[i], label=labels[i])
    plt.xlabel(x)
    plt.ylabel(y)
    plt.legend()
    plt.show()
    print()

In [None]:
plot((history.history["loss"], ), ("Loss", ), "Epoch", "Loss")
plot((history.history["accuracy"], history.history["precision"], history.history["recall"], ), ("Accuracy", "Precision", "Recall", ), "Epoch", "Accuracy, Precision, Recall")

## Testing for individual sound files

The model seems to be pretty good for the first couple of seconds, but then becomes very inaccurate as the pitch fades.

In [35]:
import soundfile
def load_audio(file):
    wav, sr = soundfile.read(file)
    return wav, sr

In [36]:
from numpy.lib.stride_tricks import as_strided
def predict(audio, sr, step_size=10):
    if len(audio.shape) == 2:
        audio = audio.mean(1)  # make mono
    audio = audio.astype(np.float32)
    if sr != model_srate:
        # resample audio if necessary
        from resampy import resample
        audio = resample(audio, sr, model_srate)

    # make 1024-sample frames of the audio with hop length of 10 milliseconds
    hop_length = int(model_srate * step_size / 1000)
    n_frames = 1 + int((len(audio) - 1024) / hop_length)
    frames = as_strided(audio, shape=(1024, n_frames),
                        strides=(audio.itemsize, hop_length * audio.itemsize))
    frames = frames.transpose().copy()

    # normalize each frame -- this is expected by the model
    frames -= np.mean(frames, axis=1)[:, np.newaxis]
    frames /= np.std(frames, axis=1)[:, np.newaxis]

    # run prediction and convert the frequency bin weights to Hz
    return model(frames)

In [37]:
def as_midi(pred):
    # Convert from output buckets back to MIDI
    midi = (pred.argmax(axis=1) / 5) + 24
    return midi

## Steelpan-trained CREPE vs Original CREPE

In [39]:
import numpy as np

model = make_model("full", [], '/retrain_weights_colab_batch_size:256_epoch:10_val_Accuracy:0.97.hdf5')
wav, sr = load_audio('/content/drive/MyDrive/Research Projects/Steelpan pitch detection/Jason\'s Work/tiny_16kHz/validation/68_train_sample_135.wav')
pred = as_midi(predict(wav, sr).numpy())
print(pred)

[68.  68.  68.  68.  80.  80.  80.  80.  80.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.
 68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  68.  53.  53.  53.
 53.  51.6 68.  68. 

In [None]:
import IPython.display as ipd