
## Model Inference & Evaluation



## Predict using the best model and compare predicted output with true MIDI file.
The model outputs predictions in the numpy format. So some extra steps need to be taken to convert the numpy into midi.
I will also try to generate song out of predicted midi.

### A few tools need to be installed before we can start testing

Tool	 ||   Purpose	  ||    Needed for

pretty_midi	||  Work with MIDI files	    ||        Piano roll ↔ MIDI

pydub	   ||     Convert between audio formats	 ||   WAV → MP3 conversion

soundfile	 ||   Save audio files as WAV	    ||      Writing .wav audio

pyfluidsynth	|| Python bindings for MIDI synthesis|| Generate audio from MIDI

fluidsynth	||  Synth engine required by pyfluidsynth ||	Actual audio generation

ffmpeg	 ||     Audio codec for MP3 support	||pydub MP3 conversion




### pretty_midi library
pretty_midi used for working with MIDI files in Python.
In my code:

Load and parse .midi files.

Convert predicted piano rolls back to MIDI (piano_roll_to_midi).

Synthesize audio from MIDI if Fluidsynth is available (midi_obj.fluidsynth()).

### pydub library
Used for: Audio format conversions (e.g., WAV ↔ MP3).

In my code:

Convert .wav to .mp3:
sound = AudioSegment.from_wav(wave_file)
sound.export(mp3_output, format="mp3")

### soundfile library
Used for: Saving audio as .wav.

In my code:

Save synthesized audio:
sf.write(wave_file, audio, samplerate=22050)

### pyfluidsynth
Used for: Synthesizing MIDI into audio using the FluidSynth engine.
In my code:

Required for midi_obj.fluidsynth() to convert MIDI notes into waveform audio.

In [1]:
!pip install pretty_midi pydub soundfile
!apt install fluidsynth

Collecting pretty_midi
  Downloading pretty_midi-0.2.10.tar.gz (5.6 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/5.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/5.6 MB[0m [31m8.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/5.6 MB[0m [31m37.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m5.6/5.6 MB[0m [31m63.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting mido>=1.1.16 (from pretty_midi)
  Downloading mido-1.3.3-py3-none-any.whl.metadata (6.4 kB)
Downloading pydub-0.25.1-py2.py3-none-any.whl (32 

In [2]:
# Install FluidSynth system package and Python wrapper
!apt-get install -y fluidsynth
!pip install pyfluidsynth

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
fluidsynth is already the newest version (2.2.5-1).
The following packages were automatically installed and are no longer required:
  r-cran-colorspace r-cran-munsell
Use 'apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded.
Collecting pyfluidsynth
  Downloading pyfluidsynth-1.3.4-py3-none-any.whl.metadata (7.5 kB)
Downloading pyfluidsynth-1.3.4-py3-none-any.whl (22 kB)
Installing collected packages: pyfluidsynth
Successfully installed pyfluidsynth-1.3.4


### ffmpeg
Used for: Decoding and encoding various audio formats (including MP3).

Required by: pydub to read/write formats like MP3.

In my code:

Enables AudioSegment.from_wav(...) and export(..., format="mp3").

In [3]:
!pip install pydub
!apt install ffmpeg -y

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
The following packages were automatically installed and are no longer required:
  r-cran-colorspace r-cran-munsell
Use 'apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded.


## Get the model from GCS bucket

In [5]:
import os
from google.cloud import storage

# GCS source and local destination
gcs_path = "models_created/models/model_best"
local_path = "/tmp/model_best"

# Download recursively from GCS folder to /tmp
storage_client = storage.Client()
bucket = storage_client.bucket("models_created")
blobs = bucket.list_blobs(prefix="models/model_best")

for blob in blobs:
    dest_path = os.path.join("/tmp", os.path.relpath(blob.name, "models"))
    os.makedirs(os.path.dirname(dest_path), exist_ok=True)
    blob.download_to_filename(dest_path)

print("Model downloaded to /tmp/model_best")


Model downloaded to /tmp/model_best


This code loads a pre-trained TensorFlow SavedModel (in the SavedModel format) for inference only using Keras 3's TFSMLayer, and wraps it into a Keras Sequential model so it can be used like a regular Keras model for predictions.
This layer does not include training configuration, optimizer, etc. It is inference-only.

TFSMLayer: A special Keras layer that wraps a TensorFlow SavedModel, making it usable within a Keras model.

Sequential: Keras' basic model container that stacks layers sequentially.

In [6]:
from keras.layers import TFSMLayer
from keras import Sequential

# Load SavedModel using TFSMLayer
layer = TFSMLayer("/tmp/model_best", call_endpoint="serving_default") #  tells Keras to use the model's default inference function (usually the one used when exporting the model).

# Wrap it in a Sequential model if needed
model_best = Sequential([layer])

## Get the mp3 file from GCS

In [9]:
from google.cloud import storage

bucket_name = "audio_and_midi_files"
blob_path = "2018/MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.mp3"
local_path = "/tmp/MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.mp3"

# Initialize client
client = storage.Client()
bucket = client.bucket(bucket_name)
blob = bucket.blob(blob_path)
blob.download_to_filename(local_path)

print(f"MP3 file downloaded to: {local_path}")


MP3 file downloaded to: /tmp/MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.mp3


## Get the corresponding MIDI file from GCS

In [12]:
from google.cloud import storage

bucket_name = "audio_and_midi_files"
blob_path = "2018/MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.midi"
local_path = "/tmp/MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.midi"

# Initialize client
client = storage.Client()
bucket = client.bucket(bucket_name)
blob = bucket.blob(blob_path)
blob.download_to_filename(local_path)

print(f"MP3 file downloaded to: {local_path}")

MP3 file downloaded to: /tmp/MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.midi


## Use the model to predict
### === Step-by-step Explanation ===

1. Load the MP3 audio file using Librosa at a sample rate of 22050 Hz.
    This gives you the raw waveform (y) and the sample rate (sr).

2. Convert the waveform into a Mel spectrogram with 128 frequency bins.
    These bins roughly correspond to the human auditory range and capture frequency content over time.

3. Convert the Mel spectrogram to a log scale in decibels (log-mel), which better aligns with how humans perceive loudness.

4. Slice the first 100 time frames from the log-mel spectrogram (approx. 2.3 seconds of audio).
    Reshape it to match the CNN input format: (1, 128, 100, 1).

5. Load the trained model (as an inference-only TFSMLayer if saved in SavedModel format).
    Wrap it in a Sequential container for Keras compatibility.

6. Use the model to predict the piano roll — a binary matrix (88 notes × 100 time frames)
    indicating which piano keys are likely active at each time step.

7. Convert the predicted piano roll to a MIDI object using PrettyMIDI.
    Each active note is translated into a `pretty_midi.Note` with start and end times.

8. Save the MIDI file to `/tmp`.

9. Synthesize the MIDI into a WAV file using FluidSynth.
   This generates audio that can be listened to.

10. Convert the WAV file to MP3 using pydub’s AudioSegment module.

Result: I've built an end-to-end pipeline that converts audio → predicted MIDI → synthesized MP3 output.


In [11]:
import librosa
import numpy as np
import pretty_midi
from pydub import AudioSegment
import soundfile as sf
import os
from keras import Sequential
from keras.layers import TFSMLayer

# --- Step 1: Load audio and get predicted piano roll ---
mp3_path = "/tmp/MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.mp3"
y, sr = librosa.load(mp3_path, sr=22050)
mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
log_mel = librosa.power_to_db(mel_spec, ref=np.max)

# Take first 100 frames (or process in chunks)
input_slice = log_mel[:, :100].reshape(1, 128, 100, 1)

# Since I am loading the model from /tmp folder. It becomes inference only layer and it behaves differently than full model.
# Wrap the inference layer
if not isinstance(model_best, Sequential):
    model_best = Sequential([model_best])

# --- Predict ---
raw_pred = model_best.predict(input_slice)

# Extract actual output tensor
if isinstance(raw_pred, dict):
    # Automatically get the first value in the dict
    raw_pred = list(raw_pred.values())[0]

# If it's a TensorFlow tensor, convert to NumPy
if hasattr(raw_pred, "numpy"):
    raw_pred = raw_pred.numpy()

# If it's still batched (shape: [1, 88, 100]), remove batch dimension
if raw_pred.ndim == 3 and raw_pred.shape[0] == 1:
    raw_pred = raw_pred[0]

# Now it's safe to threshold
pred_binary = (raw_pred > 0.5).astype(np.int32)  # raw_pred shape should be (88, 100)

print("Predicted piano roll shape:", pred_binary.shape)

# --- Step 2: Convert piano roll to MIDI ---
def piano_roll_to_midi(piano_roll, fs=100):
    pm = pretty_midi.PrettyMIDI()
    instrument = pretty_midi.Instrument(program=0)  # Acoustic Grand Piano

    for note_num in range(88):  # MIDI note numbers 21–108
        midi_note = note_num + 21
        onsets = np.where(np.diff(piano_roll[note_num, :], prepend=0) == 1)[0]
        offsets = np.where(np.diff(piano_roll[note_num, :], prepend=0) == -1)[0]

        # Make sure each onset has a matching offset
        if len(offsets) > 0 and (len(onsets) > len(offsets)):
            onsets = onsets[:len(offsets)]

        for onset, offset in zip(onsets, offsets):
            start = onset / fs
            end = offset / fs
            note = pretty_midi.Note(
                velocity=100, pitch=midi_note, start=start, end=end
            )
            instrument.notes.append(note)

    pm.instruments.append(instrument)
    return pm

midi_obj = piano_roll_to_midi(pred_binary)
midi_file = "/tmp/predicted_MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.midi"
midi_obj.write(midi_file)

# --- Step 3: Convert MIDI to audio (WAV or MP3) ---
# Option A: using FluidSynth (if installed)
wave_file = "/tmp/predicted_output_MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.wav"
midi_obj.fluidsynth(fs=22050)
audio = midi_obj.synthesize()
sf.write(wave_file, audio, samplerate=22050)

# Optionally convert to MP3
sound = AudioSegment.from_wav(wave_file)
mp3_output = "/tmp/predicted_output_MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.mp3"
sound.export(mp3_output, format="mp3")

print("MIDI and audio generated successfully.")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
Predicted piano roll shape: (88, 100)
MIDI and audio generated successfully.


## Evaluation of Predicted Audio with True Audio
1. Predicted audio file is short
(1/100 th of the size of true audio)
2. Predicted audio is of low quality

Hence the comparison of predicted audio and true audio did not turn out well. It may be due to:
1. the notes are not predicted properly
2. synthesized audio if low quality

## Evaluation of MIDI Files
The below script compares a predicted MIDI file (generated by a model) with the true/original MIDI file to evaluate how accurate the predicted piano roll is in terms of active notes.

I used precision, recall, and F1 score to perform the evaluation.

In [14]:
import pretty_midi
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score

def midi_to_binary_roll(midi_path, fs=100):
    midi = pretty_midi.PrettyMIDI(midi_path)
    roll = midi.get_piano_roll(fs=fs)[21:109]  # (88, T)
    return (roll > 0).astype(int)

# Load both MIDI files
pred_roll = midi_to_binary_roll("/tmp/predicted_MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.midi")
true_roll = midi_to_binary_roll("/tmp/MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.midi")
true_roll = true_roll[:, :100]

print("Predicted non-zero notes:", np.count_nonzero(pred_roll))
print("True non-zero notes     :", np.count_nonzero(true_roll))

# Align lengths
T = min(pred_roll.shape[1], true_roll.shape[1])
pred_flat = pred_roll[:, :T].flatten()
true_flat = true_roll[:, :T].flatten()

# Evaluate
f1 = f1_score(true_flat, pred_flat)
precision = precision_score(true_flat, pred_flat)
recall = recall_score(true_flat, pred_flat)

print(f"F1 Score   : {f1:.4f}")
print(f"Precision  : {precision:.4f}")
print(f"Recall     : {recall:.4f}")


Predicted non-zero notes: 222
True non-zero notes     : 0
F1 Score   : 0.0000
Precision  : 0.0000
Recall     : 0.0000


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
