# Demo: Audio-Score synchronization with high-resolution features and MrMsDTW

In this notebook, we'll show a full music synchronization pipeline using the SyncToolbox, including feature extraction and high-resolution synchronization.

We will take a recording of a musical piece and its .csv pitch annotation, created from a MIDI file, compute their feature representations, align them using multi-resolution multi-scale DTW (MrMsDTW), and show how to sonify the alignment and use it for automated transfer of annotations.

The pipeline in this notebook exactly reproduces the techniques described in [1], which in turn is based on [2]. On the finest synchronization, we use the high-resolution features described in [3].

In [None]:
# Loading some modules and defining some constants used later
import IPython.display as ipd
from libfmp.b import list_to_pitch_activations, plot_chromagram, plot_signal, plot_matrix, \
                     sonify_pitch_activations_with_signal
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.interpolate

from synctoolbox.dtw.mrmsdtw import sync_via_mrmsdtw
from synctoolbox.dtw.utils import compute_optimal_chroma_shift, shift_chroma_vectors, make_path_strictly_monotonic
from synctoolbox.feature.csv_tools import read_csv_to_df, df_to_pitch_features, df_to_pitch_onset_features
from synctoolbox.feature.chroma import pitch_to_chroma, quantize_chroma, quantized_chroma_to_CENS
from synctoolbox.feature.dlnco import pitch_onset_features_to_DLNCO
from synctoolbox.feature.pitch import audio_to_pitch_features
from synctoolbox.feature.pitch_onset import audio_to_pitch_onset_features
from synctoolbox.feature.utils import estimate_tuning
%matplotlib inline

Fs = 22050
feature_rate = 50
step_weights = np.array([1.5, 1.5, 2.0])
threshold_rec = 10 ** 6

## Loading the recording

Here, we take an interpretation of the first 8 measures of the Etude Op.10 No.3 in E major by Frederic Chopin, played by Valentina Igoshina .

In [None]:
audio, _ = librosa.load('data_music/Chopin_Op010-03-Measures1-8_Igoshina.wav', sr=Fs)

plot_signal(x=audio, Fs=Fs, figsize=(9,3))
plt.title('Etude Op.10 No.3 in E Major by Frederic Chopin\n Performer: Valentina Igoshina')
ipd.display(ipd.Audio(audio, rate=Fs))

### Loading the .csv annotation file, created from a MIDI file.

In [None]:
df_annotation = read_csv_to_df('data_csv/Chopin_Op010-03-Measures1-8_MIDI.csv', csv_delimiter=';')
html = df_annotation.to_html(index=False)

## Estimating tuning

We use a simple comb-based algorithm to detect the tuning deviation in the audio recording. This will be used to adjust the filterbanks for feature computation. If we do not adjust for tuning, our chroma representation may look "smeared", leading to bad synchronization results. We refer to <a href="https://www.audiolabs-erlangen.de/resources/MIR/FMP/C3/C3S1_TranspositionTuning.html">the FMP notebook on Transposition and Tuning</a> for more information on tuning issues and the algorithm used for tuning estimation.

In [None]:
import libfmp.c2
# Alternative: librosa.estimate_tuning
tuning_offset = estimate_tuning(audio, Fs)
print('Estimated tuning deviation for recording: %d cents' % (tuning_offset))

## Computing quantized chroma and DLNCO features

We now compute the feature representations used in the alignment procedure. Note that we include the 'tuning_offset' calculated in the previous step. In our pipeline, we use CENS features, which are similar to standard chroma but first quantized, then smoothed, downsampled and normalized. The MrMsDTW procedure just requires the quantized chromas, since smoothing, downsampling and normalization happens internally.

In addition to these chroma-like features, we also use special onset-related features called DLNCO (described in [3]). These are helpful to increase synchronization accuracy, especially for music with clear onsets.

Both features are computed from the audio using a multi-rate IIR filterbank. See [4] for details.

In the next cell, we also display the computation steps leading to both features.

In [None]:
def get_features_from_audio(audio, tuning_offset, Fs, feature_rate, visualize=True):
    f_pitch = audio_to_pitch_features(f_audio=audio, Fs=Fs, tuning_offset=tuning_offset, feature_rate=feature_rate, verbose=visualize)
    f_chroma = pitch_to_chroma(f_pitch=f_pitch)
    f_chroma_quantized = quantize_chroma(f_chroma=f_chroma)
    if visualize:
        plot_chromagram(f_chroma_quantized, title='Quantized chroma features - Audio', Fs=feature_rate, figsize=(9,3))

    f_pitch_onset = audio_to_pitch_onset_features(f_audio=audio, Fs=Fs, tuning_offset=tuning_offset, verbose=visualize)
    f_DLNCO = pitch_onset_features_to_DLNCO(f_peaks=f_pitch_onset, feature_rate=feature_rate, feature_sequence_length=f_chroma_quantized.shape[1], visualize=visualize)
    return f_chroma_quantized, f_DLNCO


f_chroma_quantized_audio, f_DLNCO_audio = get_features_from_audio(audio, tuning_offset, Fs, feature_rate)

In [None]:
def get_features_from_annotation(df_annotation, feature_rate, visualize=True):
    f_pitch = df_to_pitch_features(df_annotation, feature_rate=feature_rate)
    f_chroma = pitch_to_chroma(f_pitch=f_pitch)
    f_chroma_quantized = quantize_chroma(f_chroma=f_chroma)
    if visualize:
        plot_chromagram(f_chroma_quantized, title='Quantized chroma features - Annotation', Fs=feature_rate, figsize=(9, 3))
    f_pitch_onset = df_to_pitch_onset_features(df_annotation)
    f_DLNCO = pitch_onset_features_to_DLNCO(f_peaks=f_pitch_onset,
                                            feature_rate=feature_rate,
                                            feature_sequence_length=f_chroma_quantized.shape[1],
                                            visualize=visualize)
    
    return f_chroma_quantized, f_DLNCO


f_chroma_quantized_annotation, f_DLNCO_annotation = get_features_from_annotation(df_annotation, feature_rate)

## Finding optimal shift of chroma vectors

The interpretation might be played in a different key than the one in the original score. This can also be seen in the chroma representations above and will lead to complete degradation of the alignment if this effect is not accounted for. The SyncToolbox provides a built-in function for finding the shift between two recordings. This is done in the following cell and the feature sequences are subsequently adjusted to account for this shift. The plots show the chroma sequences after shifting.

Internally, the function just performs DTW using all possible shifts and returns the shift yielding the lowest total cost. To save computation time, we here first downsample the sequences.

NOTE: The chroma shift doesn't apply in this running example, since both the score and recording are in E major.

In [None]:
f_cens_1hz_audio = quantized_chroma_to_CENS(f_chroma_quantized_audio, 201, 50, feature_rate)[0]
f_cens_1hz_annotation = quantized_chroma_to_CENS(f_chroma_quantized_annotation, 201, 50, feature_rate)[0]
opt_chroma_shift = compute_optimal_chroma_shift(f_cens_1hz_audio, f_cens_1hz_annotation)
print('Pitch shift between the audio recording and score, determined by DTW:', opt_chroma_shift, 'bins')

f_chroma_quantized_annotation = shift_chroma_vectors(f_chroma_quantized_annotation, opt_chroma_shift)
f_DLNCO_annotation = shift_chroma_vectors(f_DLNCO_annotation, opt_chroma_shift)

_,_,_= plot_chromagram(f_chroma_quantized_audio[:, :30 * feature_rate], Fs=feature_rate, title='Chroma representation for the audio', figsize=(9, 3))
_,_,_= plot_chromagram(f_chroma_quantized_annotation[:, :30 * feature_rate], Fs=feature_rate, title='Chroma representation for the score', figsize=(9, 3))

## Performing MrMsDTW

We now perform alignment using MrMsDTW. The extracted chroma sequences are used on the coarser levels of the procedure, while the DLNCO features are additionally used on the finest level.

In [None]:
wp = sync_via_mrmsdtw(f_chroma1=f_chroma_quantized_audio, 
                      f_onset1=f_DLNCO_audio, 
                      f_chroma2=f_chroma_quantized_annotation, 
                      f_onset2=f_DLNCO_annotation, 
                      input_feature_rate=feature_rate, 
                      step_weights=step_weights, 
                      threshold_rec=threshold_rec, 
                      verbose=True)

## For applications: Make warping path strictly monotonic
The standard step sizes used in DTW allow for horizontal and vertical steps, which leads to warping paths that are not guaranteed to be strictly monotonous. This is usually not a problem. However, for applications such as transferring annotations, it may be better to use a strictly monotonous path and employ linear interpolation inside non-monotonous segments. See also <a href="https://www.audiolabs-erlangen.de/resources/MIR/FMP/C3/C3S3_MusicAppTempoCurve.html">the FMP notebook on Tempo Curves</a> for more information.

In [None]:
print('Length of warping path obtained from MrMsDTW:', wp.shape[1])
wp = make_path_strictly_monotonic(wp)
print('Length of warping path made strictly monotonic:', wp.shape[1])

## Sonifying warping path

In order to listen to the synchronization result, the synthesized score will now be time-scaled (according to the computed warping path) to run synchronous to the audio recording. The result is sonified by putting the audio recording into the left channel and the warped, synthesized score into the right channel of a stereo audio file.

In [None]:
df_annotation_warped = df_annotation.copy(deep=True)
df_annotation_warped["end"] = df_annotation_warped["start"] + df_annotation_warped["duration"]
df_annotation_warped[['start', 'end']] = scipy.interpolate.interp1d(wp[1] / feature_rate, 
                           wp[0] / feature_rate, kind='linear', fill_value="extrapolate")(df_annotation[['start', 'end']])
df_annotation_warped["duration"] = df_annotation_warped["end"] - df_annotation_warped["start"]
note_list = df_annotation_warped[['start', 'duration', 'pitch', 'velocity']].values.tolist()

In [None]:
H = 512
num_frames = int(len(audio) / H)
Fs_frame = Fs / H
X_ann, F_coef_MIDI = list_to_pitch_activations(note_list, num_frames, Fs_frame)
title = 'Piano-roll representation (Fs_frame = %.3f) of the synchronized annotation' % Fs_frame
plot_matrix(X_ann, Fs=Fs_frame, F_coef=F_coef_MIDI,  ylabel='MIDI pitch number', title=title, figsize=(9, 4))
plt.ylim([36, 78])
plt.show()

In [None]:
# Sonification
harmonics = [1, 1/2, 1/3, 1/4, 1/5]
fading_msec = 0.5
x_pitch_ann, x_pitch_ann_stereo = sonify_pitch_activations_with_signal(X_ann, audio, Fs_frame, Fs,
                                                                       fading_msec=fading_msec, 
                                                                       harmonics_weights=harmonics)

In [None]:
# TODO This sonification procedure is very slow for long recordings and will be improved in a future version of synctoolbox.
x_peaks = np.zeros(len(audio))
for row in note_list:
    second, duration, pitch, velocity = row
    freq = 2 ** ((pitch - 69) / 12) * 440
    for harmonic_num, harmonic_weight in enumerate(harmonics):
        x_peaks += velocity / 128 * harmonic_weight * librosa.clicks(times=second, 
                                                                     sr=Fs,
                                                                     click_freq=(harmonic_num+1)*freq,
                                                                     length=len(audio),
                                                                     click_duration=0.1)

In [None]:
print('Sonification with colored clicks (mono):')
ipd.display(ipd.Audio(x_peaks, rate=Fs))

print('Sonification with colored clicks and sinusoids (mono):')
ipd.display(ipd.Audio(x_peaks + x_pitch_ann, rate=Fs))

print('Sonification of colored clicks and original audio (stereo):')
ipd.display(ipd.Audio(np.array([audio, x_peaks]), rate=Fs))

print('Sonification of colored clicks with sinusoids and original audio (stereo):')
ipd.display(ipd.Audio(np.array([audio, x_peaks + x_pitch_ann]), rate=Fs))

## References

[1] Thomas Prätzlich, Jonathan Driedger, and Meinard Müller: Memory-Restricted Multiscale Dynamic Time Warping,
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 569–573, 2016.

[2] Meinard Müller, Henning Mattes, and Frank Kurth:
An Efficient Multiscale Approach to Audio Synchronization,
In Proceedings of the International Conference on Music Information Retrieval (ISMIR): 192–197, 2006.

[3] Sebastian Ewert, Meinard Müller, and Peter Grosche:
High Resolution Audio Synchronization Using Chroma Onset Features,
In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 1869–1872, 2009.

[4] Meinard Müller: Information Retrieval for Music and Motion, ISBN: 978-3-540-74047-6, Springer, 2007.