# Demo: Audio-audio synchronization with high-resolution features and MrMsDTW

In this notebook, we'll show a full music synchronization pipeline using the SyncToolbox, including feature extraction and high-resolution synchronization. For a short example focussing on the basics only, see [`sync_audio_audio_simple.ipynb`](sync_audio_audio_simple.ipynb).

We will take two recordings of the same musical piece (the third song of Franz Schubert's "Winterreise"), compute feature representations of both recordings, align them using multi-resolution multi-scale DTW (MrMsDTW), and show how to sonify the alignment and use it for automated transfer of annotations.

The pipeline in this notebook exactly reproduces the techniques described in [1], which in turn is based on [2]. On the finest synchronization, we use the high-resolution features described in [3].

In [None]:
# Loading some modules and defining some constants used later
import numpy as np
import pandas as pd
import librosa.display
import matplotlib.pyplot as plt
import IPython.display as ipd
import scipy.interpolate
from libfmp.b.b_plot import plot_signal, plot_chromagram
from libfmp.c3.c3s2_dtw_plot import plot_matrix_with_points

from synctoolbox.dtw.mrmsdtw import sync_via_mrmsdtw
from synctoolbox.dtw.utils import compute_optimal_chroma_shift, shift_chroma_vectors, make_path_strictly_monotonic, evaluate_synchronized_positions
from synctoolbox.feature.chroma import pitch_to_chroma, quantize_chroma, quantized_chroma_to_CENS
from synctoolbox.feature.dlnco import pitch_onset_features_to_DLNCO
from synctoolbox.feature.pitch import audio_to_pitch_features
from synctoolbox.feature.pitch_onset import audio_to_pitch_onset_features
from synctoolbox.feature.utils import estimate_tuning
%matplotlib inline

Fs = 22050
feature_rate = 50
step_weights = np.array([1.5, 1.5, 2.0])
threshold_rec = 10 ** 6

figsize = (9, 3)

## Loading two recordings of the same piece

Here, we take recordings of the song "Gefrorne Tränen" by Franz Schubert from his song cycle "Winterreise" in two performances (versions). The first version is by Gerhard Hüsch and Hanns-Udo Müller from 1933. The second version is by Randall Scarlata and Jeremy Denk from 2006. In particular, the two versions are played in different keys: The second version is played one semitone higher than the first version. We will address this later.

### Version 1

In [None]:
audio_1, _ = librosa.load('data_music/Schubert_D911-03_HU33.wav', sr=Fs)

plot_signal(audio_1, Fs=Fs, ylabel='Amplitude', title='Version 1', figsize=figsize)
ipd.display(ipd.Audio(audio_1, rate=Fs))

### Version 2

In [None]:
audio_2, _ = librosa.load('data_music/Schubert_D911-03_SC06.wav', sr=Fs)

plot_signal(audio_2, Fs=Fs, ylabel='Amplitude', title='Version 2', figsize=figsize)
ipd.display(ipd.Audio(audio_2, rate=Fs))

## Estimating tuning

We use a simple comb-based algorithm to detect tuning deviations in the two audio recordings. These will be used to adjust the filterbanks for feature computation. If we do not adjust for tuning, our chroma representations may look "smeared", leading to bad synchronization results. We refer to <a href="https://www.audiolabs-erlangen.de/resources/MIR/FMP/C3/C3S1_TranspositionTuning.html">the FMP notebook on Transposition and Tuning</a> for more information on tuning issues and the algorithm used for tuning estimation.

In [None]:
import libfmp.c2
# Alternative: librosa.estimate_tuning
tuning_offset_1 = estimate_tuning(audio_1, Fs)
tuning_offset_2 = estimate_tuning(audio_2, Fs)
print('Estimated tuning deviation for recording 1: %d cents, for recording 2: %d cents' % (tuning_offset_1, tuning_offset_2))

## Computing quantized chroma and DLNCO features

We now compute the feature representations used in the alignment procedure. Note that we include the 'tuning_offset' calculated in the previous step. In our pipeline, we use CENS features, which are similar to standard chroma but first quantized, then smoothed, downsampled and normalized. The MrMsDTW procedure just requires the quantized chromas, since smoothing, downsampling and normalization happens internally.

In addition to these chroma-like features, we also use special onset-related features called DLNCO (described in [3]). These are helpful to increase synchronization accuracy, especially for music with clear onsets.

Both features are computed from the audio using a multi-rate IIR filterbank. See [4] for details.

In the next cell, we also display the computation steps leading to both features.

In [None]:
def get_features_from_audio(audio, tuning_offset, visualize=True):
    f_pitch = audio_to_pitch_features(f_audio=audio, Fs=Fs, tuning_offset=tuning_offset, feature_rate=feature_rate, verbose=visualize)
    f_chroma = pitch_to_chroma(f_pitch=f_pitch)
    f_chroma_quantized = quantize_chroma(f_chroma=f_chroma)

    f_pitch_onset = audio_to_pitch_onset_features(f_audio=audio, Fs=Fs, tuning_offset=tuning_offset, verbose=visualize)
    f_DLNCO = pitch_onset_features_to_DLNCO(f_peaks=f_pitch_onset, feature_rate=feature_rate, feature_sequence_length=f_chroma_quantized.shape[1], visualize=visualize)
    return f_chroma_quantized, f_DLNCO


f_chroma_quantized_1, f_DLNCO_1 = get_features_from_audio(audio_1, tuning_offset_1)
f_chroma_quantized_2, f_DLNCO_2 = get_features_from_audio(audio_2, tuning_offset_2)

The next plots illustrate the different representations of the first 30 seconds of each version.

In [None]:
plot_chromagram(f_chroma_quantized_1[:, :30 * feature_rate], Fs=feature_rate, title='Chroma representation for version 1', figsize=figsize)
plt.show()
plot_chromagram(f_DLNCO_1[:, :30 * feature_rate], Fs=feature_rate, title='DLNCO representation for version 1', figsize=figsize)
plt.show()

plot_chromagram(f_chroma_quantized_2[:, :30 * feature_rate], Fs=feature_rate, title='Chroma representation for version 2', figsize=figsize)
plt.show()
plot_chromagram(f_DLNCO_2[:, :30 * feature_rate], Fs=feature_rate, title='DLNCO representation for version 2', figsize=figsize)
plt.show()

## Finding optimal shift of chroma vectors

As mentioned above, the two versions of the same piece used in this notebook are played in different keys. This can also be seen in the chroma representations above and will lead to complete degradation of the alignment if this effect is not accounted for. The SyncToolbox provides a built-in function for finding the shift between two recordings. This is done in the following cell and the feature sequences are subsequently adjusted to account for this shift. The plots show the chroma sequences after shifting.

Internally, the function just performs DTW using all possible shifts and returns the shift yielding the lowest total cost. To save computation time, we here first downsample the sequences.

In [None]:
f_cens_1hz_1 = quantized_chroma_to_CENS(f_chroma_quantized_1, 201, 50, feature_rate)[0]
f_cens_1hz_2 = quantized_chroma_to_CENS(f_chroma_quantized_2, 201, 50, feature_rate)[0]
opt_chroma_shift = compute_optimal_chroma_shift(f_cens_1hz_1, f_cens_1hz_2)
print('Pitch shift between recording 1 and recording 2, determined by DTW:', opt_chroma_shift, 'bins')

f_chroma_quantized_2 = shift_chroma_vectors(f_chroma_quantized_2, opt_chroma_shift)
f_DLNCO_2 = shift_chroma_vectors(f_DLNCO_2, opt_chroma_shift)

plot_chromagram(f_chroma_quantized_1[:, :30 * feature_rate], Fs=feature_rate, title='Version 1', figsize=figsize)
plt.show()
plot_chromagram(f_chroma_quantized_2[:, :30 * feature_rate], Fs=feature_rate, title='Version 2, shifted to match version 1', figsize=figsize)
plt.show()

## Performing MrMsDTW

We now perform alignment using MrMsDTW. The extracted chroma sequences are used on the coarser levels of the procedure, while the DLNCO features are additionally used on the finest level.

In [None]:
wp = sync_via_mrmsdtw(f_chroma1=f_chroma_quantized_1, f_onset1=f_DLNCO_1, f_chroma2=f_chroma_quantized_2, f_onset2=f_DLNCO_2, input_feature_rate=feature_rate, step_weights=step_weights, threshold_rec=threshold_rec, verbose=True)

## For applications: Make warping path strictly monotonic
The standard step sizes used in DTW allow for horizontal and vertical steps, which leads to warping paths that are not guaranteed to be strictly monotonous. This is usually not a problem. However, for applications such as transferring annotations, it may be better to use a strictly monotonous path and employ linear interpolation inside non-monotonous segments. See also <a href="https://www.audiolabs-erlangen.de/resources/MIR/FMP/C3/C3S3_MusicAppTempoCurve.html">the FMP notebook on Tempo Curves</a> for more information.

In [None]:
print('Length of warping path obtained from MrMsDTW:', wp.shape[1])
wp = make_path_strictly_monotonic(wp)
print('Length of warping path made strictly monotonic:', wp.shape[1])

## Application 1: Sonifying warping path

In order to listen to the synchronization result, version 1 will now be time-scaled (according to the computed warping path) to run synchronous to version 2. Additionally, we pitch-shift version 1 to account for the key difference mentioned earlier.

For the time-scale modification, we use the libtsm [5] library. The result is sonified by putting the warped version 1 into the left channel and version 2 into the right channel of a stereo audio file.

In [None]:
!pip install libtsm

In [None]:
import libtsm

pitch_shift_for_audio_1 = -opt_chroma_shift % 12
if pitch_shift_for_audio_1 > 6:
    pitch_shift_for_audio_1 -= 12
audio_1_shifted = libtsm.pitch_shift(audio_1, pitch_shift_for_audio_1 * 100, order="tsm-res")  

# The TSM functionality of the libtsm library expects the warping path to be given in audio samples.
# Here, we do the conversion and additionally clip values that are too large.
time_map = wp.T / feature_rate * Fs
time_map[time_map[:, 0] > len(audio_1), 0] = len(audio_1) - 1 
time_map[time_map[:, 1] > len(audio_2), 1] = len(audio_2) - 1

y_hpstsm = libtsm.hps_tsm(audio_1_shifted, time_map)
stereo_sonification = np.hstack((audio_2.reshape(-1, 1), y_hpstsm))

print('Original signal 1', flush=True)
ipd.display(ipd.Audio(audio_1, rate=Fs, normalize=True))

print('Original signal 2', flush=True)
ipd.display(ipd.Audio(audio_2, rate=Fs, normalize=True))

print('Synchronized versions', flush=True)
ipd.display(ipd.Audio(stereo_sonification.T, rate=Fs, normalize=True))

## Application 2: Transferring measure annotations

The warping path obtained using MrMsDTW may also be used to facilitate other music information retrieval tasks. For example, one often has annotations (about keys, chords, instruments, ...) for a certain version of a piece and may wish to transfer these to another version of the same piece. In the following, we use the computed warping path to transfer measure positions annotated in version 1 over to version 2.

In our case, we have hand-made measure annotations for version 2, as well. This allows us to evaluate the quality of the synchronization. As evaluation measures, we look at the mean average error and the percentage of correctly transferred measures (given a threshold).

In [None]:
measure_annotations_1 = pd.read_csv(filepath_or_buffer='data_csv/Schubert_D911-03_HU33.csv', delimiter=';')['start']
measure_positions_1_transferred_to_2 = scipy.interpolate.interp1d(wp[0] / feature_rate, wp[1] / feature_rate, kind='linear')(measure_annotations_1)
measure_annotations_2 = pd.read_csv(filepath_or_buffer='data_csv/Schubert_D911-03_SC06.csv', delimiter=';')['start']

mean_absolute_error, accuracy_at_tolerances = evaluate_synchronized_positions(measure_annotations_2 * 1000, measure_positions_1_transferred_to_2 * 1000)


## References

[1] Thomas Prätzlich, Jonathan Driedger, and Meinard Müller: Memory-Restricted Multiscale Dynamic Time Warping,
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 569–573, 2016.

[2] Meinard Müller, Henning Mattes, and Frank Kurth:
An Efficient Multiscale Approach to Audio Synchronization,
In Proceedings of the International Conference on Music Information Retrieval (ISMIR): 192–197, 2006.

[3] Sebastian Ewert, Meinard Müller, and Peter Grosche:
High Resolution Audio Synchronization Using Chroma Onset Features,
In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 1869–1872, 2009.

[4] Meinard Müller: Information Retrieval for Music and Motion, ISBN: 978-3-540-74047-6, Springer, 2007.

[5] Sebastian Rosenzweig, Simon Schwär, Jonathan Driedger, and Meinard Müller: Adaptive Pitch-Shifting with Applications to Intonation Adjustment in A Cappella Recordings Proceedings of the International Conference on Digital Audio Effects (DAFx), 2021.