# Demo: Audio-audio synchronization with chroma features and MrMsDTW

In this notebook, we'll show a minimal example for the use of the SyncToolbox for music synchronization. We will take two recordings of the same musical piece (the first song of Franz Schubert's "Winterreise"), compute chroma representations of both recordings and align them using classical dynamic time warping (DTW) and multi-resolution multi-scale DTW (MrMsDTW). We will also compare the runtimes of the two algorithms.

For an explanation of chroma features and DTW, see [1].

In [None]:
# Loading some modules and defining some constants used later
import time
import librosa.display
import matplotlib.pyplot as plt
import IPython.display as ipd
from libfmp.b.b_plot import plot_signal, plot_chromagram
from libfmp.c3.c3s2_dtw_plot import plot_matrix_with_points

from synctoolbox.dtw.core import compute_warping_path
from synctoolbox.dtw.cost import cosine_distance
from synctoolbox.dtw.mrmsdtw import sync_via_mrmsdtw
%matplotlib inline

Fs = 22050
N = 2048
H = 1024
feature_rate = int(22050 / H)

figsize = (9, 3)

## Loading two recordings of the same piece

Here, we take recordings of the song "Gute Nacht" by Franz Schubert from his song cycle "Winterreise" in two performances (versions). The first version is by Gerhard Hüsch and Hanns-Udo Müller from 1933. The second version is by Randall Scarlata and Jeremy Denk from 2006.

### Version 1

In [None]:
audio_1, _ = librosa.load('data_music/Schubert_D911-01_HU33.wav', sr=Fs)

plot_signal(audio_1, Fs=Fs, ylabel='Amplitude', title='Version 1', figsize=figsize)
ipd.display(ipd.Audio(audio_1, rate=Fs))

### Version 2

In [None]:
audio_2, _ = librosa.load('data_music/Schubert_D911-01_SC06.wav', Fs)

plot_signal(audio_2, Fs=Fs, ylabel='Amplitude', title='Version 2', figsize=figsize)
ipd.display(ipd.Audio(audio_2, rate=Fs))

## Obtaining chroma representations of the recordings using librosa

For most Western classical and pop music, chroma features are highly useful for aligning different versions of the same piece. Here, we use librosa to calculate two very basic chroma representations, derived from STFTs. The plots illustrate the chroma representations of the first 30 seconds of each version.

In [None]:
chroma_1 = librosa.feature.chroma_stft(y=audio_1, sr=Fs, n_fft=N, hop_length=H, norm=2.0)
plot_chromagram(chroma_1[:, :30 * feature_rate], Fs=feature_rate, title='Chroma representation for version 1', figsize=figsize)
plt.show()

chroma_2 = librosa.feature.chroma_stft(y=audio_2, sr=Fs, n_fft=N, hop_length=H, norm=2.0)
plot_chromagram(chroma_2[:, :30 * feature_rate], Fs=feature_rate, title='Chroma representation for version 2', figsize=figsize)
plt.show()

## Aligning chroma representations using full DTW

The chroma feature sequences in the last cell can be used for time warping. As both versions last around five minutes, an alignment can still be computed in reasonable time using classical, full DTW. In the next cell we use the SyncToolbox implementation of DTW to do this. Each feature sequence consists of around 7000 frames, meaning that the matrices computed during full DTW become quite huge - around 14 million entries each! - leading to high memory consumption.

In [None]:
C = cosine_distance(chroma_1, chroma_2)
_, _, wp_full = compute_warping_path(C=C)
# Equivalently, full DTW may be computed using librosa via:
# _, wp_librosa = librosa.sequence.dtw(C=C)

plot_matrix_with_points(C, wp_full.T, linestyle='-',  marker='', aspect='equal',
                        title='Cost matrix and warping path computed using full DTW',
                        xlabel='Version 2 (frames)', ylabel='Version 1 (frames)', figsize=(9, 5))
plt.show()

## Aligning chroma representations using SyncToolbox (MrMsDTW)

We now compute an alignment between the two versions using MrMsDTW. This algorithm has a much lower memory footprint and will also be faster on long feature sequences. For more information, see [2].

In [None]:
_ = sync_via_mrmsdtw(f_chroma1=chroma_1,
                     f_chroma2=chroma_2,
                     input_feature_rate=feature_rate,
                     verbose=True)

## Runtime comparison

We now compare the runtime of both algorithms. During their first call, they may create function caches etc. So, after running the previous cells, we can now test their raw performance.

In [None]:
start_time = time.time()
C = cosine_distance(chroma_1, chroma_2)
compute_warping_path(C=C)
end_time = time.time()
print(f'Full DTW took {end_time - start_time}s')

start_time = time.time()
sync_via_mrmsdtw(f_chroma1=chroma_1,
                 f_chroma2=chroma_2,
                 input_feature_rate=feature_rate,
                 verbose=False)
end_time = time.time()
print(f'MrMsDTW took {end_time - start_time}s')


## References

[1] Meinard Müller: Fundamentals of Music Processing – Audio, Analysis, Algorithms, Applications, ISBN: 978-3-319-21944-8, Springer, 2015.

[2] Thomas Prätzlich, Jonathan Driedger, and Meinard Müller: Memory-Restricted Multiscale Dynamic Time Warping,
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 569–573, 2016.