In [None]:
%matplotlib inline
import seaborn
import numpy, scipy, matplotlib.pyplot as plt, IPython.display as ipd, sklearn
import librosa, librosa.display
import numpy as np

# What is an Onset?

A note onset is the moment when a musical note begins. It marks the transition from silence or from one note to another. Onsets typically correspond to:
* The start of a new instrumental attack (e.g., a piano key being pressed).
* The beginning of a new pitch transition (e.g., a violin moving smoothly from one note to another without changing loudness).
* The start of a percussive sound in rhythmic music.


# Novelty Functions

A **novelty function** is a mathematical function that highlights **sudden changes** in an audio signal, helping to detect events such as note onsets or other significant transitions in music. These functions work by analyzing different characteristics of the signal, such as **energy**, **spectral content**, or **phase variations**, to identify points where the signal exhibits a sudden change.

To detect note onsets, we want to locate sudden changes in the audio signal that mark the beginning of transient regions. Often, an increase in the signal's amplitude envelope will denote an onset candidate. However, that is not always the case, for notes can change from one pitch to another without changing amplitude, e.g. a violin playing slurred notes.

We will look at two novelty functions:

1. Energy-based novelty functions
2. Spectral-based novelty functions

## Energy-based Novelty Functions

Playing a note often causes a sudden increase in signal energy. To detect this sudden increase, we will compute an **energy novelty function**:

1. Compute the short-time energy in the signal.
2. Compute the first-order difference in the energy.
3. Half-wave rectify the first-order difference.

In [None]:
x, sr = librosa.load('audio/simple_loop.wav')
print(x.shape, sr)

In [None]:
ipd.Audio(x, rate=sr)

In [None]:
plt.figure(figsize=(14, 5))
librosa.display.waveshow(x, sr=sr)
plt.grid(True)

### How do we compute the Energy?

[`librosa.feature.rms`](https://librosa.org/doc/main/generated/librosa.feature.rms.html) returns the root-mean-square (RMS) energy for each frame of audio. We will compute the RMS energy as well as its first-order difference.

The first-order difference of RMS energy represents how much the energy changes between consecutive frames. It is computed by subtracting each RMS value from the next, highlighting sudden increases or decreases in energy.

In [None]:
hop_length = 512
frame_length = 1024
rmse = np.sqrt(librosa.feature.rms(y=x, frame_length=frame_length, hop_length=hop_length)).flatten()
rmse_diff = np.zeros_like(rmse)
rmse_diff[1:] = np.diff(rmse)

In [None]:
np.diff([0.5, 0.3, 0.2, 0.1])

In [None]:
print(rmse.shape)
print(rmse_diff.shape)
# print(rmse)
# print(rmse_diff)

To obtain an energy novelty function, we'll perform **half-wave rectification** on `rmse_diff`, i.e. any negative values are set to zero.

Half-wave rectification ensures that only increases in energy contribute to the energy novelty function, while decreases are ignored. This is crucial for onset detection, as we are primarily interested in identifying points where energy rises sharply, which often signals the beginning of a new note or musical event.

Equivalently, we can apply the function $\max(0, x)$:

In [None]:
energy_novelty = numpy.max([numpy.zeros_like(rmse_diff), rmse_diff], axis=0)

In [None]:
frames = numpy.arange(len(rmse))
t = librosa.frames_to_time(frames, sr=sr)

plt.figure(figsize=(15, 6))
plt.plot(t, rmse, 'b--', t, rmse_diff, 'g--^', t, energy_novelty, 'r-')
plt.xlim(0, t.max())
plt.xlabel('Time (sec)')
plt.legend(('RMSE', 'delta RMSE', 'energy novelty'))
plt.grid(True)

Blue dashed line (RMS - Root Mean Square Energy)
* Represents the energy of the signal over time.
* Peaks correspond to loud sounds (such as note onsets or percussive hits).
* Gradual slopes indicate sustained energy levels.

Green dashed line with triangles (delta RMS - First-order Difference of RMS)
* Measures the rate of change of RMS.
* Positive peaks indicate a sudden rise in energy (potential onset).
* Negative values show energy drops but are not used for novelty function.

Red solid line (Energy Novelty Function - Half-wave Rectified Delta RMS)
* Represents the final novelty function after applying half-wave rectification.
* Negative values from delta RMS are set to zero (since we only care about rising energy).
* Peaks in this function correspond to detected onsets (note starts, percussive hits, etc.).

### Log Energy

The human perception of sound intensity is logarithmic in nature. To account for this property, we can apply a logarithm function to the energy before taking the first-order difference.

Since $\log(x)$ diverges (goes to negative infinity) as $x$ approaches zero, directly applying $\log(x)$ to energy values can cause issues when dealing with small or zero values.

To avoid this, we use a modified function:$f(x) = log(1 + λx)$

This function has two important properties:
1. When $x=0$, the function outputs 0, since $log(1) = 0$
2. For large values of $x$, it behaves like $\log(\lambda x)$: When $\lambda x$ is large, $1 + \lambda x$ is approximately $λx$, so the function behaves similarly to the standard logarithm, compressing the values.

This operation is sometimes called **logarithmic compression**.

In [None]:
log_rmse = numpy.log1p(10*rmse)
log_rmse_diff = numpy.zeros_like(log_rmse)
log_rmse_diff[1:] = numpy.diff(log_rmse)

In [None]:
log_energy_novelty = numpy.max([numpy.zeros_like(log_rmse_diff), log_rmse_diff], axis=0)


In [None]:
plt.figure(figsize=(15, 6))
plt.plot(t, log_rmse, 'b--', t, log_rmse_diff, 'g--^', t, log_energy_novelty, 'r-')
plt.xlim(0, t.max())
plt.xlabel('Time (sec)')
plt.legend(('log RMSE', 'delta log RMSE', 'log energy novelty'))
plt.grid(True)

## Spectral-based Novelty Functions

There are two problems with the energy novelty function:

1. It is sensitive to energy fluctuations.
 * For example, the function reacts to small energy variations that occur within a single note, even if no actual onset happens.
2. It is not sensitive to spectral fluctuations between notes where amplitude remains the same.
 * For example, some musical notes transition without a noticeable change in energy but with a change in spectral content (e.g., pitch shifts in legato playing on a violin or wind instrument).
 * The energy-based method may miss these onsets because it relies only on amplitude differences.

To overcome these limitations, we can use spectral-based novelty functions, which analyze changes in frequency content instead of just energy.

For example, consider the following audio signal composed of pure tones of equal magnitude:

In [None]:
sr = 22050
def generate_tone(midi):
    T = 0.5
    t = numpy.linspace(0, T, int(T*sr), endpoint=False)
    f = librosa.midi_to_hz(midi)
    return numpy.sin(2*numpy.pi*f*t)

In [None]:
x = numpy.concatenate([generate_tone(midi) for midi in [48, 52, 55, 60, 64, 67, 72, 76, 79, 84]])

In [None]:
ipd.Audio(x, rate=sr)

The energy novelty function remains roughly constant:

In [None]:
hop_length = 512
frame_length = 1024
rmse = np.sqrt(librosa.feature.rms(y=x, frame_length=frame_length, hop_length=hop_length)).flatten()
rmse_diff = np.zeros_like(rmse)
rmse_diff[1:] = np.diff(rmse)

In [None]:
energy_novelty = numpy.max([numpy.zeros_like(rmse_diff), rmse_diff], axis=0)

In [None]:
frames = numpy.arange(len(rmse))
t = librosa.frames_to_time(frames, sr=sr)

In [None]:
plt.figure(figsize=(15, 4))
plt.plot(t, rmse, 'b--', t, rmse_diff, 'g--^', t, energy_novelty, 'r-')
plt.xlim(0, t.max())
plt.xlabel('Time (sec)')
plt.legend(('RMSE', 'delta RMSE', 'energy novelty'))

Instead, we will compute a **spectral novelty function**
1. Compute the Log-Amplitude Spectrogram
2. Apply the Energy Novelty Function to Each Frequency Bin.
 * first-order difference
 * half-wave rectification
3. Sum across all frequency bins.

Luckily, `librosa` has [`librosa.onset.onset_strength`](https://librosa.org/doc/main/generated/librosa.onset.onset_strength.html) which computes a novelty function using spectral flux.

Spectral Flux measures how much the frequency content of a signal changes between consecutive frames.

In [None]:
spectral_novelty = librosa.onset.onset_strength(y=x, sr=sr)

In [None]:
frames = numpy.arange(len(spectral_novelty))
t = librosa.frames_to_time(frames, sr=sr)

In [None]:
plt.figure(figsize=(15, 4))
plt.plot(t, spectral_novelty, 'r-')
plt.xlim(0, t.max())
plt.xlabel('Time (sec)')
plt.legend(('Spectral Novelty',))
plt.grid(True)

## Questions

Novelty functions are dependent on `frame_length` and `hop_length`. Adjust these two parameters. How do they affect the novelty function?

Try with other audio files. How do the novelty functions compare?

# Peak Picking

In onset detection, we may want to find peaks in a novelty function. These peaks would correspond to the musical onsets.

However, we need a method to extract meaningful peaks that correspond to actual note onsets. This is where peak picking comes in.

Why?
* The novelty function may have many small fluctuations that don’t correspond to real onsets.
* Some onsets are stronger than others, and we need to define a threshold to filter out weaker, non-significant peaks.
* The process ensures we detect only significant local maxima in the novelty function.

In [None]:
x, sr = librosa.load('audio/58bpm.wav')

In [None]:
print(x.shape, sr)

In [None]:
ipd.Audio(x, rate=sr)

In [None]:
plt.figure(figsize=(14, 5))
librosa.display.waveshow(x, sr=sr)
plt.grid(True)

Compute an onset envelope:

In [None]:
hop_length = 256
onset_envelope = librosa.onset.onset_strength(y=x, sr=sr, hop_length=hop_length)

In [None]:
onset_envelope.shape

To map each computed value to a specific time in seconds, we generate a time variable:

In [None]:
N = len(x)
T = N/float(sr)
t = numpy.linspace(0, T, len(onset_envelope))

Plot the onset envelope:

In [None]:
plt.figure(figsize=(14, 5))
plt.plot(t, onset_envelope)
plt.xlabel('Time (sec)')
plt.xlim(xmin=0)
plt.ylim(0)
plt.grid(True)

In this onset strength envelope, we clearly see many peaks. Some correspond to onsets, and others don't. How do we create peak picker that will detect true peaks while avoiding unwanted peaks?

Luckily, `librosa.util` has a [`peak_pick`](https://librosa.org/doc/main/generated/librosa.util.peak_pick.html) method. We can control the parameters based upon our signal. Let's see how it works:

    def peak_pick(x, pre_max, post_max, pre_avg, post_avg, delta, wait):
        '''Uses a flexible heuristic to pick peaks in a signal.

        A sample n is selected as a peak if the corresponding x[n] (value of the signal at frame n)
        fulfills the following three conditions:

        1. Local Maximum Condition:

               `x[n] == max(x[n - pre_max : n + post_max])`

         * The value at x[n] must be the highest value within a surrounding window.
         * The window spans from pre_max frames before n to post_max frames after n.
         * This ensures that x[n] is locally the most prominent peak within its neighborhood.

        2. Threshold Condition:

               `x[n] >= mean(x[n - pre_avg : n + post_avg]) + delta`

         * The value x[n] must be greater than the local mean plus a threshold (delta).
         * The mean is computed over a wider window (pre_avg to post_avg), which helps adaptively define what is considered a strong peak.
         * delta is a user-defined threshold to filter out small peaks.

        3. Minimum Distance Condition (Avoid Close Peaks):

               `n - previous_n > wait`

         * Ensures that consecutive peaks are at least 'wait' samples apart.
         * previous_n refers to the last detected peak.
         * If two peaks are too close, only the first one is selected.

In [None]:
onset_frames = librosa.util.peak_pick(
    x=onset_envelope, pre_max=7, post_max=7, pre_avg=7, post_avg=7, delta=0.5, wait=5
)


In [None]:
onset_frames

In [None]:
plt.figure(figsize=(14, 5))
plt.plot(t, onset_envelope)
plt.grid(False)
plt.vlines(t[onset_frames], 0, onset_envelope.max(), color='r', alpha=0.7)
plt.xlabel('Time (sec)')
plt.xlim(0, T)
plt.ylim(0)

* Blue Line (Onset Envelope)
 * Represents the novelty function extracted from the audio.
 * Peaks in this function correspond to sudden changes in spectral energy, which are potential note onsets.

* Red Vertical Lines (Detected Onsets)
 * Each red line represents a detected peak, which is an onset candidate identified by librosa.util.peak_pick().
 * The onset times are determined using t[onset_frames].

Superimpose a click track upon the original:

In [None]:
clicks = librosa.clicks(frames=onset_frames, sr=22050, hop_length=hop_length, length=N)

In [None]:
ipd.Audio(x+clicks, rate=sr)

Using the parameters above, we find that the peak-picking algorithm demonstrates high precision, meaning it produces few false positives. However, its recall can be improved, as it fails to detect several actual onsets present in the audio signal.

## Questions

Adjust the hop length from 512 to 256 or 1024. How does that affect the onset envelope, and consequently, the peak picking?

Adjust the `peak_pick` parameters, `pre_max`, `post_max`, `pre_avg`, `post_avg`, `delta`, and `wait`. How do the detected peaks change?

In [None]:
# plt.style.use('seaborn-muted')
plt.rcParams['figure.figsize'] = (14, 5)
plt.rcParams['axes.grid'] = True
plt.rcParams['axes.spines.left'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.bottom'] = False
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.xmargin'] = 0
plt.rcParams['axes.ymargin'] = 0
plt.rcParams['image.cmap'] = 'gray'
# plt.rcParams['image.interpolation'] = None

# Onset Detection

Automatic detection of musical events in an audio signal is one of the most fundamental tasks in music information retrieval. Here, we will explore how to detect an **onset**, the very instant that marks the beginning of the transient part of a sound, or the earliest moment at which a transient can be reliably detected.

In [None]:
x, sr = librosa.load('audio/classic_rock_beat.wav')

In [None]:
ipd.Audio(x, rate=sr)

In [None]:
plt.figure(figsize=(14, 5))
librosa.display.waveshow(x, sr=sr)
plt.grid(True)

## `librosa.onset.onset_detect`

[`librosa.onset.onset_detect`](https://librosa.org/doc/main/generated/librosa.onset.onset_detect.html) works in the following way:

1.  Compute a spectral novelty function.
2.  Find peaks in the spectral novelty function.
3.  [optional] Backtrack from each peak to a preceding local minimum. Backtracking can be useful for finding segmentation points such that the onset occurs shortly after the beginning of the segment.

Compute the frame indices for estimated onsets in a signal:

In [None]:
onset_frames = librosa.onset.onset_detect(y=x, sr=sr, wait=1, pre_avg=1, post_avg=1, pre_max=1, post_max=1)
print(onset_frames) # frame numbers of estimated onsets

Convert onsets to units of seconds:

In [None]:
onset_times = librosa.frames_to_time(onset_frames)
print(onset_times)

Plot the onsets on top of a spectrogram of the audio:

In [None]:
S = librosa.stft(x)
logS = librosa.amplitude_to_db(abs(S))

In [None]:
plt.figure(figsize=(14, 5))
librosa.display.specshow(logS, sr=sr, x_axis='time', y_axis='log', cmap='Reds')
plt.vlines(onset_times, 0, 10000, color='#3333FF')

Let's also plot the onsets with the time-domain waveform.

In [None]:
plt.figure(figsize=(14, 5))
librosa.display.waveshow(x, sr=sr)
plt.vlines(onset_times, -0.8, 0.79, color='r', alpha=0.8)
plt.grid(True)

## librosa.clicks

We can add a click at the location of each detected onset.

In [None]:
clicks = librosa.clicks(frames=onset_frames, sr=sr, length=len(x))

Listen to the original audio plus the detected onsets. One way is to add the signals together, sample-wise:

In [None]:
ipd.Audio(x + clicks, rate=sr)

Another method is to play the original track in one stereo channel and the click track in the other stereo channel:

In [None]:
ipd.Audio(numpy.vstack([x, clicks]), rate=sr)

You can also change the click to a custom audio file instead:

In [None]:
cowbell, _ = librosa.load('audio/cowbell.wav')

In [None]:
ipd.Audio(cowbell, rate=sr)

In [None]:
clicks = librosa.clicks(frames=onset_frames, sr=sr, length=len(x), click=cowbell)

In [None]:
ipd.Audio(x + clicks, rate=sr)

## Questions

In `librosa.onset.onset_detect`, use the `backtrack=True` parameter. What does that do, and how does it affect the detected onsets? (See [`librosa.onset.onset_backtrack`](https://librosa.org/doc/main/generated/librosa.onset.onset_backtrack.html).)

# Onset-based Segmentation with Backtracking

[`librosa.onset.onset_detect`](https://librosa.org/doc/main/generated/librosa.onset.onset_detect.html) works by finding peaks in a spectral novelty function. However, these peaks may not actually correlates with the initial rise in energy or how we perceive the beginning of a musical note.

When detecting onsets, we often want to identify the earliest reliable point where a new sound begins. However, the peak of the spectral novelty function (Step 2) may occur slightly after the actual onset due to the nature of energy buildup in sound production.

To address this, backtracking is used:
* Instead of marking the peak as the onset, the algorithm traces backward to find the nearest local minimum before the peak.
* This local minimum represents a point just before the onset energy increases, making it a better segmentation boundary for separating musical events.

The optional keyword parameter `backtrack=True` will backtrack from each peak to a preceding local minimum. Backtracking can be useful for finding segmentation points such that the onset occurs shortly after the beginning of the segment. We will use `backtrack=True` to perform onset-based segmentation of a signal.

Load an audio file into the NumPy array `x` and sampling rate `sr`.

In [None]:
x, sr = librosa.load('audio/classic_rock_beat.wav')
print(x.shape, sr)

In [None]:
ipd.Audio(x, rate=sr)

Compute the frame indices for estimated onsets in a signal:

In [None]:
hop_length = 512
onset_frames = librosa.onset.onset_detect(y=x, sr=sr, hop_length=hop_length)
print(onset_frames) # frame numbers of estimated onsets

Convert onsets to units of seconds:

In [None]:
onset_times = librosa.frames_to_time(onset_frames, sr=sr, hop_length=hop_length)
print(onset_times)

Convert onsets to units of samples:

In [None]:
onset_samples = librosa.frames_to_samples(onset_frames, hop_length=hop_length)
print(onset_samples)

Plot the onsets on top of a spectrogram of the audio:

In [None]:
S = librosa.stft(x)
logS = librosa.amplitude_to_db(S)
librosa.display.specshow(logS, sr=sr, x_axis='time', y_axis='log')
plt.vlines(onset_times, 0, 10000, color='white')

As we see in the spectrogram, the detected onsets seem to occur a bit before the actual rise in energy.

Let's listen to these segments. We will create a function to do the following:

1.  Divide the signal into segments beginning at each detected onset.
2.  Pad each segment with 500 ms of silence.
3.  Concatenate the padded segments.

In [None]:
def concatenate_segments(x, onset_samples, pad_duration=0.500):
    """Concatenate segments into one signal."""
    silence = np.zeros(int(pad_duration*sr)) # silence
    frame_sz = min(np.diff(onset_samples))   # every segment has uniform frame size
    return np.concatenate([
        np.concatenate([x[i:i+frame_sz], silence]) # pad segment with silence
        for i in onset_samples
    ])

Concatenate the segments:

In [None]:
concatenated_signal = concatenate_segments(x, onset_samples, 0.500)

In [None]:
ipd.Audio(concatenated_signal, rate=sr)

As we hear, the little glitch between segments occurs because the segment boundaries occur during the attack, not before the attack.

## `librosa.onset.onset_backtrack`

We can avoid this glitch by backtracking from the detected onsets.

When setting the parameter `backtrack=True`, `librosa.onset.onset_detect` will call [`librosa.onset.onset_backtrack`](https://librosa.org/doc/main/generated/librosa.onset.onset_backtrack.html).
 For each detected onset, `librosa.onset.onset_backtrack` searches backward for a local minimum.

In [None]:
onset_frames = librosa.onset.onset_detect(y=x, sr=sr, hop_length=hop_length, backtrack=True)

Convert onsets to units of seconds:

In [None]:
onset_times = librosa.frames_to_time(onset_frames, sr=sr, hop_length=hop_length)

Convert onsets to units of samples:

In [None]:
onset_samples = librosa.frames_to_samples(onset_frames, hop_length=hop_length)

Plot the onsets on top of a spectrogram of the audio:

In [None]:
S = librosa.stft(x)
logS = librosa.amplitude_to_db(np.abs(S))
librosa.display.specshow(logS, sr=sr, x_axis='time', y_axis='log')
plt.vlines(onset_times, 0, 10000, color='white')

Notice how the vertical lines denoting each segment boundary appears before each rise in energy.

Concatenate the segments:

In [None]:
concatenated_signal = concatenate_segments(x, onset_samples, 0.500)
ipd.Audio(concatenated_signal, rate=sr)