In [None]:
%matplotlib inline
import seaborn
import numpy, scipy, matplotlib.pyplot as plt, IPython.display as ipd, sklearn
import librosa, librosa.display
import numpy as np

# Revisiting STFT & Spectral Features

Short-Time Fourier Transform (STFT)
* The STFT addresses the limitation of the Fourier Transform by analyzing signals in localized time windows.
* Instead of treating the entire signal as a single entity, STFT divides it into overlapping segments (windows) and applies the Fourier Transform to each segment.
* This produces a time-frequency representation, allowing us to observe how different frequency components evolve over time.
* The spectrogram is a visualization of STFT that shows the magnitude of frequencies over time.


Spectral Features
* Spectral Centroid: Measures the "center of mass" of the spectrum, giving an indication of the perceived brightness of a sound.
* Spectral Bandwidth: Measures the spread of frequencies around the spectral centroid, indicating how wide or narrow the frequency distribution is.
* Spectral Contrast: Measures the difference between peaks (high energy) and valleys (low energy) in the frequency spectrum.

## Spectral Rolloff
**Spectral rolloff** is the frequency below which a specified percentage of the total spectral energy, e.g. 85%, lies. It helps identify how much of a sound's energy is in low vs. high frequencies.

Think of it as, telling us where most of the energy is concentrated below a certain frequency.



Purpose: It helps distinguish between harmonic (tonal) sounds and noisy (percussive) sounds by identifying where most of the energy in the frequency spectrum resides.

In [None]:
x, sr = librosa.load('audio/simple_loop.wav')
ipd.Audio(x, rate=sr)

In [None]:
spectral_centroids = librosa.feature.spectral_centroid(y=x, sr=sr)[0]
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)

In [None]:
# Define a helper function to normalize the spectral centroid for visualization:
def normalize(x, axis=0):
    return sklearn.preprocessing.minmax_scale(x, axis=axis)

[`librosa.feature.spectral_rolloff`](https://librosa.org/doc/main/generated/librosa.feature.spectral_rolloff.html) computes the rolloff frequency for each frame in a signal:

In [None]:
plt.figure(figsize=(14, 5))
spectral_rolloff = librosa.feature.spectral_rolloff(y=x+0.01, sr=sr)[0]
librosa.display.waveshow(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_rolloff), color='r')

The red curve represents the normalized spectral rolloff over time.
* Higher spectral rolloff values (peaks in the red curve) indicate that more energy is concentrated in higher frequencies, meaning the sound is brighter or sharper.
* Lower spectral rolloff values (valleys in the red curve) indicate that most energy is in the lower frequencies, meaning the sound is more bass-heavy or muted.

In [None]:
# PLOT DIFFERENT SOUNDS AND TEST IT OUT


## Spectral Flatness

Spectral flatness (or tonality coefficient) is a measure to quantify how much noise-like a sound is, as opposed to being tone-like.

It is calculated as the ratio of the geometric mean to the arithmetic mean of the power spectrum.

* High spectral flatness (close to 1) → The spectrum is flat, meaning the energy is spread evenly across frequencies (white noise, percussive sounds).
* Low spectral flatness (close to 0) → The spectrum has peaks at certain frequencies, meaning it is more tonal (like a musical note or harmonic-rich signal).

In [None]:
# Compute Spectral Flatness
spectral_flatness = librosa.feature.spectral_flatness(y=x)

# Plot Spectral Flatness
plt.figure(figsize=(10, 4))
plt.semilogy(spectral_flatness.T, label='Spectral Flatness')
plt.xlabel('Time Frames')
plt.ylabel('Spectral Flatness')
plt.title('Spectral Flatness Over Time')
plt.legend()
plt.grid()
plt.show()

* X-axis (Time Frames) → Represents the progression of the signal over time.
* Y-axis (Spectral Flatness) → Measures how flat the spectrum is in each time frame.
* Low values → The signal is tonal, meaning there are dominant frequencies (e.g., musical notes, voiced speech).
* High values → The signal is noisy, meaning energy is distributed more evenly across frequencies (e.g., percussive sounds, unvoiced speech).

## Chroms Features

Revisit Constant-Q Transform (CQT)
* Unlike the linear frequency spacing in STFT, CQT provides logarithmic frequency bins, making it better suited for musical analysis.


Chroma features represent the tonal content of music by grouping frequencies into 12 pitch classes (C, C#, D, ..., B). Unlike spectral representations, which focus on absolute frequency values, Chroma features capture harmonic and melodic structure

Think of it as, the perceived "musical color" of a pitch, where notes an octave apart are considered the same chroma.

There are four chroma variants implemented in librosa: chroma_stft, chroma_cqt, chroma_cens and chroma_vqt

## Chroma STFT

chroma_stft performs short-time fourier transform of an audio input and maps each STFT bin to chroma.

In [None]:
# Load audio
y, sr = librosa.load(librosa.ex('trumpet'))
ipd.Audio(y, rate=sr)

In [None]:
import librosa
import librosa.display
import matplotlib.pyplot as plt

# Compute Chroma STFT
chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr)

# Plot
plt.figure(figsize=(14, 5))
librosa.display.specshow(chroma_stft, y_axis='chroma', x_axis='time', cmap='coolwarm')
plt.colorbar()
plt.title('Chroma STFT')
plt.xlabel('Time')
plt.ylabel('Pitch Class')
plt.show()


* The x-axis represents time.
* The y-axis represents pitch classes (C, C#, D, ..., B).
* Brighter colors indicate higher energy for that pitch class at a given time.
* Since STFT uses linear frequency spacing, it may not align perfectly with musical harmonics.

## Chroma CQT
chroma_cqt uses constant-Q transform and maps each cq-bin to chroma.

In [None]:
# Compute Chroma CQT
chroma_cqt = librosa.feature.chroma_cqt(y=y, sr=sr)

# Plot
plt.figure(figsize=(14, 5))
librosa.display.specshow(chroma_cqt, y_axis='chroma', x_axis='time', cmap='coolwarm')
plt.colorbar()
plt.title('Chroma CQT')
plt.xlabel('Time')
plt.ylabel('Pitch Class')
plt.show()


* The overall structure is similar to Chroma STFT, but:
 * The frequencies align better with musical pitch classes.
 * More stable representation, especially for harmonic sounds (like sustained notes).
 * Better suited for chord and key detection.


## Chroma CENS



Chroma CENS is a variant of Chroma features, designed to be more robust to dynamics and timbre variations. It is derived from Chroma CQT, but with additional processing steps that improve its stability and effectiveness for applications like audio matching, cover song identification, and music retrieval.

In [None]:
# Compute Chroma CENS
chroma_cens = librosa.feature.chroma_cens(y=y, sr=sr)

# Plot
plt.figure(figsize=(14, 5))
librosa.display.specshow(chroma_cens, y_axis='chroma', x_axis='time', cmap='coolwarm')
plt.colorbar()
plt.title('Chroma CENS')
plt.xlabel('Time')
plt.ylabel('Pitch Class')
plt.show()


* Looks similar to Chroma CQT, but:
 * Has smoother variations (less fluctuation in brightness).
 * Stronger tonal stability, meaning it captures the harmonic progression of a song rather than individual notes.
 * Useful for music similarity tasks because it generalizes well across different performances.


## Why Use Chroma CENS?

Regular Chroma features capture the harmonic content of music, but they can be sensitive to changes in dynamics and instrument timbre. This means that two versions of the same song (e.g., a live performance and a studio recording) might have different Chroma representations due to differences in loudness and tonal balance.

In [None]:
# Plot Chroma CQT
plt.figure(figsize=(14, 5))
librosa.display.specshow(chroma_cqt, y_axis='chroma', x_axis='time', cmap='coolwarm')
plt.colorbar()
plt.title('Chroma CQT (Before Processing)')
plt.xlabel('Time')
plt.ylabel('Pitch Class')
plt.show()

# Plot Chroma CENS
plt.figure(figsize=(14, 5))
librosa.display.specshow(chroma_cens, y_axis='chroma', x_axis='time', cmap='coolwarm')
plt.colorbar()
plt.title('Chroma CENS (After Processing)')
plt.xlabel('Time')
plt.ylabel('Pitch Class')
plt.show()

## Chroma VQT

chroma_vqt uses the Variable-Q Transform (VQT), which adapts its resolution based on frequency content.

It provides better low-frequency resolution while maintaining fine detail in higher frequencies.

It's more suitable for complex polyphonic music where different instruments overlap.

You need to specify the intervals parameter explicitly, which defines the spacing of the frequency bins.

In [None]:
# Compute Chroma VQT
chroma_vqt = librosa.feature.chroma_vqt(y=y, sr=sr, intervals='equal')

# Plot
plt.figure(figsize=(14, 5))
librosa.display.specshow(chroma_vqt, y_axis='chroma', x_axis='time', cmap='coolwarm')
plt.colorbar()
plt.title('Chroma VQT')
plt.xlabel('Time')
plt.ylabel('Pitch Class')
plt.show()

# Compute Chroma CQT
chroma_cqt = librosa.feature.chroma_cqt(y=y, sr=sr)

# Plot
plt.figure(figsize=(14, 5))
librosa.display.specshow(chroma_cqt, y_axis='chroma', x_axis='time', cmap='coolwarm')
plt.colorbar()
plt.title('Chroma CQT')
plt.xlabel('Time')
plt.ylabel('Pitch Class')
plt.show()

* More detailed and adaptive than Chroma CQT.
* Captures low-frequency harmonic structures more effectively.
* Suitable for analyzing music with rich harmonic textures, such as classical and jazz.


## Summary

* Chroma features help capture musical pitch content, independent of absolute frequency.
* Chroma CQT is often preferred over Chroma STFT for music analysis because it aligns with musical pitch perception.
* Chroma CENS is useful for comparing songs rather than detecting exact notes.
* Chroma VQT provides an adaptive resolution, making it more detailed for analyzing complex harmonic textures.

## Autocorrelation

The [autocorrelation](http://en.wikipedia.org/wiki/Autocorrelation) of a signal describes the similarity of a signal against a time-shifted version (delayed version) of itself.
For a signal $x$, the autocorrelation $r$ is:

$$ r(k) = \sum_n x(n) x(n-k) $$

In this equation, $k$ is often called the **lag** parameter. $r(k)$ is maximized at $k = 0$ and is symmetric about $k$.

The autocorrelation is useful for finding repeated patterns in a signal. For example, at short lags, the autocorrelation can tell us something about the signal's fundamental frequency. For longer lags, the autocorrelation may tell us something about the tempo of a musical signal.

In [None]:
x, sr = librosa.load('audio/c_strum.wav')
ipd.Audio(x, rate=sr)

In [None]:
plt.figure(figsize=(14, 5))
librosa.display.waveshow(x, sr=sr)
plt.grid(True)

### `numpy.correlate`

There are two ways we can compute the autocorrelation in Python. The first method is [`numpy.correlate`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.correlate.html):

In [None]:
# Because the autocorrelation produces a symmetric signal, we only care about the "right half".
r = numpy.correlate(x, x, mode='full')[len(x)-1:]
print(x.shape, r.shape)

In [None]:
plt.figure(figsize=(14, 5))
plt.plot(r[:10000])
plt.xlabel('Lag (samples)')
plt.xlim(0, 10000)
plt.grid(True)

* The x-axis represents the lag in terms of samples.
* The y-axis represents the correlation strength between the original signal and its delayed (lagged) version.

Initial Peak at Lag = 0

* The first peak (not visible at the extreme left) corresponds to lag = 0, where the signal is perfectly correlated with itself. This is expected because any signal is always maximally correlated with itself at zero lag.

Periodic Peaks Indicating Repeating Structures

* These peaks indicate repeating patterns in the waveform, which could correspond to a fundamental frequency and its harmonics.
* If the signal were a pure periodic waveform, the peaks would be evenly spaced, indicating a stable fundamental frequency.


The gradual decrease in peak height means that as the lag increases, the correlation becomes weaker.

The variations in peak heights and spacing suggest that the signal is not perfectly periodic.


### `librosa.autocorrelate`

The second method is [`librosa.autocorrelate`](http://bmcfee.github.io/librosa/generated/librosa.core.autocorrelate.html#librosa.core.autocorrelate):

In [None]:
r = librosa.autocorrelate(x, max_size=10000)
print(r.shape)

In [None]:
plt.figure(figsize=(14, 5))
plt.plot(r)
plt.xlabel('Lag (samples)')
plt.xlim(0, 10000)
plt.grid(True)

## Pitch Estimation

The autocorrelation is used to find repeated patterns within a signal. For musical signals, a repeated pattern can correspond to a pitch period. We can therefore use the autocorrelation function to estimate the pitch in a musical signal.

In [None]:
x, sr = librosa.load('audio/oboe_c6.wav')
ipd.Audio(x, rate=sr)

Compute and plot the autocorrelation:

In [None]:
r = librosa.autocorrelate(x, max_size=5000)
plt.figure(figsize=(14, 5))
plt.plot(r[:200])
plt.grid(True)

The autocorrelation always has a maximum at zero, i.e. zero lag. We want to identify the maximum outside of the peak centered at zero. Therefore, we might choose only to search within a range of reasonable pitches:

In [None]:
midi_hi = 120.0
midi_lo = 12.0
f_hi = librosa.midi_to_hz(midi_hi)
f_lo = librosa.midi_to_hz(midi_lo)
t_lo = sr/f_hi
t_hi = sr/f_lo

t_lo and t_hi here are correspond to reasonable fundamental frequencies.



In [None]:
print(f_lo, f_hi)
print(t_lo, t_hi)

Set invalid pitch candidates to zero:

In [None]:
r[:int(t_lo)] = 0
r[int(t_hi):] = 0

* r[int(t_lo)] = 0
 * This eliminates any peaks at very low lags, meaning extremely high-frequency candidates (above MIDI 120 / 8372 Hz) are removed.
 * High-frequency pitches correspond to very small lag values, which may contain harmonics or noise.

* r[int(t_hi):] = 0

 * This removes autocorrelation peaks for very large lag values, eliminating any frequencies below MIDI 12 (16.35 Hz).
 * Large lag values correspond to very low-frequency candidates, which may not be musically relevant.

In [None]:
plt.figure(figsize=(14, 5))
plt.plot(r[:1400])

* X-axis (Lag in Samples): Represents time shifts in samples.
* Y-axis (Autocorrelation Value): Represents how well the signal correlates with itself at different lags.
* A clear periodic structure suggests a well-defined pitch.


Finding the First Peak (Fundamental Period)

In [None]:
t_max = r.argmax()
print(t_max)

Why Does Finding the First Major Peak in the Autocorrelation Function (ACF) Represent the Pitch?

The fundamental frequency (pitch) of a sound corresponds to how often the waveform repeats per second. Since autocorrelation measures similarity between a signal and a time-shifted version of itself, periodic signals will show peaks at intervals corresponding to their repetition rate. The first major peak in the ACF (after lag = 0) represents the fundamental period, which we use to determine pitch.



Finally, estimate the pitch in Hertz:

In [None]:
float(sr)/t_max

Indeed, that is very close to the true frequency of C6:

In [None]:
librosa.midi_to_hz(84)

In [None]:
r = librosa.autocorrelate(x, max_size=5000)
midi_hi = 120.0
midi_lo = 12.0
f_hi = librosa.midi_to_hz(midi_hi)
f_lo = librosa.midi_to_hz(midi_lo)
t_lo = sr/f_hi
t_hi = sr/f_lo
r[:int(t_lo)] = 0
r[int(t_hi):] = 0
t_max = r.argmax()