In [None]:
%matplotlib inline
import seaborn
import numpy, scipy, matplotlib.pyplot as plt, IPython.display as ipd, sklearn
import librosa, librosa.display
import numpy as np

# Revisiting Fourier Transform, DFT and FFT


Fourier Transform (FT)
* The Fourier Transform decomposes a signal into a sum of sinewaves (frequencies).
* It provides frequency-domain representation.

Discrete Fourier Transform (DFT)
* The DFT is the discrete version of the Fourier Transform, used for sampled signals.
* It converts a finite-length signal into a sum of sinusoids of different frequencies.


Fast Fourier Transform (FFT)
* An efficient algorithm for computing the DFT.
* FFT speeds up the DFT by recursively splitting the computation into smaller DFTs.

Why DFT & FFT Alone is Not Enough?
1. DFT gives global frequency content but does not show how frequencies change over time.
2. If a signal is non-stationary (changing over time), DFT fails to capture that.



# Short-Time Fourier Transform


Musical signals are highly non-stationary, i.e., their statistics change over time. It would be rather meaningless to compute a single Fourier transform over an entire 10-minute song.

How STFT works?

Instead of analyzing the entire signal at once, the STFT breaks it into short overlapping segments (frames) by multiplying the signal with a window function
$w(n)$, which selects a small portion of the signal at a time. The Fourier Transform is then computed for each windowed segment, sliding across time.

Mathematically, the STFT is defined as:

$$ X(m, \omega) = \sum_n x(n) w(n-m) e^{-j \omega n} $$

where:
* $x(n)$ is the original signal.
* $w(n)$ is the window function.
* m determines the time position of the window.
* ùúî represents the frequency.

As we increase $m$, we slide the window function $w$ to the right. For the resulting frame, $x(n) w(n-m)$, we compute the Fourier transform. Therefore, the STFT $X$ is a function of both time, $m$, and frequency, $\omega$.

In [None]:
x, sr = librosa.load('audio/brahms_hungarian_dance_5.mp3')
ipd.Audio(x, rate=sr)

[`librosa.stft`](https://librosa.org/doc/main/generated/librosa.stft.html) computes a STFT. We provide it a frame size, i.e. the size of the FFT, and a hop length, i.e. the frame increment:

In [None]:
hop_length = 512
n_fft = 2048
X = librosa.stft(x, n_fft=n_fft, hop_length=hop_length)

To convert the hop length and frame size to units of seconds:

In [None]:
float(hop_length)/sr # units of seconds

In [None]:
float(n_fft)/sr  # units of seconds

For real-valued signals, the Fourier transform is symmetric about the midpoint. Therefore, `librosa.stft` only retains one half of the output:

In [None]:
X.shape

This STFT has 1025 frequency bins and 9813 frames in time.

## Spectrogram

In music processing, we often only care about the spectral magnitude and not the phase content.

The **spectrogram** shows the the intensity of frequencies over time. A spectrogram is simply the squared magnitude of the STFT:

$$ S(m, \omega) = \left| X(m, \omega) \right|^2 $$

The human perception of sound intensity is logarithmic in nature. Therefore, we are often interested in the log amplitude:

In [None]:
S = librosa.amplitude_to_db(abs(X))

To display any type of spectrogram in librosa, use [`librosa.display.specshow`](https://librosa.org/doc/main/generated/librosa.display.specshow.html).

In [None]:
plt.figure(figsize=(15, 5))
librosa.display.specshow(S, sr=sr, hop_length=hop_length, x_axis='time', y_axis='linear')
plt.colorbar(format='%+2.0f dB')

## Mel-spectrogram

A Mel-spectrogram is a spectrogram where the frequency axis is transformed using the Mel scale, which mimics how humans perceive sound. Instead of using a linear frequency scale, it spaces frequencies more densely at lower frequencies and more sparsely at higher frequencies, reflecting human auditory perception.

`librosa` has some outstanding spectral representations, including [`librosa.feature.melspectrogram`](https://librosa.org/doc/main/generated/librosa.feature.melspectrogram.html):

In [None]:
hop_length = 256
S = librosa.feature.melspectrogram(y=x, sr=sr, n_fft=4096, hop_length=hop_length)

The human perception of sound intensity is logarithmic in nature. Therefore, like the STFT-based spectrogram, we are often interested in the log amplitude:

In [None]:
logS = librosa.power_to_db(abs(S))

In [None]:
plt.figure(figsize=(15, 5))
librosa.display.specshow(logS, sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')


* X-axis (Time) ‚Üí Shows how the sound evolves over time.
* Y-axis (Mel Frequency Scale) ‚Üí Uses the Mel scale instead of Hz, making lower frequencies more detailed.
* Color Intensity (Amplitude in dB) ‚Üí
 * Red areas ‚Üí Higher energy (louder sounds).
 * Blue areas ‚Üí Lower energy (quieter sounds).


## Applications of Mel-Spectrograms
* Music Genre Classification ‚Äì Used in machine learning models to classify different music genres based on frequency patterns.
* Speech Recognition ‚Äì Forms the basis of automatic speech recognition (ASR)systems like Siri and Google Assistant.
* Speaker Identification ‚Äì Helps in identifying individuals based on their unique vocal features.

### Why Use Mel-Spectrograms?
* Aligns with Human Hearing ‚Äì The Mel scale reflects how we perceive sound, making it more meaningful for audio analysis.
* Better Frequency Resolution at Lower Frequencies ‚Äì More detail in bass and speech-relevant frequencies, improving classification accuracy.
* Reduces Dimensionality ‚Äì Compared to a raw spectrogram, it compresses high-frequency data, making it more efficient for machine learning.



## Introduction to the Constant-Q Transform (CQT)

Similar to the Mel-spectrogram, the Constant-Q Transform (CQT) introduces a logarithmic frequency scale, but it is specifically designed for music analysis. Instead of focusing on human perception like the Mel scale, CQT aligns frequencies with musical pitches, such as semitones and octaves. This makes it ideal for tasks like music transcription, pitch tracking, and harmonic analysis, where capturing musical structures is important.

To plot a constant-Q spectrogram, will use [`librosa.cqt`](https://librosa.org/doc/main/generated/librosa.cqt.html):

In [None]:
fmin = librosa.midi_to_hz(36)
C = librosa.cqt(x, sr=sr, fmin=fmin, n_bins=72)
logC = librosa.amplitude_to_db(abs(C))

In [None]:
plt.figure(figsize=(15, 5))
librosa.display.specshow(logC, sr=sr, x_axis='time', y_axis='cqt_note', fmin=fmin, cmap='coolwarm')
plt.colorbar(format='%+2.0f dB')

* X-axis (Time in minutes:seconds)
 * Shows how the signal changes over time.
* Y-axis (Musical Notes, C2 to C8)
 * Instead of Hz (as in a regular spectrogram or Mel-spectrogram), the frequency axis is mapped to musical pitch classes (e.g., C2, C3, etc.).
 * Each frequency bin corresponds to a musical semitone, making it easier to analyze harmonics and melodies.
* Color Intensity (Amplitude in dB)
 * Red regions indicate high-energy frequencies (louder notes).
 * Blue regions indicate low-energy frequencies (softer or silent parts).

# Spectral Features

For classification, we're going to be using new features in our arsenal: spectral moments (centroid, bandwidth, skewness, kurtosis) and other spectral statistics.

[*Moments*](https://en.wikipedia.org/wiki/Moment_(mathematics) is a term used in physics and statistics. There are raw moments and central moments.

You are probably already familiar with two examples of moments: mean and variance. The first raw moment is known as the mean. The second central moment is known as the variance.

## Spectral Centroid

In [None]:
x, sr = librosa.load('audio/simple_loop.wav')
ipd.Audio(x, rate=sr)

The **spectral centroid** ([Wikipedia](https://en.wikipedia.org/wiki/Spectral_centroid)) indicates at which frequency the energy of a spectrum is centered upon. This is like a weighted mean:

$$ f_c = \frac{\sum_k S(k) f(k)}{\sum_k S(k)} $$

where $S(k)$ is the spectral magnitude at frequency bin $k$, $f(k)$ is the frequency at bin $k$.

Basically, the spectral centroid represents the "center of mass" of the spectrum, giving an idea of whether a sound is more low-pitched (bass-heavy) or high-pitched (treble-focused).

[`librosa.feature.spectral_centroid`](https://librosa.org/doc/main/generated/librosa.feature.spectral_centroid.html) computes the spectral centroid for each frame in a signal:

In [None]:
spectral_centroids = librosa.feature.spectral_centroid(y=x, sr=sr)[0]
spectral_centroids.shape

Compute the time variable for visualization:

In [None]:
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)

Define a helper function to normalize the spectral centroid for visualization:

In [None]:
def normalize(x, axis=0):
    return sklearn.preprocessing.minmax_scale(x, axis=axis)

Plot the spectral centroid along with the waveform:

In [None]:
plt.figure(figsize=(15, 5))
librosa.display.waveshow(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_centroids), color='r') # normalize for visualization purposes
plt.grid(True)

Similar to the zero crossing rate, there is a spurious rise in spectral centroid at the beginning of the signal. That is because the silence at the beginning has such small amplitude that high frequency components have a chance to dominate. One hack around this is to add a small constant before computing the spectral centroid, thus shifting the centroid toward zero at quiet portions:

In [None]:
plt.figure(figsize=(15, 5))
spectral_centroids = librosa.feature.spectral_centroid(y=x+0.01, sr=sr)[0]
librosa.display.waveshow(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_centroids), color='r') # normalize for visualization purposes

* X-axis (Time in seconds) ‚Üí Shows the progression of the audio signal.
* Y-axis (Amplitude for waveform, Normalized Spectral Centroid in Red)
* The light blue waveform represents the original audio signal.
* The red line represents the spectral centroid, which indicates whether the signal‚Äôs energy is concentrated in low frequencies (bass) or high frequencies (treble).


Observing the Relationship Between the Waveform and Centroid
* Low spectral centroid (Red Line near bottom) ‚Üí Corresponds to sections where the signal has more low-frequency content (e.g., bass sounds, silence).
* High spectral centroid (Red Line peaks) ‚Üí Occurs when higher frequencies dominate (e.g., percussive sounds, bright notes, or sharp attacks).
* Sudden peaks in the red line ‚Üí Often align with sharp transients in the waveform, meaning these moments have high-frequency content.

Task:

* Load different audio clips (e.g., a bass-heavy sound vs. a bright, high-pitched sound).
* Compute and plot the spectral centroid over time using librosa.feature.spectral_centroid.
* Compare the centroid between different sounds.

## Spectral Bandwidth

Spectral Bandwidth measures the spread of frequencies around the spectral centroid, indicating how wide or narrow the frequency distribution is.

* A low spectral bandwidth means the energy is concentrated around a few frequencies (e.g., a pure sine wave or a flute note).
* A high spectral bandwidth means the energy is spread across many frequencies (e.g., noise, cymbals, or distorted guitar sounds).


In [None]:
hop_length = 512
n_fft = 4096
spectral_bandwidth_2 = librosa.feature.spectral_bandwidth(y=x+0.01, sr=sr, hop_length=hop_length, n_fft=n_fft)[0]
spectral_bandwidth_3 = librosa.feature.spectral_bandwidth(y=x+0.01, sr=sr, hop_length=hop_length, n_fft=n_fft, p=3)[0]
spectral_bandwidth_4 = librosa.feature.spectral_bandwidth(y=x+0.01, sr=sr, hop_length=hop_length, n_fft=n_fft, p=4)[0]

plt.figure(figsize=(15, 5))
librosa.display.waveshow(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_bandwidth_2), color='r')
plt.plot(t, normalize(spectral_bandwidth_3), color='g')
plt.plot(t, normalize(spectral_bandwidth_4), color='y')
plt.legend(('p = 2', 'p = 3', 'p = 4'))

* X-axis (Time in seconds) ‚Üí Represents how the audio evolves over time.
* Y-axis (Normalized Bandwidth Values) ‚Üí Higher values indicate a broader frequency spread, while lower values indicate a narrower spread.
* Waveform (Light Blue in Background) ‚Üí Shows the amplitude of the original audio signal.

The plot contains multiple colored lines corresponding to different p-values (p=2, 3, 4).

* These p-values refer to different orders of the spectral bandwidth computation, which affect how frequency spread is measured.
* Red (p=2) ‚Üí Captures a wider variation in bandwidth.
* Green (p=3) & Yellow (p=4) ‚Üí Show slightly more smoothed version


Observing the Relationship to the Audio Signal
* Peaks in spectral bandwidth correspond to transients or sudden changes in the audio waveform (e.g., drum hits or sharp attacks).
* Lower values occur in steady-state regions (e.g., sustained tones or silence).

Key Takeaways
* Spectral bandwidth increases during fast, noisy, or percussive sections.
* It decreases during smooth, tonal, or steady portions.
* Different p-values affect how the spread is measured, with lower p-values showing sharper changes.


## Spectral Contrast

Spectral Contrast measures the difference between peaks (high energy) and valleys (low energy) in the frequency spectrum. Unlike spectral centroid or bandwidth, which describe the overall distribution of energy, spectral contrast focuses on how much the energy varies between frequency bands.

* High spectral contrast ‚Üí Common in percussive sounds (e.g., drums) or music with sharp transients.
* Low spectral contrast ‚Üí Found in smooth, harmonic sounds (e.g., flute, sustained vocals).

In [None]:
spectral_contrast = librosa.feature.spectral_contrast(y=x, sr=sr)
spectral_contrast.shape

In [None]:
plt.figure(figsize=(15, 5))
plt.imshow(normalize(spectral_contrast, axis=1), aspect='auto', origin='lower', cmap='coolwarm')

* X-axis (Frames / Time Steps) ‚Üí Represents how the spectral contrast evolves over time.
* Y-axis (Frequency Bands) ‚Üí Divides the frequency spectrum into multiple bands (e.g., low, mid, high frequencies). Each row represents one frequency band.
* Color Intensity (Contrast in dB)
 * Red areas ‚Üí High spectral contrast (large difference between peaks and valleys).
 * Blue areas ‚Üí Low spectral contrast (smooth or harmonic sounds).



Key Observations
* Bright red areas correspond to sharp transients or percussive elements, where energy differences between peaks and valleys are large.
* Blue areas suggest sustained or harmonic sounds, where spectral energy is more evenly distributed.
* Spectral contrast varies across different frequency bands, meaning some bands show more dynamic energy changes than others.

What This Tells Us About the Audio

same task as before :))

In [None]:
print(spectral_contrast[6].mean())