# Spectrogram Clustering: Speech Segmentation

This notebook demonstrates spectral clustering on speech data - a spectrogram of the utterance "Aba" from:

> **Automatic Determination of the Number of Clusters using Spectral Algorithms**  
> Sanguinetti, G., Laidler, J., Lawrence, N.D. (2005)  
> IEEE Workshop on Machine Learning for Signal Processing (NNSP 2005)

## The Challenge

Automatic speech segmentation: identify consonants and vowels from acoustic features alone. 

The utterance "Aba" has natural segments:
1. **/a/** - vowel (steady harmonic structure)
2. **/b/** - consonant (burst + formant transitions)  
3. **/a/** - vowel (steady harmonic structure)

Can spectral clustering automatically find these boundaries?

## What we'll demonstrate:

1. Load and visualise the spectrogram
2. Cluster with two different sigma values
3. Compare coarse vs. fine temporal segmentation
4. Interpret results in phonetic context
5. Reproduce paper Figures 6b and 6c

In [None]:
# Install spectral-cluster package if needed
import sys
from pathlib import Path

try:
    import spectral
    print(f"✓ spectral package already installed (version {spectral.__version__})")
except ImportError:
    print("📦 Installing spectral-cluster package...")
    
    here = Path.cwd().resolve()
    parent = here.parent
    
    if (parent / "pyproject.toml").exists() and (parent / "spectral").is_dir():
        print(f"  → Installing from local directory: {parent}")
        import subprocess
        subprocess.check_call(
            [sys.executable, "-m", "pip", "install", "-e", str(parent)],
            stdout=subprocess.DEVNULL
        )
    else:
        print("  → Installing from GitHub...")
        import subprocess
        subprocess.check_call([
            sys.executable, "-m", "pip", "install",
            "git+https://github.com/lawrennd/spectral.git"
        ])
    
    import spectral
    print(f"✓ spectral package installed successfully (version {spectral.__version__})")

In [None]:
# Import required packages
import numpy as np
import matplotlib.pyplot as plt
from spectral import SpectralCluster

# Set random seed
np.random.seed(1)

# Configure matplotlib
plt.rcParams['figure.figsize'] = (14, 5)
plt.rcParams['font.size'] = 11

print('✓ All packages loaded successfully')

## 1. Load and Visualise the Spectrogram

A spectrogram shows:
- **X-axis**: Time (frame number)
- **Y-axis**: Frequency (bin)
- **Color**: Energy at that time-frequency point

For clustering, each time frame becomes a data point with features = spectral energy bins.

In [None]:
def ensure_data_files():
    """Download data files if they don't exist locally."""
    from pathlib import Path
    import urllib.request
    
    data_dir = Path('data')
    data_dir.mkdir(exist_ok=True)
    
    base_url = 'https://raw.githubusercontent.com/lawrennd/spectral/main/examples/data/'
    files = ['spectrogram.txt']
    
    for filename in files:
        filepath = data_dir / filename
        if not filepath.exists():
            print(f'📥 Downloading {filename}...')
            urllib.request.urlretrieve(base_url + filename, filepath)
            print(f'  ✓ Downloaded to {filepath}')

# Ensure data files are available
ensure_data_files()

def load_spectrogram(filename, downsample=3):
    """Load spectrogram data and prepare features."""
    # Load data
    spec = np.loadtxt(f'data/{filename}')
    
    # Normalise to [0, 1]
    spec_norm = spec / spec.max()
    
    # Flip vertically (to match paper convention: low freq at bottom)
    spec_norm = np.flipud(spec_norm)
    
    # Downsample to reduce computation (MATLAB uses factor=3)
    spec_ds = spec_norm[::downsample, ::downsample]
    
    # Get dimensions
    numrows, numcols = spec_ds.shape
    
    # Create coordinate grids
    y_coords, x_coords = np.mgrid[0:numrows, 0:numcols]
    
    # Flatten
    x_flat = x_coords.ravel()
    y_flat = y_coords.ravel()
    intensity_flat = spec_ds.ravel()
    
    # Weight intensity to match spatial scale
    intensity_weighted = intensity_flat * (numrows + numcols)
    
    # Combine: (time_index, freq_index, weighted_energy)
    X = np.column_stack([x_flat, y_flat, intensity_weighted])
    
    return X, spec_ds, (numrows, numcols)

# Load spectrogram
X, spec_ds, (numrows, numcols) = load_spectrogram('spectrogram.txt', downsample=3)

print(f"Spectrogram size (downsampled): {numrows} × {numcols}")
print(f"Total time-frequency points: {X.shape[0]}")
print(f"Feature dimensions: {X.shape[1]} (time, frequency, energy)")

In [None]:
# Visualise the spectrogram
fig, ax = plt.subplots(1, 1, figsize=(14, 6))
im = ax.imshow(spec_ds, aspect='auto', origin='lower', cmap='hot', interpolation='nearest')
ax.set_xlabel('Time Frame', fontsize=12)
ax.set_ylabel('Frequency Bin', fontsize=12)
ax.set_title('Spectrogram of "Aba" Utterance (Paper Figure 6a)', fontsize=13, fontweight='bold')
plt.colorbar(im, ax=ax, label='Normalised Energy')
plt.tight_layout()
plt.show()

print("\nAcoustic structure:")
print("  - Left segment: First /a/ vowel (harmonic structure)")
print("  - Middle: /b/ consonant (burst + transitions)")
print("  - Right segment: Second /a/ vowel (harmonic structure)")
print("\nChallenge: Can we automatically detect these 3 phonetic segments?")

## 2. Coarse Temporal Segmentation (sigma = 1.32)

First, we try coarser segmentation:
- **MATLAB**: sigma²=3.5 → **Python**: $\sigma = \sqrt{3.5/2} \approx 1.32$

This should find broad temporal segments (vowel vs. consonant).

In [None]:
# Cluster with sigma = 1.32 (coarse segmentation)
clf_coarse = SpectralCluster(sigma=1.32, random_state=1)
clf_coarse.fit(X)

print(f"\n{'='*60}")
print(f"COARSE SEGMENTATION (sigma = 1.32)")
print(f"{'='*60}")
print(f"Number of segments detected: {clf_coarse.n_clusters_}")
print(f"Algorithm used {clf_coarse.eigenvectors_.shape[1]} eigenvectors")
print(f"{'='*60}\n")

In [None]:
# Reshape labels back to spectrogram shape
labels_coarse_img = clf_coarse.labels_.reshape(numrows, numcols)

# Visualise coarse segmentation
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Original spectrogram
im1 = axes[0].imshow(spec_ds, aspect='auto', origin='lower', cmap='hot', interpolation='nearest')
axes[0].set_xlabel('Time Frame', fontsize=12)
axes[0].set_ylabel('Frequency Bin', fontsize=12)
axes[0].set_title('Original Spectrogram', fontsize=13, fontweight='bold')
plt.colorbar(im1, ax=axes[0], label='Energy')

# Segmentation result
im2 = axes[1].imshow(labels_coarse_img, aspect='auto', origin='lower', cmap='tab10', interpolation='nearest')
axes[1].set_xlabel('Time Frame', fontsize=12)
axes[1].set_ylabel('Frequency Bin', fontsize=12)
axes[1].set_title(f'Coarse Segmentation: {clf_coarse.n_clusters_} segments (sigma=1.32)\n(Paper Figure 6b)', 
                  fontsize=13, fontweight='bold')
plt.colorbar(im2, ax=axes[1], label='Segment')

plt.tight_layout()
plt.show()

print(f"Paper Figure 6b reproduced: {clf_coarse.n_clusters_} broad phonetic segments.")

In [None]:
# Show temporal segmentation (averaged across frequency)
fig, ax = plt.subplots(1, 1, figsize=(14, 4))

# Average cluster assignment across frequency for each time frame
time_segments = []
for t in range(numcols):
    # Most common label at this time frame
    labels_t = labels_coarse_img[:, t]
    most_common = np.bincount(labels_t).argmax()
    time_segments.append(most_common)

time_segments = np.array(time_segments)
ax.plot(time_segments, linewidth=3, marker='o', markersize=4)
ax.set_xlabel('Time Frame', fontsize=12)
ax.set_ylabel('Segment ID', fontsize=12)
ax.set_title('Temporal Segmentation (Coarse)', fontsize=13, fontweight='bold')
ax.set_yticks(range(clf_coarse.n_clusters_))
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nPhonetic interpretation:")
if clf_coarse.n_clusters_ >= 3:
    print("  ✓ Algorithm detected 3+ segments - likely vowel-consonant-vowel structure!")
else:
    print(f"  ⚠ Algorithm detected {clf_coarse.n_clusters_} segments")

## 3. Fine Temporal Segmentation (sigma = 1.12)

Now we try finer segmentation:
- **MATLAB**: sigma²=2.5 → **Python**: $\sigma = \sqrt{2.5/2} \approx 1.12$

This should find more detailed temporal structure.

In [None]:
# Cluster with sigma = 1.12 (fine segmentation)
clf_fine = SpectralCluster(sigma=1.12, random_state=1)
clf_fine.fit(X)

print(f"\n{'='*60}")
print(f"FINE SEGMENTATION (sigma = 1.12)")
print(f"{'='*60}")
print(f"Number of segments detected: {clf_fine.n_clusters_}")
print(f"Algorithm used {clf_fine.eigenvectors_.shape[1]} eigenvectors")
print(f"{'='*60}\n")

In [None]:
# Reshape labels
labels_fine_img = clf_fine.labels_.reshape(numrows, numcols)

# Visualise fine segmentation
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Original spectrogram
im1 = axes[0].imshow(spec_ds, aspect='auto', origin='lower', cmap='hot', interpolation='nearest')
axes[0].set_xlabel('Time Frame', fontsize=12)
axes[0].set_ylabel('Frequency Bin', fontsize=12)
axes[0].set_title('Original Spectrogram', fontsize=13, fontweight='bold')
plt.colorbar(im1, ax=axes[0], label='Energy')

# Segmentation result
im2 = axes[1].imshow(labels_fine_img, aspect='auto', origin='lower', cmap='tab20', interpolation='nearest')
axes[1].set_xlabel('Time Frame', fontsize=12)
axes[1].set_ylabel('Frequency Bin', fontsize=12)
axes[1].set_title(f'Fine Segmentation: {clf_fine.n_clusters_} segments (sigma=1.12)\n(Paper Figure 6c)', 
                  fontsize=13, fontweight='bold')
plt.colorbar(im2, ax=axes[1], label='Segment')

plt.tight_layout()
plt.show()

print(f"Paper Figure 6c reproduced: {clf_fine.n_clusters_} detailed segments.")

In [None]:
# Show temporal segmentation (fine)
fig, ax = plt.subplots(1, 1, figsize=(14, 4))

# Average cluster assignment across frequency for each time frame
time_segments_fine = []
for t in range(numcols):
    labels_t = labels_fine_img[:, t]
    most_common = np.bincount(labels_t).argmax()
    time_segments_fine.append(most_common)

time_segments_fine = np.array(time_segments_fine)
ax.plot(time_segments_fine, linewidth=3, marker='o', markersize=4)
ax.set_xlabel('Time Frame', fontsize=12)
ax.set_ylabel('Segment ID', fontsize=12)
ax.set_title('Temporal Segmentation (Fine)', fontsize=13, fontweight='bold')
ax.set_yticks(range(clf_fine.n_clusters_))
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nFiner temporal structure revealed!")

## 4. Side-by-Side Comparison

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Original
im0 = axes[0].imshow(spec_ds, aspect='auto', origin='lower', cmap='hot', interpolation='nearest')
axes[0].set_xlabel('Time')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Original', fontweight='bold')
plt.colorbar(im0, ax=axes[0], fraction=0.046)

# Coarse
im1 = axes[1].imshow(labels_coarse_img, aspect='auto', origin='lower', cmap='tab10', interpolation='nearest')
axes[1].set_xlabel('Time')
axes[1].set_ylabel('Frequency')
axes[1].set_title(f'Coarse: {clf_coarse.n_clusters_} segments\n(sigma=1.32)', fontweight='bold')
plt.colorbar(im1, ax=axes[1], fraction=0.046)

# Fine
im2 = axes[2].imshow(labels_fine_img, aspect='auto', origin='lower', cmap='tab20', interpolation='nearest')
axes[2].set_xlabel('Time')
axes[2].set_ylabel('Frequency')
axes[2].set_title(f'Fine: {clf_fine.n_clusters_} segments\n(sigma=1.12)', fontweight='bold')
plt.colorbar(im2, ax=axes[2], fraction=0.046)

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("COMPARISON SUMMARY")
print("="*60)
print(f"Coarse (sigma=1.32): {clf_coarse.n_clusters_} segments")
print(f"Fine (sigma=1.12):   {clf_fine.n_clusters_} segments")
print("="*60)

## 5. Eigenspace Visualization

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(16, 6))

# Coarse in 3D
if clf_coarse.eigenvectors_.shape[1] >= 3:
    ax1 = fig.add_subplot(121, projection='3d')
    eigenvecs = clf_coarse.eigenvectors_
    scatter = ax1.scatter(eigenvecs[:, 0], eigenvecs[:, 1], eigenvecs[:, 2],
                          c=clf_coarse.labels_, cmap='tab10', s=5, alpha=0.5)
    ax1.scatter(clf_coarse.centers_[:, 0], clf_coarse.centers_[:, 1], clf_coarse.centers_[:, 2],
                c='red', s=100, marker='d', edgecolors='k', linewidths=2, label='Centers')
    ax1.set_xlabel('Eigenvector 1')
    ax1.set_ylabel('Eigenvector 2')
    ax1.set_zlabel('Eigenvector 3')
    ax1.set_title(f'Coarse: 3D Eigenspace', fontweight='bold')
    ax1.legend()

# Fine in 3D
if clf_fine.eigenvectors_.shape[1] >= 3:
    ax2 = fig.add_subplot(122, projection='3d')
    eigenvecs_fine = clf_fine.eigenvectors_
    scatter = ax2.scatter(eigenvecs_fine[:, 0], eigenvecs_fine[:, 1], eigenvecs_fine[:, 2],
                          c=clf_fine.labels_, cmap='tab20', s=5, alpha=0.5)
    ax2.scatter(clf_fine.centers_[:, 0], clf_fine.centers_[:, 1], clf_fine.centers_[:, 2],
                c='red', s=100, marker='d', edgecolors='k', linewidths=2, label='Centers')
    ax2.set_xlabel('Eigenvector 1')
    ax2.set_ylabel('Eigenvector 2')
    ax2.set_zlabel('Eigenvector 3')
    ax2.set_title(f'Fine: 3D Eigenspace', fontweight='bold')
    ax2.legend()

plt.tight_layout()
plt.show()

print("\nRadial structure in eigenspace enables automatic segmentation.")

## 6. Summary and Insights

### Results

Successfully demonstrated automatic speech segmentation:
- **Coarse**: Broad phonetic segments (vowel-consonant structure)
- **Fine**: Detailed temporal structure (formant transitions, bursts)

### Speech Processing Context

Traditional speech segmentation requires:
- Phonetic transcription
- Forced alignment with text
- Language-specific models

Spectral clustering achieves segmentation from acoustic features alone - no linguistic knowledge needed!

### Why It Works

1. **Feature representation**: Each time frame = spectral energy vector
2. **Temporal proximity**: Nearby frames have similar spectra
3. **Phonetic transitions**: Different phonemes have distinct spectral signatures
4. **Automatic detection**: Algorithm finds natural boundaries

### Parameter Effects

- **Smaller sigma** (1.12): More clusters, finer temporal detail
- **Larger sigma** (1.32): Fewer clusters, broader segments

### Practical Applications

- Speech segmentation without transcription
- Acoustic event detection
- Audio fingerprinting
- Music structure analysis

## Conclusion

This notebook demonstrated:
- Loading and processing spectrograms
- Automatic speech segmentation at multiple scales
- Reproduced paper Figures 6a, 6b, and 6c
- Application to real acoustic data

## References

- Paper: Sanguinetti et al. (2005), Figures 6a, 6b, 6c
- MATLAB implementation: `matlab/demoSpectrogram.m`, `matlab/demoSpectrogram2.m`
- Python implementation: `spectral/cluster.py`