<a href="https://colab.research.google.com/github/joeljose/audio_denoising/blob/main/spec_morph_filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Noise Compensation using Spectrogram Morphological Filtering

In this project we apply image-based morphological filtering to audio spectrograms for removing noise. The core idea: a spectrogram is essentially a 2D image where bright regions represent high-energy signal content and dim regions represent noise. By treating it as an image, we can use classical image processing techniques — binary thresholding, erosion, and dilation — to separate signal from noise.

**Erosion** (a minimum filter) shrinks bright regions, eliminating small isolated noise blobs. **Dilation** (a maximum filter) then expands the remaining regions back, restoring signal edges that were slightly eroded. The combination produces a binary mask that identifies where the signal lives. We use this mask to amplify signal regions and attenuate noise regions, creating a non-linear time-frequency filter. Finally, we reconstruct the denoised audio from the processed spectrogram using the inverse STFT.

## Imports

In [None]:
from scipy.ndimage import binary_erosion
from scipy.ndimage import binary_dilation
import scipy.signal as signal
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio
import librosa

## Part 1: Synthetic Signal

Let's first demonstrate the technique on a synthetic signal where we know the ground truth. We create 5 musical tones and add Gaussian noise.

In [None]:
np.random.seed(42)

fs=10000                                          # sampling frequency in Hz
notes=[837.31,1939.85,1054.94,1939.85,837.31]     # note frequencies in Hz
note_interval= 1                                  # duration of each note in seconds
song_time=len(notes)*note_interval                # Total duration of the song


dt=1/fs
t=np.arange(0,note_interval,dt)

tones=[]

for fund_freq in notes:
  tones.append(np.sin(2 * np.pi * fund_freq * t)) # each note is a sine wave 

song = np.concatenate(tones)                      # add up all the notes to get our song

Now let's add some noise to the song:

In [None]:
# creating a noise with the same dimension as input signal 

mu, sigma = 0, 1                                  # mean and standard deviation of noise signal
noise = np.random.normal(mu, sigma, song.shape)  # generate noise signal using random function

noisy_song = song + noise                        # add the noise to the song to get noisy song

Let's hear the song

In [None]:
Audio(song, rate=fs)

Now let's hear the noisy signal 

In [None]:
Audio(noisy_song, rate=fs)

### Helper Functions

In [None]:
def draw_spectrogram(time, freq, magnitude, title, cmap='inferno'):
  """Plot a spectrogram from the magnitude of the STFT."""
  fig, ax = plt.subplots(figsize=(10, 5))
  mesh = ax.pcolormesh(time, freq, magnitude, shading='gouraud',
                       cmap=plt.get_cmap(cmap))
  ax.set_ylim([freq[1], freq[-1]])
  ax.set_title('Spectrogram of ' + title)
  ax.set_ylabel('Frequency [Hz]')
  ax.set_xlabel('Time [sec]')
  fig.colorbar(mesh, ax=ax)

def draw_timeseries(input_signal, time):
  """Plot an audio signal waveform."""
  fig, ax = plt.subplots(figsize=(10, 5))
  ax.plot(time, input_signal, color='c', linewidth=1.5, label="input")
  ax.set_xlim(time[0], time[-1])
  ax.set_title('Signal')
  ax.set_ylabel('Amplitude')
  ax.set_xlabel('Time [sec]')

### Spectrograms

We create spectrograms by computing the Short-Time Fourier Transform (STFT) of the signal and plotting its magnitude. The STFT divides the signal into overlapping segments and computes the FFT of each, producing a time-frequency representation.

In [None]:
nperseg = 512                                             # no. of samples in a segment

freq, ti, Zxx = signal.stft(song, fs=fs, nperseg=nperseg) #STFT function

draw_spectrogram(ti,freq,np.abs(Zxx),'song')              # we use our helper function to draw the spectrogram

In [None]:
nperseg = 512

freq, ti, Zxx = signal.stft(noisy_song, fs=fs, nperseg=nperseg)

draw_spectrogram(ti,freq,np.abs(Zxx),'noisy song')

The magnitude of the complex-valued STFT (Zxx) gives us the spectrogram. Comparing the two, the original signal shows clean horizontal lines at each note's frequency, while the noisy version has energy scattered everywhere. The goal is to keep only the bright regions (high energy = signal) and suppress the dim regions (low energy = noise).

### The Denoising Algorithm

The `morph_denoise` function takes the complex-valued STFT matrix (Zxx) and returns a denoised version (Rxx). The process has four steps:

1. **Grayscale image**: Scale the STFT magnitude to 0–255, treating the spectrogram as a grayscale image.
2. **Binary thresholding**: Pixels above the threshold are marked as signal (1), below as noise (0). The threshold controls how aggressive the separation is.
3. **Morphological processing**: Erosion removes small isolated noise blobs by shrinking all white regions. Dilation then restores the signal regions that were slightly eroded. The result is a clean binary mask.
4. **Masking**: Multiply signal regions (mask=1) by `amp` to boost them, and divide noise regions (mask=0) by `amp` to suppress them.

In [None]:
def morph_denoise(Zxx, threshold, amp, plot=True):
  """Apply morphological denoising to a complex-valued STFT matrix.

  Parameters
  ----------
  Zxx : ndarray
      Complex STFT matrix.
  threshold : float
      Binary threshold applied to the grayscale spectrogram (0-255).
  amp : float
      Amplification/attenuation factor for masked/unmasked regions.
  plot : bool
      If True, display intermediate processing images.

  Returns
  -------
  Rxx : ndarray
      Denoised complex STFT matrix.
  """
  zmax = np.max(np.abs(Zxx))
  gray_image = np.abs(Zxx) * (255 / zmax)

  thresh_image = np.where(gray_image >= threshold, 1, 0)  # binary thresholding

  mask_image = binary_erosion(thresh_image, iterations=1)
  mask_image = binary_dilation(mask_image, iterations=2)

  if plot:
    fig, axis_arr = plt.subplots(1, 3, figsize=(15, 5))
    axis_arr[0].set_title("grayscale spectrogram")
    axis_arr[0].imshow(gray_image, cmap=plt.cm.gray)
    axis_arr[1].set_title("After binary thresholding")
    axis_arr[1].imshow(thresh_image, cmap=plt.cm.gray)
    axis_arr[2].set_title("After Morphological filtering")
    axis_arr[2].imshow(mask_image, cmap=plt.cm.gray)
    plt.tight_layout()

  Rxx = np.where(mask_image == 1, Zxx * amp, Zxx / amp)

  return Rxx

Now let's apply the denoise function on Zxx to get Rxx.

In [None]:
Rxx = morph_denoise(Zxx, threshold=40, amp=10)

The images above show the processing pipeline: the grayscale spectrogram, the binary thresholded image, and the final mask after morphological filtering. Small noise speckles are removed by erosion while the main signal bands are preserved and restored by dilation.

The images appear vertically flipped compared to the spectrograms because image row indices start from 0 at the top, but this does not affect the processing.

Let's look at the recovered spectrogram:

In [None]:
draw_spectrogram(ti, freq, np.abs(Rxx), 'recovered song')

### Reconstruction

Now let's reconstruct the audio signal from Rxx by applying the inverse STFT.

In [None]:
_, xrec = signal.istft(Rxx, fs)

Audio(xrec, rate=fs)

## Part 2: Visual Mic Application

This technique was developed as part of the [Visual Mic](https://github.com/joeljose/Visual-Mic) project, which extracts audio from high-speed video of vibrating surfaces. The recovered audio signal is extremely noisy because the vibrations captured by the camera are tiny.

The signal has a native sample rate of only 2200 Hz (limited by the camera frame rate), meaning all signal content is below 1100 Hz. We process at this native rate so the STFT covers exactly 0–1100 Hz and the signal fills the entire spectrogram, giving the morphological operations a much better image to work with. We resample to 8000 Hz only for audio playback in the browser.

### Downloading the Audio

In [None]:
import os
import urllib.request

wav_url = "https://github.com/joeljose/assets/raw/master/audio_denoising/visualmic.wav"
wav_path = "visualmic.wav"

if not os.path.exists(wav_path):
    print("Downloading visualmic.wav ...")
    urllib.request.urlretrieve(wav_url, wav_path)
    print("Done.")
else:
    print("visualmic.wav already exists, skipping download.")

### Loading and Inspecting the Signal

Let's load the audio file at its native sample rate, print its properties, and plot the waveform.

In [None]:
noisy_signal, fs_wav = librosa.load("visualmic.wav", sr=None)

n = len(noisy_signal)
dt = 1 / fs_wav
tot_time = np.floor(n * dt)
print(f'Total playback time :{tot_time} seconds')
print(f'Total no. of samples :{n}')
print(f'Sampling frequency :{fs_wav} Hz')


t = np.arange(0, tot_time, dt)

# Plot the signal 

draw_timeseries(noisy_signal, t)

Let's hear it. The native 2200 Hz rate is too low for browser playback, so we resample to 8000 Hz:

In [None]:
PLAYBACK_SR = 8000
Audio(librosa.resample(noisy_signal, orig_sr=fs_wav, target_sr=PLAYBACK_SR), rate=PLAYBACK_SR)

### Spectrogram

Since we loaded at the native 2200 Hz sample rate, the STFT covers 0–1100 Hz and the signal fills the entire spectrogram.

In [None]:
nperseg = 512

freq, ti, Zxx = signal.stft(noisy_signal, fs=fs_wav, nperseg=nperseg)

draw_spectrogram(ti, freq, np.abs(Zxx), 'noisy signal obtained from visual mic')

### Applying Morphological Denoising

With the signal filling the full spectrogram, the morphological operations can effectively distinguish signal from noise.

In [None]:
Rxx = morph_denoise(Zxx, threshold=20, amp=10)

### Recovered Spectrogram

In [None]:
draw_spectrogram(ti, freq, np.abs(Rxx), 'recovered signal from visual mic')

### Reconstruction

Let's reconstruct the denoised audio using the inverse STFT and resample for playback.

In [None]:
_, xrec = signal.istft(Rxx, fs_wav)
Audio(librosa.resample(xrec, orig_sr=fs_wav, target_sr=PLAYBACK_SR), rate=PLAYBACK_SR)

The reconstructed audio is a much clearer version of "Mary Had a Little Lamb."