# Audio Preprocessing

This notebook demonstrates how to preprocess audio files using the CTC-SpeechRefinement package. We'll explore various preprocessing techniques including normalization, silence removal, Voice Activity Detection (VAD), noise reduction, and frequency normalization.

## Setup

First, let's import the necessary libraries and set up the environment.

In [None]:
# Add the project root to the Python path
import sys
import os
sys.path.append(os.path.abspath('..'))

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display
import pandas as pd
import seaborn as sns
from IPython.display import Audio, display
import glob
from pathlib import Path

# Import from the project
from ctc_speech_refinement.core.preprocessing.audio import preprocess_audio, load_audio
from ctc_speech_refinement.core.preprocessing.vad import apply_vad, energy_vad, zcr_vad
from ctc_speech_refinement.core.preprocessing.noise_reduction import reduce_noise
from ctc_speech_refinement.core.preprocessing.frequency_normalization import normalize_frequency

# Set up plotting
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100

## Load Audio Data

Let's load an audio file and examine its basic properties.

In [None]:
# Define the path to an audio file
audio_file = "../data/test1/test1_01.wav"  # Update this path to your audio file

# Load the audio file
audio_data, sample_rate = load_audio(audio_file)

# Print basic information
print(f"Audio file: {audio_file}")
print(f"Sample rate: {sample_rate} Hz")
print(f"Duration: {len(audio_data) / sample_rate:.2f} seconds")
print(f"Number of samples: {len(audio_data)}")

# Play the audio
display(Audio(audio_data, rate=sample_rate))

## Visualize Original Waveform

Let's visualize the waveform of the original audio file.

In [None]:
plt.figure(figsize=(14, 5))
librosa.display.waveshow(audio_data, sr=sample_rate)
plt.title('Original Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.show()

## 1. Amplitude Normalization

Let's normalize the amplitude of the audio data to have zero mean and unit variance.

In [None]:
# Define a function to normalize audio data
def normalize_audio(audio_data):
    mean = np.mean(audio_data)
    std = np.std(audio_data)
    if std > 0:
        normalized_audio = (audio_data - mean) / std
    else:
        normalized_audio = audio_data - mean
    return normalized_audio

# Normalize the audio data
normalized_audio = normalize_audio(audio_data)

# Print statistics before and after normalization
print("Before normalization:")
print(f"Mean: {np.mean(audio_data):.6f}")
print(f"Std Dev: {np.std(audio_data):.6f}")
print(f"Min: {np.min(audio_data):.6f}")
print(f"Max: {np.max(audio_data):.6f}")
print("\nAfter normalization:")
print(f"Mean: {np.mean(normalized_audio):.6f}")
print(f"Std Dev: {np.std(normalized_audio):.6f}")
print(f"Min: {np.min(normalized_audio):.6f}")
print(f"Max: {np.max(normalized_audio):.6f}")

# Play the normalized audio
display(Audio(normalized_audio, rate=sample_rate))

In [None]:
# Plot original and normalized waveforms
plt.figure(figsize=(14, 8))

plt.subplot(2, 1, 1)
librosa.display.waveshow(audio_data, sr=sample_rate)
plt.title('Original Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(2, 1, 2)
librosa.display.waveshow(normalized_audio, sr=sample_rate)
plt.title('Normalized Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.tight_layout()
plt.show()

## 2. Silence Removal

Let's remove silent regions from the audio data.

In [None]:
# Define a function to remove silence
def remove_silence(audio_data, sample_rate, top_db=60, frame_length=2048, hop_length=512):
    non_silent_intervals = librosa.effects.split(audio_data, top_db=top_db, frame_length=frame_length, hop_length=hop_length)
    
    if len(non_silent_intervals) == 0:
        print("No non-silent intervals found")
        return audio_data
    
    non_silent_audio = []
    for interval in non_silent_intervals:
        start, end = interval
        non_silent_audio.extend(audio_data[start:end])
    
    return np.array(non_silent_audio)

# Remove silence from the audio data
audio_without_silence = remove_silence(audio_data, sample_rate)

# Print duration before and after silence removal
print(f"Duration before silence removal: {len(audio_data) / sample_rate:.2f} seconds")
print(f"Duration after silence removal: {len(audio_without_silence) / sample_rate:.2f} seconds")
print(f"Reduction: {(1 - len(audio_without_silence) / len(audio_data)) * 100:.2f}%")

# Play the audio without silence
display(Audio(audio_without_silence, rate=sample_rate))

In [None]:
# Plot original and silence-removed waveforms
plt.figure(figsize=(14, 8))

plt.subplot(2, 1, 1)
librosa.display.waveshow(audio_data, sr=sample_rate)
plt.title('Original Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(2, 1, 2)
librosa.display.waveshow(audio_without_silence, sr=sample_rate)
plt.title('Waveform without Silence')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.tight_layout()
plt.show()

## 3. Voice Activity Detection (VAD)

Let's apply Voice Activity Detection to extract speech segments from the audio data.

In [None]:
# Apply energy-based VAD
speech_regions_energy = energy_vad(audio_data, sample_rate)

# Print speech regions
print("Speech regions (energy-based VAD):")
for i, (start, end) in enumerate(speech_regions_energy):
    print(f"Region {i+1}: {start:.2f}s - {end:.2f}s (duration: {end-start:.2f}s)")

# Apply VAD to extract speech segments
speech_audio_energy = apply_vad(audio_data, sample_rate, method="energy")

# Print duration before and after VAD
print(f"\nDuration before VAD: {len(audio_data) / sample_rate:.2f} seconds")
print(f"Duration after VAD: {len(speech_audio_energy) / sample_rate:.2f} seconds")
print(f"Reduction: {(1 - len(speech_audio_energy) / len(audio_data)) * 100:.2f}%")

# Play the speech audio
display(Audio(speech_audio_energy, rate=sample_rate))

In [None]:
# Apply ZCR-based VAD
speech_regions_zcr = zcr_vad(audio_data, sample_rate)

# Print speech regions
print("Speech regions (ZCR-based VAD):")
for i, (start, end) in enumerate(speech_regions_zcr):
    print(f"Region {i+1}: {start:.2f}s - {end:.2f}s (duration: {end-start:.2f}s)")

# Apply VAD to extract speech segments
speech_audio_zcr = apply_vad(audio_data, sample_rate, method="zcr")

# Print duration before and after VAD
print(f"\nDuration before VAD: {len(audio_data) / sample_rate:.2f} seconds")
print(f"Duration after VAD: {len(speech_audio_zcr) / sample_rate:.2f} seconds")
print(f"Reduction: {(1 - len(speech_audio_zcr) / len(audio_data)) * 100:.2f}%")

# Play the speech audio
display(Audio(speech_audio_zcr, rate=sample_rate))

In [None]:
# Plot original and VAD-processed waveforms
plt.figure(figsize=(14, 12))

plt.subplot(3, 1, 1)
librosa.display.waveshow(audio_data, sr=sample_rate)
plt.title('Original Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(3, 1, 2)
librosa.display.waveshow(speech_audio_energy, sr=sample_rate)
plt.title('Waveform after Energy-based VAD')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(3, 1, 3)
librosa.display.waveshow(speech_audio_zcr, sr=sample_rate)
plt.title('Waveform after ZCR-based VAD')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.tight_layout()
plt.show()

## 4. Noise Reduction

Let's apply noise reduction to the audio data.

In [None]:
# Apply spectral subtraction for noise reduction
denoised_audio_spectral = reduce_noise(audio_data, sample_rate, method="spectral_subtraction")

# Play the denoised audio
print("Denoised audio (spectral subtraction):")
display(Audio(denoised_audio_spectral, rate=sample_rate))

In [None]:
# Apply Wiener filter for noise reduction
denoised_audio_wiener = reduce_noise(audio_data, sample_rate, method="wiener")

# Play the denoised audio
print("Denoised audio (Wiener filter):")
display(Audio(denoised_audio_wiener, rate=sample_rate))

In [None]:
# Apply median filter for noise reduction
denoised_audio_median = reduce_noise(audio_data, sample_rate, method="median")

# Play the denoised audio
print("Denoised audio (median filter):")
display(Audio(denoised_audio_median, rate=sample_rate))

In [None]:
# Plot original and denoised waveforms
plt.figure(figsize=(14, 12))

plt.subplot(4, 1, 1)
librosa.display.waveshow(audio_data, sr=sample_rate)
plt.title('Original Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(4, 1, 2)
librosa.display.waveshow(denoised_audio_spectral, sr=sample_rate)
plt.title('Waveform after Spectral Subtraction')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(4, 1, 3)
librosa.display.waveshow(denoised_audio_wiener, sr=sample_rate)
plt.title('Waveform after Wiener Filter')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(4, 1, 4)
librosa.display.waveshow(denoised_audio_median, sr=sample_rate)
plt.title('Waveform after Median Filter')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.tight_layout()
plt.show()

## 5. Frequency Normalization

Let's apply frequency normalization to the audio data.

In [None]:
# Apply bandpass filter for frequency normalization
normalized_audio_bandpass = normalize_frequency(audio_data, sample_rate, method="bandpass")

# Play the frequency-normalized audio
print("Frequency-normalized audio (bandpass filter):")
display(Audio(normalized_audio_bandpass, rate=sample_rate))

In [None]:
# Apply pre-emphasis for frequency normalization
normalized_audio_preemphasis = normalize_frequency(audio_data, sample_rate, method="preemphasis")

# Play the frequency-normalized audio
print("Frequency-normalized audio (pre-emphasis):")
display(Audio(normalized_audio_preemphasis, rate=sample_rate))

In [None]:
# Apply spectral equalization for frequency normalization
normalized_audio_equalize = normalize_frequency(audio_data, sample_rate, method="equalize")

# Play the frequency-normalized audio
print("Frequency-normalized audio (spectral equalization):")
display(Audio(normalized_audio_equalize, rate=sample_rate))

In [None]:
# Apply combined methods for frequency normalization
normalized_audio_combined = normalize_frequency(audio_data, sample_rate, method="combined")

# Play the frequency-normalized audio
print("Frequency-normalized audio (combined methods):")
display(Audio(normalized_audio_combined, rate=sample_rate))

In [None]:
# Plot original and frequency-normalized waveforms
plt.figure(figsize=(14, 15))

plt.subplot(5, 1, 1)
librosa.display.waveshow(audio_data, sr=sample_rate)
plt.title('Original Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(5, 1, 2)
librosa.display.waveshow(normalized_audio_bandpass, sr=sample_rate)
plt.title('Waveform after Bandpass Filter')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(5, 1, 3)
librosa.display.waveshow(normalized_audio_preemphasis, sr=sample_rate)
plt.title('Waveform after Pre-emphasis')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(5, 1, 4)
librosa.display.waveshow(normalized_audio_equalize, sr=sample_rate)
plt.title('Waveform after Spectral Equalization')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(5, 1, 5)
librosa.display.waveshow(normalized_audio_combined, sr=sample_rate)
plt.title('Waveform after Combined Methods')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.tight_layout()
plt.show()

## 6. Combined Preprocessing Pipeline

Let's apply a combined preprocessing pipeline to the audio data.

In [None]:
# Apply the full preprocessing pipeline
preprocessed_audio, preprocessed_sample_rate = preprocess_audio(
    audio_file,
    normalize=True,
    remove_silence_flag=True,
    apply_vad_flag=True,
    vad_method="energy",
    reduce_noise_flag=True,
    noise_reduction_method="spectral_subtraction",
    normalize_frequency_flag=True,
    frequency_normalization_method="bandpass"
)

# Print duration before and after preprocessing
print(f"Duration before preprocessing: {len(audio_data) / sample_rate:.2f} seconds")
print(f"Duration after preprocessing: {len(preprocessed_audio) / preprocessed_sample_rate:.2f} seconds")
print(f"Reduction: {(1 - len(preprocessed_audio) / len(audio_data)) * 100:.2f}%")

# Play the preprocessed audio
display(Audio(preprocessed_audio, rate=preprocessed_sample_rate))

In [None]:
# Plot original and fully preprocessed waveforms
plt.figure(figsize=(14, 8))

plt.subplot(2, 1, 1)
librosa.display.waveshow(audio_data, sr=sample_rate)
plt.title('Original Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.subplot(2, 1, 2)
librosa.display.waveshow(preprocessed_audio, sr=preprocessed_sample_rate)
plt.title('Fully Preprocessed Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

plt.tight_layout()
plt.show()

## 7. Spectrograms Before and After Preprocessing

Let's compare the spectrograms of the original and preprocessed audio.

In [None]:
# Compute spectrograms
D_original = librosa.amplitude_to_db(np.abs(librosa.stft(audio_data, n_fft=2048, hop_length=512)), ref=np.max)
D_preprocessed = librosa.amplitude_to_db(np.abs(librosa.stft(preprocessed_audio, n_fft=2048, hop_length=512)), ref=np.max)

# Plot spectrograms
plt.figure(figsize=(14, 8))

plt.subplot(2, 1, 1)
librosa.display.specshow(D_original, sr=sample_rate, x_axis='time', y_axis='log', hop_length=512)
plt.colorbar(format='%+2.0f dB')
plt.title('Original Spectrogram')

plt.subplot(2, 1, 2)
librosa.display.specshow(D_preprocessed, sr=preprocessed_sample_rate, x_axis='time', y_axis='log', hop_length=512)
plt.colorbar(format='%+2.0f dB')
plt.title('Preprocessed Spectrogram')

plt.tight_layout()
plt.show()

## Conclusion

In this notebook, we've explored various audio preprocessing techniques including amplitude normalization, silence removal, Voice Activity Detection (VAD), noise reduction, and frequency normalization. We've also applied a combined preprocessing pipeline to the audio data.

These preprocessing techniques can significantly improve the quality of audio data for speech recognition tasks by removing noise, enhancing speech segments, and normalizing the audio signal.