# Audio and Advanced Signal Processing
- Audio I/O, spectral features, advanced techniques
- Real examples: Audio processing, Voice activity detection

In [1]:
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
print('Audio processing module loaded')

Audio processing module loaded


## Audio Signal Basics
- **Sample rate**: 44.1 kHz (CD quality), 48 kHz (professional)
- **Bit depth**: 16-bit (CD), 24-bit (professional)
- **Channels**: Mono (1), Stereo (2), Surround (5.1, 7.1)

**Nyquist**: Human hearing ~20 Hz to 20 kHz
Sample at 40+ kHz to capture all frequencies

In [2]:
# Generate audio tone
fs = 44100  # CD quality
duration = 2
t = np.linspace(0, duration, int(fs*duration))

# A440 note (concert A)
freq = 440
audio_tone = np.sin(2*np.pi*freq*t)

# Apply envelope (attack-decay)
envelope = np.exp(-3*t)
audio_tone *= envelope

print(f'Audio signal:')
print(f'  Sample rate: {fs} Hz')
print(f'  Duration: {duration} s')
print(f'  Samples: {len(audio_tone)}')
print(f'  Frequency: {freq} Hz (A4 note)')

Audio signal:
  Sample rate: 44100 Hz
  Duration: 2 s
  Samples: 88200
  Frequency: 440 Hz (A4 note)


## Spectral Features
Extract features for audio analysis/classification

**Common features**:
- **Spectral centroid**: Brightness (center of mass)
- **Spectral rolloff**: Frequency below which X% of energy
- **Zero-crossing rate**: Percussiveness, noisiness
- **MFCCs**: Mel-frequency cepstral coefficients (speech)

In [3]:
# Spectral centroid
def spectral_centroid(sig, fs):
    freqs = np.fft.rfftfreq(len(sig), 1/fs)
    fft_vals = np.abs(np.fft.rfft(sig))
    return np.sum(freqs * fft_vals) / np.sum(fft_vals)

# Zero-crossing rate
def zero_crossing_rate(sig):
    return np.sum(np.abs(np.diff(np.sign(sig)))) / (2 * len(sig))

# Calculate for our tone
centroid = spectral_centroid(audio_tone, fs)
zcr = zero_crossing_rate(audio_tone)

print(f'Audio features:')
print(f'  Spectral centroid: {centroid:.1f} Hz')
print(f'  Zero-crossing rate: {zcr:.4f}')

Audio features:
  Spectral centroid: 605.1 Hz
  Zero-crossing rate: 0.0200


## Noise Reduction: Spectral Subtraction
Estimate noise spectrum and subtract from signal

**Steps**:
1. Estimate noise (from silent portion)
2. FFT of noisy signal
3. Subtract noise spectrum
4. Inverse FFT

Used in: Voice enhancement, audio cleanup

In [4]:
# Clean speech simulation
fs = 16000  # Speech sample rate
t = np.linspace(0, 1, fs)
speech = np.sin(2*np.pi*300*t) + 0.5*np.sin(2*np.pi*800*t)  # Simple speech

# Add noise
noise = np.random.randn(len(speech)) * 0.3
noisy_speech = speech + noise

# Estimate noise spectrum (from first 0.1s)
noise_sample = noisy_speech[:int(0.1*fs)]
noise_spectrum = np.abs(np.fft.rfft(noise_sample))**2

# Process with spectral subtraction (simplified)
noisy_spectrum = np.fft.rfft(noisy_speech)
noisy_mag = np.abs(noisy_spectrum)
noisy_phase = np.angle(noisy_spectrum)

# Subtract noise
clean_mag = np.maximum(noisy_mag - np.sqrt(noise_spectrum), 0.1*noisy_mag)
clean_spectrum = clean_mag * np.exp(1j * noisy_phase)

# Inverse FFT
clean_speech = np.fft.irfft(clean_spectrum)

# Calculate SNR improvement
snr_before = 10*np.log10(np.var(speech) / np.var(noise))
snr_after = 10*np.log10(np.var(speech) / np.var(clean_speech - speech))

print(f'Noise reduction:')
print(f'  SNR before: {snr_before:.2f} dB')
print(f'  SNR after: {snr_after:.2f} dB')
print(f'  Improvement: {snr_after - snr_before:.2f} dB')

ValueError: operands could not be broadcast together with shapes (8001,) (801,) 

## Real Example: Voice Activity Detection (VAD)
Detect speech vs silence in audio
Used in: Speech recognition, teleconferencing, compression

**Method**: Energy + zero-crossing rate

In [None]:
# Simulate audio with speech and silence
fs = 16000
t_total = np.linspace(0, 5, 5*fs)

# Create segments: silence-speech-silence-speech
audio_vad = np.zeros(len(t_total))

# Speech segments (higher energy, more zero crossings)
speech_seg1 = slice(int(0.5*fs), int(1.5*fs))
speech_seg2 = slice(int(3.0*fs), int(4.0*fs))

audio_vad[speech_seg1] = np.random.randn(len(range(*speech_seg1.indices(len(audio_vad))))) * 0.5
audio_vad[speech_seg2] = np.random.randn(len(range(*speech_seg2.indices(len(audio_vad))))) * 0.5

# Add background noise everywhere
audio_vad += np.random.randn(len(audio_vad)) * 0.05

# VAD: Compute energy in frames
frame_length = int(0.025 * fs)  # 25ms frames
step = int(0.010 * fs)  # 10ms step

energy = []
for i in range(0, len(audio_vad) - frame_length, step):
    frame = audio_vad[i:i+frame_length]
    energy.append(np.sum(frame**2))

energy = np.array(energy)

# Threshold for voice detection
threshold = np.percentile(energy, 75)
voice_detected = energy > threshold

print(f'Voice Activity Detection:')
print(f'  Total frames: {len(energy)}')
print(f'  Voice frames: {voice_detected.sum()}')
print(f'  Silence frames: {(~voice_detected).sum()}')
print(f'  Voice activity: {voice_detected.sum()/len(voice_detected)*100:.1f}%')

## Advanced: Spectrogram Manipulation
Process signals in time-frequency domain

**Applications**:
- Time stretching (change speed without pitch)
- Pitch shifting (change pitch without speed)
- Source separation
- Audio effects

In [None]:
# Generate chirp
fs = 8000
t = np.linspace(0, 2, 2*fs)
chirp_audio = signal.chirp(t, f0=200, f1=2000, t1=2)

# STFT
f, t_spec, Zxx = signal.stft(chirp_audio, fs=fs, nperseg=256)

print(f'STFT analysis:')
print(f'  Frequency bins: {len(f)}')
print(f'  Time frames: {len(t_spec)}')
print(f'  Spectrogram shape: {Zxx.shape}')

# Manipulate: Pitch shift up by multiplying frequencies
# (Simplified - real pitch shifting is more complex)
Zxx_shifted = np.vstack([Zxx[::2, :], np.zeros((Zxx.shape[0]//2, Zxx.shape[1]))])

# Inverse STFT
_, audio_shifted = signal.istft(Zxx_shifted, fs=fs)

print(f'
Pitch shift applied (simplified version)')

## Summary

### Audio Processing Pipeline:
1. **Load audio**: Read file, resample if needed
2. **Preprocess**: Normalize, remove DC, pre-emphasis
3. **Feature extraction**: STFT, MFCCs, spectral features
4. **Processing**: Filtering, enhancement, effects
5. **Analysis**: Classification, detection, recognition

### Key Techniques:

**Time domain**:
- Amplitude envelope
- Zero-crossing rate
- Energy/RMS

**Frequency domain**:
- FFT spectrum
- Spectral centroid
- Spectral rolloff
- Spectral flux

**Time-frequency**:
- STFT/Spectrogram
- Mel-spectrogram
- Wavelet transform

### Applications:

**Speech**:
- Voice activity detection
- Speech recognition
- Speaker identification
- Noise reduction

**Music**:
- Beat tracking
- Tempo estimation
- Chord recognition
- Genre classification

**General**:
- Audio compression
- Sound effects
- Acoustic analysis
- Environmental sound classification

### Best Practices:

✓ **Sample rate**: 44.1/48 kHz for audio, 16 kHz for speech
✓ **Framing**: 20-40ms windows, 50% overlap
✓ **Windowing**: Hann/Hamming for STFT
✓ **Normalize**: Scale audio to [-1, 1]
✓ **Pre-emphasis**: Boost high frequencies for speech

### Advanced Topics:
- Deep learning (neural audio processing)
- Source separation (isolate instruments)
- Audio synthesis (generate sounds)
- Spatial audio (3D sound)
- Real-time processing (low latency)