# Whisper Multichannel Transcription Workshop

Welcome to this comprehensive tutorial on audio transcription using OpenAI's Whisper model!

**Workshop Outline:**
1. Speech-to-Text Primer
2. Whisper Introduction
3. Single-Speaker Demo
4. Multi-Speaker Demo
5. Speaker Diarization
6. Multichannel Audio Processing

**Prerequisites:**
- Basic Python knowledge
- Understanding of audio concepts is helpful but not required
- All necessary packages installed (see requirements.txt)

## Setup and Library Imports

First, let's import all the necessary libraries we'll be using throughout this tutorial.

In [None]:
# Core libraries
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio, display, Markdown
import ipywidgets as widgets

# Audio processing
import librosa
import soundfile as sf

# Whisper for transcription
import whisper

# System utilities
import os
import warnings
warnings.filterwarnings('ignore')

print("✓ All libraries imported successfully!")

---
## 1. Speech-to-Text Primer

### What is Speech-to-Text?

Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR), is the technology that converts spoken language into written text.

### How Does It Work?

Modern ASR systems typically involve several stages:

1. **Audio Preprocessing**: Converting raw audio into a suitable format
2. **Feature Extraction**: Extracting relevant acoustic features (e.g., spectrograms, MFCCs)
3. **Acoustic Modeling**: Using deep learning models to map features to phonemes or characters
4. **Language Modeling**: Applying linguistic knowledge to improve accuracy
5. **Decoding**: Converting predictions into final text output

### Key Concepts

- **Sample Rate**: Number of audio samples per second (e.g., 16kHz)
- **Spectrogram**: Visual representation of audio frequencies over time
- **Mel-Frequency Cepstral Coefficients (MFCCs)**: Compact audio feature representation
- **Word Error Rate (WER)**: Common metric for ASR accuracy

### Visualizing Audio

Let's create a simple example to understand audio data:

In [None]:
# Generate a simple sine wave as an example audio signal
sample_rate = 16000  # 16kHz sample rate
duration = 2  # seconds
frequency = 440  # A4 note (440 Hz)

# Create time array
t = np.linspace(0, duration, int(sample_rate * duration))

# Generate sine wave
audio_signal = np.sin(2 * np.pi * frequency * t)

# Plot waveform
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(t[:1000], audio_signal[:1000])  # Plot first 1000 samples
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.title('Audio Waveform (First 1000 samples)')
plt.grid(True)

# Create and plot spectrogram
plt.subplot(1, 2, 2)
D = librosa.amplitude_to_db(np.abs(librosa.stft(audio_signal)), ref=np.max)
librosa.display.specshow(D, sr=sample_rate, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Spectrogram')
plt.tight_layout()
plt.show()

# Play the audio
display(Audio(audio_signal, rate=sample_rate))

### Exercise: Explore Audio Properties

Try modifying the frequency or duration in the code above to see how it affects the waveform and spectrogram.

In [None]:
# Your code here
# Experiment with different frequencies and durations


---
## 2. Introduction to Whisper

### What is Whisper?

Whisper is a state-of-the-art automatic speech recognition (ASR) system developed by OpenAI. It's trained on 680,000 hours of multilingual and multitask supervised data collected from the web.

### Key Features

- **Multilingual**: Supports 99+ languages
- **Robust**: Works well with various accents and background noise
- **Zero-shot**: No fine-tuning required for many tasks
- **Multi-task**: Can transcribe, translate, and identify languages

### Model Sizes

Whisper comes in several sizes, trading off speed vs. accuracy:

| Model  | Parameters | English-only | Multilingual | Required VRAM | Relative Speed |
|--------|------------|--------------|--------------|---------------|----------------|
| tiny   | 39 M       | ✓            | ✓            | ~1 GB         | ~32x           |
| base   | 74 M       | ✓            | ✓            | ~1 GB         | ~16x           |
| small  | 244 M      | ✓            | ✓            | ~2 GB         | ~6x            |
| medium | 769 M      | ✓            | ✓            | ~5 GB         | ~2x            |
| large  | 1550 M     | ✗            | ✓            | ~10 GB        | 1x             |

### Architecture

Whisper uses a Transformer encoder-decoder architecture:
- **Encoder**: Processes audio features (mel spectrograms)
- **Decoder**: Generates text transcriptions autoregressively

### Loading a Whisper Model

Let's load a Whisper model. We'll start with the 'base' model for a good balance of speed and accuracy.

In [None]:
# Load Whisper model
# Options: 'tiny', 'base', 'small', 'medium', 'large'
model_size = 'base'

print(f"Loading Whisper '{model_size}' model...")
model = whisper.load_model(model_size)
print(f"✓ Model loaded successfully!")
print(f"  Device: {model.device}")

### Understanding Whisper Options

Whisper provides several options for transcription:

In [None]:
# Key transcription options
transcription_options = {
    'language': 'en',  # Specify language (or None for auto-detect)
    'task': 'transcribe',  # 'transcribe' or 'translate' (to English)
    'temperature': 0.0,  # Lower = more deterministic
    'beam_size': 5,  # Number of beams for beam search
    'best_of': 5,  # Number of candidates to generate
    'fp16': True,  # Use half-precision (faster, requires GPU)
    'verbose': False,  # Show progress
}

print("Transcription Options:")
for key, value in transcription_options.items():
    print(f"  {key}: {value}")

### Interactive Model Selector

Use this widget to explore different model sizes:

In [None]:
# Interactive model selector
model_selector = widgets.Dropdown(
    options=['tiny', 'base', 'small', 'medium', 'large'],
    value='base',
    description='Model:',
    disabled=False,
)

def on_model_change(change):
    print(f"Selected model: {change['new']}")
    print(f"Note: Larger models provide better accuracy but require more memory and time.")

model_selector.observe(on_model_change, names='value')
display(model_selector)

---
## 3. Single-Speaker Transcription Demo

Now let's put Whisper to work! We'll start with a simple single-speaker audio example.

### Creating Sample Audio

For demonstration purposes, we'll create a simple audio file using text-to-speech, or you can use your own audio file.

In [None]:
# Helper function to create a sample audio file
def create_sample_audio():
    """
    This is a placeholder for creating sample audio.
    In practice, you would:
    1. Record your own audio
    2. Download sample files
    3. Use existing audio files
    """
    print("To use this demo:")
    print("1. Record a short audio clip (e.g., using your phone or computer)")
    print("2. Save it as a .wav or .mp3 file")
    print("3. Place it in the notebooks/ directory")
    print("4. Update the 'audio_path' variable below")
    
create_sample_audio()

### Loading and Analyzing Audio

Let's load an audio file and examine its properties:

In [None]:
# Specify your audio file path
# audio_path = 'sample_single_speaker.wav'

# For this demo, we'll create a simple example
# Replace this with your actual audio file path
audio_path = None  # Set to your audio file path

if audio_path and os.path.exists(audio_path):
    # Load audio
    audio, sr = librosa.load(audio_path, sr=16000)
    
    # Display audio properties
    duration = len(audio) / sr
    print(f"Audio Properties:")
    print(f"  Sample Rate: {sr} Hz")
    print(f"  Duration: {duration:.2f} seconds")
    print(f"  Samples: {len(audio)}")
    print(f"  Channels: 1 (mono)")
    
    # Visualize waveform
    plt.figure(figsize=(12, 4))
    librosa.display.waveshow(audio, sr=sr)
    plt.title('Audio Waveform')
    plt.xlabel('Time (s)')
    plt.ylabel('Amplitude')
    plt.tight_layout()
    plt.show()
    
    # Play audio
    display(Audio(audio, rate=sr))
else:
    print("No audio file specified or file not found.")
    print("Please set 'audio_path' to your audio file.")

### Transcribing Single-Speaker Audio

Now let's transcribe the audio using Whisper:

In [None]:
if audio_path and os.path.exists(audio_path):
    # Transcribe audio
    print("Transcribing audio...")
    result = model.transcribe(audio_path, language='en', fp16=False)
    
    # Display results
    print("\n" + "="*60)
    print("TRANSCRIPTION RESULT")
    print("="*60)
    print(f"\nDetected Language: {result.get('language', 'N/A')}")
    print(f"\nText:\n{result['text']}")
    print("\n" + "="*60)
else:
    print("Please provide an audio file to transcribe.")

### Detailed Transcription with Timestamps

Whisper can also provide word-level or segment-level timestamps:

In [None]:
if audio_path and os.path.exists(audio_path):
    # Display segments with timestamps
    print("Segments with Timestamps:")
    print("="*80)
    
    for i, segment in enumerate(result['segments'], 1):
        start_time = segment['start']
        end_time = segment['end']
        text = segment['text'].strip()
        
        print(f"[{start_time:6.2f}s - {end_time:6.2f}s] {text}")
    
    print("="*80)
else:
    print("Please provide an audio file to transcribe.")

### Exercise: Try Your Own Audio

Record a short clip of yourself speaking and transcribe it using the code above!

In [None]:
# Your code here
# Load and transcribe your own audio file


---
## 4. Multi-Speaker Transcription Demo

Multi-speaker audio presents additional challenges:
- Overlapping speech
- Different voice characteristics
- Need to identify "who said what"

While Whisper excels at transcription, it doesn't inherently identify different speakers. For that, we need **speaker diarization** (covered in the next section).

### Loading Multi-Speaker Audio

In [None]:
# Specify your multi-speaker audio file path
# multi_speaker_path = 'sample_multi_speaker.wav'

multi_speaker_path = None  # Set to your audio file path

if multi_speaker_path and os.path.exists(multi_speaker_path):
    # Load audio
    audio_multi, sr_multi = librosa.load(multi_speaker_path, sr=16000)
    
    # Display audio properties
    duration_multi = len(audio_multi) / sr_multi
    print(f"Multi-Speaker Audio Properties:")
    print(f"  Duration: {duration_multi:.2f} seconds")
    print(f"  Sample Rate: {sr_multi} Hz")
    
    # Visualize
    plt.figure(figsize=(12, 4))
    librosa.display.waveshow(audio_multi, sr=sr_multi)
    plt.title('Multi-Speaker Audio Waveform')
    plt.xlabel('Time (s)')
    plt.ylabel('Amplitude')
    plt.tight_layout()
    plt.show()
    
    # Play audio
    display(Audio(audio_multi, rate=sr_multi))
else:
    print("No multi-speaker audio file specified.")
    print("Please set 'multi_speaker_path' to your audio file.")

### Transcribing Multi-Speaker Audio

Whisper will transcribe all speech, but won't separate speakers:

In [None]:
if multi_speaker_path and os.path.exists(multi_speaker_path):
    # Transcribe
    print("Transcribing multi-speaker audio...")
    result_multi = model.transcribe(multi_speaker_path, language='en', fp16=False)
    
    # Display results
    print("\n" + "="*60)
    print("MULTI-SPEAKER TRANSCRIPTION")
    print("="*60)
    print(f"\n{result_multi['text']}")
    print("\n" + "="*60)
    print("\nNote: This transcription includes all speakers,")
    print("but doesn't identify who is speaking.")
    print("See the next section on diarization for speaker identification.")
else:
    print("Please provide a multi-speaker audio file.")

### Challenges with Multi-Speaker Audio

Consider these challenges:
- **Speaker overlap**: When multiple people talk simultaneously
- **Turn-taking**: Rapid exchanges between speakers
- **Background noise**: Multiple voices in the background
- **Speaker identification**: Determining who said what

In [None]:
# Exercise space: Analyze transcription segments
# Can you identify potential speaker changes by analyzing
# pauses or changes in the audio signal?


---
## 5. Speaker Diarization

### What is Speaker Diarization?

Speaker diarization is the process of partitioning an audio stream into homogeneous segments according to the speaker identity. In simpler terms: **"Who spoke when?"**

### Key Components

1. **Speech Activity Detection (SAD)**: Identifying speech vs. non-speech regions
2. **Speaker Segmentation**: Detecting speaker change points
3. **Speaker Clustering**: Grouping segments by speaker
4. **Speaker Labeling**: Assigning labels (Speaker 1, Speaker 2, etc.)

### Tools for Diarization

- **pyannote.audio**: State-of-the-art speaker diarization toolkit
- **Resemblyzer**: Speaker embedding and verification
- **speechbrain**: End-to-end speech processing toolkit

### Using pyannote.audio for Diarization

Let's use pyannote.audio to perform speaker diarization:

In [None]:
# Import pyannote.audio
try:
    from pyannote.audio import Pipeline
    pyannote_available = True
except ImportError:
    print("pyannote.audio not available. Install with: pip install pyannote.audio")
    pyannote_available = False

# Note: You'll need a Hugging Face token for pyannote models
# Get it from: https://huggingface.co/settings/tokens
# Accept terms at: https://huggingface.co/pyannote/speaker-diarization

HF_TOKEN = os.environ.get('HF_TOKEN', None)

if not HF_TOKEN:
    print("⚠️  Warning: HF_TOKEN not set.")
    print("To use pyannote.audio:")
    print("1. Create a Hugging Face account")
    print("2. Accept conditions at https://huggingface.co/pyannote/speaker-diarization")
    print("3. Create a token at https://huggingface.co/settings/tokens")
    print("4. Set it: export HF_TOKEN=your_token_here")
else:
    print("✓ HF_TOKEN found")

In [None]:
# Load diarization pipeline
if pyannote_available and HF_TOKEN:
    try:
        print("Loading speaker diarization pipeline...")
        diarization_pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.1",
            use_auth_token=HF_TOKEN
        )
        print("✓ Pipeline loaded successfully!")
    except Exception as e:
        print(f"Error loading pipeline: {e}")
        diarization_pipeline = None
else:
    diarization_pipeline = None
    print("Diarization pipeline not available.")

### Performing Diarization

In [None]:
if diarization_pipeline and multi_speaker_path and os.path.exists(multi_speaker_path):
    # Perform diarization
    print("Performing speaker diarization...")
    diarization = diarization_pipeline(multi_speaker_path)
    
    # Display results
    print("\n" + "="*60)
    print("SPEAKER DIARIZATION RESULTS")
    print("="*60)
    
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        print(f"[{turn.start:6.2f}s - {turn.end:6.2f}s] {speaker}")
    
    print("="*60)
else:
    print("Diarization not available or no audio file provided.")

### Combining Diarization with Transcription

Now let's combine speaker diarization with Whisper transcription to get "who said what":

In [None]:
def combine_diarization_transcription(audio_path, diarization, transcription):
    """
    Combine diarization results with transcription to attribute text to speakers.
    
    Args:
        audio_path: Path to audio file
        diarization: Diarization results from pyannote
        transcription: Transcription results from Whisper
    
    Returns:
        List of dictionaries with speaker, start, end, and text
    """
    results = []
    
    # Get segments from transcription
    for segment in transcription['segments']:
        seg_start = segment['start']
        seg_end = segment['end']
        seg_text = segment['text'].strip()
        
        # Find overlapping speaker from diarization
        speaker = "Unknown"
        max_overlap = 0
        
        for turn, _, spk in diarization.itertracks(yield_label=True):
            # Calculate overlap
            overlap_start = max(seg_start, turn.start)
            overlap_end = min(seg_end, turn.end)
            overlap = max(0, overlap_end - overlap_start)
            
            if overlap > max_overlap:
                max_overlap = overlap
                speaker = spk
        
        results.append({
            'speaker': speaker,
            'start': seg_start,
            'end': seg_end,
            'text': seg_text
        })
    
    return results

# Example usage
if diarization_pipeline and multi_speaker_path and os.path.exists(multi_speaker_path):
    combined_results = combine_diarization_transcription(
        multi_speaker_path, diarization, result_multi
    )
    
    print("\n" + "="*80)
    print("COMBINED DIARIZATION + TRANSCRIPTION")
    print("="*80)
    
    for result in combined_results:
        print(f"\n[{result['start']:6.2f}s - {result['end']:6.2f}s] {result['speaker']}")
        print(f"  {result['text']}")
    
    print("\n" + "="*80)
else:
    print("Combined processing not available.")

### Exercise: Improve Speaker Attribution

The simple overlap method above might not be perfect. Try to improve it!

In [None]:
# Your code here
# Ideas:
# - Use weighted overlap based on segment length
# - Consider speaker changes within long segments
# - Add confidence scores


---
## 6. Multichannel Audio Processing

### What is Multichannel Audio?

Multichannel audio contains multiple independent audio channels, often recorded from different microphones or sources:

- **Stereo (2 channels)**: Left and right
- **Surround sound (5.1, 7.1)**: Multiple spatial channels
- **Multi-mic recordings**: Each microphone on a separate channel

### Why Process Channels Separately?

Benefits of per-channel processing:
1. **Speaker separation**: Different speakers on different channels
2. **Noise reduction**: Better quality on specific channels
3. **Spatial information**: Maintain location/direction data
4. **Improved accuracy**: Cleaner input for ASR

### Loading Multichannel Audio

In [None]:
# Load multichannel audio (keeping all channels)
# multichannel_path = 'sample_multichannel.wav'

multichannel_path = None  # Set to your audio file path

if multichannel_path and os.path.exists(multichannel_path):
    # Load with soundfile to preserve channels
    audio_data, sample_rate = sf.read(multichannel_path)
    
    # Check if audio is multichannel
    if len(audio_data.shape) == 1:
        print("This is a mono audio file (1 channel).")
        n_channels = 1
    else:
        n_channels = audio_data.shape[1]
        print(f"This is a multichannel audio file with {n_channels} channels.")
    
    print(f"\nAudio Properties:")
    print(f"  Sample Rate: {sample_rate} Hz")
    print(f"  Duration: {len(audio_data) / sample_rate:.2f} seconds")
    print(f"  Channels: {n_channels}")
    print(f"  Shape: {audio_data.shape}")
else:
    print("No multichannel audio file specified.")
    print("For this demo, we'll create a synthetic multichannel example.")
    
    # Create synthetic stereo audio for demonstration
    duration = 5
    sample_rate = 16000
    t = np.linspace(0, duration, int(sample_rate * duration))
    
    # Left channel: 440 Hz
    left_channel = np.sin(2 * np.pi * 440 * t)
    # Right channel: 880 Hz
    right_channel = np.sin(2 * np.pi * 880 * t)
    
    # Combine into stereo
    audio_data = np.column_stack((left_channel, right_channel))
    n_channels = 2
    
    print("Created synthetic stereo audio for demonstration.")
    print(f"  Channels: {n_channels}")
    print(f"  Left channel: 440 Hz tone")
    print(f"  Right channel: 880 Hz tone")

### Visualizing Multiple Channels

In [None]:
# Visualize each channel
if len(audio_data.shape) > 1 and audio_data.shape[1] > 1:
    fig, axes = plt.subplots(n_channels, 1, figsize=(12, 3 * n_channels))
    
    if n_channels == 1:
        axes = [axes]
    
    for i in range(min(n_channels, 4)):  # Limit to 4 channels for display
        channel_data = audio_data[:, i]
        axes[i].plot(np.arange(len(channel_data)) / sample_rate, channel_data)
        axes[i].set_title(f'Channel {i+1}')
        axes[i].set_xlabel('Time (s)')
        axes[i].set_ylabel('Amplitude')
        axes[i].grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # Play each channel
    for i in range(min(n_channels, 4)):
        print(f"\nChannel {i+1}:")
        display(Audio(audio_data[:, i], rate=sample_rate))
else:
    print("Mono audio - single channel only.")
    plt.figure(figsize=(12, 3))
    plt.plot(np.arange(len(audio_data)) / sample_rate, audio_data)
    plt.title('Audio Waveform (Mono)')
    plt.xlabel('Time (s)')
    plt.ylabel('Amplitude')
    plt.grid(True)
    plt.show()
    
    display(Audio(audio_data, rate=sample_rate))

### Splitting and Saving Channels

Let's split the multichannel audio into separate files:

In [None]:
# Create output directory for channel files
output_dir = 'channel_outputs'
os.makedirs(output_dir, exist_ok=True)

# Split and save channels
if len(audio_data.shape) > 1 and audio_data.shape[1] > 1:
    channel_files = []
    
    for i in range(n_channels):
        # Extract channel
        channel_audio = audio_data[:, i]
        
        # Save to file
        output_path = os.path.join(output_dir, f'channel_{i+1}.wav')
        sf.write(output_path, channel_audio, sample_rate)
        channel_files.append(output_path)
        
        print(f"✓ Saved channel {i+1} to {output_path}")
    
    print(f"\nAll {n_channels} channels saved to '{output_dir}/' directory.")
else:
    print("Mono audio - no splitting needed.")
    channel_files = []

### Transcribing Each Channel Separately

Now let's transcribe each channel independently:

In [None]:
# Transcribe each channel
if channel_files:
    channel_transcriptions = []
    
    print("Transcribing each channel...\n")
    print("="*80)
    
    for i, channel_path in enumerate(channel_files, 1):
        print(f"\nChannel {i}:")
        print("-" * 80)
        
        # Transcribe
        result = model.transcribe(channel_path, language='en', fp16=False)
        channel_transcriptions.append(result)
        
        # Display
        print(f"Text: {result['text']}")
        
        # Show segments if available
        if len(result['segments']) > 1:
            print("\nSegments:")
            for seg in result['segments']:
                print(f"  [{seg['start']:6.2f}s - {seg['end']:6.2f}s] {seg['text'].strip()}")
    
    print("\n" + "="*80)
    print("✓ All channels transcribed!")
else:
    print("No channel files available for transcription.")

### Comparing Channel Transcriptions

Let's create a comparison view of all channel transcriptions:

In [None]:
# Create comparison table
if channel_transcriptions:
    print("\n" + "="*80)
    print("CHANNEL TRANSCRIPTION COMPARISON")
    print("="*80)
    
    for i, transcription in enumerate(channel_transcriptions, 1):
        print(f"\nChannel {i}:")
        print(f"  {transcription['text']}")
    
    print("\n" + "="*80)
    
    # Calculate some statistics
    print("\nStatistics:")
    for i, transcription in enumerate(channel_transcriptions, 1):
        word_count = len(transcription['text'].split())
        duration = transcription['segments'][-1]['end'] if transcription['segments'] else 0
        print(f"  Channel {i}: {word_count} words, {duration:.2f}s duration")
else:
    print("No transcriptions available for comparison.")

### Advanced: Synchronized Multi-Channel Transcription

For more advanced use cases, you might want to synchronize transcriptions across channels:

In [None]:
def create_synchronized_transcript(channel_transcriptions):
    """
    Create a synchronized transcript showing all channels aligned by time.
    
    Args:
        channel_transcriptions: List of transcription results, one per channel
    
    Returns:
        List of time-aligned transcript entries
    """
    # Collect all timestamps from all channels
    all_events = []
    
    for ch_idx, transcription in enumerate(channel_transcriptions):
        for segment in transcription['segments']:
            all_events.append({
                'channel': ch_idx + 1,
                'start': segment['start'],
                'end': segment['end'],
                'text': segment['text'].strip()
            })
    
    # Sort by start time
    all_events.sort(key=lambda x: x['start'])
    
    return all_events

# Create synchronized view
if channel_transcriptions:
    sync_transcript = create_synchronized_transcript(channel_transcriptions)
    
    print("\n" + "="*80)
    print("SYNCHRONIZED MULTI-CHANNEL TRANSCRIPT")
    print("="*80)
    
    for event in sync_transcript:
        print(f"\n[{event['start']:6.2f}s - {event['end']:6.2f}s] Channel {event['channel']}")
        print(f"  {event['text']}")
    
    print("\n" + "="*80)
else:
    print("No transcriptions available for synchronization.")

### Exercise: Your Own Multichannel Project

Now it's your turn! Try processing your own multichannel audio:

In [None]:
# Your code here
# Ideas:
# - Record a stereo conversation (each person closer to one mic)
# - Process a multi-track recording
# - Combine channel separation with diarization
# - Export results to a formatted document


---
## Conclusion and Next Steps

### What We've Learned

In this workshop, we covered:

1. ✓ **Speech-to-Text Fundamentals**: Understanding ASR concepts and audio processing
2. ✓ **Whisper Model**: Using OpenAI's powerful transcription system
3. ✓ **Single-Speaker Transcription**: Basic transcription workflows
4. ✓ **Multi-Speaker Audio**: Challenges and approaches for multiple speakers
5. ✓ **Speaker Diarization**: Identifying "who spoke when" using pyannote.audio
6. ✓ **Multichannel Processing**: Splitting and transcribing individual audio channels

### Next Steps and Advanced Topics

Continue your learning journey:

#### Improving Accuracy
- Fine-tune Whisper on domain-specific data
- Use larger models for better performance
- Apply audio preprocessing (noise reduction, normalization)

#### Advanced Techniques
- Real-time streaming transcription
- Custom vocabulary and named entity recognition
- Multi-language support and code-switching
- Speaker identification (not just diarization)

#### Integration Projects
- Build a meeting transcription system
- Create automated subtitles for videos
- Develop a voice-controlled application
- Implement a podcast search engine

### Resources

- **Whisper**: https://github.com/openai/whisper
- **pyannote.audio**: https://github.com/pyannote/pyannote-audio
- **librosa**: https://librosa.org/
- **Hugging Face Models**: https://huggingface.co/models?pipeline_tag=automatic-speech-recognition

### Feedback and Support

Questions? Issues? Ideas?
- Open an issue on GitHub
- Check the documentation
- Join the community discussions

Thank you for participating in this workshop! 🎉

### Final Exercise: Build Your Own Application

Use this space to build your own transcription application combining what you've learned:

In [None]:
# Your final project code here
# Build something amazing!
