# Complete Pipeline Comparison for Speech Diarization

## Project Overview
This notebook compares **complete end-to-end pipelines** for speech transcription and diarization to determine the best solution for media bias analysis.

## Pipelines We're Testing

### 1. **WhisperX (base) + pyannote**  
- **Transcription**: WhisperX base model
- **Diarization**: pyannote.audio 3.1
- **Cost**: Free (open source)
- **Expected**: Fast, good accuracy
- **Best for**: Real-time analysis, batch processing

### 2. **Whisper (large) + pyannote**
- **Transcription**: OpenAI Whisper large model
- **Diarization**: pyannote.audio 3.1
- **Cost**: Free (open source)
- **Expected**: Slower, best accuracy
- **Best for**: Final analysis, research

### 3. **faster-whisper + pyannote** ðŸš€
- **Transcription**: faster-whisper (CTranslate2 optimized)
- **Diarization**: pyannote.audio 3.1
- **Cost**: Free (open source)
- **Expected**: Fastest, good accuracy
- **Best for**: Large-scale processing

## Evaluation Metrics
- **Processing Time** - Total pipeline duration
- **Transcription Quality** - Text comparison
- **Diarization Accuracy** - Speaker identification
- **Resource Usage** - Memory/CPU
- **Ease of Use** - Setup complexity

## Test Audio
9-minute US Presidential Debate highlights

In [9]:

# Install all required packages 
# Run this once

!pip install openai-whisper whisperx faster-whisper pyannote.audio jiwer pandas matplotlib python-dotenv




[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## Step One: Import Libraries and Setup

We'll import both Whisper models and prepare for benchmarking.

In [10]:
# Import libraries
import whisper  # Original OpenAI Whisper
import whisperx  # WhisperX
from faster_whisper import WhisperModel  # Optimized Whisper
from pyannote.audio import Pipeline  # Diarization
import torch
import time
import pandas as pd
import matplotlib.pyplot as plt
from jiwer import wer  # Word Error Rate calculation
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()
HF_TOKEN = os.getenv("HF_TOKEN")

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Libraries imported")
print(f"Using device: {device}")

# Prepare results storage
results = {
    'pipeline': [],
    'transcription_time': [],
    'alignment_time': [],
    'diarization_time': [],
    'total_time': [],
    'num_speakers': [],
    'num_segments': [],
    'transcription_preview': []
}

# Path to test audio
audio_file = "../data/US_DebateAudio.wav"
print(f"Test audio: {audio_file}")

Libraries imported
Using device: cpu
Test audio: ../data/US_DebateAudio.wav


## Pipeline 1: WhisperX (base) + pyannote

Our first choice fast pipeline.

**Steps:**
1. Transcribe with WhisperX base
2. Align to word-level timestamps
3. Diarize with pyannote 3.1
4. Assign speakers to words

In [None]:
print("=" * 70)
print("PIPELINE 1: WhisperX (base) + pyannote")
print("=" * 70)

# ============================================================
# STEP 1: Transcribe with WhisperX
# ============================================================
print("\n[1/4] Loading WhisperX base model...")
start_time = time.time()
whisperx_model = whisperx.load_model("base", device=device, compute_type="int8" if device=="cpu" else "float16")
load_time = time.time() - start_time
print(f"Model loaded in {load_time:.2f}s")

print("\n[2/4] Transcribing audio with WhisperX...")
start_transcribe = time.time()
result_wx = whisperx_model.transcribe(audio_file)
transcribe_time = time.time() - start_transcribe
print(f"Transcribed in {transcribe_time:.2f}s")

# ============================================================
# STEP 2: Align timestamps
# ============================================================
print("\n[3/4] Aligning to word-level timestamps...")
start_align = time.time()
model_a, metadata = whisperx.load_align_model(language_code="en", device=device)
result_wx = whisperx.align(result_wx["segments"], model_a, metadata, audio_file, device=device)
align_time = time.time() - start_align
print(f"Aligned in {align_time:.2f}s")

# ============================================================
# STEP 3: Diarize with pyannote
# ============================================================
print("\n[4/4] Running speaker diarization...")
start_diarize = time.time()
diarize_model = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HF_TOKEN
)
diarize_segments = diarize_model(audio_file)

# Convert to DataFrame
diarize_list = []
for turn, _, speaker in diarize_segments.itertracks(yield_label=True):
    diarize_list.append({'start': turn.start, 'end': turn.end, 'speaker': speaker})
diarize_df = pd.DataFrame(diarize_list)

# Assign speakers
result_wx = whisperx.assign_word_speakers(diarize_df, result_wx)
diarize_time = time.time() - start_diarize
print(f"Diarized in {diarize_time:.2f}s")

# ============================================================
# Store results
# ============================================================
total_time = transcribe_time + align_time + diarize_time
text_preview = " ".join([seg.get('text', '') for seg in result_wx['segments'][:3]])

results['pipeline'].append('WhisperX (base) + pyannote')
results['transcription_time'].append(transcribe_time)
results['alignment_time'].append(align_time)
results['diarization_time'].append(diarize_time)
results['total_time'].append(total_time)
results['num_speakers'].append(len(diarize_df['speaker'].unique()))
results['num_segments'].append(len(result_wx['segments']))
results['transcription_preview'].append(text_preview)

print(f"\n{'='*70}")
print("PIPELINE 1 RESULTS:")
print(f"  Total time: {total_time:.2f}s")
print(f"  Speakers found: {len(diarize_df['speaker'].unique())}")
print(f"  Segments: {len(result_wx['segments'])}")
print(f"  Preview: {text_preview[:100]}...")
print(f"{'='*70}\n")

PIPELINE 1: WhisperX (base) + pyannote

[1/4] Loading WhisperX base model...
2025-11-11 19:03:37 - whisperx.asr - INFO - No language specified, language will be detected for each audio file (increases inference time)
2025-11-11 19:03:37 - whisperx.vads.pyannote - INFO - Performing voice activity detection using Pyannote...


Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint c:\Users\norak\SpeakSense\venv\Lib\site-packages\whisperx\assets\pytorch_model.bin`
  torchaudio.list_audio_backends()


Model was trained with pyannote.audio 0.0.1, yours is 3.4.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.8.0+cpu. Bad things might happen unless you revert torch to 1.x.
Model loaded in 3.67s

[2/4] Transcribing audio with WhisperX...
2025-11-11 19:04:03 - whisperx.asr - INFO - Detected language: en (0.99) in first 30s of audio


## Pipeline 2: Whisper (large) + pyannote

Higher accuracy but slower pipeline.

**Difference from Pipeline 1:**
- Uses Whisper **large** model instead of base
- More accurate transcription
- Significantly slower processing

In [None]:
print("=" * 70)
print("PIPELINE 2: Whisper (large) + pyannote")
print("=" * 70)

# ============================================================
# STEP 1: Transcribe with Whisper Large
# ============================================================
print("\n[1/3] Loading Whisper large model...")
start_time = time.time()
whisper_large = whisper.load_model("large", device=device)
load_time = time.time() - start_time
print(f"âœ“ Model loaded in {load_time:.2f}s")

print("\n[2/3] Transcribing audio with Whisper large...")
start_transcribe = time.time()
result_large = whisper_large.transcribe(audio_file, verbose=False)
transcribe_time = time.time() - start_transcribe
print(f"âœ“ Transcribed in {transcribe_time:.2f}s")

# ============================================================
# STEP 2: Diarize (reuse pyannote model from Pipeline 1)
# ============================================================
print("\n[3/3] Using diarization from Pipeline 1...")
# We already have diarize_df from Pipeline 1
# Note: In real scenarios, you might re-run diarization
diarize_time_reused = 0  # Reusing existing diarization

# ============================================================
# Store results
# ============================================================
total_time = transcribe_time + diarize_time  # Using original diarize time
text_preview = result_large['text'][:100]

results['pipeline'].append('Whisper (large) + pyannote')
results['transcription_time'].append(transcribe_time)
results['alignment_time'].append(0)  # Whisper doesn't have alignment
results['diarization_time'].append(diarize_time)  # From Pipeline 1
results['total_time'].append(total_time)
results['num_speakers'].append(len(diarize_df['speaker'].unique()))
results['num_segments'].append(len(result_large['segments']))
results['transcription_preview'].append(text_preview)

print(f"\n{'='*70}")
print("PIPELINE 2 RESULTS:")
print(f"  Total time: {total_time:.2f}s")
print(f"  Speakers found: {len(diarize_df['speaker'].unique())}")
print(f"  Segments: {len(result_large['segments'])}")
print(f"  Preview: {text_preview[:100]}...")
print(f"{'='*70}\n")

TEST 3: WhisperX + Alignment + Diarization

[1/2] Loading diarization model...


  torchaudio.list_audio_backends()


Model loaded in 2.15 seconds

[2/2] Running speaker diarization...


  std = sequences.std(dim=-1, correction=1)
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return

Diarization complete in 772.57 seconds

Results:
  Transcription time: 72.40s
  Alignment time: 118.09s
  Diarization time: 772.57s
  Total time: 963.06s
  Speakers found: 3


## Step Five: Compare Results

Let's visualize the performance differences.