# Model Comparison: WhisperX vs OpenAI Whisper vs Pyannote

## Project Overview
This notebook benchmarks different speech recognition and diarization models to determine:
- **Which is most accurate?**
- **Which is fastest?**
- **Which is best for our bias detection use case?**

## Models We're Testing

### Transcription Models:
1. **WhisperX** - Our current choice
   - Pros: Word-level timestamps, alignment, fast
   - Cons: Additional processing steps
   
2. **OpenAI Whisper (base)** - The original
   - Pros: Simple, well-documented, accurate
   - Cons: Slower, phrase-level timestamps only

### Diarization:
3. **pyannote.audio 3.1** - Current speaker identification method
   - Pros: Industry standard, very accurate
   - Cons: Requires Hugging Face token, slower

## Evaluation Metrics
- **Processing time** - How long does it take?
- **Transcription accuracy** - Word Error Rate (WER)
- **Diarization accuracy** - Speaker identification quality
- **Memory usage** - Resource consumption
- **Ease of use** - Setup complexity

## Test Audio
We'll use our 9-minute US Debate audio file for a fair comparison.

In [2]:
# Install all required packages for comparison
# Run this once

import pip

!pip install openai-whisper whisperx jiwer pandas matplotlib

Collecting jiwer
  Downloading jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.14.3-cp312-cp312-win_amd64.whl.metadata (12 kB)
Downloading jiwer-4.0.0-py3-none-any.whl (23 kB)
Downloading rapidfuzz-3.14.3-cp312-cp312-win_amd64.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   -- ------------------------------------- 0.1/1.5 MB 1.5 MB/s eta 0:00:01
   ------------- -------------------------- 0.5/1.5 MB 5.4 MB/s eta 0:00:01
   -------------------------- ------------- 1.0/1.5 MB 7.2 MB/s eta 0:00:01
   ---------------------------------------  1.5/1.5 MB 8.1 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 7.5 MB/s eta 0:00:00
Installing collected packages: rapidfuzz, jiwer
Successfully installed jiwer-4.0.0 rapidfuzz-3.14.3



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## Step One: Import Libraries and Setup

We'll import both Whisper models and prepare for benchmarking.

In [3]:
# Import libraries
import whisper  # Original OpenAI Whisper
import whisperx  # WhisperX (our current choice)
import torch
import time
import pandas as pd
import matplotlib.pyplot as plt
from jiwer import wer  # Word Error Rate calculation
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()
HF_TOKEN = os.getenv("HF_TOKEN")

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Libraries imported")
print(f"Using device: {device}")

# Prepare results storage
results = {
    'model': [],
    'transcription_time': [],
    'alignment_time': [],
    'diarization_time': [],
    'total_time': [],
    'segments_found': [],
    'memory_usage': []
}

Libraries imported
Using device: cpu


## Step Two: Test OpenAI Whisper (Original)

Let's benchmark the **original** Whisper model first.

**What we're timing:**
- Model loading time
- Transcription time
- Total processing time

In [4]:
print("=" * 60)
print("TEST 1: OpenAI Whisper")
print("=" * 60)

# Path to test audio
audio_file = "../data/US_DebateAudio.wav"

# TIME: Model loading
print("\nLoading OpenAI Whisper model...")
start_load = time.time()
whisper_model = whisper.load_model("base", device=device)
load_time = time.time() - start_load
print(f"Model loaded in {load_time:.2f} seconds")

# TIME: Transcription
print("\nTranscribing audio...")
start_transcribe = time.time()
whisper_result = whisper_model.transcribe(audio_file, verbose=False)
transcribe_time = time.time() - start_transcribe
print(f"Transcription complete in {transcribe_time:.2f} seconds")

# Store results
results['model'].append('OpenAI Whisper')
results['transcription_time'].append(transcribe_time)
results['alignment_time'].append(0)  # No alignment step
results['diarization_time'].append(0)  # No diarization
results['total_time'].append(transcribe_time)
results['segments_found'].append(len(whisper_result['segments']))
results['memory_usage'].append('N/A')  # Can add psutil for this

print(f"\nResults:")
print(f"  Total time: {transcribe_time:.2f}s")
print(f"  Segments found: {len(whisper_result['segments'])}")
print(f"  Text preview: {whisper_result['text'][:200]}...")

TEST 1: OpenAI Whisper

Loading OpenAI Whisper model...


100%|███████████████████████████████████████| 139M/139M [00:13<00:00, 10.5MiB/s]


Model loaded in 15.45 seconds

Transcribing audio...




Detected language: English


 99%|█████████▉| 56232/56577 [01:29<00:00, 627.00frames/s]

Transcription complete in 92.78 seconds

Results:
  Total time: 92.78s
  Segments found: 154
  Text preview:  She doesn't have a plan. She copied Biden's plan and it's like four sentences, like run-spot-run, four sentences that are just, oh, we'll try in lower taxes. She doesn't have a plan. Take a look at h...





## Step 3: Test WhisperX (Current Method)

Now let's benchmark our current pipeline with alignment.

In [5]:
print("=" * 60)
print("TEST 2: WhisperX with Alignment")
print("=" * 60)

# TIME: Model loading
print("\n[1/3] Loading WhisperX model...")
start_load = time.time()
whisperx_model = whisperx.load_model("base", device=device, compute_type="int8" if device=="cpu" else "float16")
load_time = time.time() - start_load
print(f"✓ Model loaded in {load_time:.2f} seconds")

# TIME: Transcription
print("\n[2/3] Transcribing audio...")
start_transcribe = time.time()
whisperx_result = whisperx_model.transcribe(audio_file)
transcribe_time = time.time() - start_transcribe
print(f"✓ Transcription complete in {transcribe_time:.2f} seconds")

# TIME: Alignment
print("\n[3/3] Aligning timestamps to word level...")
start_align = time.time()
model_a, metadata = whisperx.load_align_model(language_code="en", device=device)
whisperx_result = whisperx.align(whisperx_result["segments"], model_a, metadata, audio_file, device=device)
align_time = time.time() - start_align
print(f"✓ Alignment complete in {align_time:.2f} seconds")

# Store results
total_time = transcribe_time + align_time
results['model'].append('WhisperX + Alignment')
results['transcription_time'].append(transcribe_time)
results['alignment_time'].append(align_time)
results['diarization_time'].append(0)  # No diarization yet
results['total_time'].append(total_time)
results['segments_found'].append(len(whisperx_result['segments']))
results['memory_usage'].append('N/A')

print(f"\nResults:")
print(f"  Transcription time: {transcribe_time:.2f}s")
print(f"  Alignment time: {align_time:.2f}s")
print(f"  Total time: {total_time:.2f}s")
print(f"  Segments found: {len(whisperx_result['segments'])}")

TEST 2: WhisperX with Alignment

[1/3] Loading WhisperX model...


  import pkg_resources
  from .autonotebook import tqdm as notebook_tqdm
  torchaudio.list_audio_backends()
  available_backends = torchaudio.list_audio_backends()


2025-11-11 11:40:23 - whisperx.asr - INFO - No language specified, language will be detected for each audio file (increases inference time)
2025-11-11 11:40:23 - whisperx.vads.pyannote - INFO - Performing voice activity detection using Pyannote...


  if ismodule(module) and hasattr(module, '__file__'):
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint c:\Users\norak\SpeakSense\venv\Lib\site-packages\whisperx\assets\pytorch_model.bin`
  torchaudio.list_audio_backends()


Model was trained with pyannote.audio 0.0.1, yours is 3.4.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.8.0+cpu. Bad things might happen unless you revert torch to 1.x.
✓ Model loaded in 68.52 seconds

[2/3] Transcribing audio...
2025-11-11 11:40:48 - whisperx.asr - INFO - Detected language: en (0.99) in first 30s of audio
✓ Transcription complete in 72.40 seconds

[3/3] Aligning timestamps to word level...
✓ Alignment complete in 118.09 seconds

Results:
  Transcription time: 72.40s
  Alignment time: 118.09s
  Total time: 190.49s
  Segments found: 112


## Step Four: Test Full Pipeline (WhisperX + Diarization)

Finally, let's test the complete pipeline including speaker identification.

In [None]:
print("=" * 60)
print("TEST 3: WhisperX + Alignment + Diarization")
print("=" * 60)

# We already have transcription and alignment from previous test
# Just need to add diarization

from pyannote.audio import Pipeline

# TIME: Diarization
print("\n[1/2] Loading diarization model...")
start_diarize_load = time.time()
diarize_model = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HF_TOKEN
)
diarize_load_time = time.time() - start_diarize_load
print(f"Model loaded in {diarize_load_time:.2f} seconds")

print("\n[2/2] Running speaker diarization...")
start_diarize = time.time()
diarize_segments = diarize_model(audio_file)
diarize_time = time.time() - start_diarize
print(f"Diarization complete in {diarize_time:.2f} seconds")

# Convert to DataFrame and assign speakers
import pandas as pd
diarize_list = []
for turn, _, speaker in diarize_segments.itertracks(yield_label=True):
    diarize_list.append({
        'start': turn.start,
        'end': turn.end,
        'speaker': speaker
    })
diarize_df = pd.DataFrame(diarize_list)

# Assign speakers to words
full_result = whisperx.assign_word_speakers(diarize_df, whisperx_result)

# Store results
total_time = transcribe_time + align_time + diarize_time
results['model'].append('WhisperX + Align + Diarization')
results['transcription_time'].append(transcribe_time)
results['alignment_time'].append(align_time)
results['diarization_time'].append(diarize_time)
results['total_time'].append(total_time)
results['segments_found'].append(len(full_result['segments']))
results['memory_usage'].append('N/A')

print(f"\nResults:")
print(f"  Transcription time: {transcribe_time:.2f}s")
print(f"  Alignment time: {align_time:.2f}s")
print(f"  Diarization time: {diarize_time:.2f}s")
print(f"  Total time: {total_time:.2f}s")
print(f"  Speakers found: {len(diarize_df['speaker'].unique())}")

TEST 3: WhisperX + Alignment + Diarization

[1/2] Loading diarization model...


  torchaudio.list_audio_backends()


Model loaded in 2.31 seconds

[2/2] Running speaker diarization...


  std = sequences.std(dim=-1, correction=1)
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return

## Step Five: Compare Results

Let's visualize the performance differences.