# Speech Diarization for Media Bias Analysis

## Project Overview
This notebook demonstrates speech diarization - the process of identifying "who spoke when" in an audio recording. We'll use this to analyze speaking time distribution in media, which can reveal potential biases and give context.

## What is Diarization?
- **Transcription** = converting speech to text
- **Diarization** = identifying different speakers
- **My Goal** = measure speaking time, interruptions, and speaking patterns to detect media bias

## Steps:
1. Load and transcribe audio with WhisperX
2. Perform speaker diarization
3. Align transcription with speaker labels
4. Calculate speaking time statistics
5. Analyze patterns for bias indicators

In [None]:
# Import required libraries
import whisperx
import torch

print("Libraries imported")
print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

✓ Libraries imported
Using device: cpu


## Load the Speech Recognition Model

<b>"base"</b> model is a good pick - fast enough for testing, accurate enough for real analysis.


In [8]:
# Load the WhisperX model
print("Loading WhisperX model...")
model = whisperx.load_model("base", device="cpu", compute_type="int8")
print("Model loaded successfully")

Loading WhisperX model...
2025-11-03 11:24:09 - whisperx.asr - INFO - No language specified, language will be detected for each audio file (increases inference time)
2025-11-03 11:24:09 - whisperx.vads.pyannote - INFO - Performing voice activity detection using Pyannote...


Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint c:\Users\norak\SpeakSense\venv\Lib\site-packages\whisperx\assets\pytorch_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.4.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.8.0+cpu. Bad things might happen unless you revert torch to 1.x.
Model loaded successfully


## Transcribe Audio
This converts the speech in our audio file to text with precise timestamps.

In [None]:
# Transcribe the audio file
audio_file = "../data/sampleaudio.wav"
print(f"Transcribing: {audio_file}")
result = model.transcribe(audio_file)
print(f"Done! Found {len(result['segments'])} segments")

Transcribing: ../data/sampleaudio.wav
2025-11-03 11:24:12 - whisperx.asr - INFO - Detected language: en (0.99) in first 30s of audio
Done! Found 1 segments


## View Transcription Results
Let's see what was said and when. Each segment shows the time range and the transcribed text.

In [10]:
# Display the transcription with timestamps
print("=" * 50)
print("TRANSCRIPTION RESULTS")
print("=" * 50)

# Iterate through segments and print start time, end time, and text
for i, segment in enumerate(result['segments'], 1):
    start = segment['start']
    end = segment['end']
    text = segment['text']
    print(f"\n[{i}] {start:.2f}s → {end:.2f}s")
    print(f"    {text}")

TRANSCRIPTION RESULTS

[1] 0.96s → 11.98s
     Maybe I'm not good enough. Yes you are. Maybe I'm not. It's like, maybe I'm one of those people who is always gonna dream about doing stuff. You're not. You're gonna do it.
