# Speech Diarization for Media Bias Analysis

## Project Overview
This notebook demonstrates speech diarization - the process of identifying "who spoke when" in an audio recording. We'll use this to analyze speaking time distribution in media, which can reveal potential biases and give context.

## What is Diarization?
- **Transcription** = converting speech to text
- **Diarization** = identifying different speakers
- **My Goal** = measure speaking time, interruptions, and speaking patterns to detect media bias

## Steps:
1. Load and transcribe audio with WhisperX
2. Perform speaker diarization
3. Align transcription with speaker labels
4. Calculate speaking time statistics
5. Analyze patterns for bias indicators

In [35]:
# Import required libraries
import whisperx  #Speech recognition & speaker diarization
import torch    # Deep learning operations

print("Libraries imported")

# Checks if GPU is available, else uses CPU
print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

Libraries imported
Using device: cpu


## Load the Speech Recognition Model

<b>"base"</b> model is a good pick - fast enough for testing, accurate enough for real analysis.


In [36]:
# Load the WhisperX model
print("Loading WhisperX model...")

#Downloads & initializes the AI model
model = whisperx.load_model("base", device="cpu", compute_type="int8")

print("Model loaded successfully")

Loading WhisperX model...
2025-11-09 00:01:36 - whisperx.asr - INFO - No language specified, language will be detected for each audio file (increases inference time)
2025-11-09 00:01:36 - whisperx.vads.pyannote - INFO - Performing voice activity detection using Pyannote...


Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint c:\Users\norak\SpeakSense\venv\Lib\site-packages\whisperx\assets\pytorch_model.bin`
  torchaudio.list_audio_backends()


Model was trained with pyannote.audio 0.0.1, yours is 3.4.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.8.0+cpu. Bad things might happen unless you revert torch to 1.x.
Model loaded successfully


## Transcribe Audio
This converts the speech in our audio file to text with precise timestamps.

**What happens:**
1. Audio is loaded and converted to the right format (using ffmpeg)
2. The AI model processes the audio in chunks
3. Each chunk is transcribed to text
4. Timestamps mark when each segment starts and ends

**Output:** A dictionary containing segments of transcribed text with timing information

In [37]:
# Transcribe the audio file

# Path to the audio files
audio_file = "../data/sampleaudio.wav"
audio_file2 = "../data/US_DebateAudio.wav"

print(f"Transcribing: {audio_file}")
# Processes the audio and returns a dictionary - segments with text & timestamps & language info
result = model.transcribe(audio_file)

# Counts how many speech segments were found
print(f"Done! Found {len(result['segments'])} segments")

Transcribing: ../data/sampleaudio.wav
2025-11-09 00:01:39 - whisperx.asr - INFO - Detected language: en (0.99) in first 30s of audio
Done! Found 1 segments


## View Transcription Results
Let's see what was said and when. Each segment shows the time range and the transcribed text.

Each segment shows:
- **Segment number** [1], [2], [3]...
- **Time range** when it was spoken (e.g., 0.50s → 3.20s)
- **Transcribed text** what was actually said

In [38]:
# Display the transcription with timestamps
print("=" * 50)
print("TRANSCRIPTION RESULTS")
print("=" * 50)

# Iterate through segments and print start time, end time, and text
for i, segment in enumerate(result['segments'], 1):

    start = segment['start'] 
    end = segment['end']
    text = segment['text']
    # Print timestamp to 2 decimal places
    print(f"\n[{i}] {start:.2f}s → {end:.2f}s")
    print(f"    {text}")

TRANSCRIPTION RESULTS

[1] 0.96s → 11.98s
     Maybe I'm not good enough. Yes you are. Maybe I'm not. It's like, maybe I'm one of those people who is always gonna dream about doing stuff. You're not. You're gonna do it.


## Perform Speaker Diarization
Now we can start to identify WHO is speaking. 

Using pyannote.audio to cluster voices and label them as SPEAKER_00, SPEAKER_01, etc.

First Import the Hugging Face Token from our .env file so we can access the pyannote diarization models.

In [39]:
# Imports to access token from .env file
import os
from dotenv import load_dotenv

# Load variables from .env file into environment
load_dotenv()
HF_TOKEN = os.getenv("HF_TOKEN")


## Align Timestamps to Words

Before diarization, we need word-level timestamps (not just phrase-level). This alignment model improves precision.

In [40]:
print("Aligning timestamps...")

# Load the alignment model for English
model_a, metadata = whisperx.load_align_model(language_code="en", device="cpu")

# Align our transcription to get word-level timestamps
# More accurate speaker assignment
# We run our transcribed segments through this alignment model 
# alongside our original audio file for speaker alignment
result = whisperx.align(result["segments"], model_a, metadata, audio_file, device="cpu")

print("Timestamps aligned successfully")

Aligning timestamps...
Timestamps aligned successfully


## Load Speaker Diarization Model

Now load the **diarization model** (pyannote.audio) which identifies different speakers.

In [41]:
print("Loading diarization model...")
from pyannote.audio import Pipeline

# Load the pre-trained diarization model from Hugging Face
# Pipeline.from_pretrained() downloads the model if not cached
diarize_model = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HF_TOKEN
)

print("Diarization model loaded")

print("Running diarization...")

# Analyzes the audio file to identify speakers and their speaking times
# Clusters similar voices using machine learning
diarize_segments = diarize_model(audio_file)
print("Diarization complete")

Loading diarization model...


  torchaudio.list_audio_backends()


Diarization model loaded
Running diarization...


  std = sequences.std(dim=-1, correction=1)
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(


Diarization complete


## View Diarization Results

This shows **WHO** spoke **WHEN** (not what they said ... just yet).

**Format:** `SPEAKER_XX: start_time → end_time`

**What you'll see:**
- SPEAKER_00, SPEAKER_01, etc. (standard labels assigned by the model)
- Time ranges when each speaker was talking
- Note: The same person gets the same label throughout

In [42]:
# Display diarization results
print("=" * 50)
print("DIARIZATION RESULTS")
print("=" * 50)


# itertracks() iterates through speaker segments
# yield_label=True : includes the speaker label

for turn, _, speaker in diarize_segments.itertracks(yield_label=True):
    print(f"\n{speaker}: {turn.start:.2f}s → {turn.end:.2f}s")

DIARIZATION RESULTS

SPEAKER_00: 0.96s → 2.16s

SPEAKER_01: 2.22s → 2.24s

SPEAKER_02: 2.24s → 3.05s

SPEAKER_00: 3.64s → 4.44s

SPEAKER_00: 4.77s → 5.31s

SPEAKER_00: 5.89s → 9.65s

SPEAKER_01: 10.21s → 12.01s


## Assign Speakers to Words

Now we combine two pieces of information:
- **Transcription** (what was said, from WhisperX)
- **Diarization** (who was speaking when, from pyannote)

**The challenge:** 
- pyannote gives us time ranges for speakers and WhisperX gives us time ranges for words
- We need to match them up by finding overlaps

**The solution:**
Convert pyannote's format to a pandas DataFrame, then use WhisperX's `assign_word_speakers()` function to match speakers to words based on timestamp overlaps.

In [43]:
# Assign speakers to words
print("Assigning speakers to transcription...")

# Import pandas for DataFrame manipulation
import pandas as pd

# Convert pyannote format to pandas DataFrame
# WhisperX expects diarization data as a DataFrame with columns:
# 'start', 'end', 'speaker'

# Extract speaker segments into a list of dictionaries
diarize_list = []
for turn, _, speaker in diarize_segments.itertracks(yield_label=True):
    # Create a dictionary for each speaker segment
    diarize_list.append({
        'start': turn.start,
        'end': turn.end,
        'speaker': speaker
    })

# Convert list to DataFrame (table format)
diarize_df = pd.DataFrame(diarize_list)

# Assign speakers to words
# whisperx.assign_word_speakers() matches speakers to words by:
# - Comparing timestamps of speaker segments with word timestamps
# - Finding overlaps to determine who was speaking each word

# Takes in - the DataFrame we just created and the transcription result
result = whisperx.assign_word_speakers(diarize_df, result)
print("Speakers assigned to words")

Assigning speakers to transcription...
Speakers assigned to words


In [44]:
# Display final results with speakers
print("=" * 50)
print("TRANSCRIPTION WITH SPEAKERS")
print("=" * 50)

# Loop through all segments
for segment in result["segments"]:
    # Get speaker label - default to 'UNKNOWN' if not found
    speaker = segment.get('speaker', 'UNKNOWN')

    # Extract timing and text
    start = segment['start']
    end = segment['end']
    text = segment['text']
    # Print speaker label with timestamp
    print(f"\n[{speaker}] {start:.2f}s → {end:.2f}s")
    print(f"    {text}")

TRANSCRIPTION WITH SPEAKERS

[SPEAKER_00] 0.96s → 2.14s
     Maybe I'm not good enough.

[SPEAKER_02] 2.34s → 3.03s
    Yes you are.

[SPEAKER_00] 3.73s → 4.43s
    Maybe I'm not.

[SPEAKER_00] 4.91s → 9.71s
    It's like, maybe I'm one of those people who is always gonna dream about doing stuff.

[SPEAKER_01] 10.33s → 10.87s
    You're not.

[SPEAKER_01] 11.15s → 12.00s
    You're gonna do it.


In [45]:
# ============================================================
# FULL PIPELINE: US Presidential Debate Audio
# ============================================================

# Path to debate audio file
debate_audio = "../data/US_DebateAudio.wav"

print("=" * 60)
print("RUNNING FULL PIPELINE ON US DEBATE AUDIO")
print("=" * 60)

# STEP 1: Transcribe
print("\n[1/5] Transcribing audio...")
debate_result = model.transcribe(debate_audio)
print(f"Found {len(debate_result['segments'])} segments")

# STEP 2: Align timestamps
print("\n[2/5] Aligning timestamps to word level...")
debate_result = whisperx.align(
    debate_result["segments"], 
    model_a, 
    metadata, 
    debate_audio, 
    device="cpu"
)
print("Timestamps aligned")

# STEP 3: Run diarization
print("\n[3/5] Running speaker diarization (this may take a few minutes)...")
debate_diarize = diarize_model(debate_audio)
print("Diarization complete")

# STEP 4: Convert diarization to DataFrame
print("\n[4/5] Converting diarization format...")
debate_diarize_list = []
for turn, _, speaker in debate_diarize.itertracks(yield_label=True):
    debate_diarize_list.append({
        'start': turn.start,
        'end': turn.end,
        'speaker': speaker
    })
debate_diarize_df = pd.DataFrame(debate_diarize_list)
print(f"Found {len(debate_diarize_df['speaker'].unique())} speakers")

# STEP 5: Assign speakers to words
print("\n[5/5] Assigning speakers to words...")
debate_result = whisperx.assign_word_speakers(debate_diarize_df, debate_result)
print("Speakers assigned")

print("\n" + "=" * 60)
print("PIPELINE COMPLETE!")
print("=" * 60)

RUNNING FULL PIPELINE ON US DEBATE AUDIO

[1/5] Transcribing audio...
2025-11-09 00:02:29 - whisperx.asr - INFO - Detected language: en (0.99) in first 30s of audio
Found 25 segments

[2/5] Aligning timestamps to word level...
Timestamps aligned

[3/5] Running speaker diarization (this may take a few minutes)...


  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(file["audio"], backend=backend)
  return AudioMetaData(
  info = torchaudio.info(fil

Diarization complete

[4/5] Converting diarization format...
Found 3 speakers

[5/5] Assigning speakers to words...
Speakers assigned

PIPELINE COMPLETE!


In [48]:
# Display the debate transcription with speakers
print("=" * 60)
print("US DEBATE TRANSCRIPTION WITH SPEAKERS")
print("=" * 60)

# Loop through all segments
for i, segment in enumerate(debate_result["segments"], 1):
    # Get speaker label
    speaker = segment.get('speaker', 'UNKNOWN')
    
    # Extract timing and text
    start = segment['start']
    end = segment['end']
    text = segment['text']
    
    # Print with segment number, speaker, timestamp, and text
    print(f"\n[{i}] {speaker} | {start:.2f}s → {end:.2f}s")
    print(f"    {text}")
    
    

US DEBATE TRANSCRIPTION WITH SPEAKERS

[1] SPEAKER_00 | 0.03s → 1.82s
     She doesn't have a plan.

[2] SPEAKER_00 | 2.22s → 12.48s
    She copied Biden's plan and it's like four sentences, like run spot run, four sentences that are just, oh, we'll try and lower taxes.

[3] SPEAKER_00 | 12.98s → 13.83s
    She doesn't have a plan.

[4] SPEAKER_00 | 13.85s → 14.67s
    Take a look at her plan.

[5] SPEAKER_00 | 14.89s → 16.13s
    She doesn't have a plan.

[6] SPEAKER_02 | 16.11s → 20.98s
     I believe in the ambition, the aspirations, the dreams of the American people.

[7] SPEAKER_02 | 21.54s → 26.99s
    And that is why I imagine and have actually a plan to build what I call an opportunity economy.

[8] SPEAKER_02 | 27.39s → 28.20s
    Because here's the thing.

[9] SPEAKER_02 | 28.76s → 32.80s
    We know that we have a shortage of homes in housing.

[10] SPEAKER_02 | 32.82s → 35.63s
    And the cost of housing is too expensive for too many people.

[11] SPEAKER_02 | 36.29s → 39.2

## Speaking Time Analysis

Let's calculate how much each speaker talked during the debate.

For reference this is a highlights reel of an 100 minute debate - this is a 9 minute audio

In [49]:
# Calculate speaking time per speaker
print("=" * 60)
print("SPEAKING TIME ANALYSIS")
print("=" * 60)

# Calculate total duration for each speaker
speaker_time = {}
for _, row in debate_diarize_df.iterrows():
    speaker = row['speaker']
    duration = row['end'] - row['start']
    
    if speaker in speaker_time:
        speaker_time[speaker] += duration
    else:
        speaker_time[speaker] = duration

# Calculate total time
total_time = sum(speaker_time.values())

# Display results
print(f"\nTotal speaking time: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")
print("\nSpeaker breakdown:")

for speaker, time in sorted(speaker_time.items()):
    percentage = (time / total_time) * 100
    print(f"\n{speaker}:")
    print(f"  Time: {time:.2f}s ({time/60:.2f} min)")
    print(f"  Percentage: {percentage:.1f}%")
    
# Count turns (how many times each person spoke)
print("\n" + "=" * 60)
print("TURN-TAKING ANALYSIS")
print("=" * 60)

speaker_turns = debate_diarize_df['speaker'].value_counts()
print("\nNumber of speaking turns per speaker:")
for speaker, count in speaker_turns.items():
    print(f"  {speaker}: {count} turns")

SPEAKING TIME ANALYSIS

Total speaking time: 544.00 seconds (9.07 minutes)

Speaker breakdown:

SPEAKER_00:
  Time: 244.06s (4.07 min)
  Percentage: 44.9%

SPEAKER_01:
  Time: 7.19s (0.12 min)
  Percentage: 1.3%

SPEAKER_02:
  Time: 292.75s (4.88 min)
  Percentage: 53.8%

TURN-TAKING ANALYSIS

Number of speaking turns per speaker:
  SPEAKER_02: 34 turns
  SPEAKER_00: 12 turns
  SPEAKER_01: 8 turns
