# 🎙️ Podcast Transcription Pipeline

This notebook provides a complete workflow for transcribing WhatsApp audio messages using Whisper - an open source speech recognition model originally created by OpenAI and now maintained by the community. It processes `.opus` files from WhatsApp, transcribes them with speaker identification, and creates both individual transcriptions and a combined transcript.


In [37]:
from pathlib import Path
import whisper
import pandas as pd
from tqdm.notebook import tqdm
import torch

In [None]:
from dotenv import load_dotenv
import os
load_dotenv()

## ⚙️ Configuration Setup

This cell defines the central configuration object that controls all aspects of the transcription pipeline.

**Configuration Parameters**:
- `audio_dir`: Directory containing source `.opus` files
- `output_dir`: Directory for individual transcription outputs
- `speakers_csv`: CSV file mapping file IDs to speaker names
- `whisper_model`: Whisper model size (affects accuracy vs. speed)
- `language`: Target language for transcription
- `device`: Processing device (CPU/GPU/MPS)

**Model Options**:
- `tiny`: Fastest, least accurate (39M parameters)
- `base`: Good balance for most use cases (74M parameters)
- `small`: Better accuracy, moderate speed (244M parameters)
- `medium`: High accuracy, slower (769M parameters) ⭐ **Recommended**
- `large`: Best accuracy, slowest (1550M parameters)

In [38]:
CONFIG = {
    "audio_dir": Path("whatsapp"),
    "output_dir": Path("txt"),
    "speakers_csv": Path("speakers.csv"),
    "whisper_model": "medium",   # tiny / base / small / medium / large
    "language": "pt",
    "device": "cpu",
}

## 🤖 Whisper Model Loading

This cell loads the specified Whisper model and displays system information.

**What happens**:
1. Downloads and caches the model (first run only)
2. Loads model to specified device (CPU/GPU/MPS)
3. Displays model statistics and device information

**Performance Notes**:
- First run may take several minutes to download the model
- Subsequent runs are much faster
- GPU acceleration significantly improves transcription speed
- MPS (Apple Silicon) provides good performance on Mac

In [13]:
model = whisper.load_model(CONFIG["whisper_model"], device=CONFIG["device"])

device_in_use = next(model.parameters()).device            # Effective GPU/CPU/MPS
total_params = sum(p.numel() for p in model.parameters())  # number of parameters

print(f"✅  Model '{CONFIG['whisper_model']}' loaded successfully!")
print(f"    • Device: {device_in_use}")
print(f"    • Parameters: {total_params/1e6:.1f} M")

✅  Modelo 'medium' carregado com sucesso!
    • Dispositivo: cpu
    • Parâmetros:  762.3 M


## 👥 Speaker Configuration

This cell loads the speaker mapping from the CSV file and creates a lookup dictionary.

**CSV Format**:
```csv
file_id,speaker_name
00000002-AUDIO-2025-06-08-12-31-32,LINA
00000004-AUDIO-2025-06-08-15-43-08,EDU
```

**Speaker Mapping**:
- File IDs must match the `.opus` filenames (without extension)
- Speaker names are used in output filenames and combined transcript
- Unknown speakers are labeled as "UNKNOWN"


In [None]:
df_speakers = pd.read_csv(CONFIG["speakers_csv"], header=None, names=["id", "speaker"])
id_to_speaker = dict(df_speakers.values)

print(f"✅  {len(id_to_speaker)} speakers loaded from file {CONFIG['speakers_csv']}:")
print(df_speakers.head().to_string(index=False))

✅  18 speakers carregados do arquivo speakers.csv:
                                id speaker
00000002-AUDIO-2025-06-08-12-31-32    LINA
00000004-AUDIO-2025-06-08-15-43-08     EDU
00000005-AUDIO-2025-06-08-15-51-09     EDU
00000006-AUDIO-2025-06-10-08-26-52    LINA
00000007-AUDIO-2025-06-10-08-33-44    LINA


## 📁 Directory Setup and File Discovery

This cell prepares the output directory and discovers audio files for processing.

**Actions**:
1. Creates output directory if it doesn't exist
2. Scans for `.opus` files in the audio directory
3. Displays file count and output path

**File Discovery**:
- Only processes `.opus` files (WhatsApp's audio format)
- Files are sorted alphabetically (chronological order)
- Skips non-audio files automatically

In [19]:
CONFIG["output_dir"].mkdir(exist_ok=True, parents=True)
print(f"📂  Output: {CONFIG['output_dir'].resolve()}")

audio_files = sorted(CONFIG["audio_dir"].glob("*.opus"))
print(f"🔎  {len(audio_files)} files found")

📂  Output: /Users/linalopes/Desktop/CinV-podcast/txt
🔎  18 files found


## 🎤 Transcription Processing

This is the main transcription loop that processes all audio files.

**Processing Features**:
- **Resume Capability**: Skips files that already have transcriptions
- **Progress Tracking**: Shows progress bar and completion status
- **Speaker Labeling**: Automatically labels each transcription with speaker name
- **Error Handling**: Continues processing even if individual files fail

**Output Format**:
- Individual files: `{file_id}_{speaker}.txt`
- UTF-8 encoding for proper character support
- Cleaned text (stripped whitespace)

**Performance Tips**:
- Use GPU for faster processing
- Larger models provide better accuracy
- Processing time depends on audio length and model size

In [21]:
for idx, audio_path in enumerate(tqdm(audio_files, desc="Transcribing"), 1):
    file_id = audio_path.stem
    speaker = id_to_speaker.get(file_id, "UNKNOWN")

    # skip if txt already exists (useful for resuming)
    out_path = CONFIG["output_dir"] / f"{file_id}_{speaker}.txt"
    if out_path.exists():
        print(f"⏩  {out_path.name} already exists — skipping")
        continue

    result = model.transcribe(str(audio_path),
                              language=CONFIG["language"],
                              fp16=False)

    out_path.write_text(result["text"].strip(), encoding="utf-8")
    print(f"✔️  {idx}/{len(audio_files)} saved → {out_path.name}")

Transcribing:   0%|          | 0/18 [00:00<?, ?it/s]

✔️  1/18 saved → 00000002-AUDIO-2025-06-08-12-31-32_LINA.txt
✔️  2/18 saved → 00000004-AUDIO-2025-06-08-15-43-08_EDU.txt
✔️  3/18 saved → 00000005-AUDIO-2025-06-08-15-51-09_EDU.txt
✔️  4/18 saved → 00000006-AUDIO-2025-06-10-08-26-52_LINA.txt
✔️  5/18 saved → 00000007-AUDIO-2025-06-10-08-33-44_LINA.txt
✔️  6/18 saved → 00000008-AUDIO-2025-06-13-19-05-39_EDU.txt
✔️  7/18 saved → 00000009-AUDIO-2025-06-15-23-11-51_EDU.txt
✔️  8/18 saved → 00000010-AUDIO-2025-06-15-23-25-25_EDU.txt
✔️  9/18 saved → 00000011-AUDIO-2025-06-15-23-50-25_EDU.txt
✔️  10/18 saved → 00000012-AUDIO-2025-06-16-14-44-58_LINA.txt
✔️  11/18 saved → 00000013-AUDIO-2025-06-16-14-54-32_LINA.txt
✔️  12/18 saved → 00000014-AUDIO-2025-06-16-15-01-21_LINA.txt
✔️  13/18 saved → 00000015-AUDIO-2025-06-16-15-05-15_LINA.txt
✔️  14/18 saved → 00000016-AUDIO-2025-06-16-16-00-18_EDU.txt
✔️  15/18 saved → 00000017-AUDIO-2025-06-16-16-07-04_EDU.txt
✔️  16/18 saved → 00000018-AUDIO-2025-06-16-16-09-24_EDU.txt
✔️  17/18 saved → 00000019

## 📄 Combined Transcript Generation

This cell creates a single combined transcript file from all individual transcriptions.

**Combination Process**:
1. Reads all individual transcription files
2. Sorts them alphabetically (chronological order)
3. Formats with speaker labels and proper spacing
4. Writes to `combined_transcript.txt`

**Output Format**:
```
LINA:
[transcription text]

EDU:
[transcription text]

LINA:
[transcription text]
...
```

**Use Cases**:
- Easy reading of complete conversation
- Import into editing software
- Archive and sharing purposes


In [22]:
# CONCATENATE INTO A SINGLE TEXT FILE
TXT_DIR   = CONFIG["output_dir"]
BIG_FILE  = Path("combined_transcript.txt")

parts = []

for txt_path in sorted(TXT_DIR.glob("*.txt")):          # alphabetical = chronological order
    speaker = txt_path.stem.rsplit("_", 1)[-1]            # get 'LINA' or 'EDU'
    snippet  = txt_path.read_text(encoding="utf-8").strip()
    parts.append(f"{speaker}:\n{snippet}\n")            # 1 blank line ↴

BIG_FILE.write_text("\n".join(parts), encoding="utf-8")

print(f"✅  Final file created: {BIG_FILE.resolve()}")
print(f"   Total segments: {len(parts)}")
print("   Preview ↓↓↓\n")
print("\n".join(parts[:2]) + "…")                      # show first 2 blocks


✅  Final file created: /Users/linalopes/Desktop/CinV-podcast/combined_transcript.txt
   Total segments: 18
   Preview ↓↓↓

LINA:
Padilha, Xuxu, eu tô pensando naquele termo que o Pedro comentou, né, o Pedro Fonseca comentou sobre criatividade sintética. E aí pode ser que eu esteja muito contaminada, porque eu tô indo agora para o encontro de biologia sintética e machine learning. Pode ser que eu esteja muito desanimada com o termo na cabeça, mas quando eu penso em sintética, eu me remeto à biologia sintética de alguma maneira, sabe, essa coisa de sintetizar vida no laboratório. E aí eu fico me perguntando o que ele quer dizer com criatividade sintética, né, uma criatividade sintetizada em laboratório, né, e é curioso, né, porque biologia sintética é uma vida sintetizada no laboratório, daí a gente tem várias opções, né, tipo assim, disseminação em vítrio é uma vida sintetizada no laboratório, né, de certa maneira você tem uma manipulação ali da coisa da vida, né, você, enfim, e aí quan

## 🎧 Audio Format Conversion

This optional section converts `.opus` files to `.wav` format for use with external tools.

**Conversion Details**:
- **Format**: `.opus` → `.wav`
- **Channels**: Mono (single channel)
- **Sample Rate**: 48 kHz (high quality)
- **Tool**: FFmpeg (command-line audio processing)

**Common Use Cases**:
- **Descript**: Professional audio editing software
- **Audacity**: Free audio editing
- **Adobe Audition**: Professional audio editing
- **Other tools**: Any software that doesn't support `.opus`

**Requirements**:
- FFmpeg must be installed on your system
- Available on macOS via Homebrew: `brew install ffmpeg`
- Available on Ubuntu/Debian: `sudo apt install ffmpeg`

In [23]:
# %% 🎧 CONVERT .opus → WAV (48 kHz mono) for Descript
import subprocess, shlex

OPUS_DIR = CONFIG["audio_dir"]          # Whatsapp/.opus
WAV_DIR  = Path("wav48")                # new folder
WAV_DIR.mkdir(exist_ok=True)

print(f"📂  Looking for .opus files in: {OPUS_DIR.resolve()}")
opus_files = sorted(OPUS_DIR.glob("*.opus"))
print(f"🔎  {len(opus_files)} files found")

for opus_path in tqdm(opus_files, desc="Converting"):
    wav_path = WAV_DIR / f"{opus_path.stem}.wav"
    if wav_path.exists():
        print(f"⏩  {wav_path.name} already exists — skipping")
        continue

    # FFmpeg command: .opus → WAV mono 48 kHz
    cmd = (
        f'ffmpeg -hide_banner -loglevel error '
        f'-i "{opus_path}" -ac 1 -ar 48000 "{wav_path}"'
    )
    subprocess.run(shlex.split(cmd), check=True)
    print(f"✔️  {wav_path.name} created")

print(f"✅  Conversion complete. WAVs in: {WAV_DIR.resolve()}")


📂  Looking for .opus files in: /Users/linalopes/Desktop/CinV-podcast/whatsapp
🔎  18 files found


Converting:   0%|          | 0/18 [00:00<?, ?it/s]

✔️  00000002-AUDIO-2025-06-08-12-31-32.wav created
✔️  00000004-AUDIO-2025-06-08-15-43-08.wav created
✔️  00000005-AUDIO-2025-06-08-15-51-09.wav created
✔️  00000006-AUDIO-2025-06-10-08-26-52.wav created
✔️  00000007-AUDIO-2025-06-10-08-33-44.wav created
✔️  00000008-AUDIO-2025-06-13-19-05-39.wav created
✔️  00000009-AUDIO-2025-06-15-23-11-51.wav created
✔️  00000010-AUDIO-2025-06-15-23-25-25.wav created
✔️  00000011-AUDIO-2025-06-15-23-50-25.wav created
✔️  00000012-AUDIO-2025-06-16-14-44-58.wav created
✔️  00000013-AUDIO-2025-06-16-14-54-32.wav created
✔️  00000014-AUDIO-2025-06-16-15-01-21.wav created
✔️  00000015-AUDIO-2025-06-16-15-05-15.wav created
✔️  00000016-AUDIO-2025-06-16-16-00-18.wav created
✔️  00000017-AUDIO-2025-06-16-16-07-04.wav created
✔️  00000018-AUDIO-2025-06-16-16-09-24.wav created
✔️  00000019-AUDIO-2025-06-16-16-15-54.wav created
✔️  00000021-AUDIO-2025-06-16-16-20-03.wav created
✅  Conversion complete. WAVs in: /Users/linalopes/Desktop/CinV-podcast/wav48


# 🎧 AI-VOICE SANDBOX  
*Companion notebook for the **Creativity in Vitro** project*  
These notes document the two TTS engines used for the third persona-AI (Provocateur, Scenario-Maker, Pinocchio).

---

### Quick prerequisites  
| Library / tool | Minimum version | Purpose |
|----------------|-----------------|---------|
| **openai**     | ≥ 1.14 | Access to speech models **tts-1** and **tts-1-hd** |
| **elevenlabs** | ≥ 2.0  | Voice cloning and TTS with custom parameters |
| **ipywidgets** | optional | Inline audio player inside Jupyter |
| **FFmpeg** CLI | — | Re-encoding or PCM ↔ WAV conversion if needed |

---

## OpenAI TTS (`audio.speech.create`) — parameters that matter  
| Parameter | Allowed values | What it controls |
|-----------|----------------|------------------|
| `model` | `tts-1` · `tts-1-hd` | HD = slower, slightly clearer |
| `voice` | `alloy`, `echo`, `fable`, `nova`, `onyx`, `shimmer` | Base timbre & prosody |
| `speed` | `0.25 – 4.0` | 1 = normal; >1 = faster |
| `response_format` | `wav`, `mp3`, `aac`, `flac`, `opus`, `pcm` | Output container / codec |

*No direct pitch or style knobs; add colour later in a DAW or in Descript with EQ, compression, reverb, etc.*


In [14]:
# %% 🎙️  Quick test - "Provocative" voice OpenAI (SDK >= 1.14)
import os, io, IPython.display as dsp
from pathlib import Path
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")

TEXT_SNIPPET = """
Lina, have you noticed that calling something 'synthetic' is almost like giving it
a seal of inferiority? What if synthesis is exactly where chance lives?
"""

response = client.audio.speech.create(
    model="tts-1-hd",        # or "tts-1"
    voice="nova",
    input=TEXT_SNIPPET,
    response_format="wav",
    speed=1.10
)

# — option A: save via helper —
out_path = Path("provocative_nova_110.wav")
response.stream_to_file(out_path)         # saves directly
print("✅ WAV saved to", out_path.resolve())

# — option B: if you prefer bytes —
# audio_bytes = response.content
# out_path.write_bytes(audio_bytes)

# plays inline in Jupyter
dsp.Audio(out_path)


✅ WAV salvo em /Users/linalopes/Desktop/CinV-podcast/provocadora_nova_110.wav


  response.stream_to_file(out_path)         # grava direto


## ElevenLabs TTS (`text_to_speech.convert`) — adjustable elements  
| Field | Range / choices | Effect on the voice |
|-------|-----------------|---------------------|
| `voice_id` | Any existing ID or one created with **Voice Design** | Exact timbre (cloned or generated) |
| `model_id` | `eleven_multilingual_v2` · `eleven_flash_v2_5` · `eleven_turbo_v2_5` | Quality ↔ latency ↔ cost |
| `output_format` | `mp3_44100_128`, `pcm_48000`, `opus_48000_96`, … | Codec and bit-rate |
| **Voice Settings** | — | — |
| &nbsp;&nbsp;`stability` | 0 – 1 | 0 = lively, 1 = monotone |
| &nbsp;&nbsp;`similarity_boost` | 0 – 1 | Faithfulness to reference timbre |
| &nbsp;&nbsp;`style` | 0 – 1 | Emotional intensity |
| &nbsp;&nbsp;`use_speaker_boost` | true / false | Extra loudness & presence |

---

### Minimal production workflow

1. **Pick or create** a `voice_id` in ElevenLabs; choose a built-in `voice` for OpenAI.  
2. Supply the text snippets for each AI persona.  
3. Adjust key parameters:  
   *OpenAI* → primarily **speed** and **voice**.  
   *ElevenLabs* → **stability**, **style**, **similarity_boost**.  
4. Generate the audio files, listen, iterate until the tone fits.  
5. Import final clips into Descript or a DAW for any last-minute EQ, reverb or compression.

---

#### Starter presets for this project
* **Provocateur — ElevenLabs**: stability 0.30, style 0.15, similarity_boost ≈ 0.55, model `eleven_multilingual_v2`, output `mp3_44100_128`.  
* **Provocateur — OpenAI**: voice `echo`, speed 1.15, response_format `wav`.

These settings give a firm, slightly ironic delivery distinct from Lina and Eduardo while remaining clear to the listener.


In [None]:
# %% 🎙️  ElevenLabs - "Provocative" voice (API June 2025)

from pathlib import Path
from elevenlabs import (
    ElevenLabs,
    VoiceSettings,   # ⬅️  container object for tweaks
    save, play
)

client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))

TXT = """
Lina, have you noticed that calling something 'synthetic' is almost like giving it
a seal of inferiority? What if synthesis is exactly where chance lives?
"""

VOICE_ID = "ycxdm1PRMs962FxyyuJ0"          # replace with ID that exists in your account
MODEL    = "eleven_multilingual_v2"    # or "eleven_flash_v2_5"
FORMAT   = "mp3_44100_128"             # valid format

settings = VoiceSettings(
    stability        = 0.90,   # 0 = more variation, 1 = robotic
    similarity_boost = 0.55,   # how close to original timbre
    style            = 0.85,   # 0 neutral → 1 dramatic
    use_speaker_boost=True
)

audio = client.text_to_speech.convert(
    text            = TXT,
    voice_id        = VOICE_ID,
    model_id        = MODEL,
    output_format   = FORMAT,
    voice_settings  = settings
)

out_path = Path("provocative_otto_090_055_085_11labs.mp3")
save(audio, out_path)               # saves MP3
print("✅ Saved to", out_path)

play(audio)                         # plays inline in Jupyter/Lab


✅ Saved to provocative_otto_090_055_085_11labs.mp3


## 📊 Metadata Extraction and Speaker Mapping

This cell extracts metadata from WhatsApp audio files (.opus) and maps speakers based on a CSV file:

- Reads speaker mappings from speakers.csv (ID → speaker name)
- Processes all .opus files in the configured audio directory
- For each file, extracts:
  - Duration, bitrate, sample rate using ffprobe
  - Speaker name from the mapping
  - Date and time from filename
- Creates a DataFrame with file metadata and speaker info

Dependencies:
- ffprobe (part of ffmpeg)
- pandas for data handling
- speakers.csv with format: file_id,speaker_name


In [41]:
# %% 📊  Metadata extractor that looks up the speaker in speakers.csv
from pathlib import Path
import subprocess, json, pandas as pd
import re, sys

AUDIO_DIR   = CONFIG["audio_dir"]        # e.g.  Path("Whatsapp/.opus")
OUT_CSV     = Path("whatsapp_metadata.csv")
SPEAK_CSV   = CONFIG["speakers_csv"]     # e.g.  Path("speakers.csv")

# --- load id → speaker map -------------------------------------------------
df_map = pd.read_csv(SPEAK_CSV, header=None, names=["id", "speaker"], dtype={"id":"string"})
ID2SPK = dict(df_map.values)

rows = []
opus_files = sorted(AUDIO_DIR.glob("*.opus"))
if not opus_files:
    sys.exit(f"⚠️  No .opus files found in {AUDIO_DIR.resolve()}")

for f in opus_files:
    try:
        probe = subprocess.check_output(
            ["ffprobe", "-v", "quiet", "-print_format", "json",
             "-show_format", "-show_streams", str(f)]
        )
        meta   = json.loads(probe)
        fmt    = meta.get("format", {})
        stream = meta.get("streams", [{}])[0]

        duration = float(fmt.get("duration") or stream.get("duration", 0))
        bitrate  = int(fmt.get("bit_rate", 0)) / 1000     # kbps
        sr       = int(stream.get("sample_rate", 0))      # Hz

        # ---- derive id & speaker ------------------------------------------
        file_id = f.stem
        speaker = ID2SPK.get(file_id, "UNKNOWN")

        m = re.search(r"(\d{4}-\d{2}-\d{2})-(\d{2}-\d{2}-\d{2})", f.stem)
        date = m.group(1) if m else ""
        time = m.group(2).replace("-", ":") if m else ""

        rows.append({
            "file"     : f.name,
            "id"       : file_id,
            "speaker"  : speaker,
            "date"     : date,
            "time"     : time,
            "duration" : duration,              # seconds
            "size_MB"  : f.stat().st_size / 1_048_576,
            "bitrate"  : bitrate,
            "sr_Hz"    : sr
        })
    except subprocess.CalledProcessError:
        print(f"⚠️  ffprobe failed on {f.name}; skipping")

df = pd.DataFrame(rows)
df.to_csv(OUT_CSV, index=False)
print("✅  CSV written to:", OUT_CSV.resolve())

# --- summary ---------------------------------------------------------------
tot_sec = df["duration"].sum()
by_spk  = df.groupby("speaker")["duration"].sum().round(1)
longest = df.loc[df["duration"].idxmax()]
shortest= df.loc[df["duration"].idxmin()]

print("\n--- SUMMARY ---------------------------")
print("Files analysed :", len(df))
print("Total duration :", f"{tot_sec/60:.1f} min  ({tot_sec/3600:.2f} h)")
print("Minutes per speaker:")
for spk, dur in by_spk.items():
    print(f"  · {spk:<7}: {dur/60:.1f}")
print("Longest clip   :", longest['file'], f"{longest['duration']:.1f}s")
print("Shortest clip  :", shortest['file'], f"{shortest['duration']:.1f}s")

df.head()


✅  CSV written to: /Users/linalopes/Desktop/CinV-podcast/whatsapp_metadata.csv

--- SUMMARY ---------------------------
Files analysed : 18
Total duration : 113.6 min  (1.89 h)
Minutes per speaker:
  · EDU    : 70.3
  · LINA   : 43.3
Longest clip   : 00000004-AUDIO-2025-06-08-15-43-08.opus 565.9s
Shortest clip  : 00000018-AUDIO-2025-06-16-16-09-24.opus 136.8s


Unnamed: 0,file,id,speaker,date,time,duration,size_MB,bitrate,sr_Hz
0,00000002-AUDIO-2025-06-08-12-31-32.opus,00000002-AUDIO-2025-06-08-12-31-32,LINA,2025-06-08,12:31:32,301.5665,0.617486,17.176,48000
1,00000004-AUDIO-2025-06-08-15-43-08.opus,00000004-AUDIO-2025-06-08-15-43-08,EDU,2025-06-08,15:43:08,565.862167,1.242661,18.421,48000
2,00000005-AUDIO-2025-06-08-15-51-09.opus,00000005-AUDIO-2025-06-08-15-51-09,EDU,2025-06-08,15:51:09,479.222167,1.048515,18.353,48000
3,00000006-AUDIO-2025-06-10-08-26-52.opus,00000006-AUDIO-2025-06-10-08-26-52,LINA,2025-06-10,08:26:52,439.9265,0.968476,18.467,48000
4,00000007-AUDIO-2025-06-10-08-33-44.opus,00000007-AUDIO-2025-06-10-08-33-44,LINA,2025-06-10,08:33:44,409.4865,0.879959,18.026,48000
