# Audio Transcription – Peru Audio Clips  
### **GPU-Enhanced, Progress-Aware, Multi-Threaded Pipeline (ElevenLabs Scribe)**

This notebook transcribes Spanish MP3 audio clips to **text** **transcripts** using
the new **ElevenLabs Scribe v1** Speech-to-Text API, replacing the previous Groq Whisper flow.

---

**Key Enhancements (relative to the Groq version)**

| Area | Upgrade |
|------|---------|
| STT engine | Switched to ElevenLabs Scribe v1 – 99-language, word-timestamped, speaker-diarized output. |
| Diarization | Native `diarize` flag with `num_speakers` hint for up to 32 speakers. |
| Spanish mode | `language_code="es"` consistently enforced. |
| Cost ledger | Updated to $0.40 h / $0.00667 min pricing. |
| API client | Uses the **official `elevenlabs` Python SDK**. |
| **Deduplication** | **NEW: Identifier-based deduplication to avoid transcribing same clips multiple times.** |


In [1]:
# --- Install dependencies ---------------------------------------------------
import importlib.util, subprocess, sys, os

# Ensure required packages are installed both in Colab and in a local Jupyter environment.
pkgs = [
    'elevenlabs',
    'python-dotenv',
    'requests',
    'pandas',
    'tqdm',
    'psutil',
    'soundfile',
    'librosa'
]

IN_COLAB = importlib.util.find_spec('google.colab') is not None

if IN_COLAB:
    # Colab: use shell magic for speed and readability
    from IPython import get_ipython
    get_ipython().system('pip -q install ' + ' '.join(pkgs))
else:
    # Local: run pip quietly via subprocess (only installs missing packages)
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', *pkgs], check=True)

print('✅ Dependencies installed/verified')


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m682.9/682.9 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h✅ Dependencies installed/verified


In [2]:
# --- Environment & GPU detection -------------------------------------------
import importlib.util, subprocess, os, sys, json, time, hashlib, psutil
from pathlib import Path
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Lock, Semaphore, Event
import re
from collections import defaultdict

IN_COLAB = importlib.util.find_spec("google.colab") is not None
CPU_COUNT = os.cpu_count() or 1
CHUNK_SIZE = 8192

# GPU detection (unchanged helper)
def detect_gpu_and_configure():
    gpu = {'available': False, 'memory_gb': 0, 'acceleration_flag': '', 'max_concurrent': 1}
    try:
        out = subprocess.run(['nvidia-smi','--query-gpu=memory.total','--format=csv,noheader,nounits'],
                             capture_output=True, text=True, timeout=10)
        if out.returncode==0:
            mem_mb=int(out.stdout.strip())
            mem_gb=mem_mb/1024
            gpu.update({'available': True,
                        'memory_gb': mem_gb,
                        'acceleration_flag': '-hwaccel cuda',
                        'max_concurrent': 4 if mem_gb>=16 else 2 if mem_gb>=8 else 1})
            print(f"🚀 NVIDIA GPU: {mem_gb:.1f} GB - {gpu['max_concurrent']} concurrent heavy tasks")
    except Exception as e:
        print("No NVIDIA GPU or nvidia-smi failed:", e)
    if not gpu['available']:
        print("💻 Using CPU-only mode for local tasks")
    return gpu

GPU_INFO = detect_gpu_and_configure()
GPU_SEMAPHORE = Semaphore(GPU_INFO['max_concurrent']) if GPU_INFO['available'] else None
GPU_MONITOR_STOP = Event()

def start_gpu_monitor(interval=60):
    if not GPU_INFO['available']: return None
    def _loop():
        while not GPU_MONITOR_STOP.is_set():
            try:
                out = subprocess.run(['nvidia-smi','--query-gpu=utilization.gpu,memory.used,memory.total','--format=csv,noheader,nounits'],
                                     capture_output=True, text=True, timeout=5)
                if out.returncode==0:
                    util, used, total = map(int,out.stdout.strip().split(','))
                    print(f"[GPU] util {util}%  mem {used}/{total} MB")
            except Exception as e:
                print("[GPU-mon] err",e)
            GPU_MONITOR_STOP.wait(interval)
    import threading, atexit
    t=threading.Thread(target=_loop,daemon=True); t.start()
    atexit.register(GPU_MONITOR_STOP.set)
    return t


No NVIDIA GPU or nvidia-smi failed: [Errno 2] No such file or directory: 'nvidia-smi'
💻 Using CPU-only mode for local tasks


In [3]:
# --- Adaptive worker counts -------------------------------------------------
def configure_workers():
    mem = psutil.virtual_memory().available / 2**30
    base = min(CPU_COUNT*2, 16)
    by_mem = int(mem*0.8)
    total = min(base, by_mem, 12)
    return max(1,total)

MAX_WORKERS = configure_workers()
PROGRESS_SAVE_INTERVAL = 5
print(f"🖥️  CPU {CPU_COUNT}  → workers {MAX_WORKERS}")


🖥️  CPU 2  → workers 4


In [5]:
# --- Drive / paths ----------------------------------------------------------
import os
from pathlib import Path
if IN_COLAB:
    from google.colab import drive as _gd; _gd.mount('/content/drive')
    ROOT = Path('/content/drive/My Drive/world bank/data/Peru')
else:
    # For local runs you can set PERU_DATA_ROOT to point to the project directory; defaults to cwd.
    ROOT = Path(os.getenv('PERU_DATA_ROOT', Path.cwd()))

INPUT_CSV   = ROOT/'evals/formattedData/peru_with_audio_clips.csv'
PROGRESS_CSV= ROOT/'evals/formattedData/peru_transcript_progress.csv'
FINAL_CSV   = ROOT/'evals/formattedData/peru_with_transcripts.csv'
TRANS_DIR   = ROOT/'transcripts/processed'
CACHE_DIR   = ROOT/'transcripts/cache'
COST_LOG    = ROOT/'transcripts/cost_tracking.json'

for p in [TRANS_DIR, CACHE_DIR]:
    p.mkdir(parents=True, exist_ok=True)

print("ROOT →",ROOT)


Mounted at /content/drive
ROOT → /content/drive/My Drive/world bank/data/Peru


In [6]:
# --- ElevenLabs API init ----------------------------------------------------
from dotenv import load_dotenv; load_dotenv()
from elevenlabs import ElevenLabs
import math
from pathlib import Path
from google.colab import userdata


API_KEY = userdata.get('ELEVENLABS_API_KEY')
if not API_KEY:
    raise RuntimeError('Set ELEVENLABS_API_KEY env var')

client = ElevenLabs(api_key=API_KEY)
MODEL_ID = 'scribe_v1'
print("Model →", MODEL_ID)

TRANSCRIPTION_PARAMS = dict(
    model_id=MODEL_ID,
    language_code='spa',      # Spanish
    diarize=True,            # enable speaker diarization
    timestamps_granularity='word',
    tag_audio_events=True
)

# ElevenLabs bills $0.40 per hour → $0.0066667 per minute
COST_PER_MIN = 0.0066667


Model → scribe_v1


In [7]:
# --- Utility functions for identifier extraction and deduplication ----------
def extract_identifier(school_clip):
    """Extract the 6-7 digit identifier from School_Clip column."""
    if pd.isna(school_clip) or school_clip == '':
        return None
    # Extract 6-7 digit number at the start
    match = re.match(r'^(\d{6,7})', str(school_clip).strip())
    return match.group(1) if match else None

def copy_transcript_data(df, source_idx, target_indices, transcript_column_prefix):
    """Copy transcript data from source row to target rows."""
    # All possible columns for this transcript type
    columns_to_copy = [
        transcript_column_prefix,  # Base transcript column
        transcript_column_prefix + '_JSON',
        transcript_column_prefix + ' Text',
        transcript_column_prefix + ' Language Code',
        transcript_column_prefix + ' Language Probability',
        transcript_column_prefix + ' Word Count',
        transcript_column_prefix + ' Duration Seconds',
        transcript_column_prefix + ' Speaker Count',
        transcript_column_prefix + ' Has Audio Events',
        transcript_column_prefix + ' First Speaker'
    ]

    for col in columns_to_copy:
        if col in df.columns:
            source_value = df.at[source_idx, col]
            for target_idx in target_indices:
                if target_idx != source_idx:  # Don't copy to itself
                    df.at[target_idx, col] = source_value


In [8]:
# --- Processor class with JSON and column-based storage ---------------------
import librosa, soundfile
import pandas as pd
import json
from tqdm.auto import tqdm
from datetime import datetime
from threading import Lock

class TranscriptionProcessor:
    def __init__(self):
        self.lock = Lock()
        self.stats = dict(processed=0, success=0, fail=0, cached=0,
                          api_calls=0, cost=0.0, minutes=0.0,
                          start=time.time())
        self.progress_cnt=0
        self.load_cost()

    # ---------- cost ledger ----------
    def load_cost(self):
        try:
            if COST_LOG.exists():
                self.stats['cost'] = json.loads(COST_LOG.read_text()).get('total_cost',0)
                print("💰 prev cost =",self.stats['cost'])
        except: pass
    def flush_cost(self):
        COST_LOG.write_text(json.dumps({
            'total_cost': self.stats['cost'],
            'updated': datetime.now().isoformat()
        },indent=2))

    # ---------- saving ---------------
    def save_progress(self, df, force=False):
        with self.lock:
            self.progress_cnt += 1
            if force or self.progress_cnt>=PROGRESS_SAVE_INTERVAL:
                df.to_csv(PROGRESS_CSV, index=False)
                self.progress_cnt=0
                print("💾 checkpoint saved", PROGRESS_CSV.name)

    # ---------- cache utils ----------
    def cache_key(self, path:Path):
        s=path.stat()
        return hashlib.md5(f"{path}{s.st_size}{s.st_mtime}".encode()).hexdigest()

    def cache_path(self, path:Path): return CACHE_DIR/f"{self.cache_key(path)}.json"

    def try_cache(self, path:Path):
        cp=self.cache_path(path)
        if cp.exists():
            try:
                return json.loads(cp.read_text())['text']
            except: pass
        return None

    def store_cache(self, path:Path, text:str):
        self.cache_path(path).write_text(json.dumps({'text':text,'ts':time.time()}))

    def duration_min(self, path:Path):
        try:
            print(f"Calculating duration for {path}")
            y,sr=librosa.load(path, sr=None, mono=True)
            duration = len(y)/sr/60
            print(f"Calculated duration: {duration}")
            return duration
        except Exception as e:
            print(f"ERROR during duration calculation: {e}")
            return max(path.stat().st_size/(1024**2),0.1)

    # ---------- JSON storage and metadata extraction ----------
    def store_transcript_data(self, df:pd.DataFrame, idx:int, prefix:str, transcript_data:dict):
        # Store full JSON response
        json_col = prefix + '_JSON'
        df.at[idx, json_col] = json.dumps(transcript_data, ensure_ascii=False)
        # Store text and key metadata fields
        df.at[idx, prefix + ' Text'] = transcript_data.get('text', '')
        df.at[idx, prefix + ' Language Code'] = transcript_data.get('language_code', '')
        df.at[idx, prefix + ' Language Probability'] = transcript_data.get('language_probability', 0)
        words = transcript_data.get('words', [])
        df.at[idx, prefix + ' Word Count'] = len([w for w in words if w.get('type') == 'word'])
        df.at[idx, prefix + ' Duration Seconds'] = max([w.get('end', 0) for w in words], default=0)
        df.at[idx, prefix + ' Speaker Count'] = len(set(w.get('speaker_id') for w in words if w.get('speaker_id')))
        df.at[idx, prefix + ' Has Audio Events'] = any(w.get('type') == 'audio_event' for w in words)
        df.at[idx, prefix + ' First Speaker'] = next((w.get('speaker_id') for w in words if w.get('type') == 'word'), None)

    # ---------- core transcription with JSON response ----------
    def transcribe(self, path:Path, retries=0):
        print(f"Processing: {path}")
        print(f"File exists: {path.exists()}")
        if GPU_SEMAPHORE: GPU_SEMAPHORE.acquire()
        try:
            cached_text = self.try_cache(path)
            if cached_text:
                with self.lock: self.stats['cached']+=1
                print("Result from cache")
                # Returning minimal JSON with text only
                return {'text': cached_text, 'language_code':'', 'language_probability':0, 'words':[]}, "cached"

            dur=self.duration_min(path)
            est_cost=dur*COST_PER_MIN
            print("Calling ElevenLabs API...")
            with open(path,'rb') as f:
                resp = client.speech_to_text.convert(file=f, **TRANSCRIPTION_PARAMS)
            # Build transcript_data dict
            text = resp.text.strip() if hasattr(resp,'text') else str(resp)
            transcript_data = {
                'language_code': getattr(resp, 'language_code', ''),
                'language_probability': getattr(resp, 'language_probability', 0),
                'text': text,
                'words': []
            }
            for w in getattr(resp, 'words', []):
                transcript_data['words'].append({
                    'text': getattr(w, 'text', ''),
                    'start': getattr(w, 'start', 0),
                    'end': getattr(w, 'end', 0),
                    'type': getattr(w, 'type', ''),
                    'speaker_id': getattr(w, 'speaker_id', ''),
                    'logprob': getattr(w, 'logprob', None)
                })
            if len(text) < 10:
                print("Transcript too short, raising")
                raise ValueError("too short")
            self.store_cache(path, text)
            with self.lock:
                self.stats.update(success=self.stats['success']+1,
                                  api_calls=self.stats['api_calls']+1,
                                  cost=self.stats['cost']+est_cost,
                                  minutes=self.stats['minutes']+dur)
            print("Transcription success")
            return transcript_data, "ok"
        except Exception as e:
            print(f"ERROR: {type(e).__name__}: {e}")
            if retries < 2:
                print(f"Retrying... attempt {retries+1}")
                time.sleep(2**retries)
                return self.transcribe(path, retries+1)
            with self.lock: self.stats['fail'] += 1
            return None, str(e)
        finally:
            if GPU_SEMAPHORE: GPU_SEMAPHORE.release()


In [9]:
# --- Load dataframe and initialize metadata columns ------------------------
import pandas as pd, numpy as np

df = pd.read_csv(INPUT_CSV)
for col in ['First Audio Transcript','Last Audio Transcript']:
    # Ensure base transcript columns exist
    if col not in df.columns: df[col] = ''
    # JSON and metadata columns
    df[col + '_JSON'] = ''
    df[col + ' Text'] = ''
    df[col + ' Language Code'] = ''
    df[col + ' Language Probability'] = 0
    df[col + ' Word Count'] = 0
    df[col + ' Duration Seconds'] = 0
    df[col + ' Speaker Count'] = 0
    df[col + ' Has Audio Events'] = False
    df[col + ' First Speaker'] = None

# merge existing progress / final for base and metadata columns
for p in [PROGRESS_CSV, FINAL_CSV]:
    if p.exists():
        df_old = pd.read_csv(p)
        for col in ['First Audio Transcript','Last Audio Transcript']:
            # Merge base column
            if col in df_old.columns:
                df[col] = df[col].mask(df[col].eq('') & df_old[col].ne(''), df_old[col])
            # Merge metadata columns
            for meta_col in [
                '_JSON', ' Text', ' Language Code', ' Language Probability',
                ' Word Count', ' Duration Seconds', ' Speaker Count',
                ' Has Audio Events', ' First Speaker'
            ]:
                full_col = col + meta_col
                if full_col in df_old.columns:
                    df[full_col] = df[full_col].mask(df[full_col].eq('') & df_old[full_col].ne(''), df_old[full_col])

print(f"Rows {len(df)}")

Rows 227


In [10]:
# --- Build job list with deduplication logic (with optional skipping of first 5 faulty rows) -------------------------------
import pandas as pd
import json
from collections import defaultdict

# Flag to control skipping logic; set to False to include all rows again
SKIP_FIRST_N_FAULTY = True
N_TO_SKIP = 5

def has_valid_transcript(row, col_prefix):
    """Check if a row has a valid transcript for the given column prefix."""
    json_col = col_prefix + '_JSON'
    text_col = col_prefix + ' Text'

    # First check if JSON column exists and has meaningful content
    if json_col in row.index:
        json_val = row[json_col]
        # Check for NaN, empty string, or whitespace-only string
        if pd.isna(json_val) or json_val == '' or str(json_val).strip() == '':
            return False

        # Try to parse the JSON to see if it's valid and has content
        try:
            json_data = json.loads(str(json_val))
            # Check if it has actual text content (not just empty JSON structure)
            if 'text' in json_data and json_data['text'] and str(json_data['text']).strip():
                return True
        except (json.JSONDecodeError, TypeError):
            pass

    # Fallback: check text column directly
    if text_col in row.index:
        text_val = row[text_col]
        if not pd.isna(text_val) and text_val != '' and str(text_val).strip():
            # If we have meaningful text (more than just a few characters), consider it valid
            if len(str(text_val).strip()) > 10:
                return True

    return False

# --------------------------------------------------------------------------
# Step 1: Identify the first N rows that have no valid transcripts at all
# --------------------------------------------------------------------------

faulty_indices = []
if SKIP_FIRST_N_FAULTY:
    for idx, row in df.iterrows():
        # A row is "faulty" if it has no valid first‐audio AND no valid last‐audio transcript
        first_ok  = has_valid_transcript(row, 'First Audio Transcript')
        last_ok   = has_valid_transcript(row, 'Last Audio Transcript')
        if not (first_ok or last_ok):
            faulty_indices.append(idx)
            if len(faulty_indices) >= N_TO_SKIP:
                break

# --------------------------------------------------------------------------
# Step 2: Build identifier groups, skipping those faulty rows if the flag is True
# --------------------------------------------------------------------------

identifier_groups = defaultdict(list)
for idx, row in df.iterrows():
    if SKIP_FIRST_N_FAULTY and idx in faulty_indices:
        # Skip this row for now; we'll return to these faulty rows later
        continue

    school_clip = row.get('School_Clip', '')
    identifier = extract_identifier(school_clip)
    if identifier:
        identifier_groups[identifier].append(idx)

print(f"Found {len(identifier_groups)} unique identifiers (excluding {len(faulty_indices)} skipped rows)")
print(f"Total rows (original): {len(df)}")

# --------------------------------------------------------------------------
# Step 3: Build jobs with deduplication (excluding skipped rows)
# --------------------------------------------------------------------------

jobs = []
processed_combinations = set()  # Track (identifier, clip_type) combinations
processor = TranscriptionProcessor()

for identifier, row_indices in identifier_groups.items():
    # Use the first non‐skipped row as representative
    representative_idx = row_indices[0]
    representative_row = df.iloc[representative_idx]
    ident = representative_row.get('School_Clip', f'row_{representative_idx}')

    # Process First Audio Clip if not already handled for this identifier
    first_combination = (identifier, 'first')
    if first_combination not in processed_combinations:
        first_clip = representative_row.get('First Audio Clip', '')
        if (first_clip and str(first_clip).strip() != '' and not pd.isna(first_clip)):
            # Check if ANY row for this identifier already has a valid transcript
            needs_transcription = True
            for row_idx in row_indices:
                if has_valid_transcript(df.iloc[row_idx], 'First Audio Transcript'):
                    needs_transcription = False
                    break

            if needs_transcription:
                jobs.append({
                    'idx': representative_idx,
                    'ref': first_clip,
                    'col': 'First Audio Transcript',
                    'ident': ident,
                    'ctype': 'first',
                    'identifier': identifier,
                    'all_row_indices': row_indices
                })
        processed_combinations.add(first_combination)

    # Process Last Audio Clip if not already handled for this identifier
    last_combination = (identifier, 'last')
    if last_combination not in processed_combinations:
        last_clip = representative_row.get('Last Audio Clip', '')
        if (last_clip and str(last_clip).strip() != '' and not pd.isna(last_clip)):
            # Check if ANY row for this identifier already has a valid transcript
            needs_transcription = True
            for row_idx in row_indices:
                if has_valid_transcript(df.iloc[row_idx], 'Last Audio Transcript'):
                    needs_transcription = False
                    break

            if needs_transcription:
                jobs.append({
                    'idx': representative_idx,
                    'ref': last_clip,
                    'col': 'Last Audio Transcript',
                    'ident': ident,
                    'ctype': 'last',
                    'identifier': identifier,
                    'all_row_indices': row_indices
                })
        processed_combinations.add(last_combination)

print(f"Pending clips {len(jobs)} (after deduplication)")

# --------------------------------------------------------------------------
# Step 4: Debug / inspect the first few jobs
# --------------------------------------------------------------------------

if jobs:
    print("\nFirst 5 jobs to process:")
    for i, job in enumerate(jobs[:5]):
        row = df.iloc[job['idx']]
        print(f"  {i+1}. Identifier {job['identifier']}: {job['ctype']} - {job['ref']}")
        print(f"     Will copy results to {len(job['all_row_indices'])} rows: {job['all_row_indices']}")
        # Show current state of transcript columns for debugging
        json_col = job['col'] + '_JSON'
        text_col = job['col'] + ' Text'
        json_val = row.get(json_col, 'N/A')
        text_val = row.get(text_col, 'N/A')

        # Show first 100 chars of JSON and text to see what's there
        json_preview = str(json_val)[:100] + "..." if len(str(json_val)) > 100 else str(json_val)
        text_preview = str(text_val)[:100] + "..." if len(str(text_val)) > 100 else str(text_val)

        print(f"     JSON preview: {repr(json_preview)}")
        print(f"     Text preview: {repr(text_preview)}")
else:
    print("\n✅ All transcripts appear to be complete!")
    # Show count of completed transcripts for verification
    first_completed = sum(
        1
        for _, row in df.iterrows()
        if row.get('First Audio Clip') and has_valid_transcript(row, 'First Audio Transcript')
    )
    last_completed = sum(
        1
        for _, row in df.iterrows()
        if row.get('Last Audio Clip') and has_valid_transcript(row, 'Last Audio Transcript')
    )
    print(f"  Found {first_completed} completed first audio transcripts")
    print(f"  Found {last_completed} completed last audio transcripts")

# --------------------------------------------------------------------------
# Step 5: Deduplication statistics
# --------------------------------------------------------------------------

original_potential_jobs = 0
for idx, row in df.iterrows():
    # Count potential jobs even for skipped rows (so we can compare against jobs after deduplication)
    if row.get('First Audio Clip') and not has_valid_transcript(row, 'First Audio Transcript'):
        original_potential_jobs += 1
    if row.get('Last Audio Clip') and not has_valid_transcript(row, 'Last Audio Transcript'):
        original_potential_jobs += 1

if original_potential_jobs > len(jobs):
    saved_jobs = original_potential_jobs - len(jobs)
    saved_cost = saved_jobs * 15 * COST_PER_MIN  # Rough estimate: 15 minutes per clip
    print(f"\n💰 Deduplication savings: {saved_jobs} jobs avoided, ~${saved_cost:.2f} cost saved")

# --------------------------------------------------------------------------
# Notes on skipping logic:
# - The `SKIP_FIRST_N_FAULTY` flag and `N_TO_SKIP` constants control whether the first N faulty rows are excluded.
# - If you later want to include all rows (i.e., remove the skipping logic), set `SKIP_FIRST_N_FAULTY = False`.
# - The `faulty_indices` list holds the actual DataFrame indices of the rows that were skipped,
#   so you can come back and process them separately when needed.


Found 121 unique identifiers (excluding 5 skipped rows)
Total rows (original): 227
💰 prev cost = 10.790180862199183
Pending clips 4 (after deduplication)

First 5 jobs to process:
  1. Identifier 393322: last - world bank/data/Peru/audio/processed/393322_Clip_1_last_audio.mp3
     Will copy results to 2 rows: [20, 21]
     JSON preview: 'nan'
     Text preview: 'nan'
  2. Identifier 269860: first - world bank/data/Peru/audio/processed/269860_Clip_1_first_audio.mp3
     Will copy results to 2 rows: [163, 164]
     JSON preview: 'nan'
     Text preview: 'nan'
  3. Identifier 247551: last - world bank/data/Peru/audio/processed/247551_Clip_1_last_audio.mp3
     Will copy results to 1 rows: [212]
     JSON preview: 'nan'
     Text preview: 'nan'
  4. Identifier 1381979: first - world bank/data/Peru/audio/processed/1381979_Clip_1_first_audio.mp3
     Will copy results to 1 rows: [225]
     JSON preview: 'nan'
     Text preview: 'nan'

💰 Deduplication savings: 15 jobs avoided, ~$1.50 cost sav

In [11]:
# --- Run transcription with copying to duplicate rows ----------------------
start = time.time()
if jobs:
    def worker(job):
        if IN_COLAB:
            path = Path('/content/drive/My Drive') / job['ref']
        else:
            path = ROOT / job['ref']

        # Transcribe the audio
        transcript_data, status = processor.transcribe(path)

        if transcript_data:
            # Store transcript data in the representative row
            processor.store_transcript_data(df, job['idx'], job['col'], transcript_data)

            # Copy the transcript data to all other rows with the same identifier
            copy_transcript_data(df, job['idx'], job['all_row_indices'], job['col'])

            print(f"✅ Transcribed {job['identifier']} {job['ctype']} and copied to {len(job['all_row_indices'])} rows")

        processor.save_progress(df)
        with processor.lock:
            processor.stats['processed'] += 1
        return job, status

    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex, tqdm(total=len(jobs), unit='clip') as bar:
        for fut in as_completed([ex.submit(worker, j) for j in jobs]):
            job, status = fut.result()
            bar.update(1)
            bar.set_postfix(proc=processor.stats['processed'], succ=processor.stats['success'],
                            cost=f"${processor.stats['cost']:.2f}")
    processor.save_progress(df, force=True)
else:
    print("Nothing to do – all transcripts present")


  0%|          | 0/4 [00:00<?, ?clip/s]

Processing: /content/drive/My Drive/world bank/data/Peru/audio/processed/393322_Clip_1_last_audio.mp3
Processing: /content/drive/My Drive/world bank/data/Peru/audio/processed/269860_Clip_1_first_audio.mp3
Processing: /content/drive/My Drive/world bank/data/Peru/audio/processed/247551_Clip_1_last_audio.mp3
Processing: /content/drive/My Drive/world bank/data/Peru/audio/processed/1381979_Clip_1_first_audio.mp3
File exists: True
File exists: True
File exists: True
File exists: True
Calculating duration for /content/drive/My Drive/world bank/data/Peru/audio/processed/269860_Clip_1_first_audio.mp3
Calculating duration for /content/drive/My Drive/world bank/data/Peru/audio/processed/1381979_Clip_1_first_audio.mp3
Calculating duration for /content/drive/My Drive/world bank/data/Peru/audio/processed/393322_Clip_1_last_audio.mp3
Calculating duration for /content/drive/My Drive/world bank/data/Peru/audio/processed/247551_Clip_1_last_audio.mp3
Calculated duration: 14.198857142857143
Calling Eleven

  df.at[idx, prefix + ' Duration Seconds'] = max([w.get('end', 0) for w in words], default=0)


Processing: /content/drive/My Drive/world bank/data/Peru/audio/processed/393322_Clip_1_last_audio.mp3
File exists: True
Calculating duration for /content/drive/My Drive/world bank/data/Peru/audio/processed/393322_Clip_1_last_audio.mp3
Calculated duration: 14.198857142857143
Calling ElevenLabs API...
ERROR: ReadTimeout: The read operation timed out
ERROR: ReadTimeout: The read operation timed out
💾 checkpoint saved peru_transcript_progress.csv


In [12]:
# --- Final verification and save CSV ---------------------------------------
print("\n🔍 Final verification of transcript coverage...")

# Count completed transcripts by identifier
identifier_transcript_counts = defaultdict(lambda: {'first': 0, 'last': 0, 'rows': 0})

for idx, row in df.iterrows():
    school_clip = row.get('School_Clip', '')
    identifier = extract_identifier(school_clip)
    if identifier:
        identifier_transcript_counts[identifier]['rows'] += 1

        if (row.get('First Audio Clip') and
            has_valid_transcript(row, 'First Audio Transcript')):
            identifier_transcript_counts[identifier]['first'] += 1

        if (row.get('Last Audio Clip') and
            has_valid_transcript(row, 'Last Audio Transcript')):
            identifier_transcript_counts[identifier]['last'] += 1

# Verify that for each identifier, all rows have the same transcript data
verification_issues = []
for identifier, counts in identifier_transcript_counts.items():
    if counts['first'] > 0 and counts['first'] != counts['rows']:
        verification_issues.append(f"Identifier {identifier}: {counts['first']}/{counts['rows']} rows have first transcript")
    if counts['last'] > 0 and counts['last'] != counts['rows']:
        verification_issues.append(f"Identifier {identifier}: {counts['last']}/{counts['rows']} rows have last transcript")

if verification_issues:
    print("⚠️  Verification issues found:")
    for issue in verification_issues[:10]:  # Show first 10 issues
        print(f"   {issue}")
    if len(verification_issues) > 10:
        print(f"   ... and {len(verification_issues) - 10} more issues")
else:
    print("✅ Verification passed: All rows with same identifier have consistent transcript data")

# Save final CSV
df.to_csv(FINAL_CSV, index=False)
processor.flush_cost()
print(f"\n✅ Saved {FINAL_CSV.name} | rows {len(df)}")

# Final statistics
total_identifiers = len(identifier_groups)
total_rows = len(df)
completed_first = sum(1 for _, row in df.iterrows()
                     if row.get('First Audio Clip') and has_valid_transcript(row, 'First Audio Transcript'))
completed_last = sum(1 for _, row in df.iterrows()
                    if row.get('Last Audio Clip') and has_valid_transcript(row, 'Last Audio Transcript'))

print(f"\n📊 Final Statistics:")
print(f"   Unique identifiers: {total_identifiers}")
print(f"   Total rows: {total_rows}")
print(f"   Rows with first transcript: {completed_first}")
print(f"   Rows with last transcript: {completed_last}")
print(f"   Total API costs: ${processor.stats['cost']:.2f}")
print(f"   Processing time: {time.time() - start:.1f} seconds")



🔍 Final verification of transcript coverage...
⚠️  Verification issues found:
   Identifier 381855: 1/2 rows have last transcript
   Identifier 829457: 1/2 rows have last transcript
   Identifier 268516: 1/2 rows have first transcript
   Identifier 268516: 1/2 rows have last transcript
   Identifier 269779: 1/2 rows have first transcript

✅ Saved peru_with_transcripts.csv | rows 227

📊 Final Statistics:
   Unique identifiers: 121
   Total rows: 227
   Rows with first transcript: 218
   Rows with last transcript: 218
   Total API costs: $10.95
   Processing time: 240.5 seconds


## 🎉 Done!

This notebook automatically:
1. Detects GPUs and adjusts worker counts.
2. Saves progress every 5 successful transcriptions (`peru_transcript_progress.csv`).
3. Resumes from either the **progress** or **final** CSV if re-run.
4. **Uses ElevenLabs Scribe** for high-accuracy Spanish transcripts with speaker diarization.
5. Tracks cumulative ElevenLabs API spend in `cost_tracking.json` at $0.00667 per minute.
6. **NEW: Implements identifier-based deduplication** to avoid transcribing the same clips multiple times.

### 🚀 Key Deduplication Features:

- **Identifier Extraction**: Automatically extracts 6-7 digit identifiers from `School_Clip` column
- **Smart Grouping**: Groups rows by identifier and only processes unique audio clips once
- **Result Copying**: Automatically copies transcript data to all rows sharing the same identifier
- **Cost Savings**: Prevents duplicate API calls, saving significant processing costs
- **Verification**: Validates that all rows with the same identifier have consistent transcript data

With the updated implementation:
- Full JSON responses are stored per-transcript in dedicated `_JSON` columns;
- Key metadata (text, language, word counts, durations, speaker info) are extracted into separate columns;
- **Deduplication logic prevents wasteful double-transcription of identical clips**;
- Existing workflows remain compatible while enabling more detailed analysis.

You can safely interrupt and restart without re-transcribing completed audio files, and the deduplication ensures maximum cost efficiency.
