# Stage 1.5 — CORAA-MUPE-ASR Experiment

**Research question:** *Does the frozen TTS backbone already understand accent in a way that is separable from speaker identity?*

This notebook runs the **complete Stage 1.5 latent separability audit** using the
[CORAA-MUPE-ASR](https://huggingface.co/datasets/nilc-nlp/CORAA-MUPE-ASR) dataset —
289 life story interviews (365h) of spontaneous Brazilian Portuguese speech.

### Pipeline overview

| Step | What happens |
|------|-------------|
| 1 | Environment setup (GPU check, dependencies) |
| 2 | Download CORAA-MUPE-ASR from Hugging Face & build manifest |
| 3 | Generate `texts.json` from `normalized_text` (for backbone extractor) |
| 4 | Extract features: acoustic, ECAPA, SSL (WavLM), backbone (Qwen3-TTS) |
| 5 | Update config to match CORAA regions |
| 6 | Run probes + analysis → GO / GO_CONDITIONAL / NOGO decision |
| 7 | Inspect results (metrics table, heatmaps, report) |

### Decision criteria

| Decision | Accent F1 (macro) | Leakage (a→s) | Text drop |
|----------|-------------------|---------------|----------|
| **GO** | ≥ 0.55 | ≤ chance + 7pp | ≤ 10pp |
| **GO_CONDITIONAL** | ≥ 0.45 | ≤ chance + 12pp | ≤ 10pp |
| **NOGO** | < 0.40 (all backbone & SSL) | — | — |

> **Runtime estimate:** ~2–4 hours on Colab L4 GPU depending on dataset filters.

---
## 1. Environment setup

In [None]:
# Ensure we are in /content
import os
os.makedirs('/content', exist_ok=True)
os.chdir('/content')

In [None]:
# 1.1 GPU diagnostics
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f'GPU: {gpu_name} ({gpu_mem:.1f} GB)')
else:
    print('WARNING: No GPU detected. Backbone extraction will be very slow.')

!nvidia-smi || echo 'nvidia-smi not available'

In [None]:
# 1.2 Clone repository
REPO_URL = 'https://github.com/paulohenriquevn/accent-speaker-disentanglement.git'  # TODO: update
BRANCH = 'main'

!rm -rf /content/stage1_5_repo
!git clone -b {BRANCH} {REPO_URL} /content/stage1_5_repo
%cd /content/stage1_5_repo

In [None]:
# 1.3 Install dependencies
# NOTE: qwen-tts is needed for backbone extraction (Qwen3-TTS model)
!pip install -q -U pip
!pip install -q -e .[dev]
!pip install -q -U qwen-tts

In [None]:
# 1.4 Verify installation
!stage1_5 --help

In [None]:
# 1.5 (Optional) Mount Google Drive for persistent storage
MOUNT_DRIVE = False  # Set True to mount Drive
DRIVE_ARTIFACT_DIR = '/content/drive/MyDrive/stage1_5_coraa'  # Where to save results

if MOUNT_DRIVE:
    from google.colab import drive
    drive.mount('/content/drive')
    os.makedirs(DRIVE_ARTIFACT_DIR, exist_ok=True)
    print(f'Drive mounted. Artifacts will be synced to: {DRIVE_ARTIFACT_DIR}')

---
## 2. Dataset preparation (CORAA-MUPE-ASR)

We use the `build-coraa` CLI command which:
1. Downloads the dataset from Hugging Face (`nilc-nlp/CORAA-MUPE-ASR`)
2. Filters only interviewees (`speaker_type='R'`) who have `birth_state` metadata
3. Maps 27 Brazilian states to 5 IBGE macro-regions: `N`, `NE`, `CO`, `SE`, `S`
4. Applies configurable filters (regions, duration, quality, max samples per speaker)
5. Exports audio and writes `manifest.jsonl`

### Filter rationale
- **Regions**: We focus on `NE`, `SE`, `S` (the three regions with the most phonetic contrast). Set to `None` to include all 5 regions.
- **Duration**: Segments between 3–15 seconds to avoid very short/uninformative clips and very long segments that strain GPU memory during backbone extraction.
- **Audio quality**: `high` only to reduce noise confounds.
- **Max samples per speaker**: Cap at 30 to mitigate class imbalance and prevent speaker memorization.

In [None]:
# 2.1 Configurable dataset filters
# Adjust these to control dataset size and composition.

REGIONS = 'NE,SE,S'          # Comma-separated macro-region codes, or None for all 5
MIN_DURATION = 3.0            # seconds
MAX_DURATION = 15.0           # seconds
AUDIO_QUALITY = 'high'        # 'high', 'low', or None for both
MAX_SAMPLES_PER_SPEAKER = 30  # None for unlimited

MANIFEST_PATH = 'data/manifest.jsonl'
AUDIO_DIR = 'data/wav/coraa'

In [None]:
# 2.2 Build manifest from CORAA-MUPE-ASR
# Calls the Python function directly so tqdm progress and errors are visible.
# First run will take several minutes for the HF download + audio export.

import logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s %(name)s: %(message)s', force=True)

# Force-reload to pick up any git-pulled changes without restarting runtime
import importlib, stage1_5.data.dataset_builder as _dsb
importlib.reload(_dsb)
from stage1_5.data.dataset_builder import build_manifest_from_coraa

region_list = [r.strip() for r in REGIONS.split(',')] if REGIONS else None

manifest_result = build_manifest_from_coraa(
    output_path=MANIFEST_PATH,
    audio_dir=AUDIO_DIR,
    regions=region_list,
    min_duration=MIN_DURATION,
    max_duration=MAX_DURATION,
    audio_quality=AUDIO_QUALITY,
    max_samples_per_speaker=MAX_SAMPLES_PER_SPEAKER,
)
print(f'Manifest written to: {manifest_result}')

In [None]:
# 2.3 Verify manifest & dataset statistics
import json
import pandas as pd
from pathlib import Path

manifest_path = Path(MANIFEST_PATH)
assert manifest_path.exists(), f'Manifest not found: {manifest_path}'

rows = [json.loads(line) for line in manifest_path.read_text().splitlines() if line.strip()]
df_manifest = pd.DataFrame(rows)

print(f'Total utterances: {len(df_manifest)}')
print(f'Unique speakers:  {df_manifest["speaker"].nunique()}')
print(f'Unique accents:   {df_manifest["accent"].nunique()}')
print(f'Unique text_ids:  {df_manifest["text_id"].nunique()}')
print()
print('Utterances per accent (region):')
print(df_manifest['accent'].value_counts().to_string())
print()
print('Speakers per accent:')
print(df_manifest.groupby('accent')['speaker'].nunique().to_string())
print()
print('Sample manifest entry:')
print(json.dumps(rows[0], indent=2, ensure_ascii=False))

In [None]:
# 2.4 FULL validation: check ALL audio files exist before feature extraction
# This catches export failures early, before spending hours on extraction.
from pathlib import Path
import os

print(f'Current working directory: {os.getcwd()}')
print(f'Checking {len(rows)} audio paths from manifest...\n')

missing = []
exists_count = 0
total_bytes = 0
for r in rows:
    p = Path(r['path'])
    if p.exists():
        exists_count += 1
        total_bytes += p.stat().st_size
    else:
        missing.append(str(p))

print(f'Audio files found:   {exists_count} / {len(rows)}')
print(f'Audio files missing: {len(missing)} / {len(rows)}')
print(f'Total audio size:    {total_bytes / (1024**3):.2f} GB')

if missing:
    print(f'\nFirst 10 missing paths:')
    for mp in missing[:10]:
        print(f'  {mp}')
    # Check if the paths are relative and CWD might be wrong
    if not Path(missing[0]).is_absolute():
        print(f'\nPaths are RELATIVE. CWD is: {os.getcwd()}')
        print(f'Expected absolute paths after the fix. Re-run build-coraa.')
    raise RuntimeError(
        f'{len(missing)} audio files are missing! '
        f'Feature extraction will fail. Fix the dataset build step first.'
    )
else:
    print('\nAll audio files verified. Ready for feature extraction.')

---
## 3. Generate `texts.json` for backbone extraction

The backbone feature extractor (Qwen3-TTS) requires a text prompt for each utterance.
Since CORAA-MUPE is spontaneous speech (not controlled reading), we use the `normalized_text`
from the dataset as the text prompt. The `text_id` field links each manifest entry to its text.

We need to:
1. Reload the CORAA-MUPE dataset to access `normalized_text` (the manifest only has `text_id`)
2. Build a mapping `{text_id: normalized_text}` for all entries in the manifest
3. Write `data/texts.json` in the format `[{"text_id": ..., "text": ...}, ...]`

In [None]:
# 3.1 Build texts.json from CORAA-MUPE normalized_text
import json
import pandas as pd
from pathlib import Path
from datasets import load_dataset

# Reload manifest (self-contained — no dependency on earlier cell variables)
manifest_path = Path(MANIFEST_PATH)
assert manifest_path.exists(), f'Manifest not found: {manifest_path}. Run step 2 first.'
rows = [json.loads(line) for line in manifest_path.read_text().splitlines() if line.strip()]
df_manifest = pd.DataFrame(rows)
manifest_text_ids = set(df_manifest['text_id'].unique())

# Load the HF dataset (uses cache from step 2)
print('Loading CORAA-MUPE-ASR (from cache)...')
ds = load_dataset('nilc-nlp/CORAA-MUPE-ASR', split='train')
# Exclude audio column to avoid slow/broken serialisation
metadata_cols = [c for c in ds.column_names if c != 'audio']
coraa_df = ds.select_columns(metadata_cols).to_pandas()

# Build text_id the same way as build_manifest_from_coraa()
coraa_df_r = coraa_df[coraa_df['speaker_type'] == 'R'].copy()
coraa_df_r['text_id_computed'] = (
    coraa_df_r['audio_name'].astype(str) + '_' +
    coraa_df_r['start_time'].astype(str).str.replace('.', '_', regex=False)
)

# Build the texts list
texts_list = []
seen = set()
for _, row in coraa_df_r.iterrows():
    tid = row['text_id_computed']
    if tid in manifest_text_ids and tid not in seen:
        text = str(row.get('normalized_text', '') or row.get('original_text', ''))
        if text.strip():
            texts_list.append({'text_id': tid, 'text': text.strip()})
            seen.add(tid)

# Write texts.json
texts_json_path = Path('data/texts.json')
texts_json_path.write_text(json.dumps(texts_list, ensure_ascii=False, indent=2), encoding='utf-8')

print(f'Wrote {len(texts_list)} text entries to {texts_json_path}')
print(f'Manifest text_ids: {len(manifest_text_ids)}')
missing = manifest_text_ids - seen
if missing:
    print(f'WARNING: {len(missing)} text_ids in manifest have no text. Backbone extraction will skip them.')
    print(f'  Examples: {list(missing)[:5]}')
else:
    print('All manifest text_ids have corresponding texts.')

print(f'\nSample entry: {texts_list[0]}')

---
## 4. Feature extraction

We extract 4 types of features, each producing `.npz` files per utterance:

| Extractor | Dimensionality | Purpose |
|-----------|---------------|--------|
| **Acoustic** | 85-dim (MFCC mean/std + F0 stats + speaking rate + duration) | Baseline: surface-level accent cues |
| **ECAPA** | 192-dim (SpeechBrain speaker embeddings) | Control: speaker identity |
| **SSL (WavLM-large)** | 1024-dim per layer (layers 0,6,12,18,24) | Baseline: self-supervised speech representations |
| **Backbone (Qwen3-TTS)** | Variable-dim per layer | Test: does the TTS backbone encode accent? |

Each extractor can be run independently. Cached features are reused if already present.

In [None]:
# 4.1 Acoustic features (~1-3 min)
# MFCC mean/std (40+40), F0 stats (3), speaking rate (1), duration (1) = 85 dimensions
!stage1_5 features acoustic {MANIFEST_PATH} artifacts/features/acoustic

In [None]:
# 4.2 ECAPA speaker embeddings (~5-10 min on GPU)
# 192-dim embeddings from SpeechBrain's pretrained ECAPA-TDNN
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
!stage1_5 features ecapa {MANIFEST_PATH} artifacts/features/ecapa --device {DEVICE}

In [None]:
# 4.3 SSL features - WavLM-large (~10-20 min on GPU)
# 1024-dim per layer, layers [0, 6, 12, 18, 24]
!stage1_5 features ssl \
    {MANIFEST_PATH} \
    artifacts/features/ssl \
    --model wavlm_large \
    --device {DEVICE}

In [None]:
# 4.4 Backbone features - Qwen3-TTS (~30-60 min on GPU)
# Hooks into internal layers: text_encoder_out, decoder_block_04, decoder_block_08, pre_vocoder
# Requires texts.json (built in step 3)
!stage1_5 features backbone \
    {MANIFEST_PATH} \
    data/texts.json \
    artifacts/features/backbone \
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice" \
    --layers text_encoder_out decoder_block_04 decoder_block_08 pre_vocoder \
    --device {DEVICE} \
    --dtype bfloat16

In [None]:
# 4.5 Verify extracted features
from pathlib import Path
import numpy as np

feature_dirs = {
    'acoustic': Path('artifacts/features/acoustic'),
    'ecapa': Path('artifacts/features/ecapa'),
    'ssl': Path('artifacts/features/ssl'),
    'backbone': Path('artifacts/features/backbone'),
}

for name, fdir in feature_dirs.items():
    npz_files = list(fdir.glob('*.npz'))
    if not npz_files:
        print(f'  {name}: NO FILES FOUND')
        continue
    sample = np.load(npz_files[0])
    keys = list(sample.files)
    dims = {k: sample[k].shape for k in keys}
    print(f'  {name}: {len(npz_files)} files, keys={keys}, dims={dims}')

---
## 5. Update experiment config for CORAA-MUPE

The default `config/stage1_5.yaml` uses `accents: ["NE", "SE", "S"]`. We update it
to match the regions actually present in our manifest. The config drives the `stage1_5 run`
pipeline which trains probes and computes the GO/NOGO decision.

In [None]:
# 5.1 Patch config to match actual dataset
import yaml

config_path = Path('config/stage1_5.yaml')
cfg = yaml.safe_load(config_path.read_text())

# Update accents to match what's actually in the manifest
actual_accents = sorted(df_manifest['accent'].unique().tolist())
cfg['experiment']['accents'] = actual_accents
print(f'Accents in manifest: {actual_accents}')

# Ensure paths are correct
cfg['paths']['manifest'] = MANIFEST_PATH

# Write updated config
config_path.write_text(yaml.dump(cfg, default_flow_style=False, allow_unicode=True))
print(f'Config updated: {config_path}')
print(f'\nExperiment thresholds:')
print(f'  GO:          F1 >= {cfg["experiment"]["min_f1_go"]}, leakage <= chance + {cfg["experiment"]["leakage_margin_pp"]}pp')
print(f'  CONDITIONAL: F1 >= {cfg["experiment"]["min_f1_conditional"]}, leakage <= chance + {cfg["experiment"]["leakage_conditional_margin_pp"]}pp')
print(f'  Text drop tolerance: {cfg["experiment"]["text_drop_tolerance_pp"]}pp')

---
## 6. Run the full pipeline

The `stage1_5 run` command:
1. Loads the manifest and all extracted features
2. For each feature space, trains logistic regression probes with:
   - **Speaker-disjoint splits** for accent classification (no speaker seen in both train & test)
   - **Text-disjoint splits** for robustness testing
   - **Stratified splits** for speaker classification
3. Computes: accent F1 (macro), speaker accuracy, leakage (a→s, s→a), text-drop, RSA, CKA
4. Applies decision logic (GO / GO_CONDITIONAL / NOGO)
5. Generates heatmap figures and a markdown report

In [None]:
# 6.1 Run the pipeline
!stage1_5 run config/stage1_5.yaml

---
## 7. Inspect results

In [None]:
# 7.1 Load metrics
import pandas as pd

metrics_path = Path('artifacts/analysis/metrics.csv')
assert metrics_path.exists(), f'Metrics not found: {metrics_path}. Did the pipeline run successfully?'

metrics = pd.read_csv(metrics_path)
print(f'Total feature spaces evaluated: {len(metrics)}')
print()
metrics

In [None]:
# 7.2 Key metrics: accent separability (sorted by F1)
accent_cols = ['label', 'target', 'accent_f1', 'accent_text_f1', 'accent_text_drop',
               'leakage_a2s', 'chance_speaker', 'rsa_accent', 'cka_accent']
available_cols = [c for c in accent_cols if c in metrics.columns]
print('Accent separability (sorted by F1, descending):')
metrics[available_cols].sort_values('accent_f1', ascending=False)

In [None]:
# 7.3 Leakage analysis
# leakage_a2s = speaker accuracy when probe is trained on accent features
# Should be close to chance_speaker for clean accent encoding
leak_cols = ['label', 'accent_f1', 'leakage_a2s', 'chance_speaker', 'speaker_acc']
available_leak = [c for c in leak_cols if c in metrics.columns]

metrics_sorted = metrics[available_leak].sort_values('accent_f1', ascending=False)

if 'leakage_a2s' in metrics.columns and 'chance_speaker' in metrics.columns:
    metrics_sorted = metrics_sorted.copy()
    metrics_sorted['leakage_excess_pp'] = (
        (metrics_sorted['leakage_a2s'] - metrics_sorted['chance_speaker']) * 100
    ).round(1)

print('Leakage analysis:')
metrics_sorted

In [None]:
# 7.4 Display heatmaps
from IPython.display import Image, display

fig_dir = Path('artifacts/analysis/figures')

for fig_name in ['accent_f1.png', 'leakage.png', 'accent_text_robustness.png']:
    fig_path = fig_dir / fig_name
    if fig_path.exists():
        print(f'\n--- {fig_name} ---')
        display(Image(str(fig_path)))
    else:
        print(f'Figure not found: {fig_path}')

In [None]:
# 7.5 Custom visualization: backbone layers vs SSL layers comparison
import matplotlib.pyplot as plt
import numpy as np

backbone_rows = metrics[metrics['label'].str.startswith('backbone:')].copy()
ssl_rows = metrics[metrics['label'].str.startswith('ssl:')].copy()

if not backbone_rows.empty or not ssl_rows.empty:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Accent F1 comparison
    ax = axes[0]
    if not backbone_rows.empty:
        ax.barh(backbone_rows['label'], backbone_rows['accent_f1'], color='steelblue', label='Backbone')
    if not ssl_rows.empty:
        ax.barh(ssl_rows['label'], ssl_rows['accent_f1'], color='coral', label='SSL')
    ax.axvline(x=0.55, color='green', linestyle='--', label='GO threshold')
    ax.axvline(x=0.45, color='orange', linestyle='--', label='GO_COND threshold')
    ax.axvline(x=0.40, color='red', linestyle='--', label='NOGO threshold')
    ax.set_xlabel('Accent F1 (macro)')
    ax.set_title('Accent Separability by Layer')
    ax.legend(loc='lower right', fontsize=8)

    # Leakage comparison
    ax = axes[1]
    if not backbone_rows.empty:
        ax.barh(backbone_rows['label'], backbone_rows['leakage_a2s'], color='steelblue', label='Backbone')
    if not ssl_rows.empty:
        ax.barh(ssl_rows['label'], ssl_rows['leakage_a2s'], color='coral', label='SSL')
    if 'chance_speaker' in metrics.columns:
        chance = metrics['chance_speaker'].iloc[0]
        ax.axvline(x=chance, color='gray', linestyle=':', label=f'Chance ({chance:.3f})')
        ax.axvline(x=chance + 0.07, color='green', linestyle='--', label=f'GO limit (+7pp)')
        ax.axvline(x=chance + 0.12, color='orange', linestyle='--', label=f'COND limit (+12pp)')
    ax.set_xlabel('Leakage a→s (speaker accuracy)')
    ax.set_title('Speaker Leakage by Layer')
    ax.legend(loc='lower right', fontsize=8)

    plt.tight_layout()
    plt.savefig('artifacts/analysis/figures/layer_comparison.png', dpi=150)
    plt.show()
else:
    print('No backbone or SSL results to plot.')

In [None]:
# 7.6 Display the GO/NOGO report
from IPython.display import Markdown, display

report_path = Path('report/stage1_5_report.md')
if report_path.exists():
    display(Markdown(report_path.read_text()))
else:
    print('Report not found. Check pipeline output above for errors.')

---
## 8. Interpretation guide

### How to read the results

| Metric | What it measures | Good value |
|--------|-----------------|------------|
| `accent_f1` | Can a linear probe classify accent from this feature space? | Higher = more accent info (≥0.55 for GO) |
| `leakage_a2s` | Does accent feature space leak speaker identity? | Close to `chance_speaker` = clean separation |
| `accent_text_f1` | Accent F1 with text-disjoint split | Close to `accent_f1` = not memorizing text |
| `accent_text_drop` | `accent_f1 - accent_text_f1` | ≤0.10 = robust to text variation |
| `rsa_accent` | Representational Similarity Analysis correlation with accent labels | Higher = accent structure |
| `cka_accent` | Centered Kernel Alignment with accent labels | Higher = accent structure |

### What the decision means

- **GO**: The backbone already encodes accent separably from speaker. Proceed to Stage 2 (LoRA training) with the identified best layer.
- **GO_CONDITIONAL**: Weak but promising signal. Stage 2 may work but with higher risk. Consider adjusting LoRA strategy.
- **NOGO**: No accent signal in backbone. LoRA fine-tuning on this backbone is unlikely to achieve accent control.

### Important caveats for CORAA-MUPE

1. **Spontaneous speech**: Unlike controlled reading, spontaneous speech has high intra-speaker variability. This makes accent classification harder but the experiment more ecologically valid.
2. **Text-disjoint evaluation**: Since each utterance has unique text, `accent_text_f1` tests generalization to unseen text content. A large `accent_text_drop` suggests the probe memorizes text rather than accent.
3. **Region vs accent**: We use IBGE macro-regions as accent proxy. Within-region variation exists (e.g., Minas Gerais vs São Paulo within SE), so perfect F1 is not expected.

---
## 9. (Optional) Save artifacts to Drive

In [None]:
# 9.1 Sync important artifacts to Drive
if MOUNT_DRIVE:
    import shutil
    # Copy analysis results
    for src in ['artifacts/analysis', 'report']:
        dst = os.path.join(DRIVE_ARTIFACT_DIR, src)
        if os.path.exists(src):
            shutil.copytree(src, dst, dirs_exist_ok=True)
            print(f'Copied {src} -> {dst}')
    # Copy manifest and config
    for f in [MANIFEST_PATH, 'config/stage1_5.yaml', 'data/texts.json']:
        if os.path.exists(f):
            dst = os.path.join(DRIVE_ARTIFACT_DIR, f)
            os.makedirs(os.path.dirname(dst), exist_ok=True)
            shutil.copy2(f, dst)
            print(f'Copied {f} -> {dst}')
    print(f'\nAll artifacts saved to: {DRIVE_ARTIFACT_DIR}')
else:
    print('Drive not mounted. Download artifacts manually from the Colab file browser.')

---
## 10. Summary & next steps

After running this notebook, you should have:

1. A `data/manifest.jsonl` with CORAA-MUPE utterances (filtered by region/duration/quality)
2. Feature files in `artifacts/features/{acoustic,ecapa,ssl,backbone}/`
3. Metrics in `artifacts/analysis/metrics.csv` and heatmaps in `artifacts/analysis/figures/`
4. A GO/NOGO report in `report/stage1_5_report.md`

**If GO or GO_CONDITIONAL:** proceed to Stage 2 (LoRA training) using the best backbone layer identified.

**If NOGO:** consider:
- Relaxing filters (more data, all 5 regions) and re-running
- Trying a different TTS backbone
- Investigating whether the accent signal exists but is non-linearly encoded (beyond what a logistic probe captures)