# Stage 1.5 — Colab Runner
This notebook automates the Stage 1.5 latent separability audit (Accent × Speaker) inside Google Colab.

**Pipeline overview**
1. (Optional) Mount Google Drive to access private datasets/checkpoints.
2. Clone this repository (or pull from your fork).
3. Install dependencies with `pip install -e .[dev]`.
4. Ensure `data/manifest.jsonl` and referenced audio files exist.
5. Run the feature extractors (acoustic, ECAPA, SSL, backbone).
6. Execute `stage1_5 run` to train probes, compute leakage/RSA/CKA, and render the GO/NOGO report.
7. Download artifacts (`artifacts/analysis`, `report/`) or sync back to Drive.

In [49]:
%mkdir -p /content


shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory


In [50]:
%cd /content

/content


In [48]:
%rm -rf /content/stage1_5

## 1. Runtime diagnostics

In [None]:
!nvidia-smi || echo 'GPU not available (OK for CPU-only runs)'

Wed Feb 11 17:00:59 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   39C    P8             12W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

## 2. (Optional) Mount Google Drive
If your dataset or checkpoints live on Drive, mount it now. Skip if you plan to upload files manually.

In [None]:
from google.colab import drive
MOUNT_DRIVE = False  # set to True if you want to mount Drive
if MOUNT_DRIVE:
    drive.mount('/content/drive')

## 3. Repository + dataset configuration
Set the repository URL/branch you want to run. Update paths if your manifest or audio live elsewhere (e.g., in Drive).

In [51]:
from pathlib import Path

REPO_URL = 'https://github.com/paulohenriquevn/accent-speaker-disentanglement.git'  # TODO: update
BRANCH = 'main'                                        # e.g., 'main' or 'stage1_5'
WORKDIR = Path('/content/stage1_5')
DATA_ROOT = WORKDIR / 'data'                           # adjust if mounting from Drive
MANIFEST_PATH = DATA_ROOT / 'manifest.jsonl'
CONFIG_PATH = Path('config/stage1_5.yaml')             # relative to WORKDIR

print('Repo:', REPO_URL)
print('Branch:', BRANCH)
print('Working dir:', WORKDIR)

Repo: https://github.com/paulohenriquevn/accent-speaker-disentanglement.git
Branch: main
Working dir: /content/stage1_5


## 4. Clone / refresh the project

In [52]:
!git clone -b $BRANCH $REPO_URL 'stage1_5'
%cd stage1_5

Cloning into 'stage1_5'...
remote: Enumerating objects: 154, done.[K
remote: Counting objects: 100% (154/154), done.[K
remote: Compressing objects: 100% (111/111), done.[K
remote: Total 154 (delta 58), reused 128 (delta 32), pack-reused 0 (from 0)[K
Receiving objects: 100% (154/154), 420.98 KiB | 2.19 MiB/s, done.
Resolving deltas: 100% (58/58), done.
/content/stage1_5
Already up to date.


In [72]:
!git pull

remote: Enumerating objects: 17, done.[K
remote: Counting objects:   5% (1/17)[Kremote: Counting objects:  11% (2/17)[Kremote: Counting objects:  17% (3/17)[Kremote: Counting objects:  23% (4/17)[Kremote: Counting objects:  29% (5/17)[Kremote: Counting objects:  35% (6/17)[Kremote: Counting objects:  41% (7/17)[Kremote: Counting objects:  47% (8/17)[Kremote: Counting objects:  52% (9/17)[Kremote: Counting objects:  58% (10/17)[Kremote: Counting objects:  64% (11/17)[Kremote: Counting objects:  70% (12/17)[Kremote: Counting objects:  76% (13/17)[Kremote: Counting objects:  82% (14/17)[Kremote: Counting objects:  88% (15/17)[Kremote: Counting objects:  94% (16/17)[Kremote: Counting objects: 100% (17/17)[Kremote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects:  50% (1/2)[Kremote: Compressing objects: 100% (2/2)[Kremote: Compressing objects: 100% (2/2), done.[K
remote: Total 9 (delta 5), reused 9 (delta 5), pack-reused 0 (from 0

## 5. Install dependencies

In [53]:
!pip install -q -U pip
!pip install -q -e .[dev]
!pip install -q -U qwen-tts

  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
  Building editable for stage1_5 (pyproject.toml) ... [?25l[?25hdone


## 6. Dataset preparation
Refer to `docs/dataset_guidelines.md` for the full specification.

Options:
- **Auto**: provide a dataset archive URL + metadata CSV and let the CLI download/extract/build the manifest.
- **Manual**: upload your own `data/wav` + `data/manifest.jsonl` (skip the auto cells).


In [79]:
from pathlib import Path

DATASET_URL = ''  # e.g., https://drive.google.com/uc?id=...
DATASET_ARCHIVE_NAME = 'stage1_5_dataset.zip'
DATASET_EXTRACT_DIR = Path('data/external')
DATASET_METADATA_CSV = DATASET_EXTRACT_DIR / 'metadata.csv'
DATASET_AUDIO_SUBDIR = DATASET_EXTRACT_DIR / 'wav'
COPY_AUDIO_TO = DATA_ROOT / 'wav'
AUTO_BUILD_MANIFEST = False  # set True to run CLI steps automatically
print('Dataset extract dir:', DATASET_EXTRACT_DIR)


Dataset extract dir: data/external


In [55]:
if DATASET_URL:
    !stage1_5 dataset download --url $DATASET_URL --output-dir $DATASET_EXTRACT_DIR --filename $DATASET_ARCHIVE_NAME
else:
    print('Set DATASET_URL to enable automatic download or skip to manual upload.')


Set DATASET_URL to enable automatic download or skip to manual upload.


In [56]:
import shutil

if AUTO_BUILD_MANIFEST and DATASET_AUDIO_SUBDIR.exists():
    COPY_AUDIO_TO.mkdir(parents=True, exist_ok=True)
    shutil.copytree(DATASET_AUDIO_SUBDIR, COPY_AUDIO_TO, dirs_exist_ok=True)
    if DATASET_METADATA_CSV.exists():
        !stage1_5 dataset build-manifest $DATASET_METADATA_CSV --audio-root $COPY_AUDIO_TO --output $MANIFEST_PATH --source real
    else:
        raise FileNotFoundError(f'Metadata CSV not found: {DATASET_METADATA_CSV}')
else:
    print('AUTO_BUILD_MANIFEST disabled or dataset folders missing; ensure data/manifest.jsonl exists.')


AUTO_BUILD_MANIFEST disabled or dataset folders missing; ensure data/manifest.jsonl exists.


In [57]:
# Manual fallback: copy dataset into data/ then ensure manifest exists
# Example: copy dataset from Drive
# !cp -r /content/drive/MyDrive/stage1_5_data/* $DATA_ROOT

if not MANIFEST_PATH.exists():
    raise FileNotFoundError(f'Manifest not found: {MANIFEST_PATH}. Provide metadata or enable AUTO_BUILD_MANIFEST.')

print('Manifest entries preview:')
!head -n 5 $MANIFEST_PATH


Manifest entries preview:
{"utt_id": "spk01_NE_t01", "path": "data/wav/spk01/t01.wav", "speaker": "spk01", "accent": "NE", "text_id": "t01", "source": "real"}
{"utt_id": "spk02_SE_t01", "path": "data/wav/spk02/t01.wav", "speaker": "spk02", "accent": "SE", "text_id": "t01", "source": "real"}


In [62]:
!pip -q uninstall -y transformers
!pip -q install "transformers>=4.41,<4.50"


In [69]:
import transformers
print(transformers.__version__)
print(hasattr(transformers, "AutoModel"))

4.57.6


ImportError: cannot import name 'add_model_info_to_auto_map' from 'transformers.utils' (/usr/local/lib/python3.12/dist-packages/transformers/utils/__init__.py)

## Fixes for backbone run (synthetic manifest + texts.json + layers)
These cells generate `data/texts.json`, `data/texts.jsonl`, and `gen/manifest_syn.jsonl` required by `stage1_5 features backbone`.


## 7. Feature extraction
Uncomment the commands you need. You may run them separately to reuse cached features.

In [90]:
%%bash
set -euo pipefail

# Acoustic features PASS
# stage1_5 features acoustic data/manifest.jsonl artifacts/features/acoustic

# ECAPA embeddings (set device to 'cuda' if GPU is available) PASS
# stage1_5 features ecapa data/manifest.jsonl artifacts/features/ecapa --device cuda

# SSL features (HuBERT/WavLM via Hugging Face Transformers)
# stage1_5 features ssl data/manifest.jsonl artifacts/features/ssl --model wavlm_large

# Backbone hooks (requires synthetic manifest + text prompts)
stage1_5 features backbone gen/manifest_syn.jsonl data/texts.json artifacts/features/backbone \
  "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice" \
  --layers text_encoder_out \
  --layers decoder_block_04 \
  --layers decoder_block_08 \
  --layers pre_vocoder




  self.setter(val)
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/stage1_5/stage1_5/features/cli.py:36 in backbone                    │
│                                                                              │
│   33 @app.command()                                                          │
│   34 def backbone(manifest: Path, text_json: Path, output: Path, checkpoint: │
│   35 │   │   │    layers: List[str] = typer.Option(..., help="Layer names to │
│ ❱ 36 │   extract_backbone_cli(manifest, text_json, output, checkpoint, layer │
│   37                                                                         │
│   38                                                                         │
│   39 if __name__ == "__main__":                                              │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ check

CalledProcessError: Command 'b'set -euo pipefail\n\n# Acoustic features PASS\n# stage1_5 features acoustic data/manifest.jsonl artifacts/features/acoustic\n\n# ECAPA embeddings (set device to \'cuda\' if GPU is available) PASS\n# stage1_5 features ecapa data/manifest.jsonl artifacts/features/ecapa --device cuda\n\n# SSL features (HuBERT/WavLM via Hugging Face Transformers)\n# stage1_5 features ssl data/manifest.jsonl artifacts/features/ssl --model wavlm_large\n\n# Backbone hooks (requires synthetic manifest + text prompts)\nstage1_5 features backbone gen/manifest_syn.jsonl data/texts.jsonl artifacts/features/backbone \\\n  "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice" \\\n  --layers "text_encoder_out decoder_block_04 decoder_block_08 pre_vocoder"\n\n\n'' returned non-zero exit status 1.

In [81]:
# Ensure data/texts.json and data/texts.jsonl exist for backbone CLI
import json
from pathlib import Path

jsonl_path = DATA_ROOT / "texts.jsonl"
json_path = DATA_ROOT / "texts.json"

default_texts = [
    "Bom dia, obrigado por participar do experimento.",
    "Hoje o tempo esta firme e o ceu esta limpo.",
    "A equipe avaliou o modelo em diferentes regioes.",
    "O objetivo e medir separabilidade de sotaque e identidade.",
    "Leia a frase com voz neutra e ritmo constante.",
    "A gravacao deve ser clara, sem ruido de fundo.",
    "O cachorro correu pelo quintal com alegria.",
    "A chuva começou no fim da tarde e parou cedo.",
    "O professor explicou a tarefa com paciencia.",
    "A menina abriu a janela para entrar ar.",
    "O mercado fica aberto ate as oito da noite.",
    "O time treinou para o jogo do fim de semana.",
    "A cidade cresce e precisa de novos servicos.",
    "O barco saiu do porto ao amanhecer.",
    "O cafe estava quente e o pao estava fresco.",
    "A biblioteca recebe estudantes todos os dias.",
    "A musica tocava baixo durante a reuniao.",
    "O medico pediu descanso e hidratacao.",
    "A estrada estava livre e o caminho foi rapido.",
    "O artista pintou o quadro com cores vivas.",
    "A casa antiga foi reformada com cuidado.",
    "O relogio marcou seis horas e a aula terminou.",
    "A feira oferecia frutas maduras e doces.",
    "O passageiro perdeu o onibus por pouco.",
    "A noticia foi confirmada pela equipe tecnica.",
    "O jardim tem flores vermelhas e amarelas.",
    "A crianca contou uma historia divertida.",
    "O vento soprava forte e frio naquela noite.",
    "O aluno revisou o conteudo antes da prova.",
    "O telefone tocou cedo e acordou a familia."
]

if jsonl_path.exists():
    arr = [json.loads(l) for l in jsonl_path.read_text(encoding="utf-8").splitlines() if l.strip()]
    json_path.write_text(json.dumps(arr, ensure_ascii=False, indent=2), encoding="utf-8")
    print("Wrote", json_path, "items:", len(arr))
elif json_path.exists():
    arr = json.loads(json_path.read_text(encoding="utf-8"))
    jsonl_path.write_text("\n".join(json.dumps(x, ensure_ascii=False) for x in arr) + "\n", encoding="utf-8")
    print("Wrote", jsonl_path, "items:", len(arr))
else:
    arr = [{"text_id": f"t{i:02d}", "text": t} for i, t in enumerate(default_texts)]
    json_path.write_text(json.dumps(arr, ensure_ascii=False, indent=2), encoding="utf-8")
    jsonl_path.write_text("\n".join(json.dumps(x, ensure_ascii=False) for x in arr) + "\n", encoding="utf-8")
    print("Generated default texts ->", json_path, "and", jsonl_path, "items:", len(arr))

data/texts.jsonl not found; skipping conversion. Ensure data/texts.json exists.


In [83]:
# Generate gen/manifest_syn.jsonl with the required ManifestEntry fields
import json, random
from pathlib import Path

texts_path = DATA_ROOT / "texts.json"
out = Path("gen/manifest_syn.jsonl")
out.parent.mkdir(parents=True, exist_ok=True)

texts = json.loads(texts_path.read_text(encoding="utf-8"))

ACCENTS = ["NE", "SE", "S"]
SPEAKER = "ryan"   # keep aligned with backbone.generation.speaker
rows = []

for i, obj in enumerate(texts):
    text_id = obj["text_id"]
    utt_id = f"syn_{i:06d}"
    wav_path = f"gen/synthetic_audio/{utt_id}.wav"
    rows.append({
        "utt_id": utt_id,
        "path": wav_path,
        "speaker": SPEAKER,
        "accent": random.choice(ACCENTS),
        "text_id": text_id,
        "source": "synthetic",
    })

with out.open("w", encoding="utf-8") as f:
    for r in rows:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

print("Wrote", out, "rows:", len(rows))

FileNotFoundError: [Errno 2] No such file or directory: '/content/stage1_5/data/texts.json'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 8. Run Stage 1.5 pipeline

In [None]:
!stage1_5 run $CONFIG_PATH

## 9. Inspect metrics & figures

In [None]:
import pandas as pd
metrics = pd.read_csv('artifacts/analysis/metrics.csv')
metrics.sort_values('accent_f1', ascending=False).head()

In [None]:
from IPython.display import Image, display
display(Image('artifacts/analysis/figures/accent_f1.png'))
display(Image('artifacts/analysis/figures/leakage.png'))

## 10. View GO/NOGO report

In [None]:
from IPython.display import Markdown
report_path = Path('report/stage1_5_report.md')
if report_path.exists():
    display(Markdown(report_path.read_text()))
else:
    print('Report not found, ensure the pipeline ran successfully.')

## 11. (Optional) Sync artifacts back to Drive

In [None]:
# Example: copy metrics/report to Drive folder
# !cp -r artifacts /content/drive/MyDrive/stage1_5_artifacts
# !cp -r report /content/drive/MyDrive/stage1_5_report
print('Sync commands commented out by default.')