# Stage 1.5 — Colab Runner
This notebook automates the Stage 1.5 latent separability audit (Accent × Speaker) inside Google Colab.

**Pipeline overview**
1. (Optional) Mount Google Drive to access private datasets/checkpoints.
2. Clone this repository (or pull from your fork).
3. Install dependencies with `pip install -e .[dev]`.
4. Ensure `data/manifest.jsonl` and referenced audio files exist.
5. Run the feature extractors (acoustic, ECAPA, SSL, backbone).
6. Execute `stage1_5 run` to train probes, compute leakage/RSA/CKA, and render the GO/NOGO report.
7. Download artifacts (`artifacts/analysis`, `report/`) or sync back to Drive.

## 1. Runtime diagnostics

In [None]:
!nvidia-smi || echo 'GPU not available (OK for CPU-only runs)'

## 2. (Optional) Mount Google Drive
If your dataset or checkpoints live on Drive, mount it now. Skip if you plan to upload files manually.

In [None]:
from google.colab import drive
MOUNT_DRIVE = False  # set to True if you want to mount Drive
if MOUNT_DRIVE:
    drive.mount('/content/drive')

## 3. Repository + dataset configuration
Set the repository URL/branch you want to run. Update paths if your manifest or audio live elsewhere (e.g., in Drive).

In [None]:
from pathlib import Path

REPO_URL = 'https://github.com/<your-org>/<repo>.git'  # TODO: update
BRANCH = 'main'                                        # e.g., 'main' or 'stage1_5'
WORKDIR = Path('/content/stage1_5')
DATA_ROOT = WORKDIR / 'data'                           # adjust if mounting from Drive
MANIFEST_PATH = DATA_ROOT / 'manifest.jsonl'
CONFIG_PATH = Path('config/stage1_5.yaml')             # relative to WORKDIR

print('Repo:', REPO_URL)
print('Branch:', BRANCH)
print('Working dir:', WORKDIR)

## 4. Clone / refresh the project

In [None]:
import shutil, subprocess
if WORKDIR.exists():
    shutil.rmtree(WORKDIR)
!git clone -b $BRANCH $REPO_URL $WORKDIR
%cd $WORKDIR

## 5. Install dependencies

In [None]:
!pip install -q -U pip
!pip install -q -e .[dev]

## 6. (Optional) Sync dataset files
Upload or copy your audio + manifest into `data/`. If the files already exist (e.g., synced from Drive), skip this cell.

In [None]:
# Example: copy dataset from Drive
# !cp -r /content/drive/MyDrive/stage1_5_data/* $DATA_ROOT

if not MANIFEST_PATH.exists():
    raise FileNotFoundError(f'Manifest not found: {MANIFEST_PATH}. Upload or copy it before continuing.')

print('Manifest entries preview:')
!head -n 5 $MANIFEST_PATH

## 7. Feature extraction
Uncomment the commands you need. You may run them separately to reuse cached features.

In [None]:
%%bash
set -euo pipefail
# Acoustic features
stage1_5 features acoustic data/manifest.jsonl artifacts/features/acoustic

# ECAPA embeddings (set device to 'cuda' if GPU is available)
# stage1_5 features ecapa data/manifest.jsonl artifacts/features/ecapa --device cuda

# SSL features (HuBERT/WavLM via s3prl)
# stage1_5 features ssl data/manifest.jsonl artifacts/features/ssl --model wavlm_large

# Backbone hooks (requires synthetic manifest + text prompts)
# stage1_5 features backbone gen/manifest_syn.jsonl data/texts.json artifacts/features/backbone \
#     --checkpoint your-org/tts-backbone --layers encoder_out block_08 decoder_pre_vocoder

## 8. Run Stage 1.5 pipeline

In [None]:
!stage1_5 run $CONFIG_PATH

## 9. Inspect metrics & figures

In [None]:
import pandas as pd
metrics = pd.read_csv('artifacts/analysis/metrics.csv')
metrics.sort_values('accent_f1', ascending=False).head()

In [None]:
from IPython.display import Image, display
display(Image('artifacts/analysis/figures/accent_f1.png'))
display(Image('artifacts/analysis/figures/leakage.png'))

## 10. View GO/NOGO report

In [None]:
from IPython.display import Markdown
report_path = Path('report/stage1_5_report.md')
if report_path.exists():
    display(Markdown(report_path.read_text()))
else:
    print('Report not found, ensure the pipeline ran successfully.')

## 11. (Optional) Sync artifacts back to Drive

In [None]:
# Example: copy metrics/report to Drive folder
# !cp -r artifacts /content/drive/MyDrive/stage1_5_artifacts
# !cp -r report /content/drive/MyDrive/stage1_5_report
print('Sync commands commented out by default.')