# üéØ Stage 1.5 - Latent Separability Audit

## Auditoria de Separabilidade Latente (Accent √ó Speaker) em Backbones TTS

**Hardware**: Google Colab L4 GPU (24GB VRAM)  
**Tempo estimado**: ~2-3 horas para pipeline completo  
**Autor**: OpenCode Research Lab

---

### üìã O que este notebook faz:

1. ‚úÖ **Setup**: Instala depend√™ncias e clona reposit√≥rio
2. ‚úÖ **Dataset**: Baixa/prepara dados de √°udio (3 regi√µes brasileiras)
3. ‚úÖ **Features**: Extrai representa√ß√µes do backbone TTS (Qwen3-TTS)
4. ‚úÖ **Probes**: Treina classificadores lineares (Accent √ó Speaker)
5. ‚úÖ **Analysis**: Gera heatmaps e relat√≥rio com decis√£o GO/NOGO

### ‚ö° Otimiza√ß√µes para L4:

- **Mixed precision**: bfloat16 (reduz VRAM em 50%)
- **Flash Attention 3**: kernels otimizados para L4
- **Batch processing**: pipeline em lotes
- **Cache management**: limpa VRAM entre etapas

---

### üöÄ In√≠cio R√°pido:

1. **Runtime** ‚Üí Change runtime type ‚Üí **L4 GPU**
2. Execute todas as c√©lulas sequencialmente
3. Aguarde ~2h para completar
4. Baixe `report/stage1_5_report.md` no final


---

## üîß Parte 1: Setup e Instala√ß√£o

Instala todas as depend√™ncias e aplica os fixes cr√≠ticos.


In [None]:
# Verificar GPU dispon√≠vel
!nvidia-smi -L
!nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits

import torch
print(f"\nüéÆ GPU: {torch.cuda.get_device_name(0)}")
print(f"üíæ VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
print(f"‚úÖ CUDA: {torch.version.cuda}")
print(f"‚úÖ PyTorch: {torch.__version__}")

In [None]:
%%time
# Instalar depend√™ncias (~ 5 min)
print("üì¶ Instalando depend√™ncias...\n")

# Instalar PyTorch com CUDA 12.1 (otimizado para L4)
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Instalar Qwen3-TTS
!pip install -q -U qwen-tts

# Instalar Flash Attention 3 (kernels otimizados para L4)
!pip install -q flash-attn --no-build-isolation

# Depend√™ncias do Stage 1.5
!pip install -q numpy scipy pandas scikit-learn librosa praat-parselmouth soundfile
!pip install -q speechbrain transformers datasets huggingface-hub sentence-transformers
!pip install -q hydra-core omegaconf tqdm typer pyyaml jinja2
!pip install -q matplotlib seaborn

print("\n‚úÖ Depend√™ncias instaladas!")

In [None]:
%%time
# Clonar reposit√≥rio Stage 1.5 (~ 30 sec)
import os

REPO_URL = "https://github.com/seu-usuario/stage1_5.git"  # ‚ö†Ô∏è EDITAR COM SEU REPO

if not os.path.exists("/content/stage1_5"):
    print(f"üì• Clonando {REPO_URL}...")
    !git clone -q {REPO_URL} /content/stage1_5
    print("‚úÖ Reposit√≥rio clonado!")
else:
    print("‚úÖ Reposit√≥rio j√° existe")

%cd /content/stage1_5

In [None]:
%%time
# Aplicar fixes cr√≠ticos (~ 10 sec)
print("üîß Aplicando corre√ß√µes cr√≠ticas...\n")

# Fix 1: Type hints no adapter
!sed -i 's/def prepare_inputs(self, entry: ManifestEntry, text: str) -> Dict\[str, torch.Tensor\]:/def prepare_inputs(self, entry: ManifestEntry, text: str) -> Dict[str, Any]:/' stage1_5/backbone/huggingface.py

# Fix 2: Forward method (aplicar patch completo)
fix_forward = '''
    def forward(self, inputs: Dict[str, Any]) -> torch.Tensor:
        """Execute forward pass. For Qwen3-TTS, inputs contains generation params."""
        with torch.no_grad():
            if self._model_type == "qwen3_tts":
                mode = inputs.get("mode", "custom_voice")
                if mode == "custom_voice":
                    self.model.generate_custom_voice(
                        text=inputs["text"],
                        language=inputs.get("language", "Portuguese"),
                        speaker=inputs.get("speaker", "ryan"),
                        instruct=inputs.get("instruct"),
                        non_streaming_mode=True,
                        max_new_tokens=inputs.get("max_new_tokens", 256),
                    )
                    return torch.empty(0)
                raise ValueError(f"Unsupported qwen3_tts mode: {mode}")
            return self.model(**inputs)
'''

# Aplicar via Python (mais seguro que sed)
with open("stage1_5/backbone/huggingface.py", "r") as f:
    content = f.read()

# Substituir m√©todo forward
import re
content = re.sub(
    r'def forward\(self, inputs\):.*?return self\.model\(\*\*inputs\)',
    fix_forward.strip(),
    content,
    flags=re.DOTALL
)

with open("stage1_5/backbone/huggingface.py", "w") as f:
    f.write(content)

# Fix 3: Layer resolution (adicionar m√©todo se n√£o existir)
fix_resolve = '''
    def resolve_layer(self, alias: str) -> Optional[torch.nn.Module]:
        """Resolve layer alias to module."""
        if self._model_type != "qwen3_tts":
            modules = dict(self.model.named_modules())
            return modules.get(alias)
        
        modules = dict(self.model.named_modules())
        if alias in modules:
            return modules[alias]
        
        aliases_map = {
            "text_encoder_out": "talker.text_projection",
            "pre_vocoder": "talker.codec_head",
        }
        
        if alias in aliases_map:
            candidate = aliases_map[alias]
            if candidate in modules:
                return modules[candidate]
        
        if alias.startswith("decoder_block_"):
            suffix = alias.split("decoder_block_", 1)[1]
            if suffix.isdigit():
                idx = int(suffix)
                candidate = f"talker.model.layers.{idx}"
                if candidate in modules:
                    return modules[candidate]
        
        return None
'''

if "def resolve_layer" not in content:
    # Adicionar antes do √∫ltimo m√©todo
    content = content.replace(
        "\nclass HuggingFaceBackboneAdapter:",
        f"\nclass HuggingFaceBackboneAdapter:\n{fix_resolve}"
    )
    with open("stage1_5/backbone/huggingface.py", "w") as f:
        f.write(content)

print("‚úÖ Fixes aplicados!")
print("‚úÖ Instalando pacote...")

# Instalar em modo edit√°vel
!pip install -q -e .

print("\nüéâ Setup completo!")

---

## üìä Parte 2: Prepara√ß√£o do Dataset

Baixa e prepara dataset de √°udio com 3 regi√µes brasileiras.


In [None]:
%%time
# Op√ß√£o 1: Usar dataset p√∫blico (recomendado)
# Baixar Common Voice Brasil ou similar

import os

# ‚ö†Ô∏è EDITAR: URL do seu dataset (ZIP com √°udios + metadata.csv)
DATASET_URL = "https://exemplo.com/dataset_stage1_5.zip"

# Ou usar dataset sint√©tico para testes
USE_SYNTHETIC = True  # Mudar para False para usar dataset real

if USE_SYNTHETIC:
    print("üß™ Gerando dataset sint√©tico para teste...\n")
    
    # Criar estrutura de diret√≥rios
    !mkdir -p data/wav data/splits
    
    # Gerar √°udios sint√©ticos
    import numpy as np
    import soundfile as sf
    
    accents = ["NE", "SE", "S"]
    speakers_per_accent = 3
    texts_per_speaker = 10
    
    sr = 16000
    duration = 2
    
    manifest_rows = []
    text_rows = []
    
    for accent in accents:
        for spk_idx in range(speakers_per_accent):
            speaker = f"spk{accent}{spk_idx:02d}"
            spk_dir = f"data/wav/{speaker}"
            os.makedirs(spk_dir, exist_ok=True)
            
            for text_idx in range(texts_per_speaker):
                text_id = f"t{text_idx:02d}"
                utt_id = f"{speaker}_{accent}_{text_id}"
                
                # Gerar √°udio (ru√≠do branco simples)
                audio = np.random.randn(sr * duration).astype(np.float32) * 0.1
                wav_path = f"{spk_dir}/{text_id}.wav"
                sf.write(wav_path, audio, sr)
                
                # Adicionar ao manifest
                manifest_rows.append({
                    "utt_id": utt_id,
                    "path": wav_path,
                    "speaker": speaker,
                    "accent": accent,
                    "text_id": text_id,
                    "source": "real"
                })
                
                # Adicionar texto
                if {"text_id": text_id, "text": f"Texto de exemplo n√∫mero {text_idx}"} not in text_rows:
                    text_rows.append({
                        "text_id": text_id,
                        "text": f"Texto de exemplo n√∫mero {text_idx} para teste de separabilidade"
                    })
    
    # Salvar manifest
    import json
    with open("data/manifest.jsonl", "w") as f:
        for row in manifest_rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")
    
    # Salvar textos
    with open("data/texts.json", "w") as f:
        json.dump(text_rows, f, ensure_ascii=False, indent=2)
    
    print(f"‚úÖ Dataset sint√©tico criado:")
    print(f"   - {len(manifest_rows)} utterances")
    print(f"   - {len(accents)} accents: {accents}")
    print(f"   - {accents.count() * speakers_per_accent} speakers")
    print(f"   - {len(text_rows)} unique texts")

else:
    print(f"üì• Baixando dataset de {DATASET_URL}...\n")
    !mkdir -p data
    !wget -q {DATASET_URL} -O data/dataset.zip
    !unzip -q data/dataset.zip -d data/
    
    # Construir manifest a partir de CSV
    # ‚ö†Ô∏è EDITAR: ajustar conforme estrutura do seu dataset
    !stage1_5 dataset build-manifest \
        data/metadata.csv \
        --audio-root data/wav \
        --output data/manifest.jsonl
    
    print("‚úÖ Dataset baixado e manifest criado!")

# Verificar dataset
!head -3 data/manifest.jsonl
!wc -l data/manifest.jsonl

---

## üî¨ Parte 3: Extra√ß√£o de Features

Extrai features de m√∫ltiplas representa√ß√µes:
- Acoustic (MFCC, F0, speaking rate)
- ECAPA (speaker embeddings)
- SSL (WavLM)
- **Backbone (Qwen3-TTS)** ‚Üê foco principal


In [None]:
%%time
# 3.1 Acoustic features (~ 2 min para 100 utterances)
print("üéµ Extraindo features ac√∫sticas...\n")

!stage1_5 features acoustic \
    data/manifest.jsonl \
    artifacts/features/acoustic \
    --sample-rate 16000

print("\n‚úÖ Acoustic features extra√≠das!")
!ls -lh artifacts/features/acoustic/ | head -5

In [None]:
%%time
# 3.2 ECAPA embeddings (~ 3 min para 100 utterances)
print("üéôÔ∏è Extraindo ECAPA embeddings...\n")

!stage1_5 features ecapa \
    data/manifest.jsonl \
    artifacts/features/ecapa \
    --device cuda

print("\n‚úÖ ECAPA embeddings extra√≠dos!")
!ls -lh artifacts/features/ecapa/ | head -5

In [None]:
%%time
# 3.3 SSL features (~ 10 min para 100 utterances)
print("üåê Extraindo SSL features (WavLM)...\n")

!stage1_5 features ssl \
    data/manifest.jsonl \
    artifacts/features/ssl \
    --model wavlm_large \
    --layers 0 6 12 18 24 \
    --device cuda \
    --torch-dtype bfloat16 \
    --pooling mean

print("\n‚úÖ SSL features extra√≠das!")
!ls -lh artifacts/features/ssl/ | head -5

In [None]:
# Limpar cache CUDA antes de carregar backbone
import torch
import gc

gc.collect()
torch.cuda.empty_cache()

print(f"üíæ VRAM livre: {torch.cuda.mem_get_info()[0] / 1e9:.1f}GB")

In [None]:
%%time
# 3.4 Backbone features (Qwen3-TTS) - ETAPA PRINCIPAL
# ~ 30-60 min para 100 utterances (depende da GPU)

print("üß† Extraindo features do backbone (Qwen3-TTS)...\n")
print("‚ö†Ô∏è  Esta etapa pode demorar. Progresso ser√° mostrado.\n")

# Checkpoint Qwen3-TTS
CHECKPOINT = "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"

# Layers para extrair
LAYERS = [
    "text_encoder_out",
    "decoder_block_04",
    "decoder_block_08",
    "decoder_block_12",
    "pre_vocoder"
]

!stage1_5 features backbone \
    data/manifest.jsonl \
    data/texts.json \
    artifacts/features/backbone \
    --checkpoint {CHECKPOINT} \
    --layers {" ".join(LAYERS)} \
    --device cuda \
    --dtype bfloat16 \
    --attn-implementation flash-attn3 \
    --generation-mode custom_voice \
    --generation-language Portuguese \
    --generation-speaker ryan \
    --generation-max-new-tokens 256 \
    --pooling mean \
    --strict False

print("\n‚úÖ Backbone features extra√≠das!")
print("\nüìä Inspecionando features:")
!ls -lh artifacts/features/backbone/ | head -5

# Verificar uma feature
import numpy as np
sample_feat = list(Path("artifacts/features/backbone").glob("*.npz"))[0]
data = np.load(sample_feat)
print(f"\nüì¶ Sample feature: {sample_feat.name}")
print(f"   Layers: {data.files}")
print(f"   Shapes: {dict((k, data[k].shape) for k in data.files)}")

---

## üéØ Parte 4: Treinamento de Probes

Treina classificadores lineares para:
- **Accent separability** (speaker-disjoint)
- **Speaker identification**
- **Leakage tests** (A‚ÜíS, S‚ÜíA)
- **Text robustness**


In [None]:
# Limpar cache CUDA
import torch
import gc

gc.collect()
torch.cuda.empty_cache()

print(f"üíæ VRAM livre: {torch.cuda.mem_get_info()[0] / 1e9:.1f}GB")

In [None]:
%%time
# Rodar pipeline completo de probes + an√°lise
# ~ 10-20 min

print("üî¨ Executando pipeline de probes e an√°lise...\n")

!stage1_5 run config/stage1_5.yaml

print("\n‚úÖ Pipeline completo!")

---

## üìä Parte 5: An√°lise de Resultados

Visualiza m√©tricas, heatmaps e decis√£o GO/NOGO.


In [None]:
# Mostrar m√©tricas
import pandas as pd

metrics = pd.read_csv("artifacts/analysis/metrics.csv")

print("üìä M√âTRICAS DE SEPARABILIDADE\n")
print("=" * 80)

# Mostrar top 5 layers por accent F1
top5 = metrics.sort_values("accent_f1", ascending=False).head(5)

print("\nüèÜ Top 5 Layers por Accent F1 (speaker-disjoint):\n")
print(top5[[
    "label",
    "accent_f1",
    "speaker_acc",
    "leakage_a2s",
    "leakage_s2a",
    "accent_text_drop"
]].to_string(index=False))

print("\n" + "=" * 80)

# Estat√≠sticas gerais
print("\nüìà Estat√≠sticas Gerais:\n")
print(f"   - Best Accent F1: {metrics['accent_f1'].max():.3f}")
print(f"   - Best Speaker Acc: {metrics['speaker_acc'].max():.3f}")
print(f"   - Min Leakage A‚ÜíS: {metrics['leakage_a2s'].min():.3f}")
print(f"   - Min Text Drop: {metrics['accent_text_drop'].min():.3f}")

# Comparar backbone vs SSL vs acoustic
print("\nüîç Compara√ß√£o por Tipo:\n")
for feature_type in ["backbone", "ssl", "acoustic", "ecapa"]:
    subset = metrics[metrics["label"].str.startswith(feature_type)]
    if len(subset) > 0:
        print(f"   {feature_type.upper():12s}: F1={subset['accent_f1'].max():.3f} (max)")

In [None]:
# Visualizar heatmaps
from IPython.display import Image, display

print("üé® HEATMAPS\n")
print("=" * 80)

heatmaps = [
    ("Accent F1", "artifacts/analysis/figures/accent_f1.png"),
    ("Leakage", "artifacts/analysis/figures/leakage.png"),
    ("Text Robustness", "artifacts/analysis/figures/accent_text_robustness.png"),
]

for title, path in heatmaps:
    if os.path.exists(path):
        print(f"\nüìä {title}:\n")
        display(Image(filename=path))
    else:
        print(f"‚ö†Ô∏è  {title} n√£o encontrado em {path}")

In [None]:
# Mostrar relat√≥rio final
print("üìÑ RELAT√ìRIO FINAL\n")
print("=" * 80)

with open("report/stage1_5_report.md", "r") as f:
    report = f.read()

# Extrair decis√£o
import re
decision_match = re.search(r"\*\*Decision:\*\* (.+)", report)
rationale_match = re.search(r"\*\*Rationale:\*\* (.+)", report)

if decision_match:
    decision = decision_match.group(1)
    print(f"\nüéØ DECIS√ÉO: {decision}\n")

if rationale_match:
    rationale = rationale_match.group(1)
    print(f"üìù Justificativa: {rationale}\n")

print("\n" + "=" * 80)
print("\nüìÑ Relat√≥rio completo salvo em: report/stage1_5_report.md")
print("\nüíæ Para baixar, use o explorador de arquivos do Colab (√≠cone de pasta √† esquerda)")

---

## üíæ Parte 6: Download de Resultados

Baixa todos os artefatos gerados.


In [None]:
%%time
# Empacotar resultados
print("üì¶ Empacotando resultados...\n")

!zip -r -q stage1_5_results.zip \
    report/ \
    artifacts/analysis/ \
    artifacts/probes/ \
    config/

# Estat√≠sticas do arquivo
import os
size_mb = os.path.getsize("stage1_5_results.zip") / 1e6
print(f"‚úÖ Resultados empacotados: stage1_5_results.zip ({size_mb:.1f} MB)")

# Download via Colab
from google.colab import files
print("\nüì• Baixando arquivo...")
files.download("stage1_5_results.zip")

### (Opcional) Sincronizar com Google Drive


In [None]:
# Sincronizar com Google Drive (opcional)
from google.colab import drive

drive.mount('/content/drive')

# Copiar resultados
!mkdir -p /content/drive/MyDrive/stage1_5_results
!cp -r report/ /content/drive/MyDrive/stage1_5_results/
!cp -r artifacts/analysis/ /content/drive/MyDrive/stage1_5_results/
!cp stage1_5_results.zip /content/drive/MyDrive/

print("‚úÖ Resultados sincronizados com Google Drive!")
print("üìÇ Localiza√ß√£o: MyDrive/stage1_5_results/")

---

## üßπ Parte 7: Limpeza (Opcional)

Libera espa√ßo em disco e VRAM.


In [None]:
# Limpar features intermedi√°rias (manter apenas an√°lise)
!rm -rf artifacts/features/
!rm -rf data/wav/

# Limpar cache Python
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

# Estat√≠sticas de espa√ßo
!df -h /content

print("\n‚úÖ Limpeza conclu√≠da!")

---

## üîß Troubleshooting

### Erro: CUDA out of memory

**Solu√ß√£o 1**: Reduzir batch size
```python
# Editar config/stage1_5.yaml
# batch_size: 4  # reduzir para 2 ou 1
```

**Solu√ß√£o 2**: Usar float16 em vez de bfloat16
```bash
!stage1_5 features backbone ... --dtype float16
```

**Solu√ß√£o 3**: Processar em lotes menores
```bash
# Dividir manifest em chunks
!split -l 20 data/manifest.jsonl data/manifest_chunk_
# Processar cada chunk separadamente
```

### Erro: Flash Attention n√£o dispon√≠vel

```bash
# Remover flag flash-attn3
!stage1_5 features backbone ... --attn-implementation eager
```

### Erro: Qwen-TTS n√£o instalado

```bash
!pip install -U qwen-tts
```

### Verificar logs detalhados

```python
import logging
logging.basicConfig(level=logging.DEBUG)
```

---

## üìö Refer√™ncias

- **PRD**: `/content/stage1_5/PRD.md`
- **GATE**: `/content/stage1_5/GATE_1_5.md`
- **README**: `/content/stage1_5/README.md`
- **Fixes**: `/mnt/user-data/outputs/IMPLEMENTATION_GUIDE.md`

---

## üéâ Fim!

Pipeline Stage 1.5 completo. Pr√≥ximos passos:

1. ‚úÖ Analisar `report/stage1_5_report.md`
2. ‚úÖ Verificar decis√£o GO/NOGO
3. ‚úÖ Se GO ‚Üí prosseguir para Stage 2 (LoRA training)
4. ‚úÖ Se NOGO ‚Üí ajustar dataset/backbone conforme recomendado

**D√∫vidas?** Consulte a documenta√ß√£o ou abra uma issue no GitHub.
