# Accents-PT-BR — Dataset Pipeline + HuggingFace Publication

**Projeto:** Controle Explícito de Sotaque Regional em pt-BR  
**Objetivo:** Construir o dataset derivado **Accents-PT-BR** (CORAA-MUPE + Common Voice PT),  
executar toda a pipeline de validação (confounds, splits) e publicar no HuggingFace Hub.  
**Config:** `configs/accent_classifier.yaml` (single source of truth).  

**Seções:**
1. Setup do ambiente
2. Dataset pipeline (CORAA-MUPE + Common Voice + confounds + splits)
3. Detailed distribution analysis
4. Construção do HuggingFace Dataset
5. Publicação no HuggingFace Hub
6. Verification

Este notebook é a **camada de orquestração**. Toda lógica está em `src/` (testável, auditável).  
O notebook apenas: instala deps → configura ambiente → chama módulos → publica resultado.

## 1. Setup do Ambiente

In [1]:
# Bootstrap: clone repo, install deps, check NumPy ABI.
# This module uses only stdlib — safe to import before pip install.
# On first Colab run, this cell may restart the runtime once (NumPy ABI fix).
from src.utils.notebook_bootstrap import bootstrap
bootstrap()

ModuleNotFoundError: No module named 'src'

In [None]:
import yaml, json, logging
from pathlib import Path
from collections import Counter
from datetime import datetime

import numpy as np
import pandas as pd

# Platform-aware persistent cache setup
from src.utils.platform import detect_platform, setup_environment
from src.utils.seed import set_global_seed
from src.utils.git import get_commit_hash
from src.data.manifest import compute_file_hash

# Load config — single source of truth for all experiment parameters
with open('configs/accent_classifier.yaml') as f:
    config = yaml.safe_load(f)

platform = detect_platform()
setup_environment(platform)

SEED = config['seed']['global']
generator = set_global_seed(SEED)

logging.basicConfig(
    level=logging.INFO,
    format='%(name)s - %(levelname)s - %(message)s',
)

# Drive cache base directory — platform-aware
DRIVE_BASE = platform.cache_base
DRIVE_BASE.mkdir(parents=True, exist_ok=True)

print(f'Platform: {platform.name}')
print(f'Config: {config["experiment"]["name"]}')
print(f'Seed: {SEED}')
print(f'Cache: {DRIVE_BASE}')

## 2. Dataset Pipeline (CORAA-MUPE + Common Voice + Confounds + Splits)

Executa a pipeline completa via `src.data.pipeline.load_or_build_accents_dataset()`:
1. Load/build CORAA-MUPE manifest (cache-aware)
2. Load/build Common Voice PT manifest (cache-aware)
3. Combine manifests (validação de colisões, consistência speaker→accent)
4. Análise de confounds (accent × gender, duration, source)
5. Speaker-disjoint splits (verificação automática)

A mesma função é usada pelo classifier notebook — DRY.

In [None]:
from src.data.pipeline import load_or_build_accents_dataset

bundle = load_or_build_accents_dataset(config, DRIVE_BASE)

combined_entries = bundle.combined_entries
split_info = bundle.split_info
split_entries = bundle.split_entries
confound_results = bundle.confound_results
combined_sha256 = bundle.combined_sha256

train_entries = split_entries['train']
val_entries = split_entries['val']
test_entries = split_entries['test']

# Accent distribution per split
for split_name, entries_list in split_entries.items():
    accent_dist = Counter(e.accent for e in entries_list)
    print(f'  {split_name}: {dict(sorted(accent_dist.items()))}')

## 3. Detailed Distribution Analysis

Cross-tabulations and source distribution for documentation and dataset card.

In [None]:
# Cross-tabulations for dataset card
gender_table = pd.crosstab(
    [e.accent for e in combined_entries],
    [e.gender for e in combined_entries],
    margins=True,
)
print('=== ACCENT x GENDER ===')
print(gender_table)

source_table = pd.crosstab(
    [e.accent for e in combined_entries],
    [e.source for e in combined_entries],
    margins=True,
)
print('\n=== ACCENT x SOURCE ===')
print(source_table)

# Source distribution details
source_dist = bundle.source_distribution
print('\n=== SOURCE x ACCENT (detail) ===')
for src, counts in source_dist['source_x_accent'].items():
    print(f'  {src}: {dict(sorted(counts.items()))}')

# Confound summary for dataset card
confound_summary = [
    {
        'test': r.test_name,
        'variables': f'{r.variable_a} x {r.variable_b}',
        'statistic': r.statistic,
        'p_value': r.p_value,
        'effect_size': r.effect_size,
        'effect_size_name': r.effect_size_name,
        'is_blocking': r.is_blocking,
    }
    for r in confound_results
]

total_speakers = len({e.speaker_id for e in combined_entries})
total_entries = len(combined_entries)
total_duration_h = sum(e.duration_s for e in combined_entries) / 3600
print(f'\nTotal: {total_entries:,} entries, {total_speakers} speakers, {total_duration_h:.1f}h')

## 4. Construção do HuggingFace Dataset

Converte as entries do manifest em um `datasets.DatasetDict` com:
- `Audio()` feature (decode automático, 16kHz)
- Metadados: `speaker_id`, `accent`, `gender`, `duration_s`, `source`, `birth_state`, `utt_id`
- Splits: `train`, `validation`, `test` (speaker-disjoint)

In [None]:
from datasets import Dataset, DatasetDict, Audio, Features, Value, ClassLabel
from src.data.hf_utils import entries_to_hf_dict, to_hf_split_entries

# Build ordered label lists for ClassLabel features
accent_labels = sorted({e.accent for e in combined_entries})
gender_labels = sorted({e.gender for e in combined_entries})
source_labels = sorted({e.source for e in combined_entries})

print(f'Accent classes: {accent_labels}')
print(f'Gender classes: {gender_labels}')
print(f'Source classes: {source_labels}')

# Define features schema
features = Features({
    'audio': Audio(sampling_rate=16_000),
    'utt_id': Value('string'),
    'speaker_id': Value('string'),
    'accent': ClassLabel(names=accent_labels),
    'gender': ClassLabel(names=gender_labels),
    'duration_s': Value('float32'),
    'source': ClassLabel(names=source_labels),
    'birth_state': Value('string'),
    'text_id': Value('string'),
})

# Convert internal split names ('val') to HuggingFace convention ('validation')
hf_split_entries = to_hf_split_entries(split_entries)

# Build DatasetDict with speaker-disjoint splits
dataset_dict = DatasetDict({
    name: Dataset.from_dict(
        entries_to_hf_dict(entries),
        features=features,
    )
    for name, entries in hf_split_entries.items()
})

print(f'\nDatasetDict criado:')
print(dataset_dict)
for split_name, ds in dataset_dict.items():
    print(f'  {split_name}: {len(ds)} rows, columns={ds.column_names}')

# Verify a sample
sample = dataset_dict['train'][0]
print(f'\nSample (train[0]):')
print(f'  utt_id: {sample["utt_id"]}')
print(f'  speaker_id: {sample["speaker_id"]}')
print(f'  accent: {sample["accent"]}')
print(f'  gender: {sample["gender"]}')
print(f'  duration_s: {sample["duration_s"]:.2f}')
print(f'  source: {sample["source"]}')
print(f'  audio sample_rate: {sample["audio"]["sampling_rate"]}')
print(f'  audio array shape: {np.array(sample["audio"]["array"]).shape}')

## 5. Publicação no HuggingFace Hub

Autentica no HuggingFace, gera dataset card com estatísticas e publica.  

**IMPORTANTE:** O token precisa de permissão `write` no HuggingFace Hub.  
Gere um token em https://huggingface.co/settings/tokens.

In [None]:
from huggingface_hub import notebook_login

# Login — will prompt for token if not already cached
notebook_login()

In [None]:
# Generate dataset card using src/data/hf_utils (not inline f-string)
from src.data.hf_utils import build_dataset_card, to_hf_split_entries

commit_hash = get_commit_hash()

# Manifest hash
COMBINED_MANIFEST_PATH = DRIVE_BASE / 'accents_pt_br' / 'manifest.jsonl'
manifest_sha = compute_file_hash(COMBINED_MANIFEST_PATH) if COMBINED_MANIFEST_PATH.exists() else 'N/A'

DATASET_CARD = build_dataset_card(
    combined_entries=combined_entries,
    split_entries=to_hf_split_entries(split_entries),
    confound_summary=confound_summary,
    accent_labels=accent_labels,
    gender_labels=gender_labels,
    source_labels=source_labels,
    manifest_sha=manifest_sha,
    commit_hash=commit_hash,
    seed=SEED,
)

print(f'Dataset card generated ({len(DATASET_CARD)} chars)')
print(f'Manifest SHA-256: {manifest_sha[:16]}...')
print(f'Commit: {commit_hash[:8]}...')

In [None]:
# Publish dataset to HuggingFace Hub
# SAFETY: Set PUBLISH = True manually to upload. Prevents accidental publication on Run All.
PUBLISH = False

HF_REPO_ID = 'paulohenriquevn/accents-pt-br'

print(f'Target: https://huggingface.co/datasets/{HF_REPO_ID}')
print(f'Splits: {list(dataset_dict.keys())}')
print(f'Total rows: {sum(len(ds) for ds in dataset_dict.values()):,}')
print()

if not PUBLISH:
    print('PUBLISH = False — skipping upload. Set PUBLISH = True in this cell to publish.')
else:
    dataset_dict.push_to_hub(
        HF_REPO_ID,
        private=False,
    )
    print(f'\nDataset uploaded successfully!')
    print(f'URL: https://huggingface.co/datasets/{HF_REPO_ID}')

In [None]:
# Upload dataset card
import tempfile

if not PUBLISH:
    print('PUBLISH = False — skipping dataset card upload.')
else:
    from huggingface_hub import HfApi

    api = HfApi()

    with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False, encoding='utf-8') as f:
        f.write(DATASET_CARD)
        card_path = Path(f.name)

    api.upload_file(
        path_or_fileobj=str(card_path),
        path_in_repo='README.md',
        repo_id=HF_REPO_ID,
        repo_type='dataset',
    )
    card_path.unlink()  # Clean up temp file

    print(f'Dataset card uploaded to {HF_REPO_ID}')
    print(f'\n=== PUBLICATION COMPLETE ===')
    print(f'Dataset: https://huggingface.co/datasets/{HF_REPO_ID}')
    print(f'Utterances: {total_entries:,}')
    print(f'Speakers: {total_speakers}')
    print(f'Duration: {total_duration_h:.1f}h')
    print(f'Manifest SHA-256: {manifest_sha}')

## 6. Verification

Load back from HuggingFace Hub to verify row counts and audio decoding.

In [None]:
from datasets import load_dataset
from src.data.hf_utils import INTERNAL_TO_HF_SPLITS

print(f'Verifying: loading {HF_REPO_ID} from Hub...')
ds_verify = load_dataset(HF_REPO_ID)

print(f'\nLoaded successfully:')
print(ds_verify)

for split_name_hf, split_ds in ds_verify.items():
    print(f'  {split_name_hf}: {len(split_ds)} rows')

# Reverse the internal→HF mapping for verification (HF name → internal name)
hf_to_internal = {v: k for k, v in INTERNAL_TO_HF_SPLITS.items()}

# Verify row counts match
for hf_name, internal_name in hf_to_internal.items():
    local_count = len(split_entries[internal_name])
    remote_count = len(ds_verify[hf_name])
    match = 'OK' if local_count == remote_count else 'MISMATCH'
    print(f'  {hf_name}: local={local_count}, remote={remote_count} [{match}]')

# Verify a sample decodes correctly
sample = ds_verify['train'][0]
assert sample['audio']['sampling_rate'] == 16000, 'Sampling rate mismatch'
assert len(sample['audio']['array']) > 0, 'Empty audio'
print(f'\nSample verification PASSED (sr={sample["audio"]["sampling_rate"]}, '
      f'len={len(sample["audio"]["array"])})')

print(f'\n=== ALL VERIFICATIONS PASSED ===')