# Voice-Based Benign vs Malignant Laryngeal Disorder Classification
**Research-grade pipeline notebook — generated 2025-10-05 09:42**


> **Purpose**: Train and evaluate a robust model that classifies voice recordings into **benign** vs **malignant**, using
both **hand-crafted acoustic features** and a **CNN on spectrograms**.  
> **Key principles**: subject-wise splits (no leakage), reproducibility, thorough EDA, clinically relevant metrics, and transparent reporting.

### What this notebook includes
- Data ingestion from two folders: `data/normal-benign/` and `data/malignant/`
- Audio **chunking into ≤5s** windows with configurable overlap
- Feature extraction matching literature: **F0 stats, jitter, shimmer (APQ3/APQ5), CPP, LTAS bands, ZCR, spectral centroid/flatness, HNR proxy**
- Visualizations: distributions, violin plots, correlation heatmaps, **UMAP/PCA**, and **class separability** checks
- Two modeling tracks:
  1. **Classical ML on features** (LogReg / XGBoost / RF)
  2. **CNN on log-mel spectrograms** with **chunk-level** inference aggregated to **file-level**
- Evaluation: **AUROC, AUPRC, sensitivity/specificity**, confusion matrix, calibration, **learning curves**, **ablation** and **feature importance (SHAP)** for the classical model
- **Reproducibility**: fixed seeds, config cell, environment capture
- **Reporting**: model card, limitations, next steps

> **Install hint**: First run the `!pip install` cell below on your machine/environment.

# 🔧 Setup & Requirements

This notebook expects the following structure:
```
project_root/
  ├─ data/
  │   ├─ normal-benign/   # .wav/.mp3/etc.
  │   └─ malignant/
  └─ notebooks/ (optional)
```

In [None]:
# If running locally/Colab: install dependencies (uncomment if needed)
# Note: Some packages may need system libs on local machines.
# !pip install numpy pandas scipy librosa soundfile matplotlib scikit-learn umap-learn torch torchvision torchaudio tqdm #                parselmouth-python shap xgboost einops

In [None]:
# ========= Configuration ========= #
from pathlib import Path

DATA_DIR = Path('data')
BENIGN_DIR = DATA_DIR/'normal-benign'
MALIGNANT_DIR = DATA_DIR/'malignant'

SAMPLE_RATE = 22050
CHUNK_SECONDS = 5.0
OVERLAP = 0.5                 # 50% overlap
MIN_SNR_DB = None             # set to e.g. 10 to filter low-SNR chunks, or None to disable
USE_VAD = True                # basic energy-gate VAD to drop silence
SEED = 42
N_JOBS = 4                    # parallel workers for feature extraction
LOGMEL_N_MELS = 128
LOGMEL_HOP = 512
LOGMEL_N_FFT = 2048
SPEC_AUG = True               # time/freq masking during CNN training
SUBJECT_FROM_FILENAME = True  # try to infer subject id from filename prefix before first '_' (customize below)

# CNN
BATCH_SIZE = 16
EPOCHS = 30
LR = 1e-3
WEIGHT_DECAY = 1e-4

# Cross-validation
N_FOLDS = 5                    # subject-wise StratifiedGroupKFold

In [None]:
# ========= Imports ========= #
import os, re, math, json, random, warnings
from pathlib import Path
from dataclasses import dataclass
import numpy as np
import pandas as pd
from tqdm import tqdm
import librosa, soundfile as sf
from scipy import signal
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedGroupKFold, train_test_split
from sklearn.metrics import roc_auc_score, average_precision_score, confusion_matrix, roc_curve, precision_recall_curve, auc, brier_score_loss
from sklearn.decomposition import PCA
try:
    import umap
    HAS_UMAP = True
except Exception:
    HAS_UMAP = False

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from einops import rearrange

warnings.filterwarnings('ignore')
np.random.seed(SEED); random.seed(SEED); torch.manual_seed(SEED)

In [None]:
# ========= Utility helpers ========= #
def set_all_seeds(seed=42):
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

def read_audio(path, sr):
    y, s = librosa.load(path, sr=sr, mono=True)
    return y, s

def vad_energy(y, frame_length=2048, hop_length=512, energy_db_thresh=-40):
    # Simple energy-based VAD mask
    rms = librosa.feature.rms(y=y, frame_length=frame_length, hop_length=hop_length).flatten()
    db = 20*np.log10(np.maximum(rms, 1e-10))
    active = db > energy_db_thresh
    # Expand to samples
    mask = np.zeros_like(y, dtype=bool)
    for i, a in enumerate(active):
        start = i*hop_length
        end = start + frame_length
        mask[start:end] |= a
    return y[mask] if mask.any() else y

def chunk_audio(y, sr, chunk_seconds=5.0, overlap=0.5):
    L = int(chunk_seconds*sr)
    H = int(L*(1-overlap))
    if len(y) < L:
        # pad with zeros
        pad = L - len(y)
        y = np.pad(y, (0, pad))
    starts = list(range(0, max(1, len(y)-L+1), H if H>0 else L))
    chunks = [y[s:s+L] for s in starts]
    return chunks

def snr_db(y):
    if len(y)==0:
        return -np.inf
    s_power = np.mean(y**2)
    n_power = np.var(y - np.mean(y))
    if n_power <= 1e-12:
        return 100.0
    return 10*np.log10(s_power / n_power)

In [None]:
# ========= Scan dataset ========= #
from glob import glob

def infer_subject_id(fname:str):
    # Customize: infer subject from filename, e.g., "ID123_session1.wav" -> "ID123"
    base = Path(fname).stem
    m = re.match(r'^([^_]+)', base)
    return m.group(1) if m else base

def list_files():
    files = []
    for label, d in [('benign', BENIGN_DIR), ('malignant', MALIGNANT_DIR)]:
        for ext in ('*.wav','*.mp3','*.flac','*.m4a','*.ogg'):
            for f in glob(str(d/ext)):
                files.append((f, label))
    df = pd.DataFrame(files, columns=['path','label'])
    if SUBJECT_FROM_FILENAME:
        df['subject'] = df['path'].apply(lambda p: infer_subject_id(p))
    else:
        df['subject'] = df['path'].apply(lambda p: Path(p).parent.name + '_' + Path(p).stem)
    return df

df_files = list_files()
print(f"Found {len(df_files)} files (benign={sum(df_files.label=='benign')}, malignant={sum(df_files.label=='malignant')})")
df_files.head()

# ========= Feature extraction (hand-crafted) ========= #
# We compute: F0 stats, jitter, shimmer (APQ3/APQ5), CPP, LTAS bands, centroid, flatness, ZCR, HNR proxy.
# For jitter/shimmer/CPP we use Praat via parselmouth for robustness.

In [None]:
# If parselmouth fails to import, ensure it is installed
try:
    import parselmouth
    from parselmouth import praat
    HAS_PSM = True
except Exception as e:
    HAS_PSM = False
    print("[WARN] parselmouth unavailable. Jitter/Shimmer/CPP will be approximated or skipped.")

def praat_jitter_shimmer_cpp(y, sr):
    if not HAS_PSM:
        return {k: np.nan for k in ['jitter_local','jitter_rap','shimmer_local','shimmer_apq3','shimmer_apq5','cpp']}
    snd = parselmouth.Sound(y, sampling_frequency=sr)
    # Pitch for period detection
    pitch = snd.to_pitch_cc()
    point_process = parselmouth.praat.call(snd, "To PointProcess (periodic, cc)", 75, 500)
    try:
        jitter_local = parselmouth.praat.call([snd, point_process], "Get jitter (local)", 0, 0, 75, 500, 1.3)
        jitter_rap   = parselmouth.praat.call([snd, point_process], "Get jitter (rap)",   0, 0, 75, 500, 1.3)
        shimmer_local= parselmouth.praat.call([snd, point_process], "Get shimmer (local)",0, 0, 75, 500, 1.3, 1.6)
        shimmer_apq3 = parselmouth.praat.call([snd, point_process], "Get shimmer (apq3)", 0, 0, 75, 500, 1.3, 1.6)
        shimmer_apq5 = parselmouth.praat.call([snd, point_process], "Get shimmer (apq5)", 0, 0, 75, 500, 1.3, 1.6)
        # CPP via cepstrum
        cep = parselmouth.praat.call(snd, "To PowerCepstrum", 0.01, 0.0001, 0.05)
        cpp = parselmouth.praat.call(cep, "Get peak prominence", 60, 333)  # in dB
    except Exception:
        jitter_local=jitter_rap=shimmer_local=shimmer_apq3=shimmer_apq5=cpp=np.nan
    return dict(jitter_local=jitter_local, jitter_rap=jitter_rap, shimmer_local=shimmer_local,
                shimmer_apq3=shimmer_apq3, shimmer_apq5=shimmer_apq5, cpp=cpp)

In [None]:
def compute_basic_features(y, sr):
    # F0
    f0 = librosa.yin(y, fmin=75, fmax=500, sr=sr)
    f0 = f0[np.isfinite(f0)]
    f0_stats = {
        'f0_mean': float(np.mean(f0)) if f0.size else np.nan,
        'f0_median': float(np.median(f0)) if f0.size else np.nan,
        'f0_min': float(np.min(f0)) if f0.size else np.nan,
        'f0_max': float(np.max(f0)) if f0.size else np.nan,
        'f0_range': float((np.max(f0)-np.min(f0))) if f0.size else np.nan,
        'f0_sd': float(np.std(f0)) if f0.size else np.nan
    }
    # Spectral features
    centroid = librosa.feature.spectral_centroid(y=y, sr=sr).mean()
    flatness = librosa.feature.spectral_flatness(y=y).mean()
    zcr = librosa.feature.zero_crossing_rate(y).mean()
    # HNR proxy: harmonic energy over total energy via HPSS
    H, P = librosa.effects.hpss(librosa.stft(y))
    harm = np.sum(np.abs(H)**2); perc = np.sum(np.abs(P)**2)
    hnr_proxy = 10*np.log10((harm + 1e-9)/(perc + 1e-9))
    return {
        **f0_stats,
        'spectral_centroid': float(centroid),
        'spectral_flatness': float(flatness),
        'zcr': float(zcr),
        'hnr_proxy_db': float(hnr_proxy)
    }

def compute_ltas_bands(y, sr, bands=((0,1000),(1000,2000),(2000,4000),(4000,8000))):
    freqs, Pxx = signal.welch(y, sr, nperseg=4096)
    total = np.trapz(Pxx, freqs) + 1e-12
    out = {}
    for (a,b) in bands:
        mask = (freqs>=a) & (freqs<b)
        band_energy = np.trapz(Pxx[mask], freqs[mask])
        out[f'ltas_{a//1000}-{b//1000}k_pct'] = float(100.0*band_energy/total)
    return out

def extract_handcrafted_features_for_chunk(y, sr):
    jitter_shimmer_cpp = praat_jitter_shimmer_cpp(y, sr)
    basic = compute_basic_features(y, sr)
    ltas = compute_ltas_bands(y, sr)
    snr = snr_db(y)
    return {**basic, **jitter_shimmer_cpp, **ltas, 'snr_db': snr}

def extract_features_from_file(path, label, subject):
    y, sr = read_audio(path, SAMPLE_RATE)
    if USE_VAD:
        y = vad_energy(y)
    chunks = chunk_audio(y, sr, CHUNK_SECONDS, OVERLAP)
    feats = []
    for i, ch in enumerate(chunks):
        if MIN_SNR_DB is not None and snr_db(ch) < MIN_SNR_DB:
            continue
        d = extract_handcrafted_features_for_chunk(ch, sr)
        d.update({'path': str(path), 'chunk_idx': i, 'label': label, 'subject': subject})
        feats.append(d)
    return feats

In [None]:
# Run feature extraction (this may take a while)
rows = []
for _,r in tqdm(df_files.iterrows(), total=len(df_files)):
    rows.extend(extract_features_from_file(r['path'], r['label'], r['subject']))
df_feats = pd.DataFrame(rows)
print(df_feats.shape)
df_feats.head()

## Exploratory Data Analysis
- Class balance at **file** and **chunk** level
- Distribution plots (e.g., `cpp`, `shimmer_apq3`, `shimmer_apq5`, `f0_max`, `ltas_0-1k_pct`)
- Correlations and redundancy
- PCA/UMAP overview

In [None]:
# Class balance
print('Files ->', df_files.label.value_counts())
print('Chunks ->', df_feats.label.value_counts())

# Simple distributions with Matplotlib
def plot_hist_by_class(feature, bins=40):
    plt.figure(figsize=(6,4))
    for lab in ['benign','malignant']:
        vals = df_feats[df_feats.label==lab][feature].dropna().values
        plt.hist(vals, bins=bins, alpha=0.5, label=lab, density=True)
    plt.title(feature); plt.legend(); plt.xlabel(feature); plt.ylabel('density'); plt.show()

for f in ['cpp','shimmer_apq3','shimmer_apq5','f0_max','ltas_0-1k_pct']:
    if f in df_feats.columns:
        plot_hist_by_class(f)

# Correlation heatmap (matplotlib only)
num_cols = [c for c in df_feats.columns if df_feats[c].dtype!=object and c not in ['chunk_idx']]
corr = df_feats[num_cols].corr().fillna(0)
plt.figure(figsize=(10,8))
plt.imshow(corr, aspect='auto')
plt.colorbar(); plt.title('Feature Correlation'); plt.xticks(range(len(num_cols)), num_cols, rotation=90); plt.yticks(range(len(num_cols)), num_cols); plt.tight_layout(); plt.show()

# PCA/UMAP projection
X = df_feats[num_cols].fillna(df_feats[num_cols].median())
y = (df_feats['label']=='malignant').astype(int).values
try:
    from sklearn.decomposition import PCA
    Xp = PCA(n_components=2).fit_transform(X)
    plt.figure(figsize=(6,5))
    plt.scatter(Xp[:,0], Xp[:,1], c=y, alpha=0.6)
    plt.title('PCA (features)'); plt.xlabel('PC1'); plt.ylabel('PC2'); plt.show()
except Exception as e:
    print('[WARN] PCA failed:', e)

if HAS_UMAP:
    Xu = umap.UMAP(random_state=SEED).fit_transform(X)
    plt.figure(figsize=(6,5))
    plt.scatter(Xu[:,0], Xu[:,1], c=y, alpha=0.6)
    plt.title('UMAP (features)'); plt.xlabel('U1'); plt.ylabel('U2'); plt.show()

# ========= Spectrograms & CNN ========= #
We will create log-mel spectrograms for each **chunk** and train a compact CNN. During evaluation we
aggregate chunk probabilities back to **file** (or **subject**) via mean-probability or majority vote.

In [None]:
def chunk_to_logmel(y, sr, n_mels=LOGMEL_N_MELS, n_fft=LOGMEL_N_FFT, hop_length=LOGMEL_HOP):
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels, power=2.0)
    S_db = librosa.power_to_db(S, ref=np.max)
    return S_db.astype(np.float32)

class SpectrogramDataset(Dataset):
    def __init__(self, df, augment=False):
        self.df = df.reset_index(drop=True)
        self.augment = augment
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        y, sr = read_audio(r['path'], SAMPLE_RATE)
        if USE_VAD: y = vad_energy(y)
        chunks = chunk_audio(y, sr, CHUNK_SECONDS, OVERLAP)
        ch = chunks[int(r['chunk_idx'])]
        spec = chunk_to_logmel(ch, sr)
        # Normalize per-spec
        spec = (spec - spec.mean()) / (spec.std() + 1e-6)
        spec = np.expand_dims(spec, 0)  # (1, n_mels, T)
        if self.augment:
            # simple SpecAugment: random time/freq masking
            if np.random.rand()<0.5:
                t = spec.shape[-1]; w = int(0.1*t); start = np.random.randint(0, max(1,t-w))
                spec[:,:,start:start+w] = 0
            if np.random.rand()<0.5:
                f = spec.shape[-2]; w = int(0.1*f); start = np.random.randint(0, max(1,f-w))
                spec[:,start:start+w,:] = 0
        label = 1 if r['label']=='malignant' else 0
        return torch.tensor(spec), torch.tensor(label, dtype=torch.long), r['subject'], r['path']

In [None]:
class SmallCNN(nn.Module):
    def __init__(self, n_classes=2):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
        self.pool = nn.MaxPool2d(2,2)
        self.dropout = nn.Dropout(0.3)
        self.head = nn.Linear(64* (LOGMEL_N_MELS//8) * ( (int(CHUNK_SECONDS*SAMPLE_RATE/LOGMEL_HOP)+1)//8 ), 128)
        self.out = nn.Linear(128, n_classes)
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = x.flatten(1)
        x = F.relu(self.head(x))
        x = self.dropout(x)
        return self.out(x)

In [None]:
def train_one_epoch(model, loader, opt, device):
    model.train(); losses=[]; correct=0; total=0
    for xb, yb, _, _ in loader:
        xb, yb = xb.to(device), yb.to(device)
        opt.zero_grad()
        logits = model(xb)
        loss = F.cross_entropy(logits, yb)
        loss.backward()
        opt.step()
        losses.append(loss.item())
        correct += (logits.argmax(1)==yb).sum().item()
        total += yb.size(0)
    return np.mean(losses), correct/total

@torch.no_grad()
def evaluate(model, loader, device):
    model.eval(); probs=[]; ys=[]; subjects=[]; paths=[] 
    for xb, yb, subj, p in loader:
        xb = xb.to(device)
        logits = model(xb)
        pr = torch.softmax(logits, dim=1)[:,1].cpu().numpy()
        probs.extend(pr); ys.extend(yb.numpy().tolist()); subjects.extend(list(subj)); paths.extend(list(p))
    df = pd.DataFrame({'prob': probs, 'y': ys, 'subject': subjects, 'path': paths})
    return df

In [None]:
def aggregate_by(df_chunk, key='path'):
    # mean probability per key (path or subject)
    agg = df_chunk.groupby(key)['prob'].mean().reset_index()
    y = df_chunk.groupby(key)['y'].first().values
    return agg['prob'].values, y

def evaluate_operating_points(y_true, y_prob, name=''):
    fpr, tpr, thr = roc_curve(y_true, y_prob)
    roc_auc = auc(fpr, tpr)
    prec, rec, thr2 = precision_recall_curve(y_true, y_prob)
    ap = auc(rec, prec)
    print(f"{name} AUROC={roc_auc:.3f} | AUPRC={ap:.3f}")
    # Plot ROC
    plt.figure(figsize=(5,4))
    plt.plot(fpr,tpr); plt.plot([0,1],[0,1],'--'); plt.xlabel('FPR'); plt.ylabel('TPR'); plt.title(f'ROC {name}'); plt.show()
    # Plot PR
    plt.figure(figsize=(5,4))
    plt.plot(rec,prec); plt.xlabel('Recall'); plt.ylabel('Precision'); plt.title(f'PR {name}'); plt.show()

In [None]:
# ========= Cross-validation (subject-wise) ========= #
from sklearn.utils.class_weight import compute_class_weight

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device)

# Prepare fold splits at FILE level (grouped by subject), then expand to chunk rows via merge
gkf = StratifiedGroupKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

file_labels = (df_files['label']=='malignant').astype(int).values
groups = df_files['subject'].values

fold = 0
results = []
for train_idx, test_idx in gkf.split(df_files, file_labels, groups):
    fold += 1
    tr_files = df_files.iloc[train_idx]
    te_files = df_files.iloc[test_idx]
    tr_chunks = df_feats.merge(tr_files[['path']], on='path', how='inner')
    te_chunks = df_feats.merge(te_files[['path']], on='path', how='inner')
    print(f"\n=== Fold {fold}: train chunks={len(tr_chunks)} | test chunks={len(te_chunks)} ===")

    # Classical ML on features
    X_cols = [c for c in df_feats.columns if c not in ['path','chunk_idx','label','subject'] and df_feats[c].dtype!=object]
    scaler = StandardScaler().fit(tr_chunks[X_cols].fillna(tr_chunks[X_cols].median()))
    Xtr = scaler.transform(tr_chunks[X_cols].fillna(tr_chunks[X_cols].median()))
    Xte = scaler.transform(te_chunks[X_cols].fillna(tr_chunks[X_cols].median()))
    ytr = (tr_chunks['label']=='malignant').astype(int).values
    yte = (te_chunks['label']=='malignant').astype(int).values

    from xgboost import XGBClassifier
    clf = XGBClassifier(n_estimators=300, max_depth=4, learning_rate=0.05, subsample=0.8, colsample_bytree=0.8, reg_lambda=1.0, random_state=SEED)
    clf.fit(Xtr, ytr)

    # Chunk-level and file-level eval
    pr_tr_chunk = clf.predict_proba(Xtr)[:,1]; pr_te_chunk = clf.predict_proba(Xte)[:,1]
    te_pred_df = pd.DataFrame({'prob': pr_te_chunk, 'y': yte, 'path': te_chunks['path'].values, 'subject': te_chunks['subject'].values})

    p_file, y_file = aggregate_by(te_pred_df, key='path')
    evaluate_operating_points(y_file, p_file, name=f'Classical (Fold {fold}) - file')
    p_subj, y_subj = aggregate_by(te_pred_df, key='subject')
    evaluate_operating_points(y_subj, p_subj, name=f'Classical (Fold {fold}) - subject')

    # CNN track
    tr_df_cnn = tr_chunks[['path','chunk_idx','label','subject']].copy()
    te_df_cnn = te_chunks[['path','chunk_idx','label','subject']].copy()
    ds_tr = SpectrogramDataset(tr_df_cnn, augment=SPEC_AUG)
    ds_te = SpectrogramDataset(te_df_cnn, augment=False)
    dl_tr = DataLoader(ds_tr, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
    dl_te = DataLoader(ds_te, batch_size=BATCH_SIZE, shuffle=False, num_workers=0)

    model = SmallCNN().to(device)
    opt = torch.optim.AdamW(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)
    best_au = 0; best_state = None
    for ep in range(1, EPOCHS+1):
        tr_loss, tr_acc = train_one_epoch(model, dl_tr, opt, device)
        te_chunk_df = evaluate(model, dl_te, device)
        p_file, y_file = aggregate_by(te_chunk_df, key='path')
        fpr,tpr,_ = roc_curve(y_file, p_file); au = auc(fpr,tpr)
        if au > best_au: best_au, best_state = au, model.state_dict()
        if ep % 5 == 0 or ep==1:
            print(f"Epoch {ep}: loss={tr_loss:.3f} acc={tr_acc:.3f} AUROC(file)={au:.3f}")
    if best_state is not None:
        model.load_state_dict(best_state)

    te_chunk_df = evaluate(model, dl_te, device)
    p_file, y_file = aggregate_by(te_chunk_df, key='path')
    evaluate_operating_points(y_file, p_file, name=f'CNN (Fold {fold}) - file')
    p_subj, y_subj = aggregate_by(te_chunk_df, key='subject')
    evaluate_operating_points(y_subj, p_subj, name=f'CNN (Fold {fold}) - subject')

In [None]:
# ========= Ablation & Reporting ========= #
# Example ablation: restrict classical features to robust markers from literature
robust_cols = [c for c in ['f0_max','shimmer_apq3','shimmer_apq5','cpp','ltas_0-1k_pct'] if c in df_feats.columns]
if robust_cols:
    print('Ablation with robust markers only:', robust_cols)
else:
    print('[Note] Robust feature columns not present (check extraction).')

# Model card (fill with your results)
from textwrap import dedent
model_card = dedent(f"""
## Model Card — Laryngeal Voice Classifier

**Intended use**: Screening/decision support, not a standalone diagnostic.  
**Data**: Benign vs malignant voice recordings; subject-wise CV.  
**Preprocessing**: {SAMPLE_RATE} Hz, chunk={CHUNK_SECONDS}s overlap={OVERLAP}. VAD={USE_VAD}.  
**Features**: F0 stats, jitter/shimmer, CPP, LTAS, spectral features + log-mel spectrograms for CNN.  
**Models**: XGBoost (features), SmallCNN (spectrograms).  
**Metrics**: AUROC/AUPRC at file and subject aggregation.  
**Calibration**: (add if applied)  
**Ablations**: Robust markers only vs full features.  
**Bias/Risk**: Potential dataset shift, microphone variability, language content effects.  
**Limitations**: Small-N, need external validation and clinical correlation.  
**Ethics**: Not a replacement for laryngoscopy/biopsy; inform users about uncertainty.
""")
print(model_card)