# Speaker Verification with ECAPA-TDNN on Kaggle GPU

This notebook implements speaker verification using the ECAPA-TDNN architecture on Kaggle's GPU.

## Table of Contents
1. Setup and Installation
2. Repository and Environment Configuration
3. Dataset and Model Configuration
4. Model Training
5. Optional: MUSAN Augmentation Validation

In [92]:
# Install required packages
!pip install torch torchaudio PyYAML soundfile librosa wandb
!pip install matplotlib numpy tqdm

print("Package installation complete!")

Package installation complete!


In [93]:
# Setup working directory and clone repository
import os
import sys
from pathlib import Path

# Start fresh in working directory
%cd /kaggle/working
!rm -rf speaker-verification

# Clone repository
!git clone https://github.com/mapotofu40/speaker-verification.git

# Move config.py to avoid name conflict
!mv speaker-verification/config.py speaker-verification/global_config.py

# Set up Python path properly
project_dir = Path('/kaggle/working/speaker-verification').absolute()

# Clean up sys.path
sys.path = [p for p in sys.path if 'speaker-verification/speaker-verification' not in p]

# Add project directory to Python path if not already present
if str(project_dir) not in sys.path:
    sys.path.insert(0, str(project_dir))

# Change to project directory
os.chdir(project_dir)

print("\nDirectory structure:")
!ls -R

print(f"\nProject directory: {project_dir}")
print(f"Current working directory: {os.getcwd()}")
print(f"\nPython path:")
for p in sys.path:
    print(f"  {p}")

/kaggle/working
Cloning into 'speaker-verification'...
remote: Enumerating objects: 56, done.[K
remote: Counting objects: 100% (56/56), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 56 (delta 21), reused 45 (delta 10), pack-reused 0 (from 0)[K
Receiving objects: 100% (56/56), 35.29 KiB | 5.04 MiB/s, done.
Resolving deltas: 100% (21/21), done.

Directory structure:
.:
cli.py	 ecapa_tdnn	   kaggle_notebook.ipynb  requirements.txt  train.py
config	 evaluate.py	   models		  tester	    utils
dataset  global_config.py  README.md		  trainer	    verify.py

./config:
defaults.py  __init__.py

./models:
ecapa_tdnn.py  feature_extractor.py  __init__.py  modules.py

./utils:
augment.py  config.py  data.py	__init__.py  metrics.py  training.py

Project directory: /kaggle/working/speaker-verification
Current working directory: /kaggle/working/speaker-verification

Python path:
  /kaggle/working/speaker-verification
  /kaggle/working
  /kaggle/lib/kagglegym
  /kaggle/lib

In [94]:
# Import necessary modules
import torch
import torch.nn.functional as F
import sys
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Clear any remaining path duplicates
sys.path = list(dict.fromkeys(sys.path))

# Verify imports work
try:
    from config.defaults import BASE_CONFIG
    from models.ecapa_tdnn import SpeakerVerificationModel
    from models.feature_extractor import FeatureExtractor
    from utils.data import VietnamCelebDataset, ValidationPairDataset, collate_fn
    from utils.training import train_model
    from utils.metrics import compute_eer, cosine_similarity
    from tqdm import tqdm
    print("All modules imported successfully!")
    
except ImportError as e:
    print(f"Import error: {e}")
    print("\nCurrent directory structure:")
    !pwd
    !ls -R
    print(f"\nPython path:")
    for p in sys.path:
        print(f"  {p}")
    raise

# Verify GPU availability
print(f"\nGPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU device: {torch.cuda.get_device_name(0)}")

All modules imported successfully!

GPU available: True
GPU device: Tesla P100-PCIE-16GB


## Weights & Biases Setup

Before running the training, you need to:
1. Create a free account at https://wandb.ai if you haven't already
2. Get your API key from https://wandb.ai/authorize
3. When you run the cell below, you'll see a prompt that says "wandb: Please enter your credentials to login to wandb"
4. Paste your API key and press Enter

Your API key will be securely stored for future use.

In [95]:
# Configure paths for Kaggle
config = BASE_CONFIG.copy()
config['paths'].update({
    'checkpoint_dir': '/kaggle/working/checkpoints',
    'log_dir': '/kaggle/working/logs',
    'cache_dir': '/kaggle/working/cache'
})

# Create necessary directories
for path in config['paths'].values():
    Path(path).mkdir(parents=True, exist_ok=True)

# Login to Weights & Biases
import wandb

# This will prompt you to enter your API key
wandb.login(key = "890ec82b435e34992eb8da6f0b05ae313d701245")

# Initialize wandb run
wandb.init(
    project="speaker-verification",
    config={
        "architecture": "ECAPA-TDNN",
        "dataset": "VietnamCeleb",
        **config['training'],  # Add training config
        **config['audio'],     # Add audio processing config
        **config['model']      # Add model architecture config
    }
)



In [102]:
# MUSAN dataset configuration
musan_config = {
    'data_root': '/kaggle/input/musan-dataset/musan',  # Update this path to your MUSAN dataset location
    'noise_dir': 'noise',
    'music_dir': 'music',
    'speech_dir': 'speech',
    'sample_rate': config['audio']['sample_rate'],
    'min_snr_db': 5,
    'max_snr_db': 20
}

# Update augmentation configuration with MUSAN parameters
augment_config = {
    'enabled': True,
    'speed_perturb': True,
    'musan_path': musan_config['data_root'],
    'noise_prob': 0.6,  # Probability of applying noise augmentation
    'noise_types': {
        'noise': 0.3,  # background noise probability (maps to noise_dir)
        'music': 0.4,  # music interference probability (maps to music_dir)
        'speech': 0.3  # speech/babble noise probability (maps to speech_dir)
    },
    'noise_snr_range': [musan_config['min_snr_db'], musan_config['max_snr_db']],  # Use correct parameter name
    'reverb_prob': 0.5,
    'musan_noise_prob': 0.6  # Probability of using MUSAN noise vs. Gaussian noise
}

# Validate MUSAN paths
musan_path_map = {
    'noise': os.path.join(musan_config['data_root'], musan_config['noise_dir']),
    'music': os.path.join(musan_config['data_root'], musan_config['music_dir']),
    'speech': os.path.join(musan_config['data_root'], musan_config['speech_dir'])
}

for noise_type, path in musan_path_map.items():
    if not os.path.exists(path):
        logger.warning(f"MUSAN {noise_type} path does not exist: {path}")
    else:
        logger.info(f"MUSAN {noise_type} path validated: {path}")

# Update wandb config with MUSAN parameters
wandb.config.update({
    'musan': {
        'noise_prob': augment_config['noise_prob'],
        'noise_types': list(augment_config['noise_types'].keys()),
        'noise_snr_range': augment_config['noise_snr_range']
    }
})

In [103]:
# Dataset configuration
data_config = {
    'data_root': '/kaggle/input/vietnam-celeb-dataset/full-dataset',  # Update with your dataset path
    'metadata_file': '/kaggle/input/vietnam-celeb-dataset/full-dataset/speaker-metadata.tsv',
    'train_utterance_file': '/kaggle/input/asv-output/cleaned_utterances.txt',
    'val_easy_utterance_file': '/kaggle/input/vietnam-celeb-dataset/vietnam-celeb-e.txt',
    'val_hard_utterance_file': '/kaggle/input/vietnam-celeb-dataset/vietnam-celeb-h.txt'
}

# Create feature extractor
feature_extractor = FeatureExtractor(
    sample_rate=config['audio']['sample_rate'],
    n_mels=config['audio']['n_mels']
)

# Create datasets
train_dataset = VietnamCelebDataset(
    data_root=data_config['data_root'],
    metadata_file=data_config['metadata_file'],
    utterance_file=data_config['train_utterance_file'],
    feature_extractor=feature_extractor,
    cache_dir=config['paths']['cache_dir'],
    use_cache=config['cache']['enabled'],
    augment=augment_config['enabled'],
    augment_config=augment_config,  # Now includes MUSAN configuration
    musan_path=augment_config['musan_path']  # Add MUSAN path
)

# Create validation datasets with the new ValidationPairDataset
val_easy_dataset = ValidationPairDataset(
    data_root=data_config['data_root'],
    metadata_file=data_config['metadata_file'],
    pair_file=data_config['val_easy_utterance_file'],
    feature_extractor=feature_extractor,
    cache_dir=config['paths']['cache_dir'],
    use_cache=config['cache']['enabled']
)

val_hard_dataset = ValidationPairDataset(
    data_root=data_config['data_root'],
    metadata_file=data_config['metadata_file'],
    pair_file=data_config['val_hard_utterance_file'],
    feature_extractor=feature_extractor,
    cache_dir=config['paths']['cache_dir'],
    use_cache=config['cache']['enabled']
)

In [104]:
from torch.utils.data import DataLoader

# Create data loaders
train_loader = DataLoader(
    train_dataset,
    batch_size=config['training']['batch_size'],
    shuffle=True,
    collate_fn=collate_fn,
    num_workers=2,  # Reduce for Kaggle
    pin_memory=True
)

# Create data loaders with the new validation datasets
val_easy_loader = DataLoader(
    val_easy_dataset,
    batch_size=config['training']['batch_size'],
    shuffle=False,
    collate_fn=collate_fn,
    num_workers=2,  # Reduce for Kaggle
    pin_memory=True
)

val_hard_loader = DataLoader(
    val_hard_dataset,
    batch_size=config['training']['batch_size'],
    shuffle=False,
    collate_fn=collate_fn,
    num_workers=2,  # Reduce for Kaggle
    pin_memory=True
)

In [105]:
print("Setting up model training...")
# Initialize optimizer, criterion and move model to device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = model.to(device)

# Initialize optimizer and scheduler
optimizer = torch.optim.Adam(
    model.parameters(),
    lr=config['training']['learning_rate'],
    weight_decay=config['training']['weight_decay']
)

# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=config['training']['num_epochs']
)

# Loss criterion for training
criterion = torch.nn.CrossEntropyLoss()

print("Model training setup complete!")

Setting up model training...
Using device: cuda
Model training setup complete!


In [None]:
# Create model
model = SpeakerVerificationModel(
    input_dim=config['audio']['n_mels'],
    channels=config['model']['channels'],
    embedding_dim=config['model']['embedding_dim'],
    num_blocks=config['model']['num_blocks'],
    num_speakers=len(train_dataset.speaker_to_idx)
)

# Watch model in wandb
wandb.watch(model)

# Validation function to compute EER
def validate_eer(model, val_loader, device):
    



    model.eval()
    all_scores = []
    all_labels = []
    
    with torch.no_grad():
        for batch in val_loader:
            features1 = batch['features1'].to(device)
            features2 = batch['features2'].to(device)
            labels = batch['label'].to(device)
            
            # Extract embeddings for both utterances
            embeddings1 = model.extract_embedding(features1)
            embeddings2 = model.extract_embedding(features2)
            
            # Compute cosine similarity
            similarities = F.cosine_similarity(embeddings1, embeddings2)
            
            all_scores.append(similarities.cpu())
            all_labels.append(labels.cpu())
    
    # Concatenate all scores and labels
    all_scores = torch.cat(all_scores).numpy()
    all_labels = torch.cat(all_labels).numpy()
    
    # Compute EER
    eer, threshold = compute_eer(all_scores, all_labels)
    return eer, threshold

# Training loop
num_epochs = config['training']['num_epochs']
best_val_eer = float('inf')
best_epoch = 0
best_state_dict = None

for epoch in range(num_epochs):
    # Training phase
    model.train()
    train_loss = 0.0
    for batch_idx, batch in enumerate(tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs} (Train)')):
        features = batch['features'].to(device)
        speaker_ids = batch['speaker_ids'].to(device)
        
        optimizer.zero_grad()
        logits = model(features, speaker_ids)
        loss = criterion(logits, speaker_ids)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
        
        # Log batch-level metrics
        wandb.log({
            'batch': epoch * len(train_loader) + batch_idx,
            'batch_loss': loss.item()
        })
    
    train_loss /= len(train_loader)
    
    # Validation phase
    val_easy_eer, easy_threshold = validate_eer(model, val_easy_loader, device)
    val_hard_eer, hard_threshold = validate_eer(model, val_hard_loader, device)
    
    # Average EER from both validation sets
    avg_val_eer = (val_easy_eer + val_hard_eer) / 2
    
    # Log epoch-level metrics
    wandb.log({
        'epoch': epoch + 1,
        'train_loss': train_loss,
        'val_easy_eer': val_easy_eer * 100,  # Convert to percentage
        'val_hard_eer': val_hard_eer * 100,  # Convert to percentage
        'avg_val_eer': avg_val_eer * 100,    # Convert to percentage
        'easy_threshold': easy_threshold,
        'hard_threshold': hard_threshold,
        'learning_rate': scheduler.get_last_lr()[0]
    })
    
    print(f'Epoch {epoch+1}/{num_epochs}:')
    print(f'Training Loss: {train_loss:.4f}')
    print(f'Validation Easy EER: {val_easy_eer*100:.2f}% (threshold: {easy_threshold:.3f})')
    print(f'Validation Hard EER: {val_hard_eer*100:.2f}% (threshold: {hard_threshold:.3f})')
    print(f'Average Validation EER: {avg_val_eer*100:.2f}%')
    
    # Save best model based on average EER
    if avg_val_eer < best_val_eer:
        best_val_eer = avg_val_eer
        best_epoch = epoch + 1
        best_state_dict = model.state_dict().copy()
        checkpoint = {
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'scheduler_state_dict': scheduler.state_dict(),
            'best_val_eer': best_val_eer,
            'easy_threshold': easy_threshold,
            'hard_threshold': hard_threshold
        }
        torch.save(checkpoint, f'{config["paths"]["checkpoint_dir"]}/best_model.pth')
        # Save model to wandb
        wandb.save(f'{config["paths"]["checkpoint_dir"]}/best_model.pth')
        print(f'New best model saved! (Average EER: {best_val_eer*100:.2f}%)')
    
    # Learning rate scheduling
    scheduler.step()

print(f'\nTraining completed!')
print(f'Best validation EER: {best_val_eer*100:.2f}% at epoch {best_epoch}')

# Log final best metrics
wandb.run.summary.update({
    'best_val_eer': best_val_eer * 100,  # Convert to percentage
    'best_epoch': best_epoch
})

# Load best model for final use
checkpoint = torch.load(f'{config["paths"]["checkpoint_dir"]}/best_model.pth')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Close wandb run
wandb.finish()

Epoch 1/150 (Train):   0%|          | 0/1890 [00:00<?, ?it/s]

In [None]:
# Save the final model
final_model_path = '/kaggle/working/final_model.pth'
torch.save(model.state_dict(), final_model_path)
print(f"Training completed and model saved to {final_model_path}!")
print(f"Best validation EER: {best_val_eer*100:.2f}% at epoch {best_epoch}")

## Dataset Augmentation Setup

### MUSAN Dataset Configuration (Optional)

This section configures audio augmentation using the MUSAN dataset. The training will work without MUSAN, but having it enables better model robustness through:
- Background noise augmentation
- Music interference augmentation
- Speech babble noise augmentation

If you don't have the MUSAN dataset:
1. Visit http://www.openslr.org/17/
2. Download and extract to `/kaggle/input/musan-dataset/`

## Optional: Augmentation Quality Validation

This section helps validate the quality of audio augmentation by visualizing and analyzing:
1. Waveforms before/after augmentation
2. Spectrograms to verify frequency characteristics
3. Signal-to-Noise Ratio (SNR) statistics
4. Augmentation type distribution

You can skip this section if not using MUSAN augmentation.

In [None]:
# import matplotlib.pyplot as plt
# import librosa
# import librosa.display
# import numpy as np
# from utils.augment import AudioAugmenter

# # Create audio augmenter with our configuration
# audio_augmenter = AudioAugmenter(
#     sample_rate=config['audio']['sample_rate'],
#     enabled=True,
#     speed_perturb=augment_config.get('speed_perturb', True),
#     noise_prob=augment_config.get('noise_prob', 0.5),
#     noise_snr_range=augment_config['noise_snr_range'],  # Fixed parameter name
#     reverb_prob=augment_config.get('reverb_prob', 0.5),
#     musan_path=augment_config.get('musan_path'),
#     noise_types=augment_config['noise_types']
# )

# def plot_audio_comparison(original, augmented, sr, title="Audio Comparison"):
#     fig, axes = plt.subplots(2, 2, figsize=(15, 10))
#     fig.suptitle(title)
    
#     # Original waveform
#     axes[0,0].plot(original)
#     axes[0,0].set_title('Original Waveform')
    
#     # Augmented waveform
#     axes[0,1].plot(augmented)
#     axes[0,1].set_title('Augmented Waveform')
    
#     # Original spectrogram
#     D = librosa.amplitude_to_db(np.abs(librosa.stft(original)), ref=np.max)
#     librosa.display.specshow(D, y_axis='log', x_axis='time', ax=axes[1,0])
#     axes[1,0].set_title('Original Spectrogram')
    
#     # Augmented spectrogram
#     D = librosa.amplitude_to_db(np.abs(librosa.stft(augmented)), ref=np.max)
#     librosa.display.specshow(D, y_axis='log', x_axis='time', ax=axes[1,1])
#     axes[1,1].set_title('Augmented Spectrogram')
    
#     plt.tight_layout()
#     return fig

# def compute_snr(original, noisy):
#     noise = noisy - original
#     signal_power = np.mean(original ** 2)
#     noise_power = np.mean(noise ** 2)
#     snr = 10 * np.log10(signal_power / noise_power)
#     return snr

In [None]:
# # Get a sample from the training dataset
# sample_idx = 0
# sample = train_dataset[sample_idx]
# original_audio = sample['audio']

# # Test different noise types
# noise_types = ['background', 'music', 'babble']
# figs = []

# for noise_type in noise_types:
#     # Apply specific noise augmentation
#     augmented_audio = audio_augmenter.apply_noise(
#         original_audio.numpy(), 
#         noise_type=noise_type,
#         snr=np.random.uniform(*augment_config['snr_range'])
#     )
    
#     # Calculate actual SNR
#     actual_snr = compute_snr(original_audio.numpy(), augmented_audio)
    
#     # Plot comparison
#     fig = plot_audio_comparison(
#         original_audio.numpy(),
#         augmented_audio,
#         config['audio']['sample_rate'],
#         f'Audio Comparison - {noise_type.capitalize()} Noise (SNR: {actual_snr:.2f} dB)'
#     )
#     figs.append(fig)
    
#     # Log augmented audio sample to wandb
#     wandb.log({
#         f'audio_samples/{noise_type}': wandb.Audio(
#             augmented_audio,
#             sample_rate=config['audio']['sample_rate'],
#             caption=f'{noise_type.capitalize()} noise augmentation'
#         )
#     })

# plt.show()

In [None]:
# # Validate augmentation statistics
# def validate_augmentation_stats(dataset, num_samples=100):
#     snr_values = []
#     noise_type_counts = {noise_type: 0 for noise_type in noise_types}
#     total_augmented = 0

#     for i in range(num_samples):
#         sample = dataset[i]
#         original_audio = sample['audio']
        
#         # Apply random augmentation as per training
#         if np.random.random() < augment_config['noise_prob']:
#             noise_type = np.random.choice(noise_types, p=[info['prob'] for info in augment_config['noise_types'].values()])
#             snr = np.random.uniform(*augment_config['snr_range'])
            
#             augmented_audio = audio_augmenter.apply_noise(original_audio.numpy(), noise_type=noise_type, snr=snr)
#             actual_snr = compute_snr(original_audio.numpy(), augmented_audio)
            
#             snr_values.append(actual_snr)
#             noise_type_counts[noise_type] += 1
#             total_augmented += 1
    
#     # Print statistics
#     print(f"Augmentation Statistics (over {num_samples} samples):")
#     print(f"Total augmented: {total_augmented}/{num_samples} ({total_augmented/num_samples*100:.1f}%)")
#     print("\nNoise type distribution:")
#     for noise_type, count in noise_type_counts.items():
#         print(f"{noise_type}: {count}/{total_augmented} ({count/total_augmented*100:.1f}% of augmented)")
    
#     if snr_values:
#         print("\nSNR statistics:")
#         print(f"Mean SNR: {np.mean(snr_values):.2f} dB")
#         print(f"Std SNR: {np.std(snr_values):.2f} dB")
#         print(f"Min SNR: {np.min(snr_values):.2f} dB")
#         print(f"Max SNR: {np.max(snr_values):.2f} dB")
        
#         # Log statistics to wandb
#         wandb.log({
#             'augmentation/mean_snr': np.mean(snr_values),
#             'augmentation/std_snr': np.std(snr_values),
#             'augmentation/augmentation_rate': total_augmented/num_samples
#         })

# # Run validation
# validate_augmentation_stats(train_dataset)

### MUSAN Dataset Structure

The MUSAN dataset should be organized in the following structure at `/kaggle/input/musan-dataset/musan/`:

```
musan/
├── noise/
│   ├── free-sound/
│   └── sound-bible/
├── music/
│   ├── fma/
│   ├── fma-western-art/
│   ├── hd-classical/
│   ├── jamendo/
│   ├── rfm/
└── speech/
    ├── librivox/
    └── us-gov/
```

You can download the MUSAN dataset from:
1. Visit http://www.openslr.org/17/
2. Download the tar file (about 14GB compressed)
3. Extract it to `/kaggle/input/musan-dataset/`

The dataset contains:
- **noise**: Various background noises and sound effects
- **music**: Different genres and styles of music recordings
- **speech**: Speech recordings from various sources

Each category is used differently in our augmentation pipeline:
- Background noise: Simulates real-world environments
- Music: Creates challenging interference conditions
- Speech: Generates realistic babble noise with multiple overlapped speakers

In [None]:
# # First, let's verify the MUSAN dataset structure
# def verify_musan_structure(root_path):
#     expected_structure = {
#         'noise': ['free-sound', 'sound-bible'],
#         'music': ['fma', 'fma-western-art', 'hd-classical', 'jamendo', 'rfm'],
#         'speech': ['librivox', 'us-gov']
#     }
    
#     all_valid = True
#     for category, subdirs in expected_structure.items():
#         category_path = os.path.join(root_path, category)
#         if not os.path.exists(category_path):
#             logger.error(f"Missing category directory: {category}")
#             all_valid = False
#             continue
            
#         for subdir in subdirs:
#             subdir_path = os.path.join(category_path, subdir)
#             if not os.path.exists(subdir_path):
#                 logger.warning(f"Missing subdirectory: {category}/{subdir}")
#                 all_valid = False
#             else:
#                 # Count files to verify content
#                 files = [f for f in os.listdir(subdir_path) if f.endswith('.wav')]
#                 logger.info(f"Found {len(files)} .wav files in {category}/{subdir}")
    
#     return all_valid

# # Verify MUSAN dataset structure
# musan_valid = verify_musan_structure(musan_config['data_root'])
# if not musan_valid:
#     logger.warning("MUSAN dataset structure is incomplete. Some augmentations may not work as expected.")
# else:
#     logger.info("MUSAN dataset structure verified successfully!")