# Phase 4.8: The Paradigm Shift - Attempting a Wav2Vec2 Generalist

**Objective:** This notebook documents the project's most significant methodological leap: moving from a CNN-on-spectrograms approach to a state-of-the-art, end-to-end Speech Transformer.

The goal was to train a **Wav2Vec2** model to see if a native speech architecture could outperform our champion CNN. This involved three major upgrades:
1.  **Expanding the Dataset** to include the naturalistic IEMOCAP dataset.
2.  Building a **new data pipeline** for raw audio waveforms.
3.  Implementing a **two-stage curriculum learning** strategy.

**Outcome:** While the pipeline was successfully built, the model training failed due to numerical instability, providing a critical learning opportunity for the final, successful attempt.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
!pip install transformers[torch] datasets librosa pandas seaborn matplotlib tqdm audiomentations

Mounted at /content/drive
Collecting audiomentations
  Downloading audiomentations-0.42.0-py3-none-any.whl.metadata (11 kB)
Collecting numpy-minmax<1,>=0.3.0 (from audiomentations)
  Downloading numpy_minmax-0.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting numpy-rms<1,>=0.4.2 (from audiomentations)
  Downloading numpy_rms-0.6.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.5 kB)
Collecting python-stretch<1,>=0.3.1 (from audiomentations)
  Downloading python_stretch-0.3.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Downloading audiomentations-0.42.0-py3-none-any.whl (86 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.5/86.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy_minmax-0.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux201

## Part 1: Creating the Super-Dataset

To train our most powerful model yet, we expand our data pool. In addition to RAVDESS and CREMA-D, we now incorporate the **IEMOCAP** dataset, which contains spontaneous, natural speech from dialogue scenarios. This creates a large and highly diverse dataset of over 11,000 samples, providing a robust testbed for our models.

In [None]:
# ===================================================================
# CELL 2: DATA PREPARATION
# ===================================================================
import os
import random
from sklearn.model_selection import train_test_split
import pickle

# --- Configuration ---
RAVDESS_PATH = "/content/drive/MyDrive/ser_project/ravdess_data/"
CREMA_D_PATH = "/content/drive/MyDrive/ser_project/crema_d_data/AudioWAV/"
IEMOCAP_PATH = "/content/drive/MyDrive/ser_project/iemocap_data/IEMOCAP_full_release/"

# --- Mappings (6 core emotions) ---
unified_emotion_map = { "neutral": 0, "happy": 1, "sad": 2, "angry": 3, "fearful": 4, "disgust": 5 }
ravdess_map = { "01": "neutral", "03": "happy", "04": "sad", "05": "angry", "06": "fearful", "07": "disgust" }
crema_d_map = { "NEU": "neutral", "HAP": "happy", "SAD": "sad", "ANG": "angry", "FEA": "fearful", "DIS": "disgust" }
iemocap_map = { "neu": "neutral", "hap": "happy", "sad": "sad", "ang": "angry", "fea": "fearful", "exc": "happy" } # Map excited to happy

# --- Gather files and labels from all three datasets ---
all_files = []
all_labels_str = []
print("--- GATHERING AND COUNTING FILES ---")

# Process RAVDESS
ravdess_count = 0
for root, dirs, files in os.walk(RAVDESS_PATH):
    for f in files:
        if f.endswith('.wav'):
            try:
                code = f.split("-")[2]
                if code in ravdess_map:
                    all_files.append(os.path.join(root, f))
                    all_labels_str.append(ravdess_map[code])
                    ravdess_count += 1
            except IndexError:
                continue
print(f"Found {ravdess_count} relevant files in RAVDESS.")

# Process CREMA-D
crema_d_count = 0
if os.path.exists(CREMA_D_PATH):
    for f in os.listdir(CREMA_D_PATH):
        if f.endswith('.wav'):
            try:
                code = f.split("_")[2]
                if code in crema_d_map:
                    all_files.append(os.path.join(CREMA_D_PATH, f))
                    all_labels_str.append(crema_d_map[code])
                    crema_d_count += 1
            except IndexError:
                continue
print(f"Found {crema_d_count} relevant files in CREMA-D.")

# Process IEMOCAP
iemocap_count = 0
if os.path.exists(IEMOCAP_PATH):
    for session_folder in os.listdir(IEMOCAP_PATH):
        if session_folder.startswith("Session"):
            emo_path = os.path.join(IEMOCAP_PATH, session_folder, "dialog/EmoEvaluation/")
            wav_root = os.path.join(IEMOCAP_PATH, session_folder, "sentences/wav/")
            if os.path.isdir(emo_path) and os.path.isdir(wav_root):
                for txt_file in os.listdir(emo_path):
                    if txt_file.endswith('.txt'):
                        with open(os.path.join(emo_path, txt_file)) as f_ann:
                            for line in f_ann:
                                if line.startswith('['):
                                    parts = line.strip().split('\t')
                                    if len(parts) >= 3 and parts[2] in iemocap_map:
                                        wav_folder = parts[1].rsplit('_', 1)[0]
                                        wav_file = os.path.join(wav_root, wav_folder, f"{parts[1]}.wav")
                                        if os.path.exists(wav_file):
                                            all_files.append(wav_file)
                                            all_labels_str.append(iemocap_map[parts[2]])
                                            iemocap_count += 1
print(f"Found {iemocap_count} relevant files in IEMOCAP.")
print(f"\nTotal files found across all datasets: {len(all_files)}")

# --- Create final data splits ---
# 80% train, 10% validation, 10% test
train_val_files, test_files, train_val_labels_str, test_labels_str = train_test_split(
    all_files, all_labels_str, test_size=0.15, random_state=42, stratify=all_labels_str
)
train_files, val_files, train_labels_str, val_labels_str = train_test_split(
    train_val_files, train_val_labels_str, test_size=0.1, random_state=42, stratify=train_val_labels_str
)

print("\n--- DATA SPLITTING COMPLETE ---")
print(f"Training samples: {len(train_files)}")
print(f"Validation samples: {len(val_files)}")
print(f"Test samples: {len(test_files)}")

--- GATHERING AND COUNTING FILES ---
Found 1056 relevant files in RAVDESS.
Found 7442 relevant files in CREMA-D.
Found 3438 relevant files in IEMOCAP.

Total files found across all datasets: 11936

--- DATA SPLITTING COMPLETE ---
Training samples: 9130
Validation samples: 1015
Test samples: 1791


## Part 2: A New Pipeline for Raw Audio

Unlike our previous CNNs that required spectrogram "images," Transformers like Wav2Vec2 can process raw audio waveforms directly. This requires a completely new data pipeline.

We define a `WavDataset` that loads audio and resamples it to the required 16kHz. A custom `collate_fn` then uses the official Hugging Face `Wav2Vec2FeatureExtractor` to pad batches and convert them into the format the model expects.

In [None]:
# ===================================================================
# CELL 3: HELPER DEFINITIONS
# ===================================================================
import torch
import librosa
from torch.utils.data import Dataset
from transformers import Wav2Vec2FeatureExtractor

# --- Initialize Feature Extractor (used by the collate function) ---
model_name = "facebook/wav2vec2-base-960h"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)

# --- Wav2Vec2 Dataset Class ---
class WavDataset(Dataset):
    def __init__(self, file_paths, labels):
        self.file_paths = file_paths
        self.labels = labels

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        # Load audio at the required 16kHz sample rate
        speech_array, sr = librosa.load(self.file_paths[idx], sr=16000)
        return speech_array, self.labels[idx]

# --- Collate Function to process batches ---
def collate_fn(batch):
    features, labels = zip(*batch)
    # The feature_extractor handles padding and tensor conversion
    processed = feature_extractor(list(features), sampling_rate=16000, padding=True, return_tensors="pt")
    return processed['input_values'], torch.tensor(labels, dtype=torch.long)

print("✅ Helper classes and functions are defined.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

✅ Helper classes and functions are defined.


## Part 3: The Training Failure - `NaN` Loss

Our training plan was a two-stage curriculum. **Stage 1**, documented in this cell, aimed to train an "Acted Speech Expert" on the combined RAVDESS and CREMA-D datasets.

However, the training was unsuccessful. The log shows the training loss immediately becoming `NaN` (Not a Number), which indicates a numerical instability issue (like exploding gradients). As a result, the model could not learn, and its validation accuracy remained at the level of random chance. The subsequent Stage 2 (adapting to IEMOCAP) also failed as a consequence.

In [None]:
# ===================================================================
# CELL 4: STAGE 1 - TRAINING THE ACTED SPEECH EXPERT
# ===================================================================
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import Wav2Vec2ForSequenceClassification, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score
from tqdm import tqdm
import os

# --- Configuration for Stage 1 ---
LEARNING_RATE = 3e-5
BATCH_SIZE = 8
EPOCHS = 15
CHECKPOINT_STAGE1_PATH = "/content/drive/MyDrive/ser_project/wav2vec2_stage1_acted_best.pth"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
unified_emotion_labels = ["neutral", "happy", "sad", "angry", "fearful", "disgust"]
emotion_to_idx = {e: i for i, e in enumerate(unified_emotion_labels)}

# --- Prepare RAVDESS + CREMA-D data splits ---
# These variables (train_files, etc.) must be available from Cell 2
acted_train_files = [f for f in train_files if 'iemocap_data' not in f]
acted_val_files = [f for f in val_files if 'iemocap_data' not in f]
acted_train_labels = [emotion_to_idx[lbl] for i, lbl in enumerate(train_labels_str) if 'iemocap_data' not in train_files[i]]
acted_val_labels = [emotion_to_idx[lbl] for i, lbl in enumerate(val_labels_str) if 'iemocap_data' not in val_files[i]]

# The WavDataset and collate_fn must be available from Cell 3
train_dataset = WavDataset(acted_train_files, acted_train_labels)
val_dataset = WavDataset(acted_val_files, acted_val_labels)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn, num_workers=2)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=2)
print(f"Starting Stage 1: Training on {len(train_dataset)} acted samples...")

# --- Initialize Model, Optimizer, Scheduler, and Scaler ---
model = Wav2Vec2ForSequenceClassification.from_pretrained("facebook/wav2vec2-base-960h", num_labels=len(unified_emotion_labels))
model.freeze_feature_extractor = False
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss()
num_training_steps = len(train_loader) * EPOCHS
num_warmup_steps = int(0.1 * num_training_steps)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)
scaler = torch.cuda.amp.GradScaler()

# --- Stage 1 Training Loop with AMP ---
best_val_acc = 0.0
for epoch in range(EPOCHS):
    model.train()
    running_loss = 0.0
    for inputs, labels in tqdm(train_loader, desc=f"Stage 1 - Epoch {epoch+1}/{EPOCHS}"):
        inputs, labels = inputs.to(device), labels.to(device)
        with torch.cuda.amp.autocast():
            outputs = model(inputs).logits
            loss = criterion(outputs, labels)

        optimizer.zero_grad()
        scaler.scale(loss).backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        running_loss += loss.item() * inputs.size(0)
    train_loss = running_loss / len(train_dataset)

    # Validation
    model.eval()
    val_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in tqdm(val_loader, desc=f"Stage 1 - Epoch {epoch+1}/{EPOCHS} [Val]"):
            inputs, labels = inputs.to(device), labels.to(device)
            with torch.cuda.amp.autocast():
                outputs = model(inputs).logits
                loss = criterion(outputs, labels)
            val_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    val_accuracy = 100 * correct / total
    val_loss /= len(val_dataset)
    print(f"Stage 1 - Epoch {epoch+1}/{EPOCHS} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_accuracy:.2f}%")

    if val_accuracy > best_val_acc:
        best_val_acc = val_accuracy
        print(f"🎉 New best Stage 1 validation accuracy: {best_val_acc:.2f}%. Saving model...")
        torch.save({'model_state_dict': model.state_dict()}, CHECKPOINT_STAGE1_PATH)

print("\n✅ Stage 1 training complete.")

Starting Stage 1: Training on 6498 acted samples...


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight', 'wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = torch.cuda.amp.GradScaler()
  with torch.cuda.amp.autocast():
Stage 1 - Epoch 1/15: 100%|██████████| 813/813 [15:20<00:00,  1.13s/it]
  with torch.cuda.amp.autocast():
Stage 1 - Epoch 1/15 [Val]: 100%|██████████| 91/91 [01:43<00:00,  1.13s/it]


Stage 1 - Epoch 1/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%
🎉 New best Stage 1 validation accuracy: 19.06%. Saving model...


Stage 1 - Epoch 2/15: 100%|██████████| 813/813 [00:46<00:00, 17.41it/s]
Stage 1 - Epoch 2/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 27.03it/s]


Stage 1 - Epoch 2/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 3/15: 100%|██████████| 813/813 [00:46<00:00, 17.52it/s]
Stage 1 - Epoch 3/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 27.36it/s]


Stage 1 - Epoch 3/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 4/15: 100%|██████████| 813/813 [00:45<00:00, 17.71it/s]
Stage 1 - Epoch 4/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 27.18it/s]


Stage 1 - Epoch 4/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 5/15: 100%|██████████| 813/813 [00:45<00:00, 17.83it/s]
Stage 1 - Epoch 5/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 26.53it/s]


Stage 1 - Epoch 5/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 6/15: 100%|██████████| 813/813 [00:46<00:00, 17.54it/s]
Stage 1 - Epoch 6/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 26.43it/s]


Stage 1 - Epoch 6/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 7/15: 100%|██████████| 813/813 [00:45<00:00, 17.75it/s]
Stage 1 - Epoch 7/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 26.94it/s]


Stage 1 - Epoch 7/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 8/15: 100%|██████████| 813/813 [00:47<00:00, 17.20it/s]
Stage 1 - Epoch 8/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 26.31it/s]


Stage 1 - Epoch 8/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 9/15: 100%|██████████| 813/813 [00:45<00:00, 17.75it/s]
Stage 1 - Epoch 9/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 25.31it/s]


Stage 1 - Epoch 9/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 10/15: 100%|██████████| 813/813 [00:46<00:00, 17.61it/s]
Stage 1 - Epoch 10/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 25.90it/s]


Stage 1 - Epoch 10/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 11/15: 100%|██████████| 813/813 [00:46<00:00, 17.48it/s]
Stage 1 - Epoch 11/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 25.93it/s]


Stage 1 - Epoch 11/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 12/15: 100%|██████████| 813/813 [00:45<00:00, 17.73it/s]
Stage 1 - Epoch 12/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 26.82it/s]


Stage 1 - Epoch 12/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 13/15: 100%|██████████| 813/813 [00:45<00:00, 17.75it/s]
Stage 1 - Epoch 13/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 26.77it/s]


Stage 1 - Epoch 13/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 14/15: 100%|██████████| 813/813 [00:46<00:00, 17.58it/s]
Stage 1 - Epoch 14/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 27.30it/s]


Stage 1 - Epoch 14/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%


Stage 1 - Epoch 15/15: 100%|██████████| 813/813 [00:46<00:00, 17.34it/s]
Stage 1 - Epoch 15/15 [Val]: 100%|██████████| 91/91 [00:03<00:00, 26.36it/s]

Stage 1 - Epoch 15/15 | Train Loss: nan | Val Loss: 1.7905 | Val Acc: 19.06%

✅ Stage 1 training complete.





In [None]:
# ===================================================================
# CELL 5: STAGE 2 - ADAPTING TO NATURAL SPEECH (IEMOCAP)
# ===================================================================
import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import Wav2Vec2ForSequenceClassification, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score
from tqdm import tqdm
import os

# --- Configuration for Stage 2 ---
LEARNING_RATE = 1e-5 # Use a smaller LR for the second, more delicate fine-tuning stage
EPOCHS = 20
CHECKPOINT_STAGE1_PATH = "/content/drive/MyDrive/ser_project/wav2vec2_stage1_acted_best.pth"
CHECKPOINT_STAGE2_PATH = "/content/drive/MyDrive/ser_project/wav2vec2_stage2_final_best.pth"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu");
unified_emotion_labels = ["neutral", "happy", "sad", "angry", "fearful", "disgust"]
emotion_to_idx = {e: i for i, e in enumerate(unified_emotion_labels)}

# --- Prepare IEMOCAP-only data splits ---
# These variables (train_files, etc.) must be available from Cell 2
iemocap_train_files = [f for f in train_files if 'iemocap_data' in f]
iemocap_val_files = [f for f in val_files if 'iemocap_data' in f]
iemocap_train_labels = [emotion_to_idx[lbl] for i, lbl in enumerate(train_labels_str) if 'iemocap_data' in train_files[i]]
iemocap_val_labels = [emotion_to_idx[lbl] for i, lbl in enumerate(val_labels_str) if 'iemocap_data' in val_files[i]]

# The WavDataset and collate_fn must be available from Cell 3
train_dataset = WavDataset(iemocap_train_files, iemocap_train_labels)
val_dataset = WavDataset(iemocap_val_files, iemocap_val_labels)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
print(f"Starting Stage 2: Adapting on {len(train_dataset)} natural IEMOCAP samples...")

# --- Load the Stage 1 Model ---
print("Loading Stage 1 model (Acted Speech Expert)...")
model = Wav2Vec2ForSequenceClassification.from_pretrained("facebook/wav2vec2-base-960h", num_labels=len(unified_emotion_labels))
model.freeze_feature_extractor = False
stage1_checkpoint = torch.load(CHECKPOINT_STAGE1_PATH)
model.load_state_dict(stage1_checkpoint['model_state_dict'])
model = model.to(device)

# --- Initialize a new Optimizer, Scheduler, and Scaler for Stage 2 ---
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss()
num_training_steps = len(train_loader) * EPOCHS
num_warmup_steps = int(0.1 * num_training_steps)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)
scaler = torch.cuda.amp.GradScaler()

# --- Stage 2 Training Loop with AMP ---
best_val_acc = 0.0
for epoch in range(EPOCHS):
    model.train(); running_loss = 0.0
    for inputs, labels in tqdm(train_loader, desc=f"Stage 2 - Epoch {epoch+1}/{EPOCHS}"):
        inputs, labels = inputs.to(device), labels.to(device)
        with torch.cuda.amp.autocast():
            outputs = model(inputs).logits
            loss = criterion(outputs, labels)

        optimizer.zero_grad()
        scaler.scale(loss).backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        running_loss += loss.item() * inputs.size(0)
    train_loss = running_loss / len(train_dataset)

    # Validation
    model.eval(); val_loss = 0.0; correct = 0; total = 0
    with torch.no_grad():
        for inputs, labels in tqdm(val_loader, desc=f"Stage 2 - Epoch {epoch+1}/{EPOCHS} [Val]"):
            inputs, labels = inputs.to(device), labels.to(device)
            with torch.cuda.amp.autocast():
                outputs = model(inputs).logits
                loss = criterion(outputs, labels)
            val_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs.data, 1); total += labels.size(0); correct += (predicted == labels).sum().item()
    val_accuracy = 100 * correct / total; val_loss /= len(val_dataset)
    print(f"Stage 2 - Epoch {epoch+1}/{EPOCHS} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_accuracy:.2f}%")

    if val_accuracy > best_val_acc:
        best_val_acc = val_accuracy
        print(f"🎉 New best Stage 2 validation accuracy: {best_val_acc:.2f}%. Saving final model...")
        torch.save({'model_state_dict': model.state_dict()}, CHECKPOINT_STAGE2_PATH)

print("\n✅ Stage 2 training complete. The final model is saved.")

Starting Stage 2: Adapting on 2632 natural IEMOCAP samples...
Loading Stage 1 model (Acted Speech Expert)...


Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight', 'wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = torch.cuda.amp.GradScaler()
  with torch.cuda.amp.autocast():
Stage 2 - Epoch 1/20: 100%|██████████| 329/329 [12:30<00:00,  2.28s/it]
  with torch.cuda.amp.autocast():
Stage 2 - Epoch 1/20 [Val]: 100%|██████████| 37/37 [01:18<00:00,  2.13s/it]


Stage 2 - Epoch 1/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%
🎉 New best Stage 2 validation accuracy: 15.12%. Saving final model...


Stage 2 - Epoch 2/20: 100%|██████████| 329/329 [00:36<00:00,  8.97it/s]
Stage 2 - Epoch 2/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 20.13it/s]


Stage 2 - Epoch 2/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 3/20: 100%|██████████| 329/329 [00:34<00:00,  9.44it/s]
Stage 2 - Epoch 3/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.92it/s]


Stage 2 - Epoch 3/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 4/20: 100%|██████████| 329/329 [00:33<00:00,  9.72it/s]
Stage 2 - Epoch 4/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.89it/s]


Stage 2 - Epoch 4/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 5/20: 100%|██████████| 329/329 [00:33<00:00,  9.78it/s]
Stage 2 - Epoch 5/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 20.14it/s]


Stage 2 - Epoch 5/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 6/20: 100%|██████████| 329/329 [00:33<00:00,  9.91it/s]
Stage 2 - Epoch 6/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.87it/s]


Stage 2 - Epoch 6/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 7/20: 100%|██████████| 329/329 [00:33<00:00,  9.96it/s]
Stage 2 - Epoch 7/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 20.29it/s]


Stage 2 - Epoch 7/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 8/20: 100%|██████████| 329/329 [00:33<00:00,  9.96it/s]
Stage 2 - Epoch 8/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.25it/s]


Stage 2 - Epoch 8/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 9/20: 100%|██████████| 329/329 [00:33<00:00,  9.85it/s]
Stage 2 - Epoch 9/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 20.09it/s]


Stage 2 - Epoch 9/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 10/20: 100%|██████████| 329/329 [00:33<00:00,  9.92it/s]
Stage 2 - Epoch 10/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.69it/s]


Stage 2 - Epoch 10/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 11/20: 100%|██████████| 329/329 [00:33<00:00,  9.91it/s]
Stage 2 - Epoch 11/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.17it/s]


Stage 2 - Epoch 11/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 12/20: 100%|██████████| 329/329 [00:33<00:00,  9.95it/s]
Stage 2 - Epoch 12/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.09it/s]


Stage 2 - Epoch 12/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 13/20: 100%|██████████| 329/329 [00:32<00:00, 10.02it/s]
Stage 2 - Epoch 13/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.61it/s]


Stage 2 - Epoch 13/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 14/20: 100%|██████████| 329/329 [00:32<00:00,  9.97it/s]
Stage 2 - Epoch 14/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 20.01it/s]


Stage 2 - Epoch 14/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 15/20: 100%|██████████| 329/329 [00:33<00:00,  9.94it/s]
Stage 2 - Epoch 15/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.63it/s]


Stage 2 - Epoch 15/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 16/20: 100%|██████████| 329/329 [00:32<00:00,  9.98it/s]
Stage 2 - Epoch 16/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.75it/s]


Stage 2 - Epoch 16/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 17/20: 100%|██████████| 329/329 [00:32<00:00, 10.01it/s]
Stage 2 - Epoch 17/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 20.03it/s]


Stage 2 - Epoch 17/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 18/20: 100%|██████████| 329/329 [00:32<00:00, 10.04it/s]
Stage 2 - Epoch 18/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 20.10it/s]


Stage 2 - Epoch 18/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 19/20: 100%|██████████| 329/329 [00:32<00:00, 10.04it/s]
Stage 2 - Epoch 19/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.79it/s]


Stage 2 - Epoch 19/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%


Stage 2 - Epoch 20/20: 100%|██████████| 329/329 [00:33<00:00,  9.91it/s]
Stage 2 - Epoch 20/20 [Val]: 100%|██████████| 37/37 [00:01<00:00, 19.67it/s]

Stage 2 - Epoch 20/20 | Train Loss: nan | Val Loss: 1.7929 | Val Acc: 15.12%

✅ Stage 2 training complete. The final model is saved.



