# Phase 1: Specialist Model Baseline

**Objective:** The goal of this initial phase is to establish a strong performance baseline for Speech Emotion Recognition (SER). 

We will use a modern computer vision approach on a single, clean dataset (RAVDESS). This involves two main parts:
1.  **Data Transformation:** Converting raw audio files into image-like Mel spectrograms.
2.  **Modeling & Training:** Using a pre-trained Convolutional Neural Network (CNN) via Transfer Learning to build a "Specialist Model" that is an expert on this dataset.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install resampy



## Data Transformation: From Audio to Images

In this step, we process the entire RAVDESS dataset. Each `.wav` audio file is loaded, standardized to a 3-second clip, and then converted into a Log-Mel spectrogram. This transforms our audio problem into an image classification problem, allowing us to leverage powerful, pre-existing computer vision models. The final spectrograms are saved as `.npy` files for efficient loading during training.

In [None]:
import os
import librosa
import numpy as np
from tqdm import tqdm

# --- Configuration ---
# Make sure these paths match your Google Drive structure
AUDIO_PATH = "/content/drive/MyDrive/ser_project/ravdess_data/"
SPECTROGRAM_PATH = "/content/drive/MyDrive/ser_project/ravdess_spectrograms/"

# Create the output directory if it doesn't exist
os.makedirs(SPECTROGRAM_PATH, exist_ok=True)

# --- Preprocessing Script ---
print("Starting audio to spectrogram conversion...")

actor_folders = [f for f in os.listdir(AUDIO_PATH) if os.path.isdir(os.path.join(AUDIO_PATH, f))]
for actor_folder in tqdm(actor_folders, desc="Processing Actors"):
    actor_path = os.path.join(AUDIO_PATH, actor_folder)
    for file_name in os.listdir(actor_path):
        try:
            file_path = os.path.join(actor_path, file_name)

            # Load audio
            audio, sr = librosa.load(file_path, res_type='kaiser_fast', duration=3, sr=22050*2, offset=0.5)

            # Generate Mel Spectrogram
            spectrogram = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=128, fmax=8000)
            db_spectrogram = librosa.power_to_db(spectrogram, ref=np.max)

            # Save the spectrogram as a NumPy array
            output_filename = os.path.join(SPECTROGRAM_PATH, f"{os.path.splitext(file_name)[0]}.npy")
            np.save(output_filename, db_spectrogram)

        except Exception as e:
            print(f"\nError processing {file_path}: {e}")

print("\nSpectrogram conversion complete.")

Starting audio to spectrogram conversion...


Processing Actors: 100%|██████████| 24/24 [02:20<00:00,  5.87s/it]


Spectrogram conversion complete.





In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class AudioCNN(nn.Module):
    def __init__(self, num_classes=8):
        super(AudioCNN, self).__init__()
        # Convolutional layers
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, stride=1, padding=1)
        self.bn1 = nn.BatchNorm2d(16)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)

        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(32)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)

        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.bn3 = nn.BatchNorm2d(64)
        self.pool3 = nn.MaxPool2d(kernel_size=4, stride=4)

        # Flattening and fully connected layers
        self.flatten = nn.Flatten()
        # The input features to the linear layer will depend on the spectrogram size.
        # We will calculate this dynamically later. For now, a placeholder.
        self.fc1 = nn.Linear(in_features=64 * 8 * 8, out_features=128) # Placeholder size
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(in_features=128, out_features=num_classes)

    def forward(self, x):
        # Add a channel dimension
        x = x.unsqueeze(1)

        # Conv blocks
        x = self.pool1(F.relu(self.bn1(self.conv1(x))))
        x = self.pool2(F.relu(self.bn2(self.conv2(x))))
        x = self.pool3(F.relu(self.bn3(self.conv3(x))))

        # Flatten and classify
        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

print("CNN Model class defined.")

CNN Model class defined.


## Modeling, Training, and Evaluation

With our data prepared, we now build and train our model. Our strategy is **Transfer Learning**. Instead of training a small CNN from scratch, we import a powerful **ResNet18** model that was pre-trained on the ImageNet dataset. We adapt this model for our task by replacing its final classification layer with one that matches our 8 emotion classes. The model is then trained on our spectrograms, and its performance is tracked using a validation set.

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from tqdm import tqdm
from torch.optim.lr_scheduler import StepLR
# Import pre-trained models from torchvision
from torchvision import models
import torch.nn.functional as F # Often needed, good to have

# --- Configuration ---
SPECTROGRAM_PATH = "/content/drive/MyDrive/ser_project/ravdess_spectrograms/"
LEARNING_RATE = 0.001 # A good starting LR for fine-tuning
BATCH_SIZE = 32
EPOCHS = 30 # We often need fewer epochs when fine-tuning
CHECKPOINT_PATH = "/content/drive/MyDrive/ser_project/resnet_checkpoint.pth"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

emotion_map = { "01": 0, "02": 1, "03": 2, "04": 3, "05": 4, "06": 5, "07": 6, "08": 7 }
emotion_labels_list = ["neutral", "calm", "happy", "sad", "angry", "fearful", "disgust", "surprise"]

# --- Custom PyTorch Dataset (Modified for 3-Channels) ---
class SpectrogramDataset(Dataset):
    def __init__(self, file_paths, labels, target_width=300):
        self.file_paths = file_paths
        self.labels = labels
        self.target_width = target_width

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        spectrogram = np.load(self.file_paths[idx])
        label = self.labels[idx]

        # Padding and Truncating
        current_width = spectrogram.shape[1]
        if current_width < self.target_width:
            padding = self.target_width - current_width
            spectrogram = np.pad(spectrogram, ((0, 0), (0, padding)), mode='constant')
        elif current_width > self.target_width:
            spectrogram = spectrogram[:, :self.target_width]

        # Normalize to [0, 1]
        spec_min = spectrogram.min()
        spec_max = spectrogram.max()
        if spec_max > spec_min:
            spectrogram = (spectrogram - spec_min) / (spec_max - spec_min)

        # Stack the single-channel spectrogram to create a 3-channel image for ResNet
        spectrogram = np.stack([spectrogram, spectrogram, spectrogram], axis=0)

        return torch.tensor(spectrogram, dtype=torch.float32), torch.tensor(label, dtype=torch.long)

# --- Prepare Data ---
all_files = [os.path.join(SPECTROGRAM_PATH, f) for f in os.listdir(SPECTROGRAM_PATH) if f.endswith('.npy')]
all_labels = [emotion_map[os.path.basename(f).split("-")[2]] for f in all_files]

# 80% train, 10% validation, 10% test split
train_files, temp_files, train_labels, temp_labels = train_test_split(
    all_files, all_labels, test_size=0.2, random_state=42, stratify=all_labels
)
val_files, test_files, val_labels, test_labels = train_test_split(
    temp_files, temp_labels, test_size=0.5, random_state=42, stratify=temp_labels
)

train_dataset = SpectrogramDataset(train_files, train_labels)
val_dataset = SpectrogramDataset(val_files, val_labels)
test_dataset = SpectrogramDataset(test_files, test_labels)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# --- Initialize Pre-trained ResNet18 Model ---
model = models.resnet18(weights='IMAGENET1K_V1')

# Adapt the final fully-connected layer for our 8 emotion classes
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, len(emotion_labels_list))

model = model.to(device)

# --- Optimizer, Scheduler, and Checkpoint Loading ---
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss()
scheduler = StepLR(optimizer, step_size=7, gamma=0.1)
start_epoch = 0
if os.path.exists(CHECKPOINT_PATH):
    print("Checkpoint found! Loading model state...")
    checkpoint = torch.load(CHECKPOINT_PATH)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    start_epoch = checkpoint['epoch']
    print(f"Resuming training from epoch {start_epoch + 1}")

# --- Training Loop with Validation ---
print("Starting training with ResNet18...")
for epoch in range(start_epoch, EPOCHS):
    model.train()
    running_loss = 0.0
    for inputs, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS} [Train]"):
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * inputs.size(0)

    train_loss = running_loss / len(train_dataset)

    # --- Validation Phase ---
    model.eval()
    val_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in tqdm(val_loader, desc=f"Epoch {epoch+1}/{EPOCHS} [Val]"):
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            val_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    val_accuracy = 100 * correct / total
    val_loss /= len(val_dataset)

    print(f"Epoch {epoch+1}/{EPOCHS} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_accuracy:.2f}%")

    scheduler.step()

    torch.save({ 'epoch': epoch + 1, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict() }, CHECKPOINT_PATH)

# --- Final Evaluation on the TEST SET ---
model.eval()
all_preds = []
all_true = []
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)
        all_preds.extend(preds.cpu().numpy())
        all_true.extend(labels.cpu().numpy())

accuracy = accuracy_score(all_true, all_preds)
print(f"\n--- FINAL EVALUATION ---")
print(f"ResNet18 Model Accuracy on Test Set: {accuracy * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(all_true, all_preds, target_names=emotion_labels_list, zero_division=0))

Using device: cuda
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth


100%|██████████| 44.7M/44.7M [00:00<00:00, 80.4MB/s]


Starting training with ResNet18...


Epoch 1/30 [Train]: 100%|██████████| 36/36 [00:21<00:00,  1.65it/s]
Epoch 1/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.06it/s]


Epoch 1/30 | Train Loss: 1.4527 | Val Loss: 2.0887 | Val Acc: 45.14%


Epoch 2/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.43it/s]
Epoch 2/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.29it/s]


Epoch 2/30 | Train Loss: 0.9335 | Val Loss: 1.5112 | Val Acc: 50.00%


Epoch 3/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.47it/s]
Epoch 3/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.29it/s]


Epoch 3/30 | Train Loss: 0.7445 | Val Loss: 1.7743 | Val Acc: 45.14%


Epoch 4/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.41it/s]
Epoch 4/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.41it/s]


Epoch 4/30 | Train Loss: 0.5086 | Val Loss: 1.5974 | Val Acc: 48.61%


Epoch 5/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.37it/s]
Epoch 5/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  6.94it/s]


Epoch 5/30 | Train Loss: 0.4297 | Val Loss: 0.9689 | Val Acc: 67.36%


Epoch 6/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.49it/s]
Epoch 6/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.52it/s]


Epoch 6/30 | Train Loss: 0.3207 | Val Loss: 2.3494 | Val Acc: 40.28%


Epoch 7/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.44it/s]
Epoch 7/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.35it/s]


Epoch 7/30 | Train Loss: 0.2567 | Val Loss: 1.2624 | Val Acc: 68.75%


Epoch 8/30 [Train]: 100%|██████████| 36/36 [00:09<00:00,  4.00it/s]
Epoch 8/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.69it/s]


Epoch 8/30 | Train Loss: 0.0959 | Val Loss: 0.5771 | Val Acc: 81.94%


Epoch 9/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.49it/s]
Epoch 9/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  6.74it/s]


Epoch 9/30 | Train Loss: 0.0336 | Val Loss: 0.5338 | Val Acc: 83.33%


Epoch 10/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.49it/s]
Epoch 10/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.19it/s]


Epoch 10/30 | Train Loss: 0.0256 | Val Loss: 0.5341 | Val Acc: 84.03%


Epoch 11/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.34it/s]
Epoch 11/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.05it/s]


Epoch 11/30 | Train Loss: 0.0179 | Val Loss: 0.5262 | Val Acc: 84.03%


Epoch 12/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.40it/s]
Epoch 12/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.06it/s]


Epoch 12/30 | Train Loss: 0.0161 | Val Loss: 0.5207 | Val Acc: 84.03%


Epoch 13/30 [Train]: 100%|██████████| 36/36 [00:07<00:00,  4.59it/s]
Epoch 13/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.81it/s]


Epoch 13/30 | Train Loss: 0.0162 | Val Loss: 0.5304 | Val Acc: 84.72%


Epoch 14/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.45it/s]
Epoch 14/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.67it/s]


Epoch 14/30 | Train Loss: 0.0151 | Val Loss: 0.5277 | Val Acc: 84.72%


Epoch 15/30 [Train]: 100%|██████████| 36/36 [00:07<00:00,  4.62it/s]
Epoch 15/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.39it/s]


Epoch 15/30 | Train Loss: 0.0087 | Val Loss: 0.5226 | Val Acc: 85.42%


Epoch 16/30 [Train]: 100%|██████████| 36/36 [00:07<00:00,  4.51it/s]
Epoch 16/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.31it/s]


Epoch 16/30 | Train Loss: 0.0084 | Val Loss: 0.5207 | Val Acc: 84.72%


Epoch 17/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.47it/s]
Epoch 17/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.67it/s]


Epoch 17/30 | Train Loss: 0.0115 | Val Loss: 0.5289 | Val Acc: 85.42%


Epoch 18/30 [Train]: 100%|██████████| 36/36 [00:07<00:00,  4.55it/s]
Epoch 18/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  6.91it/s]


Epoch 18/30 | Train Loss: 0.0110 | Val Loss: 0.5277 | Val Acc: 85.42%


Epoch 19/30 [Train]: 100%|██████████| 36/36 [00:07<00:00,  4.51it/s]
Epoch 19/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.37it/s]


Epoch 19/30 | Train Loss: 0.0079 | Val Loss: 0.5229 | Val Acc: 86.11%


Epoch 20/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.41it/s]
Epoch 20/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.48it/s]


Epoch 20/30 | Train Loss: 0.0103 | Val Loss: 0.5284 | Val Acc: 84.03%


Epoch 21/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.44it/s]
Epoch 21/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.15it/s]


Epoch 21/30 | Train Loss: 0.0074 | Val Loss: 0.5240 | Val Acc: 87.50%


Epoch 22/30 [Train]: 100%|██████████| 36/36 [00:07<00:00,  4.52it/s]
Epoch 22/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.59it/s]


Epoch 22/30 | Train Loss: 0.0078 | Val Loss: 0.5181 | Val Acc: 86.81%


Epoch 23/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.37it/s]
Epoch 23/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.35it/s]


Epoch 23/30 | Train Loss: 0.0087 | Val Loss: 0.5214 | Val Acc: 86.81%


Epoch 24/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.39it/s]
Epoch 24/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.22it/s]


Epoch 24/30 | Train Loss: 0.0095 | Val Loss: 0.5205 | Val Acc: 84.72%


Epoch 25/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.47it/s]
Epoch 25/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.18it/s]


Epoch 25/30 | Train Loss: 0.0094 | Val Loss: 0.5245 | Val Acc: 85.42%


Epoch 26/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.44it/s]
Epoch 26/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.43it/s]


Epoch 26/30 | Train Loss: 0.0083 | Val Loss: 0.5232 | Val Acc: 85.42%


Epoch 27/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.33it/s]
Epoch 27/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  7.31it/s]


Epoch 27/30 | Train Loss: 0.0072 | Val Loss: 0.5216 | Val Acc: 86.81%


Epoch 28/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.26it/s]
Epoch 28/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  6.88it/s]


Epoch 28/30 | Train Loss: 0.0093 | Val Loss: 0.5294 | Val Acc: 84.03%


Epoch 29/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.21it/s]
Epoch 29/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  6.74it/s]


Epoch 29/30 | Train Loss: 0.0073 | Val Loss: 0.5210 | Val Acc: 86.11%


Epoch 30/30 [Train]: 100%|██████████| 36/36 [00:08<00:00,  4.11it/s]
Epoch 30/30 [Val]: 100%|██████████| 5/5 [00:00<00:00,  6.87it/s]


Epoch 30/30 | Train Loss: 0.0102 | Val Loss: 0.5193 | Val Acc: 86.81%

--- FINAL EVALUATION ---
ResNet18 Model Accuracy on Test Set: 82.64%

Classification Report:
              precision    recall  f1-score   support

     neutral       0.78      0.70      0.74        10
        calm       0.70      0.84      0.76        19
       happy       0.81      0.89      0.85        19
         sad       0.79      0.79      0.79        19
       angry       0.89      0.89      0.89        19
     fearful       0.88      0.70      0.78        20
     disgust       1.00      0.84      0.91        19
    surprise       0.81      0.89      0.85        19

    accuracy                           0.83       144
   macro avg       0.83      0.82      0.82       144
weighted avg       0.84      0.83      0.83       144

