# Fine-Tuning a Pretrained PANN Model for Audio Classification

This notebook fine-tunes a pretrained **CNN14** model—a variant of **PANNs** (Pretrained Audio Neural Networks)—on a custom 7-class audio dataset. Our dataset consists of the following classes: *breath, cough, crying, laugh, screaming, sneeze,* and *yawn*.  
> **Note:** The original "metadata for test set.csv" mistakenly labeled all files for the "yawn" class as **"yawm"** (an anomaly), which was corrected prior to training.

## About PANNs

**PANNs** stands for **Pretrained Audio Neural Networks**. These models were introduced in the paper:

> **PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition**  
> *Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020.*

### Architecture Overview – CNN14

- **Input Processing:**  
  The CNN14 model takes raw audio waveforms (resampled to 32 kHz) as input and internally computes time–frequency representations using the Short-Time Fourier Transform (STFT) and mel spectrogram extraction.

- **Convolutional Blocks:**  
  A series of convolutional layers extract both local and global audio features, capturing diverse acoustic patterns learned from large-scale data.

- **Final Layers:**  
  Originally, the model’s final fully connected layer was trained on AudioSet (527 classes). For our task, the final classification layer is reinitialized for our 7-class problem.

The model’s code is taken from GitHub ([https://github.com/qiuqiangkong/audioset_tagging_cnn.git](https://github.com/qiuqiangkong/audioset_tagging_cnn.git)), and the pretrained weights (e.g., CNN14 achieving mAP = 0.431 on AudioSet) were downloaded from Zenodo ([https://zenodo.org/record/3987831](https://zenodo.org/record/3987831)).

## Training Details

- **Pretrained Weights:**  
  I initially attempted to train the model from scratch (without using pretrained weights), but the performance was very poor—around **40% accuracy**. Therefore, I chose to fine-tune a model initialized with pretrained weights (excluding the final classification layer), which dramatically improved performance.

- **Training Setup:**  
  - **Number of Epochs:** 20  
  - **Learning Rate:** 1e-4  
  - **Optimizer:** Adam  
  - **Batch Size:** 8  

## Results

- **3-Fold Cross-Validation (on Training Data):**  
  - **Average F1 Score:** ~95.4%  
  - **Average Precision:** ~95.5%  
  - **Average Recall:** ~95.4%

- **Final Model Evaluation on Test Set:**  
  - **Test Accuracy:** ~90.5%

These results demonstrate that fine-tuning the pretrained CNN14 model on our dataset yields excellent performance, with robust cross-validation metrics and a high test accuracy.

---


In [None]:
import os
import sys
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, Subset
import torchaudio
import torchaudio.transforms as transforms
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score, precision_score, recall_score
import torch.nn.functional as F

# ------------------------------------------------
# 1. Add the PANNs model directory to the system path
# ------------------------------------------------

pytorch_path = r"C:\\Users\\Harsh\\Desktop\\Audio Recognition Project\\audioset_tagging_cnn-master\\pytorch"
sys.path.insert(0, pytorch_path)

# Import the CNN14 model from the repository.
from models import Cnn14

# ------------------------------------------------
# 2. Configuration and Paths
# ------------------------------------------------
DATASET_PATH = r"C:\\Users\\Harsh\\Desktop\\Audio Recognition Project\\dataset"
PRETRAINED_MODEL_PATH = r"C:\\Users\\Harsh\\Desktop\\Audio Recognition Project\\pretrained_models\\PANN\\Cnn14_mAP=0.431.pth"

TRAIN_METADATA_CSV = os.path.join(DATASET_PATH, "metadata of train set.csv")
TEST_METADATA_CSV  = os.path.join(DATASET_PATH, "metadata of test set.csv")

TRAIN_AUDIO_DIR = os.path.join(DATASET_PATH, "train")
TEST_AUDIO_DIR  = os.path.join(DATASET_PATH, "test")

BATCH_SIZE = 8
EPOCHS = 20
LEARNING_RATE = 1e-4
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ------------------------------------------------
# 3. Audio Processing Parameters
# ------------------------------------------------
SAMPLE_RATE = 32000
N_FFT = 1024
HOP_LENGTH = 320
N_MELS = 64
FMIN = 50
FMAX = 14000

# (The model computes its own spectrograms internally.)

# ------------------------------------------------
# 4. Prepare Metadata and Create Class Mapping
# ------------------------------------------------
train_meta = pd.read_csv(TRAIN_METADATA_CSV)
train_meta.columns = train_meta.columns.str.strip()

# Use the "Classname" column as the label.
classes = sorted(train_meta["Classname"].unique())
class_to_idx = {cls: i for i, cls in enumerate(classes)}
num_classes = len(classes)
print("Class mapping (Classname -> Index):")
print(class_to_idx)

# ------------------------------------------------
# 5. Define the Custom Dataset (Return Raw Waveform)
# ------------------------------------------------
class AudioDataset(Dataset):
    def __init__(self, metadata_csv, audio_dir, class_to_idx, transform=None):
        self.metadata = pd.read_csv(metadata_csv)
        self.metadata.columns = self.metadata.columns.str.strip()
        self.audio_dir = audio_dir
        self.class_to_idx = class_to_idx
        self.transform = transform

    def __len__(self):
        return len(self.metadata)

    def __getitem__(self, idx):
        row = self.metadata.iloc[idx]
        filename = row["Filename"]
        label = self.class_to_idx[row["Classname"]]
        file_path = os.path.join(self.audio_dir, filename)
        
        waveform, sr = torchaudio.load(file_path)
        # If stereo, take the first channel.
        if waveform.shape[0] > 1:
            waveform = waveform[0:1, :]
        # Resample if needed.
        if sr != SAMPLE_RATE:
            resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=SAMPLE_RATE)
            waveform = resampler(waveform)
        # Squeeze to make waveform 1D.
        waveform = waveform.squeeze(0)
        if self.transform:
            waveform = self.transform(waveform)
        return waveform, label

train_dataset = AudioDataset(TRAIN_METADATA_CSV, TRAIN_AUDIO_DIR, class_to_idx, transform=None)
test_dataset = AudioDataset(TEST_METADATA_CSV, TEST_AUDIO_DIR, class_to_idx, transform=None)

# ------------------------------------------------
# 6. Custom Collate Function for Raw Waveforms
# ------------------------------------------------
def collate_fn(batch):
    """
    Pads raw 1D waveforms in the batch along the time dimension so that all waveforms have the same length.
    """
    waveforms, labels = zip(*batch)
    max_length = max(waveform.shape[0] for waveform in waveforms)
    padded_waveforms = []
    for waveform in waveforms:
        pad_length = max_length - waveform.shape[0]
        padded_waveform = F.pad(waveform, (0, pad_length))
        padded_waveforms.append(padded_waveform)
    stacked_waveforms = torch.stack(padded_waveforms, dim=0)  # [batch_size, max_length]
    labels = torch.tensor(labels, dtype=torch.long)
    return stacked_waveforms, labels

train_loader_full = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

# ------------------------------------------------
# 7. Define Model Parameters and Initialize Model
# ------------------------------------------------
# Here we specify our model parameters. Note that the pretrained CNN14 checkpoint was trained on AudioSet (527 classes).
# Since your dataset has only 7 classes, we will initialize a new final layer.
model_params = {
    "sample_rate": SAMPLE_RATE,
    "window_size": N_FFT,
    "hop_size": HOP_LENGTH,
    "mel_bins": N_MELS,
    "fmin": FMIN,
    "fmax": FMAX,
    "classes_num": num_classes  # e.g., 7
}

# Initialize the model using the Wavegram_Logmel_Cnn14 variant.

model = Cnn14(**model_params)
model.to(DEVICE)

# ------------------------------------------------
# 8. Load Pretrained Weights (Filtering Out Final Classification Layer)
# ------------------------------------------------
# Load the pretrained checkpoint.
checkpoint = torch.load(PRETRAINED_MODEL_PATH, map_location=DEVICE)
pretrained_dict = checkpoint["model"]

# Remove keys corresponding to the final classification layer (often named "fc_audioset").
filtered_dict = {k: v for k, v in pretrained_dict.items() if not k.startswith("fc_audioset")}

# Update the current model's state dict with the filtered pretrained weights.
model_dict = model.state_dict()
model_dict.update(filtered_dict)
model.load_state_dict(model_dict)
print("Loaded pretrained weights (excluding final layer) successfully.")

# ------------------------------------------------
# 9. Loss and Optimizer
# ------------------------------------------------
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# ------------------------------------------------
# 10. 3-Fold Cross-Validation on Training Set
# ------------------------------------------------
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score, precision_score, recall_score

kf = KFold(n_splits=3, shuffle=True, random_state=42)
fold_f1 = []
fold_precision = []
fold_recall = []

print("\nStarting 3-Fold Cross-Validation...")
for fold, (train_idx, val_idx) in enumerate(kf.split(train_dataset)):
    print(f"\n--- Fold {fold+1} ---")
    train_subset = Subset(train_dataset, train_idx)
    val_subset = Subset(train_dataset, val_idx)
    
    train_loader = DataLoader(train_subset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
    val_loader = DataLoader(val_subset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
    
    # Initialize a new model for this fold and load the pretrained weights (excluding final layer).
    model_fold = Cnn14(**model_params)
    model_fold.to(DEVICE)
    model_fold.load_state_dict(model_dict)
    optimizer_fold = optim.Adam(model_fold.parameters(), lr=LEARNING_RATE)
    criterion_fold = nn.CrossEntropyLoss()
    
    for epoch in range(EPOCHS):
        model_fold.train()
        running_loss = 0.0
        for waveforms, labels in tqdm(train_loader, desc=f"Fold {fold+1} Epoch {epoch+1}"):
            waveforms, labels = waveforms.to(DEVICE), labels.to(DEVICE)
            optimizer_fold.zero_grad()
            outputs = model_fold(waveforms)
            logits = outputs["clipwise_output"]
            loss = criterion_fold(logits, labels)
            loss.backward()
            optimizer_fold.step()
            running_loss += loss.item()
        print(f"Fold {fold+1} Epoch {epoch+1} Loss: {running_loss/len(train_loader):.4f}")
    
    model_fold.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for waveforms, labels in val_loader:
            waveforms, labels = waveforms.to(DEVICE), labels.to(DEVICE)
            outputs = model_fold(waveforms)
            logits = outputs["clipwise_output"]
            _, preds = torch.max(logits, 1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    
    fold_f1.append(f1_score(all_labels, all_preds, average="weighted", zero_division=0))
    fold_precision.append(precision_score(all_labels, all_preds, average="weighted", zero_division=0))
    fold_recall.append(recall_score(all_labels, all_preds, average="weighted", zero_division=0))
    print(f"Fold {fold+1} - F1: {fold_f1[-1]:.4f}, Precision: {fold_precision[-1]:.4f}, Recall: {fold_recall[-1]:.4f}")

print("\n--- Average 3-Fold Cross-Validation Results ---")
print(f"F1 Score: {np.mean(fold_f1):.4f}")
print(f"Precision: {np.mean(fold_precision):.4f}")
print(f"Recall: {np.mean(fold_recall):.4f}")

# ------------------------------------------------
# 11. Train Final Model on Full Training Set and Evaluate on Test Set
# ------------------------------------------------
model_final = Cnn14(**model_params)
model_final.to(DEVICE)
model_final.load_state_dict(model_dict)  # Load pretrained weights excluding final layer.
optimizer_final = optim.Adam(model_final.parameters(), lr=LEARNING_RATE)
criterion_final = nn.CrossEntropyLoss()

print("\nTraining final model on full training set...")
for epoch in range(EPOCHS):
    model_final.train()
    total_loss = 0.0
    for waveforms, labels in tqdm(train_loader_full, desc=f"Final Model Epoch {epoch+1}/{EPOCHS}"):
        waveforms, labels = waveforms.to(DEVICE), labels.to(DEVICE)
        optimizer_final.zero_grad()
        outputs = model_final(waveforms)
        logits = outputs["clipwise_output"]
        loss = criterion_final(logits, labels)
        loss.backward()
        optimizer_final.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} - Training Loss: {total_loss/len(train_loader_full):.4f}")

torch.save(model_final.state_dict(), "trained_model_final.pth")
print("Final training complete. Model saved as 'trained_model_final.pth'.")

model_final.eval()
correct = 0
total = 0
with torch.no_grad():
    for waveforms, labels in test_loader:
        waveforms, labels = waveforms.to(DEVICE), labels.to(DEVICE)
        outputs = model_final(waveforms)
        logits = outputs["clipwise_output"]
        _, predicted = torch.max(logits, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)
        
test_accuracy = correct / total
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")


Class mapping (Classname -> Index):
{'breath': 0, 'cough': 1, 'crying': 2, 'laugh': 3, 'screaming': 4, 'sneeze': 5, 'yawn': 6}
Loaded pretrained weights (excluding final layer) successfully.

Starting 3-Fold Cross-Validation...

--- Fold 1 ---


Fold 1 Epoch 1: 100%|██████████| 524/524 [01:56<00:00,  4.51it/s]


Fold 1 Epoch 1 Loss: 1.4765


Fold 1 Epoch 2: 100%|██████████| 524/524 [02:08<00:00,  4.07it/s]


Fold 1 Epoch 2 Loss: 1.3197


Fold 1 Epoch 3: 100%|██████████| 524/524 [02:17<00:00,  3.81it/s]


Fold 1 Epoch 3 Loss: 1.2947


Fold 1 Epoch 4: 100%|██████████| 524/524 [02:25<00:00,  3.61it/s]


Fold 1 Epoch 4 Loss: 1.2715


Fold 1 Epoch 5: 100%|██████████| 524/524 [02:27<00:00,  3.54it/s]


Fold 1 Epoch 5 Loss: 1.2581


Fold 1 Epoch 6: 100%|██████████| 524/524 [02:31<00:00,  3.46it/s]


Fold 1 Epoch 6 Loss: 1.2461


Fold 1 Epoch 7: 100%|██████████| 524/524 [02:33<00:00,  3.41it/s]


Fold 1 Epoch 7 Loss: 1.2400


Fold 1 Epoch 8: 100%|██████████| 524/524 [02:06<00:00,  4.13it/s]


Fold 1 Epoch 8 Loss: 1.2359


Fold 1 Epoch 9: 100%|██████████| 524/524 [02:07<00:00,  4.11it/s]


Fold 1 Epoch 9 Loss: 1.2252


Fold 1 Epoch 10: 100%|██████████| 524/524 [02:07<00:00,  4.12it/s]


Fold 1 Epoch 10 Loss: 1.2237


Fold 1 Epoch 11: 100%|██████████| 524/524 [02:07<00:00,  4.11it/s]


Fold 1 Epoch 11 Loss: 1.2218


Fold 1 Epoch 12: 100%|██████████| 524/524 [02:07<00:00,  4.12it/s]


Fold 1 Epoch 12 Loss: 1.2189


Fold 1 Epoch 13: 100%|██████████| 524/524 [02:07<00:00,  4.12it/s]


Fold 1 Epoch 13 Loss: 1.2135


Fold 1 Epoch 14: 100%|██████████| 524/524 [02:07<00:00,  4.11it/s]


Fold 1 Epoch 14 Loss: 1.2140


Fold 1 Epoch 15: 100%|██████████| 524/524 [02:07<00:00,  4.12it/s]


Fold 1 Epoch 15 Loss: 1.2095


Fold 1 Epoch 16: 100%|██████████| 524/524 [02:06<00:00,  4.13it/s]


Fold 1 Epoch 16 Loss: 1.2044


Fold 1 Epoch 17: 100%|██████████| 524/524 [02:06<00:00,  4.13it/s]


Fold 1 Epoch 17 Loss: 1.2064


Fold 1 Epoch 18: 100%|██████████| 524/524 [02:06<00:00,  4.14it/s]


Fold 1 Epoch 18 Loss: 1.2032


Fold 1 Epoch 19: 100%|██████████| 524/524 [02:06<00:00,  4.13it/s]


Fold 1 Epoch 19 Loss: 1.1962


Fold 1 Epoch 20: 100%|██████████| 524/524 [02:07<00:00,  4.11it/s]


Fold 1 Epoch 20 Loss: 1.1970
Fold 1 - F1: 0.9531, Precision: 0.9538, Recall: 0.9528

--- Fold 2 ---


Fold 2 Epoch 1: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 2 Epoch 1 Loss: 1.4703


Fold 2 Epoch 2: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 2 Epoch 2 Loss: 1.3254


Fold 2 Epoch 3: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 2 Epoch 3 Loss: 1.2936


Fold 2 Epoch 4: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 2 Epoch 4 Loss: 1.2720


Fold 2 Epoch 5: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 2 Epoch 5 Loss: 1.2642


Fold 2 Epoch 6: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 2 Epoch 6 Loss: 1.2538


Fold 2 Epoch 7: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 2 Epoch 7 Loss: 1.2438


Fold 2 Epoch 8: 100%|██████████| 525/525 [02:08<00:00,  4.09it/s]


Fold 2 Epoch 8 Loss: 1.2390


Fold 2 Epoch 9: 100%|██████████| 525/525 [02:07<00:00,  4.10it/s]


Fold 2 Epoch 9 Loss: 1.2290


Fold 2 Epoch 10: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 2 Epoch 10 Loss: 1.2264


Fold 2 Epoch 11: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 2 Epoch 11 Loss: 1.2235


Fold 2 Epoch 12: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 2 Epoch 12 Loss: 1.2147


Fold 2 Epoch 13: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 2 Epoch 13 Loss: 1.2137


Fold 2 Epoch 14: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 2 Epoch 14 Loss: 1.2162


Fold 2 Epoch 15: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 2 Epoch 15 Loss: 1.2093


Fold 2 Epoch 16: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 2 Epoch 16 Loss: 1.2073


Fold 2 Epoch 17: 100%|██████████| 525/525 [02:08<00:00,  4.09it/s]


Fold 2 Epoch 17 Loss: 1.2057


Fold 2 Epoch 18: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 2 Epoch 18 Loss: 1.2076


Fold 2 Epoch 19: 100%|██████████| 525/525 [02:08<00:00,  4.10it/s]


Fold 2 Epoch 19 Loss: 1.2011


Fold 2 Epoch 20: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 2 Epoch 20 Loss: 1.2009
Fold 2 - F1: 0.9563, Precision: 0.9571, Recall: 0.9566

--- Fold 3 ---


Fold 3 Epoch 1: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 3 Epoch 1 Loss: 1.4807


Fold 3 Epoch 2: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 3 Epoch 2 Loss: 1.3292


Fold 3 Epoch 3: 100%|██████████| 525/525 [02:08<00:00,  4.09it/s]


Fold 3 Epoch 3 Loss: 1.2941


Fold 3 Epoch 4: 100%|██████████| 525/525 [02:08<00:00,  4.10it/s]


Fold 3 Epoch 4 Loss: 1.2688


Fold 3 Epoch 5: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 3 Epoch 5 Loss: 1.2577


Fold 3 Epoch 6: 100%|██████████| 525/525 [02:08<00:00,  4.10it/s]


Fold 3 Epoch 6 Loss: 1.2521


Fold 3 Epoch 7: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 3 Epoch 7 Loss: 1.2390


Fold 3 Epoch 8: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 3 Epoch 8 Loss: 1.2342


Fold 3 Epoch 9: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 3 Epoch 9 Loss: 1.2306


Fold 3 Epoch 10: 100%|██████████| 525/525 [02:08<00:00,  4.08it/s]


Fold 3 Epoch 10 Loss: 1.2223


Fold 3 Epoch 11: 100%|██████████| 525/525 [02:08<00:00,  4.10it/s]


Fold 3 Epoch 11 Loss: 1.2185


Fold 3 Epoch 12: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 3 Epoch 12 Loss: 1.2164


Fold 3 Epoch 13: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 3 Epoch 13 Loss: 1.2147


Fold 3 Epoch 14: 100%|██████████| 525/525 [02:07<00:00,  4.12it/s]


Fold 3 Epoch 14 Loss: 1.2124


Fold 3 Epoch 15: 100%|██████████| 525/525 [02:07<00:00,  4.10it/s]


Fold 3 Epoch 15 Loss: 1.2119


Fold 3 Epoch 16: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 3 Epoch 16 Loss: 1.2057


Fold 3 Epoch 17: 100%|██████████| 525/525 [02:07<00:00,  4.10it/s]


Fold 3 Epoch 17 Loss: 1.2035


Fold 3 Epoch 18: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 3 Epoch 18 Loss: 1.2020


Fold 3 Epoch 19: 100%|██████████| 525/525 [02:07<00:00,  4.11it/s]


Fold 3 Epoch 19 Loss: 1.1958


Fold 3 Epoch 20: 100%|██████████| 525/525 [02:08<00:00,  4.10it/s]


Fold 3 Epoch 20 Loss: 1.1982
Fold 3 - F1: 0.9525, Precision: 0.9526, Recall: 0.9528

--- Average 3-Fold Cross-Validation Results ---
F1 Score: 0.9539
Precision: 0.9545
Recall: 0.9540

Training final model on full training set...


Final Model Epoch 1/20: 100%|██████████| 787/787 [03:12<00:00,  4.10it/s]


Epoch 1 - Training Loss: 1.4278


Final Model Epoch 2/20: 100%|██████████| 787/787 [03:11<00:00,  4.10it/s]


Epoch 2 - Training Loss: 1.3038


Final Model Epoch 3/20: 100%|██████████| 787/787 [03:11<00:00,  4.11it/s]


Epoch 3 - Training Loss: 1.2720


Final Model Epoch 4/20: 100%|██████████| 787/787 [03:11<00:00,  4.11it/s]


Epoch 4 - Training Loss: 1.2580


Final Model Epoch 5/20: 100%|██████████| 787/787 [03:12<00:00,  4.09it/s]


Epoch 5 - Training Loss: 1.2492


Final Model Epoch 6/20: 100%|██████████| 787/787 [03:11<00:00,  4.10it/s]


Epoch 6 - Training Loss: 1.2378


Final Model Epoch 7/20: 100%|██████████| 787/787 [03:11<00:00,  4.11it/s]


Epoch 7 - Training Loss: 1.2268


Final Model Epoch 8/20: 100%|██████████| 787/787 [03:12<00:00,  4.09it/s]


Epoch 8 - Training Loss: 1.2262


Final Model Epoch 9/20: 100%|██████████| 787/787 [03:11<00:00,  4.10it/s]


Epoch 9 - Training Loss: 1.2195


Final Model Epoch 10/20: 100%|██████████| 787/787 [03:11<00:00,  4.10it/s]


Epoch 10 - Training Loss: 1.2144


Final Model Epoch 11/20: 100%|██████████| 787/787 [03:12<00:00,  4.10it/s]


Epoch 11 - Training Loss: 1.2093


Final Model Epoch 12/20: 100%|██████████| 787/787 [04:13<00:00,  3.11it/s]


Epoch 12 - Training Loss: 1.2082


Final Model Epoch 13/20: 100%|██████████| 787/787 [04:59<00:00,  2.63it/s]


Epoch 13 - Training Loss: 1.2036


Final Model Epoch 14/20: 100%|██████████| 787/787 [05:12<00:00,  2.52it/s]


Epoch 14 - Training Loss: 1.2028


Final Model Epoch 15/20: 100%|██████████| 787/787 [05:01<00:00,  2.61it/s]


Epoch 15 - Training Loss: 1.2053


Final Model Epoch 16/20: 100%|██████████| 787/787 [05:01<00:00,  2.61it/s]


Epoch 16 - Training Loss: 1.1996


Final Model Epoch 17/20: 100%|██████████| 787/787 [04:59<00:00,  2.63it/s]


Epoch 17 - Training Loss: 1.1971


Final Model Epoch 18/20: 100%|██████████| 787/787 [05:00<00:00,  2.62it/s]


Epoch 18 - Training Loss: 1.1958


Final Model Epoch 19/20: 100%|██████████| 787/787 [04:58<00:00,  2.63it/s]


Epoch 19 - Training Loss: 1.1940


Final Model Epoch 20/20: 100%|██████████| 787/787 [04:56<00:00,  2.65it/s]


Epoch 20 - Training Loss: 1.1943
Final training complete. Model saved as 'trained_model_final.pth'.
Test Accuracy: 90.48%


## Notebook Overview

1. **Configuration and Paths:**  
   The notebook sets up all paths (dataset, metadata, pretrained model) according to the folder structure.

2. **Data Preparation:**  
   - The CSV metadata is loaded, and a class mapping is created based on the `"Classname"` column.
   - A custom PyTorch `Dataset` (`AudioDataset`) loads raw audio waveforms (resampled to 32 kHz) from the specified folder.
   - A custom collate function pads the variable-length waveforms so that they can be batched together.

3. **Model Initialization:**  
   - The pretrained CNN14 model is imported from the PANNs repository.
   - The pretrained weights (from `Cnn14_mAP=0.431.pth`) are loaded, filtering out the final classification layer so that the output layer is reinitialized for our 7 classes.

4. **Training and Evaluation:**  
   - The notebook performs **3-fold cross-validation** on the training set, printing weighted F1 score, precision, and recall for each fold and their averages.
   - A final model is then trained on the full training set and evaluated on the test set.
