In this notebook, I trained a Random Forest classifier using features derived from precomputed log-Mel spectrograms stored as .npz files. These Mel features were extracted from preprocessed audio chunks (already trimmed for human speech in prior steps).

Key Steps:
- Load metadata from train.csv and map each Mel file to its primary label.
- Extract statistical features from each Mel spectrogram:
    - Mean and standard deviation across Mel bands (64 + 64 = 128 features per sample).
- Label encode species names for model compatibility.
- Filter out rare classes (with fewer than 2 samples).
- Train-test split using stratified sampling to preserve class distribution.
- Train a Random Forest classifier with class balancing to handle label imbalance.
- Evaluate the model using accuracy and a classification report.
- Export the trained model and label encoder for later inference.


In [23]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from collections import Counter
import joblib
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# === CONFIG ===
mel_dir = "processed/feature"  # mel feature folder
csv_path = "data/train.csv"    # train.csv path
model_path = "random_forest_model.pkl"
label_encoder_path = "label_encoder.pkl"

# === Step 1: Load metadata and label map ===
"""
The goal of Step 1 is to read the metadata (CSV file) containing filenames and their corresponding labels,
and then map these labels (bird species) from strings to integers using LabelEncoder. 
This is necessary for later stages in machine learning, where the model requires numerical input and output.
"""

df = pd.read_csv(csv_path)
label_map_raw = {
    row['filename'].split('/')[-1].split('.')[0]: row['primary_label']
    for _, row in df.iterrows()
}

# Encode string labels as integers
le = LabelEncoder()
le.fit(list(label_map_raw.values()))
label_map = {k: le.transform([v])[0] for k, v in label_map_raw.items()}

# === Step 2: Load mel features ===
"""
In this step, the code loads the mel spectrograms (spectral features) for each audio file
and processes them by computing the mean and standard deviation for each mel frequency band.
"""

X_raw, y_raw = [], []

for fname in tqdm(sorted(os.listdir(mel_dir))):
    if not fname.endswith('_mel.npz'):
        continue

    # Extract clip_id like CSA35130_25_mel.npz → CSA35130
    clip_id = fname.rsplit('_', 2)[0]
    if clip_id not in label_map:
        continue

    path = os.path.join(mel_dir, fname)
    data = np.load(path)
    if 'mel' not in data:
        continue

    mel = data['mel']
    mean = mel.mean(axis=1)
    std = mel.std(axis=1)
    feature = np.concatenate([mean, std])  # shape: (128,)
    X_raw.append(feature)
    y_raw.append(label_map[clip_id])

X_raw = np.array(X_raw)
y_raw = np.array(y_raw)

if len(X_raw) == 0:
    raise RuntimeError("No valid mel files found.")

#print(f"Loaded {X_raw.shape[0]} samples with shape {X_raw.shape[1]} features each.")

# === Step 3: Filter classes with <2 samples ===
counts = Counter(y_raw)
valid_labels = {label for label, count in counts.items() if count >= 2}

X = []
y = []
for xi, yi in zip(X_raw, y_raw):
    if yi in valid_labels:
        X.append(xi)
        y.append(yi)

X = np.array(X)
y = np.array(y)

# === Step 4: Train-test split ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# === Step 5: Train Random Forest ===
model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    class_weight='balanced',
    n_jobs=-1
)
model.fit(X_train, y_train)

# === Step 6: Evaluate ===
y_pred = model.predict(X_test)
labels = np.unique(y_test)
target_names = [le.inverse_transform([label])[0] for label in labels]

report = classification_report(y_test, y_pred, labels=labels, target_names=target_names, zero_division=0)
accuracy = accuracy_score(y_test, y_pred)

print(f"\n Accuracy on test set: {accuracy:.4f}")
print("\n Classification Report:\n")
print(report)

# === Step 7: Save model and label encoder ===
joblib.dump(model, model_path)
joblib.dump(le, label_encoder_path)
print(f" Saved model to {model_path} and label encoder to {label_encoder_path}")


100%|██████████| 121255/121255 [02:27<00:00, 822.89it/s]



 Accuracy on test set: 0.6250

 Classification Report:

              precision    recall  f1-score   support

     1139490       0.00      0.00      0.00         3
     1192948       0.57      0.44      0.50         9
     1194042       1.00      1.00      1.00         1
     1346504       0.71      0.71      0.71         7
      134933       1.00      0.40      0.57         5
      135045       0.80      0.90      0.85        40
     1462711       0.75      0.38      0.50         8
     1462737       0.48      0.93      0.64        15
       21038       1.00      0.86      0.92         7
       21211       0.91      0.69      0.78        29
       22333       0.67      0.25      0.36         8
       22973       0.90      0.53      0.67        36
       22976       0.59      0.56      0.57        18
       24272       1.00      1.00      1.00         3
       24292       0.50      0.67      0.57         3
       24322       1.00      0.86      0.92        14
       41663       0.68 

## ⚙️ Environment Requirements

To ensure compatibility and avoid serialization issues when saving and loading the `RandomForestClassifier` model, this notebook uses the following library versions:

- **scikit-learn**: `1.2.2`  
- **numpy**: `1.23.5`

These versions match Kaggle's default environment, ensuring that the trained model (`random_forest_model.pkl`) can be loaded successfully during submission.


In [7]:
import joblib
# Load the label encoder
le = joblib.load("label_encoder_v2.pkl")

# Count number of unique classes
num_classes = len(le.classes_)
print(f"Number of classes: {num_classes}")

Number of classes: 206
