> Setup & paths

1. Enable autoreload for local utils/ modules

2. Add project root to sys.path to import utils.*

3. Define constants (FPS) used for audio↔visual alignment


In [2]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append("..")

import pandas as pd

from utils import audio_tools as audioTools
from utils import gt_and_modeling_dfs as prepare_df

import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from utils import evaluation_tools as eval

> **Choice of split:**\
> The split had to be identical with visual features extraction for the future fusion, so we use the same strategy as there: There are 4 main characters that we need to identify so the split time from a given episode is selected based on the equal (rough idea) no of apperances of the all all character in both splits.

In [3]:
FPS_TO_SAVE = 25  # to match visual

EPISODES = {
    "Muppets-02-01-01": {
        "path": "../data/raw/Muppets-02-01-01.avi",
        "train_split_timestamp": "19:30",
        "ground_truth_path": "../data/muppets-gt-2025wt/Ground_Truth_New_01.xlsx"
    },
    "Muppets-02-04-04": {
        "path": "../data/raw/Muppets-02-04-04.avi",
        "train_split_timestamp": "19:52",
        "ground_truth_path": "../data/muppets-gt-2025wt/Ground_Truth_New_04.xlsx"
    },
    "Muppets-03-04-03": {
        "path": "../data/raw/Muppets-03-04-03.avi",
        "train_split_timestamp": "19:54",
        "ground_truth_path": "../data/muppets-gt-2025wt/Ground_Truth_New_03.xlsx"
    }
}

EPISODE_NAME_TO_VIDEO_ID = {
    "Muppets-02-01-01": 211,
    "Muppets-02-04-04": 244,
    "Muppets-03-04-03": 343
}

# character-oriented: only two
SIM1_CHARACTER_LABEL_COLS = ["Kermit", "StatlerWaldorf"]


> Dataset configuration

- Episodes: video paths + GT files + per-episode time split

- Mapping episode name → numeric Video id used in GT

- Characters for SIM1 (binary presence labels)


In [4]:
# GT combined

# Load + consolidate Ground Truth (GT)
# Read GT from all episodes and normalize timestamps

Output: one combined dataframe used for feature extraction
all_ep_gt_df = prepare_df.all_ep_gt(EPISODES)
print(all_ep_gt_df.shape)
display(all_ep_gt_df.head())

Consolidated GT: 115885 rows, 10 columns
(115885, 10)


Unnamed: 0,Video,Frame_number,Timestamp,Kermit,Pigs,Miss Piggy,Cook,StatlerWaldorf,Rowlf the Dog,Fozzie Bear
0,211,0,00:00.00,0,0,0,0,0,0,0
1,211,1,00:00.04,0,0,0,0,0,0,0
2,211,2,00:00.08,0,0,0,0,0,0,0
3,211,3,00:00.12,0,0,0,0,0,0,0
4,211,4,00:00.16,0,0,0,0,0,0,0


> Build audio feature space (frame-aligned)

- Extract per-frame audio features (MFCC + deltas, F0, spectral centroid, …)

- Align features with GT by (Video, Frame_number, Timestamp)

- Save to data/processed/feature_spaces/audio_sim1.csv for reuse

In [5]:
cfg = audioTools.AudioFrameConfig(
    sr=22050,
    fps=FPS_TO_SAVE,
    n_fft=2048,
    n_mfcc=13
)

audio_sim1 = audioTools.build_audio_feature_space_df(
    EPISODES=EPISODES,
    EPISODE_NAME_TO_VIDEO_ID=EPISODE_NAME_TO_VIDEO_ID,
    gt_df=all_ep_gt_df,
    character_cols=SIM1_CHARACTER_LABEL_COLS,
    out_csv_path="../data/processed/feature_spaces/audio_sim1.csv",
    cache_dir="../data/raw/_audio_cache",
    cfg=cfg
)

display(audio_sim1.head())


[audio feature space] saved (115855, 46) -> ../data/processed/feature_spaces/audio_sim1.csv


Unnamed: 0,Video,Frame_number,Timestamp,mfcc_0,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,...,mfcc_d2_7,mfcc_d2_8,mfcc_d2_9,mfcc_d2_10,mfcc_d2_11,mfcc_d2_12,spectral_centroid,f0,Kermit,StatlerWaldorf
0,211,0,00:00.00,-645.122742,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0,0
1,211,1,00:00.04,-645.122742,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0,0
2,211,2,00:00.08,-645.122742,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0,0
3,211,3,00:00.12,-645.122742,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0,0
4,211,4,00:00.16,-645.122742,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0,0


In [6]:
# Sanity check: audio ↔ visual row alignment

# verify both modalities contain the same GT-aligned keys
# expect large intersection; mismatch indicates FPS / frame extraction issues

visual_sim1 = pd.read_csv("../data/processed/feature_spaces/visual_sim1.csv")

key_cols = ["Video", "Frame_number", "Timestamp"]
merged = visual_sim1[key_cols].merge(audio_sim1[key_cols], on=key_cols, how="inner")
print("Visual rows:", len(visual_sim1))
print("Audio rows:", len(audio_sim1))
print("Intersection:", len(merged))


Visual rows: 115885
Audio rows: 115855
Intersection: 115855


> ### Train/test split (time-blocked, no leakage) + preprocessing

- Split each episode by timestamp (early = train, later = test)

- Build X by dropping labels and metadata

In [22]:
# --- Config ---
SIM1_CHARACTER_LABEL_COLS = ["Kermit", "StatlerWaldorf"]
META_COLS = ["Video", "Frame_number", "Timestamp"]

audio_df = pd.read_csv("../data/processed/feature_spaces/audio_sim1.csv")

# Safety: keep only characters that actually exist in the CSV
SIM1_CHARACTER_LABEL_COLS = [c for c in SIM1_CHARACTER_LABEL_COLS if c in audio_df.columns]
assert len(SIM1_CHARACTER_LABEL_COLS) > 0, "No character label columns found in audio_df."

# --- 1) Split (same logic as visual) ---
train_df, test_df = prepare_df.split_feature_space_df(
    feature_df=audio_df,
    EPISODES=EPISODES,
    EPISODE_NAME_TO_VIDEO_ID=EPISODE_NAME_TO_VIDEO_ID
)

# --- 2) Build X/y ---
DROP_COLS = SIM1_CHARACTER_LABEL_COLS + META_COLS
X_train_df = train_df.drop(columns=DROP_COLS)
X_test_df  = test_df.drop(columns=DROP_COLS)

# same column order
X_test_df = X_test_df[X_train_df.columns]

col_names = X_train_df.columns.tolist()
print("Training features:", col_names)

# --- 3) Impute (f0 can be NaN) ---
imputer = SimpleImputer(strategy="mean")
X_train = imputer.fit_transform(X_train_df)
X_test  = imputer.transform(X_test_df)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# --- 5) Train per-character binary model + store BOTH score and hard label ---
y_true_df = test_df[SIM1_CHARACTER_LABEL_COLS].copy()
y_pred_df = y_true_df.copy()  # will hold *_present and *_score


[split] Muppets-02-01-01 | Video=211 | train=29251, test=9421
[split] Muppets-02-04-04 | Video=244 | train=29801, test=8895
[split] Muppets-03-04-03 | Video=343 | train=29851, test=8636
[FINAL SPLIT] train=(88903, 46), test=(26952, 46)
Training features: ['mfcc_0', 'mfcc_1', 'mfcc_2', 'mfcc_3', 'mfcc_4', 'mfcc_5', 'mfcc_6', 'mfcc_7', 'mfcc_8', 'mfcc_9', 'mfcc_10', 'mfcc_11', 'mfcc_12', 'mfcc_d1_0', 'mfcc_d1_1', 'mfcc_d1_2', 'mfcc_d1_3', 'mfcc_d1_4', 'mfcc_d1_5', 'mfcc_d1_6', 'mfcc_d1_7', 'mfcc_d1_8', 'mfcc_d1_9', 'mfcc_d1_10', 'mfcc_d1_11', 'mfcc_d1_12', 'mfcc_d2_0', 'mfcc_d2_1', 'mfcc_d2_2', 'mfcc_d2_3', 'mfcc_d2_4', 'mfcc_d2_5', 'mfcc_d2_6', 'mfcc_d2_7', 'mfcc_d2_8', 'mfcc_d2_9', 'mfcc_d2_10', 'mfcc_d2_11', 'mfcc_d2_12', 'spectral_centroid', 'f0']


In [24]:
# --- Model 1: per-character binary classification (KNN) ---

# -- Store both: *_score (for PR/ROC/AP) and *_present (thresholded for confusion matrix)

for character in SIM1_CHARACTER_LABEL_COLS:
    y_train = train_df[character].astype(int).values

    knn = KNeighborsClassifier(
        n_neighbors=5,
        weights="distance"
    )
    knn.fit(X_train, y_train)

    # score preferred (for PR/ROC/MAP)
    if hasattr(knn, "predict_proba"):
        y_score = knn.predict_proba(X_test)[:, 1]
    else:
        # fallback
        y_score = knn.predict(X_test).astype(float)

    # threshold -> hard label (for confusion matrix)
    y_pred = (y_score >= 0.5).astype(int)

    y_pred_df[f"{character}_score"] = y_score
    y_pred_df[f"{character}_present"] = y_pred

# -- Evaluation (multi-label, per character)

# Confusion matrix per character
# PR/ROC overlays
# Average Precision per character + mean across characters

metrics_dict, overall_map = eval.evaluate_multiclass(
    y_true_df=y_true_df,
    y_pred_df=y_pred_df,
    characters=SIM1_CHARACTER_LABEL_COLS
)

metrics_dict, overall_map

Mean Average Precision (MAP) per character:
Kermit: MAP=0.606
StatlerWaldorf: MAP=0.060

Overall MAP (all characters): 0.333


({'Kermit': {'MAP': 0.6063908557974679},
  'StatlerWaldorf': {'MAP': 0.06048283485061585}},
 np.float64(0.3334368453240419))

> ### Model 2: Logistic Regression baseline (probabilistic scores)

- Same split + preprocessing

- Balanced class weights to address strong label imbalance

- Outputs calibrated probabilities via predict_proba

In [18]:
y_test_df = test_df[SIM1_CHARACTER_LABEL_COLS].copy()

for character in SIM1_CHARACTER_LABEL_COLS:
    y_train = train_df[character].astype(int).values
    y_test  = test_df[character].astype(int).values

    clf = LogisticRegression(
        class_weight="balanced",
        max_iter=2000,
        n_jobs=-1
    )
    clf.fit(X_train, y_train)

    y_score = clf.predict_proba(X_test)[:, 1]
    y_pred  = (y_score >= 0.5).astype(int)

    y_test_df[f"{character}_present"] = y_pred
    y_test_df[f"{character}_score"]   = y_score

metrics_lr, overall_lr = eval.evaluate_multiclass(
    y_true_df=y_test_df[SIM1_CHARACTER_LABEL_COLS],
    y_pred_df=y_test_df,
    characters=SIM1_CHARACTER_LABEL_COLS
)
print("Overall MAP (LogReg):", overall_lr)


  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights

divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul

  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights

divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul



Mean Average Precision (MAP) per character:
Kermit: MAP=0.644
StatlerWaldorf: MAP=0.042

Overall MAP (all characters): 0.343
Overall MAP (LogReg): 0.34327459891647544


> ### Export predictions for fusion

- Save test-set audio predictions with keys:
(Video, Frame_number, Timestamp)

- Used later to merge with visual predictions for late fusion / ensemble

In [19]:
# --- Save AUDIO predictions for fusion ---

KEY_COLS = ["Video", "Frame_number", "Timestamp"]

audio_pred = test_df[KEY_COLS].copy()

for ch in SIM1_CHARACTER_LABEL_COLS:
    audio_pred[f"{ch}_score"] = y_test_df[f"{ch}_score"].values
    audio_pred[f"{ch}_present"] = y_test_df[f"{ch}_present"].values

out_path = "../data/processed/preds/audio_sim1_pred.csv"
audio_pred.to_csv(out_path, index=False)

print(f"[OK] Audio predictions saved to {out_path}")
audio_pred.head()

[OK] Audio predictions saved to ../data/processed/preds/audio_sim1_pred.csv


Unnamed: 0,Video,Frame_number,Timestamp,Kermit_score,Kermit_present,StatlerWaldorf_score,StatlerWaldorf_present
0,211,29251,19:30.04,0.287161,0,0.279864,0
1,211,29252,19:30.08,0.328632,0,0.278563,0
2,211,29253,19:30.12,0.217013,0,0.233629,0
3,211,29254,19:30.16,0.298333,0,0.370398,0
4,211,29255,19:30.20,0.287664,0,0.278315,0


In [20]:
a = pd.read_csv("../data/processed/preds/audio_sim1_pred.csv")
v = pd.read_csv("../data/processed/preds/visual_sim1_pred.csv")

print("Audio rows:", len(a))
print("Visual rows:", len(v))
print("Intersection:", len(
    a.merge(v, on=["Video", "Frame_number"])
))


Audio rows: 26952
Visual rows: 26982
Intersection: 26952


# Audio-based Classification: Discussion and Conclusions

In the audio-only part of SIM1, we focused on detecting character presence using classical audio features, including MFCCs with first and second order derivatives, pitch (F0), and spectral centroid. These features are commonly used to capture voice timbre, pitch characteristics, and spectral properties of speech.

The results show that audio features are moderately effective for detecting Kermit, achieving a Mean Average Precision (MAP) of around 0.64. This can be explained by the fact that Kermit has a very distinctive voice, with a relatively stable pitch range and characteristic spectral properties. As a result, MFCC-based features are able to capture relevant information for identifying his presence.

However, the performance for Statler & Waldorf is significantly lower in the audio-only setting (MAP close to 0.04–0.06). This suggests that audio features alone are not sufficient to reliably detect these characters. There are several reasons for this behavior. First, Statler and Waldorf often speak simultaneously or overlap with laughter, background noise, or other speakers. Second, their voices are less distinctive in terms of pitch and timbre compared to Kermit. Finally, the audio annotations are frame-based and do not always perfectly align with speaking activity, which further increases noise in the labels.

Overall, the audio-only results demonstrate that audio features provide useful but incomplete information. They work well for characters with strong and consistent vocal signatures, but they struggle in scenes with overlapping speech, background noise, or weak audio cues. This confirms the limitation of relying on a single modality for character detection in complex audiovisual content.#