# Similarity Modeling 1 - Audio Domain Features

> Setup & paths

1. Enable autoreload for local utils/ modules

2. Add project root to sys.path to import utils.*

3. Define constants (FPS) used for audio↔visual alignment


In [None]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append("..")

import pandas as pd

from utils import audio_tools as audioTools
from utils import gt_and_modeling_dfs as prepare_df

import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

from utils import evaluation_tools as eval

> **Choice of split:**\
> The split had to be identical with visual features extraction for the future fusion, so we use the same strategy as there: There are 4 main characters that we need to identify so the split time from a given episode is selected based on the equal (rough idea) no of apperances of the all all character in both splits.

In [None]:
FPS_TO_SAVE = 25  # to match visual

EPISODES = {
    "Muppets-02-01-01": {
        "path": "../data/raw/Muppets-02-01-01.avi",
        "train_split_timestamp": "19:30",
        "ground_truth_path": "../data/muppets-gt-2025wt/Ground_Truth_New_01.xlsx"
    },
    "Muppets-02-04-04": {
        "path": "../data/raw/Muppets-02-04-04.avi",
        "train_split_timestamp": "19:52",
        "ground_truth_path": "../data/muppets-gt-2025wt/Ground_Truth_New_04.xlsx"
    },
    "Muppets-03-04-03": {
        "path": "../data/raw/Muppets-03-04-03.avi",
        "train_split_timestamp": "19:54",
        "ground_truth_path": "../data/muppets-gt-2025wt/Ground_Truth_New_03.xlsx"
    }
}

EPISODE_NAME_TO_VIDEO_ID = {
    "Muppets-02-01-01": 211,
    "Muppets-02-04-04": 244,
    "Muppets-03-04-03": 343
}

# character-oriented: only two
SIM1_CHARACTER_LABEL_COLS = ["Kermit", "StatlerWaldorf", "Fozzie Bear"]


> Dataset configuration

- Episodes: video paths + GT files + per-episode time split

- Mapping episode name → numeric Video id used in GT

- Characters for SIM1 (binary presence labels)


In [None]:
# GT combined

# Load + consolidate Ground Truth (GT)
# Read GT from all episodes and normalize timestamps
# Output: one combined dataframe used for feature extraction
all_ep_gt_df = prepare_df.all_ep_gt(EPISODES)
print(all_ep_gt_df.shape)
display(all_ep_gt_df.head())

> Build audio feature space (frame-aligned)

- Extract per-frame audio features (MFCC + deltas, F0, spectral centroid, …)

- Align features with GT by (Video, Frame_number, Timestamp)

- Save to data/processed/feature_spaces/audio_sim1.csv for reuse

In [None]:
# --- Define character columns for SIM1 audio feature space ---

# All characters we want to KEEP in the feature space (GT columns),
# even if we do not train models for all of them
SIM1_ALL_CHAR_COLS = [
    "Kermit",
    "StatlerWaldorf",
    "Fozzie Bear"
]

print("SIM1_ALL_CHAR_COLS:", SIM1_ALL_CHAR_COLS)

In [None]:
cfg = audioTools.AudioFrameConfig(
    sr=22050,
    fps=FPS_TO_SAVE,
    n_fft=2048,
    n_mfcc=13
)

audio_sim1 = audioTools.build_audio_feature_space_df(
    EPISODES=EPISODES,
    EPISODE_NAME_TO_VIDEO_ID=EPISODE_NAME_TO_VIDEO_ID,
    gt_df=all_ep_gt_df,
    character_cols=SIM1_ALL_CHAR_COLS,
    out_csv_path="../data/processed/feature_spaces/audio_sim1.csv",
    cache_dir="../data/raw/_audio_cache",
    cfg=cfg
)

display(audio_sim1.head())


In [None]:
# Sanity check: audio ↔ visual row alignment

# verify both modalities contain the same GT-aligned keys
# expect large intersection; mismatch indicates FPS / frame extraction issues

visual_sim1 = pd.read_csv("../data/processed/feature_spaces/visual_sim1.csv")

key_cols = ["Video", "Frame_number", "Timestamp"]
merged = visual_sim1[key_cols].merge(audio_sim1[key_cols], on=key_cols, how="inner")
print("Visual rows:", len(visual_sim1))
print("Audio rows:", len(audio_sim1))
print("Intersection:", len(merged))


> ### Train/test split (time-blocked, no leakage) + preprocessing

- Split each episode by timestamp (early = train, later = test)

- Build X by dropping labels and metadata

In [None]:
# --- Config ---

# --- Labels: keep ALL available GT chars in feature space, but train on subset ---
# In SIM1 GT may include more characters (e.g., Fozzie Bear). We keep them in the CSV,
# but for SIM1 audio we train only on target characters (Kermit + StatlerWaldorf).

META_COLS = ["Video", "Frame_number", "Timestamp"]

audio_df = pd.read_csv("../data/processed/feature_spaces/audio_sim1.csv")

# All character GT columns present in the feature space (exclude meta + feature cols)
ALL_SIM1_CHARS_DESIRED = ["Kermit", "StatlerWaldorf", "Fozzie Bear"]  # keep GT for all (if exists)

SIM1_ALL_CHAR_COLS = [c for c in ALL_SIM1_CHARS_DESIRED if c in audio_df.columns]
print("GT columns kept in feature space:", SIM1_ALL_CHAR_COLS)

SIM1_CHARACTER_LABEL_COLS = [c for c in ["Kermit", "StatlerWaldorf"] if c in audio_df.columns]
assert len(SIM1_CHARACTER_LABEL_COLS) > 0, "No target character label columns found in audio_df."
print("Audio models trained for:", SIM1_CHARACTER_LABEL_COLS)


# --- 1) Split (same logic as visual) ---
train_df, test_df = prepare_df.split_feature_space_df(
    feature_df=audio_df,
    EPISODES=EPISODES,
    EPISODE_NAME_TO_VIDEO_ID=EPISODE_NAME_TO_VIDEO_ID
)

# --- 2) Build X/y ---
DROP_COLS = SIM1_CHARACTER_LABEL_COLS + META_COLS
X_train_df = train_df.drop(columns=DROP_COLS)
X_test_df  = test_df.drop(columns=DROP_COLS)

# same column order
X_test_df = X_test_df[X_train_df.columns]

col_names = X_train_df.columns.tolist()
print("Training features:", col_names)

# --- 3) Impute (f0 can be NaN) ---
imputer = SimpleImputer(strategy="mean")
X_train = imputer.fit_transform(X_train_df)
X_test  = imputer.transform(X_test_df)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# --- 5) Train per-character binary model + store BOTH score and hard label ---
y_true_df = test_df[SIM1_CHARACTER_LABEL_COLS].copy()
y_pred_df = y_true_df.copy()  # will hold *_present and *_score


> ### Model 1: SGDClassifier 

In [None]:
# --- Train/test split + preprocessing (assumes train_df/test_df already built) ---

# Train only on the target characters (but keep all GT cols in the CSV)
DROP_COLS = SIM1_ALL_CHAR_COLS + META_COLS  # drop all GT + meta from features

X_train_df = train_df.drop(columns=DROP_COLS)
X_test_df  = test_df.drop(columns=DROP_COLS)

# ensure same column order
X_test_df = X_test_df[X_train_df.columns]

# --- Impute + scale ---
imputer = SimpleImputer(strategy="mean")
X_train = imputer.fit_transform(X_train_df)
X_test  = imputer.transform(X_test_df)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# --- y_true + y_pred containers ---
y_true_df = test_df[SIM1_CHARACTER_LABEL_COLS].copy()
y_pred_df = test_df[META_COLS].copy()  

# --- FINAL Model 1: per-character binary classification (SGDClassifier, log-loss) ---
for character in SIM1_CHARACTER_LABEL_COLS:
    y_train = train_df[character].astype(int).values

    clf = SGDClassifier(
        loss="log_loss",
        penalty="l2",
        alpha=1e-4,
        max_iter=2000,
        tol=1e-3,
        random_state=42
    )
    clf.fit(X_train, y_train)

    y_score = clf.predict_proba(X_test)[:, 1]
    y_pred  = (y_score >= 0.5).astype(int)

    y_pred_df[f"{character}_score"]   = y_score
    y_pred_df[f"{character}_present"] = y_pred

# --- Evaluation (FINAL) ---
metrics_sgd, overall_map_sgd = eval.evaluate_multiclass(
    y_true_df=y_true_df,
    y_pred_df=y_pred_df,
    characters=SIM1_CHARACTER_LABEL_COLS
)

print("Overall MAP (SGD FINAL):", overall_map_sgd)
print(metrics_sgd)

BEST_AUDIO_MODEL = {
    "name": "SGDClassifier(log_loss)",
    "overall_map": float(overall_map_sgd),
    "per_character_map": {k: float(v["MAP"]) for k, v in metrics_sgd.items()}
}
BEST_AUDIO_MODEL


> ### Model 2: Logistic Regression baseline (probabilistic scores)

- Same split + preprocessing

- Balanced class weights to address strong label imbalance

- Outputs calibrated probabilities via predict_proba

In [None]:
# --- Baseline Model: Logistic Regression (for comparison only, NOT used for fusion export) ---

y_pred_lr = test_df[META_COLS].copy()

for character in SIM1_CHARACTER_LABEL_COLS:
    y_train = train_df[character].astype(int).values

    clf = LogisticRegression(
        class_weight="balanced",
        max_iter=2000,
        n_jobs=-1
    )
    clf.fit(X_train, y_train)

    y_score = clf.predict_proba(X_test)[:, 1]
    y_pred  = (y_score >= 0.5).astype(int)

    y_pred_lr[f"{character}_score"]   = y_score
    y_pred_lr[f"{character}_present"] = y_pred

metrics_lr, overall_map_lr = eval.evaluate_multiclass(
    y_true_df=y_true_df,
    y_pred_df=y_pred_lr,
    characters=SIM1_CHARACTER_LABEL_COLS
)

print("Overall MAP (LogReg baseline):", overall_map_lr)
print(metrics_lr)


# Audio-based Classification: Discussion and Conclusions

In the audio-only part of SIM1, we focused on detecting character presence using classical audio features, including MFCCs with first and second order derivatives, pitch (F0), and spectral centroid. These features are widely used to capture voice timbre, pitch characteristics, and spectral properties of speech.

The audio feature space was constructed to include ground truth labels for all characters available in SIM1, even though the audio models were trained only for a selected subset of characters (Kermit and Statler & Waldorf). This design choice ensures consistency of the feature space across modalities and allows seamless integration in later multimodal fusion stages.

For classification, the final audio model was implemented using an SGDClassifier with log-loss, which provides probabilistic outputs and scales well to large frame-level datasets. Logistic Regression was used as a baseline for comparison but was not used for downstream fusion.

The results show that audio features are moderately effective for detecting Kermit, achieving a Mean Average Precision (MAP) of approximately 0.64. This can be explained by the fact that Kermit has a highly distinctive voice, characterized by a relatively stable pitch range and consistent spectral patterns. As a result, MFCC-based features are able to capture discriminative information that supports reliable detection of his presence.

In contrast, the performance for Statler & Waldorf is significantly lower in the audio-only setting, with MAP values around 0.04–0.06. This indicates that audio features alone are not sufficient to reliably detect these characters. Several factors contribute to this limitation. First, Statler and Waldorf frequently speak simultaneously or overlap with laughter, background noise, or other speakers. Second, their vocal characteristics are less distinctive in terms of pitch and timbre compared to Kermit. Finally, the frame-level audio annotations do not always align precisely with actual speaking activity, which introduces additional label noise.

Overall, the audio-only results demonstrate that audio features provide useful but incomplete information for character detection. They perform well for characters with strong and consistent vocal signatures, but struggle in scenes with overlapping speech, background noise, or weak audio cues. This confirms the limitations of relying on a single modality and motivates the use of multimodal fusion, where audio cues can complement more reliable visual information.