In [7]:
import sys
from pathlib import Path

# add project root to PYTHONPATH
PROJECT_ROOT = Path("..").resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

import utils.evaluation_tools as evaluation_tools

import pandas as pd
import numpy as np


# SIM1 Feature Choice

In SIM1 we were supposed to find frames with the characters: Kermit, StatlerWaldorf and Fozzoe Bear. 

**In visual domain**, 

- we focused on "Kermit" and "Fozzie Bear". 
- For Kermit, 'green_mask' and 'eye_blob' features aee used. He have a distinctive green color and there is a black curve in his eye blobs veside a black dot in the middle, adding these 2 up we find him.
- As for the Fozzie Bear, his skin color is light brown and a unique texture, so 'brown_rhythm' pattern is used to locate his frames.
<br>
<br>

**In Audio Domain:**
- Walderf and Statler do not have specific visual features but have unique voice eatures, so visual domain feature space focus more on them.

In [8]:
CHAR_COLS = ["Kermit", "StatlerWaldorf"]
KEY_COLS = ["Video", "Frame_number"] 

audio_pred = pd.read_csv("../data/processed/preds/audio_sim1_pred.csv")   
visual_pred = pd.read_csv("../data/processed/preds/visual_sim1_pred.csv") 

df = audio_pred.merge(
    visual_pred,
    on=KEY_COLS,
    suffixes=("_audio", "_visual"),
    how="inner"
)

# fused scores
weights = {
    "Kermit": (0.6, 0.4),          # (audio_w, visual_w)
    "StatlerWaldorf": (0.2, 0.8),
}

for ch in CHAR_COLS:
    wa, wv = weights[ch]
    df[f"{ch}_score"] = wa*df[f"{ch}_score_audio"] + wv*df[f"{ch}_score_visual"]
    df[f"{ch}_present"] = (df[f"{ch}_score"] >= 0.5).astype(int)

gt = pd.read_csv("../data/processed/feature_spaces/visual_sim1.csv")[KEY_COLS + CHAR_COLS]
gt = gt.merge(df[KEY_COLS + [f"{c}_score" for c in CHAR_COLS] + [f"{c}_present" for c in CHAR_COLS]],
              on=KEY_COLS, how="inner")

metrics_fused, overall_fused = evaluation_tools.evaluate_multiclass(
    y_true_df=gt[CHAR_COLS],
    y_pred_df=gt,
    characters=CHAR_COLS
)

print("Overall MAP (Fused):", overall_fused)


Mean Average Precision (MAP) per character:
Kermit: MAP=0.763
StatlerWaldorf: MAP=0.645

Overall MAP (all characters): 0.704
Overall MAP (Fused): 0.7039134973211246


In [16]:
df_out = gt[["Video", "Frame_number"] + 
            [f"{c}_score" for c in CHAR_COLS] + 
            [f"{c}_present" for c in CHAR_COLS]]

df_out.to_csv("../data/processed/preds/fused_sim1_pred.csv", index=False)

# Multimodal Fusion: Discussion and Conclusions

To overcome the limitations of unimodal approaches, we applied a late fusion strategy combining audio-based and visual-based predictions. Instead of concatenating features at an early stage, we fused the prediction scores from independently trained audio and visual models. This approach allows each modality to contribute according to its strengths and avoids issues related to feature scale mismatch and dimensionality imbalance.

Importantly, the fusion strategy was designed in a character-oriented manner. For Kermit, audio information was given a higher weight, as his voice provides strong discriminative cues. For Statler & Waldorf, visual information was emphasized, since these characters are visually distinctive but difficult to separate using audio alone. This character-specific weighting reflects the actual properties of the data and leads to more interpretable results.

The fusion results show a substantial improvement over both unimodal baselines. After fusion, the MAP increased to 0.76 for Kermit and 0.65 for Statler & Waldorf, with an overall MAP of approximately 0.70. This clearly demonstrates that combining modalities significantly improves detection performance and robustness.

These results confirm that audio and visual modalities provide complementary information. Audio helps resolve ambiguities when a character is speaking but not clearly visible, while visual features help in scenes with background noise, overlapping speech, or silent character presence. By combining both sources, the system is able to reduce false positives and false negatives that occur in unimodal settings.

# Final Remarks

In conclusion, the experiments show that multimodal fusion is essential for reliable character detection in complex video material such as The Muppet Show. Audio-only and visual-only approaches each have clear limitations, but when combined in a principled way, they achieve significantly better and more balanced performance. The character-oriented fusion strategy used in this project provides an effective and interpretable solution that aligns well with the requirements and goals of SIM1.