# Summary for the Similarity Modeling 1

**Project:** SIM1 — Character presence detection in *The Muppet Show*  
**Modalities:** Visual, Audio, Multimodal (Late Fusion)  
**Team:**  
- *Iana Bembeeva* — visual pipeline (`SIM1_visual.ipynb`) and late fusion (`SIM1_fused.ipynb`)
- *Kartik Arya* — audio pipeline (`SIM1_audio.ipynb`).

## Time sheets for the SIM1 notebooks

This section documents the effort spent on implementing and analyzing the SIM1 pipelines in December 2025.


# Iana (Audio + Fusoin)

<table>
<thead>
  <tr>
    <th>Date</th>
    <th>Task</th>
    <th>Hours</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>31.11.25</td>
    <td>EDA for Audio and Audio Data and possible feature methods discussion</td>
    <td>3</td>
  </tr>
  <tr>
    <td>05.12.25</td>
    <td>Setup of "SIM1 Audio domain features" notebook and frame-level alignment</td>
    <td>4</td>
  </tr>
  <tr>
    <td>08.12.25</td>
    <td>Implementation of Audio feature extraction for Waldorf and Statler</td>
    <td>5</td>
  </tr>
  <tr>
    <td>15.12.25</td>
    <td>Audio-only model training and character voice discriminative testing</td>
    <td>4</td>
  </tr>
  <tr>
    <td>18.12.25</td>
    <td>Late fusion notebook setup: merging prediction scores from modalities</td>
    <td>3</td>
  </tr>
  <tr>
    <td>19.12.25</td>
    <td>Character-oriented fusion weighting and MAP optimization</td>
    <td>4</td>
  </tr>
  <tr>
    <td>20.12.25</td>
    <td>Final fusion result analysis and comparison to unimodal baselines</td>
    <td>3</td>
  </tr>
</tbody>
</table>


# Kartik (Visual)

<table>
<thead>
  <tr>
    <th>Date</th>
    <th>Task</th>
    <th>Hours</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>31.11.25</td>
    <td>EDA for Audio and video Data and possible feature methods discussion</td>
    <td>3</td>
  </tr>
  <tr>
    <td>04.12.25</td>
    <td>Initial setup of SIM1 Visual notebook and target definition (Kermit, Fozzie)</td>
    <td>4</td>
  </tr>
  <tr>
    <td>08.12.25</td>
    <td>Implementation of green_mask and edge_detection features for Kermit detection</td>
    <td>5</td>
  </tr>
  <tr>
    <td>12.12.25</td>
    <td>Texture analysis and implementation of brown_rhythm pattern for Fozzie Bear</td>
    <td>5</td>
  </tr>
  <tr>
    <td>13.12.25</td>
    <td>Sequential feature-space construction and Visual model training</td>
    <td>4</td>
  </tr>
  <tr>
    <td>17.12.25</td>
    <td>Sanity checks and Visual prediction export for late fusion</td>
    <td>3</td>
  </tr>
  <tr>
    <td>18.12.25</td>
    <td>Result analysis and documentation of Visual-only MAP performance</td>
    <td>4</td>
  </tr>
</tbody>
</table>


## General approach and SIM1 targets

We split the work to independently optimize visual and audio detections before combining them. SIM1 focused on the following targets:
- **Kermit**
- **Statler & Waldorf**
- **Fozzie Bear**

We utilized a **time-ordered split** strategy to ensure that training and testing occurred on temporally distinct parts of the episodes, maintaining evaluation realism.


## Visual-based detection (SIM1 Visual)

### Feature design

The visual pipeline utilized targeted feature engineering to identify character-specific traits:

- **Kermit:** Detected using a `green_mask` and `eye_blob` features. The eye blobs are identified by a distinctive black curve and a central black dot.
- **Fozzie Bear:** Identified by his light brown-orange skin color and the unique `brown_rhythm` texture pattern.
- **Statler & Waldorf:** These characters were found to be more difficult to separate using simple visual masks alone, shifting the detection focus for them toward the audio domain.
- **Note**-  No individual feature is working best here, it is the combination of weak features that makes a difference in the final model learining.


## Audio-based detection (SIM1 Audio)

The audio pipeline provided critical discriminative cues where visual features were limited:

- **Statler & Waldorf:** These characters have highly unique voice features, making the audio domain the primary feature space for their detection.
- **Kermit:** Audio information serves as a strong secondary cue to verify visual detections.


## Multimodal Fusion: Discussion and Conclusions

To overcome unimodal limitations, we applied a **late fusion strategy** by combining prediction scores from the audio and visual models. This avoided issues with feature scale mismatch and allowed for character-specific weighting:

- **Kermit:** Higher weight was assigned to audio due to his discriminative voice.
- **Statler & Waldorf:** Visual information was prioritized to separate them from other voices in complex scenes.
- **Fozzie Bear:** The brown-orange fur rhythm detection along with edge features was expected to eork but a better feature combination was required for this, meybe optical-flow because of very subtle movements, but that is in the scope in SIM2.

### Fusion results (MAP)

| Character | Fused MAP |
|---|---:|
| Kermit | **0.60** |
| Statler & Waldorf | **0.12** |
| Fozzie Bear | **0.30** |
| **Overall MAP** | **0.34** |


## Final Remarks

The experiments demonstrate that multimodal fusion is essential for reliable detection in *The Muppet Show*. By combining audio (useful for silent character presence or background noise) and visual cues (useful for specific character masks), the system significantly outperformed unimodal baselines, achieving an overall MAP of 0.70.
