# Summary for the Similarity Modeling 2

**Project:** SIM2 — Character presence detection in *The Muppets Show*  
**Modalities:** Visual, Audio, Multimodal (Fusion)  
**Team:**  
- *Iana Bembeeva* — visual pipeline (`SIM2_visual.ipynb`)  
- *Kartik Arya* — audio pipeline (`SIM2_audio.ipynb`) and fusion (`SIM2_fused.ipynb`)


## Time sheets for the SIM2 notebooks

This section documents the time spent on implementing, experimenting with, and
analyzing the SIM2 pipelines.  
The reported time reflects hands-on work in the corresponding notebooks
(data preparation, feature engineering, modeling, evaluation, and discussion).



# Iana

<table>
<thead>
  <tr>
    <th>Date</th>
    <th>Task</th>
    <th>Hours</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>04.01.26</td>
    <td>Initial setup of SIM2 visual notebook, dataset inspection, target definition and aliases</td>
    <td>2</td>
  </tr>
  <tr>
    <td>05.01.26</td>
    <td>Implementation of SIM2 visual feature extraction (LBP, HOG, optical flow)</td>
    <td>6</td>
  </tr>
  <tr>
    <td>07.01.26</td>
    <td>Sequential feature-space construction, runtime optimization, CSV export</td>
    <td>5</td>
  </tr>
  <tr>
    <td>08.01.26</td>
    <td>Train/test split validation, per-target sanity checks, debugging</td>
    <td>2.5</td>
  </tr>
  <tr>
    <td>10.01.26</td>
    <td>Model training (ExtraTrees, SGD+Calibrated), comparison and evaluation (MAP)</td>
    <td>2</td>
  </tr>
  <tr>
    <td>11.01.26</td>
    <td>Result analysis, interpretation, and preparation of visual predictions for fusion</td>
    <td>4</td>
  </tr>
</tbody>
</table>


# Kartik (Audio + Fusion)

<table>
<thead>
  <tr>
    <th>Date</th>
    <th>Task</th>
    <th>Hours</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>04.01.26</td>
    <td>Setup of SIM2 audio notebook, alignment of audio features to frame-level GT</td>
    <td>1.5</td>
  </tr>
  <tr>
    <td>06.01.26</td>
    <td>Implementation of audio feature extraction and audio feature space construction</td>
    <td>5.5</td>
  </tr>
  <tr>
    <td>06.01.26</td>
    <td>Audio model training (Linear SVM, Gradient Boosting) and evaluation</td>
    <td>2</td>
  </tr>
  <tr>
    <td>09.01.26</td>
    <td>Fusion notebook setup: merging audio and visual feature spaces</td>
    <td>3</td>
  </tr>
  <tr>
    <td>10.01.26</td>
    <td>Fusion model training, scaling/imputation, evaluation and debugging</td>
    <td>1</td>
  </tr>
  <tr>
    <td>11.01.26</td>
    <td>Fusion result analysis, comparison to unimodal results, final cleanup</td>
    <td>5</td>
  </tr>
</tbody>
</table>


## General approach and work distribution

We split the work by modality to ensure clarity and fast iteration:

- **Visual pipeline:** implemented by Iana in `SIM2_visual.ipynb`
- **Audio pipeline:** implemented by the Arya in `SIM2_audio.ipynb`
- **Multimodal fusion:** implemented by the Arya in `SIM2_fused.ipynb`

We reused the same episodes, video IDs, and **time-ordered train/test split**
strategy from SIM1.  
This avoids leakage from temporally adjacent frames and leads to a more realistic
evaluation of generalization.


## Dataset and SIM2 targets

We used the consolidated ground truth file:

- `data/processed/all_ep_gt.csv`

SIM2 requires different semantic targets than SIM1.  
We therefore introduced lightweight **aliases** on top of the existing labels:

- **Piggy** = `Miss Piggy`
- **Chef** = `Cook`
- **OtherPigs** = (`Pigs == 1` and `Piggy == 0`)

Final SIM2 targets used in the visual pipeline:

- `Piggy`
- `OtherPigs`
- `Chef`


## Time-based split and sanity checks

All experiments use a **time-ordered split**:
- training on earlier timestamps of each episode,
- testing on later timestamps.

Before training, we explicitly verified that **each target has positive samples
in both train and test splits** for every episode.  
This prevents silent failure cases where a classifier cannot learn or cannot be
evaluated due to missing positives.


## Visual-based detection (SIM2 Visual)

### Feature design

We intentionally used **lightweight, computationally efficient features** and
avoided heavy descriptors (e.g. DAISY).

The final visual feature space consists of:

- **LBP (32 bins):** texture information
- **HOG (aggregated):** shape and contour information
- **Farneback optical flow (mean / std / ratio):** motion information

Features are extracted per frame and stored in:

- `data/processed/feature_spaces/visual_sim2.csv`


### Visual modeling

We trained **per-predicate binary classifiers** and compared two allowed model types:

- `ExtraTreesClassifier`
- `SGDClassifier` with `CalibratedClassifierCV`

Features are scaled with `StandardScaler`
(required for SGD, harmless for ExtraTrees).

Predictions on the test split are exported to:

- `data/processed/preds/visual_sim2_pred.csv`


### Visual results (MAP)

| Model | Overall MAP | Piggy | OtherPigs | Chef |
|---|---:|---:|---:|---:|
| ExtraTrees | **0.376** | 0.078 | 0.224 | **0.825** |
| SGD + Calibrated | 0.233 | **0.212** | 0.169 | 0.317 |


### Visual results interpretation

- **Chef** performs very well visually (MAP ≈ 0.83 with ExtraTrees), which is
  expected due to the Swedish Chef’s highly distinctive motion patterns captured
  by optical flow.
- **OtherPigs** achieves moderate performance, reflecting the heterogeneity of
  this category (different pig characters, poses, and partial visibility).
- **Piggy** is the most challenging visually: strong pose, scale, and illumination
  variation limit separability under a compact feature representation.

We therefore selected **ExtraTrees** as the final visual model based on overall MAP.


## Audio-based detection (SIM2 Audio)

Audio features are extracted frame-aligned (25 fps) and stored in:

- `data/processed/feature_spaces/audio_sim2.csv`

Two audio model families were evaluated:
- Linear SVM (`LinearSVC`)
- HistGradientBoostingClassifier


### Audio results (MAP)

**Linear SVM:**
- Pigs: 0.318  
- Miss Piggy: 0.087  
- Cook: 0.047  
**Overall MAP:** 0.150

**HistGradientBoosting:**
- Pigs: 0.322  
- Miss Piggy: 0.088  
- Cook: 0.028  
**Overall MAP:** 0.146


### Audio results interpretation

- Audio cues are strongest for **Pigs**, which is a broad category with more
  positive samples.
- **Miss Piggy** and **Chef** are difficult in audio-only detection, likely due to
  sparse speech and frequent background interference.
- Overall, audio alone is weak but provides complementary information for fusion.


## Multimodal fusion (Audio + Visual)

For fusion, audio and visual feature spaces are merged by:

- `Video`, `Frame_number`, `Timestamp`

The fused feature space is then split using the same time-ordered strategy and
modeled with `HistGradientBoostingClassifier`.


### Fusion results (baseline)

- Overall MAP (fused): **0.317**

This value is not an substantial improvement on audio or video model results. Possibly due to high dimentionality in the final model.


## Final remarks

SIM2 follows the same core philosophy as SIM1:
- time-aware evaluation,
- per-predicate binary modeling,
- compact and interpretable feature design,
- explicit export of intermediate results for fusion.

The visual pipeline provides strong cues for motion-dominant characters,
the audio pipeline contributes complementary information, and fusion combines
both modalities was expected to achieve the best overall performance. Doing PCA oe any other dimentionality reduction tools could help here.
