# Similarity Modeling 2 Summary
**Parts of the notebook are the same as in Similarity Modeling 1 (general approach)**. 

**Authors: Sakka Mahmoud Abdussalem, Kravchenko Oleksandra**

This notebook contains a summary about our work distribution, task approach and results.

## Timesheets
**Oleksandra's timesheet**
<table>
<thead>
  <tr>
    <th>Date</th>
    <th>Task</th>
    <th>Hours</th>

  </tr>
</thead>
<tbody>
  <tr>
    <td>24.10.2024</td>
    <td>Initial Team Meeting</td>
    <td>0.5</td>
  </tr>
    <tr>
    <td>01.12.2024</td>
    <td>Second Team Meeting - Discussion of extraction approach and work split</td>
    <td>0.5</td>
  </tr>
    <tr>
    <td>08.12.2024</td>
    <td>Third Team Meeting</td>
    <td>0.5</td>
  </tr>
</tbody>
</table>

**Mahmoud's timesheet:**

<table>
<thead>
  <tr>
    <th>Date</th>
    <th>Task</th>
    <th>Hours</th>

  </tr>
</thead>
<tbody>
  <tr>
    <td>24.10.2024</td>
    <td>Initial Team Meeting</td>
    <td>0.5</td>
  </tr>
  <tr>
    <td>01.12.2024</td>
    <td>Second Team Meeting - Discussion of extraction approach and work split</td>
    <td>0.5</td>
  </tr>
  <tr>
    <td>06.12.2024</td>
    <td>Initial exploration of material, implementation of trim_video.py and audio_extraction</td>
    <td>4</td>
  </tr>
  <tr>
    <td>07.12.2024</td>
    <td>Implementation of load_data.py</td>
    <td>2</td>
  </tr>
    <tr>
    <td>08.12.2024</td>
    <td>Third Team Meeting</td>
    <td>0.5</td>
  </tr>
</tbody>
</table>


## Work distribution
Since we both participated in both Similarity Modeling 1 and 2 courses, we split the work in a way that allows us to experiment in all aspects of video analysis and classification. For Similarity Modeling 2, Mahmoud's focus was on the video features and Oleksandra's on the audio features (the roles are reversed in the Similarity Modeling 1). Therefore, we both also contributed to the hybrid approach by creating separate features and merging them together. The analytical work was also shared and conducted in regular discussion and iterative improvements of all notebooks and functions. The timesheets give a more clear indication of the work distribution within the group. Helper functions for frame extraction, data loading, cross validation, evaluation and so on were written in collaboration and refined for the specific needs that came up during the project. The summaries, analyses were mainly written by the authors of the respective notebooks, same goes for indivudal sections in this notebook. 

## Approach overview
We focused on experimentation with multiple features mentioned in the Similarity Modeling 2 lectures and their variants. We brainstormed possible features and classification algorithms to cover a broad spectrum of the domain. During the feature engineering, we inspected different features on excerpts of the data, tunes them, used them in the classification models, and refined the methods based on the evaluation results. We used a GitHub repository as a collaborative environment for the project. 
Regarding overall technical setup, we created a script `load_data.py` that extract frames and audio tracks from the videos and organises them in the folders on our local machines. The feature extraction functions then reference those files to create datasets, that are then also saved and reused to save time. Our classification model were either trained and evaluated using nested cross validation method or holdout set depending on the computational depends of the task. 
Our train-test-split approach accounts for the time-series nature of the data. We manually define split points that correspond to "fade-outs" in the data to correctly split the videos and ensure that the same scene is not present in both train and test split to avoid data leakage and overfitting. 
The evaluation was done using binary accuracy, precision, recall and f1 metrics, as well as ROC curve. Since our data is highly imabalanced (target characters are much more often are absent than present), we found this evaluation method to be most accurate. 

The classifiers chosen for this project are:

- Naive Bayes
- Random Forest 
- Gradient Boosting (XGBoost)

Annotations fix: we reviewed annotations for GT Muppets-02-01-01 and GT Muppets 03-04-03 and found some inconsistencies that we corrected. 

## Classification using audio features
### Features
For the visual analysis we settled for the following 3 methods:
- **MFCC**: we use Mel-Frequency Cepstral Coefficients (MFCC) primarly to capture the timbre of the characters' voice. Each character's voice has unique characteristics like tone and resonance. MFCCs are based on mel frequencies, which adjust for how humans perceive sound frequencies, helping distinguish different voices based on their spectral shapes. 
- **Log Mel Spectograms**: provides a time-frequency representation of audio, encodes detailed frequency and intensity information, such as pitch patterns and voice texture. These patterns aim to capture the characters' distinctive speaking style and delivery.
- **Probabilistic YIN**: estimates the pitch based on fundamental frequency.Different characters often speak at varying pitches. This feature emphasizes pitch variations, which are essential for identifying characters with distinct vocal range.
### Results

## Classification using video features
### Features
For the extraction of visual features, we experimented with various methods, including SIFT. Ultimately, we decided to use Local Binary Patterns (LBP) and Discrete Cosine Transform (DCT) based on their superior performance in our experiments. To effectively capture color information, particularly the distinctive colors of the pigs, we incorporated HSV Color Histograms, adapted from the methods employed in Similarity Modeling 1.

### Results
For pig detection, the Naive Bayes model demonstrates a more consistent AUC performance across all folds compared to XGBoost. In many folds, the precision and recall metrics of Naive Bayes surpass those of XGBoost for the same folds. Across all models, fold 2-B consistently emerges as the worst-performing fold, exhibiting the lowest AUC. Interestingly, Naive Bayes and XGBoost identify different folds as their best-performing ones: Naive Bayes achieves its best performance on fold 0-A, while XGBoost performs best on fold 0-B.
Feature importance analysis unsurprisingly highlights HSV features as the most influential for pig detection. Among the top 10 features, HSV features dominate, with only one DCT feature and one LBP feature making it into the top 10. This dominance is likely due to the pigs’ distinctive colors, which provide a strong basis for classification.


| Audio Results Pigs        | Accuracy   | Precision  | Recall     | F1        | ROC AUC   |
|---------------------------|------------|------------|------------|-----------|-----------|
| **Naive Bayes - Overall** | 0.5653     | 0.2753     | 0.6047     | 0.3581    | 0.6856    |
| **XGBoost - Overall**     | 0.7551     | 0.3932     | 0.1397     | 0.1906    | 0.6606    |
| **Naive Bayes - Best Fold** | 0.5234    | 0.1526     | 0.8835     | 0.2603    | 0.8488    |
| **XGBoost - Best Fold**   | 0.7869     | 0.9524     | 0.1594     | 0.2730    | 0.7615    |

For the Swedish Cook, both models struggle significantly with class imbalance. Compared to the pig detection task, XGBoost demonstrates better overall performance across all folds for this task. As before, fold 2-B emerges as the worst-performing fold for Naive Bayes, while for XGBoost, fold 0-B performs the worst. Interestingly, similar to the Similarity Modeling 1 project, fold 2-A is particularly challenging, as the majority of characters in that episode are dressed up, adding complexity to the classification task.
A closer examination of the classification report and metrics such as precision and recall reveals that the models often default to predicting the majority class. The best-performing folds are 0-B for Naive Bayes and 0-A for XGBoost. However, it is important to note that the best fold for Naive Bayes, based solely on voting, still predicts only the majority class. When ranking folds based on precision and recall, fold 1-A clearly outperforms for Naive Bayes.
HSV features dominate the feature importance for this task but account for only 6 of the top 10 features—significantly fewer compared to the pig detection task, where 8 out of the top 10 features were HSV-based. The remaining 4 features are evenly split between DCT and LBP, highlighting a more balanced contribution of these feature types compared to the dominance of HSV features.


| Audio Results Swedish Cook | Accuracy   | Precision  | Recall     | F1        | ROC AUC   |
|----------------------------|------------|------------|------------|-----------|-----------|
| **Naive Bayes - Overall**  | 0.9010     | 0.0160     | 0.0754     | 0.0264    | 0.7492    |
| **XGBoost - Overall**      | 0.9638     | 0.4812     | 0.3904     | 0.3852    | 0.8128    |
| **Naive Bayes - Best Fold** | 0.9077    | 0.0000*    | 0.0000*    | 0.0000*   | 0.9156    |
| **Naive Bayes - Fold with highest Recall** | 0.9407  | 0.0670   | 0.3231   | 0.1110   | 0.8414    |
| **XGBoost - Best Fold**    | 0.9977     | 0.9479     | 0.8387     | 0.8900    | 0.9997    |


\* The Naive Bayes model for Swedish Cook in the best fold fails due to its inability to handle certain data characteristics effectively.



## Hybrid classification
### Features
For the hybrid classification, we combine the previously discussed visual and audio features and utilize them within the nested cross-validation framework. This approach leverages the GPU-accelerated XGBoost algorithm.

### Results
Similar to Similarity Modeling 1, the best and worst folds for both the Pigs and the Swedish Chef remain consistent when compared to visual classification. A direct comparison with the audio models is challenging, as missing annotations for the cook in a lot folds made it impossible to use our classification approach.

For the Pigs, the inclusion of audio features marginally improves the prediction of the minority class compared to the visual model. Across the folds, we observe reduced variability. However, for the worst-performing fold, despite an improved AUC, the precision decreases. As with Kermit, color features dominate the top 10 feature importance rankings, likely due to the distinctive pink color of the pigs.


| **Results Pigs - XGBoost**  | **Accuracy** | **Precision** | **Recall** | **F1**    | **ROC AUC** |
|-----------------------------|--------------|---------------|------------|-----------|-------------|
| **Audio - Overall**         | 0.9350       | 0.3381        | 0.0138     | 0.0220    | 0.6835      |
| **Visual - Overall**        | 0.7551       | 0.3932        | 0.1397     | 0.1906    | 0.6606      |
| **Hybrid - Overall**        | 0.7592       | 0.3950        | 0.1362     | 0.1888    | 0.6680      |
| **Audio - Best Fold\***       | 0.8827       | 0.7067        | 0.0234     | 0.0453    | 0.7012      |
| **Visual - Best Fold**      | 0.9410       | 0.8106        | 0.7284     | 0.7673    | 0.9744      |
| **Hybrid - Best Fold**      | 0.7962       | 0.9757        | 0.1935     | 0.3230    | 0.7612      |

\* Best Precision and Recall

For the Swedish Chef, the best and worst folds also remain unchanged compared to the visual model. The addition of audio features slightly worsens the prediction of the minority class, while the AUC for the best fold remains unchanged. However, the overall AUC across all folds declines. Here too, the top 10 most important features are dominated by visual features, emphasizing their critical role in classification.

| **Results Cook- XGBoost**  | **Accuracy** | **Precision** | **Recall** | **F1**    | **ROC AUC** |
|----------------------------|--------------|---------------|------------|-----------|-------------|
| **Visual - Overall**       | 0.9638       | 0.4812        | 0.3904     | 0.3852    | 0.8128      |
| **Hybrid - Overall**       | 0.9636       | 0.4813        | 0.3761     | 0.3772    | 0.7719      |
| **Visual - Best Fold**     | 0.9977       | 0.9479        | 0.8387     | 0.8900    | 0.9997      |
| **Hybrid - Best Fold**     | 0.9972       | 0.9548        | 0.7788     | 0.8579    | 0.9997      |



## Challenges