## Similarity Modeling 1 Summary
**Parts of the notebook are the same as in Similarity Modeling 2 (general approach)**. 

**Authors: Sakka Mahmoud Abdussalem, Kravchenko Oleksandra**

This notebook contains a summary about our work distribution, task approach and results.

## Timesheets
**Oleksandra's timesheet**
<table>
<thead>
  <tr>
    <th>Date</th>
    <th>Task</th>
    <th>Hours</th>

  </tr>
</thead>
<tbody>
  <tr>
    <td>24.10.2024</td>
    <td>Initial Team Meeting</td>
    <td>0.5</td>
  </tr>
    <tr>
    <td>01.12.2024</td>
    <td>Second Team Meeting - Discussion of extraction approach and work split</td>
    <td>0.5</td>
  </tr>
    <tr>
    <td>08.12.2024</td>
    <td>Third Team Meeting</td>
    <td>0.5</td>
  </tr>
</tbody>
</table>

**Mahmoud's timesheet:**

<table>
<thead>
  <tr>
    <th>Date</th>
    <th>Task</th>
    <th>Hours</th>

  </tr>
</thead>
<tbody>
  <tr>
    <td>24.10.2024</td>
    <td>Initial Team Meeting</td>
    <td>0.5</td>
  </tr>
  <tr>
    <td>01.12.2024</td>
    <td>Second Team Meeting - Discussion of extraction approach and work split</td>
    <td>0.5</td>
  </tr>
  <tr>
    <td>06.12.2024</td>
    <td>Initial exploration of material, implementation of trim_video.py and audio_extraction</td>
    <td>4</td>
  </tr>
  <tr>
    <td>07.12.2024</td>
    <td>Implementation of load_data.py</td>
    <td>2</td>
  </tr>
    <tr>
    <td>08.12.2024</td>
    <td>Third Team Meeting</td>
    <td>0.5</td>
  </tr>
</tbody>
</table>


## Work distribution
Since we both participated in both Similarity Modeling 1 and 2 courses, we split the work in a way that allows us to experiment in all aspects of video analysis and classification. For Similarity Modeling 1, Mahmoud's focus was on the audio features and Oleksandra's on the video features (the roles are reversed in the Similarity Modeling 2). Therefore, we both also contributed to the hybrid approach by creating separate features and merging them together. The analytical work was also shared and conducted in regular discussion and iterative improvements of all notebooks and functions. The timesheets give a more clear indication of the work distribution within the group. Helper functions for frame extraction, data loading, cross validation, evaluation and so on were written in collaboration and refined for the specific needs that came up during the project. The summaries, analyses were mainly written by the authors of the respective notebooks, same goes for indivudal sections in this notebook. 

## Approach overview
We focused on experimentation with multiple features mentioned in the Similarity Modeling 1 lectures and their variants. We brainstormed possible features and classification algorithms to cover a broad spectrum of the domain. During the feature engineering, we inspected different features on excerpts of the data, tunes them, used them in the classification models, and refined the methods based on the evaluation results. We used a GitHub repository as a collaborative environment for the project. 
Regarding overall technical setup, we created a script `load_data.py` that extract frames and audio tracks from the videos and organises them in the folders on our local machines. The feature extraction functions then reference those files to create datasets, that are then also saved and reused to save time. Our classification model were either trained and evaluated using nested cross validation method or holdout set depending on the computational depends of the task. 
Our train-test-split approach accounts for the time-series nature of the data. We manually define split points that correspon to "fade-outs" in the data to correctly split the videos and ensure that the same scene is not present in both train and test split to avoid data leakage and overfitting. 
The evaluation was done using binary accuracy, precision, recall and f1 metrics, as well as ROC curve. Since our data is highly imabalanced (target characters are much more often are absent than present), we found this evaluation method to be most accurate. 

Annotations fix: we reviewed annotations for GT Muppets-02-01-01 and GT Muppets 03-04-03 and found some inconsistencies that we corrected. 

## Classification using audio features
Details: `sim1_audio_analysis.ipynb`

### Features 
We used the following features: 

- **Zero-Crossing Rate (ZCR)**: Measures the rate at which the audio signal crosses the zero amplitude axis, capturing information about its frequency content. This could be useful for identifying distinct pitch or tonal qualities in their voices, such as Kermit's high-pitched voice.
- **Loudness (Root Mean Square Energy)**: Computes the signal's overall energy, providing insights into its intensity. This could help differentiate characters based on the volume or energy of their speech or sound.
- **Rhythm Detection**: Employs autocorrelation to identify repeating patterns within frames, capturing rhythmic structures in the audio.
- **Mel-Frequency Cepstral Coefficients (MFCCs)**: Combines short-time Fourier transforms, Mel transformations, and discrete cosine transforms to capture spectral properties of the signal, particularly suited for identifying speech and tonal patterns.

### Results

For the detection of Kermit, XGBoost consistently outperformed both KNN and Logistic Regression in terms of accuracy and ROC AUC, particularly in the best-performing folds. Logistic Regression and KNN both struggled with class imbalance and variability across folds, highlighting their limited robustness in handling diverse data distributions. Among all models, fold 2-B emerged as the worst-performing fold, likely due to Kermit being dressed as Robin Hood in that part of the episode. 
For the best-performing folds, KNN achieved its highest AUC on fold 1-A, while Logistic Regression and XGBoost both performed best on fold 0-B.
An analysis of the feature importances for the best-performing XGBoost fold underscores the positive impact of including MFCC (Mel-Frequency Cepstral Coefficients) features in the model. Among the top 10 most important features, only Loudness from the original feature set was included, further emphasizing the value of engineered features in enhancing classification performance.

| Audio Results Kermit      | Accuracy   | Precision  | Recall     | F1        | ROC AUC   |
|---------------------------|------------|------------|------------|-----------|-----------|
| **KNN - Overall**         | 0.6551     | 0.3835     | 0.3413     | 0.3358    | 0.5992    |
| **Logistic - Overall**    | 0.6767     | 0.4795     | 0.1919     | 0.1820    | 0.6533    |
| **XGBoost - Overall**     | 0.7153     | 0.5254     | 0.1874     | 0.2621    | 0.6593    |
| **KNN - Best Fold**       | 0.7218     | 0.3321     | 0.3931     | 0.3601    | 0.6413    |
| **Logistic - Best Fold**  | 0.8635     | 0.4054     | 0.0474     | 0.0848    | 0.7369    |
| **XGBoost - Best Fold**   | 0.8505     | 0.3755     | 0.1804     | 0.2437    | 0.7478    |


For Waldorf and Statler, XGBoost consistently outperformed both KNN and Logistic Regression in terms of accuracy and ROC AUC, particularly in the best-performing folds. However, all models struggled significantly with the extreme class imbalance, leading to poor recall for the minority class. This issue was even more pronounced than in the Kermit detection task. Across all three models, fold 0-A emerged as the worst-performing fold, likely because it contained the highest number of occurrences of Waldorf and Statler, further exacerbating the imbalance in the training data.
The class imbalance heavily skewed performance metrics toward the negative class (absence of Waldorf and Statler). The high accuracy values observed across all models primarily reflect their tendency to predict the majority class, which dominates the dataset. While fold 1-B showed the best AUC performance for KNN and Logistic Regression, the overall performance of Logistic Regression remained particularly poor, as it consistently defaulted to predicting the majority class for all folds. For XGBoost, fold 2-A achieved the highest AUC, but based on precision and recall metrics, fold 0-B would be the true best fold. Although fold 0-B had the lowest AUC, it performed comparably to other folds in the overall context, making it the most balanced fold in terms of recall and precision.
Similar to the Kermit detection task, the feature importance analysis for the best-performing XGBoost fold underscores the significance of the MFCC features, reaffirming their critical role in improving model performance. Interestingly, loudness remains one of the few features from the original feature set to make it into the top 10 most important features, further emphasizing its value.



| Audio Results W&S         | Accuracy   | Precision  | Recall     | F1        | ROC AUC   |
|---------------------------|------------|------------|------------|-----------|-----------|
| **KNN - Overall**         | 0.9716     | 0.1335     | 0.0531     | 0.0723    | 0.5807    |
| **Logistic - Overall**    | 0.9767     | 0.0000*    | 0.0000*    | 0.0000*   | 0.6913    |
| **XGBoost - Overall**     | 0.9766     | 0.3601     | 0.0116     | 0.0219    | 0.7460    |
| **KNN - Best Fold**       | 0.9785     | 0.9734     | 0.9785     | 0.9759    | 0.6109    |
| **Logistic - Best Fold**  | 0.9851     | 0.0000*    | 0.0000*    | 0.0000*   | 0.7411    |
| **Logistic -  Fold with highest Recall**  | 0.9792    | 0.6000    | 0.0376   | 0.07076   | 0.7207   |
| **XGBoost - Best Fold**   | 0.9912     | 0.5000     | 0.0058     | 0.0116    | 0.7816    |

\* Logistic Regression performs the worst as it cannot handle the class imbalance and always predicts the majority class.


## Classification using video features
Details: `sim1_visual_analysis.ipynb`

### Features
We used the following features:  
- Dominant hues: We extract the 10 most dominant hues per frame. This could help to capture our targets whenever we observe their distinct color in a frame. 
- Contours:  
- Gray-Level Co-occurrence Matrix (GLCM): to describe _texture_.

### Results
Kermit classification task based on visual features turned out to be rather difficult. There is a significant class imbalance that influences the training and testing outcomes, as well as some instances of mislabeling in the annotations that impact the performance. Kermit also appears in a variety of scenes and among many different characters, which is a big different to Statler and Waldorf that are mostly in the same setting, so it is easier to recognize them. 

The metrics in the tables below are "binary" to fairly demonstrate the issues with the models failing to predict the target classes. 

For the video classification task involving **Kermit**, XGBoost consistently delivered the best performance, particularly in terms of ROC AUC and F1 scores, with the best fold achieving a remarkable ROC AUC of 0.9744 and an F1 score of 0.7673. This demonstrates XGBoost's ability to effectively model the complex patterns required for Kermit detection, even in the face of significant class imbalance and variability in Kermit’s appearances. Models such as KNN and Logistic Regression struggled notably with low recall values, reflecting their challenges in detecting the minority class. The Decision Tree model showed moderate performance but fell short of XGBoost’s robustness and precision.

The class imbalance and annotation inconsistencies in the Kermit dataset were key challenges. Kermit’s appearances span diverse settings with multiple characters, adding to the complexity of classification. XGBoost’s superior performance can be attributed to its ability to leverage engineered features, particularly MFCCs, which proved critical in capturing the nuances necessary for accurate classification.


| Audio Results Kermit      | Accuracy   | Precision  | Recall     | F1        | ROC AUC   |
|---------------------------|------------|------------|------------|-----------|-----------|
| **KNN test set**         | 0.8383     | 0.7584     | 0.0457     | 0.0862    | 0.52    | 
| **LogReg test set**       | 0.7358     | 0.3102     | 0.4808     | 0.3771    | 0.63    |  
| **Decision Tree test set**     | 0.6854     | 0.27     | 0.5189     | 0.3551    | 0.62    |  
| **XGBoost - CV Overall**  | 0.7435     | 0.6739     | 0.4086     | 0.4677    | 0.7956    | 
| **XGBoost - CV Best Fold**   | 0.0.941     | 0.8106     | 0.7284     | 0.7673    | 0.9744    |

For the video classification of **Waldorf and Statler** (W&S), XGBoost once again stood out, achieving near-perfect performance metrics in some folds, with the best fold producing a ROC AUC of 0.9993 and an F1 score of 0.799. Logistic Regression also performed well, achieving an F1 score of 0.8658 and a strong ROC AUC of 0.89, indicating its ability to effectively classify W&S, who appear predominantly in consistent settings. KNN, while achieving high accuracy, suffered from low recall, reflecting its limitations in detecting the minority class.

Class imbalance was a significant issue for both tasks, although it was less pronounced for W&S due to their relatively consistent appearances. Feature importance analysis from XGBoost underscored the critical role of MFCCs in enhancing classification accuracy for both tasks, with only a few original features like loudness contributing meaningfully. These results highlight the importance of robust feature engineering and model selection in tackling challenging video classification tasks.


| Audio Results W&S         | Accuracy   | Precision  | Recall     | F1        | ROC AUC   |
|---------------------------|------------|------------|------------|-----------|-----------|
| **KNN test set**         | 0.9721     | 0.8551     | 0.0523     | 0.0986    | 0.5992    | 
| **LogReg test set**    | 0.9929    | 0.9672     | 0.7837     | 0.8658    | 0.89    | 
| **Decision Tree test set**     | 0.9845     | 0.7110     | 0.7872     | 0.7472    | 0.89    | 
| **XGBoost - CV Overall**  | 0.9846     | 0.9536   | 0.6485     | 0.7348    | 0.966   |
| **XGBoost - CV Best Fold**   | 0.9917     | 0.9811     | 0.6739     | 0.799    | 0.9993    |

## Hybrid classification
### Features
### Results
For the detection of Kermit, we observe results similar to those from the visual classification. Fold 0-B remains the best-performing fold, where the model enhanced with audio features identifies the minority class slightly more often. As expected, fold 2-B is the worst-performing fold, showing a decline in performance compared to the purely visual model. However, when compared to the audio-only model, the hybrid approach significantly outperforms it across all folds. 

An analysis of the feature importances for the best-performing fold reveals that the top 10 most influential features are predominantly color features (8 out of 10), which is unsurprising given Kermit's distinctive green color. This highlights the critical role of visual features in identifying Kermit within the dataset.


| Results Kermit - XGBoost      | Accuracy   | Precision  | Recall     | F1        | ROC AUC   |
|---------------------------|------------|------------|------------|-----------|-----------|
| **Audio - Overall**      | 0.7153     | 0.5254     | 0.1874     | 0.2621    | 0.6593    |
| **Visual - Overall**     | 0.7435       | 0.6739        | 0.4086     | 0.4677    | 0.7955      |
| **Hybrid - Overall**      | 0.7403       | 0.6549        | 0.4235     | 0.4665    | 0.7878      |
| **Audio - Best Fold**    | 0.8505     | 0.3755     | 0.1804     | 0.2437    | 0.7478    |
| **Visual - Best Fold**    | 0.9410       | 0.8106        | 0.7284     | 0.7673    | 0.9744      |
| **Hybrid - Best Fold**    | 0.9487       | 0.7987        | 0.8239     | 0.8111    | 0.9827      |


For Waldorf and Statler, we observe results that are nearly identical to those of the visual model. Fold 0-B remains the worst-performing fold, while fold 2-A continues to perform the best. Similarly, the performance of the hybrid model (which is effectively driven by visual features) is significantly better than that of the audio-only model.
This observation is further supported by the feature importance analysis, where the top 10 features are entirely dominated by the principal components of the visual features, emphasizing the critical role of visual data in detecting Waldorf and Statler.

| **Results W & S - XGBoost**  | **Accuracy** | **Precision** | **Recall** | **F1**    | **ROC AUC** |
|-------------------------------|--------------|---------------|------------|-----------|-------------|
| **Audio - Overall**           | 0.7153       | 0.5254        | 0.1874     | 0.2621    | 0.6593      |
| **Visual - Overall**           | 0.9846       | 0.9536        | 0.6485     | 0.7348    | 0.9660      |
| **Hybrid - Overall**          | 0.9841       | 0.9543        | 0.6200     | 0.7186    | 0.9670      |
| **Audio - Best Fold**         | 0.8505       | 0.3755        | 0.1804     | 0.2437    | 0.7478      |
| **Visual - Best Fold**         | 0.9917       | 0.9811        | 0.6739     | 0.7990    | 0.9993      |
| **Hybrid - Best Fold**        | 0.9917       | 0.9751        | 0.6760     | 0.7985    | 0.9994      |


## Overall conclusion & Challenges