# Visual Feature Analyses SIM 2 - Detecting Pigs and the Swedish Chef

This notebook contains the visual feature analysis for the Similarity Modeling 2 project. The objective is to detect the presence of all pigs and the swedish chef, in a frame using visual features. These features are extracted using the methods outlined below. For the classification task, we employ a Naive Bayes algorithm combined with a nested cross-validation approach. Similar to Similarity Modeling 1 we implement a LightGBM Model with the aim of analyzing the performance of this model for all three subtasks (visual, audio, visual + audio).

In [1]:
import numpy as np
import pandas as pd

from scripts.load_data import check_and_load
from scripts.extract_video_features import extract_lbp_features, extract_dct_features, extract_hsv_features
from scripts.nested_cv import partition_feature_df

import os
from skimage.feature import local_binary_pattern
from skimage.io import imread
from skimage.color import rgb2gray, rgb2hsv
from tqdm import tqdm
from pathlib import Path
from scipy.fftpack import dct
from skimage.util import img_as_ubyte


## Data Loading

In [2]:
# # Define paths
# data_path = "../ground_truth_data/trimmed_videos"
# frames_output_dir = "../ground_truth_data/trimmed_videos/frames"
# audio_output_dir = "../ground_truth_data/trimmed_videos/audio"
# annotations_path = "../ground_truth_data/trimmed_videos"

data_path = "../ground_truth_data"
frames_output_dir = "../ground_truth_data/frames"
audio_output_dir = "../ground_truth_data/audio"
annotations_path = "../ground_truth_data"

muppet_files = {
    "Muppets-02-01-01.avi": "GroundTruth_Muppets-02-01-01.csv",
    "Muppets-02-04-04.avi": "GroundTruth_Muppets-02-04-04.csv",
    "Muppets-03-04-03.avi": "GroundTruth_Muppets-03-04-03.csv",
}

In [3]:
annotations, audio_data, frames = check_and_load(data_path, frames_output_dir, audio_output_dir, annotations_path, muppet_files)

Frames and audio are already extracted.
Loading audio segments...
Loaded 3 audio files.
Loaded audio segments for 3 videos.
Loaded frames for 3 videos.
Number of videos with frames: 3
Video 0 has 38681 frames.
Video 1 has 38706 frames.
Video 2 has 38498 frames.


## Feature Engineering

This section outlines the extraction of meaningful visual features from the video frames for the classification task. The goal is to transform raw image data into a structured format that highlights key patterns and properties essential for detecting and analyzing the presence of the pigs and the swedish chef in the videos.


### Local Binary Patterns (LBP)
LBP is a texture descriptor that captures local patterns by encoding pixel intensity relationships within a 3x3 neighborhood. It is particularly useful for identifying repetitive structures like textures or edges. We first convert the frames to grayscale to reduce computational complexity. Next, a circular LBP operator is applied to analyze a pixel's neighborhood and generate a binary pattern. Finally, we calculate a histogram of the LBP values to summarize the texture distribution for each frame.

In [4]:
#lbp_features_df = extract_lbp_features(frames) # function saves df to ../model_vars/sim2_video/lbp_feature_df.csv - change parameter output_path if needed

# load features if already computed
lbp_features_df = pd.read_csv("../model_vars/sim2_video/lbp_feature_df.csv")

lbp_features_df.shape

(115885, 12)

TODO add vis

### Discrete Cosine Transform (DCT)
DCT captures frequency domain information, emphasizing patterns in varying spatial frequencies. This method is commonly used in image compression and feature extraction due to its ability to retain critical information while reducing dimensionality. Similar to LBP the frames are converted to grayscale, then a 2D DCT is applied to each frame, and the top-left coefficients (low frequencies) are retained. These coefficients represent the dominant structural information in the image, ignoring fine details.

In [5]:
#dct_features_df = extract_dct_features(frames)

#load features if already computed
dct_features_df = pd.read_csv('../model_vars/sim2_video/dct_feature_df.csv')

dct_features_df.shape

(115885, 66)

TODO add vis

### HSV Color Histograms - supplemental
Though this technique was discussed in similarity modeling 1, we borrowed this feature engineering method, in order to make use of the colors of the chracters. HSV (Hue, Saturation, Value) provides a representation of color, separating chromatic information (hue and saturation) from intensity (value). This approach is effective in distinguishing objects based on their color characteristics - such as the pink of the pigs. We convert the frames to the HSV color space and compute histograms for each channel (H, S, and V), capturing the distribution of colors in the image. 

In [6]:
# Extract Color Histogram features
#hsv_feature_df = extract_hsv_features(frames)

#load features if already computed
hsv_feature_df = pd.read_csv('../model_vars/sim2_video/hsv_feature_df.csv')
 
hsv_feature_df.shape

(115885, 50)

TODO add vis

## Model Prep

In [7]:
# Merge DataFrames on 'video_idx' and 'frame_idx'
#merged_df = lbp_features_df.merge(dct_features_df, on=["video_idx", "frame_idx"], how="inner")
#merged_df = merged_df.merge(hsv_feature_df, on=["video_idx", "frame_idx"], how="inner")

output_path_merged = "../model_vars/sim2_video/merged_df.csv"
#merged_df.to_csv(output_path_merged, index=False)

merged_df = pd.read_csv(output_path_merged)
merged_df.shape

(115885, 124)

In [None]:
# Create a mapping from filenames to video indices
video_idx_map = {filename: idx for idx, filename in enumerate(muppet_files.keys())}

# Prepare ground truth data with corrected video_idx
ground_truth_data = []
for video_filename, annotation_df in annotations.items():
    video_idx = video_idx_map[video_filename]  # Map video filename to its index
    for _, row in annotation_df.iterrows():
        ground_truth_data.append({
            'video_idx': video_idx,  # Use mapped video index
            'frame_idx': row['Frame_number'],  # Assuming Frame_number exists
            'Pigs': row['Pigs'],  # Assuming Pigs is a column in the annotation
            'Cook': row['Cook']  # Assuming this column exists
        })

# Create a DataFrame for ground truth
ground_truth_df = pd.DataFrame(ground_truth_data)
ground_truth_df.shape

(115885, 4)

In [9]:
# Merge features with ground truth
feature_df = pd.merge(merged_df, ground_truth_df, on=['video_idx', 'frame_idx'], how='inner')
feature_df.shape

(115885, 126)

In [13]:
# split_points = {
#     0: 19716,  # Video 0
#     1: 19719,  # Video 1
#     2: 19432, # Video 2 
# }

# Assuming feature_df is the dataframe containing video_idx and frame_idx columns
grp_by = ['Pigs', 'Cook']
feature_df, split_overview = partition_feature_df(feature_df, grp_by = grp_by)

output_path_feature = "../model_vars/sim2_video/feature_df.csv"
feature_df.to_csv(output_path_feature, index=False)

#feature_df = pd.read_csv(output_path_feature)
feature_df.shape

(115885, 127)

In [12]:
feature_df.head()

Unnamed: 0,lbp_bin_0,lbp_bin_1,lbp_bin_2,lbp_bin_3,lbp_bin_4,lbp_bin_5,lbp_bin_6,lbp_bin_7,lbp_bin_8,lbp_bin_9,...,hsv_channel_2_bin_9,hsv_channel_2_bin_10,hsv_channel_2_bin_11,hsv_channel_2_bin_12,hsv_channel_2_bin_13,hsv_channel_2_bin_14,hsv_channel_2_bin_15,Pigs,Cook,fold
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0-A
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0-A
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0-A
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0-A
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0-A


In [14]:
# Display the results
print(split_overview)

feature_df['fold'].unique() # ACHTUNG manche fehlen in den Splits!

   video_idx fold  Pigs  Cook
0          0  0-A  1871   217
1          0  0-B     0  1654
2          1  1-A  4012   226
3          1  1-B  4769     0
4          2  2-A  5134   683
5          2  2-B  5694   665


array(['0-A', '0-B', '1-A', '1-B', '2-A', '2-B'], dtype=object)

In [8]:
# Select only numeric columns
numeric_df = feature_df.select_dtypes(include=[np.number])


print("NaN in any feature:", feature_df.isnull().values.any())
# Check for infinite values in numeric columns
print("Inf in any numeric feature:", np.isinf(numeric_df.values).any())

# Optionally, find rows with infinite values
rows_with_inf = numeric_df[np.isinf(numeric_df).any(axis=1)].index.tolist()
print("Rows with Inf values:", rows_with_inf)

NaN in any feature: False
Inf in any numeric feature: False
Rows with Inf values: []


# Visual Classification

In [1]:
import numpy as np
import pandas as pd

from scripts.nested_cv import evaluate_model, nested_cross_validation
import psutil
import os

from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

physical_cores = psutil.cpu_count(logical=False)
print(f"Number of physical cores: {physical_cores}")

os.environ["LOKY_MAX_CPU_COUNT"] = "12"

Number of physical cores: 12


In [2]:
feature_df = pd.read_csv('../model_vars/sim2_video/feature_df.csv')


train_cols = [col for col in feature_df.columns if col.startswith(('lbp', 'hsv', 'dct'))]

In [4]:
feature_df.describe()

Unnamed: 0,lbp_bin_0,lbp_bin_1,lbp_bin_2,lbp_bin_3,lbp_bin_4,lbp_bin_5,lbp_bin_6,lbp_bin_7,lbp_bin_8,lbp_bin_9,...,hsv_channel_2_bin_8,hsv_channel_2_bin_9,hsv_channel_2_bin_10,hsv_channel_2_bin_11,hsv_channel_2_bin_12,hsv_channel_2_bin_13,hsv_channel_2_bin_14,hsv_channel_2_bin_15,Pigs,Cook
count,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,...,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0,115885.0
mean,0.010907,0.038826,0.02046,0.085142,0.151472,0.230323,0.054135,0.047366,0.311475,0.049892,...,1.072261,0.953754,0.828437,0.712243,0.527304,0.338847,0.221629,0.203214,0.185356,0.029728
std,0.003762,0.011804,0.007544,0.023469,0.055371,0.049187,0.012952,0.011772,0.13752,0.013403,...,0.700995,0.902687,0.793313,0.822851,0.663433,0.460252,0.331332,0.443309,0.388588,0.169836
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.089293,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.008638,0.032616,0.015663,0.071089,0.114698,0.207644,0.047393,0.041565,0.210585,0.043814,...,0.598243,0.433292,0.297752,0.208333,0.115266,0.064257,0.033783,0.023061,0.0,0.0
50%,0.010828,0.038689,0.02045,0.08753,0.158467,0.231,0.05556,0.048019,0.283731,0.049599,...,0.955637,0.771098,0.596925,0.417279,0.281822,0.186111,0.11489,0.086969,0.0,0.0
75%,0.013261,0.044996,0.02538,0.100518,0.194563,0.253705,0.062576,0.053682,0.375763,0.056,...,1.350476,1.172585,0.9721,0.858088,0.646432,0.42165,0.268791,0.205025,0.0,0.0
max,0.042604,0.222249,0.049828,0.146134,0.271931,0.460948,0.090126,0.216194,1.0,0.356191,...,7.688644,11.281291,6.124551,7.989216,4.902741,3.509804,4.298856,15.792565,1.0,1.0


In [5]:
print(train_cols)
len(train_cols)

['lbp_bin_0', 'lbp_bin_1', 'lbp_bin_2', 'lbp_bin_3', 'lbp_bin_4', 'lbp_bin_5', 'lbp_bin_6', 'lbp_bin_7', 'lbp_bin_8', 'lbp_bin_9', 'dct_coeff_0', 'dct_coeff_1', 'dct_coeff_2', 'dct_coeff_3', 'dct_coeff_4', 'dct_coeff_5', 'dct_coeff_6', 'dct_coeff_7', 'dct_coeff_8', 'dct_coeff_9', 'dct_coeff_10', 'dct_coeff_11', 'dct_coeff_12', 'dct_coeff_13', 'dct_coeff_14', 'dct_coeff_15', 'dct_coeff_16', 'dct_coeff_17', 'dct_coeff_18', 'dct_coeff_19', 'dct_coeff_20', 'dct_coeff_21', 'dct_coeff_22', 'dct_coeff_23', 'dct_coeff_24', 'dct_coeff_25', 'dct_coeff_26', 'dct_coeff_27', 'dct_coeff_28', 'dct_coeff_29', 'dct_coeff_30', 'dct_coeff_31', 'dct_coeff_32', 'dct_coeff_33', 'dct_coeff_34', 'dct_coeff_35', 'dct_coeff_36', 'dct_coeff_37', 'dct_coeff_38', 'dct_coeff_39', 'dct_coeff_40', 'dct_coeff_41', 'dct_coeff_42', 'dct_coeff_43', 'dct_coeff_44', 'dct_coeff_45', 'dct_coeff_46', 'dct_coeff_47', 'dct_coeff_48', 'dct_coeff_49', 'dct_coeff_50', 'dct_coeff_51', 'dct_coeff_52', 'dct_coeff_53', 'dct_coeff_54',

122

## Train-Test-Split Approach

In this analysis, we employ a nested cross-validation approach for our classification models. The nested cross-validation provides robust model evaluation by incorporating two levels of data splitting: an outer loop for testing model generalization and an inner loop for hyperparameter tuning. The outer loop ensures that the performance metrics reflect how well the model generalizes to entirely unseen data, while the inner loop systematically optimizes the model's parameters using the training data.

A traditional random train-test split could lead to data leakage if frames from the same video appear in both the training and testing sets. This overlap could inflate performance metrics by allowing the model to learn video-specific features rather than generalizable patterns. The nested cross-validation mitigates this risk by ensuring that the outer splits isolate data from different videos, providing a more realistic estimate of the model's ability to generalize across unseen scenarios. 

For the creation of the folds, each episode is split at its midway point, specifically at the transition where a segment ends and the screen briefly fades to black before the next segment begins. This results in two distinct folds per episode, ensuring that each fold captures a separate and coherent part of the episode.

## Naive Bayes

#### Pigs

In [23]:
from sklearn.naive_bayes import GaussianNB

# Define parameter grid
param_grid = {}

feature_df_pigs = feature_df[feature_df['fold'] != "0-B"]

target_col = 'Pigs'
results_nb_pigs, summary_nb_pigs, best_model_nb_pigs = nested_cross_validation(
    feature_df_pigs, 
    train_cols, 
    target_col, 
    GaussianNB, 
    param_grid, 
    num_cores=10
)

## Save Vars
import pickle
with open('../model_vars/sim2_video/pigs_nb_results.pkl', 'wb') as f:
    pickle.dump({
        'results_pigs': results_nb_pigs,
        'summary_pigs': summary_nb_pigs,
        'best_model_pigs': best_model_nb_pigs
    }, f)



Metrics for Fold 0-A: {'outer_fold': '0-A', 'accuracy': 0.5234327449786975, 'precision': np.float64(0.8973781262834788), 'recall': np.float64(0.5234327449786975), 'f1': np.float64(0.611644589366598), 'roc_auc': np.float64(0.8487717815939532)}
Metrics for Fold 1-A: {'outer_fold': '1-A', 'accuracy': 0.6932907348242812, 'precision': np.float64(0.7736492540280212), 'recall': np.float64(0.6932907348242812), 'f1': np.float64(0.7188984654563997), 'roc_auc': np.float64(0.7155395959571467)}
Metrics for Fold 1-B: {'outer_fold': '1-B', 'accuracy': 0.6257965976720914, 'precision': np.float64(0.6393746689500284), 'recall': np.float64(0.6257965976720914), 'f1': np.float64(0.6321678309054406), 'roc_auc': np.float64(0.6427215747031788)}
Metrics for Fold 2-A: {'outer_fold': '2-A', 'accuracy': 0.49639769452449567, 'precision': np.float64(0.6944319709785657), 'recall': np.float64(0.49639769452449567), 'f1': np.float64(0.5133877640961363), 'roc_auc': np.float64(0.7358976928458588)}
Metrics for Fold 2-B: {

### Swedish Cook

In [24]:
# Define parameter grid
param_grid = {}

feature_df_cook = feature_df[feature_df['fold'] != "1-B"]

target_col = 'Cook'
results_nb_cook, summary_nb_cook, best_model_nb_cook = nested_cross_validation(
    feature_df_cook, 
    train_cols, 
    target_col, 
    GaussianNB, 
    param_grid, 
    num_cores=10
)


## Save Vars
import pickle
with open('../model_vars/sim2_video/cook_nb_results.pkl', 'wb') as f:
    pickle.dump({
        'results_cook': results_nb_cook,
        'summary_cook': summary_nb_cook,
        'best_model_cook': best_model_nb_cook
    }, f)

Metrics for Fold 0-A: {'outer_fold': '0-A', 'accuracy': 0.9831101643335363, 'precision': np.float64(0.9780441374663073), 'recall': np.float64(0.9831101643335363), 'f1': np.float64(0.9805706076543964), 'roc_auc': np.float64(0.7417421618927403)}
Metrics for Fold 0-B: {'outer_fold': '0-B', 'accuracy': 0.9077247561297126, 'precision': np.float64(0.8327745642779932), 'recall': np.float64(0.9077247561297126), 'f1': np.float64(0.8686358901802906), 'roc_auc': np.float64(0.9156154738580365)}
Metrics for Fold 1-A: {'outer_fold': '1-A', 'accuracy': 0.9406663623916021, 'precision': np.float64(0.9811876741594294), 'recall': np.float64(0.9406663623916021), 'f1': np.float64(0.9594712948128516), 'roc_auc': np.float64(0.8414041527954895)}
Metrics for Fold 2-A: {'outer_fold': '2-A', 'accuracy': 0.8222519555372582, 'precision': np.float64(0.9277316281077067), 'recall': np.float64(0.8222519555372582), 'f1': np.float64(0.8712773306592072), 'roc_auc': np.float64(0.6326220463334423)}
Metrics for Fold 2-B: {'

## XGB - TODO LIGHTGBM

In [5]:
import torch
from scripts.nested_cv import ncv_xgb_gpu

if torch.cuda.is_available():
    print(f"CUDA is available. GPU: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA is not available.")


CUDA is available. GPU: NVIDIA GeForce RTX 4070 Ti SUPER


In [6]:
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1]
}


feature_df_pigs = feature_df[feature_df['fold'] != "0-B"]
feature_df_pigs.shape

feature_df = pd.read_csv('../model_vars/sim2_video/feature_df.csv')
train_cols = [col for col in feature_df.columns if col.startswith(('lbp', 'hsv', 'dct'))]

target_col='Pigs'
results_rf_kermit, summary_rf_kermit, best_models_rf_kermit = ncv_xgb_gpu(
    feature_df=feature_df_pigs,
    train_cols=train_cols,
    target_col=target_col,
    param_grid=param_grid
)

Outer Fold: 0-A
Outer Fold: 1-A
Outer Fold: 1-B
Outer Fold: 2-A
Outer Fold: 2-B
Model: 6/6
Summary of Metrics Across Folds:
{'accuracy': 0.7550947403613891, 'precision': 0.7129216497936853, 'recall': 0.7550947403613891, 'f1': 0.7087152111030657, 'roc_auc': 0.6606317907768406}


In [None]:

## Save Vars
import pickle
with open('../model_vars/sim1_audio/kermit_rf_results.pkl', 'wb') as f:
    pickle.dump({
        'results_kermit': results_rf_kermit,
        'summary_kermit': summary_rf_kermit,
        'best_model_kermit': best_models_rf_kermit
    }, f)


# print("Results:", results_rf_kermit)
# print("Summary:", summary_rf_kermit)


## Evaluation