# Results

This notebook contains the detailed analysis for all experiments for this bachelor project. It queries the evaluation data from W&B and displays/ visualises it nicely in `pd.DataFrames` (that are later compiled into LaTeX tables) and figures that are saved to the directory `reports/figures`.

In [None]:
import sys
sys.path.append("../src")
import os
import string
import itertools

# retrieve data about runs
import wandb

# plotting and analysis
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import torch
from torch.utils.data import DataLoader

from config import BASEPATH
from defaults import DEFAULT
from utils import show_images
from transform import ImageTransform, VideoTransform
from data import ImageDataset, VideoDataset
from model import ImageClassifier, VideoClassifier

In [None]:
pd.set_option('display.float_format','{:.2f}'.format)
sns.set_style("darkgrid")

WANDB_PROJECT = "mikasenghaas/bsc-2"

In [None]:
image_config = DEFAULT["resnet18"]
video_config = DEFAULT["r2plus1d_18"]

image_transform = ImageTransform(**image_config["transform"])
video_transform = VideoTransform(**video_config["transform"])
image_data = ImageDataset(**image_config["dataset"], split="train", transform=image_transform)
video_data = VideoDataset(**video_config["dataset"], split="train", transform=video_transform)

assert image_data.class2id == video_data.class2id
class2id = image_data.class2id
id2class = image_data.id2class

In [None]:
# get all runs
api = wandb.Api()
all_runs = api.runs(WANDB_PROJECT)

In [None]:
# helper
def render_latex(df):
    # capitalise col names
    df.columns = [' '.join(map(lambda x: x[0].upper() + x[1:], col.split('_'))) for col in df.columns]
    
    # format df
    s = df.style.highlight_max(props='bfseries: ;')
    s.format(precision=2)
    
    # render latex
    opts = {"hrules": True, "position": "h"}
    return s.to_latex(**opts)

def capitalise(s):
    return s[0].upper() + s[1:]

# Experiment Results

This experiment compares a wide-variety of image and video classifiers. The classifiers tested are freely available on the PyTorch Hub. The following tables shows an overview over all models. We first filter all the runs that train models within the first experiment.

In [None]:
models = list(DEFAULT.keys())

runs = [run for run in all_runs if run.name in models and run.state == "finished"]

print(f"There are {len(runs)} runs in Experiment 1 ({', '.join([run.name for run in runs])})")

## Configuration

Let's look at the training configuration setup of each model. The setup is separated into the following categories, with the following keys

- General: Model Name, Descripton, Link
- Model: TorchHub Link, TorchHub Identifier, Pretrained, Num Classes 
- Data: Clip Duration (Video)
- Loader: Batch Size, Shuffle
- Optim: LR, Weight Decay
- Trainer: Device, Epochs



In [None]:
# general information
general = pd.DataFrame([dict(run.config['general']) for run in runs])
general.loc[3, "name"] = "ViT B-16"
general.set_index("name", inplace=True)
general["type"] = general["type"].apply({"image": "Single-Frame", "video": "Video"}.get)
general

In [None]:
# model information
model = pd.DataFrame([dict(run.config['model'], **{'name': run.config['general']['name']}) for run in runs]).set_index("name")
model

In [None]:
# data information
data = pd.DataFrame([dict(run.config['dataset'], **{'name': run.config['general']['name']}) for run in runs]).set_index("name")
data

In [None]:
# loader information
loader = pd.DataFrame([dict(run.config['loader'], **{'name': run.config['general']['name']}) for run in runs]).set_index("name")
loader

In [None]:
# optim information
optim = pd.DataFrame([dict(run.config['optim'], **{'name': run.config['general']['name']}) for run in runs]).set_index("name")
optim

In [None]:
# trainer information
trainer = pd.DataFrame([dict(run.config['trainer'], **{'name': run.config['general']['name']}) for run in runs]).set_index("name")
trainer

Nothing suprising here. All hyperparameters are at the default values for the image and video classifiers. We can render to LaTex to put it into the report.

## Results

The results section is responsible for plotting and saving all relevant figures that show the performance between the different models. We start by querying the summary statistics that were computed by the `eval.py` script and then synced to W&B.

In [None]:
# convert wandb summary to df
all_summary = pd.DataFrame([dict({k: v for k, v in run.summary.items() if not k.startswith("_")}, **{"Name": run.config['general']['name']}) for run in runs])
all_summary.loc[3, "Name"] = "ViT B-16"
all_summary.set_index("Name", inplace=True)
all_summary.columns

This is a lot of columns. Let's organise them into three categories:
- Machine Specifications
- Benchmarks (Model Size, FLOPS, Throughput, Latency)
- Performance (Top1-Acc, Top3-Acc, Macro F1)

In [None]:
# get machine info for benchmarking
machine_info_cols = {
    "device": "Device",
    "machine_info_system_system": "System",
    "machine_info_system_release": "Release",
    "machine_info_cpu_model": "CPU Model",
    "machine_info_cpu_cores_physical": "CPU Cores",
    "machine_info_cpu_frequency": "CPU Frequency",
    "machine_info_gpus": "GPU",
    "machine_info_memory_total": "Memory"}

benchmark_cols = {
    "params": "Params",
    "flops": "FLOPs",
    "frames_seen": "Frames Seen",
    "timing_batch_size_1_on_device_inference_metrics_seconds_per_batch_mean": "Inference Time (Mean)",
    "timing_batch_size_1_on_device_inference_metrics_batches_per_second_mean": "Inference Throughput (Mean)",
    "timing_batch_size_1_on_device_inference_human_readable_batch_latency": "Inference Latency",
    "timing_batch_size_1_on_device_inference_human_readable_batches_per_second": "Inference Throughput",
    }

performance_cols = {
    "train/best_acc": "Train Top-1 Acc (Best)",
    "val/best_acc": "Val Top-1 Acc (Best)",
    "test_top1_acc": "Test Top-1 Acc",
    "test_top3_acc": "Test Top-3 Acc",
    "test_macro_f1": "Test Macro F1",
    "test_conf_matrix": "Confusion Matrix",
    }

machine_info = all_summary[machine_info_cols.keys()].rename(columns=machine_info_cols)
benchmark = all_summary[benchmark_cols.keys()].rename(columns=benchmark_cols)
performance = all_summary[performance_cols.keys()].rename(columns=performance_cols)

# different units
benchmark["Params"] = benchmark["Params"] / 1e6 # million
benchmark["FLOPs"] = benchmark["FLOPs"] / 1e9 # gigaflops
benchmark["Frames Seen"] = benchmark["Frames Seen"] / 1e3 # 1k frames
benchmark["Inference Time (Mean)"] = benchmark["Inference Time (Mean)"] * 1e3 # to ms

# different units
performance["Train Top-1 Acc (Best)"] = performance["Train Top-1 Acc (Best)"] * 100 # %
performance["Val Top-1 Acc (Best)"] = performance["Val Top-1 Acc (Best)"] * 100 # %
performance["Test Top-1 Acc"] = performance["Test Top-1 Acc"] * 100 # %
performance["Test Top-3 Acc"] = performance["Test Top-3 Acc"] * 100 # %
performance["Test Macro F1"] = performance["Test Macro F1"] * 100 # %

# add model type to benchmark
benchmark["Model Type"] = general["type"]
performance["Model Type"] = general["type"]

### Machine Specification

These are the same for all models, so we can just query the specifications for any model. The benchmarking was performedn on the HPC at ITU, on a CPU with 20 cores at 3.3GHz, and 256GB of Memory.

In [None]:
machine_info.iloc[0]

### Benchmarking

In [None]:
benchmark

### Performance


In [None]:
performance

## Performance and Efficiency (LaTeX)

In [None]:
tradeoff = pd.concat([performance, benchmark], axis=1)
tradeoff = tradeoff.drop("Model Type", axis=1)
tradeoff["Model Type"] = general["type"]

In [None]:
tradeoff_cols = [
    "Test Top-1 Acc",
    "Test Top-3 Acc",
    "Test Macro F1",
    "FLOPs",
    "Inference Latency",
    "Inference Throughput",
]
print(render_latex(tradeoff[tradeoff_cols]))

Let's plot the test performance metrics for all models. We first have to pivot the table, so that each metric is in a separate row.

In [None]:
pivoted_performance = performance.reset_index()[["Name", "Test Top-1 Acc", "Test Top-3 Acc", "Test Macro F1", "Model Type"]]
pivoted_performance = pivoted_performance.melt(id_vars=["Name", "Model Type"],
         value_vars=["Test Top-1 Acc", "Test Top-3 Acc", "Test Macro F1"],
         var_name='Metric Type', value_name='Metric Value')

# order by model type
pivoted_performance = pivoted_performance.sort_values(by=["Model Type", "Metric Value"])
pivoted_performance

In [None]:
# stacked bar chart of three acc metrics for each model
# order = ["MobileNet V3 Small", "EfficientNet V2 Small", "ViT B-16", "ResNet50", "AlexNet", "GoogleNet", "DenseNet121", "ResNet18", "X3D S", "R(2+1)D 18"]
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(
    data=pivoted_performance,
    x="Name", y="Metric Value", hue="Metric Type",
    palette="Greens", width=0.7, ax=ax
);

# styling
ax.set_ylim(0, 100) # set ylim max to 100
ax.set_xticklabels(ax.get_xticklabels(), rotation=30) # rotate x ticks
ax.set_xlabel("Model Name", fontweight="bold");
ax.set_ylabel("Performance Metric (%)", fontweight="bold");
ax.axvline(7.5, color="gray", linestyle="--") # separate single-frame and video models

# for metric, color in zip(["Test Macro F1", "Test Top-1 Acc", "Test Top-3 Acc"], ["#CAE4C6", "#7DBA7F", "#00460A"]):
    # best_performance = performance[metric].max()
    # ax.axhline(best_performance, color=color, linewidth=0.8, linestyle="--")

fig.savefig(os.path.join(BASEPATH, "report", "figures", "performance-metrics.png"), dpi=300, bbox_inches="tight")

In [None]:
# scatter of top-1 vs top-3 acc
fig, axs = plt.subplots(ncols=3, figsize=(7*3, 6))
metrics = ["Test Top-1 Acc", "Test Top-3 Acc", "Test Macro F1"]
combinations = list(itertools.combinations(metrics, 2))
for pair, ax in zip(combinations, axs):
    x, y = pair
    sns.scatterplot(
        data=performance,
        x=x, y=y, hue=performance.index, style="Model Type",
        palette="Dark2", s=120, ax=ax
    );

    # styling
    ax.set_xlabel(f"{x} (%)", fontweight="bold");
    ax.set_ylabel(f"{y} (%)", fontweight="bold");

fig.savefig(os.path.join(BASEPATH, "report", "figures", "performance-metrics-scatter.png"), dpi=300, bbox_inches="tight")

In [None]:
# let's investigate the difference between top-1 and top-3 acc
performance["Test Top-3 Acc - Test Top-1 Acc"] = performance["Test Top-3 Acc"] - performance["Test Top-1 Acc"]
fig, ax = plt.subplots(figsize=(8, 6))
sns.barplot(
    data=performance,
    x=performance.index, y="Test Top-3 Acc - Test Top-1 Acc",
    palette="Dark2", ax=ax
)
ax.set_xlabel("Model Name", fontweight="bold");
ax.set_ylabel("Performance Gain (Top-3 - Top-1 Acc.)", fontweight="bold");
ax.set_xticklabels(ax.get_xticklabels(), rotation=20); # rotate x ticks

We see the following:
- Resnet18 is the best image classifier with (71.39% test top1-acc) and Resnet18-LSTM is the best classifier (72.19%) video classifier
- MobileNetV3 struggled to learn anything meaningful (only 29% top1-acc)
- Stacking a RNN/ LSTM layer does not seem to affect the performance a lot (slight increase/ decrease for ResNet, significant decrease for alexnet)
- Top-3 Acc is generally higher than Top-1 Acc -> if the model is wrong, it's often almost right
- Macro F1 is generally worse than Top1-Acc -> model focuses on getting the high-resource classes right
- Larger model does not mean better performance here, resnet50 is double size but does not perform better, same with alexnet, however, mobilenet-v3 is really small and struggles to learn the true relationship
- Throughput decreases with recurrent layer (less samples/ second), this makes these models less attractive
- All models are valid in terms of throughput. Many can predict at a FPS rate of min. 30FPS 


## Performance/ Efficiency Trade-Off

In [None]:
# scatter of top-1 acc vs model size, flops, inference time, inference throughput
fig, axs = plt.subplots(ncols=2, figsize=(7*2, 6))
metrics = ["Inference Time (Mean)", "Inference Throughput (Mean)"]
units = ["ms", "Preds/s"]

# flatten 2d axs
for i, ((x, unit), ax) in enumerate(zip(zip(metrics, units), axs)):
    sns.scatterplot(
        data=tradeoff,
        x=x, y="Test Top-1 Acc", hue=tradeoff.index, style="Model Type",
        palette="tab10", s=120, ax=ax
    );

    # styling
    ax.set_xlabel(f"{x} ({unit})", fontweight="bold", fontsize=13);
    if i % 2 == 0:
        ax.set_ylabel(f"Test Top-1 Acc (%)", fontweight="bold", fontsize=13); 
    else:
        ax.legend_.remove()
        ax.set_ylabel(""); 

fig.savefig(os.path.join(BASEPATH, "report", "figures", "performance-efficiency-tradeoff-scatter.png"), dpi=300, bbox_inches="tight")

## Confusion Matrix

Let's look at the confusion matrix of each of the models. The confusion matrix is in the original summary dataframe.

In [None]:
# the class names are two long, let's encode them as letters
classes = class2id.keys()
letters = [c for c in string.ascii_uppercase[:len(classes)]]
class2letter = [{"Class": k, "Encoding": v} for k, v in zip(classes, letters)]
class2letter

The confusion matrix was computed using `torcheval.metrics.multiclass_confusion_matrix(y_pred, y_true)`. From the docs we read:

> Compute multi-class confusion matrix, a matrix of dimension num_classes x num_classes where each element at position (i,j) is the number of examples with true class i that were predicted to be class j. See also binary_confusion_matrix

We can read the **Rows** denote the **True** Class and the Columns denotes the **Predicted** class.

In [None]:
performance

In [None]:
import matplotlib.patches as patches

classes = class2id.keys()
letters = [c for c in string.ascii_uppercase[:len(classes)]]
class2letter = {k: v for k, v in zip(classes, letters)}
models = ["ResNet18", "R(2+1)D"]
for i, model in enumerate(models):
  record = performance.loc[model, :]
  model_type = record["Model Type"]
  conf_matrix = np.array(record["Confusion Matrix"])
  if model_type == "Single-Frame":
    conf_matrix = conf_matrix / 1000 # convert to thousands
  # recall_conf_matrix = (conf_matrix.T / conf_matrix.sum(1)).T # normalise by row, show recall
  # precision_conf_matrix = conf_matrix / conf_matrix.sum(0) # normalise by col, show precision

  conf_matrix = np.around(conf_matrix, 1)
  conf_matrix = pd.DataFrame(conf_matrix, index=letters, columns=letters)

  fig, ax = plt.subplots(figsize=(7, 6))
  sns.heatmap(
    conf_matrix, annot=True, linewidths=.5,
    cmap="Greens", annot_kws={"size": 7}, ax=ax
  )
  ax.set_xlabel("Predicted", fontweight="bold")
  ax.set_ylabel("Actual", fontweight="bold")

  corridors = patches.Rectangle((10, 10), 3, 3, linewidth=1, edgecolor='gray', facecolor='none')
  libraries = patches.Rectangle((3, 3), 3, 3, linewidth=1, edgecolor='gray', facecolor='none')
  ax.add_patch(corridors)
  ax.add_patch(libraries)

  # add top axis
  topax = ax.secondary_xaxis("top")
  topax.set_xticks([5.5, 15.5], ["First Floor", "Ground Floor"], minor=True)
  topax.set_xticks([10], [""], minor=False)
  topax.tick_params(which='minor', width=0)
  topax.tick_params(which='major', width=1, length=20)
  
  fig.savefig(os.path.join(BASEPATH, "report", "figures", f"conf-matrix-{model.replace(' ', '_').lower()}.png"), bbox_inches="tight")

In [None]:

import matplotlib.patches as patches

for i, model in enumerate(models):
  record = performance.loc[model, :]
  model_type = record["Model Type"]
  conf_matrix = np.array(record["Confusion Matrix"])
  if model_type == "Single-Frame":
    conf_matrix = conf_matrix / 1000 # convert to thousands
  # recall_conf_matrix = (conf_matrix.T / conf_matrix.sum(1)).T # normalise by row, show recall
  # precision_conf_matrix = conf_matrix / conf_matrix.sum(0) # normalise by col, show precision

  conf_matrix = np.around(conf_matrix, 1)
  conf_matrix = conf_matrix[[0, 2, 1], [0, 2, 1]]
  conf_matrix = pd.DataFrame(conf_matrix, index=reversed(classes), columns=reversed(classes))

  fig, ax = plt.subplots(figsize=(7, 6))
  sns.heatmap(
    conf_matrix, annot=True, linewidths=.5,
    cmap="Greens", annot_kws={"size": 7}, ax=ax
  )
  ax.set_xlabel("Predicted", fontweight="bold")
  ax.set_ylabel("Actual", fontweight="bold")

  corridors = patches.Rectangle((10, 10), 3, 3, linewidth=1, edgecolor='gray', facecolor='none')
  libraries = patches.Rectangle((3, 3), 3, 3, linewidth=1, edgecolor='gray', facecolor='none')
  ax.add_patch(corridors)
  ax.add_patch(libraries)

  # add top axis
  topax = ax.secondary_xaxis("top")
  topax.set_xticks([5.5, 15.5], ["First Floor", "Ground Floor"], minor=True)
  topax.set_xticks([10], [""], minor=False)
  topax.tick_params(which='minor', width=0)
  topax.tick_params(which='major', width=1, length=20)

We can see the following:
- Holes in the main diagonal even on good models, because the test split does not cover all classes (and showing actual count not weighted)
- ResNet18 perform best overall. very nicely highlighted main diagonal.
  - Base: 
  - RNN:
  - LSTM:
- ResNet50 is similar to ResNet50
- mobilenet only ever predicts ground floor classes. that's odd? 

- All models struggle to differentiate GF Corridor 2 (K) and Atrium (M) -> predict majority class Atrium (M)
- All models struggle to differentiate the different Libraries (D,E,F) -> library 1 often predicted as library 2, low precision for library 3
- All models struggle to differentiate corridor 1 (L) and 2 (M) -> basically random choice

## Mispredicted Samples

To get a better feel for the characteristics of the two best performing models - the best performing image, and the best performing video classifier, I will load and look at 10 mispredicted samples for each model, and look at top three most confident predictions of the models.

First, we need to get the 10 mispredicted instances for both models. For this we need to initialise and load the weights for both models.

In [None]:
best_performing_image, image_version = "resnet18", "v11"
best_performing_video, video_version = "r2plus1d_18", "v0"

image_config = DEFAULT[best_performing_image]
video_config = DEFAULT[best_performing_video]

image_transform = ImageTransform(**image_config["transform"])
video_transform = VideoTransform(**video_config["transform"])

image_test_data = ImageDataset(**image_config["dataset"], split="test", transform=image_transform)
video_test_data = VideoDataset(**video_config["dataset"], split="test", transform=video_transform)

image_test_loader = DataLoader(image_test_data, batch_size=1, shuffle=True)
video_test_loader = DataLoader(video_test_data, batch_size=1)

In [None]:
# initialise model architectures
image_clf = ImageClassifier(**image_config["model"])
video_clf = VideoClassifier(**video_config["model"])

In [None]:
# load weights
def download_artifact(model, version, project="bsc-2"):
  api = wandb.Api()
  artifact_path = f"mikasenghaas/{project}/{model}:{version}"
  artifact = api.artifact(artifact_path, type="model")

  filepath = os.path.join(BASEPATH, "artifacts", f"{model}:{version}")
  artifact.download(root=filepath)
  
  model_path = os.path.join(filepath, f"{model}.pt")
  return model_path
    
image_path = download_artifact(best_performing_image, image_version)
video_path = download_artifact(best_performing_video, video_version)

In [None]:
image_clf.load_state_dict(torch.load(image_path))
video_clf.load_state_dict(torch.load(video_path))

image_clf.eval()
video_clf.eval()

print("Loaded Weights")

Let's make an example prediction on the first frame and first video clip from the test data set to get an impression on how to use the models.

In [None]:
# make prediction on first frame
from utils import show_image

frame, label_id = next(iter(image_test_loader))
logits = image_clf(frame)
probs = torch.nn.functional.softmax(logits, dim=1)
prob, pred = torch.max(probs, dim=1)

show_image(frame.squeeze(), title=f"Pred: {id2class[pred.item()]} ({round(prob.item()*100,2)}%)\nTrue: {id2class[label_id.item()]}", unnormalise=True, show=True)

In [None]:
# make prediction on first clip
import io
import ipywidgets as widgets

sample = next(iter(video_test_loader))
video = sample["video"]
label = sample["label"]

logits = video_clf(video)
probs = torch.nn.functional.softmax(logits, dim=1)
prob, pred = torch.max(probs, dim=1)

def display_video(video, title, config):
  mean = np.array(config["transform"]["mean"])
  std = np.array(config["transform"]["std"])
  video = video.permute(1,0,2,3)
  video_widget = widgets.Image(format='jpeg')

  # display the widget
  display(video_widget)
  for frame in video:
    img = plt.imshow(np.array(((frame * std[:, None, None] + mean[:, None, None]) * 255.0).permute(1,2,0), dtype=np.uint8))
    plt.title(title)
    buffer = io.BytesIO()
    plt.savefig(buffer, format='jpeg')
    buffer.seek(0)
    
    video_widget.value = buffer.getvalue()

display_video(video.squeeze(0), f"Pred: {id2class[pred.item()]} ({round(prob.item() * 100, 2)}%)\nTrue: {label[0]}", video_config) 

Cool! Let's gather ten mispredicted samples for each model.

In [None]:
# mispredicted samples for image classifier
mispredicted_frames = []
for x, y in image_test_loader:
  logits = image_clf(x)
  probs = torch.nn.functional.softmax(logits, dim=1)
  prob, pred = torch.max(probs, dim=1)
  argsorted = torch.argsort(probs, dim=1)
  top3_preds = torch.arange(20)[argsorted[0][-3:]] # get top 3 preds
  top3_probs = probs[0][argsorted[0][-3:]] # get top 3 probs
  if pred.item() != y.item():
    mispredicted_frames.append({
      "frame": x,
      "pred": [id2class[pred.item()] for pred in top3_preds],
      "prob": [prob.item() for prob in top3_probs],
      "true": id2class[y.item()]
      })
  if len(mispredicted_frames) == 9:
    break

In [None]:
# mispredicted samples for video classifier
mispredicted_clips = []
for sample in video_test_loader:
  x = sample["video"]
  y = sample["label"][0]
  logits = video_clf(x)
  probs = torch.nn.functional.softmax(logits, dim=1)
  prob, pred = torch.max(probs, dim=1)
  argsorted = torch.argsort(probs, dim=1)
  top3_preds = torch.arange(20)[argsorted[0][-3:]] # get top 3 preds
  top3_probs = probs[0][argsorted[0][-3:]] # get top 3 probs
  if pred.item() != class2id[y]:
    mispredicted_clips.append({
      "clip": x,
      "pred": [id2class[pred.item()] for pred in top3_preds],
      "prob": [prob.item() for prob in top3_probs],
      "true": y
      })
  if len(mispredicted_clips) == 9:
    break

Let's have a look at the mispredicted samples

In [None]:
frames = []
titles = []
for i, frame in enumerate(mispredicted_frames):
  frames.append(frame["frame"])
  title = ""
  for i, (pred, prob) in enumerate(zip(reversed(frame["pred"]), reversed(frame["prob"]))):
    title += f"{i+1}: {pred} ({round(prob * 100, 2)}%)\n"
  title += f"T: {frame['true']}"
  titles.append(title)
  if i == 3:
    break

fig, axs = plt.subplots(1, 3, figsize=(15,5))
# flatten 2d array
axs = axs.flatten()
mean = np.array(image_config["transform"]["mean"])
std = np.array(image_config["transform"]["std"])
for ax, img, txt in zip(axs, frames, titles):
  # unnormalise
  img = img.squeeze()
  img = img * std[:, None, None] + mean[:, None, None]

  ax.imshow((img.permute(1,2,0) * 255.0).numpy().astype(np.uint8))
  ax.plot([], [], ' ', label=txt)
  ax.legend(loc="lower left", fontsize=12)
  ax.axis('off')

fig.tight_layout(pad=0.5)
fig.savefig(os.path.join(BASEPATH, "report", "figures", "resnet18-mispredicted-frames.png"), dpi=300)

Analysis of mispredicted clips:

1. 

In [None]:
frames = []
titles = []
for i, clip in enumerate(mispredicted_clips):
  first_frame = clip["clip"].squeeze().permute(1,0,2,3)[0].squeeze()
  frames.append(first_frame)
  title = ""
  for i, (pred, prob) in enumerate(zip(reversed(clip["pred"]), reversed(clip["prob"]))):
    title += f"{i+1}: {pred} ({round(prob * 100, 2)}%)\n"
  title += f"T: {frame['true']}"
  titles.append(title)
  if i == 3:
    break

fig, axs = plt.subplots(1, 3, figsize=(15,5))
# flatten 2d array
axs = axs.flatten()
mean = np.array(video_config["transform"]["mean"])
std = np.array(video_config["transform"]["std"])
for ax, img, txt in zip(axs, frames, titles):
  # unnormalise
  img = img.squeeze()
  img = img * std[:, None, None] + mean[:, None, None]

  ax.imshow((img.permute(1,2,0) * 255.0).numpy().astype(np.uint8))
  ax.plot([], [], ' ', label=txt)
  ax.legend(loc="lower left", fontsize=12)
  ax.axis('off')

fig.tight_layout(pad=0.5)
fig.savefig(os.path.join(BASEPATH, "report", "figures", "r2plus1d-mispredicted-frames.png"), dpi=300)