# Analyse Results

This notebook contains the detailed analysis for all experiments for this bachelor project. It queries the evaluation data from W&B and displays/ visualises it nicely in `pd.DataFrames` (that are later compiled into LaTeX tables) and figures that are saved to the directory `reports/figures`.

In [None]:
import re
import sys
import string
sys.path.append("../src")

# retrieve data about runs
import wandb

# plotting and analysis
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

In [None]:
pd.set_option('display.float_format','{:.2f}'.format)
sns.set_style("darkgrid")

WANDB_PROJECT = "mikasenghaas/bsc"

In [None]:
from data import ImageDataset

img = ImageDataset(**ImageDataset.default_config())
class2id, id2class = img.class2id, img.id2class

In [None]:
# get all runs
api = wandb.Api()
all_runs = api.runs(WANDB_PROJECT)

In [None]:
# helper
def render_latex(df):
    # capitalise col names
    df.columns = [' '.join(map(lambda x: x[0].upper() + x[1:], col.split('_'))) for col in df.columns]
    
    # format df
    s = df.style.highlight_max(props='bfseries: ;')
    s.format(precision=2)
    
    # render latex
    opts = {"hrules": True, "position": "h"}
    return s.to_latex(**opts)

def capitalise(s):
    return s[0].upper() + s[1:]

# Experiment 1: Image and Video Classifiers

This experiment compares a wide-variety of image and video classifiers. The classifiers tested are freely available on the PyTorch Hub. The following tables shows an overview over all models. We first filter all the runs that train models within the first experiment.

In [None]:
runs = [run for run in all_runs if run.group == "experiment1"]

print(f"There are {len(runs)} runs in Experiment 1 ({', '.join([run.name for run in runs])})")

## Configuration

Let's look at the training configuration setup of each model. This includes the hyperparameters for the optimser, learning rate scheduler and batch size and total number of trained epochs.

In [None]:
# convert wandb config to df
config = pd.DataFrame([dict(run.config, **{"id": run.name}) for run in runs])

# drop unnecessary cols
config = config.drop(["model", "version", "wandb_name", "wandb_log", "wandb_tags", "all_classes", "first_floor", "ground_floor", "wandb_group", "include_classes"], axis=1)

# set index
config = config.set_index("id")
config

Nothing suprising here. All hyperparameters are at the default values for the image and video classifiers. We can render to LaTex to put it into the report.

In [None]:
# render latex
training_params = ["max_epochs", "batch_size", "lr", "gamma", "step_size", "device"]
data_params = ["ratio"]

print(render_latex(config[training_params]))

## Results

The results section is responsible for plotting and saving all relevant figures that show the performance between the different models. We start by querying the summary statistics that were computed by the `eval.py` script and then synced to W&B.

In [None]:
# convert wandb summary to df
all_summary = pd.DataFrame([dict({k: v for k, v in run.summary.items() if not k.startswith("_")}, **{"model_name": run.name}) for run in runs])

# choose relevant cols
cols = ["model_name",
        "test_top1_acc",
        "test_top3_acc",
        "test_macro_f1",
        "num_params",
        "flops",
        "samples_per_second_mean",
        "samples_per_second_std",
       ]
summary = all_summary[cols]

summary["test_top1_acc"] = summary["test_top1_acc"] * 1e2 # to %
summary["test_top3_acc"] = summary["test_top3_acc"] * 1e2 # to %
summary["test_macro_f1"] = summary["test_macro_f1"] * 1e2 # to %
summary["flops"] = summary["flops"] * 1e-9 # to %
summary["num_params"] = summary["num_params"] * 1e-6 # to millions

# add type of model with regex
summary["type"] = summary["model_name"].apply(lambda x: "video" if re.search(r"rnn|lstm", x) else "image")
summary["base"] = summary["model_name"].apply(lambda x: x.split("-")[0])
summary["head"] = summary["model_name"].apply(lambda x: "None" if x.find("-") == -1 else x.split("-")[1])

# set index
summary

In [None]:
bool("alexnet".find("-"))

We see the following:
- Resnet18 is the best image classifier with (71.39% test top1-acc) and Resnet18-LSTM is the best classifier (72.19%) video classifier
- MobileNetV3 struggled to learn anything meaningful (only 29% top1-acc)
- Stacking a RNN/ LSTM layer does not seem to affect the performance a lot (slight increase/ decrease for ResNet, significant decrease for alexnet)
- Top-3 Acc is generally higher than Top-1 Acc -> if the model is wrong, it's often almost right
- Macro F1 is generally worse than Top1-Acc -> model focuses on getting the high-resource classes right
- Larger model does not mean better performance here, resnet50 is double size but does not perform better, same with alexnet, however, mobilenet-v3 is really small and struggles to learn the true relationship
- Throughput decreases with recurrent layer (less samples/ second), this makes these models less attractive
- All models are valid in terms of throughput. Many can predict at a FPS rate of min. 30FPS 

In [None]:
# render latex
print(render_latex(summary))

Let's plot the Top-1 Accuracy, Top-3 Accuracy and Macro F1 score for each of the models in a barchart.

In [None]:
# we first need to pivot the df
tmp = summary[["model_name", "test_top1_acc", "test_top3_acc", "test_macro_f1"]]
tmp = tmp.melt(id_vars='model_name',
         value_vars=['test_top1_acc', 'test_top3_acc', 'test_macro_f1'],
         var_name='metric_type', value_name='metric_value')
tmp

In [None]:
# stacked bar chart of three acc metrics for each model
fig, ax = plt.subplots(figsize=(8, 6))
sns.barplot(
    data=tmp,
    x="model_name", y="metric_value", hue="metric_type",
    palette="Dark2", ax=ax
);

# styling
ax.set_ylim(0, 100) # set ylim max to 100
ax.set_xticklabels(ax.get_xticklabels(), rotation=20) # rotate x ticks
handles, labels = ax.get_legend_handles_labels() # get legend handles and labels
ax.legend(title='Performance Metric', handles=handles, labels=["Top-1 Accuracy", "Top-3 Accuracy", "Macro F1 Score"]);
ax.set_xlabel("Model Name", fontweight="bold");
ax.set_ylabel("Performance Metric (%)", fontweight="bold");

fig.savefig("../report/figures/experiment1-performance-metricz.png", dpi=300, bbox_inches="tight")

Let's look at the relationship between some performance metrics. We want to make plots for:

Performance vs. Complexity/ Efficiency
- Top-1 Accuracy vs. Model Size
- Top-1 Accuracy vs. GLOPS

Other Performance vs. Complexity/ Efficiency
- F1 vs. Model Size
- F1 vs. GLOPS

Performance
- F1 vs. Model Size
- F1 vs. GLOPS

In [None]:
def scatter(data, x, y, **kwargs):
  """
  Scatter plot of summary metrics of models where hue is on the base and the 
  tip style is indicated by the type of RNN module that is appended

  Args:
    data (pd.DataFrame): summary dataframe
    x (str): x-axis variable
    y (str): y-axis variable
  
  Returns:
    fig (matplotlib.axes.Figure): matplotlib figure object
  """
  if kwargs.get("ax"):
    ax = kwargs.get("ax")
  else:
    fig, ax = plt.subplots(figsize=(6, 4))

  sns.scatterplot(data=data, s=150, x=x, y=y, hue="base", palette="Dark2", style="head", ax=ax)

  # set x and y labels
  if kwargs.get("xlabel"):
    ax.set_xlabel(kwargs.get("xlabel"), fontweight="bold");
  if kwargs.get("ylabel"):
    ax.set_ylabel(kwargs.get("ylabel"), fontweight="bold");

  # capitalise legend labels
  handles, labels = ax.get_legend_handles_labels()
  ax.legend(handles=handles, labels=list(map(capitalise, labels)), bbox_to_anchor=(1.05, 1));

  # adding arrows to base-head pairs
  for cnn in ["alexnet", "resnet18"]: # all models with a rnn head
    xs = data.loc[data['model_name'] == cnn, x].values[0]
    ys = data.loc[data['model_name'] == cnn, y].values[0]
    for rnn in ["rnn", "lstm"]: # all rnn heads
      xe = data.loc[data['model_name'] == f"{cnn}-{rnn}", x].values[0]
      ye = data.loc[data['model_name'] == f"{cnn}-{rnn}", y].values[0]
      ax.annotate("", xy=(xe, ye), xytext=(xs, ys), arrowprops=dict(arrowstyle="->", color="gray"))
    
  # return for saving
  if not kwargs.get("ax"):
    return fig

In [None]:
# scatter plot of num params vs top-1 acc
fig = scatter(summary, x="num_params", y="test_top1_acc", xlabel="Num. of Parameters (M)", ylabel="Top-1 Accuracy (%)")
fig.savefig("../report/figures/experiment1-num-params-vs-top1-acc.png", dpi=300, bbox_inches="tight")

In [None]:
# scatter plot of flops vs top-1 acc
fig = scatter(summary, x="flops", y="test_top1_acc", xlabel="FLOPs (G)", ylabel="Top-1 Accuracy (%)")
fig.savefig("../report/figures/experiment1-flops-vs-top1-acc.png", dpi=300, bbox_inches="tight")

In [None]:
# scatter plot of flops vs top-1 acc
fig = scatter(summary, x="samples_per_second_mean", y="test_top1_acc", xlabel="Throughput (Preds/s)", ylabel="Top-1 Accuracy (%)")
fig.savefig("../report/figures/experiment1-throughput-vs-top1-acc.png", dpi=300, bbox_inches="tight")

In [None]:
# combine all scatter plots into one plots
fig, axs = plt.subplots(ncols=3, figsize=(5*3, 4))
xvars = ["num_params", "flops", "samples_per_second_mean"]
xlabels = ["Num. of Parameters (M)", "FLOPs (G)", "Throughput (Preds/s)"]
for xvar, xlabel, ax in zip(xvars, xlabels, axs):
  scatter(summary, x=xvar, y="test_top1_acc", xlabel=xlabel, ylabel="Top-1 Accuracy (%)", ax=ax)
for ax in axs[:-1]:
  ax.get_legend().remove()

fig.savefig("../report/figures/experiment1-scatter-plots.png", dpi=300, bbox_inches="tight") 

Let's look at the confusion matrix of each of the models. The confusion matrix is in the original summary dataframe.

In [None]:
# the class names are two long, let's encode them as letters
classes = class2id.keys()
letters = [c for c in string.ascii_uppercase[:len(classes)]]
class2letter = [{"Class": k, "Encoding": v} for k, v in zip(classes, letters)]

print(render_latex(pd.DataFrame(class2letter).set_index("Class")))

The confusion matrix was computed using `torcheval.metrics.multiclass_confusion_matrix(y_pred, y_true)`. From the docs we read:

> Compute multi-class confusion matrix, a matrix of dimension num_classes x num_classes where each element at position (i,j) is the number of examples with true class i that were predicted to be class j. See also binary_confusion_matrix

We can read the **Rows** denote the **True** Class and the Columns denotes the **Predicted** class.

In [None]:
classes = class2id.keys()
letters = [c for c in string.ascii_uppercase[:len(classes)]]
class2letter = {k: v for k, v in zip(classes, letters)}
for record in all_summary[["model_name", "test_conf_matrix"]].to_dict(orient="records"):
  model_name = record["model_name"]

  conf_matrix = np.array(record["test_conf_matrix"])
  recall_conf_matrix = (conf_matrix.T / conf_matrix.sum(1)).T # normalise by row, show recall
  precision_conf_matrix = conf_matrix / conf_matrix.sum(0) # normalise by col, show precision

  for title, mat in zip(["conf-matrix", "recall-matrix", "precision-matrix"], [conf_matrix, recall_conf_matrix, precision_conf_matrix]):
    # set float precision to 2
    mat = np.around(mat, decimals=2)
    df_conf_matrix = pd.DataFrame(mat, index=letters, columns=letters)

    fig, ax = plt.subplots(figsize=(8, 6))
    sns.heatmap(df_conf_matrix, annot=True, annot_kws={"size": 7}, ax=ax)
    ax.set_xlabel("Predicted", fontweight="bold")
    ax.set_ylabel("Actual", fontweight="bold")
    # ax.set_title(f"{title} for {model_name}", fontweight="bold")
    fig.savefig(f"../report/figures/experiment1-{title}-{model_name}.png", dpi=300, bbox_inches="tight")

We can see the following:
- Holes in the main diagonal even on good models, because the test split does not cover all classes (and showing actual count not weighted)
- ResNet18 perform best overall. very nicely highlighted main diagonal.
  - Base: 
  - RNN:
  - LSTM:
- ResNet50 is similar to ResNet50
- mobilenet only ever predicts ground floor classes. that's odd? 

- All models struggle to differentiate GF Corridor 2 (K) and Atrium (M) -> predict majority class Atrium (M)
- All models struggle to differentiate the different Libraries (D,E,F) -> library 1 often predicted as library 2, low precision for library 3
- All models struggle to differentiate corridor 1 (L) and 2 (M) -> basically random choice

## Experiment 2

This experiment compares different kinds of video classifiers against each other.

In [None]:
runs = [run for run in all_runs if run.group == "experiment2"]

print(f"There are {len(runs)} runs in Experiment 2 ({', '.join([run.name for run in runs])})")

### Configuration

In [None]:
# convert wandb config to df
config = pd.DataFrame([dict(run.config, **{"id": run.name}) for run in runs])

# drop unnecessary cols
config = config.drop(["model", "version", "wandb_name", "wandb_log", "wandb_tags", "all_classes", "first_floor", "ground_floor", "wandb_group", "include_classes"], axis=1)

# set index
config = config.set_index("id")
config

### Results

In [None]:
# convert wandb summary to df
summary = pd.DataFrame([dict({k: v for k, v in run.summary.items() if not k.startswith("_")}, **{"id": run.name}) for run in runs])

# drop unnecessary cols
summary = summary.drop([], axis=1)

# set index
summary = summary.set_index("id")
summary

In [None]:
# render latex
print(render_latex(summary))

In [None]:
for run in all_runs:
    if run.id == "n0itki9o":
        break

In [None]:
# convert wandb summary to df
summary = pd.DataFrame([dict(run.summary.items(), **{"id": run.name})])

# drop unnecessary cols
# summary = summary.drop([], axis=1)


# set index
summary = summary.set_index("id")
summary

In [None]:
print(render_latex(summary[["samples_per_second"]]))