## Analysis

This notebook contains the analysis of the data tracked on
[Weights & Biases](https://wandb.ai/).


### Setup

---

In [1]:
# Auto-reload
%load_ext autoreload
%autoreload 2

In [11]:
# Bult-in modules
import os
import sys
sys.path.insert(0, "..")

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# External modules
# - Data Representation
import pandas as pd
import numpy as np

# - Data Visualization
from matplotlib import pyplot as plt
import seaborn as sns

# - Machine Learning
import torch
import torch.nn as nn
from sklearn import metrics

# - Experiment Configuration and Logging
import wandb

# Custom modules
from utils import eval_utils as utils

In [7]:
# Setup of global variables
WANDB_PROJECT = "few-shot-benchmark"
WANDB_ENTITY = "metameta-learners"

GROUP = "mika"

ROOT_DIR = os.path.dirname(os.path.abspath("."))
ARTIFACT_DIR = os.path.join(ROOT_DIR, "artifacts")
FIGURE_DIR = os.path.join(ROOT_DIR, "figures")

### Load Experiment Data

---

Let's start by loading all runs from the given experiment group.



In [9]:
# Initialize wandb
api = wandb.Api()

# Get all runs
runs = api.runs(f"{WANDB_ENTITY}/{WANDB_PROJECT}")

# Filter runs by group
group_runs = [run for run in runs if run.group == GROUP]
print(f"Found {len(group_runs)} runs")

Found 2 runs


Next, we'll load all runs from the given experiment group into a single dataframe.

In [10]:
df_runs = utils.load_to_df(group_runs)
df_runs.head()

Unnamed: 0_level_0,config,config,config,config,config,config,eval,eval,eval,eval,eval,eval,eval,eval,eval,eval
Unnamed: 0_level_1,dataset,method,sot,n_way,n_shot,val/acc_std,epoch,val/acc,test/acc,val/acc_ci,test/acc_ci,train/acc,train/loss,test/acc_std,train/acc_ci,train/acc_std
run_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
cmmdjbv0,swissprot,baseline,False,5,5,11.659457,3,71.48,68.633333,0.932951,0.910922,87.78,1.161866,11.384151,0.665911,8.322155
rjejahjj,swissprot,baseline,False,5,5,11.756795,3,71.866667,68.066667,0.94074,0.886208,88.36,1.161866,11.075298,0.630342,7.877631


### Looking closer to particular runs

---

Select a run from the table above to look at it in more detail.

In [15]:
runid = 'cmmdjbv0'
config = [run.config for run in group_runs if run.id == runid][0]
dataset, loader, model = utils.init_run(config, ROOT_DIR, "test")

Next, let's evaluate the run's model on the given dataset:

In [19]:
# Get the mapping from encoding to annotation
encoding2anot = {v : k for k, v in dataset.trg2idx.items()}

# Define metric fn from sklearn assuming y_true and y_pred as input in this order
clf_kwargs = {"average": "macro"}
metric_fns = [
    (metrics.accuracy_score, None),
    (metrics.precision_score, clf_kwargs),
    (metrics.recall_score, clf_kwargs),
    (metrics.f1_score, clf_kwargs),
]

# Evaluate model and obtain its predictions with ground truth for each episode
episodes_results = utils.eval_run(model, loader)

# Compute metrics for each episode
episodes_metrics = utils.compute_metrics(metric_fns, episodes_results)

episodes_metrics.head()

Evaluating: 100%|██████████| 24/24 [00:07<00:00,  3.27it/s]


Unnamed: 0,accuracy_score,precision_score,recall_score,f1_score
0,0.56,0.56,0.56,0.55596
1,0.72,0.72,0.72,0.710909
2,0.4,0.39,0.4,0.374921
3,0.76,0.833333,0.76,0.735198
4,0.76,0.782857,0.76,0.756667


---