# First look at the results

This is just an initial preview of the results...let's see how decision-margin-consistency affects: (a) consitency scores (humanvshuman, modelvsmodel, and modelvshuman), and the rank-order of models.

# humanvshuman

TLDR: agreement between humans is underestimated due to noise, and averaging across subjects increases the estimate of agreement between humans. Caveat that for these data, we can only make this conclusion for group-level agreement because we don't have multiple trials to average within subjects (though see other experiment for this case...)

Here we compare error-consistency scores to accuracy-consistency scores, which can be taken as a measure of decision-margin consistency in people.

Error consistency is computed by comparing the consistency of errors between individual subjects.

Accuracy-consistency is computed by averaging across multiple trials, and is proportional to the decision margin distance for random sources of noise.

In theory accuracy-consistency can be computed within individual subjects, if each item is presented multiple times. However, stimuli were presented only once to each subject in these experiments, and therefore we compute group-accuracy-consistency. 

To do so, we perform a split-half reliability analysis. First, we split the subjects into two groups, compute the average accuracy for each individual image separately for each group, then correlate the scores for each group across items. These split-half reliability scores were then adjusted to estimate the reliability of the full dataset using the Spearman-Browne adjustment, which we refer to here as the group-accuracy-consistency. This process repeated for all possible splits of subjects, and the average group-accuracy-consistency score is reported.

Note that the "group-accuracy-consistency" score is a standard estimate of the "noise ceiling" for human behavioral data. No model is expected to correlate with these human behavioral data greater than this noise ceiling.

In [None]:
%config InlineBackend.figure_format='retina'

In [None]:
import os
import pandas as pd
from glob import glob
from modelvshuman_dmc import constants as c
import seaborn as sns
import matplotlib.pyplot as plt

def load_humanvshuman_error_consistency_summary():
    error_consistency = []
    for dataset in c.DEFAULT_DATASETS:
        datadir = f"{c.RESULTS_DIR}/humanvshuman_error_consistency/{dataset}"
        filename = os.path.join(datadir, f"humanvshuman_error_consistency_{dataset}_summary.csv")
        error_consistency.append(pd.read_csv(filename))

    error_consistency = pd.concat(error_consistency)
    
    return error_consistency

def load_humanvshuman_splithalves_noise_ceiling_summary():
    noise_ceiling = []
    for dataset in c.DEFAULT_DATASETS:
        datadir = f"{c.RESULTS_DIR}/humanvshuman_splithalves_noise_ceiling/{dataset}"
        filename = os.path.join(datadir, f"humanvshuman_splithalves_noise_ceiling_{dataset}_summary.csv")
        noise_ceiling.append(pd.read_csv(filename))

    noise_ceiling = pd.concat(noise_ceiling)
    
    return noise_ceiling

In [None]:
ls {c.RESULTS_DIR}

In [None]:
error_consistency = load_humanvshuman_error_consistency_summary()
error_consistency

In [None]:
noise_ceiling = load_humanvshuman_splithalves_noise_ceiling_summary()
noise_ceiling

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

assert (error_consistency.dataset == noise_ceiling.dataset).all(), "Oops, better align your dfs"
assert (error_consistency.condition == noise_ceiling.condition).all(), "Oops, better align your dfs"

# Select relevant columns and add 'metric' column to identify the source of the data
error_df = error_consistency[['dataset', 'condition', 'error_consistency_avg']].copy()
error_df['metric'] = 'error\nconsistency'
error_df.rename(columns={'error_consistency_avg': 'score'}, inplace=True)

noise_df = noise_ceiling[['dataset', 'condition', 'adj_corr_mean']].copy()
noise_df['metric'] = 'group-accuracy\nconsistency'
noise_df.rename(columns={'adj_corr_mean': 'score'}, inplace=True)

# Concatenate the dataframes
combined_df = pd.concat([error_df, noise_df], ignore_index=True)

# Create the line plot
plt.figure(figsize=(6, 8))
ax = sns.lineplot(data=combined_df, x='metric', y='score', hue='dataset', style='condition', markers=True)

# Remove style markers from the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[:len(combined_df['dataset'].unique())], labels=labels[:len(combined_df['dataset'].unique())], bbox_to_anchor=(1.05, 1), 
          loc='upper left', borderaxespad=0, fontsize=13)

plt.ylabel('score', fontsize=20, labelpad=16)
ax.set_ylim([0,1.0])
plt.yticks(fontsize=14)

plt.xlabel('metric', fontsize=20, labelpad=16)
ax.set_xlim([-.2,1.2])
plt.xticks(fontsize=14)

plt.show()

In [None]:
combined_df

In [None]:
plt.figure(figsize=(6, 6))
df = noise_ceiling.copy()
df['delta'] = noise_ceiling.adj_corr_mean - error_consistency.error_consistency_avg
df = df.reset_index()

ax = sns.scatterplot(data=df, x="adj_corr_mean", y="delta", hue="dataset")
ax.axis('square');

# Remove style markers from the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[:len(combined_df['dataset'].unique())], labels=labels[:len(combined_df['dataset'].unique())], bbox_to_anchor=(1.05, 1), 
          loc='upper left', borderaxespad=0, fontsize=13)

plt.ylabel('score increase\nrelative to error-consistency)', fontsize=16, labelpad=16)
ax.set_ylim([-.1,1]);
plt.yticks(fontsize=14);

plt.xlabel('group-accuracy-consistency\n(estimated noise ceiling)', fontsize=16, labelpad=16)
ax.set_xlim([-.1,1]);
plt.xticks(fontsize=14);

In [None]:
df['delta'].max(), df['delta'].mean()

# modelvsmodel

Although the deep neural network models analyzed here are "noiseless", we find that agreement between models is increased for decision-margin consistency relative to error-consistency.

In [None]:
%config InlineBackend.figure_format='retina'

In [None]:
import os
import pandas as pd
import numpy as np
from glob import glob
from modelvshuman_dmc import constants as c
import seaborn as sns
import matplotlib.pyplot as plt

def load_modelvsmodel_error_consistency(collection):
    error_consistency = []
    for dataset in c.DEFAULT_DATASETS:
        datadir = f"{c.RESULTS_DIR}/modelvsmodel_pairwise_error_consistency/{collection}/{dataset}"
        filename = os.path.join(datadir, f"demo_set_modelvsmodel_pairwise_error_consistency_{dataset}.csv")
        error_consistency.append(pd.read_csv(filename))

    error_consistency = pd.concat(error_consistency)
    
    return error_consistency

def load_modelvsmodel_pairwise_decision_margin_consistency(collection):
    dmc = []
    for dataset in c.DEFAULT_DATASETS:
        datadir = f"{c.RESULTS_DIR}/modelvsmodel_pairwise_decision_margin_consistency/{collection}/{dataset}"
        filename = os.path.join(datadir, f"{collection}_set_modelvsmodel_pairwise_decision_margin_consistency_{dataset}.csv")
        dmc.append(pd.read_csv(filename))

    dmc = pd.concat(dmc)
    dmc.rename(columns=dict(subject_A="sub1", subject_B="sub2"), inplace=True)
    
    return dmc

def compute_heatmap(df, model_names, score_col):
    N = len(model_names)
    matrix = np.full((N, N), np.nan)
    for model1,model2 in combinations(model_names, 2):
        subset = df[(df.sub1==model1) & (df.sub2==model2)]
        if len(subset) == 0:
            subset = df[(df.sub2==model1) & (df.sub1==model2)]
        assert len(subset) == 1, "oops"
        idx1 = model_names.index(model1)
        idx2 = model_names.index(model2)
        matrix[idx1,idx2] = subset.iloc[0][score_col]
    
    return matrix

def compute_heatmaps(df, score_col):
    heatmaps = {}
    model_names = np.unique(df.sub1.values.tolist() + df.sub2.values.tolist()).tolist()
    datasets = df.dataset.unique()
    for dataset in datasets:
        subset = df[df.dataset == dataset]
        conditions = subset.condition.unique()
        for condition in conditions:
            cond_df = subset[subset.condition==condition]
            heatmaps[(dataset, condition)] = compute_heatmap(cond_df, model_names, score_col)
    return heatmaps, model_names

def plot_heatmap(matrix, model_names, vmin=0, vmax=1):
    ax = sns.heatmap(matrix, vmin=vmin, vmax=vmax)
    # Set x and y tick labels to model_names
    ax.set_xticks(np.arange(len(model_names)) + 0.5);  # Center the tick marks
    ax.set_yticks(np.arange(len(model_names)) + 0.5);
    ax.set_xticklabels(model_names, rotation=90);
    ax.set_yticklabels(model_names, rotation=0);
    # Move x-tick labels to the top
    ax.xaxis.set_ticks_position('top')
    ax.xaxis.set_label_position('top')
    
    return ax

In [None]:
ls {c.RESULTS_DIR}/modelvsmodel_pairwise_decision_margin_consistency

In [None]:
error_consistency = load_modelvsmodel_error_consistency("demo")
error_consistency

In [None]:
dmc = load_modelvsmodel_pairwise_decision_margin_consistency("demo")
dmc

In [None]:
subset = error_consistency[(error_consistency.sub1=='') & (error_consistency.sub2=='')]
len(subset)

In [None]:
df = error_consistency.copy()

In [None]:
error_consistency_heatmaps, err_con_model_names = compute_heatmaps(error_consistency, 'error_consistency')
error_consistency_heatmaps[('edge', 0)]

In [None]:
ax = plot_heatmap(error_consistency_heatmaps[('edge', 0)], err_con_model_names);

In [None]:
dmc

In [None]:
dmc_heatmaps, dmc_model_names = compute_heatmaps(dmc, 'decision_margin_consistency')
dmc_heatmaps[('edge', 0)]

In [None]:
ax = plot_heatmap(dmc_heatmaps[('edge', 0)], dmc_model_names);

In [None]:
assert err_con_model_names==dmc_model_names

In [None]:
merged_df = pd.merge(error_consistency, dmc, on=['dataset','condition','sub1','sub2'], how='left')
merged_df

In [None]:
ax = sns.scatterplot(merged_df, x="error_consistency", y="decision_margin_consistency", hue="pair")
sns.lineplot(x=[-.1,1], y=[-.1,1], ax=ax, color=(.7,.7,.7), linestyle='--');
ax.axis('square');

ax.set_title("Error consistency vs. decision-margin consistency\nall pairs of models across all datasets+conditions", pad=20)
# Remove style markers from the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[:len(combined_df['dataset'].unique())], labels=labels[:len(combined_df['dataset'].unique())], bbox_to_anchor=(1.10, 1), 
          loc='upper left', borderaxespad=0, fontsize=13)

plt.xlabel('error-consistency', fontsize=16, labelpad=16)
ax.set_xlim([-.1,1]);
plt.xticks(fontsize=14);

plt.ylabel('decision-margin consistency', fontsize=16, labelpad=16)
ax.set_ylim([-.1,1]);
plt.yticks(fontsize=14);

# modelvshuman

Finally, what happens?

In [None]:
%config InlineBackend.figure_format='retina'

In [None]:
import os
import pandas as pd
import numpy as np
from glob import glob
import seaborn as sns
import matplotlib.pyplot as plt

from modelvshuman_dmc import constants as c
from modelvshuman_dmc.analysis import data
from modelvshuman_dmc.datasets import experiments

from pdb import set_trace

def get_human_accuracy(datasets=c.DEFAULT_DATASETS):
    results = []
    for dataset in datasets:
        df = data.load_human_data(f'{c.RAW_DATA_DIR}/{dataset}', expected_subjects=c.EXPECTED_SUBJECTS.get(dataset, 4))
        drop_columns = [col for col in ['Session', 'session', 'trial'] if col in df.columns]
        avg = df.groupby(by=['condition']).mean(numeric_only=True).drop(columns=drop_columns).reset_index()
        
        avg.insert(0, 'subj', "humans")
        avg.insert(1, 'dataset_name', dataset)
        avg.insert(2, 'metric_name', 'accuracy (top-1, 16-way)')
  
        results.append(avg)
    results = pd.concat(results)
    return results

def get_model_accuracy(model_names, datasets):
    results = []
    for model_name in model_names:
        for dataset in datasets:
            df = data.load_model_data(f'{c.RAW_DATA_DIR}/{dataset}', model_name)
            drop_columns = [col for col in ['Session', 'session', 'trial', 'targ_act', 'max_nontarg_act', 'decision_margin'] if col in df.columns]
            avg = df.groupby(by=['condition']).mean(numeric_only=True).drop(columns=drop_columns).reset_index()

            avg.insert(0, 'subj', model_name)
            avg.insert(1, 'dataset_name', dataset)
            avg.insert(2, 'metric_name', 'accuracy (top-1, 16-way)')
            results.append(avg)
    results = pd.concat(results)
    return results

def get_model_performance(model_names, dataset=None):
    results = []
    for model_name in model_names:
        c.PERFORMANCES_DIR
        filename = os.path.join(c.PERFORMANCES_DIR, f"{model_name}.csv")
        df = pd.read_csv(filename)
        if dataset is not None:
            df = df[df.dataset_name==dataset]
        results.append(df)
    results = pd.concat(results)
    return results

In [None]:
ls {c.RAW_DATA_DIR}/

In [None]:
models = ["alexnet", "resnet50", "bagnet33", "simclr_resnet50x1", "vit_b_16", "convnext_large"]
dataset = "colour"
dataset = "uniform-noise"
dataset = "contrast"

In [None]:
human_acc = get_human_accuracy(datasets=[dataset])
human_acc

In [None]:
model_acc = get_model_accuracy(model_names=models, datasets=[dataset])
model_acc

In [None]:
acc_df = pd.concat([human_acc, model_acc])
acc_df

In [None]:
from modelvshuman_dmc import constants as c
from modelvshuman_dmc.plotting.colors import *
from modelvshuman_dmc.plotting.decision_makers import DecisionMaker

__all__ = ['plotting_definition_template']

def plotting_definition_template(df):
    """Decision makers to compare a few models with human observers.

    This exemplary definition can be adapted for the
    desired purpose, e.g., by adding more/different models.

    Note that models will need to be evaluated first, before
    their data can be plotted.

    For each model, define:
    - a color using rgb(42, 42, 42)
    - a plotting symbol by setting marker;
      a list of markers can be found here:
      https://matplotlib.org/3.1.0/api/markers_api.html
    """

    decision_makers = []

    # Assign the blue color to alexnet
    decision_makers.append(DecisionMaker(name_pattern="alexnet",
                           color=rgb(65, 90, 140), marker="o", df=df,
                           plotting_name="AlexNet"))
    
    # New color for ResNet-50
    decision_makers.append(DecisionMaker(name_pattern="resnet50",
                           color=rgb(120, 130, 190), marker="o", df=df,
                           plotting_name="ResNet-50"))
    
    decision_makers.append(DecisionMaker(name_pattern="bagnet33",
                           color=rgb(110, 110, 110), marker="o", df=df,
                           plotting_name="BagNet-33"))
    
    decision_makers.append(DecisionMaker(name_pattern="simclr_resnet50x1",
                           color=rgb(210, 150, 0), marker="o", df=df,
                           plotting_name="SimCLR-x1"))
    
    # New color for ViT-B-16 (assigned a greenish hue)
    decision_makers.append(DecisionMaker(name_pattern="vit_b_16",
                           color=rgb(0, 180, 100), marker="o", df=df,
                           plotting_name="ViT-B-16"))
    
    # New color for ConvNeXt-Large (assigned a purple hue)
    decision_makers.append(DecisionMaker(name_pattern="convnext_large",
                           color=rgb(150, 60, 200), marker="o", df=df,
                           plotting_name="ConvNeXt-Large"))

    decision_makers.append(DecisionMaker(name_pattern="humans",
                           color=rgb(165, 30, 55), marker="D", df=df, markersize=10,
                           plotting_name="Humans"))
    
    return decision_makers

In [None]:
from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class PlotConfig:
    """
    Plotting parameters
    """
    ylabel: str
    title: str
    xlabel: Optional[str] = None
    xlabel_fontsize: int = 16   
    xlabel_labelpad: int = 10
    ylabel_fontsize: int = 16   
    ylabel_labelpad: int = 10
    title_fontsize: int = 20
    title_pad: int = 10
    xticks_fontsize: int = 14
    yticks_fontsize: int = 14
    xlim: List[float] = field(default_factory=list)
    ylim: List[float] = field(default_factory=list)
    chance: Optional[float] = None
    chance_label: Optional[str] = None
    
    def __post_init__(self):
        assert True, "How can we go wrong?"

accuracy_plot_cfg = PlotConfig(ylabel="Classification Accuracy", 
                               xlabel_fontsize=16, xlabel_labelpad=10,
                               ylabel_fontsize=16, ylabel_labelpad=10,
                               title="Accuracy", title_fontsize=20, title_pad=10,
                               xticks_fontsize=14, yticks_fontsize=14,
                               xlim=[-.15, 1.15], ylim=[0,1.0],
                               chance=1/16, chance_label=None)

plot_cfg = accuracy_plot_cfg
plot_cfg

In [None]:
accuracy_plot_cfg = PlotConfig(ylabel="Classification Accuracy", 
                               xlabel_fontsize=16, xlabel_labelpad=10,
                               ylabel_fontsize=16, ylabel_labelpad=10,
                               title="Accuracy", title_fontsize=20, title_pad=10,
                               xticks_fontsize=14, yticks_fontsize=14,
                               xlim=None, ylim=[0,1.0],
                               chance=1/16, chance_label=None)
plot_cfg = accuracy_plot_cfg
plot_cfg

In [None]:
# models = ["alexnet", "resnet50", "bagnet33", "simclr_resnet50x1", "vit_b_16", "convnext_large"]

In [None]:
decision_maker_fun = plotting_definition_template
decision_makers = decision_maker_fun(acc_df)
decision_makers

In [None]:
experiment = experiments.__dict__[f'{dataset.replace("-","_")}_experiment']
experiment, experiment.plotting_conditions

In [None]:
PLOTTING_EDGE_COLOR = (0.3, 0.3, 0.3, 0.3)
PLOTTING_EDGE_WIDTH = 0.02

def lineplot(df, decision_makers, experiment, plot_cfg):
    plt.figure(figsize=(6, 6))

    for decision_maker in decision_makers:
        result_list = [df[(acc_df.subj==decision_maker.name_pattern[0]) & (df.condition==cond)].iloc[0].is_correct for cond in experiment.data_conditions]

        plt.plot(experiment.plotting_conditions, result_list,
                 marker=decision_maker.marker, color=decision_maker.color,
                 markersize=decision_maker.markersize, linewidth=decision_maker.linewidth,
                 markeredgecolor=PLOTTING_EDGE_COLOR,
                 markeredgewidth=PLOTTING_EDGE_WIDTH, label=decision_maker.plotting_name)

    # Add the chance line if plot_cfg.chance is not None
    if plot_cfg.chance is not None:
        plt.axhline(y=plot_cfg.chance, color='gray', linestyle='--', linewidth=1)
        # Add text "chance" in italics, centered just above the line
        if plot_cfg.chance_label is not None:
            x_center = 0.5 * (ax.get_xlim()[0] + ax.get_xlim()[1])  # Calculate the midpoint of the x-axis
            plt.text(x=x_center, y=plot_cfg.chance + 0.02, s='chance', color='gray', fontsize=12, style='italic',
                     horizontalalignment='center')

    # Add the legend and place it outside to the right
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)

    ax = plt.gca()
    ax.set_ylim(plot_cfg.ylim);
    plt.yticks(fontsize=plot_cfg.yticks_fontsize);

    if plot_cfg.xlim is not None:
        ax.set_xlim(plot_cfg.xlim)
    else:
        xlim = ax.get_xlim()
        ax.set_xlim([xlim[0]-.15, xlim[1]+.15])

    plt.xticks(fontsize=plot_cfg.xticks_fontsize);

    ax.set_title(plot_cfg.title, fontsize=plot_cfg.title_fontsize, pad=plot_cfg.title_pad)

    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.set_xlabel(experiment.xlabel, fontsize=plot_cfg.xlabel_fontsize, labelpad=plot_cfg.xlabel_labelpad);
    ax.set_ylabel(plot_cfg.ylabel, fontsize=plot_cfg.ylabel_fontsize, labelpad=plot_cfg.ylabel_labelpad);
    
    return ax

In [None]:
ax = lineplot(acc_df, decision_makers, experiment, plot_cfg);

# modelvshuman Error Consistency

In [None]:
%config InlineBackend.figure_format='retina'

In [None]:
from modelvshuman_dmc import constants as c
from modelvshuman_dmc.analysis import data
from modelvshuman_dmc.datasets import experiments

In [None]:
import pandas as pd
from modelvshuman_dmc import constants as c

def get_modelvshuman_error_consistency(collection, datasets):
    if isinstance(datasets, str): datasets = [datasets]
    results = []
    for dataset in datasets:
        data_dir = f"{c.RESULTS_DIR}/modelvshuman_pairwise_error_consistency/{collection}/{dataset}"
        filename = os.path.join(data_dir, f'{collection}_set_modelvshuman_pairwise_error_consistency_{dataset}_summary.csv')
        df = pd.read_csv(filename)
        df.rename(columns=dict(sub1='subj', dataset='dataset_name'), inplace=True)
        df['condition'] = df['condition'].astype(str)
        results.append(df)
    results = pd.concat(results)
    return results

def get_modelvsmodel_error_consistency(collection, datasets):
    if isinstance(datasets, str): datasets = [datasets]
    results = []
    for dataset in datasets:
        data_dir = f"{c.RESULTS_DIR}/modelvsmodel_pairwise_error_consistency/{collection}/{dataset}"
        filename = os.path.join(data_dir, f'{collection}_set_modelvsmodel_pairwise_error_consistency_{dataset}_summary.csv')
        df = pd.read_csv(filename)
        df.rename(columns=dict(sub1='subj', dataset='dataset_name'), inplace=True)
        df['condition'] = df['condition'].astype(str)
        results.append(df)
    results = pd.concat(results)
    return results

def get_humanvshuman_error_consistency(datasets):
    if isinstance(datasets, str): datasets = [datasets]
    results = []
    for dataset in datasets:
        data_dir = f"{c.RESULTS_DIR}/humanvshuman_error_consistency/{dataset}"
        filename = os.path.join(data_dir, f'humanvshuman_error_consistency_{dataset}_summary.csv')
        df = pd.read_csv(filename)
        df.rename(columns=dict(sub1='subj', dataset='dataset_name'), inplace=True)
        df['condition'] = df['condition'].astype(str)
        results.append(df)
    results = pd.concat(results)
    return results

In [None]:
def get_modelvshuman_dmc(collection, datasets):
    if isinstance(datasets, str): datasets = [datasets]
    results = []
    for dataset in datasets:
        data_dir = f"{c.RESULTS_DIR}/modelvshuman_decision_margin_consistency/{collection}/{dataset}"
        filename = os.path.join(data_dir, f'{collection}_set_modelvshuman_pairwise_decision_margin_consistency_{dataset}.csv')
        df = pd.read_csv(filename)
        df.rename(columns=dict(model_name='subj', dataset='dataset_name'), inplace=True)
        df['condition'] = df['condition'].astype(str)
        results.append(df)
    results = pd.concat(results)
    return results

def get_modelvsmodel_dmc(collection, datasets):
    if isinstance(datasets, str): datasets = [datasets]
    results = []
    for dataset in datasets:
        data_dir = f"{c.RESULTS_DIR}/modelvsmodel_pairwise_decision_margin_consistency/{collection}/{dataset}"
        filename = os.path.join(data_dir, f'{collection}_set_modelvsmodel_pairwise_decision_margin_consistency_{dataset}_summary.csv')
        df = pd.read_csv(filename)
        df.rename(columns=dict(sub1='subj', dataset='dataset_name', avg_correlation="dmc_avg", avg_corr_lower_ci="dmc_lower_ci", avg_corr_upper_ci="dmc_upper_ci"), inplace=True)
        df['condition'] = df['condition'].astype(str)
        results.append(df)
    results = pd.concat(results)
    return results

def get_humanvshuman_dmc(datasets):
    if isinstance(datasets, str): datasets = [datasets]
    results = []
    for dataset in datasets:
        data_dir = f"{c.RESULTS_DIR}/humanvshuman_splithalves_noise_ceiling/{dataset}"
        filename = os.path.join(data_dir, f'humanvshuman_splithalves_noise_ceiling_{dataset}_summary.csv')
        df = pd.read_csv(filename)
        df.rename(columns=dict(sub1='subj', dataset='dataset_name', adj_corr_mean="adj_corr_avg"), inplace=True)
        df['condition'] = df['condition'].astype(str)
        results.append(df)
    results = pd.concat(results)
    return results

In [None]:
ls {c.RESULTS_DIR}/modelvshuman_decision_margin_consistency/demo/edge

In [None]:
from modelvshuman_dmc import constants as c
from modelvshuman_dmc.plotting.colors import *
from modelvshuman_dmc.plotting.decision_makers import DecisionMaker

def plotting_definition_demo_models(df):
    """Decision makers to compare a few models with human observers.

    This exemplary definition can be adapted for the
    desired purpose, e.g., by adding more/different models.

    Note that models will need to be evaluated first, before
    their data can be plotted.

    For each model, define:
    - a color using rgb(42, 42, 42)
    - a plotting symbol by setting marker;
      a list of markers can be found here:
      https://matplotlib.org/3.1.0/api/markers_api.html
    """

    decision_makers = []

    # Assign the blue color to alexnet
    decision_makers.append(DecisionMaker(name_pattern="alexnet",
                           color=rgb(65, 90, 140), marker="o", df=df,
                           plotting_name="AlexNet"))
    
    # New color for ResNet-50
    decision_makers.append(DecisionMaker(name_pattern="resnet50",
                           color=rgb(120, 130, 190), marker="o", df=df,
                           plotting_name="ResNet-50"))
    
    decision_makers.append(DecisionMaker(name_pattern="bagnet33",
                           color=rgb(110, 110, 110), marker="o", df=df,
                           plotting_name="BagNet-33"))
    
    decision_makers.append(DecisionMaker(name_pattern="simclr_resnet50x1",
                           color=rgb(210, 150, 0), marker="o", df=df,
                           plotting_name="SimCLR-x1"))
    
    # New color for ViT-B-16 (assigned a greenish hue)
    decision_makers.append(DecisionMaker(name_pattern="vit_b_16",
                           color=rgb(0, 180, 100), marker="o", df=df,
                           plotting_name="ViT-B-16"))
    
    # New color for ConvNeXt-Large (assigned a purple hue)
    decision_makers.append(DecisionMaker(name_pattern="convnext_large",
                           color=rgb(150, 60, 200), marker="o", df=df,
                           plotting_name="ConvNeXt-Large"))

    # decision_makers.append(DecisionMaker(name_pattern="humans",
    #                        color=rgb(165, 30, 55), marker="D", df=df, markersize=10,
    #                        plotting_name="Humans"))
    
    return decision_makers


In [None]:
ls {c.RESULTS_DIR}/humanvshuman_error_consistency/{dataset}

In [None]:
ls {c.RESULTS_DIR}/modelvsmodel_pairwise_error_consistency

In [None]:
ls {c.RESULTS_DIR}/modelvshuman_pairwise_error_consistency/demo/edge

In [None]:
models = ["alexnet", "resnet50", "bagnet33", "simclr_resnet50x1", "vit_b_16", "convnext_large"]
collection = "demo"
dataset = "silhouette"

In [None]:
mvm_error = get_modelvsmodel_error_consistency(collection, datasets=[dataset])
mvm_error

In [None]:
hvh_error = get_humanvshuman_error_consistency(datasets=[dataset])
hvh_error

In [None]:
error_df = get_modelvshuman_error_consistency(collection, datasets=[dataset])
error_df

In [None]:
acc_df = get_model_accuracy(models, datasets=[dataset])
acc_df

In [None]:
merged_df = pd.merge(acc_df, error_df, on=['dataset_name', 'condition', 'subj'], how='left')
assert len(merged_df)==len(acc_df) and len(merged_df)==len(error_df), "Merge error"
merged_df

In [None]:
decision_maker_fun = plotting_definition_demo_models
decision_makers = decision_maker_fun(acc_df)
decision_makers

In [None]:
# experiment = experiments.__dict__[f'{dataset.replace("-","_")}_experiment']
# experiment, experiment.plotting_conditions
# df

In [None]:
df = merged_df.copy()
m = "alexnet"
xs = [df[df.subj==m].iloc[0].is_correct for m in models]
ys = [df[df.subj==m].iloc[0].error_consistency_avg for m in models]
xs, ys

In [None]:
mvh_error_con_plot_cfg = PlotConfig(xlabel='Classification Accuracy (16-way, top-1)',
                                    ylabel=r'Error consistency ($\kappa$)',
                                    xlabel_fontsize=16, xlabel_labelpad=14,
                                    ylabel_fontsize=16, ylabel_labelpad=10,
                                    title="Error Consistency", title_fontsize=20, title_pad=10,
                                    xticks_fontsize=14, yticks_fontsize=14,
                                    xlim=[-.15, 1.15], ylim=[-.1,1.1],
                                    chance=1/16, chance_label=None)

plot_cfg = mvh_error_con_plot_cfg
plot_cfg

In [None]:
PLOTTING_EDGE_COLOR = (0.3, 0.3, 0.3, 0.3)
PLOTTING_EDGE_WIDTH = 0.02

plt.figure(figsize=(8,6))

ys = []
for decision_maker in decision_makers:
    model_name = decision_maker.name_pattern[0]
    subset = df[df.subj==model_name]
    assert len(subset)==1, f"Expected one row, found mutiple rows, {model_name}"
    
    # Get values for plotting
    x = subset.iloc[0].is_correct
    y = subset.iloc[0].error_consistency_avg
    ys.append(y)
    yerr_lower = y - subset.iloc[0].error_consistency_lower_ci
    yerr_upper = subset.iloc[0].error_consistency_upper_ci - y
    
    # Plot the scatter point
    plt.scatter(x, y,
                marker=decision_maker.marker, color=decision_maker.color,
                s=decision_maker.markersize**2, edgecolors=PLOTTING_EDGE_COLOR,
                linewidths=PLOTTING_EDGE_WIDTH, label=decision_maker.plotting_name)
    
    # Add error bars
    plt.errorbar(x, y, yerr=[[yerr_lower], [yerr_upper]], fmt='none', ecolor=decision_maker.color, elinewidth=1, capsize=3)

# Add the legend and place it outside to the right
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
    
ax = plt.gca()
ax.set_ylim(plot_cfg.ylim);
plt.yticks(fontsize=plot_cfg.yticks_fontsize);

if plot_cfg.xlim is not None:
    ax.set_xlim(plot_cfg.xlim)
else:
    xlim = ax.get_xlim()
    ax.set_xlim([xlim[0]-.15, xlim[1]+.15])

plt.xticks(fontsize=plot_cfg.xticks_fontsize);

ax.set_title(plot_cfg.title, fontsize=plot_cfg.title_fontsize, pad=plot_cfg.title_pad)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_xlabel(plot_cfg.xlabel, fontsize=plot_cfg.xlabel_fontsize, labelpad=plot_cfg.xlabel_labelpad);
ax.set_ylabel(plot_cfg.ylabel, fontsize=plot_cfg.ylabel_fontsize, labelpad=plot_cfg.ylabel_labelpad);    

# Plot the average model-vs-human line
y_avg = sum(ys) / len(ys)
plt.axhline(y=y_avg, color='gray', linestyle='--', linewidth=1);

# Add label "DNN vs. Human" above the line
x_center = 0.5 * (plt.xlim()[0] + plt.xlim()[1])  # Calculate the midpoint of the x-axis
plt.text(x=x_center, y=y_avg - 0.07, s='DNN vs. Human', color='gray', fontsize=11, style='italic',
         horizontalalignment='center');

# Reference line for "DNN vs DNN"
dnn_vs_dnn_avg = mvm_error.iloc[0].error_consistency_avg
dnn_vs_dnn_lower = mvm_error.iloc[0].error_consistency_lower_ci
dnn_vs_dnn_upper = mvm_error.iloc[0].error_consistency_upper_ci

# Plot the dashed line for "DNN vs DNN"
plt.axhline(y=dnn_vs_dnn_avg, color=(125/255, 87/255, 54/255), linestyle='--', linewidth=1)

# Plot the shaded region between the confidence intervals
plt.fill_between([plt.xlim()[0], plt.xlim()[1]], dnn_vs_dnn_lower, dnn_vs_dnn_upper,
                 color=(246/255, 241/255, 233/255), alpha=0.5)

# Add label "DNN vs DNN" above the line
plt.text(x=x_center, y=dnn_vs_dnn_avg + 0.02, s='DNN vs DNN', color=(125/255, 87/255, 54/255), fontsize=11, style='italic',
         horizontalalignment='center');

# Plot the dashed line for Human vs. Human
hvh_avg = hvh_error.iloc[0].error_consistency_avg
hvh_lower = hvh_error.iloc[0].error_consistency_lower_ci
hvh_upper = hvh_error.iloc[0].error_consistency_upper_ci

plt.axhline(y=hvh_avg, color=(145/255, 14/255, 42/255), linestyle='--', linewidth=1)

# Plot the shaded region between the confidence intervals
plt.fill_between([plt.xlim()[0], plt.xlim()[1]], hvh_lower, hvh_upper,
                 color=(145/255, 14/255, 42/255), alpha=0.15)

# Add label "DNN vs DNN" above the line
plt.text(x=x_center, y=hvh_avg - 0.035, s='Human vs Human', color=(145/255, 14/255, 42/255), fontsize=11, style='italic',
         horizontalalignment='center');

# hvh decision_margin_consistency

In [None]:
models = ["alexnet", "resnet50", "bagnet33", "simclr_resnet50x1", "vit_b_16", "convnext_large"]
collection = "demo"
dataset = "cue-conflict"
dataset = "silhouette"

acc_df = get_model_accuracy(models, datasets=[dataset])
mvh_dmc = get_modelvshuman_dmc(collection, dataset)
mvm_dmc = get_modelvsmodel_dmc(collection, dataset)
hvh_dmc = get_humanvshuman_dmc(dataset)
mvh_dmc

In [None]:
mvh_dmc

In [None]:
acc_df

In [None]:
merged_df = pd.merge(acc_df, mvh_dmc, on=['dataset_name', 'condition', 'subj'], how='left')
assert len(merged_df)==len(acc_df) and len(merged_df)==len(error_df), "Merge error"
df = merged_df.copy()
df

In [None]:
mvh_dmc_plot_cfg = PlotConfig(xlabel='Classification Accuracy (16-way, top-1)',
                                    ylabel=r'Decision-margin consistency',
                                    xlabel_fontsize=16, xlabel_labelpad=14,
                                    ylabel_fontsize=16, ylabel_labelpad=10,
                                    title="Decision-margin consistency", title_fontsize=20, title_pad=10,
                                    xticks_fontsize=14, yticks_fontsize=14,
                                    xlim=[-.15, 1.15], ylim=[-.1,1.1],
                                    chance=1/16, chance_label=None)

plot_cfg = mvh_dmc_plot_cfg
plot_cfg

In [None]:
hvh = hvh_dmc
mvm = mvm_dmc
hvh_metric = 'adj_corr'
mvm_metric = "dmc"
metric = "decision_margin_consistency"

In [None]:
df

In [None]:
PLOTTING_EDGE_COLOR = (0.3, 0.3, 0.3, 0.3)
PLOTTING_EDGE_WIDTH = 0.02

plt.figure(figsize=(8,6))

ys = []
for decision_maker in decision_makers:
    model_name = decision_maker.name_pattern[0]
    subset = df[df.subj==model_name]
    assert len(subset)==1, f"Expected one row, found mutiple rows, {model_name}"
    
    # Get values for plotting
    x = subset.iloc[0].is_correct
    y = subset.iloc[0][metric]
    ys.append(y)
    # yerr_lower = y - subset.iloc[0].error_consistency_lower_ci
    # yerr_upper = subset.iloc[0].error_consistency_upper_ci - y
    
    # Plot the scatter point
    plt.scatter(x, y,
                marker=decision_maker.marker, color=decision_maker.color,
                s=decision_maker.markersize**2, edgecolors=PLOTTING_EDGE_COLOR,
                linewidths=PLOTTING_EDGE_WIDTH, label=decision_maker.plotting_name)
    
    # Add error bars
    # plt.errorbar(x, y, yerr=[[yerr_lower], [yerr_upper]], fmt='none', ecolor=decision_maker.color, elinewidth=1, capsize=3)

# Add the legend and place it outside to the right
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
    
ax = plt.gca()
ax.set_ylim(plot_cfg.ylim);
plt.yticks(fontsize=plot_cfg.yticks_fontsize);

if plot_cfg.xlim is not None:
    ax.set_xlim(plot_cfg.xlim)
else:
    xlim = ax.get_xlim()
    ax.set_xlim([xlim[0]-.15, xlim[1]+.15])

plt.xticks(fontsize=plot_cfg.xticks_fontsize);

ax.set_title(plot_cfg.title, fontsize=plot_cfg.title_fontsize, pad=plot_cfg.title_pad)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_xlabel(plot_cfg.xlabel, fontsize=plot_cfg.xlabel_fontsize, labelpad=plot_cfg.xlabel_labelpad);
ax.set_ylabel(plot_cfg.ylabel, fontsize=plot_cfg.ylabel_fontsize, labelpad=plot_cfg.ylabel_labelpad);    

# Plot the average model-vs-human line
y_avg = sum(ys) / len(ys)
plt.axhline(y=y_avg, color='gray', linestyle='--', linewidth=1);

# Add label "DNN vs. Human" above the line
x_center = 0.5 * (plt.xlim()[0] + plt.xlim()[1])  # Calculate the midpoint of the x-axis
plt.text(x=x_center, y=y_avg - 0.05, s='DNN vs. Human', color='gray', fontsize=11, style='italic',
         horizontalalignment='center');

# Reference line for "DNN vs DNN"
dnn_vs_dnn_avg = mvm.iloc[0][f"{mvm_metric}_avg"]
dnn_vs_dnn_lower = mvm.iloc[0][f"{mvm_metric}_lower_ci"]
dnn_vs_dnn_upper = mvm.iloc[0][f"{mvm_metric}_upper_ci"]

# Plot the dashed line for "DNN vs DNN"
plt.axhline(y=dnn_vs_dnn_avg, color=(125/255, 87/255, 54/255), linestyle='--', linewidth=1)

# Plot the shaded region between the confidence intervals
plt.fill_between([plt.xlim()[0], plt.xlim()[1]], dnn_vs_dnn_lower, dnn_vs_dnn_upper,
                 color=(246/255, 241/255, 233/255), alpha=0.5)

# Add label "DNN vs DNN" above the line
plt.text(x=x_center, y=dnn_vs_dnn_avg + 0.015, s='DNN vs DNN', color=(125/255, 87/255, 54/255), fontsize=11, style='italic',
         horizontalalignment='center');

# Plot the dashed line for Human vs. Human
hvh_avg = hvh.iloc[0][f'{hvh_metric}_avg']
hvh_lower = hvh.iloc[0][f'{hvh_metric}_lower_ci']
hvh_upper = hvh.iloc[0][f'{hvh_metric}_upper_ci']

plt.axhline(y=hvh_avg, color=(145/255, 14/255, 42/255), linestyle='--', linewidth=1)

# Plot the shaded region between the confidence intervals
plt.fill_between([plt.xlim()[0], plt.xlim()[1]], hvh_lower, hvh_upper,
                 color=(145/255, 14/255, 42/255), alpha=0.15)

# Add label "DNN vs DNN" above the line
plt.text(x=x_center, y=hvh_avg + 0.03, s='Human vs Human', color=(145/255, 14/255, 42/255), fontsize=11, style='italic',
         horizontalalignment='center');