# Results notebook

**Workflow:**
1. Specify which version of results to load
    - Maybe it could be a folder such as `results/official/V0/test-27-01-25` with internal structure something like: `metadata.json`, `results.json`, `{model_name_1}.json`, ... ,`{model_name_n}.json`
    - Not sure where is the best place to store everything, I think:
        -  `results.json` for the main benchmark results (like a models x experiments table)
        -  `{model_name_i}.json` could be used for storing the detailed Yes/No answers for each experiment and all samples, so that we can use that for in-depth analysis. Also confidence scores would be nice.
        -  `metadata.json` to be discussed, anything from just the date of the experiments, to the complete list of hyperparameters of every experiment
2. Load `results.json` and print them for the summary of the results; later on we might also explore radio-plots or similar visualizations for a breakdown into skills of the results
3. Quantitative section: this can be a more in-depth analysis of the benchmark, which can also be used for deciding which images to inspect for the qualitative analysis
4. Qualitative section: pick either a model and look at all experiments, or an experiment and look at all the models, visualize confidence scores, etc.

### Some nice-to-have features:

- [x] Ability to manually inspect *n* images with questions where the model was wrong for every part of the benchmark  
- [x] Visualize model confidence in the answers  
- [ ] Show answers from different models for the same image  
- [ ] Compare correlation of fails between different models (e.g., do they have the same weaknesses?)  
- [x] Perform all of the above without needing to load and interact with the models
- [ ] Print table(s) of results to LaTeX format for easy copy-paste  

In [5]:
import os
import sys
import json
import datasets
import numpy as np
import pandas as pd
from PIL import Image
from collections import defaultdict

ROOT = os.path.dirname(os.path.dirname(os.path.abspath('.')))
print("ROOT", ROOT)

ROOT /Users/pietroferrazzi/Desktop/dottorato/AAA_progetti/codice/open-world-symbolic-planner


In [2]:
sys.path.append(ROOT)

## Summary of results

In [None]:
#results_folder = os.path.join(ROOT, 'results/official/V1/pope-bias-15-02-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-vg-18-02-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-oi-cleaned-24-02-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-oi-cleaned-26-02-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworld-27-02-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworldv2-28-02-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworld-3-03-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworld_precond-3-03-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworld_precond_cot-10-03-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-72-blocksworld_precond-10-03-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworld-column-labels-10-03-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-oi-cot-10-03-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworld-column-labels_shuffled-10-03-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworld-column-labels_symbolic-10-03-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworld-column-labels-v2-11-03-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworld-column-labels-shuffled-v2-11-03-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworld-column-labels-symbolic-v2-11-03-25')
#results_folder = os.path.join(ROOT, 'results/official/V1/llava-qwen-blocksworld-detailed-prompt-12-03-25')
results_folder = os.path.join(ROOT, 'results/official/V1/blocksword-aya-mistral-qwen-llava-deepseek-molmo-14-04-25')

main_results_path = os.path.join(results_folder, 'results.json')

In [7]:
# Load main results
with open(main_results_path, "r") as f:
    main_results = json.load(f)

# Print as a table
header = f"{'Model':<20} {'Dataset':<25} {'Adversarial':<12} {'Popular':<12} {'Random':<12} {'Positive':<12} {'Negative':<12}"
print(header)
print("-" * len(header))

for model, experiments in main_results.items():
    for exp, metrics in experiments.items():
        if 'block' not in exp :
            continue
        for k in metrics:
            if metrics[k] is None:
                metrics[k] = np.nan
        if ('pope' in exp) or ('bias' in exp):
            row = f"{model:<20} {exp:<25} {metrics['adversarial']:<12.2f} {metrics['popular']:<12.2f} {metrics['random']:<12.2f} {metrics['positive']:<12.2f} {'-':<12}"
        else:
            if 'positive' not in metrics:
                continue # blocksworld precondition effects has different metrics, print later
            row = f"{model:<20} {exp:<25} {'-':<12} {'-':<12} {'-':<12} {metrics['positive']:<12.2f} {metrics['negative']:<12.2f}"
        print(row)

Model                Dataset                   Adversarial  Popular      Random       Positive     Negative    
---------------------------------------------------------------------------------------------------------------
Llava OneVision      blocksworld               -            -            -            0.70         0.76        
Qwen2-VL             blocksworld               -            -            -            0.57         0.81        
Qwen2.5-VL           blocksworld               -            -            -            0.65         0.92        
DeepSeek VL2 tiny    blocksworld               -            -            -            0.21         0.94        
DeepSeek VL2         blocksworld               -            -            -            0.25         0.95        
Mistral Small 3.1    blocksworld               -            -            -            0.65         0.84        
Aya-vision 8B        blocksworld               -            -            -            0.57         0.65 

In [9]:
header = f"{'Model':<20} {'Metric':<40} {'Strict':<15} {'Precond sat':<12} {'Precond unsat':<13} {'Effect changed':<14} {'Effect unchanged':<12}"

print(header)
print("-" * len(header))

for model, experiments in main_results.items():
    for exp, metrics in experiments.items():
        if not ('precondition_effect' in exp):
            continue
        for k in metrics:
            if metrics[k] is None:
                metrics[k] = np.nan
        row = f"{model:<20} {exp:<40} {'Non-strict':<15} {metrics['precond_satisfied']:<12.2f} {metrics['precond_unsatisfied']:<13.2f} {metrics['effect_changed']:<14.2f} {metrics['effect_unchanged']:<12.2f}"
        print(row)
        row_strict = f"{model:<20} {exp:<40} {'Strict':<15} {metrics['precond_satisfied_strict']:<12.2f} {metrics['precond_unsatisfied_strict']:<13.2f} {metrics['effect_changed_strict']:<14.2f} {metrics['effect_unchanged_strict']:<12.2f}"
        print(row_strict)


Model                Metric                                   Strict          Precond sat  Precond unsat Effect changed Effect unchanged
----------------------------------------------------------------------------------------------------------------------------------------
Llava OneVision      blocksworld_precondition_effect          Non-strict      0.66         0.72          0.77           0.68        
Llava OneVision      blocksworld_precondition_effect          Strict          0.39         0.49          0.46           0.30        
Qwen2-VL             blocksworld_precondition_effect          Non-strict      0.65         0.66          0.69           0.66        
Qwen2-VL             blocksworld_precondition_effect          Strict          0.37         0.43          0.31           0.28        
Qwen2.5-VL           blocksworld_precondition_effect          Non-strict      0.63         0.79          0.79           0.78        
Qwen2.5-VL           blocksworld_precondition_effect         

## Quantitative analysis

### Blocksworld

In [8]:
%pip install tabulate

Note: you may need to restart the kernel to use updated packages.


In [None]:
results_per_predicate = {model: defaultdict(dict) for model in main_results.keys()}
for model, experiments in main_results.items():
    if 'blocksworld' not in experiments:
        continue
    detailed_results_path = os.path.join(results_folder, f'{model}.json')
    with open(detailed_results_path, "r") as f:
        detailed_results = json.load(f)
    for question_id, data in detailed_results['blocksworld'].items():
        predicate = data['metadata']['predicate']
        if predicate not in results_per_predicate[model]:
            results_per_predicate[model][predicate] = {'positive': [], 'negative': []}
        results_per_predicate[model][predicate][data['split']].append(data['correct'])

dfs = []
for model, predicates in results_per_predicate.items():
    for predicate, splits in predicates.items():
        for split, correct in splits.items():
            if len(correct) == 0:
                continue
            acc = np.mean(correct)
            dfs.append(pd.DataFrame({'Model': model, 'Predicate': predicate, 'Split': split, 'Accuracy': acc}, index=[0]))
df = pd.concat(dfs)
df_positive = df[df['Split'] == 'positive']
df_negative = df[df['Split'] == 'negative']
df_positive = df_positive.pivot(index='Model', columns='Predicate', values='Accuracy')
df_negative = df_negative.pivot(index='Model', columns='Predicate', values='Accuracy')

print('Positive')
print(df_positive.to_markdown())
print()
print('Negative')
print(df_negative.to_markdown())

In [None]:
# Present the same data in two barplots one for positive and one for negative with the predicate name on the x-axis and the accuracy on the y-axis, colored by model.
import matplotlib.pyplot as plt
import seaborn as sns

df_positive = df_positive.reset_index()
df_negative = df_negative.reset_index()

df_positive = df_positive.melt(id_vars='Model', var_name='Predicate', value_name='Accuracy')
df_negative = df_negative.melt(id_vars='Model', var_name='Predicate', value_name='Accuracy')

fig, ax = matplotlib.pyplot.subplots(1, 2, figsize=(20, 5))
sns.barplot(data=df_positive, x='Predicate', y='Accuracy', hue='Model', ax=ax[0])
ax[0].set_title('Positive')
ax[0].set_ylim(0, 1)
ax[0].set_ylabel('Accuracy')
ax[0].set_xlabel('Predicate')
ax[0].legend(title='Model', loc='upper left')

sns.barplot(data=df_negative, x='Predicate', y='Accuracy', hue='Model', ax=ax[1])
ax[1].set_title('Negative')
ax[1].set_ylim(0, 1)
ax[1].set_ylabel('Accuracy')
ax[1].set_xlabel('Predicate')
ax[1].legend(title='Model', loc='upper left')

plt.show()

### Blocksworld preconditions effects

In [11]:
results_per_predicate = {model: defaultdict(dict) for model in main_results.keys()}
for model, experiments in main_results.items():
    if 'blocksworld_precondition_effect' not in experiments:
        continue
    detailed_results_path = os.path.join(results_folder, f'{model}.json')
    with open(detailed_results_path, "r") as f:
        detailed_results = json.load(f)
    for question_id, data in detailed_results['blocksworld_precondition_effect'].items():
        predicate = data['metadata']['predicate']
        if predicate not in results_per_predicate[model]:
            results_per_predicate[model][predicate] = {'precond_satisfied': [], 'precond_unsatisfied': [], 'effect_changed': [], 'effect_unchanged': []}
        results_per_predicate[model][predicate][data['split']].append(data['correct'])

dfs = []
for model, predicates in results_per_predicate.items():
    for predicate, splits in predicates.items():
        for split, correct in splits.items():
            if len(correct) == 0:
                continue
            acc = np.mean(correct)
            dfs.append(pd.DataFrame({'Model': model, 'Predicate': predicate, 'Split': split, 'Accuracy': acc}, index=[0]))
df = pd.concat(dfs)
df_precond_satisfied = df[df['Split'] == 'precond_satisfied']
df_precond_unsatisfied = df[df['Split'] == 'precond_unsatisfied']
df_effect_changed = df[df['Split'] == 'effect_changed']
df_effect_unchanged = df[df['Split'] == 'effect_unchanged']

df_precond_satisfied = df_precond_satisfied.pivot(index='Model', columns='Predicate', values='Accuracy')
df_precond_unsatisfied = df_precond_unsatisfied.pivot(index='Model', columns='Predicate', values='Accuracy')
df_effect_changed = df_effect_changed.pivot(index='Model', columns='Predicate', values='Accuracy')
df_effect_unchanged = df_effect_unchanged.pivot(index='Model', columns='Predicate', values='Accuracy')

# Also add a row that prints the average accuracy for each predicate on df (independent of the split)
# For each predicate in df print the average accuracy (independent of the split and the model)
print(df.groupby('Predicate')['Accuracy'].mean())

print('Precondition satisfied')
print(df_precond_satisfied.to_markdown())
print()

print('Precondition unsatisfied')
print(df_precond_unsatisfied.to_markdown())
print()

print('Effect changed')
print(df_effect_changed.to_markdown())
print()

print('Effect unchanged')
print(df_effect_unchanged.to_markdown())

Predicate
clear       0.673822
incolumn    0.746970
on          0.907865
Name: Accuracy, dtype: float64
Precondition satisfied
| Model           |    clear |   incolumn |
|:----------------|---------:|-----------:|
| Llava OneVision | 0.656566 |   0.656566 |
| Molmo           | 0.666667 |   0.676768 |
| Qwen2-VL        | 0.585859 |   0.707071 |
| Qwen2.5-VL      | 0.59596  |   0.656566 |
| Qwen2.5-VL 72B  | 0.909091 |   0.777778 |

Precondition unsatisfied
| Model           |    clear |   incolumn |
|:----------------|---------:|-----------:|
| Llava OneVision | 0.616162 |   0.828283 |
| Molmo           | 0.707071 |   0.777778 |
| Qwen2-VL        | 0.474747 |   0.848485 |
| Qwen2.5-VL      | 0.808081 |   0.777778 |
| Qwen2.5-VL 72B  | 0.939394 |   0.787879 |

Effect changed
| Model           |    clear |   incolumn |       on |
|:----------------|---------:|-----------:|---------:|
| Llava OneVision | 0.606742 |   0.777778 | 0.910112 |
| Molmo           | 0.662921 |   0.747475 | 0.8764

## Qualitative analysis

In [12]:
def inspect_examples(results, data_path, samples_per_split=5):

    all_splits = np.unique([x['split'] for x in results.values()]).tolist()
    failures_per_split = {split:0 for split in all_splits}
    
    for res in results.values():
        split = res['split']
        
        if failures_per_split[split] < samples_per_split:
            failures_per_split[split] += 1
            print(f"\n\nSplit: {split}")
            print(f"Q:{res['question']} - GT answer: {res['gt_answer']} - Model answer: {res['answer']}")
            image = Image.open(os.path.join(data_path, res['image_name']))
            image.show()
            
        if np.all([count >= samples_per_split for count in failures_per_split.values()]):
            break

In [14]:
def inspect_failures(results, data_path, samples_per_split=5):

    all_splits = np.unique([x['split'] for x in results.values()]).tolist()
    failures_per_split = {split:0 for split in all_splits}
    
    for res in results.values():
        split = res['split']
        
        if failures_per_split[split] < samples_per_split and not res['correct']:
            failures_per_split[split] += 1
            print(f"\n\nSplit: {split} ({failures_per_split[split]}/{samples_per_split}) - ImageID: {res['image_id']}")
            print(f"Q:{res['question']} - GT answer: {res['gt_answer']} - Model answer: {res['answer']}")
            image = Image.open(os.path.join(data_path, res['image_name']))
            image.show()
            
        if np.all([count >= samples_per_split for count in failures_per_split.values()]):
            break

### COCO

In [None]:
# Load dataset and display n images based on a list of indices of the dataset
#COCO_HF_PATH = "/scratch/cs/world-models/predicate_datasets/POPE/output/coco/hf_coco_pope_dataset" # not needed
COCO_DATA_PATH = "/scratch/cs/world-models/predicate_datasets/coco/val2014"
#dataset = datasets.load_from_disk(COCO_HF_PATH)['validation'] # not needed

In [None]:
model_to_inspect = 'Llava OneVision'
model_results_path = os.path.join(results_folder, model_to_inspect+'.json')
# Load specific results
with open(model_results_path, "r") as f:
    model_results = json.load(f)

In [None]:
inspect_examples(model_results['coco_pope'], COCO_DATA_PATH, samples_per_split=7)

In [None]:
inspect_failures(model_results['coco_pope'], COCO_DATA_PATH, samples_per_split=5)

### OpenImages

In [None]:
# Load dataset and display n images based on a list of indices of the dataset
#OI_HF_PATH = "/scratch/cs/world-models/predicate_datasets/POPE/output/openimages/hf_openimages_pope_dataset_500"
OI_DATA_PATH = "/scratch/cs/world-models/predicate_datasets/openimages/validation"
#dataset = datasets.load_from_disk(OI_HF_PATH)['validation']

In [None]:
model_to_inspect = 'Llava OneVision'
model_results_path = os.path.join(results_folder, model_to_inspect+'.json')
# Load specific results
with open(model_results_path, "r") as f:
    model_results = json.load(f)

In [None]:
inspect_examples(model_results['open_images_pope'], OI_DATA_PATH, samples_per_split=7)

In [None]:
inspect_failures(model_results['open_images_pope'], OI_DATA_PATH, samples_per_split=7)

### Winoground

In [None]:
#WINO_HF_PATH="/scratch/cs/world-models/predicate_datasets/preprocessed/winoground"
WINO_DATA_PATH="/scratch/cs/world-models/predicate_datasets/winoground_images"
#dataset = datasets.load_from_disk(WINO_HF_PATH)['test']

In [None]:
model_to_inspect = 'Llava OneVision'
model_results_path = os.path.join(results_folder, model_to_inspect+'.json')
# Load specific results
with open(model_results_path, "r") as f:
    model_results = json.load(f)

In [None]:
inspect_examples(model_results['winoground'], WINO_DATA_PATH, samples_per_split=7)

In [None]:
inspect_failures(model_results['winoground'], WINO_DATA_PATH, samples_per_split=5)

### OpenImages predicates

In [None]:
# Load dataset and display n images based on a list of indices of the dataset
#OI_HF_PATH = "/scratch/cs/world-models/predicate_datasets/POPE/output/openimages/hf_openimages_pope_dataset_500"
OI_DATA_PATH = "/scratch/cs/world-models/predicate_datasets/openimages/validation"
#dataset = datasets.load_from_disk(OI_HF_PATH)['validation']

In [None]:
model_to_inspect = 'Llava OneVision' #'Qwen2-VL'
model_results_path = os.path.join(results_folder, model_to_inspect+'.json')
# Load specific results
with open(model_results_path, "r") as f:
    model_results = json.load(f)

In [None]:
inspect_examples(model_results['oi_predicate_questions'], OI_DATA_PATH, samples_per_split=7)

In [None]:
inspect_failures(model_results['oi_predicate_questions'], OI_DATA_PATH, samples_per_split=50)

### VisualGenome predicates

In [None]:
# Load dataset and display n images based on a list of indices of the dataset
VG_DATA_PATH = "/scratch/cs/world-models/predicate_datasets/visual_genome"

In [None]:
model_to_inspect = 'Llava OneVision'
model_results_path = os.path.join(results_folder, model_to_inspect+'.json')
# Load specific results
with open(model_results_path, "r") as f:
    model_results = json.load(f)

In [None]:
inspect_examples(model_results['visual_genome'], VG_DATA_PATH, samples_per_split=7)

In [None]:
inspect_failures(model_results['visual_genome'], VG_DATA_PATH, samples_per_split=25)

### Blocksworld

In [15]:
BLOCKSWORLD_DATA_PATH = "/scratch/cs/world-models/predicate_datasets/blocksworld_predicates_v1/images"

In [16]:
model_to_inspect = 'Molmo'
model_results_path = os.path.join(results_folder, model_to_inspect+'.json')
# Load specific results
with open(model_results_path, "r") as f:
    model_results = json.load(f)

In [17]:
inspect_failures(model_results['blocksworld'], BLOCKSWORLD_DATA_PATH, samples_per_split=5)

KeyError: 'blocksworld'