# Abstraction Alignment to Benchmark Language Models
We use abstraction alignment to benchmark language models' specificity. Here we expand specificity benchmarks from the [S-TEST dataset](https://github.com/jeffhj/S-TEST) to include additional hypotheses. This example follows the Quantitatively Comparing Model Specificity case study from the Abstraction Alignment paper (Section 5.2.2).

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import json
import os
import pickle
import numpy as np
from tqdm import tqdm

from nltk.corpus import wordnet as wn

## Load the S-TEST Specificity Testing Benchmaks
The S-TEST dataset contains sentences that test the model on a subject's occupation, location, or place of brith. For instance occupation sentences are in the format "Cher is a [MASK] by profession." Each sentence has a corresponding specific label (e.g., "singer") and general label (e.g., "artist"). Here we load the S-TEST data as well as the prediction results for 5 BERT, RoBERTa, and GPT-2 models.

In order to get results, first run `python S-TEST/scripts/run_experiments.py`. See the `README.md` for more details.

In [3]:
CASE_STUDY_DIR = 'util/llm/'
DATA_DIR = 'util/llm/S-TEST/data/S-TEST/'
RESULTS_DIR = 'util/llm/S-TEST/output/results/'

MODELS = [
    'bert_base', 
    'bert_large', 
    'roberta.base', 
    'roberta.large', 
    'gpt2',
]
TASKS = [
    {'name': 'occupation', 'id': 'P106', 'up_fn': 'hypernyms', 'down_fn': 'hyponyms', 'root': wn.synset('person.n.01')},
    {'name': 'location', 'id': 'P131', 'up_fn': 'part_holonyms', 'down_fn': 'part_meronyms', 'root': None},
    {'name': 'place of birth', 'id': 'P19', 'up_fn': 'part_holonyms', 'down_fn': 'part_meronyms', 'root': None},
]

In [4]:
def load_data(task_id):
    """Load the data instances for a task_id. There can be duplicates, but we
    handle them the same way the S-TEST repo does."""
    data = {}
    with open(os.path.join(DATA_DIR, f'{task_id}.jsonl'), 'r') as f:
        for line in f:
            datum = json.loads(line)
            data[datum['sub_label']] = datum
    return data

task = TASKS[0]
model = MODELS[0]

data = load_data(task['id'])
print(f"{len(data)} instances for {task['name']} prediciton task.")
print(f"Example data:")
print(data[list(data.keys())[0]])

4999 instances for occupation prediciton task.
Example data:
{'sub_uri': 'Q39074561', 'sub_label': 'Joe Carter', 'obj_uri': 'Q1371925', 'obj_label': 'announcer', 'obj_value': 2.0, 'obj2_uri': 'Q1930187', 'obj2_label': 'journalist', 'obj2_value': 3.0, 'predicate_id': 'P106'}


## Compute Model Accuracy and the S-TEST Specificity Metric
The S-TEST specificity metric `p_r` tests how often the model prefers the specific label to the more general label. In our nominclature, we write `p_r = P(s_s, s_g)`.

In [5]:
def load_model_results(model_name, task_id):
    with open(os.path.join(RESULTS_DIR, model_name, task_id, 'result.pkl'), 'rb') as f:
        results = pickle.load(f)['list_of_results']
    return results

results = load_model_results(model, task['id'])
print(f"{len(results)} predictions for {model} on {task['name']} prediciton task.")
print(f"Predictions for results[0] sum to {np.sum([np.exp(w['log_prob']) for w in results[0]['masked_topk']['topk']])}")
print(f"Computed probabilities for {len(results[0]['masked_topk']['topk'])} words.") 

5000 predictions for bert_base on occupation prediciton task.
Predictions for results[0] sum to 0.8970748439562531
Computed probabilities for 18173 words.


In [6]:
def compute_pr(results, data):
    specific = 0
    for result in results:
        subject = result['sample']['sub_label']
        data_instance = data[subject]
        specific_label = data_instance['obj_label']
        coarse_label = data_instance['obj2_label']
        for token in result['masked_topk']['topk']:
            if token['token_word_form'] == specific_label:
                specific += 1
                break
            if token['token_word_form'] == coarse_label:
                break
    return specific / len(results)

def compute_accuracy(results, data, k=1):
    correct = 0
    for result in results:
        subject = result['sample']['sub_label']
        data_instance = data[subject]
        label = data_instance['obj_label']
        topk_words = [t['token_word_form'] for t in result['masked_topk']['topk'][:k]]
        if label in topk_words:
            correct += 1
    return correct / len(results)

In [7]:
accuracy_1 = compute_accuracy(results, data, k=1)
accuracy_10 = compute_accuracy(results, data, k=10)
print(f"{model} {task['name']} prediction acc@1 = {accuracy_1:.2%}; acc@10 = {accuracy_10:.2%}")

bert_base occupation prediction acc@1 = 0.32%; acc@10 = 28.44%


In [30]:
pr = compute_pr(results, data)
print(f"{model} {task['name']} prediction pr = {pr:.2%}")

bert_base occupation prediction pr = 70.46%


## Compute Abstraction Alignment Specificity Metrics

### Get all related words from WordNet
Instead of testing two words, with abstraction alignment we can test many concepts over many levels of abstraction. We use WordNet as the human abstraction graph, taking all the concepts that are related to the dataset's labels.

In [8]:
def get_related_words(synset, traversal_fn_name, root=None, include_self=True):
    """Returns all words related to synset via the traversal_fn_name."""
    traversal_fn = getattr(synset, traversal_fn_name)
    words = set([])
    if include_self:
        words.add(synset.name().split('.')[0])
    if (root is not None and synset == root) or len(traversal_fn()) == 0:
        return words
    for word in traversal_fn():
        next_words = get_related_words(word, traversal_fn_name, root)
        words.update(next_words)
    return words

def get_abstraction_graph(task):
    """Creates an abstraction graph with all task-related words."""
    with open(os.path.join(CASE_STUDY_DIR, f"{task['id']}_synsets.json"), 'r') as f:
        label_synsets = json.load(f)
    label_to_synset = {label: wn.synset(synset) for label, synset in label_synsets if synset is not None}
    abstraction_graph = {}
    for label, synset in label_to_synset.items():
        if synset in abstraction_graph: 
            continue
        children = get_related_words(synset, task['down_fn'], root=task['root'], include_self=True)
        children.add(label)
        parents = get_related_words(synset, task['up_fn'], root=task['root'], include_self=False)
        abstraction_graph[label] = {
            'children': children,
            'parents': parents
        }
    return abstraction_graph

In [9]:
abstraction_graph = get_abstraction_graph(task)
print(f"Related words for {len(abstraction_graph)} synsets.")
print(f"Avg num children = {np.mean([len(w['children']) for w in abstraction_graph.values()])}")
print(f"Avg num parents = {np.mean([len(w['parents']) for w in abstraction_graph.values()])}")
print(list(abstraction_graph.keys())[0], '-->', abstraction_graph[list(abstraction_graph.keys())[0]])

Related words for 105 synsets.
Avg num children = 18.60952380952381
Avg num parents = 3.5904761904761906
announcer --> {'children': {'radio_announcer', 'sports_announcer', 'announcer', 'newscaster', 'newsreader', 'tv_announcer'}, 'parents': {'person', 'broadcaster', 'communicator'}}


In [10]:
task_words = {'children': set([]), 'parents': set([])}
for words in abstraction_graph.values():
    task_words['children'].update(words['children'])
    task_words['parents'].update(words['parents'])
print(f"{len(task_words['children']) + len(task_words['parents'])} words related to the task.")

1601 words related to the task.


### Compute abstraction alignment specificity metrics
We test two additional specificity metrics, p_s = P(s_s&#8595;, s_&#8593;) and p_t = P(s_s&#8597;, s_t).

p_s = P(s_s&#8595;, s_&#8593;) measures how often a specific word is preffered to a general word. To compute this we compare all words more specific than the specific label to all words more general than the specific label.

In [37]:
# p_s --> how often a specific word is preferred to a general word
def compute_ps(results, data, abstraction_graph, agg_fn=np.max):
    """Computes how often a specific answer is preferred to a general answer."""
    prefers_specific = 0
    num_instances = 0
    for i, result in enumerate(results):
        subject = result['sample']['sub_label']
        label = data[subject]['obj_label']
        if label not in abstraction_graph:
            continue
        specific_word_probs = []
        general_word_probs = []
        for token in result['masked_topk']['topk']:
            if token['token_word_form'] in abstraction_graph[label]['children']:
                specific_word_probs.append(np.exp(token['log_prob']))
            if token['token_word_form'] in abstraction_graph[label]['parents']:
                general_word_probs.append(np.exp(token['log_prob']))     
            
        if len(specific_word_probs) == 0 or len(general_word_probs) == 0:
            continue # no related words are in the vocab, so can't copute ps
        num_instances += 1
        
        if agg_fn(specific_word_probs) >= agg_fn(general_word_probs):
            prefers_specific += 1
    return prefers_specific / num_instances

In [38]:
ps_max = compute_ps(results, data, abstraction_graph, agg_fn=np.max)
print(f"{model} {task['name']} prediction ps_max = {ps_max:.2%}")

bert_base occupation prediction ps_max = 79.01%


p_t = P(s_s&#8597;, s_t) measures how often a related word is preffered to a topic word. We compute it by comparing all words related to the label to all other wrods related to the task.

In [33]:
def compute_pt(results, data, abstraction_graph, task_words, agg_fn=np.max):
    """Computes how often a related word is prefferred to a topic word."""
    prefers_specific = 0
    num_instances = 0
    for result in results:
        subject = result['sample']['sub_label']
        label = data[subject]['obj_label']
        if label not in abstraction_graph:
            continue
        specific_word_probs = []
        task_word_probs = []
        for token in result['masked_topk']['topk']:
            token_word = token['token_word_form']
            if token_word in abstraction_graph[label]['children'] or token_word in abstraction_graph[label]['parents']:
                specific_word_probs.append(np.exp(token['log_prob']))
            if token_word in task_words:
                task_word_probs.append(np.exp(token['log_prob']))
            
        if len(specific_word_probs) == 0 or len(task_word_probs) == 0:
            continue # no related words are in the vocab, so can't copute ps
        num_instances += 1
        
        if agg_fn(specific_word_probs) >= agg_fn(task_word_probs):
            prefers_specific += 1
            
    # print(f'Computing over {num_instances}/{len(data)} instances')
    return prefers_specific / num_instances

In [17]:
task_words = set([])
for words in abstraction_graph.values():
    task_words.update(words['children'])
    task_words.update(words['parents'])

In [34]:
pt_max = compute_pt(results, data, abstraction_graph, task_words, agg_fn=np.max)
print(f"{model} {task['name']} prediction pt_max = {pt_max:.2%}")

bert_base occupation prediction pt_max = 0.68%


### Compute all specificity metrics for all pairs of tasks and models

In [14]:
def compute_all_metrics(tasks, models):
    for task in tasks:
        print(f"TASK {task['id']}: {task['name']} prediction")
        data = load_data(task['id'])
        abstraction_graph = get_abstraction_graph(task)        
        task_words = set([])
        for words in abstraction_graph.values():
            task_words.update(words['children'])
            task_words.update(words['parents'])
        for model in models:
            print(f"MODEL {model}")
            results = load_model_results(model, task['id'])

            accuracy_10 = compute_accuracy(results, data, k=10)
            print(f'--- acc@10 = {accuracy_10:.2%}')
            
            pr = compute_pr(results, data)
            print(f'--- pr     = {pr:.2%}')
            
            ps_max = compute_ps(results, data, abstraction_graph, agg_fn=np.max)
            print(f"--- ps_max = {ps_max:.2%}")
            
            pt_max = compute_pt(results, data, abstraction_graph, task_words, agg_fn=np.max)
            print(f"--- pt_max = {pt_max:.2%}")

In [20]:
compute_all_metrics(TASKS, MODELS)

TASK P106: occupation prediction
MODEL bert_base
--- acc@10 = 28.44%
--- pr     = 70.46%
--- ps_max = 79.01%
Computing over 4994/4999 instances
--- pt_max = 0.68%
MODEL bert_large
--- acc@10 = 22.14%
--- pr     = 71.76%
--- ps_max = 82.40%
Computing over 4994/4999 instances
--- pt_max = 1.16%
MODEL roberta.base
--- acc@10 = 24.50%
--- pr     = 61.80%
--- ps_max = 78.97%
Computing over 4994/4999 instances
--- pt_max = 7.51%
MODEL roberta.large
--- acc@10 = 22.44%
--- pr     = 71.44%
--- ps_max = 82.38%
Computing over 4994/4999 instances
--- pt_max = 7.97%
MODEL gpt2
--- acc@10 = 16.10%
--- pr     = 57.28%
--- ps_max = 51.92%
Computing over 4994/4999 instances
--- pt_max = 16.84%
TASK P131: location prediction
MODEL bert_base
--- acc@10 = 43.16%
--- pr     = 49.09%
--- ps_max = 95.91%
Computing over 4519/4976 instances
--- pt_max = 23.04%
MODEL bert_large
--- acc@10 = 45.64%
--- pr     = 42.36%
--- ps_max = 97.04%
Computing over 4519/4976 instances
--- pt_max = 27.44%
MODEL roberta.base
