[Winogrand](https://leaderboard.allenai.org/winogrande/submissions/get-started) is a test to measure a model's common sense reasoning capabilities. It's a series of prompts like this one:

"She remembered how annoying it is to dust her wood chair so she bought a plastic table instead.  Cleaning the _ is time consuming."

Where two options are given to substitude the _ in the text:

- 1: "chair" <- correct
- 2: "table"

The [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) uses a similar dataset (Winograd) for measuring performance, and they do it by looking into the token probabilities of the phrase in the two alternatives (the options), picking the higher one. (Also see the ../misc/next_token_probs notebook).

Let's investigate how this could be done!

In [1]:
import json
from gptbench import Sample, empty_config

In [2]:
ben = Sample(seed=0xDECADEF00D)

cfg = empty_config()
cfg.model.set(dtype='bfloat16') # halve the memory requirements

ben.init_pretrained('gpt2-xl', cfg) # 'gpt2' or 'gpt-xl' if your GPU can handle it

Initializing model from gpt2-xl
Dataset: dummy 0 tokens
Dataset: loading uint16 tokens
Expanding initial dataset size of 1 (less than block_size+1) by 1025 times to size of 1025
Dataset train_path: dummy empty dataset, val_path: None, train_split: 0.9, vocab_size: 50257
Model params: 1557.61M


In [3]:
# let's ingest the extra-small test set 

with open('../data/winogrande/train_xs.jsonl', 'r', encoding='utf-8') as f:
    json_lines = list(f)

entries = []

for json_str in json_lines:
    entry = json.loads(json_str)
    """ Each entry is in the form:
{"qID": "3D5G8J4N5CI2K40F4RZLF9OG2CKVTH-2", 
"sentence": "Kyle doesn't wear leg warmers to bed, while Logan almost always does. _ is more likely to live in a colder climate.", 
"option1": "Kyle", "option2": "Logan", "answer": "2"}
    """
    entries.append(entry)

In [4]:
entries[:4]

[{'qID': '3QHITW7OYO7Q6B6ISU2UMJB84ZLAQE-2',
  'sentence': "Ian volunteered to eat Dennis's menudo after already having a bowl because _ despised eating intestine.",
  'option1': 'Ian',
  'option2': 'Dennis',
  'answer': '2'},
 {'qID': '3QHITW7OYO7Q6B6ISU2UMJB84ZLAQE-1',
  'sentence': "Ian volunteered to eat Dennis's menudo after already having a bowl because _ enjoyed eating intestine.",
  'option1': 'Ian',
  'option2': 'Dennis',
  'answer': '1'},
 {'qID': '3XWUWJ18TLO2DDRXF83QWLKRJ29UU4-1',
  'sentence': 'He never comes to my home, but I always go to his house because the _ is smaller.',
  'option1': 'home',
  'option2': 'house',
  'answer': '1'},
 {'qID': '3XWUWJ18TLO2DDRXF83QWLKRJ29UU4-2',
  'sentence': 'He never comes to my home, but I always go to his house because the _ is bigger.',
  'option1': 'home',
  'option2': 'house',
  'answer': '2'}]

Our first try will just use the probability of the option token used to replace _ in the sentence. It won't use the text following the _ character. To keep thinks simple for now.

In [4]:
def choose_next1(sentence, op1, op2, answer, prob_calc, log=False):
    """
    Decide by comparing probabilities of generating the two options, when generating right before the _ character in the sentence
    We'll prepend a space to the options and remove the space before _ in the sentence if any.
    When an option is encoded to several tokens, we'll use the 'mean' or 'mult' the probabilities. (prob_calc_param)
    Returns True if it found the correct answer.
    """

    index = sentence.index(' _')
    if index < 0:
        index = sentence.index('_')
    sent = sentence[:index]
    sent_tok = ben.train_dataset.encode(sent)

    op1 = ' ' + op1
    op2 = ' ' + op2
    op1_tok = ben.train_dataset.encode(op1)
    op2_tok = ben.train_dataset.encode(op2)
    if log: print(f"op1/op2 tokens: '{op1}'={op1_tok}, '{op2}'={op2_tok}")

    p_op1 = ben.model_next_probs(text_tokens=sent_tok, next_text_tokens=op1_tok)
    p_op2 = ben.model_next_probs(text_tokens=sent_tok, next_text_tokens=op2_tok)

    if log: print("op1/op2 probs:", p_op1, p_op2)

    # calc probabilities
    if prob_calc == 'mean':
        p_op1 = sum(p_op1) / len(p_op1)
        p_op2 = sum(p_op2) / len(p_op2)
        
    elif prob_calc == 'mult':
        p1 = 1.
        for p in p_op1:
            p1*=p
        p_op1=p1
        p2 = 1.
        for p in p_op2:
            p2*=p
        p_op2=p2
    else:
        assert False, "Unknown prob_calc"

    if log: print("op1/op2 choose p:", p_op1, p_op2)
   
    if answer == 1:
        return p_op1 > p_op2
    else:
        return p_op2 > p_op1

In [5]:
choose_next1('The sky is _', 'blue', 'yellow-pink', 1, 'mean', log=True)

op1/op2 tokens: ' blue'=[4171], ' yellow-pink'=[7872, 12, 79, 676]
op1/op2 probs: [0.09228515625] [0.00066375732421875, 0.003448486328125, 0.00115966796875, 0.96484375]
op1/op2 choose p: 0.09228515625 0.24252891540527344


False

Averaging falls prey to outliers: the 'ink' final token of option 2 ('yellow-pink') has 96% probability, which (wrongly) kills the 9% probability of option 1.

A solution could be to use an outlier-resistant metric like the median.

But since the chain rule discrete variables states that:

P(C B A) = P(C | B A)  * P(B | A)

We can multiply the probabilities for all tokens in the option, however this will penalize options with more tokens.

In [6]:
choose_next1('The sky is _', 'blue', 'yellow-pink', 1, 'mult', log=True)

op1/op2 tokens: ' blue'=[4171], ' yellow-pink'=[7872, 12, 79, 676]
op1/op2 probs: [0.09228515625] [0.00066375732421875, 0.003448486328125, 0.00115966796875, 0.96484375]
op1/op2 choose p: 0.09228515625 2.5611114895518483e-09


True

In [7]:
choose_next1("She remembered how annoying it is to dust her wood chair so she bought a plastic table instead. Cleaning the _ is time consuming.",
             "chair", "table", 1, 'mult', log=True)

op1/op2 tokens: ' chair'=[5118], ' table'=[3084]
op1/op2 probs: [0.04150390625] [0.326171875]
op1/op2 choose p: 0.04150390625 0.326171875


False

Another problem is that we're just looking at the probabilities of generating the option tokens, any following text is ignored. In the previous "time consuming" is ignored, and that's where the meaning is.

The choose_next1() method of just looking into the probabilities of the options doesn't look very reliable.

Let's write a function to calc accuracy along all the test set:

In [8]:
# a function to test accuracy
def calc_accuracy(entries, choose_fn):
    """
    choose_fn(sentence, option1, option2, answer)
    """

    correct_count = 0
    for e in entries:
        res = choose_fn(e['sentence'], e['option1'], e['option2'], int(e['answer']))
        correct_count += int(res)
        print('.', end='')

    return correct_count / len(entries)


In [9]:
choose_next1_fn = lambda sentence, option1, option2, answer: choose_next1(sentence, option1, option2, answer, 'mean')

calc_accuracy(entries, choose_next1_fn)

................................................................................................................................................................

0.49375

In [10]:
choose_next1_fn = lambda sentence, option1, option2, answer: choose_next1(sentence, option1, option2, answer, 'mult')

calc_accuracy(entries, choose_next1_fn)

................................................................................................................................................................

0.4875

Accuracy with the the mean and mult ways of counting multiple token probabilities is near 50%: the same as flipping a coin.

Let's look into other ways.

The paper "[A Simple Method for Commonsense Reasoning](https://arxiv.org/pdf/1806.02847.pdf)" from Trieu H. Trinh and Quoc V. Le introduces the idea of calculating the partial probability after the option tokens are generated.

In the above example, we'd calculate for each option - say for option 1, 'chair':
```
P(" chair is time consuming." | "She remembered how annoying it is to dust her wood chair so she bought a plastic table instead. Cleaning the")
```

The conditional probability of ending the phrase with option 1, given the text generated before option 1.

And then also calc the probabilities for 'table' and pick the higher one as true.

Let's write a function to do that.

In [15]:
def choose_partial(sentence, op1, op2, answer, prob_calc, log=False):
    """
    Returns True if it found the correct answer.
    """

    index = sentence.index('_')
    sent_pre = sentence[:index]
    
    sent_post = sentence[index+1:]
    sent_post_tok = ben.train_dataset.encode(sent_post)

    op1 = sent_pre + op1
    op2 = sent_pre + op2
    op1_tok = ben.train_dataset.encode(op1)
    op2_tok = ben.train_dataset.encode(op2)
    if log: print(f"op1/op2 tokens: '{op1}'={op1_tok}, '{op2}'={op2_tok}\npost tokens='{sent_post}'={sent_post_tok}")

    p_op1 = ben.model_next_probs(text_tokens=op1_tok, next_text_tokens=sent_post_tok)
    p_op2 = ben.model_next_probs(text_tokens=op2_tok, next_text_tokens=sent_post_tok)

    if log: print("op1/op2 probs:", p_op1, p_op2)

    # calc probabilities
    if prob_calc == 'mean':
        p_op1 = sum(p_op1) / len(p_op1)
        p_op2 = sum(p_op2) / len(p_op2)
        
    elif prob_calc == 'mult':
        p1 = 1.
        for p in p_op1:
            p1*=p
        p_op1=p1
        p2 = 1.
        for p in p_op2:
            p2*=p
        p_op2=p2
    else:
        assert False, "Unknown prob_calc"

    if log: print("op1/op2 choose p:", p_op1, p_op2)
        
    if answer == 1:
        return p_op1 > p_op2
    else:
        return p_op2 > p_op1

How does it work?

Let's just use the 'mult' probabilities aggregation method.

In [14]:
choose_partial("She remembered how annoying it is to dust her wood chair so she bought a plastic table instead. Cleaning the _ is time consuming.",
             "chair", "table", 1, 'mult', log=True)

op1/op2 tokens: 'She remembered how annoying it is to dust her wood chair so she bought a plastic table instead. Cleaning the chair'=[3347, 12086, 703, 15774, 340, 318, 284, 8977, 607, 4898, 5118, 523, 673, 5839, 257, 7309, 3084, 2427, 13, 5985, 278, 262, 5118], 'She remembered how annoying it is to dust her wood chair so she bought a plastic table instead. Cleaning the table'=[3347, 12086, 703, 15774, 340, 318, 284, 8977, 607, 4898, 5118, 523, 673, 5839, 257, 7309, 3084, 2427, 13, 5985, 278, 262, 3084]
post=' is time consuming.'=[318, 640, 18587, 13]
op1/op2 probs: [0.1923828125, 0.005615234375, 0.625, 0.099609375] [0.146484375, 0.005279541015625, 0.640625, 0.091796875]
op1/op2 choose p: 6.725342245772481e-05 4.547987373371143e-05


True

In [16]:
choose_partial('The sky is _.', 'blue', 'yellow-pink', 1, 'mult', log=True)

op1/op2 tokens: 'The sky is blue'=[464, 6766, 318, 4171], 'The sky is yellow-pink'=[464, 6766, 318, 7872, 12, 79, 676]
post tokens='.'=[13]
op1/op2 probs: [0.171875] [0.1484375]
op1/op2 choose p: 0.171875 0.1484375


True

Calculate accuracy for the test data:

In [17]:
choose_fn = lambda sentence, option1, option2, answer: choose_partial(sentence, option1, option2, answer, 'mult')

calc_accuracy(entries, choose_fn)

................................................................................................................................................................

0.625

About 62% - that's an improvement over coin flipping! : )

The [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) mentions 70%, on the gpt2-xl model with the partial scoring method we use, but for the Winograd Schema test (we're using the Winogrande, a later test).

From here we could test with the larger Winogrande tests (we're using the extra-small test with 160 entries). Try the small test (../data/winogrande/train_s.jsonl) which has 640 entries or larger ones.

This could also work well with multiple choice questions: do a partial generate (as above) of the n choices and choose the one with higher probability.