[Winogrand](https://leaderboard.allenai.org/winogrande/submissions/get-started) is a test to measure a model's common sense reasoning capabilities. It's a series of prompts like this one:

"She remembered how annoying it is to dust her wood chair so she bought a plastic table instead.  Cleaning the _ is time consuming."

Where two options are given to substitude the _ in the text:

- 1: "chair" <- correct
- 2: "table"

The [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) uses a similar dataset (Winograd) for measuring performance, and they do it by looking into the token probabilities of the phrase in the two alternatives, picking the higher one. (See the ../misc/next_token_probs notebook).

Let's investigate how this could be done!

In [1]:
import json, copy
from gptbench import Sample, empty_config

In [2]:
ben = Sample(seed=0xDECADEF00D)

cfg = empty_config()
cfg.model.set(dtype='bfloat16') # halve the memory requirements

ben.init_pretrained('gpt2-xl', cfg) # 'gpt2' or 'gpt-xl' if your GPU can handle it

Initializing model from gpt2-xl
Dataset: dummy 0 tokens
Dataset: loading uint16 tokens
Expanding initial dataset size of 1 (less than block_size+1) by 1025 times to size of 1025
Dataset train_path: dummy empty dataset, val_path: None, train_split: 0.9, vocab_size: 50257
Model params: 1557.61M


In [3]:
# let's ingest the extra-small test set 

with open('../data/winogrande/train_xs.jsonl', 'r', encoding='utf-8') as f:
    json_lines = list(f)

    entries = []

for json_str in json_lines:
    entry = json.loads(json_str)
    """ Each entry is in the form:
{"qID": "3D5G8J4N5CI2K40F4RZLF9OG2CKVTH-2", 
"sentence": "Kyle doesn't wear leg warmers to bed, while Logan almost always does. _ is more likely to live in a colder climate.", 
"option1": "Kyle", "option2": "Logan", "answer": "2"}
    """
    entries.append(entry)

In [4]:
entries[:4]

[{'qID': '3QHITW7OYO7Q6B6ISU2UMJB84ZLAQE-2',
  'sentence': "Ian volunteered to eat Dennis's menudo after already having a bowl because _ despised eating intestine.",
  'option1': 'Ian',
  'option2': 'Dennis',
  'answer': '2'},
 {'qID': '3QHITW7OYO7Q6B6ISU2UMJB84ZLAQE-1',
  'sentence': "Ian volunteered to eat Dennis's menudo after already having a bowl because _ enjoyed eating intestine.",
  'option1': 'Ian',
  'option2': 'Dennis',
  'answer': '1'},
 {'qID': '3XWUWJ18TLO2DDRXF83QWLKRJ29UU4-1',
  'sentence': 'He never comes to my home, but I always go to his house because the _ is smaller.',
  'option1': 'home',
  'option2': 'house',
  'answer': '1'},
 {'qID': '3XWUWJ18TLO2DDRXF83QWLKRJ29UU4-2',
  'sentence': 'He never comes to my home, but I always go to his house because the _ is bigger.',
  'option1': 'home',
  'option2': 'house',
  'answer': '2'}]

Let's build a helper function to test the options and return their token's probabilities:

In [21]:
def calc_probs(start_text_tokens, tokens):
    """
    Returns a list with the probabilities of each token in tokens. After a token is generated, it's added to start_text for the next round
    """
    text_tokens = start_text_tokens[:] # copy as we'll be expanding it with the option tokens
    out = []

    for t in tokens:
        print(tokens, text_tokens, '-->', ben.train_dataset.decode(text_tokens), '/', t, ben.train_dataset.decode(t))
        probs = ben.model_probs(text_tokens=text_tokens)
        p_t = probs[t].item()
        out.append(p_t)
        text_tokens.append(t)

    return out

Our first try will just use the probability of the option token used to replace _ in the sentence. It won't use the text following the _ character. To keep thinks simple for now.

In [27]:
def choose_next1(sentence, op1, op2, answer, prob_calc):
    """
    Decide by comparing probabilities of generating the two options, when generating right before the _ character in the sentence
    We'll prepend a space to the options and remove the space before _ in the sentence if any.
    When an option is encoded to several tokens, we'll use the mean of probabilities.
    Returns True if it found the correct answer.
    """

    index = sentence.index(' _')
    if index < 0:
        index = sentence.index('_')
    sent = sentence[:index]
    sent_tok = ben.train_dataset.encode(sent)
    
    op1_tok = ben.train_dataset.encode(' ' + op1)
    op2_tok = ben.train_dataset.encode(' ' + op2)
    print(op1_tok, op2_tok)

    p_op1 = calc_probs(sent_tok, op1_tok)
    p_op2 = calc_probs(sent_tok, op2_tok)

    print(p_op1, p_op2)

    # calc probabilities
    if prob_calc == 'mean':
        p_op1 = sum(p_op1) / len(p_op1)
        p_op2 = sum(p_op2) / len(p_op2)
        
    elif prob_calc == 'mult':
        p1 = 1.
        for p in p_op1:
            p1*=p
        p_op1=p1
        p2 = 1.
        for p in p_op2:
            p2*=p
        p_op2=p2
    else:
        assert False, "Unknown prob_calc"

    print(p_op1, p_op2)
   
    if answer == 1:
        return p_op1 > p_op2
    else:
        return p_op2 > p_op1

In [29]:
choose_next1('The sky is _ and etc.', 'blue', 'yellow-pink', 1, 'mean')

[4171] [7872, 12, 79, 676]
[4171] [464, 6766, 318] --> The sky is / 4171  blue
[7872, 12, 79, 676] [464, 6766, 318] --> The sky is / 7872  yellow
[7872, 12, 79, 676] [464, 6766, 318, 7872] --> The sky is yellow / 12 -
[7872, 12, 79, 676] [464, 6766, 318, 7872, 12] --> The sky is yellow- / 79 p
[7872, 12, 79, 676] [464, 6766, 318, 7872, 12, 79] --> The sky is yellow-p / 676 ink
[0.09228515625] [0.00066375732421875, 0.003448486328125, 0.00115966796875, 0.96484375]
0.09228515625 0.24252891540527344


False

In [32]:
choose_next1('The sky is _ and etc.', 'blue', 'yellowish', 1, 'mult')

[4171] [7872, 680]
[4171] [464, 6766, 318] --> The sky is / 4171  blue
[7872, 680] [464, 6766, 318] --> The sky is / 7872  yellow
[7872, 680] [464, 6766, 318, 7872] --> The sky is yellow / 680 ish
[0.09228515625] [0.00066375732421875, 0.0028533935546875]
0.09228515625 1.8939608708024025e-06


True

In [34]:
choose_next1("She remembered how annoying it is to dust her wood chair so she bought a plastic table instead. Cleaning the _ is time consuming.",
             "chair", "table", 1, 'mult')

[5118] [3084]
[5118] [3347, 12086, 703, 15774, 340, 318, 284, 8977, 607, 4898, 5118, 523, 673, 5839, 257, 7309, 3084, 2427, 13, 5985, 278, 262] --> She remembered how annoying it is to dust her wood chair so she bought a plastic table instead. Cleaning the / 5118  chair
[3084] [3347, 12086, 703, 15774, 340, 318, 284, 8977, 607, 4898, 5118, 523, 673, 5839, 257, 7309, 3084, 2427, 13, 5985, 278, 262] --> She remembered how annoying it is to dust her wood chair so she bought a plastic table instead. Cleaning the / 3084  table
[0.04150390625] [0.326171875]
0.04150390625 0.326171875


False

The choose_next1() method of just looking into the probabilities of the options doesn't look very reliable. Let's look into other ways.

To be continued...