**Description**: demonstrates that the zero-shot text classification method [described here](https://stats.stackexchange.com/q/601159/337906) works ok on the [COPA task](https://people.ict.usc.edu/~gordon/copa.html). It's one of the [SuperGLUE tasks](https://super.gluebenchmark.com/tasks) in which labels have multiple tokens, in some sense.

**Estimated run time**: ~1 min.

**Environment**: See the [Setup section in the README](https://github.com/kddubey/lm-classification/#setup).

**Other**: You have to have an OpenAI API key stored in the environment variable `OPENAI_API_KEY`. [Sign up here](https://openai.com/api/). This notebook will warn you about cost before incurring any. Running the whole notebook will cost ya north of <span>$</span>2!

[Load data](#load-data)

[Write prompt](#write-prompt)

[Run model](#run-model)

[Score](#score)

[Evaluate CVS](#evaluate-cvs)

[Evaluate question](#evaluate-question)

[Evaluate single-token](#evaluate-single-token)

In [1]:
from __future__ import annotations
import logging
import sys
from typing import Literal, Sequence

import datasets as nlp_datasets
import pandas as pd

from lm_classification import classify
from lm_classification.utils import api
from utils import display_df

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [2]:
## When hitting the OpenAI endpoints, we'll log any server errors
logging.basicConfig(level=logging.INFO,
                    handlers=[logging.StreamHandler(stream=sys.stdout)],
                    format='%(asctime)s :: %(name)s :: %(levelname)s :: '
                           '%(message)s')
logger = logging.getLogger(__name__)

# Load data

For this MVP, let's evaluate on the [Choice of Plausible Alternatives (COPA) task](https://people.ict.usc.edu/~gordon/copa.html). I picked this first b/c I read it has multi-token labels, in some sense. It also looks cool. The [SuperGLUE tasks](https://super.gluebenchmark.com/tasks) in general are pretty cool.

The classification problem is to pick 1 of 2 alternatives which caused or resulted in the premise. Here are two example pulled from the website:

Example 1

> Premise: The man broke his toe. What was the CAUSE of this?
>
> Alternative 1: He got a hole in his sock.
>
> Alternative 2: He dropped a hammer on his foot.


Example 2

> Premise: I tipped the bottle. What happened as a RESULT?
>
> Alternative 1: The liquid in the bottle froze.
>
> Alternative 2: The liquid in the bottle poured out.

A classifier should predict Alternative 2 for Example 1, and Alternative 2 for Example 2.

The test set labels are hidden, so I'll score this zero-shot classifier on the train and validation sets. I didn't tune much of anything on this data.

In [3]:
def load_super_glue(task_id: str, split: str):
    return pd.DataFrame(nlp_datasets
                        .load_dataset('super_glue', task_id, split=split))


df = (pd.concat((load_super_glue('copa', 'train'),
                 load_super_glue('copa', 'validation')))
      .reset_index(drop=True)) ## the idx column is only unique w/in splits! fuhgetaboutit



In [4]:
len(df)

500

In [5]:
df.head()

Unnamed: 0,premise,choice1,choice2,question,idx,label
0,My body cast a shadow over the grass.,The sun was rising.,The grass was cut.,cause,0,0
1,The woman tolerated her friend's difficult beh...,The woman knew her friend was going through a ...,The woman felt that her friend took advantage ...,cause,1,0
2,The women met for coffee.,The cafe reopened in a new location.,They wanted to catch up with each other.,cause,2,1
3,The runner wore shorts.,The forecast predicted high temperatures.,She planned to run along the beach.,cause,3,0
4,The guests of the party hid behind the couch.,It was a surprise party.,It was a birthday party.,cause,4,0


# Write prompt

A simple way to model COPA is to prompt an LM with (for Example 1):

```
The man broke his toe. This happened because 
```

and use the LM to estimate the probabilities of the 2 alternatives conditional on this prompt. (See the **Example** section [here](https://stats.stackexchange.com/q/601159/337906) for a full description of what "compute the probabilities" actually means. Or, check out the code in [`lm_classification.classify.predict_proba_examples`](https://github.com/kddubey/lm-classification/blob/main/lm_classification/classify.py#L270).)

A potential issue with this approach is that any 2 alternatives are gonna have really low probabilities. As a result, discriminating between alternatives may be, numerically and statistically, a bad idea. But hey, that's why this notebook is here: let's see if this issue really does tank accuracy. And even if it does, we could always provide the alternatives in the prompt, as would be done w/ a sampling approach.

In [6]:
def prompt(premise: str, question: Literal['cause', 'effect']):
    if question == 'cause':
        preface = 'This happened because' ## spaces are added when constructing Examples
    elif question == 'effect':
        preface = 'As a result,'
    else:
        raise ValueError( "question must be 'cause' or 'effect'. Got "
                         f'{question}.')
    return f'{premise} {preface}'

In [7]:
df['prompt'] = [prompt(premise, question)
                for premise, question
                in zip(df['premise'], df['question'])]

In [8]:
display_df(df, columns=['prompt', 'choice1', 'choice2', 'label'])

Unnamed: 0,prompt,choice1,choice2,label
0,My body cast a shadow over the grass. This happened because,The sun was rising.,The grass was cut.,0
1,The woman tolerated her friend's difficult behavior. This happened because,The woman knew her friend was going through a hard time.,The woman felt that her friend took advantage of her kindness.,0
2,The women met for coffee. This happened because,The cafe reopened in a new location.,They wanted to catch up with each other.,1


Note: we need to lowercase the choices. We could also strip the period, but I wanna see what happens if it's included. Ideally, minor but consistent things like that aren't an issue.

# Run model

Note that for many SuperGLUE datasets, including COPA, the probability distribution over classes (alternative 1, 2 for COPA) is uniform. So we'll use `prior=None`.

In [9]:
examples = [classify.Example(prompt=record['prompt'],
                             completions=(record['choice1'].lower(),
                                          record['choice2'].lower()),
                             prior=None)
            for record in df.to_dict('records')]

In [10]:
len(examples)

500

In [11]:
## 500 examples * 2 classes = 1000 OpenAI API requests
## $0.35
pred_probs = classify.predict_proba_examples(examples,
                                             model='text-davinci-003',
                                             ask_if_ok=True)

Computing probs:   0%|          | 0/1000 [00:00<?, ?it/s]

# Score

For COPA, the scoring metric is accuracy.

In [12]:
(pred_probs.argmax(axis=1) == df['label']).mean()

0.916

To put this number in context, we'll evaluate zero-shot classification via sampling (CVS) on `text-davinci-003`.

But first, let's see how zero-shot curie performs.

In [13]:
## $0.04
pred_probs_curie = classify.predict_proba_examples(examples,
                                                   model='text-curie-001',
                                                   ask_if_ok=True)

Computing probs:   0%|          | 0/1000 [00:00<?, ?it/s]

In [14]:
(pred_probs_curie.argmax(axis=1) == df['label']).mean()

0.82

# Evaluate CVS

COPA isn't a great demo for this approach b/c there's a trivial way to transform multi-token labels to single tokens: just point to each choice with a single letter!

For example:

```
The man broke his toe. This happened because
A. He got a hole in his sock.
B. He dropped a hammer on his foot.
Answer A or B.
```

This prompt is a multiple choice question. And it could probably work well for all of the SuperGLUE tasks, because they're all binary classification.

In [15]:
def prompt_mc(premise: str, question: Literal['cause', 'effect'],
              choice1: str, choice2: str):
    if question == 'cause':
        preface = 'This happened because'
    elif question == 'effect':
        preface = 'As a result'
    else:
        raise ValueError( "question must be 'cause' or 'effect'. Got "
                         f'{question}.')
    return (f'{premise} {preface}\n'
            f'A. {choice1}\n'
            f'B. {choice2}\n'
             'Answer A or B.')


df['prompt_mc'] = [prompt_mc(record['premise'], record['question'],
                             record['choice1'], record['choice2'])
                   for record in df.to_dict('records')]


display_df(df, columns=['prompt_mc', 'label'])

Unnamed: 0,prompt_mc,label
0,My body cast a shadow over the grass. This happened because A. The sun was rising. B. The grass was cut. Answer A or B.,0
1,The woman tolerated her friend's difficult behavior. This happened because A. The woman knew her friend was going through a hard time. B. The woman felt that her friend took advantage of her kindness. Answer A or B.,0
2,The women met for coffee. This happened because A. The cafe reopened in a new location. B. They wanted to catch up with each other. Answer A or B.,1


In [16]:
## $0.41
choices = api.gpt3_complete(df['prompt_mc'],
                            ask_if_ok=True,
                            model='text-davinci-003',
                            max_tokens=5,
                            logprobs=1)

Computing probs:   0%|          | 0/500 [00:00<?, ?it/s]

In [17]:
completions_mc = [choice['text'] for choice in choices]

In [18]:
def process_completion(completion: str, class_chars: Sequence[str],
                       strip_chars: str=' \n', default=-1) -> int:
    if any(len(class_char) != 1 for class_char in class_chars):
        raise ValueError('Elements of class_chars must be a single character.')
    completion_stripped = completion.strip(strip_chars)
    if not completion_stripped:
        return default
    completion_char_lower = completion_stripped[0].lower()
    class_chars_lower = [class_char.lower() for class_char in class_chars]
    try:
        return class_chars_lower.index(completion_char_lower)
    except ValueError:
        return default

In [19]:
class_chars = ('A', 'B')
pred_classes_cvs = [process_completion(completion, class_chars)
                    for completion in completions_mc]

Check that all of the sampled completions could be mapped to a label 0 or 1:

In [20]:
assert (pd.Series(pred_classes_cvs) != -1).all()

In [21]:
(pred_classes_cvs == df['label']).mean()

0.962

This performance boost seems pretty significant, which is a bit of a bummer. It should also be a bit cheaper, so there isn't a good reason to use this package instead of CVS.

Let's see how CVS w/ `text-curie-001` performs. Hypothesis: won't work well b/c I don't think this model was trained w/ human feedback, which makes the prompt unfamiliar.

In [22]:
## $0.04
choices_curie = api.gpt3_complete(df['prompt_mc'],
                                  ask_if_ok=True,
                                  model='text-curie-001',
                                  max_tokens=5,
                                  logprobs=1)

Computing probs:   0%|          | 0/500 [00:00<?, ?it/s]

In [23]:
completions_mc_curie = [choice['text'] for choice in choices_curie]
pred_classes_cvs_curie = [process_completion(completion, class_chars)
                          for completion in completions_mc_curie]

Let's see how many of these sampled completions are actually "valid", i.e., in the label set

In [24]:
pred_classes_cvs_curie = pd.Series(pred_classes_cvs_curie, index=df.index)
(pred_classes_cvs_curie != -1).mean()

0.53

In [25]:
(pred_classes_cvs_curie == df['label']).mean()

0.282

Ouch, much worse than random guessing. Let's see how often the valid completions are accurate.

In [26]:
_mask_valid = pred_classes_cvs_curie != -1
(pred_classes_cvs_curie[_mask_valid]
 ==
 df.loc[_mask_valid, 'label']).mean()

0.5320754716981132

# Evaluate question

There are different ways to format a prompt-completion problem. Let's see how performance changes by formatting the problem as a question:

```
The man broke his toe. What was the cause of this? 
```

In [27]:
def prompt_question(premise: str, question: Literal['cause', 'effect']):
    if question == 'cause':
        question_ = 'What was the cause of this?'
    elif question == 'effect':
        question_ = 'What happened as a result?'
    else:
        raise ValueError( "question must be 'cause' or 'effect'. Got "
                         f'{question}.')
    return f'{premise} {question_}'


df['prompt_question'] = [prompt_question(premise, question)
                         for premise, question
                         in zip(df['premise'], df['question'])]


display_df(df, columns=['prompt_question', 'choice1', 'choice2', 'label'])

Unnamed: 0,prompt_question,choice1,choice2,label
0,My body cast a shadow over the grass. What was the cause of this?,The sun was rising.,The grass was cut.,0
1,The woman tolerated her friend's difficult behavior. What was the cause of this?,The woman knew her friend was going through a hard time.,The woman felt that her friend took advantage of her kindness.,0
2,The women met for coffee. What was the cause of this?,The cafe reopened in a new location.,They wanted to catch up with each other.,1


To stay in-line with the way `text-davinci-003` was trained (with human feedback), let's use the correct separator between prompts and completions:

In [28]:
classify.end_of_text

' <|endoftext|>\n\n'

In [29]:
examples_question = [classify.Example(prompt=record['prompt_question'],
                                      completions=(record['choice1'],
                                                   record['choice2']),
                                      prior=None,
                                      end_of_prompt=classify.end_of_text)
                     for record in df.to_dict('records')]

In [30]:
## 500 examples * 2 classes = 1000 OpenAI API requests
## $0.49
pred_probs_question = classify.predict_proba_examples(examples_question,
                                                      model='text-davinci-003',
                                                      ask_if_ok=True)

Computing probs:   0%|          | 0/1000 [00:00<?, ?it/s]

In [31]:
(pred_probs_question.argmax(axis=1) == df['label']).mean()

0.854

# Evaluate single-token

Let's see how the single-token transformation performs for COPA. My hypothesis is that it'll perform similarly to the multi-token approach. It could perform better simply b/c there's no weird aggregation of probabilities. I wouldn't be bummed if it performed better. B/c if I could control the backend, there's still a usability and computational benefit to the idea of returning probabilities for A and B instead of sampling from all possible tokens.

In [32]:
examples_mc = [classify.Example(prompt=record['prompt_mc'],
                                completions=('A', 'B'),
                                prior=None,
                                end_of_prompt=classify.end_of_text)
               for record in df.to_dict('records')]

In [33]:
## 500 examples * 2 classes = 1000 OpenAI API requests
## $0.81
pred_probs_mc = classify.predict_proba_examples(examples_mc,
                                                model='text-davinci-003',
                                                ask_if_ok=True)

Computing probs:   0%|          | 0/1000 [00:00<?, ?it/s]

In [34]:
(pred_probs_mc.argmax(axis=1) == df['label']).mean()

0.92

Conclusion: performs slightly better than the multi-token approach. But it's worse than [CVS](#evaluate-cvs), which is difficult to explain. I should understand more about how sampling works. Ideally, a single `model()` call will get us all the data we need, and there's no need to sample.