**Description**: demonstrates that the zero-shot text classification method [described here](https://stats.stackexchange.com/q/601159/337906) works ok. Currently just evaluates the method on the [COPA task](https://people.ict.usc.edu/~gordon/copa.html), since it's one of the [SuperGLUE tasks](https://super.gluebenchmark.com/tasks) in which labels have multiple tokens, in some sense.

**Estimated run time**: ~1 min.

**Environment**: See [`requirements.txt`](https://github.com/kddubey/lm-classification/blob/main/requirements.txt).

**Other**: You have to have an OpenAI API key stored in the environment variable `OPENAI_API_KEY`. [Sign up here](https://openai.com/api/). This notebook will warn you about cost before incurring any. It'll cost ya about <span>$</span>0.50.

[Load data](#load-data)

[Write prompt](#write-prompt)

[Run model](#run-model)

In [1]:
import logging
import sys

import datasets as nlp_datasets
from IPython.display import display
import pandas as pd

from lm_classification import classify
from lm_classification.utils import gpt2_tokenizer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [2]:
## When hitting the OpenAI endpoints, we'll log any server errors
logging.basicConfig(level=logging.INFO,
                    handlers=[logging.StreamHandler(stream=sys.stdout)],
                    format='%(asctime)s :: %(name)s :: %(levelname)s :: '
                           '%(message)s')
logger = logging.getLogger(__name__)

# Load data

For this MVP, let's evaluate on the [Choice of Plausible Alternatives (COPA) task](https://people.ict.usc.edu/~gordon/copa.html). I picked this first b/c I read it has multi-token labels, in some sense. It also looks cool. The [SuperGLUE tasks](https://super.gluebenchmark.com/tasks) in general are pretty cool.

The classification problem is to pick 1 of 2 alternatives which caused or resulted in the premise. Here are two example pulled from the website:

Example 1

> Premise: The man broke his toe. What was the CAUSE of this?
>
> Alternative 1: He got a hole in his sock.
>
> Alternative 2: He dropped a hammer on his foot.


Example 2

> Premise: I tipped the bottle. What happened as a RESULT?
>
> Alternative 1: The liquid in the bottle froze.
>
> Alternative 2: The liquid in the bottle poured out.

A classifier should predict Alternative 2 for Example 1, and Alternative 2 for Example 2.

The test set labels are hidden, so I'll score this zero-shot classifier on the train and validation sets. I promise you I didn't tune anything on this data. There ain't much to tune to begin with :-) 

In [3]:
def load_super_glue(task_id: str, split: str):
    return pd.DataFrame(nlp_datasets
                        .load_dataset('super_glue', task_id, split=split))


copa_df = (pd.concat((load_super_glue('copa', 'train'),
                      load_super_glue('copa', 'validation')))
           .reset_index(drop=True)) ## the idx column is only unique w/in splits! fuhgetaboutit



In [4]:
len(copa_df)

500

In [5]:
copa_df.head()

Unnamed: 0,premise,choice1,choice2,question,idx,label
0,My body cast a shadow over the grass.,The sun was rising.,The grass was cut.,cause,0,0
1,The woman tolerated her friend's difficult beh...,The woman knew her friend was going through a ...,The woman felt that her friend took advantage ...,cause,1,0
2,The women met for coffee.,The cafe reopened in a new location.,They wanted to catch up with each other.,cause,2,1
3,The runner wore shorts.,The forecast predicted high temperatures.,She planned to run along the beach.,cause,3,0
4,The guests of the party hid behind the couch.,It was a surprise party.,It was a birthday party.,cause,4,0


# Write prompt

A simple way to model COPA is to prompt an LM with (for Example 1):

```
The man broke his toe. What was the cause of this?
```

and use the LM to estimate the probabilities of the 2 alternatives conditional on this prompt. (See the **Example** section [here](https://stats.stackexchange.com/q/601159/337906) for a full description of what "compute the probabilities" actually means. Or, check out the code in [`lm_classification.classify.predict_proba_examples`](https://github.com/kddubey/lm-classification/blob/main/lm_classification/classify.py#L219).)

A potential issue with this approach is that any 2 alternatives are gonna have really low probabilities. As a result, discriminating between alternatives may be, numerically and statistically, a bad idea. But hey, that's why this notebook is here: let's see if this issue really does tank accuracy. And even if it does, we could always provide the alternatives in the prompt, as would be done w/ a sampling approach.

In [6]:
def prompt_copa(premise: str, question: str):
    if question == 'cause':
        question_ = 'What was the cause of this?'
    elif question == 'effect':
        question_ = 'What happened as a result?'
    else:
        raise ValueError( "question must be 'cause' or 'effect'. Got "
                         f'{question}.')
    return (f'{premise} {question_}')

In [7]:
copa_df['prompt'] = [prompt_copa(premise, question)
                     for premise, question
                     in zip(copa_df['premise'], copa_df['question'])]

In [8]:
_num_examples_displayed = 3
with pd.option_context('max_colwidth', -1):
    display(copa_df
            [['prompt', 'choice1', 'choice2', 'label']]
            .head(_num_examples_displayed)
            .style ## render the newlines in the printed output
            .set_properties(**{'text-align': 'left',
                               'white-space': 'pre-wrap'}))

Unnamed: 0,prompt,choice1,choice2,label
0,My body cast a shadow over the grass. What was the cause of this?,The sun was rising.,The grass was cut.,0
1,The woman tolerated her friend's difficult behavior. What was the cause of this?,The woman knew her friend was going through a hard time.,The woman felt that her friend took advantage of her kindness.,0
2,The women met for coffee. What was the cause of this?,The cafe reopened in a new location.,They wanted to catch up with each other.,1


# Run model

Note that for many SuperGLUE datasets, including COPA, the probability distribution over classes (alternative 1, 2 for COPA) is uniform. So we'll use `prior=None`.

In [9]:
copa_examples = [classify.Example(prompt=record['prompt'],
                                  completions=(record['choice1'],
                                               record['choice2']),
                                  prior=None)
                 for record in copa_df.to_dict('records')]

In [10]:
len(copa_examples)

500

This next cell warns you about price, and asks if you're ready to pay.

(small TODO: I forgot what `input` does in Jupyter notebooks. I use VS Code notebooks now, and it just asks you to press the Enter key to proceed.)

In [11]:
all_tokens = gpt2_tokenizer([example.prompt + classify.END_OF_PROMPT + completion
                             for example in copa_examples
                             for completion in example.completions])
num_tokens = sum(len(tokens) for tokens in all_tokens['input_ids'])
cost_per_1k_tokens = 0.02 ## https://openai.com/api/pricing/
cost = round(num_tokens * cost_per_1k_tokens / 1_000, 2)

output = input(f'The next cell will cost you ${cost}. Proceed?')

In [12]:
## 500 examples * 2 classes = 1000 OpenAI API requests
pred_probs = classify.predict_proba_examples(copa_examples,
                                             model='text-davinci-003')

Computing probs: 100%|██████████| 1000/1000 [00:24<00:00, 41.44it/s]


For COPA, the scoring metric is accuracy.

In [13]:
pred_classes = pred_probs.argmax(axis=1)
(pred_classes == copa_df['label']).mean()

0.854

According to the [leaderboards](https://super.gluebenchmark.com/submission/MoBKq4x8olYosIAnGvgF3NrRrpl2/-M8-O2ZNnHOirgs_EsH4), a few-shot sampling approach ("priming" GPT-3 w/ 32 training examples) gets 0.92 on the test set. So this method does ok.