**You must run this on a GPU. A T4 is sufficient. It's free on Google Colab.**

**Description**: for a 4 GB quantized Llama 2 model, this notebook demonstrates that

1. classification via sampling (CVS) is not effective
2. CAPPr is effective; it recovers 94% of the performance of OpenAI's `gpt-3.5-turbo`,
   and requires no post-processing of the output.

**Contamination notice**: I don't know whether the Llama 2 was trained on any COPA data.
If it was, but there's no interaction between the method (CAPPr vs CVS) and training,
then the difference between performances can be studied.

**Estimated run time**: ~6 min. (CVS is slow for this model)

[Install packages](#install-packages)

[Load model and tokenizer](#load-model-and-tokenizer)

[Utils](#utils)

[Load data](#load-data)

[The problem](#the-problem)

[Write prompt](#write-prompt)

[The solution](#the-solution)

# Install packages

In [1]:
# check correct CUDA version
import torch

_cuda_version = torch.version.cuda
_msg = (
    "Change the pip install auto-gptq command to the one for "
    f"{_cuda_version} based on the list here: "
    "https://github.com/PanQiWei/AutoGPTQ#quick-installation"
)

assert _cuda_version == "11.8", _msg

I don't wanna pay for renting an A100 so I need to use a semi-aggressively quantized
model. Something which fits on a T4. Need the latest `transformers`, `auto-gptq`, and
`optimum` according to this [HF blog
post](https://huggingface.co/blog/gptq-integration#autogptq-library--the-one-stop-library-for-efficiently-leveraging-gptq-for-llms).

In [None]:
!python -m pip install cappr[demos] \
transformers==4.33.0 \
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ \
optimum

In [3]:
from __future__ import annotations
from pprint import pprint
from typing import Literal

import datasets
import pandas as pd
from tqdm.auto import tqdm

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
)

from cappr import Example
from cappr.huggingface import classify

# Load model and tokenizer

In [4]:
_msg = (
    "This should probably be run on a GPU. A T4 instance is sufficient for the models "
    "tested here."
)
assert torch.cuda.is_available(), _msg

In [5]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [6]:
model_id = "TheBloke/Llama-2-7B-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [7]:
# warm up model
_ = model(**tokenizer(["warm up"], return_tensors="pt").to(DEVICE))

# Utils

Copied from [here](https://github.com/kddubey/cappr/blob/main/demos/utils.py) so that this notebook can be run anywhere.

In [8]:
from __future__ import annotations
from typing import Optional, Union

from IPython.display import display
import pandas as pd


def display_df(
    df: pd.DataFrame,
    columns: Optional[list[str]] = None,
    num_rows: Union[int, None] = 3,
):
    """
    Displays `df.head(num_rows)[columns]` without truncating columns. If
    possible, render any newlines.
    """
    if columns is None:
        columns = df.columns
    if num_rows is None:
        num_rows = len(df)
    df_head_styled = df.head(num_rows)[columns].style
    with pd.option_context("max_colwidth", -1):
        # I'm not sure why try-except doesn't work w/ display(), so instead
        # check the necessary uniqueness condition before running it
        if df.index.is_unique:
            display(
                df_head_styled.set_properties(
                    **{"text-align": "left", "white-space": "pre-wrap"}
                )
            )
        else:
            # `Styler.apply` and `.applymap` are not compatible with non-unique
            # index or columns
            display(df_head_styled)


def remove_suffix(string: str, suffix: str):
    if string.endswith(suffix):
        return string[: -len(suffix)]
    return string


def remove_prefix(string: str, prefix: str) -> str:
    if string.startswith(prefix):
        return string[len(prefix) :]
    return string

# Load data

For this MVP, let's evaluate on the [Choice of Plausible Alternatives (COPA) task](https://people.ict.usc.edu/~gordon/copa.html). I picked this first b/c I read it has multi-token labels, in some sense. It also looks cool.

The classification problem is to pick 1 of 2 alternatives which caused or resulted in the premise. Here are two example pulled from the website:

Example 1

> Premise: The man broke his toe. What was the CAUSE of this?
>
> Alternative 1: He got a hole in his sock.
>
> Alternative 2: He dropped a hammer on his foot.


Example 2

> Premise: I tipped the bottle. What happened as a RESULT?
>
> Alternative 1: The liquid in the bottle froze.
>
> Alternative 2: The liquid in the bottle poured out.

A classifier should predict Alternative 2 for Example 1, and Alternative 2 for Example 2.

The test set labels are hidden, so I'll score this zero-shot classifier on the train and validation sets.

In [9]:
def load_super_glue(task_id: str, split: str):
    return pd.DataFrame(datasets
                        .load_dataset('super_glue', task_id, split=split))


# takes about 12 seconds
df = (pd.concat((load_super_glue('copa', 'train'),
                 load_super_glue('copa', 'validation')))
      .reset_index(drop=True)) # idx column is only unique w/in splits! fuhgetaboutit

In [10]:
len(df)

500

In [11]:
df.head()

Unnamed: 0,premise,choice1,choice2,question,idx,label
0,My body cast a shadow over the grass.,The sun was rising.,The grass was cut.,cause,0,0
1,The woman tolerated her friend's difficult beh...,The woman knew her friend was going through a ...,The woman felt that her friend took advantage ...,cause,1,0
2,The women met for coffee.,The cafe reopened in a new location.,They wanted to catch up with each other.,cause,2,1
3,The runner wore shorts.,The forecast predicted high temperatures.,She planned to run along the beach.,cause,3,0
4,The guests of the party hid behind the couch.,It was a surprise party.,It was a birthday party.,cause,4,0


# The problem

The most straightforward way to solve this task is to point to each alternative with a single letter, and hope that the LM samples/generates the correct letter. The prompt is effectively a multiple choice question. This approach falls under the umbrella of "classification via sampling" (CVS).

For example:

```
The man broke his toe because
A. He got a hole in his sock.
B. He dropped a hammer on his foot.
Answer A or B.
```

In [12]:
def _conjunction(question: Literal["cause", "effect"]):
    if question == "cause":
        return " because"
    elif question == "effect":
        return ", so"
    else:
        raise ValueError("question must be 'cause' or 'effect'. Got " f"{question}.")


def prompt(premise: str, question: Literal["cause", "effect"]):
    conjunction = _conjunction(question)
    return f'{premise.strip(". ")}{conjunction}'

In [13]:
def prompt_mc(
    premise: str, question: Literal["cause", "effect"], choice1: str, choice2: str
):
    return (
        f"{prompt(premise, question)}\n"
        f"A. {choice1}\n"
        f"B. {choice2}\n"
        "Answer A or B."
    )


df["prompt_mc"] = [
    prompt_mc(
        record["premise"], record["question"], record["choice1"], record["choice2"]
    )
    for record in df.to_dict("records")
]


display_df(df, columns=["prompt_mc", "label"])

Unnamed: 0,prompt_mc,label
0,My body cast a shadow over the grass because A. The sun was rising. B. The grass was cut. Answer A or B.,0
1,The woman tolerated her friend's difficult behavior because A. The woman knew her friend was going through a hard time. B. The woman felt that her friend took advantage of her kindness. Answer A or B.,0
2,The women met for coffee because A. The cafe reopened in a new location. B. They wanted to catch up with each other. Answer A or B.,1


(It turns out that GitHub doesn't render the newlines, but I promise they're there!)

Set up a huggingface text generator:

In [14]:
generator = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)
# pad to allow batching. these get masked out ofc
generator.tokenizer.pad_token_id = generator.model.config.eos_token_id

We need to create a PyTorch Dataset to batch the inputs.

In [15]:
class TextsDataset(torch.utils.data.Dataset):
    def __init__(self, texts: list[str]):
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index: int):
        return self.texts[index]

In [16]:
sequences = generator(
    TextsDataset(df["prompt_mc"].tolist()),
    do_sample=True,
    num_return_sequences=1,
    top_k=50,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=10,
    batch_size=1,
)

Welp, setting `batch_size` > 1 causes numerical errors. This next cell is gonna take ~5 min :-(

In [17]:
responses = []
for seq in tqdm(sequences, total=len(df), desc="Sampling"):
    responses.append(seq[0]["generated_text"])

Sampling:   0%|          | 0/500 [00:00<?, ?it/s]



In [18]:
responses_cleaned = [
    remove_prefix(response, prompt_mc)
    for prompt_mc, response in zip(df["prompt_mc"], responses)
]

In [19]:
pd.Series(responses_cleaned).sample(n=10)

123                                    [2x3=6] Note: The
112               When in doubt, trust your instinct.\nA
278        \nIf an individual has been hired temporarily
172             \n9. Many people are of the opinion that
36            We know all, but a is more appropriate for
310                          \n1. We can’t have any more
474                                          \n(1) 30.04
97                   \nIf the speaker says, “Do you know
330                       \nC. This is a test and is not
318     Please note that answers\nshould not be a mix...
dtype: object

For this relatively tame dataset, only a fraction of the LM's responses contain a letter which is easy to parse. Computing acccuracy is kind of pointless; we don't even have real predictions.

Unfortunately, for smaller or less-instruction-trained LMs, CVS can raise more problems than it solves. This result aligns with the one found in the [demo for `text-curie-001`](https://github.com/kddubey/cappr/blob/main/demos/superglue/copa.ipynb); sampling structured outputs from smaller language models is not statistically performant.

What do we do?

# Write prompt

We should take advantage of the fact that models like Llama 2 were extensively trained for a simple task: predict the next token.

A simple way to model COPA is to prompt an LM with (for Example 1):

```
The man broke his toe because
```

and use the LM to estimate the probabilities of the 2 alternatives conditional on this prompt.

In [20]:
df["prompt"] = [
    prompt(premise, question)
    for premise, question in zip(df["premise"], df["question"])
]

In [21]:
display_df(df, columns=["prompt", "choice1", "choice2", "label"])

Unnamed: 0,prompt,choice1,choice2,label
0,My body cast a shadow over the grass because,The sun was rising.,The grass was cut.,0
1,The woman tolerated her friend's difficult behavior because,The woman knew her friend was going through a hard time.,The woman felt that her friend took advantage of her kindness.,0
2,The women met for coffee because,The cafe reopened in a new location.,They wanted to catch up with each other.,1


Note: we need to lowercase the choices.

# The solution

Also observe that the interface feels simpler than HuggingFace's pipeline generation, as they have to worry about so many more things when sampling.

In [22]:
examples = [
    Example(
        prompt=record["prompt"],
        completions=(record["choice1"].lower(), record["choice2"].lower()),
        prior=None,
    )
    for record in df.to_dict("records")
]

In [23]:
pprint(examples[0])

Example(prompt='My body cast a shadow over the grass because',
        completions=('the sun was rising.', 'the grass was cut.'),
        prior=None,
        end_of_prompt=' ')


We have 500 examples * 2 classes = 1000 model inferences. Set the batch size to something that'll work on your machine. On a T4, `batch_size = 32` works for this data.

In [24]:
batch_size = 32

In [25]:
pred_probs = classify.predict_proba_examples(
    examples, model_and_tokenizer=(model, tokenizer), batch_size=batch_size
)

log-probs:   0%|          | 0/500 [00:00<?, ?it/s]

  np.array(  # raises jagged/inhomogeneous ValueError if non-constant # tokens


For COPA, the scoring metric is accuracy.

In [26]:
(pred_probs.argmax(axis=1) == df["label"]).mean()

0.816

This 4 GB open source model beats OpenAI's `text-curie-001`, which was 80% accurate according to the CAPPr demo [here](https://github.com/kddubey/cappr/blob/main/demos/superglue/copa.ipynb). OpenAI's `gpt-3.5-turbo` is 86.6% accurate. So we've recovered 94% of its performance by using CAPPr. Ha.