In its current form, you must run this notebook on a GPU. A T4 is sufficient. It's free
on [Google
Colab](https://stackoverflow.com/questions/62596466/how-can-i-run-notebooks-of-a-github-project-in-google-colab/67344477#67344477).
You can technically run this notebook on a CPU (with minor adjustments), but then it'll
take hours. We'll be running the model 1000 times!

**Description**: for a [4 GB 4-bit Llama 2 chat
model](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_0.gguf)
and the [COPA](https://people.ict.usc.edu/~gordon/copa.html) classification task, this
notebook demonstrates that

1. multiple choice text generation is not effective; it recovers 69% of the accuracy of
   OpenAI's `gpt-3.5-turbo`
2. CAPPr is effective; for this dataset, CAPPr recovers 91% of the accuracy of OpenAI's
   `gpt-3.5-turbo`, and will never require post-processing of the output.

**Contamination notice**: I don't know whether Llama 2 was trained on any COPA data. If
it was, but there's no interaction between the method (CAPPr vs text generation) and
training, then the difference between performances can be studied.

**Estimated run time**: ~10 min.

[Install packages](#install-packages)

[Download model](#download-model)

[Utils](#utils)

[Load data](#load-data)

[The problem](#the-problem)

[Write prompt](#write-prompt)

[The solution](#the-solution)

# Install packages

For CPU, just do

```
!pip install llama-cpp-python
```

For GPU (ty [this comment](https://github.com/ggerganov/llama.cpp/issues/128#issuecomment-1604696753)):

In [None]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

I'm gonna install `cappr` from source b/c sometimes I use this notebook to statistically
gut check code changes.

I'll also install the `demos` extras for NLP datasets.

In your local env, you'd just do:

```
pip install "cappr[llama-cpp]"
```

In [None]:
!pip install "cappr[demos] @ git+https://github.com/kddubey/cappr.git"

# Download model

The model is a [4 GB 4-bit Llama 2 chat
model](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_0.gguf) with 7B parameters.

In [3]:
!huggingface-cli download \
TheBloke/Llama-2-7b-Chat-GGUF \
llama-2-7b-chat.Q4_0.gguf \
--local-dir . \
--local-dir-use-symlinks False

downloading https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf to /root/.cache/huggingface/hub/tmp8tg0r5a3
Downloading (…)-2-7b-chat.Q4_0.gguf: 100% 3.83G/3.83G [01:36<00:00, 39.8MB/s]
Storing https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf in local_dir at ./llama-2-7b-chat.Q4_0.gguf (not cached).
./llama-2-7b-chat.Q4_0.gguf


In [1]:
model_path = "./llama-2-7b-chat.Q4_0.gguf"

In [2]:
from __future__ import annotations
from pprint import pprint
from typing import Collection, Literal, Sequence

import datasets
import pandas as pd
from tqdm.auto import tqdm

from llama_cpp import Llama

from cappr import Example
from cappr.llama_cpp import classify

In [3]:
import torch
n_gpu_layers = -1 if torch.cuda.is_available() else 0
n_gpu_layers

-1

In [4]:
model = Llama(model_path=model_path, n_gpu_layers=n_gpu_layers)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


# Utils

Copied from [here](https://github.com/kddubey/cappr/blob/main/demos/utils.py) so that this notebook can be run anywhere.

In [5]:
from __future__ import annotations
from typing import Optional, Union

from IPython.display import display
import pandas as pd


def display_df(
    df: pd.DataFrame,
    columns: Optional[list[str]] = None,
    num_rows: Union[int, None] = 3,
):
    """
    Displays `df.head(num_rows)[columns]` without truncating columns. If
    possible, render any newlines.
    """
    if columns is None:
        columns = df.columns
    if num_rows is None:
        num_rows = len(df)
    df_head_styled = df.head(num_rows)[columns].style
    with pd.option_context("max_colwidth", -1):
        # I'm not sure why try-except doesn't work w/ display(), so instead
        # check the necessary uniqueness condition before running it
        if df.index.is_unique:
            display(
                df_head_styled.set_properties(
                    **{"text-align": "left", "white-space": "pre-wrap"}
                )
            )
        else:
            # `Styler.apply` and `.applymap` are not compatible with non-unique
            # index or columns
            display(df_head_styled)


def remove_suffix(string: str, suffix: str):
    if string.endswith(suffix):
        return string[: -len(suffix)]
    return string


def remove_prefix(string: str, prefix: str) -> str:
    if string.startswith(prefix):
        return string[len(prefix) :]
    return string

# Load data

For this MVP, let's evaluate on the [Choice of Plausible Alternatives (COPA) task](https://people.ict.usc.edu/~gordon/copa.html). I picked this first b/c I read it has multi-token labels, in some sense.

The classification problem is to pick 1 of 2 alternatives which caused or resulted in the premise. Here are two example pulled from the website:

Example 1

> Premise: The man broke his toe. What was the CAUSE of this?
>
> Alternative 1: He got a hole in his sock.
>
> Alternative 2: He dropped a hammer on his foot.


Example 2

> Premise: I tipped the bottle. What happened as a RESULT?
>
> Alternative 1: The liquid in the bottle froze.
>
> Alternative 2: The liquid in the bottle poured out.

A classifier should predict Alternative 2 for Example 1, and Alternative 2 for Example 2.

The test set labels are hidden, so I'll score this zero-shot classifier on the train and validation sets.

In [6]:
def load_super_glue(task_id: str, split: str):
    return pd.DataFrame(datasets
                        .load_dataset('super_glue', task_id, split=split))


df = (pd.concat((load_super_glue('copa', 'train'),
                 load_super_glue('copa', 'validation')))
      .reset_index(drop=True)) # idx column is only unique w/in splits! fuhgetaboutit

In [7]:
len(df)

500

In [8]:
df.head()

Unnamed: 0,premise,choice1,choice2,question,idx,label
0,My body cast a shadow over the grass.,The sun was rising.,The grass was cut.,cause,0,0
1,The woman tolerated her friend's difficult beh...,The woman knew her friend was going through a ...,The woman felt that her friend took advantage ...,cause,1,0
2,The women met for coffee.,The cafe reopened in a new location.,They wanted to catch up with each other.,cause,2,1
3,The runner wore shorts.,The forecast predicted high temperatures.,She planned to run along the beach.,cause,3,0
4,The guests of the party hid behind the couch.,It was a surprise party.,It was a birthday party.,cause,4,0


# The problem

The most straightforward way to solve this task is to point to each alternative with a
single letter, and hope that the LM samples/generates the correct letter. The prompt is
effectively a multiple choice question.

For example:

```
The man broke his toe because
A. He got a hole in his sock.
B. He dropped a hammer on his foot.
Answer A or B.
```

In [9]:
def _conjunction(question: Literal["cause", "effect"]):
    if question == "cause":
        return " because"
    elif question == "effect":
        return ", so"
    else:
        raise ValueError("question must be 'cause' or 'effect'. Got " f"{question}.")


def prompt(premise: str, question: Literal["cause", "effect"]):
    conjunction = _conjunction(question)
    return f'{premise.strip(". ")}{conjunction}'

In [10]:
def prompt_mc(
    premise: str, question: Literal["cause", "effect"], choice1: str, choice2: str
):
    return (
        f"{prompt(premise, question)}\n"
        f"A. {choice1}\n"
        f"B. {choice2}\n"
        "Answer A or B."
    )


df["prompt_mc"] = [
    prompt_mc(
        record["premise"], record["question"], record["choice1"], record["choice2"]
    )
    for record in df.to_dict("records")
]


display_df(df, columns=["prompt_mc", "label"])

Unnamed: 0,prompt_mc,label
0,My body cast a shadow over the grass because A. The sun was rising. B. The grass was cut. Answer A or B.,0
1,The woman tolerated her friend's difficult behavior because A. The woman knew her friend was going through a hard time. B. The woman felt that her friend took advantage of her kindness. Answer A or B.,0
2,The women met for coffee because A. The cafe reopened in a new location. B. They wanted to catch up with each other. Answer A or B.,1


(It turns out that GitHub doesn't render the newlines, but I promise they're there!)

The [notebook
here](https://github.com/kddubey/cappr/blob/main/demos/openai/superglue/copa.ipynb)
demonstrates that this prompt is effective for bigger OpenAI models:
`gpt-3.5-turbo-instruct` and `gpt-3.5-turbo`. Let's see how it performs for Llama 2.

Note after experiments: this prompt seems to be the best I can do w/ text generation.

In [11]:
# if we're using a chat model, let's make sure we're formatting the prompt correctly
llama_chat_template = """
<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_message} [/INST]
""".lstrip(
    "\n"
)

system_prompt_copa = (
    "Identify the cause or effect of a premise given two choices. Each choice "
    "is identified by a letter, A or B.\n"
    "Respond only with the letter corresponding to the correct cause or effect."
)

df["prompt_mc_chat"] = [
    llama_chat_template.format(
        system_prompt=system_prompt_copa, user_message=prompt_mc
    )
    for prompt_mc in df["prompt_mc"]
]

if "chat" in model_path.lower():
    prompt_mc_column = "prompt_mc_chat"
else:
    prompt_mc_column = "prompt_mc"

In [12]:
print(df[prompt_mc_column].iloc[0])

<s>[INST] <<SYS>>
Identify the cause or effect of a premise given two choices. Each choice is identified by a letter, A or B.
Respond only with the letter corresponding to the correct cause or effect.
<</SYS>>

My body cast a shadow over the grass because
A. The sun was rising.
B. The grass was cut.
Answer A or B. [/INST]



Generate MC answers:

In [None]:
completions = []
for _prompt in tqdm(df[prompt_mc_column], total=len(df), desc="Sampling"):
    response = model(_prompt, max_tokens=5, temperature=0)
    completion = response['choices'][0]['text']
    completions.append(completion)

In [14]:
pd.Series(completions).sample(n=10)

59                       B
31            B. He grew a
392                      B
132    B. The woman apolog
239                      A
277                      B
35                       A
348                      B
290                      B
352                      B
dtype: object

When you're doing text generation, you often have to write this sort of data-dependent
and model-dependent function:

In [15]:
def process_completion(
    completion: str,
    class_chars: Sequence[str],
    prefixes_remove: Collection[str] = ("Answer ",),
    strip_chars: str = " \n.",
    default=-1,
) -> int:
    if any(len(class_char) != 1 for class_char in class_chars):
        raise ValueError("Elements of class_chars must be a single character.")

    completion_stripped = completion.strip(strip_chars)
    if not completion_stripped:
        return default
    for prefix_remove in prefixes_remove:
        completion_stripped_rm = remove_prefix(completion_stripped, prefix_remove)
    if not completion:
        return default
    completion_char_lower = completion_stripped_rm[0].lower()
    class_chars_lower = [class_char.lower() for class_char in class_chars]
    try:
        return class_chars_lower.index(completion_char_lower)
    except ValueError:
        return default

In [16]:
class_chars = ('A', 'B')

In [17]:
pred_classes_text_gen = [
    process_completion(completion, class_chars)
    for completion in completions
]

How many of the sampled completions could be mapped to a label 0 or 1?

In [18]:
(pd.Series(pred_classes_text_gen) != -1).mean()

0.994

That's nice. But how accurate are the predictions?

In [19]:
(pred_classes_text_gen == df['label']).mean()

0.624

If you don't use an instruction-trained model, only a fraction of the LM's responses
contain a letter which is easy to parse. Computing acccuracy is kind of pointless; we
don't even have real predictions.

If you do use an instruction-trained model, you get real predictions, but they're
incorrect too often.

Unfortunately, for smaller or less-instruction-trained LMs, text generation can raise
more problems than it solves. This result aligns with the one found in the [demo for
`text-curie-001`](https://github.com/kddubey/cappr/blob/main/demos/superglue/copa.ipynb).
Sampling structured outputs from smaller language models is not statistically
performant.

What do we do?

# Write prompt

We should take advantage of the fact that models like Llama 2 were extensively trained for a simple task: predict the next token.

A simple way to model COPA is to prompt an LM with (for Example 1):

```
The man broke his toe because
```

and use the LM to estimate the probabilities of the 2 alternatives conditional on this prompt.

In [20]:
df["prompt"] = [
    prompt(premise, question)
    for premise, question in zip(df["premise"], df["question"])
]

In [21]:
display_df(df, columns=["prompt", "choice1", "choice2", "label"])

Unnamed: 0,prompt,choice1,choice2,label
0,My body cast a shadow over the grass because,The sun was rising.,The grass was cut.,0
1,The woman tolerated her friend's difficult behavior because,The woman knew her friend was going through a hard time.,The woman felt that her friend took advantage of her kindness.,0
2,The women met for coffee because,The cafe reopened in a new location.,They wanted to catch up with each other.,1


# The solution

In [22]:
examples = [
    Example(
        prompt=record["prompt"],
        completions=(record["choice1"].lower(), record["choice2"].lower()),
        prior=None,
        end_of_prompt=" ",  # currently unused by cappr.llama_cpp
    )
    for record in df.to_dict("records")
]

In [23]:
pprint(examples[0])

Example(prompt='My body cast a shadow over the grass because',
        completions=('the sun was rising.', 'the grass was cut.'),
        prior=None,
        end_of_prompt=' ',
        normalize=True)


For `llama-cpp` models, CAPPr may be slower than text generation. (If batch inference gets supported by `llama-cpp`, I'll be able to speed up CAPPr.)

In [24]:
pred_probs = classify.predict_proba_examples(examples, model)

conditional log-probs:   0%|          | 0/500 [00:00<?, ?it/s]

For COPA, the scoring metric is accuracy.

In [25]:
(pred_probs.argmax(axis=1) == df["label"]).mean()

0.832

This 4 GB open source model beats OpenAI's `text-curie-001`, which was 80% accurate
according to the CAPPr demo
[here](https://github.com/kddubey/cappr/blob/main/demos/openai/superglue/copa.ipynb).
OpenAI's `gpt-3.5-turbo` is 91% accurate. So we've recovered 0.832/0.91 = 91% of its
performance by using CAPPr.