You must run this notebook on a GPU. A T4 is sufficient. It's free on [Google
Colab](https://stackoverflow.com/questions/62596466/how-can-i-run-notebooks-of-a-github-project-in-google-colab/67344477#67344477).

**Description**: for a [GPTQd Mistral
7B](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GPTQ) and the [Craigslist
Bargains](https://huggingface.co/datasets/craigslist_bargains) classification task, this
notebook demonstrates that CAPPr gets you +1% absolute accuracy compared to text
generation. In general, you should expect similar or identical performance when every
completion is 1 token long. This notebook also demonstrates using the `prior` keyword argument.

**Estimated run time**: ~15 min.

In [1]:
# check correct CUDA version
import torch

_cuda_version = torch.version.cuda
_msg = (
    "Change the pip install auto-gptq command to the one for "
    f"{_cuda_version} based on the list here: "
    "https://github.com/PanQiWei/AutoGPTQ#quick-installation"
)

assert _cuda_version == "11.8", _msg

In [None]:
!python -m pip install "cappr[demos] @ git+https://github.com/kddubey/cappr.git" \
auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ \
optimum

In [3]:
from __future__ import annotations
from typing import Sequence

import datasets
import pandas as pd
import torch
from tqdm.auto import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

from cappr import huggingface as hf

# Load training data

If you're only interested in how to use CAPPr, skip this section.

In [4]:
_df_raw_tr = pd.DataFrame(datasets.load_dataset("craigslist_bargains", split="train"))

Downloading builder script:   0%|          | 0.00/8.70k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.24k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.53k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/5247 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/838 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/597 [00:00<?, ? examples/s]

In [5]:
len(_df_raw_tr)

5247

The "text" to classify is gonna be the Buyer-Seller dialogue. Need to process that into
something an LLM would better understand.

In [6]:
_df_raw_tr["utterance"][0]  # see those last two empty strings. gonna drop em

['Hi, not sure if the charger would work for my car. Can you sell it to me for $5?',
 'It will work, i have never seen a car without a cigarette lighter port.\\',
 "Still, can I buy it for $5? I'm on a tight budge",
 'I think the lowest I would want to go is 8. ',
 "How about $6 and I pick it up myself? It'll save you shipping to me.",
 '7, and we have a deal.',
 'Eh, fine. $7.',
 '',
 '']

In [7]:
assert (_df_raw_tr["agent_turn"].apply(len) == _df_raw_tr["utterance"].apply(len)).all()

The possible choices for each dialogue are product categories:

In [8]:
_df_raw_tr["items"].apply(lambda item: item["Category"])

0               [phone, phone]
1                 [bike, bike]
2           [housing, housing]
3       [furniture, furniture]
4       [furniture, furniture]
                 ...          
5242                [car, car]
5243    [furniture, furniture]
5244              [bike, bike]
5245    [furniture, furniture]
5246        [housing, housing]
Name: items, Length: 5247, dtype: object

Not sure why they're duplicated. Are they ever different? Hope not.

In [9]:
assert (_df_raw_tr["items"].apply(lambda item: len(set(item["Category"]))) == 1).all()

In [10]:
class_names = sorted(set(_df_raw_tr["items"].apply(lambda item: item["Category"][0])))
class_names

['bike', 'car', 'electronics', 'furniture', 'housing', 'phone']

I'm gonna assume this is the complete list.

# Write prompt

Instruction format pulled from
[here](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GPTQ#you-can-then-use-the-following-code).

In [11]:
chat_template = """
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
""".lstrip("\n")

In [12]:
class_names_str = ", ".join(class_names)

In [13]:
system_message = (
    "You will be given a dialogue between a seller and a buyer about the sale of "
    "an item.\n"
    "Your task is to categorize the item being sold as one of these categories: "
    f"{class_names_str}\n\n"
    "You will answer with just the the correct category and nothing else."
)

In [14]:
def prompt(dialogue: str) -> str:
    user_message = (
        "Dialogue:\n"
        f'"""\n{dialogue}\n"""\n'
        "The item discussed in the dialogue above belongs to the category:"
    )
    return chat_template.format(
        system_message=system_message, prompt=user_message
    )

# Process training data

If you're only interested in how to use CAPPr, skip this section.

To ensure train and test data processing is equivalent, we'll apply the same function:
`process`.

In [15]:
def _as_dialogue(
    all_agent_turns: Sequence[Sequence[bool]], all_utterances: Sequence[Sequence[str]]
):
    if len(all_agent_turns) != len(all_utterances):
        raise ValueError("agent_turns and utterances must have the same length.")

    dialogues = []
    for agent_turns, utterances in zip(all_agent_turns, all_utterances):
        dialogue: list[str] = []
        for agent_turn, utterance in zip(agent_turns, utterances):
            if not utterance:
                # some utterances are empty for some reason. just gonna drop em
                continue
            prefix = "Buyer: " if not agent_turn else "Seller: "
            dialogue.append(prefix + utterance)
        dialogues.append("\n".join(dialogue))
    return dialogues


def process(craigslist_df: pd.DataFrame) -> pd.DataFrame:
    """
    Returns a DataFrame with the canonical `"text"`, `"label"`, and `"prompt"` columns
    for the
    [CraigslistBargains dataset](https://huggingface.co/datasets/craigslist_bargains),
    assuming it was loaded via::

        craigslist_df = pd.DataFrame(
            datasets.load_dataset("craigslist_bargains", split=...)
        )
    """
    # Input checks
    if not (
        craigslist_df["agent_turn"].apply(len) == craigslist_df["utterance"].apply(len)
    ).all():
        raise ValueError("There's an agent_turn and utterance with different lengths.")
    if not (
        craigslist_df["items"].apply(lambda item: len(set(item["Category"]))) == 1
    ).all():
        raise ValueError("There's an item associated with multiple categories.")

    class_names = ["bike", "car", "electronics", "furniture", "housing", "phone"]
    # hard-coded per dataset to ensure consistency across splits. it's possible that the
    # test split is missing some of these.
    df = pd.DataFrame(
        {
            "text": _as_dialogue(
                craigslist_df["agent_turn"], craigslist_df["utterance"]
            ),
            "class": [item["Category"][0] for item in craigslist_df["items"]],
        },
        index=craigslist_df.index,
    )
    assert len(df) == len(craigslist_df)
    df["label"] = [class_names.index(class_name) for class_name in df["class"]]
    df["prompt"] = [prompt(dialogue) for dialogue in df["text"]]
    return df[["text", "label", "prompt", "class"]]

In [16]:
df_tr = process(craigslist_df=_df_raw_tr)

In [17]:
len(df_tr)

5247

For computationally and statistically cheap experiments, we don't need all of this data.

In [18]:
df_tr_mini = df_tr.copy().sample(n=100, random_state=123).reset_index(drop=True)
len(df_tr_mini)

100

# Preview prompt

In [19]:
print(df_tr["prompt"].iloc[0])

<|im_start|>system
You will be given a dialogue between a seller and a buyer about the sale of an item.
Your task is to categorize the item being sold as one of these categories: bike, car, electronics, furniture, housing, phone

You will answer with just the the correct category and nothing else.<|im_end|>
<|im_start|>user
Dialogue:
"""
Buyer: Hi, not sure if the charger would work for my car. Can you sell it to me for $5?
Seller: It will work, i have never seen a car without a cigarette lighter port.\
Buyer: Still, can I buy it for $5? I'm on a tight budge
Seller: I think the lowest I would want to go is 8. 
Buyer: How about $6 and I pick it up myself? It'll save you shipping to me.
Seller: 7, and we have a deal.
Buyer: Eh, fine. $7.
"""
The item discussed in the dialogue above belongs to the category:<|im_end|>
<|im_start|>assistant



In [20]:
print(df_tr["class"].iloc[0])

phone


Ooh, little tricky since the buyer and seller only mention a charger. I bet most models
will incorrectly classify the dialogue as a car sale.

# Load model

In [21]:
model_name = "TheBloke/Mistral-7B-OpenOrca-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", trust_remote_code=False, revision="main"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


Downloading model.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.69k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [22]:
_ = model(**tokenizer(["warm up"], return_tensors="pt").to(model.device))

# CAPPr

Many research datasets are intentionally balanced (so that only the model's likelihood is considered).
Balanced classes in real applications are rare. Let's see what the distribution
looks like for this more realistic dataset:

In [23]:
df_tr["class"].value_counts(normalize=True).sort_index()

bike           0.179150
car            0.133028
electronics    0.130741
furniture      0.247951
housing        0.202783
phone          0.106346
Name: class, dtype: float64

Indeed, not really balanced. Let's supply this prior to CAPPr and see how it does on a
small subset of the training data.

In [24]:
prior = (
    df_tr["class"]
    .value_counts(normalize=True)
    [class_names]
)

Also let's cache the shared prompt instructions.

In [25]:
chat_template_shared_instructions = """
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_prompt}
""".strip("\n")

chat_template_prompt = """
{prompt}<|im_end|>
<|im_start|>assistant
""".lstrip("\n")

In [26]:
user_prompt = "Dialogue:\n"
prompt_prefix = chat_template_shared_instructions.format(
    system_message=system_message,
    user_prompt=user_prompt,
)


def prompt_suffix(dialogue: str):
    prompt = (
        f'"""\n{dialogue}\n"""\n'
        "The item discussed in the dialogue above belongs to the category:"
    )
    return chat_template_prompt.format(prompt=prompt)


df_tr_mini["prompt_suffix"] = [
    prompt_suffix(dialogue) for dialogue in df_tr_mini["text"]
]

In [27]:
# Here's what the model will see
print(prompt_prefix + " " + df_tr_mini["prompt_suffix"].iloc[0])

<|im_start|>system
You will be given a dialogue between a seller and a buyer about the sale of an item.
Your task is to categorize the item being sold as one of these categories: bike, car, electronics, furniture, housing, phone

You will answer with just the the correct category and nothing else.<|im_end|>
<|im_start|>user
Dialogue:
 """
Buyer: hi i am interested in a bike
Seller: Great! It's a terrific bike and I'm asking $200.
Buyer: It is quite a bike but my budget is closer to 100. Don't get me wrong is nice but is not new.
Seller: That's true, it's not new and does have some scuff marks. But it's in solid condition mechanically. How about $150?
Buyer: I see no problem with this. I agree.
"""
The item discussed in the dialogue above belongs to the category:<|im_end|>
<|im_start|>assistant



In [28]:
cached = hf.classify.cache_model((model, tokenizer), prompt_prefix)

In [29]:
pred_probs = hf.classify.predict_proba(
    prompts=df_tr_mini["prompt_suffix"],
    completions=class_names,
    model_and_tokenizer=cached,
    prior=prior,
)

conditional log-probs:   0%|          | 0/100 [00:00<?, ?it/s]

Accuracy:

In [30]:
(pred_probs.argmax(axis=1) == df_tr_mini["label"]).mean()

0.92

# Text generation

We need to create a PyTorch Dataset to batch the inputs.

In [31]:
class TextsDataset(torch.utils.data.Dataset):
    def __init__(self, texts: list[str]):
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index: int):
        return self.texts[index]

In [32]:
text_gen_dataset_tr = TextsDataset(df_tr_mini["prompt"].tolist())

We'll do greedy sampling to better ensure that each completion is one of the class names.

In [33]:
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    max_new_tokens=5,
    return_full_text=False,
)
# pad to allow batching. these get masked out ofc
generator.tokenizer.pad_token_id = generator.model.config.eos_token_id

In [34]:
completions = []
for seq in tqdm(
    generator(
        text_gen_dataset_tr,
        # suppress "Setting pad_token_id..." stdout
        pad_token_id=generator.tokenizer.eos_token_id,
        batch_size=2,
    ),
    total=len(text_gen_dataset_tr),
    desc="Sampling",
):
    completions.append(seq[0]['generated_text'])

Sampling:   0%|          | 0/100 [00:00<?, ?it/s]



Let's see if the model generated categories like we asked.

In [35]:
pd.Series(completions).sample(n=10)

92     electronics
40            bike
54         housing
0             bike
63         housing
26            bike
32            bike
12             car
76     electronics
90            bike
dtype: object

Nice, text generation works well here. Writing the post-processor is trivial.

In [36]:
def process_completion(completion: str, class_names: Sequence[str], default=-1) -> int:
    for i, name in enumerate(class_names):
        if name in completion.lower():
            return i
    return default

In [37]:
preds_text_gen = [
    process_completion(completion, class_names)
    for completion in completions
]

How many of the completions could be mapped to a label?

In [38]:
(pd.Series(preds_text_gen) != -1).mean()

1.0

How accurate are the predictions?

In [39]:
(df_tr_mini["label"] == preds_text_gen).mean()

0.9

# Evaluate on test data

It's important to do some things honorably.

<span style="font-family: Baskerville; font-size: 18px;">I solemnly swear that I
evaluated on the test set twice (once per pre-selected method), running only the
following cells in sequence once.</span>

<img src="../signature.png" alt="drawing" width="200"/>
<div style="width:200px"><hr/></div>

In [40]:
_df_raw_te = pd.DataFrame(datasets.load_dataset("craigslist_bargains", split="test"))
len(_df_raw_te)

838

In [41]:
df_te = process(craigslist_df=_df_raw_te)

In [42]:
df_te["prompt_suffix"] = [prompt_suffix(dialogue) for dialogue in df_te["text"]]

## Text generation

In [43]:
text_gen_dataset_te = TextsDataset(df_te["prompt"].tolist())

In [44]:
completions_te = []
for seq in tqdm(
    generator(
        text_gen_dataset_te,
        # suppress "Setting pad_token_id..." stdout
        pad_token_id=generator.tokenizer.eos_token_id,
        batch_size=2,
    ),
    total=len(text_gen_dataset_te),
    desc="Sampling",
):
    completions_te.append(seq[0]['generated_text'])

Sampling:   0%|          | 0/838 [00:00<?, ?it/s]

In [45]:
preds_text_gen_te = [
    process_completion(completion, class_names)
    for completion in completions_te
]

How many of the completions could be mapped to a label?

In [46]:
(pd.Series(preds_text_gen_te) != -1).mean()

0.9976133651551312

How accurate are the predictions?

In [47]:
(preds_text_gen_te == df_te["label"]).mean()

0.8949880668257757

## CAPPr

In [48]:
pred_probs_te = hf.classify.predict_proba(
    prompts=df_te["prompt_suffix"],
    completions=class_names,
    model_and_tokenizer=cached,  # cached prompt_prefix
    prior=prior,  # estimated from independent training data
)

conditional log-probs:   0%|          | 0/838 [00:00<?, ?it/s]

How accurate are the predictions?

In [49]:
(pred_probs_te.argmax(axis=1) == df_te["label"]).mean()

0.9045346062052506

Hmm, [Refuel AI's few-shot
experiment](https://github.com/refuel-ai/autolabel/blob/main/examples/craigslist/example_craigslist.ipynb)
with `gpt-3.5-turbo` got 89% accuracy. But the way that experiment processed dialogues
seems off. So I'm not gonna say this 4 GB model beats `gpt-3.5-turbo` just yet.