You must run this notebook on a GPU. A T4 is sufficient. It's free on [Google
Colab](https://stackoverflow.com/questions/62596466/how-can-i-run-notebooks-of-a-github-project-in-google-colab/67344477#67344477).

**Description**: for a [GPTQd Mistral
7B](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GPTQ) and the [Craigslist
Bargains](https://huggingface.co/datasets/craigslist_bargains) classification task, this
notebook demonstrates that CAPPr gets you +1% absolute accuracy compared to text
generation. In general, you should expect similar or identical performance when every
completion is 1 token long. This notebook also demonstrates using (the still highly
experimental feature) `discount_completions=1.0`.

**Estimated run time**: ~15 min.

In [1]:
# check correct CUDA version
import torch

_cuda_version = torch.version.cuda
_msg = (
    "Change the pip install auto-gptq command to the one for "
    f"{_cuda_version} based on the list here: "
    "https://github.com/PanQiWei/AutoGPTQ#quick-installation"
)

assert _cuda_version == "11.8", _msg

In [None]:
!python -m pip install "cappr[demos] @ git+https://github.com/kddubey/cappr.git" \
auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ \
optimum

In [1]:
from __future__ import annotations
from typing import Sequence

import datasets
import numpy as np
import pandas as pd
import torch
from tqdm.auto import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

from cappr.huggingface.classify import token_logprobs, predict_proba

# Load training data

If you're only interested in how to use CAPPr, skip this section.

In [2]:
_df_raw_tr = pd.DataFrame(datasets.load_dataset("craigslist_bargains", split="train"))

In [3]:
len(_df_raw_tr)

5247

The "text" to classify is gonna be the Buyer-Seller dialogue. Need to process that into
something an LLM would better understand.

In [4]:
_df_raw_tr["utterance"][0]  # see those last two empty strings. gonna drop em

['Hi, not sure if the charger would work for my car. Can you sell it to me for $5?',
 'It will work, i have never seen a car without a cigarette lighter port.\\',
 "Still, can I buy it for $5? I'm on a tight budge",
 'I think the lowest I would want to go is 8. ',
 "How about $6 and I pick it up myself? It'll save you shipping to me.",
 '7, and we have a deal.',
 'Eh, fine. $7.',
 '',
 '']

In [5]:
assert (_df_raw_tr["agent_turn"].apply(len) == _df_raw_tr["utterance"].apply(len)).all()

The possible choices for each dialogue are product categories:

In [6]:
_df_raw_tr["items"].apply(lambda item: item["Category"])

0               [phone, phone]
1                 [bike, bike]
2           [housing, housing]
3       [furniture, furniture]
4       [furniture, furniture]
                 ...          
5242                [car, car]
5243    [furniture, furniture]
5244              [bike, bike]
5245    [furniture, furniture]
5246        [housing, housing]
Name: items, Length: 5247, dtype: object

Not sure why they're duplicated. Are they ever different? Hope not.

In [7]:
assert (_df_raw_tr["items"].apply(lambda item: len(set(item["Category"]))) == 1).all()

In [8]:
class_names = sorted(set(_df_raw_tr["items"].apply(lambda item: item["Category"][0])))
class_names

['bike', 'car', 'electronics', 'furniture', 'housing', 'phone']

I'm gonna assume this is the complete list.

# Write prompt

Instruction format pulled from
[here](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GPTQ#you-can-then-use-the-following-code).

In [9]:
chat_template = """
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
""".lstrip("\n")

In [10]:
class_names_str = ", ".join(class_names)

In [11]:
def prompt(dialogue: str) -> str:
    system_message = (
        "You will be given a dialogue between a seller and a buyer about the sale of "
        "an item.\n"
        "Your task is to categorize the item being sold as one of these categories: "
        f"{class_names_str}\n\n"
        "You will answer with just the the correct category and nothing else."
    )
    user_message = (
        "Dialogue:\n"
        f'"""\n{dialogue}\n"""\n'
        "The item discussed in the dialogue above belongs to the category:"
    )
    return chat_template.format(
        system_message=system_message, prompt=user_message
    )

# Process training data

If you're only interested in how to use CAPPr, skip this section.

To ensure train and test data processing is equivalent, we'll apply the same function:
`process`.

In [12]:
def _as_dialogue(
    all_agent_turns: Sequence[Sequence[bool]], all_utterances: Sequence[Sequence[str]]
):
    if len(all_agent_turns) != len(all_utterances):
        raise ValueError("agent_turns and utterances must have the same length.")

    dialogues = []
    for agent_turns, utterances in zip(all_agent_turns, all_utterances):
        dialogue: list[str] = []
        for agent_turn, utterance in zip(agent_turns, utterances):
            if not utterance:
                # some utterances are empty for some reason. just gonna drop em
                continue
            prefix = "Buyer: " if not agent_turn else "Seller: "
            dialogue.append(prefix + utterance)
        dialogues.append("\n".join(dialogue))
    return dialogues


def process(craigslist_df: pd.DataFrame) -> pd.DataFrame:
    """
    Returns a DataFrame with the canonical `"text"`, `"label"`, and `"prompt"` columns
    for the
    [CraigslistBargains dataset](https://huggingface.co/datasets/craigslist_bargains),
    assuming it was loaded via::

        craigslist_df = pd.DataFrame(
            datasets.load_dataset("craigslist_bargains", split=...)
        )
    """
    # Input checks
    if not (
        craigslist_df["agent_turn"].apply(len) == craigslist_df["utterance"].apply(len)
    ).all():
        raise ValueError("There's an agent_turn and utterance with different lengths.")
    if not (
        craigslist_df["items"].apply(lambda item: len(set(item["Category"]))) == 1
    ).all():
        raise ValueError("There's an item associated with multiple categories.")

    class_names = ["bike", "car", "electronics", "furniture", "housing", "phone"]
    # hard-coded per dataset to ensure consistency across splits. it's possible that the
    # test split is missing some of these.
    df = pd.DataFrame(
        {
            "text": _as_dialogue(
                craigslist_df["agent_turn"], craigslist_df["utterance"]
            ),
            "class": [item["Category"][0] for item in craigslist_df["items"]],
        },
        index=craigslist_df.index,
    )
    assert len(df) == len(craigslist_df)
    df["label"] = [class_names.index(class_name) for class_name in df["class"]]
    df["prompt"] = [prompt(dialogue) for dialogue in df["text"]]
    return df[["text", "label", "prompt", "class"]]

In [13]:
df_tr = process(craigslist_df=_df_raw_tr)

In [14]:
len(df_tr)

5247

For computationally and statistically cheap experiments, we don't need all of this data.

In [15]:
df_tr_mini = df_tr.copy().iloc[:100]
len(df_tr_mini)

100

# Preview prompt

In [16]:
print(df_tr["prompt"].iloc[0])

<|im_start|>system
You will be given a dialogue between a seller and a buyer about the sale of an item.
Your task is to categorize the item being sold as one of these categories: bike, car, electronics, furniture, housing, phone

You will answer with just the the correct category and nothing else.<|im_end|>
<|im_start|>user
Dialogue:
"""
Buyer: Hi, not sure if the charger would work for my car. Can you sell it to me for $5?
Seller: It will work, i have never seen a car without a cigarette lighter port.\
Buyer: Still, can I buy it for $5? I'm on a tight budge
Seller: I think the lowest I would want to go is 8. 
Buyer: How about $6 and I pick it up myself? It'll save you shipping to me.
Seller: 7, and we have a deal.
Buyer: Eh, fine. $7.
"""
The item discussed in the dialogue above belongs to the category:<|im_end|>
<|im_start|>assistant



In [17]:
print(df_tr["class"].iloc[0])

phone


Ooh, little tricky since the buyer and seller only mention a charger. I bet most models
will incorrectly classify the dialogue as a car sale.

# Load model

In [18]:
model_name = "TheBloke/Mistral-7B-OpenOrca-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", trust_remote_code=False, revision="main"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [19]:
_ = model(**tokenizer(["warm up"], return_tensors="pt").to(model.device))

# CAPPr

Many research datasets are intentionally balanced (so that only the model's likelihood is considered).
Balanced classes in real applications are rare. Let's see what the distribution
looks like for this (more realistic) dataset:

In [20]:
df_tr["class"].value_counts(normalize=True).sort_index()

bike           0.179150
car            0.133028
electronics    0.130741
furniture      0.247951
housing        0.202783
phone          0.106346
Name: class, dtype: float64

Indeed, not really balanced. Let's supply this prior to CAPPr and see how it does on a
small subset of the training data.

In [21]:
prior = (
    df_tr["class"]
    .value_counts(normalize=True)
    [class_names]
)

In [22]:
pred_probs = predict_proba(
    prompts=df_tr_mini["prompt"],
    completions=class_names,
    model_and_tokenizer=(model, tokenizer),
    batch_size=2,
    prior=prior,
)

conditional log-probs:   0%|          | 0/100 [00:00<?, ?it/s]

Accuracy:

In [23]:
(pred_probs.argmax(axis=1) == df_tr_mini["label"]).mean()

0.76

Hmm, when there are a lot of classes, it's a good idea to see whether any completions
are getting over-predicted b/c the model's prior is biased wrt your dataset. Let's see
if that happened:

In [24]:
def class_proportions(true_prop: pd.Series, pred_probs: np.ndarray) -> pd.DataFrame:
    preds = pd.Series(
        [class_names[class_idx] for class_idx in pred_probs.argmax(axis=1)],
        name=true_prop.name,
    )
    return pd.DataFrame(
        {
            "true proportion": true_prop,
            "pred proportion": preds.value_counts(normalize=True),
        },
    ).sort_index()

In [25]:
class_proportions(df_tr_mini["class"].value_counts(normalize=True), pred_probs)

Unnamed: 0,true proportion,pred proportion
bike,0.17,0.16
car,0.16,0.13
electronics,0.15,0.34
furniture,0.24,0.21
housing,0.15,0.11
phone,0.13,0.05


`electronics` is getting overpredicted. In situations like these, it can sometimes be
useful to set `discount_completions=1.0` in addition to supplying the prior. Together,
these settings may effectively "reset" the prior from the language model to the prior
for your domain-specific dataset.

Let's see if these settings alleviate the problem for this dataset.

In [26]:
# pre-compute since we'll use this for test data too
log_marg_probs_completions = token_logprobs(
    class_names,
    model_and_tokenizer=(model, tokenizer),
)

log-probs:   0%|          | 0/6 [00:00<?, ?it/s]

In [27]:
pred_probs_with_discount = predict_proba(
    prompts=df_tr_mini["prompt"],
    completions=class_names,
    model_and_tokenizer=(model, tokenizer),
    batch_size=2,
    prior=prior,
    discount_completions=1.0,
    log_marg_probs_completions=log_marg_probs_completions,
)

conditional log-probs:   0%|          | 0/100 [00:00<?, ?it/s]

In [28]:
class_proportions(
    df_tr_mini["class"].value_counts(normalize=True), pred_probs_with_discount
)

Unnamed: 0,true proportion,pred proportion
bike,0.17,0.17
car,0.16,0.13
electronics,0.15,0.29
furniture,0.24,0.23
housing,0.15,0.11
phone,0.13,0.07


Indeed, `discount_completions=1.0` overpredicts `electronics` less.

Accuracy:

In [29]:
(pred_probs_with_discount.argmax(axis=1) == df_tr_mini["label"]).mean()

0.8

Cool, accuracy went up a bit. But keep in mind there isn't much data, and these
predictions are dependent of the labels, so don't interpret the score absolutely. We'll
properly evaluate on the test set later.

# Text generation

We need to create a PyTorch Dataset to batch the inputs.

In [30]:
class TextsDataset(torch.utils.data.Dataset):
    def __init__(self, texts: list[str]):
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index: int):
        return self.texts[index]

We'll do greedy sampling to better ensure that each completion is one of the class names.

In [31]:
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    max_new_tokens=5,
    return_full_text=False,
)
# pad to allow batching. these get masked out ofc
generator.tokenizer.pad_token_id = generator.model.config.eos_token_id

In [32]:
text_gen_dataset_tr = TextsDataset(df_tr_mini["prompt"].tolist())

In [33]:
completions = []
for seq in tqdm(
    generator(
        text_gen_dataset_tr,
        # suppress "Setting pad_token_id..." stdout
        pad_token_id=generator.tokenizer.eos_token_id,
        batch_size=2,
    ),
    total=len(text_gen_dataset_tr),
    desc="Sampling",
):
    completions.append(seq[0]['generated_text'])

Sampling:   0%|          | 0/100 [00:00<?, ?it/s]



Let's see if the model generated categories like we asked.

In [34]:
pd.Series(completions).sample(n=10)

29       furniture
14       furniture
38            bike
28       furniture
18           phone
19       furniture
37            bike
42           phone
72            bike
4      electronics
dtype: object

Nice, text generation works well here. Writing the post-processor is trivial.

In [35]:
def process_completion(completion: str, class_names: Sequence[str], default=-1) -> int:
    for i, name in enumerate(class_names):
        if name in completion.lower():
            return i
    return default

In [36]:
preds_text_gen = [
    process_completion(completion, class_names)
    for completion in completions
]

How many of the completions could be mapped to a label?

In [37]:
(pd.Series(preds_text_gen) != -1).mean()

1.0

How accurate are the predictions?

In [38]:
(df_tr_mini["label"] == preds_text_gen).mean()

0.72

# Evaluate on test data

It's important to do some things honorably.

<span style="font-family: Baskerville; font-size: 18px;">I solemnly swear that I
evaluated on the test set twice (once per pre-selected method), running only the
following cells in sequence once.</span>

<img src="../signature.png" alt="drawing" width="200"/>
<div style="width:200px"><hr/></div>

In [39]:
_df_raw_te = pd.DataFrame(datasets.load_dataset("craigslist_bargains", split="test"))
len(_df_raw_te)

838

In [40]:
df_te = process(craigslist_df=_df_raw_te)

## Text generation

In [41]:
text_gen_dataset_te = TextsDataset(df_te["prompt"].tolist())

In [42]:
completions_te = []
for seq in tqdm(
    generator(
        text_gen_dataset_te,
        # suppress "Setting pad_token_id..." stdout
        pad_token_id=generator.tokenizer.eos_token_id,
        batch_size=2,
    ),
    total=len(text_gen_dataset_te),
    desc="Sampling",
):
    completions_te.append(seq[0]['generated_text'])

Sampling:   0%|          | 0/838 [00:00<?, ?it/s]

In [43]:
preds_text_gen_te = [
    process_completion(completion, class_names)
    for completion in completions_te
]

How many of the completions could be mapped to a label?

In [44]:
(pd.Series(preds_text_gen_te) != -1).mean()

0.9976133651551312

How accurate are the predictions?

In [45]:
(preds_text_gen_te == df_te["label"]).mean()

0.8949880668257757

## CAPPr

In [46]:
pred_probs_te = predict_proba(
    prompts=df_te["prompt"],
    completions=class_names,
    model_and_tokenizer=(model, tokenizer),
    batch_size=2,
    prior=prior,  # estimated from independent training data
    discount_completions=1.0,
    log_marg_probs_completions=log_marg_probs_completions,
)

conditional log-probs:   0%|          | 0/838 [00:00<?, ?it/s]

How accurate are the predictions?

In [47]:
(pred_probs_te.argmax(axis=1) == df_te["label"]).mean()

0.9069212410501193

Hmm, this seems to be as good as [Refuel AI's few-shot
experiment](https://github.com/refuel-ai/autolabel/blob/main/examples/craigslist/example_craigslist.ipynb)
with `gpt-3.5-turbo`. But the way they processed dialogues seems off. So I'm not gonna
say this 7B model beats `gpt-3.5-turbo` just yet.