## Overview

In this notebook, we will walk through:
1. Loading RAFT datasets
2. Creating a classifier using any CausalLM from the HuggingFace Hub
3. Generating predictions using that classifier for RAFT test examples.

## Loading RAFT datasets

We'll focus on the ADE corpus V2 dataset in this starter kit, but similar code could be run for all of the datasets in RAFT. To see the possible datasets:

In [9]:
import datasets
datasets.logging.set_verbosity(datasets.logging.ERROR)


datasets.get_dataset_config_names("ought/raft")

['ade_corpus_v2',
 'banking_77',
 'terms_of_service',
 'tai_safety_research',
 'neurips_impact_statement_risks',
 'overruling',
 'systematic_review_inclusion',
 'one_stop_english',
 'tweet_eval_hate',
 'twitter_complaints',
 'semiconductor_org_types']

Each dataset in RAFT consists of a training set of only 50 labeled examples and an unlabeled test set. All labels have a textual version associated with them. Let's take a look at one of the examples in the training set:

In [10]:
DATASET = "ade_corpus_v2"

train_data = datasets.load_dataset("ought/raft", DATASET, split = "train")
print(f"Size of the training set: {len(train_data)}")

#Convert label from int to str
first_example = train_data[0]
first_example["Label"] = train_data.features["Label"].int2str(first_example["Label"])

print(f"First training example: {first_example}")


Size of the training set: 50
First training example: {'Query': 'Is it possible for me to change my PIN number?', 'ID': 0, 'Label': 'change_pin'}


Now let's load and take a look at the first example in the unlabeled test set. The label field still exists, but it always contains a 0:

In [11]:
test_data = datasets.load_dataset("ought/raft", DATASET, split = "test")
print(f"Size of the test set: {len(test_data)}")

print(f"First test example: {test_data[0]}")

Size of the test set: 5000
First test example: {'Query': 'where did my funds come from?', 'ID': 50, 'Label': 0}


## Creating a classifier from the HuggingFace Model Hub

We provide a class which uses the same prompt construction method as our GPT-3 baseline, but works with any CausalLM on the HuggingFace Model Hub. The classifier will automatically use a GPU if available. Brief documentation on the arguments for configuring the classifier is provided below.

In [12]:
from raft_baselines.classifiers import TransformersCausalLMClassifier

classifier = TransformersCausalLMClassifier(
    model_type="distilgpt2", # the model to use from the HF hub
    training_data=train_data, # the training data
    num_prompt_training_examples=25, # see raft_predict.py for the number of training examples used on a per-dataset basis in the GPT-3 baselines run.
    # Note that it may be better to use fewer training examples and/or shorter instructions with other models with smaller context windows.
    add_prefixes=DATASET == "banking_77", # set to True when using banking_77 since multiple classes start with the same token
    config=DATASET, # for task-specific instructions and field ordering
    use_task_specific_instructions=True,
    do_semantic_selection=True,
)

## Generating predictions for RAFT test examples

### Example prompt and prediction

In [13]:
first_test_example = test_data[0]

# delete the 0 Label
del first_test_example["Label"]

# probabilities for all classes
output_probs = classifier.classify(first_test_example, should_print_prompt=True)
output_probs

The following is a banking customer service query. Classify the query into one of the 77 categories available.
Possible labels:
1. Refund_not_showing_up
2. activate_my_card
3. age_limit
4. apple_pay_or_google_pay
5. atm_support
6. automatic_top_up
7. balance_not_updated_after_bank_transfer
8. balance_not_updated_after_cheque_or_cash_deposit
9. beneficiary_not_allowed
10. cancel_transfer
11. card_about_to_expire
12. card_acceptance
13. card_arrival
14. card_delivery_estimate
15. card_linking
16. card_not_working
17. card_payment_fee_charged
18. card_payment_not_recognised
19. card_payment_wrong_exchange_rate
20. card_swallowed
21. cash_withdrawal_charge
22. cash_withdrawal_not_recognised
23. change_pin
24. compromised_card
25. contactless_not_working
26. country_support
27. declined_card_payment
28. declined_cash_withdrawal
29. declined_transfer
30. direct_debit_payment_not_recognised
31. disposable_card_limits
32. edit_personal_details
33. exchange_charge
34. exchange_rate
35. exchange

{'Refund_not_showing_up': 0.0070044743,
 'activate_my_card': 0.0343347,
 'age_limit': 0.002957408,
 'apple_pay_or_google_pay': 0.001034065,
 'atm_support': 0.00079292984,
 'automatic_top_up': 0.001262399,
 'balance_not_updated_after_bank_transfer': 0.027695943,
 'balance_not_updated_after_cheque_or_cash_deposit': 0.031180397,
 'beneficiary_not_allowed': 0.0012036111,
 'cancel_transfer': 0.0009232421,
 'card_about_to_expire': 0.0002955884,
 'card_acceptance': 0.0018507269,
 'card_arrival': 0.00034799604,
 'card_delivery_estimate': 0.00016433404,
 'card_linking': 8.8245e-05,
 'card_not_working': 0.000100416306,
 'card_payment_fee_charged': 0.00019868347,
 'card_payment_not_recognised': 0.00017375551,
 'card_payment_wrong_exchange_rate': 5.9682e-05,
 'card_swallowed': 0.00053263427,
 'cash_withdrawal_charge': 0.0009322987,
 'cash_withdrawal_not_recognised': 0.018336443,
 'change_pin': 0.0025327236,
 'compromised_card': 0.0008759908,
 'contactless_not_working': 0.000335608,
 'country_suppo

### Example prediction df for first N test examples

The following generates a CSV with predictions for the first N test examples in the format required for submission (ID, Label). See further submission instructions [here](https://huggingface.co/datasets/ought/raft-submission).

Note that this is expected to generate predictions of all "Not ADE-related" for the 10 test examples with the code as written; few-shot classification is pretty hard!

In [14]:
import pandas as pd

N_TEST = 10
test_examples_to_predict = test_data.select(range(N_TEST))

def predict_one(clf, test_example):
    del test_example["Label"]    
    output_probs = clf.classify(example)
    output_label = max(output_probs.items(), key=lambda kv_pair: kv_pair[1])[0]
    return output_label
    
result_df = pd.DataFrame(columns=["ID", "Label"]).astype({"ID": int, "Label": str})

for example in test_examples_to_predict:
    result_df = result_df.append({"ID": example["ID"], "Label": predict_one(classifier, example)}, ignore_index=True)

result_df.to_csv("../data/example_predictions.csv", index=False)