## Getting started with the RAFT benchmark

In this notebook, we will walk through:

1. Loading the tasks from the [RAFT dataset](https://huggingface.co/datasets/ought/raft)
2. Creating a classifier using any CausalLM from the [Hugging Face Hub](https://huggingface.co/models)
3. Generating predictions using that classifier for RAFT test examples

This should provide you with the steps needed to make a submission to the [RAFT leaderboard](https://huggingface.co/spaces/ought/raft-leaderboard)!

## Setup

We'll set the logging level of the `datasets` library to disable some warnings that can clutter the notebook:

In [1]:
import datasets

datasets.logging.set_verbosity_error()

## Loading RAFT datasets

We'll focus on the ADE corpus V2 task in this starter kit, but similar code could be run for all of the tasks in RAFT. To see the possible tasks, we can use the following function from `datasets`:

In [2]:
from datasets import get_dataset_config_names

RAFT_TASKS = get_dataset_config_names("ought/raft")
RAFT_TASKS

['ade_corpus_v2',
 'banking_77',
 'terms_of_service',
 'tai_safety_research',
 'neurips_impact_statement_risks',
 'overruling',
 'systematic_review_inclusion',
 'one_stop_english',
 'tweet_eval_hate',
 'twitter_complaints',
 'semiconductor_org_types']

Each task in RAFT consists of a training set of only **_50 labeled examples_** and an unlabeled test set. All labels have a textual version associated with them. Let's load corpus associated with the `ade_corpus_v2` task:

In [18]:
from datasets import load_dataset

TASK = "ade_corpus_v2"
raft_dataset = load_dataset("ought/raft", name=TASK)
raft_dataset

  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'ID', 'Label'],
        num_rows: 50
    })
    test: Dataset({
        features: ['Sentence', 'ID', 'Label'],
        num_rows: 5000
    })
})

The `raft_dataset` object itself is a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training and test sets. In this task we can see we have 50 labelled examples to work with and 5,000 examples on the test set we need to generate predictions for. To access an example, you need to specify the name of the split and then the index as follows:

In [4]:
raft_dataset["train"][0]

{'Sentence': 'No regional side effects were noted.', 'ID': 0, 'Label': 2}

Here we can see that each example is assigned a label ID which denotes the class in this particular tasks. Let's check how many classes we have in the training set:

In [12]:
label_ids = raft_dataset["train"].unique("Label")
label_ids

[2, 1]

Okay, this indicates that `ade_corpus_v2` is a binary classification task and we can extract the human-readable label names as follows:

In [15]:
features = raft_dataset["train"].features["Label"]
id2label = {idx : features.int2str(idx) for idx in label_ids}
id2label

{2: 'not ADE-related', 1: 'ADE-related'}

Note that the test set also has a `Label` entry, but it is zero to denote a dummy label (this is what your model needs to predict!):

In [16]:
raft_dataset["test"].unique("Label")

[0]

To get a broader sense of what kind of data we are dealing with, we can use the following function to randomly sample from the corpus and display the results as a table:

In [17]:
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))
    
show_random_elements(raft_dataset["train"])

Unnamed: 0,Sentence,ID,Label
0,"Several hypersensitivity reactions to cloxacillin have been reported, although IgE-mediated allergic reactions to the drug are rare and there is little information about possible tolerance to other semisynthetic penicillins or cephalosporins in patients with cloxacillin allergy.",39,ADE-related
1,Best-corrected visual acuity measurements were performed at every visit.,33,not ADE-related
2,A closer look at septic shock.,20,not ADE-related
3,Considerable improvement of myasthenic symptoms was seen in all patients within 3-6 months after the initiation of this therapy.,34,not ADE-related
4,No abnormalities were identified on review of collection and processing records.,25,not ADE-related
5,We report a case of long lasting respiratory depression after intravenous administration of morphine to a 7 year old girl with haemolytic uraemic syndrome.,32,ADE-related
6,"We describe the case of a 10-year-old girl with two epileptic seizures and subcontinuous spike-waves during sleep, who presented unusual side-effects related to clobazam (CLB) monotherapy.",1,not ADE-related
7,"IMPLICATIONS: Dexmedetomidine, an alpha(2)-adrenoceptor agonist, is indicated for sedating patients on mechanical ventilation.",43,not ADE-related
8,"The cases are important in documenting that drug-induced dystonias do occur in patients with dementia, that risperidone appears to have contributed to dystonia among elderly patients, and that the categorization of dystonic reactions needs further clarification.",24,ADE-related
9,OBJECTIVE: To describe onset of syndrome of inappropriate antidiuretic hormone (SIADH) associated with vinorelbine therapy for advanced breast cancer.,9,ADE-related


## Creating a classifier from the Hugging Face Model Hub

We provide a class which uses the same prompt construction method as our GPT-3 baseline, but works with any CausalLM on the [HuggingFace Model Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads). The classifier will automatically use a GPU if available. Brief documentation on the arguments for configuring the classifier is provided below.:

In [19]:
from raft_baselines.classifiers import TransformersCausalLMClassifier

classifier = TransformersCausalLMClassifier(
    model_type="distilgpt2",             # The model to use from the HF hub
    training_data=train_data,            # The training data
    num_prompt_training_examples=25,     # See raft_predict.py for the number of training examples used on a per-dataset basis in the GPT-3 baselines run.
                                         # Note that it may be better to use fewer training examples and/or shorter instructions with other models with smaller context windows.
    add_prefixes=(TASK=="banking_77"),   # Set to True when using banking_77 since multiple classes start with the same token
    config=TASK,                         # For task-specific instructions and field ordering
    use_task_specific_instructions=True,
    do_semantic_selection=True,
)

## Generating predictions for RAFT test examples

In order to generate predictions on the test set, we need to provide the model with an appropriate prompt with the instructions. Let's take a look at how this works on a single example from the test set.

### Example prompt and prediction

The `TransformersCausalLMClassifier` has a `classify` function that will automatically generate the predicted probabilites from the model. We'll set `should_print_prompt=True` so that we can see which prompt is being used to instruct the model:

In [24]:
test_dataset = raft_dataset["test"]
first_test_example = test_dataset[0]

# delete the 0 Label
del first_test_example["Label"]

# probabilities for all classes
output_probs = classifier.classify(first_test_example, should_print_prompt=True)
output_probs

Label the sentence based on whether it is related to an adverse drug effect (ADE). Details are described below:
Drugs: Names of drugs and chemicals that include brand names, trivial names, abbreviations and systematic names were annotated. Mentions of drugs or chemicals should strictly be in a therapeutic context. This category does not include the names of metabolites, reaction byproducts, or hospital chemicals (e.g. surgical equipment disinfectants).
Adverse effect: Mentions of adverse effects include signs, symptoms, diseases, disorders, acquired abnormalities, deficiencies, organ damage or death that strictly occur as a consequence of drug intake.
Possible labels:
1. ADE-related
2. not ADE-related

Sentence: METHODS: We identified three patients who developed skin necrosis and determined any factors, which
Label: not ADE-related

Sentence: In 1991 the patient were found to be seropositive for HCV antibodies as detected by
Label: not ADE-related

Sentence: No regional side effects w

{'ADE-related': 0.30973935, 'not ADE-related': 0.6902607}

In this example we can see the model predicts that the example is not related to an adverse drug effect. We can use this technique to generate predictions across the whole test set! Let's take a look.

### Creating a submission file of predictions

To submit to the RAFT leaderboard, you'll need to provide a CSV file of predictions on the test set for each task (see [here](https://huggingface.co/datasets/ought/raft-submission) for detailed instructions).  The following code snippet generates a CSV with predictions for the first $N$ test examples in the format required for submission $(ID, Label)$. 

Note that this is expected to generate predictions of all "Not ADE-related" for the 10 test examples with the code as written; few-shot classification is pretty hard!

In [28]:
# Increase this to len(test_dataset) to generate predictions over the full test set
N_TEST = 10
test_examples_to_predict = test_dataset.select(range(N_TEST))

def predict_one(clf, test_example):
    del test_example["Label"]    
    output_probs = clf.classify(example)
    output_label = max(output_probs.items(), key=lambda kv_pair: kv_pair[1])[0]
    return output_label

data = []
for example in test_examples_to_predict:
    data.append({"ID": example["ID"], "Label": predict_one(classifier, example)})
    
result_df = pd.DataFrame(data=data, columns=["ID", "Label"]).astype({"ID": int, "Label": str})   
result_df

Unnamed: 0,ID,Label
0,50,not ADE-related
1,51,not ADE-related
2,52,not ADE-related
3,53,not ADE-related
4,54,not ADE-related
5,55,not ADE-related
6,56,not ADE-related
7,57,not ADE-related
8,58,not ADE-related
9,59,not ADE-related


Note that the `ID` column starts from index 50 since we have IDs 0-49 in the training set. The final step is to save the DataFrame as a CSV file and build out the rest of your submission:

In [29]:
result_df.to_csv("../data/example_predictions.csv", index=False)

Good luck with the rest of the benchmark!