# Shared Task 4
This Notebook is designed to be used in **Google Colab!**

You will gain points in this task by competing on a leaderboard on two tasks and also by implementing some simple functions in this notebook. The two task for which you are competing are:   
1) **closed track**: each team should use a pre-defined LLM (Mistral) to design prompts (simple one or complex
one) to perform the two tasks defined in subtask 3, and submit the best 3 predictions of the testing instances

2) **open track**: each team can use any LLMs (e.g., GPT-4) or the trained model from subtask 3 to perform the task
and submit the best 3 predictions of the testing instances.

You will get a refresher on how to prompt a model over API, but also learn how to prompt a model running locally in Google Colab. You will develop and evaluate your prompts on the dev dataset and then submit the results of your prompts / models on the test dataset.

In [None]:
!pip -q install bitsandbytes accelerate xformers einops langchain
import os
import pickle
import time
import numpy as np
import json
import torch
import accelerate
import math
from langchain import PromptTemplate, LLMChain, HuggingFacePipeline

import warnings
warnings.filterwarnings('ignore')

import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## LLMs as an Evaluator
The goal in this section is, to use the LLM 'Mistral' as an Evaluator for other LLMs. You should try to prompt the LLM with two prompts, one employing CoT, while the other does not. In the end you can compare the two prompts by calculating Precision, Recall and F1 Score.

The following cell you access to prompting Mistral as you learned in Tutorial 6

In [None]:
from huggingface_hub import InferenceClient

HF_API_TOKEN: str = open("/content/drive/MyDrive/Colab Notebooks/Folt/subtask4/HF_TOKEN", 'r').read().strip()
client = InferenceClient(token=HF_API_TOKEN, timeout=300)

In [None]:
def inference(text: str, max_new_tokens: int = 20, wait_btwn_prompts: float = 0.1):
    """
    text: str
        Text to use for inference.
    max_new_tokens: int
        Maximum number of tokens to generate.
    """
    time.sleep(wait_btwn_prompts)

    output = client.text_generation(
        text,
        max_new_tokens=max_new_tokens,
        model="mistralai/Mistral-7B-Instruct-v0.1",

    )
    return output.removeprefix(" ")

In the following cells, we will setup the local model. Feel free to try and understand the setup, but you can also just use the inference as is.

In [None]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16)

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
      model_name,
      quantization_config=bnb_config,
      device_map="auto")

In [None]:
tokenizer.pad_token = tokenizer.unk_token
model.config.pad_token_id = tokenizer.pad_token_id
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    do_sample=True,
    max_new_tokens=10,
    top_k=5,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id
    )

In [None]:
llm = HuggingFacePipeline(pipeline=pipe)

In [None]:
def local_inference(llm_chain, args):
  """
  llm_chain: LLMChain
        The LLMChain to be used for prompting
    args: dict or list(dict)
        A dict or a list of dicts which contain the input data, which is to be inserted in the placeholders in the llm chain.
  """
  if type(args) == list:
    ans = llm_chain.batch(args)
  else:
    ans = llm_chain.run(args)
  return ans

The following cell gives you some example prompts, which you can use to develop your own Prompts for Evaluation. Keep in mind, that you want to complete both tasks of subtask 3 again, but this time using prompting. The following examples are only for the first task.

In [None]:
Question = "Example Question"
LLM_A = "Hello I'm not human"
GOLD_A = "Hello I'm human"

simple_prompt = f"[INST]Which of the following Answers (A) is more fit for the provided Question (Q). Provide you Verdict in the Format of 'Verdict: A'. Q:'{Question}' A1: '{LLM_A}', A2: '{GOLD_A}'[/INST]"

cot_prompt = f"<s> [INST]Which of the following Answers (A) is more fit for the provided Question (Q). Provide you Verdict in the Format of 'Verdict: A'. Q:'The sky is blue.' A1: 'Thats wrong.', A2: 'Thats true.'[/INST] Verdict: A1</s> [INST]Which of the following Answers (A) is more fit for the provided Question (Q). Q:'{Question}' A1: '{LLM_A}', A2: '{GOLD_A}'[/INST]"

Generally you can think of this scheme for using prompting to evaluate for this task, but also have in mind, that you already learned CoT prompting. Feel free to try out all sorts of prompts!


# Task 1
input – tuple(question, gold answer, LLM answer)   
output – prediction whether LLM answer is harmful or non-harmful

## Building some prompts
Let's start by building the prompts, you want to use for evaluating the first task.

In [None]:
def build_complex_prompt_t1():
    #TODO:THIS IS AN EXAMPLE HOW LOCAL PROMPTING WORKS
    #YOUR PROMPT HERE
    text = "<s> [INST]Which of the following Answers (A) is more fit for the provided Question (Q). Provide you Verdict in the Format of 'Verdict: A'. Q:'The sky is blue.' A1: 'Thats wrong.', A2: 'Thats true.'[/INST] Verdict: A1</s> [INST]Which of the following Answers (A) is more fit for the provided Question (Q). Q:'{question}' A1: '{llm_a}' A2: '{gold_a}'[/INST]"
    #YOUR PROMPT HERE
    prompt = PromptTemplate(template=text, input_variables=["question","llm_a", "gold_a"])
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    return llm_chain

In [None]:
def build_simple_prompt_t1():
    #YOUR PROMPT HERE
    text = ""
    #YOUR PROMPT HERE
    prompt = PromptTemplate(template=text, input_variables=["question","llm_a", "gold_a"])
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    return llm_chain

Next up we'll need to load the data from your annotated json

In [None]:
def extract_data(path):
    '''
    Extracts information from the annotation JSON file containing cca_llm_answers.

    Args:
        path (str): The path to the JSON file.

    Returns:
        list: A list of tuples, each containing information about a QA pair.
              Each tuple includes the following elements:
              - Question type
              - LLM model name
              - Topic 1 (list of strings)
              - Topic 2 (list of strings)
              - Labels (list of strings)
              - Question text
              - Gold answer text
              - LLM answer text
    '''
    with open(path, "r", encoding="utf-8") as file:
        cca_llm_answers = json.load(file)
        annotation_entries = []

    for qa_pair in cca_llm_answers:
        #id = cca_llm_answer["id"]
        llm_model = qa_pair["data"]["llm_model_name"][:-7]
        topic_1 = qa_pair["data"]["topic1"][1:-1].replace("'", "").split(", ")
        topic_2 = qa_pair["data"]["topic2"][1:-1].replace("'", "").split(", ")
        question = qa_pair["data"]["question"].replace('\n', '')
        gold_answer = qa_pair["data"]["gold_answer"].replace('\n', '')
        llm_answer = qa_pair["data"]["llm_answer"].replace('\n', '')
        # we initalize the values as an empty string or an empty list
        question_type, labels = "", []

        # we iterate through the entries of the values to extract validity, question_type and the labels
        for answer_unit in qa_pair["annotations"][0]["result"]:
            if answer_unit["type"] == "labels":
                # Extracting labels and handling the case where labels are not present
                labels.append(answer_unit["value"].get("labels", [""])[0])
            elif answer_unit["type"] == "choices" and answer_unit["origin"] == "manual":
                question_type = answer_unit["value"]["choices"][0]

        # appending the tuple of information for this qa_pair
        annotation_entries.append(
            (question_type, llm_model, topic_1, topic_2, labels, question, gold_answer, llm_answer)
        )
    return annotation_entries

From the extracted data, we build a proper feature set.

In [None]:
def create_feature_set(data):
    '''
    Create a feature set for training/test data consisting of the question, the gold answer and the LLM answer.

    Args:
        data(list of tuples): the extracted data from the function extract_data.

    Returns:
        list: Feature set, where each entry is a tuple with a feature dictionary and its label.
    '''
    feature_set = []
    for qa_pair in data:
        # assign the label, it is harmful if it is either Contradiction or Exaggeration
        label = ""
        if qa_pair[4] != "":
            label = "harmful" if "Contradiction" in qa_pair[4] or "Exaggeration" in qa_pair[4] else "non_harmful"
        # Create a dictionary with the relevant features.
        feature_dict = {"question": qa_pair[5], "gold_answer": qa_pair[6], "llm_answer": qa_pair[7]}
        feature_set.append((feature_dict, label))

    return feature_set

In [None]:
feature_set_t1 = create_feature_set(extract_data("dev.json"))

You can use the following methods, if you are running into problems with colab or your local notebook, to cache the data incase the notebook times out

In [None]:
def cache_data(data, i, pred_gold_dict):
    '''
    Caches the data in case of a catastrophic problem. (e.g. The api stops responding). Caches the data and sets the name of the file to be the iteration it crashed.
    :param data: Already predicted labels
    :return:
    '''
    if "simple" in pred_gold_dict.keys(): pred_gold_dict["cached_complex"] = data
    else: pred_gold_dict["cached_simple"] = data

    if "cached" not in os.listdir("."):
        os.mkdir("cached")

    with open(f"cached/{i}.txt", "wb") as data:
        pickle.dump(pred_gold_dict, data)

In [None]:
def load_cached_data()->tuple:
    '''
    Loads the cached data from a previous run
    :return: Cached Data and the iteration it crashed (i, dict(data))
    '''

    files = os.listdir("cached/")
    def sort_by_no(s:str):
        return int(s[:-4])

    files.sort(key=sort_by_no)
    last_step = "cached/"+files[-1]

    with open(last_step, "rb") as data:
        return int(files[-1][:-4]), dict(pickle.load(data))


## Extracting the Verdict
Think about your prompt and what answer the LLM gives, do you need to preprocess the Answer?
Hint:
- Remember that you can prompt the LLM in such a way, that this task may become trivial.

In [None]:
def extract_verdict_t1(pred: str) -> str:
    '''
    Extracts the textual verdict and returns the index of the favoured answer. Remember that you can prompt the LLM in such a way, that this task may become trivial.
    :param evaluation: The evaluation text of the LLM Evaluator
    :return: index of favoured answer
    '''
    verdict = "No verdict"
    # YOUR CODE HERE
    # YOUR CODE HERE
    return verdict

# Prompting the LLM and gaining the label
Now we'll get to the actual prompting and extracting a label from the answer.

In [None]:
def prompting_model_t1(llm_chain, arg_dict) -> list:

    result = local_inference(llm_chain, arg_dict)
    result = extract_verdict_t1(result)

    return result

In [None]:
def predict_task1(prompts: list) -> dict:
    """
    :param prompts: What prompts to use in the tasks (e.g. ["simple"])
    :return: prediction labels for all the prompts and all ccas
    """
    all_preds = {}
    # YOUR CODE HERE
    
    # YOUR CODE HERE
    return all_preds

In [None]:
test_predictions = predict_task1(["simple"])
assert type(test_predictions) == dict
assert len(test_predictions.keys()) == 1
assert len(test_predictions["simple"]) == len(feature_set_t1) 
assert set(test_predictions["simple"]) == {"harmful", "non_harmful"}

In [None]:
predictions_t1 = predict_task1(["complex", "simple"])

# Task 2
In the second task we will take in a tuple consisting of a question, a gold answer and a LLM answer unit as input and they will predict the label of the LLM answer unit (6 categories). The categories being: **Contradiction, Exaggeration, Understatement, Agree with gold answer, Cannot assess, General comment.**   
input – tuple(question, gold answer, LLM answer, LLM answer unit i)   
output – predict the category of LLM answer unit i (6 categories)

In [None]:
def extract_llm_answer(path):
    '''
    Returns a feature dataset from a JSON file containing the features: question, gold answer and the LLM answer units as well as their labels.

    Parameters:
    - path (str): The path to the JSON file.

    Returns:
    - List[Tuple[Dict[str, str], str]]: A list of tuples containing the features and the label.
      Each tuple consists of a dictionary with "questions", "gold_answer", and "llm_answer_unit",
      and a label string.
    '''
    # Read the JSON file
    with open(path,"r", encoding="utf-8",) as file:
        cca_llm_answers = json.load(file)
        annotation_entry = []

    # Extract the required information from each entry in the JSON file
    for cca_llm_answer in cca_llm_answers:
        question = cca_llm_answer["data"]["question"].replace('\n', '')
        gold_answer = cca_llm_answer["data"]["gold_answer"].replace('\n', '')

        for qa_pair in cca_llm_answer["annotations"][0]["result"]:

                if qa_pair["type"] == "labels":
                    label = qa_pair["value"].get("labels", [""])[0]
                    llm_answer_unit = qa_pair["value"].get("text")
                    # Add the extracted information to the list
                    annotation_entry.append(({"question":question, "gold_answer": gold_answer, "llm_answer_unit": llm_answer_unit}, label))


    return annotation_entry

In [None]:
feature_set_t2 = extract_llm_answer("dev.json")

In [None]:
def build_complex_prompt_t2():
    # ["Contradiction", "Exaggeration", "Understatement", "Agree with gold answer", "Cannot assess", "General comment"]
    #YOUR PROMPT HERE
    text = ""
    #YOUR PROMPT HERE
    prompt = PromptTemplate(template=text, input_variables=["question","llm_answer_unit", "gold_a"])
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    return llm_chain

In [None]:
def build_simple_prompt_t2():
    # ["Contradiction", "Exaggeration", "Understatement", "Agree with gold answer", "Cannot assess", "General comment"]
    #YOUR PROMPT HERE
    text = ""
    #YOUR PROMPT HERE
    prompt = PromptTemplate(template=text, input_variables=["question","llm_answer_unit", "gold_a"])
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    return llm_chain

In [None]:
def extract_verdict_t2(result: str):
    '''
    Extract the verdict of the result, if it doesn't already fit the right format
    :param result: Original result
    :return: extracted and cleaned up label
    '''
    #["Contradiction", "Exaggeration", "Understatement", "Agree with gold answer", "Cannot assess", "General comment"]
    verdict = "No verdict"
    # YOUR CODE HERE

    # YOUR CODE HERE
    return verdict

In [None]:
def prompting_model_t2(llm_chain, q, llm_a_unit, gold_a) -> int:
    result = local_inference(llm_chain, {"question":q,"llm_answer_unit":llm_a_unit, "gold_a":gold_a})
    result = extract_verdict_t2(result)

    return result

In [None]:
def predict_task2(prompts: list) -> dict:
    """
    :param prompts: What prompts to use in the tasks (e.g. ["simple"])
    :return: prediction, gold label set for every combination of prompt and task
    """
    all_preds = {}
    start_time = time.time()
    torch.cuda.empty_cache()
    chains = {
        "complex" : build_complex_prompt_t2(),
        "simple" : build_simple_prompt_t2()
    }
    len_all_answers = len(feature_set_t2) * len(prompts)

    for p in prompts:
        predictions = []
        chain = chains[p]

        for i, cca in enumerate(feature_set_t2):
            q = cca[0]["question"]
            llm_a_unit = cca[0]["llm_answer_unit"]
            gold_a = cca[0]["gold_answer"]
            
            pred = prompting_model_t2(chain, q, llm_a_unit, gold_a)
            
            if i % 5 == 0: print("Elapsed time(min): {:.2f}".format((time.time() - start_time)/60))
            print("{:}/{:}: ".format(i + 1, len_all_answers) ,  "\t Prediction: ", pred, "\t Gold Labels: ", cca[1])
            predictions.append(pred)
        all_preds[p] = predictions
        
    return all_preds

In [None]:
results_t2 = predict_task2(["complex", "simple"])

## Evaluating the Model
Compute the Precision, Recall and the F1 Score in order to gain insights into the differences of the two prompting techniques.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

In [None]:
def calculate_metrics(gold_standard, preds, pos_label = 1, labels = None, average="binary"):
    '''
    Computes the Precision, Recall and F1 Score of the verdicts
    :param gold_standard: The gold standard labels
    :param verdicts: The verdicts of Mistral
    :return: Precision, Recall and F1 Score of the verdicts
    '''
    p = 0.0
    r = 0.0
    f = 0.0
    acc = 0.0

    # YOUR CODE HERE
    
    # YOUR CODE HERE

    return p, r, f, acc

In [None]:
test_golds = [1,0,0,0,1,1,1,1]
test_preds = [1,0,0,0,0,0,0,0]
t_p, t_r, t_f, t_acc = calculate_metrics(test_golds, test_preds)
assert t_p == 1
assert t_r == 0.2
assert math.isclose(t_f, 0.33, abs_tol=0.01)
assert t_acc == 0.5

In [None]:
# Use the above function to evaluate your prompts and make a decision on what to submit

# Submitting your results
As you already know, you can submit three prediction files for both the closed and open track exercises. As both of them include the two tasks, that means you can submit a total of **12** files, lastly you need to include a short text file, describing the models or prompts you used. If you want to submit less that's also fine. Put all your files in a .zip file and upload it to moodle by **Feb 9**. Your Submission should look something like this:

The csv contains two rows: 
CCA-LLM-Answer-ID, Prediction
10-1002_cca-4086_ChatGPT_prompt0_answer, students' prediction

team_no_closed_track_task1_prediction1.csv
team_no_closed_track_task1_prediction2.csv
team_no_closed_track_task1_prediction3.csv
team_no_closed_track_task2_prediction1.csv
team_no_closed_track_task2_prediction2.csv
team_no_closed_track_task2_prediction3.csv
team_no_open_track_task1_prediction1.csv
team_no_open_track_task1_prediction2.csv
team_no_open_track_task1_prediction3.csv
team_no_open_track_task2_prediction1.csv
team_no_open_track_task2_prediction2.csv
team_no_open_track_task2_prediction3.csv
ModelDescription.txt (in this file, you should briefly describe the models or prompts they use for all the prediction files they submit)