# Lab 3: Prompting LLMs to solve NLP Problems

## June 27, 2023

Welcome to Lab 3 of our course on Natural Language Processing. Today, we will be diving deep into the fourth and most recent paradigm in NLP teased in the previous Lab, i.e. Pre-train, Prompt and Predict. The core idea behind the paradigm is that once we train a big enough language model (pre-training + instruction tuning), we do not really need to train these models further to solve any specific taks, but instead can directly prompt the model to solve a task by specifying instructions, task descriptions and in some cases a few examples.

Like last time we will be working on the with the [SocialIQA](https://arxiv.org/abs/1904.09728) dataset, and demonstrating how to work with LLMs to solve such tasks.

Before geting started, we recommend signing up for a free-trial of the [OpenAI API](https://openai.com/blog/openai-api), which should give you free credits worth 5$ for three months. This should be plenty for the tutorial today and for your final projects. After signing up for the API, get the api key and place it in the `key.txt` file located in the same directory. Once that's setup you can proceed with the tutorial.


Recommended Reading:
- Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig. <i>Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing</i>. https://arxiv.org/abs/2107.13586

In [None]:
try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    siqa_data_dir = "gdrive/MyDrive/PlakshaNLP2023/Lab3b/data/socialiqa-train-dev/"
except:
    siqa_data_dir = "/datadrive/t-kabir/work/repos/PlakshaNLP/TLPNLP2023/source/Lab3b/data/socialiqa-train-dev/"

In [None]:
!pip install --upgrade openai
!pip install numpy
!pip install pandas
!pip install tqdm

In [None]:
# We start by importing libraries that we will be making use of in the assignment.
import os
import time
from functools import partial
import json
from pprint import pprint
import numpy as np
import pandas as pd
import openai
import random
from collections import Counter
import tqdm
import re

In [None]:
# Specify the key
with open("key.txt") as f:
    openai.api_key = f.read().split("\n")[0]

In [None]:
# Loading the SocialIQA dataset

def load_siqa_data(split):

    # We first load the file containing context, question and answers
    with open(f"data/socialiqa-train-dev/{split}.jsonl") as f:
        data = [json.loads(jline) for jline in f.read().splitlines()]

    # We then load the file containing the correct answer for each question
    with open(f"data/socialiqa-train-dev/{split}-labels.lst") as f:
        labels = f.read().splitlines()

    return data, labels


train_data, train_labels = load_siqa_data("train")
dev_data, dev_labels = load_siqa_data("dev")

print(f"Number of Training Examples: {len(train_data)}")
print(f"Number of Validation Examples: {len(dev_data)}")

In [None]:
train_data[0]

In [None]:
train_labels[0]

## Task 1: Prompting Basics (30 minutes)

In this task, you will be learning how create standard NLP problems into text prompts which can then be fed to an LLM for its prediction. Mainly there are 2 concepts that are important to understand while creating prompts:
- Prompt Template or Function: a textual string that has two slots: an input slot [X] for input x and an answer slot
[Z] for an intermediate generated answer text z that will later be mapped into y.
- Answer verbalizer: A mapping between the task labels to words or phrases that converts the more artificial looking labels to natural language that fits with the prompt. eg. for sentiment analysis we can define Z = {“excellent”, “good”, “OK”, “bad”, “horrible”} to represent each of the classes in Y = {++, +, ~, -, --}.

<img src="images/prompting_basics.png" alt="prompting" border="0">

We can also include more interesting stuff like instruction of the task in the template and explanation of the answer in the verbalizer to make more powerful prompts, as we will see a bit later.

## Task 1.1 Defining prompt function and verbalizer for SocialIQA.

For the purpose of this excercise, we ask you to implement this prompt function:
```
Context: {{context}}

Question: {{question}}

Which one of these answers best answers the question according to the context?

A: {{answerA}}
B: {{answerB}}

C: {{answerC}}
```

and verbalizer:

```
{"1": "The answer is A", "2": "The answer is B", "3": "The answer is C"}
```

This prompt was obtained from [PromptSource](https://huggingface.co/spaces/bigscience/promptsource), an awesome resource for finding prompts for hundreds of NLP tasks!

In [None]:
def social_iqa_prompting_fn(siqa_example: dict[str, str]):
    """
    Takes an example from the SocialIQA dataset, fills in the prompt template, and returns the prompt.
    
    Inputs:
        siqa_example: A dictionary containing the context, question and answerA, answerB, answerC for a SocialIQA example.

    Outputs:

    """
    prompt = None

    # YOUR CODE HERE
    raise NotImplementedError()

    return prompt
    

In [None]:
# Sample Test Case 1
print("Running Sample Test Case 1")
siqa_example = train_data[0]
prompt = social_iqa_prompting_fn(siqa_example)
expected_prompt = """Context: Cameron decided to have a barbecue and gathered her friends together.
Question: How would Others feel as a result?
Which one of these answers best answers the question according to the context?
AnswerA: like attending
AnswerB: like staying home
AnswerC: a good friend to have"""
print(f"Input Example:\n{siqa_example}")
print(f"Prompt:\n{prompt}")
print(f"Expected Prompt:\n{expected_prompt}")
assert prompt == expected_prompt

# Sample Test Case 2
print("Running Sample Test Case 2")
siqa_example = train_data[100]
prompt = social_iqa_prompting_fn(siqa_example)
expected_prompt = """Context: Jordan's dog peed on the couch they were selling and Jordan removed the odor as soon as possible.
Question: How would Jordan feel afterwards?
Which one of these answers best answers the question according to the context?
AnswerA: selling a couch
AnswerB: Disgusted
AnswerC: Relieved"""
print(f"Input Example:\n{siqa_example}")
print(f"Prompt:\n{prompt}")
print(f"Expected Prompt:\n{expected_prompt}")
assert prompt == expected_prompt


In [None]:
def social_iqa_verbalizer(label: str):
    """
    Takes in the label and coverts it into a natural language phrase as specified above

    Inputs:
        label: A string containing the correct answer for a SocialIQA example.
    
    Outputs:
        A string containing the natural language phrase corresponding to the label.
    """

    verbalized_label = None

    # YOUR CODE HERE
    raise NotImplementedError()

    return verbalized_label

In [None]:
# Sample Test Case 1
print("Running Sample Test Case 1")
siqa_example = train_labels[0]
output = social_iqa_verbalizer(siqa_example)
expected_output = """The answer is A"""
print(f"Input Example:\n{siqa_example}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

# Sample Test Case 2
print("Running Sample Test Case 2")
siqa_example = train_labels[100]
output = social_iqa_verbalizer(siqa_example)
expected_output = """The answer is B"""
print(f"Input Example:\n{siqa_example}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output


Let's now obtain the prompts and verbalized labels for each of the the examples in the dataset

In [None]:
train_prompts = None
train_verbalized_labels = None
val_prompts = None
val_verbalized_labels = None

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
# Sample Test Case 1
print("Running Sample Test Case 1")
idx = 10
siqa_example = train_data[idx]
prompt = train_prompts[idx]
expected_prompt = """Context: Sydney was a school teacher and made sure their students learned well.
Question: How would you describe Sydney?
Which one of these answers best answers the question according to the context?
AnswerA: As someone that asked for a job
AnswerB: As someone that takes teaching seriously
AnswerC: Like a leader"""
print(f"Input Example:\n{siqa_example}")
print(f"Prompt:\n{prompt}")
print(f"Expected Prompt:\n{expected_prompt}")

# Sample Test Case 2
print("Running Sample Test Case 2")
idx = 10
siqa_label = train_labels[idx]
verbalized_label = "The answer is B"
print(f"Input Example:\n{siqa_label}")
print(f"Verbalized Label:\n{verbalized_label}")
print(f"Expected Verbalized Label:\n{verbalized_label}")

It is often useful to have a reverse verbalizer as well that converts the verbalized labels back to the structured and consistent labels in the dataset. For example, "The answer is A" is mapped back to "1" and so on. 

In [None]:
def social_iqa_reverse_verbalizer(verbalized_label: str):
    """
    Reverses the verbalized label into the label
    Inputs:
        verbalized_label: A string containing the natural language phrase corresponding to the label.
    Outputs:
        label: A string containing the correct answer for a SocialIQA example.
    
    Important Note: We will be using this function to map LLM's output to structured label. The output of LLM now can be in some format other than what we expect
    For example, it can be "The answer is A" or "The answer is A." or or "<some text> The answer is A" or "The answer is A <some text>"
    When you reverse the verbalized label, make sure you handle these cases.
    
    Important Note 2: If the resulting text doesn't have the answer, then just return an empty string.
    """

    label = None
    # YOUR CODE HERE
    raise NotImplementedError()

    return label

In [None]:
# Sample Test Case 1
print("Running Sample Test Case 1")
example_verbalized_label = "The answer is C"
output = social_iqa_reverse_verbalizer(example_verbalized_label)
expected_output = "3"
print(f"Input Example:\n{example_verbalized_label}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

# Sample Test Case 2
print("Running Sample Test Case 2")
example_verbalized_label = "The answer is B."
output = social_iqa_reverse_verbalizer(example_verbalized_label)
expected_output = "2"
print(f"Input Example:\n{example_verbalized_label}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

# Sample Test Case 3
print("Running Sample Test Case 3")
example_verbalized_label = "some explanation before the actual answer, The answer is A"
output = social_iqa_reverse_verbalizer(example_verbalized_label)
expected_output = "1"
print(f"Input Example:\n{example_verbalized_label}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

# Sample Test Case 4
print("Running Sample Test Case 4")
example_verbalized_label = "some text here the answer is C, some more text"
output = social_iqa_reverse_verbalizer(example_verbalized_label)
expected_output = "3"
print(f"Input Example:\n{example_verbalized_label}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

# Sample Test Case 5
print("Running Sample Test Case 5")
example_verbalized_label = "none of the options is the correct answer"
output = social_iqa_reverse_verbalizer(example_verbalized_label)
expected_output = ""
print(f"Input Example:\n{example_verbalized_label}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

## Task 1.2: Choose Few-Shot examples

Often we can get better performance on a task by providing a few examples of the task as part of the prompt. This is also known as in-context learning, where the model learns to solve a task based on the examples provided in the context (and no updates to the model's weights!). One of the easiest way that works reasonably well in practice is to simply choose `k` examples randomly for each class from the entire training dataset, such that we have n_classes * k few-shot examples where n_classes = 3 for SocialIQA dataset. Implement the `choose_few_shot` function below that does that.

In [None]:
def choose_few_shot(train_prompts, train_verbalized_labels, k = 1, seed = 42):
    """
    Randomly chooses k examples from the training set for few-shot in-context learning.
    Inputs:
        train_prompts: A list of prompts for the training set.
        train_verbalized_labels: A list of labels for the training set.
        k: The number of examples per class to choose.
        n_classes: The number of classes in the dataset.
        seed: The random seed to use, to ensure reproducible outputs

    Outputs:
        - List[Dict[str, str]]: A list of 3k examples from the training set, where each example is represented as a dictionary with "prompt" and "label" as keys and corresponding values.

    Example Output: [
        {
            "prompt": <Example Prompt 1>,
            "label": <Example Label_1>
        },
        ...,
        {
            "prompt": <Example Prompt 3k>,
            "label": <Example Label_3k>
        }
    ]
    """

    random.seed(seed)
    np.random.seed(seed)

    fs_examples = []

    # YOUR CODE HERE
    raise NotImplementedError()

    # Shuffle the examples to ensure there is no bias in the order of the examples
    random.shuffle(fs_examples)

    return fs_examples

In [None]:
# Sample Test Case 1
print("Running Sample Test Case 1. Checking if the output length is correct")
k = 1
seed = 42
output = choose_few_shot(train_prompts, train_verbalized_labels, k, seed)
output_len = len(output)
expected_output_len = k * len(set(train_labels))
print(f"k: {k}")
print(f"Output Length:\n{output_len}")
print(f"Expected Output Length:\n{expected_output_len}")
assert output_len == expected_output_len

# Sample Test Case 2
print("Running Sample Test Case 2. Checking if all labels are predicted")
output_labels = sorted(list(set([example["label"] for example in output])))
expected_output_labels = ["The answer is A", "The answer is B", "The answer is C"]
print(f"Output Labels:\n{output_labels}")
print(f"Expected Output Labels:\n{expected_output_labels}")
assert output_labels == expected_output_labels

# Sample Test Case 3
print("Running Sample Test Case 3. Checking if count of labels are correct")
k = 3
output = choose_few_shot(train_prompts, train_verbalized_labels, k, seed)
output_label_counter = Counter(list(([example["label"] for example in output])))
expected_output_counter = {"The answer is A": k, "The answer is B": k, "The answer is C": k}
print(f"For k = {k}")
print(f"Output Label Counter:\n{output_label_counter}")
print(f"Expected Output Label Counter:\n{expected_output_counter}")
assert output_label_counter == expected_output_counter
# # Sample Test Case 4
# print("Running Sample Test Case 3")
# expected_output = [{'prompt': "Context: Jordan explain another reason why they were late, but their boss wasn't buying it.\nQuestion: How would Jordan feel afterwards?\nWhich one of these answers best answers the question according to the context?\nAnswerA: pleased at always being late\nAnswerB: guilty at always being late\nAnswerC: sneaky",
#   'label': 'The answer is B'},
#  {'prompt': 'Context: Riley played basketball with friends and injured their bicep very badly.\nQuestion: What will happen to Others?\nWhich one of these answers best answers the question according to the context?\nAnswerA: try to help\nAnswerB: quit school\nAnswerC: go to their room',
#   'label': 'The answer is A'},
#  {'prompt': 'Context: Alex told Cameron they did not share feelings, but Cameron kissed Alex on the lips anyway.\nQuestion: How would Alex feel as a result?\nWhich one of these answers best answers the question according to the context?\nAnswerA: pleased\nAnswerB: happy\nAnswerC: upset',
#   'label': 'The answer is C'}]

# print(f"Output:\n{output}")
# print(f"Expected Output:\n{expected_output}")
# assert output == expected_output

In [None]:
# Choose 3 few-shot examples from training data
few_shot_examples = choose_few_shot(train_prompts, train_verbalized_labels, k = 1, seed = 42)

### Few-shot examples with explanations

So far above we have been constructing label verbalizer to provide the answer directly. Often it can be useful to prompt the model to first generate an explanation before the answer. For eg.
```
"prompt": "Context: Tracy didn't go home that evening and resisted Riley's attacks.
            Question: What does Tracy need to do before this?
            Options: 
            (A) make a new plan 
            (B) Go home and see Riley 
            (C) Find somewhere to go"
"label": "Tracy found somewhere to go and didn't come home because she wanted to resist Riley's attacks. Hence, the answer is C"
```
One way to prompt the model to generate such explanations is to provide the explanations for the few-shot examples, which will ground the model to first generate an explanation and then the answer. This helps both improve the performance of the model as well as have more interpretable outputs from LLM.

Below we provide a few examples with explanations for SocialIQA task obtained from [Super-NaturalInstructions](https://aclanthology.org/2022.emnlp-main.340/), an amazing resource for prompts, instructions and explanations for around 1600 NLP tasks.


In [None]:
fs_examples_w_explanations = [
    {
        "prompt": "Context: Tracy didn't go home that evening and resisted Riley's attacks.\nQuestion: What does Tracy need to do before this?\nWhich one of these answers best answers the question according to the context?\nAnswerA: make a new plan\nAnswerB: Go home and see Riley\AnswerC: Find somewhere to go",
        "label": "Tracy found somewhere to go and didn't come home because she wanted to resist Riley's attacks. Hence, the correct answer is C."
    },
    {
        "prompt": "Context: Sydney walked past a homeless woman asking for change but did not have any money they could give to her. Sydney felt bad afterwards.\nQuestion: How would you describe Sydney?\nWhich one of these answers best answers the question according to the context?\nAnswerA: sympathetic\nAnswerB: like a person who was unable to help\nAnswerC: incredulous",
        "label": "Sydney is a sympathetic person because she felt bad for someone who needed help, and she couldn't help her. Hence, the correct answer is A."
    },
    {
        "prompt": "Context: Taylor gave help to a friend who was having trouble keeping up with their bills.\nQuestion: What will their friend want to do next?\nWhich one of these answers best answers the question according to the context?\nAnswerA: help the friend find a higher paying job\nAnswerB: thank Taylor for the generosity\nAnswerC: pay some of their late employees",
        "label": "The friend should thank Taylor for the generosity she showed by helping him pay bills. Hence, the correct answer is B."
    }
]

In [None]:
fs_examples_w_explanations

## Task 2: Evaluating ChatGPT (GPT-3.5-Turbo) on SocialIQA (45 minutes)

Today we will be working with OpenAI's GPT family of models. ChatGPT (or GPT-3.5) was built on top of GPT-3, which is a pre-trained Large Language Model (LLM) with 175 Billion parameters, trained on a huge amount of unlabelled data using the language modelling objective (i.e. given k tokens, generate (k+1)th token). While this forms the basis of all GPT family of models, GPT-3.5 and later models are based on [InstructGPT](https://arxiv.org/abs/2203.02155), which further adds an Instruction Tuning step that learns from human feedback to follow provided instructions.

![Instruction Tuning](images/instructgpt.png)
*From the [Ouyang et al. 2022](https://arxiv.org/abs/2203.02155)*

As a consequence of the pre-training with language modeling objective and instruction tuning, we can use GPT-3.5 to complete a given piece of text and provide specific instructions about how to go about completing the text. We achieve this by defining a text prompt which is to be given as the input to the LLM which then generates a completion of the provided text.

Below we demonstrate how can we hit the OpenAI API to get responses for our prompts. Starting from GPT-3.5 models, the API comes with [ChatCompletions](https://platform.openai.com/docs/guides/gpt/chat-completions-api) support, which take a list of messages (conversation between user and assistant) as input and return a model-generated message as output. An example API call looks like:

In [None]:
import openai

response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ],
  max_tokens=20,
  temperature=0.0,
)

Let's try to wrap our head around different parameters to this function call.

First we have `model`, where we specify which OpenAI model to use. We have used `"gpt-3.5-turbo"` here, which is similar to ChatGPT like you would have used online. You can find the list of other models [here](https://platform.openai.com/docs/models).

Next, we have `messages`, which contains the conversation between the user and assistant that is to be completed. Notice that the first message is what we call a "system prompt", which is used to set the behavior of the assistant. 

`max_tokens` is used to specify the maximum number of response tokens that the model should generate. This can be useful when you know how long the response is typically going to be, and can help reduce cost.

`temperature`, helps in controlling the variability in the output. Lower values for temperature result in more consistent outputs, while higher values generate more diverse and creative results. Setting temperature to 0 will make the outputs mostly deterministic, but a small amount of variability will remain.

In [None]:
response

Now let's look at the response. The assistant's reply can be  extracted with `response['choices'][0]['message']['content']`. Every response will include a finish_reason. The possible values for finish_reason are:

- stop: API returned complete message, or a message terminated by one of the stop sequences provided via the stop parameter
- length: Incomplete model output due to max_tokens parameter or token limit
- function_call: The model decided to call a function
- content_filter: Omitted content due to a flag from our content filters
- null: API response still in progress or incomplete

Depending on input parameters (like providing functions as shown below), the model response may include different information.

In [None]:
model_output = response['choices'][0]['message']['content']
print(model_output)

## Task 2.1: Using ChatGPT to solve SocialIQA problems

Now we have an understanding of how to work with OpenAI API, we can go ahead and call the api with the prompts that we just created and check how well does the model perform the task. We promt the model with the test example for which we want the prediction and provide few-shot examples as part of the context. This can be done by simply providing the example prompt and labels as user-assistant conversation history and test example as the most recent query of the user. Implement the function `get_social_iqa_pred_gpt` that receives a test prompt to be answered, few-shot examples, and some api specific hyperparameters to predict the answer.

In [None]:
def get_social_iqa_pred_gpt(
        test_prompt,
        few_shot_examples,
        model_name = "gpt-3.5-turbo",
        max_tokens = 20,
        temperature = 0.0,
):
    
    """
    Calls the OpenAI API with test_prompt and few-shot examples to generate the answer.
    Inputs:
        test_prompt: The prompt for the test example
        few_shot_examples: A list of few-shot examples
        model_name: The name of the model to use
        max_tokens: The maximum number of tokens to generate
        temperature: The temperature to use for the model

    Outputs:
        model_output: The model's output

    Hint: Your messages to be sent should be in the following format:
        [
            {"role": "user", "content": <fs-example-1-promot>},
            {"role": "assistant", "content": <fs-example-1-label>},
            ...,
            {"role": "user", "content": <fs-example-3k-promot>},
            {"role": "assistant", "content": <fs-example-3k-label>},
            {"role": "user", "content": <test-prompt>},
        ]
    """

    messages_prompt = [{
        "role": "user", "content": "You are an expert of Human Social Common Sense. You need to solve the SocialIQA task. In this task, you're given a context, a question, and three options. Your task is to find the correct answer to the question using the given context and options. Also, you may need to use commonsense reasoning about social situations to answer the questions. Classify your answers into 'A', 'B', and 'C'. You must choose the most likely option."
    }]
    model_output = None
    
    while True:
        try:
            # YOUR CODE HERE
            raise NotImplementedError()
            time.sleep(20) # to prevent rate limit error
            break
        except (openai.error.APIConnectionError, openai.error.RateLimitError, openai.error.Timeout, openai.error.ServiceUnavailableError) as e:
            #Sleep and try again
            print(f"Couldn't get response due to {e}. Trying again!")
            time.sleep(20)
            continue

    return model_output

In [None]:
test_example = val_prompts[0]
test_example_label = val_verbalized_labels[0]
model_output = get_social_iqa_pred_gpt(test_example, few_shot_examples, model_name = "gpt-3.5-turbo", max_tokens = 20, temperature = 0.0)
print(test_example)
print(f"Model's response: ", model_output)
print(f"Correct answer: ", test_example_label)

As you can see the model didn't quite get the answer right. Let's try providing examples with explanations i.e. `fs_examples_w_explanations` and see the output. Note that we will need to give a higher value of `max_tokens`, since the model is also expected to generate explanation now.

In [None]:
test_example = val_prompts[0]
test_example_label = val_verbalized_labels[0]
model_output = get_social_iqa_pred_gpt(test_example, fs_examples_w_explanations,
                                        model_name = "gpt-3.5-turbo", max_tokens = 50, temperature = 0.0)
print(test_example)
print(f"Model's response: ", model_output)
print(f"Correct answer: ", test_example_label)

As you can see the output is correct and the explanation also makes sense.

Let's do a full fledged evaluation now. Due to cost limits, we will only be evaluating first 32 examples of the validation set and not the whole but that should give us some idea of how good ChatGPT is at solving social common-sense reasoning problems

In [None]:
def get_model_predictions(
        test_prompts,
        few_shot_examples,
        model_name = "gpt-3.5-turbo",
        max_tokens = 20,
        temperature = 0.0,
):
    """
    Get predictions for all test prompts using the `get_social_iqa_pred_gpt` function

    Inputs:
        test_prompts: A list of test prompts
        few_shot_examples: A list of few-shot examples
        model_name: The name of the model to use
        max_tokens: The maximum number of tokens to generate
        temperature: The temperature to use for the model
    
    Outputs:
        model_preds: A list of model predictions for each test prompt
    """

    model_preds = []
    # YOUR CODE HERE
    raise NotImplementedError()

    return model_preds

def evaluate_model_preds(
        model_preds,
        test_labels
):
    """
    Evaluates the prediction of the model by performing string match between the predictions and labels.

    Inputs:
        model_preds: A list of model predictions for each test prompt
        test_labels: A list of test labels. Note that these are not verbalized

    Outputs:
        accuracy: The accuracy of the model i.e. #correct_predictions / #total_predictions
    """

    accuracy = None
    # YOUR CODE HERE
    raise NotImplementedError()

    return accuracy

In [None]:
# To test if things are working fine
k = 5
test_prompts = val_prompts[:k]
test_labels = dev_labels[:k]
model_preds = get_model_predictions(test_prompts, few_shot_examples, model_name = "gpt-3.5-turbo", max_tokens = 20, temperature = 0.0)


In [None]:
accuracy = evaluate_model_preds(model_preds, test_labels)
print(f"Accuracy: {accuracy}")

In [None]:
# Evaluate on 32 validation examples
k = 32
test_prompts = val_prompts[:k]
test_labels = dev_labels[:k]
model_preds = get_model_predictions(test_prompts, few_shot_examples, model_name = "gpt-3.5-turbo", max_tokens = 20, temperature = 0.0)

In [None]:
accuracy = evaluate_model_preds(model_preds, test_labels)
print(f"Accuracy: {accuracy}")

In [None]:
model_preds

In [None]:
# Evaluate on 32 validation examples with explanations
k = 32
test_prompts = val_prompts[:k]
test_labels = dev_labels[:k]
model_preds = get_model_predictions(test_prompts, 
                                    fs_examples_w_explanations, 
                                    model_name = "gpt-3.5-turbo", max_tokens = 50, temperature = 0.0)

In [None]:
model_preds

In [None]:
accuracy = evaluate_model_preds(model_preds, test_labels)
print(f"Accuracy: {accuracy}")

As you can see we get better performance on prompting the model with explanations than without 56% vs 62.5%. We can do more prompt-engineering and better type of explanations to improve the performance further. But we hope with this you would have gotten some idea on how to use these models to solve NLP tasks like this. Also, notice that common sense reasoning remains an open problem for the models we have today, as even with ChatGPT, which is a fairly strong LLM, the accuracy remains comparitively low.