# Homework 3: Large language model (LLM) prompting

## Learning objectives
After completing this assignment, students will be able to:     
* Prompt LLMs programmatically with templates (parameterized)
* Demonstrate the difference between zero-shot, few-shot, and chain-of-thought prompting
* Engineer and test different prompts

## Overview
In this assignment, you will explore different prompting techniques for OpenAI LLMs. You will fill in a Jupyter notebook hosted on the Pitt CRCD to run your code.

## OpenAI account setup
Until the class OpenAI account is available, you will have to use your own account with free credits. You will need an OpenAI account and API key, you can [sign up here](https://www.google.com/url?q=https%3A%2F%2Fplatform.openai.com%2Fsignup%3Flaunch) and learn [how to make an API key here](https://www.google.com/url?q=https%3A%2F%2Fhelp.openai.com%2Fen%2Farticles%2F4936850-where-do-i-find-my-secret-api-key). The OpenAI API is paid, however, this homework will stay well under the free $5 credit given to each account. Be careful not to exhaust your free OpenAI credits while testing. You can check [on this page here](https://www.google.com/url?q=https%3A%2F%2Fplatform.openai.com%2Faccount%2Fusage). To avoid exhausting your credits quickly, avoid running cells over and over again after you've completed an exercise.  

## Deliverables
1. Your code: the Jupyter notebook you modified from the template. Submit:
    * your .ipynb file
    * a **.html export of your notebook**. To get a .html version, click File > Save and Export Notebook As... > HTML from within JupyterLab. 
2. A PDF report with answers to questions provided in the template notebook. Please name your report `hw3_{your pitt email id}.pdf`. No need to include @pitt.edu, just use the email ID before that part. For example: `report_mmyoder_hw3.pdf`. Make sure to include the following additional information:
    * any additional resources, references, or web pages you've consulted
    * any person with whom you've discussed the assignment and describe the nature of your discussions
    * any generative AI tool used, and how it was used
    * any unresolved issues or problems

Please submit all of this material on Canvas. We will grade your report and may look over your code.

## Recommended Readings
- [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf). Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, ...others. ArXiV 2020.
- [Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing](https://arxiv.org/pdf/2107.13586.pdf). Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig. ACM Computing Surveys 2021.
- [Best practices for prompt engineering with OpenAI API](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api). Jessica Shieh. OpenAI 2023.
- [Training language models to follow instructions with human feedback](https://arxiv.org/pdf/2203.02155.pdf). Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, ...others. ArXiV 2020.
- [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903.pdf). Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, Denny Zhou. NeurIPS 2022.

## Acknowledgments
This assignment is based on a homework assignment designed by Mark Yatskar and provided by Lorraine Li.

**To get started, start filling in this Jupyter notebook.**


## Setup 1: Dataset / Package
**Run the following cells and enter your OpenAI API Key!**
Note that you will have to restart your kernel after installing the `openai` package.

In [None]:
%%capture
!pip install openai datasets

import openai
from openai import OpenAI
from time import sleep
from datasets import load_dataset
from functools import partial

IMDB_DATASET = load_dataset("imdb", split='train').shuffle(42)[0:200]
IMDB_DATASET_X = IMDB_DATASET['text']
IMDB_DATASET_Y = IMDB_DATASET['label']
del IMDB_DATASET


## TODO - Start
OPENAI_API_KEY = ""
## TODO - End

cache = {}
def run_gpt3(prompt, return_first_line = True, instruction_tuned = False):
    # Return the response from the cache if we have already run this
    cache_key = (prompt, return_first_line, instruction_tuned)
    if cache_key in cache:
        return cache[cache_key]
    client = OpenAI(
      api_key=OPENAI_API_KEY,
    )
    # Set the API Key


    # Select the model
    if instruction_tuned:
        model = "gpt-3.5-turbo-instruct"
    else:
        model = "davinci-002"

    # Send the prompt to GPT-3
    for i in range(0,60,6):
        try:
            response = client.completions.create(
                model=model,
                prompt=prompt,
                temperature=0,
                max_tokens=100,
                top_p=1,
                frequency_penalty=0.0,
                presence_penalty=0.0,
            )
            response = dict(response)['choices'][0]
            response = dict(response)['text'].strip()
            break
        except Exception as e:
            print(e)
            sleep(i)

    # Parse the response
    if return_first_line:
        final_response = response.split('.')[0]+'.'
        if '\n' in final_response:
          final_response = response.split('\n')[0]
    else:
        final_response = response

    # Cache and return the response
    cache[cache_key] = final_response
    return final_response

In [None]:
def grade_gpt3_starts_with_answer(prompt, input, answer, **kwargs):
    model_output = run_gpt3(prompt.replace("{input}", input), **kwargs).strip().lower()
    answer = answer.lower()
    return model_output.startswith(answer)

def grade_gpt3_contains_answer(prompt, first_line_only, input, answer, **kwargs):
    model_output = run_gpt3(prompt.replace("{input}", input), return_first_line=first_line_only, **kwargs).strip().lower()
    answer = answer.lower()
    return answer in model_output

## Setup 2 Define Test Cases

In [None]:
def test_capital_of_country(prompt):
    correct = sum([
        grade_gpt3_contains_answer(prompt, True, "Canada", "Ottawa"),
        grade_gpt3_contains_answer(prompt, True, "India", "New Delhi"),
        grade_gpt3_contains_answer(prompt, True, "Turkey", "Ankara"),
        grade_gpt3_contains_answer(prompt, True, "China", "Beijing"),
        grade_gpt3_contains_answer(prompt, True, "Japan", "Tokyo")
    ])

    if correct >= 3:
        return (3, 3)
    else:
        return (0, 3)

In [None]:
def test_director_of_movie(prompt):
    correct = sum([
        grade_gpt3_contains_answer(prompt, True, "Toy Story", "John Lasseter"),
        grade_gpt3_contains_answer(prompt, True, "Pulp Fiction", "Quentin Tarantino"),
        grade_gpt3_contains_answer(prompt, True, "Jurassic Park", "Steven Spielberg"),
        grade_gpt3_contains_answer(prompt, True, "Star Wars: Episode IV – A New Hope", "George Lucas"),
        grade_gpt3_contains_answer(prompt, True, "The Dark Knight", "Christopher Nolan")
    ])
    if correct >= 3:
        return (3, 3)
    else:
        return (0, 3)

In [None]:
def test_synonyms_of_word(prompt, first_line_only):
    correct = sum([
        grade_gpt3_contains_answer(prompt, first_line_only, "car", "vehicle"),
        grade_gpt3_contains_answer(prompt, first_line_only, "cold", "frigid"),
        grade_gpt3_contains_answer(prompt, first_line_only,"mad", "angry"),
        grade_gpt3_contains_answer(prompt, first_line_only,"desk", "table"),
        grade_gpt3_contains_answer(prompt, first_line_only,"gift", "present")
    ])
    if correct >= 3:
        return (3, 3)
    else:
        return (0, 3)

In [None]:
def test_ingredients_of_food(prompt, first_line_only):
    correct = sum([
        grade_gpt3_contains_answer(prompt, first_line_only, "tiramisu", "coffee"),
        grade_gpt3_contains_answer(prompt, first_line_only, "pesto", "leaves"),
        grade_gpt3_contains_answer(prompt, first_line_only, "samosa", "masala"),
        grade_gpt3_contains_answer(prompt, first_line_only, "hummus", "chickpeas"),
        grade_gpt3_contains_answer(prompt, first_line_only, "macaroon", "coconut")
    ])
    if correct >= 3:
        return (3, 3)
    else:
        return (0, 3)

In [None]:
def test_quotee_of_quote(prompt):
    correct = sum([
        grade_gpt3_starts_with_answer(prompt, '"If you can\'t handle me at my worst, then you sure as hell don\'t deserve me at my best."', "Marilyn Monroe"),
        grade_gpt3_contains_answer(prompt, True, '"The only thing we have to fear is fear itself."', "Roosevelt"),
        grade_gpt3_starts_with_answer(prompt, '"Genius is one percent inspiration and ninety-nine percent perspiration."', "Thomas Edison"),
        grade_gpt3_starts_with_answer(prompt, '"Nothing is certain except for death and taxes."', "Benjamin Franklin"),
        grade_gpt3_starts_with_answer(prompt, '"Life is like riding a bicycle. To keep your balance, you must keep moving."', "Albert Einstein")
    ])
    if correct >= 3:
        if any([q_word in prompt.lower() for q_word in ["who", "what", "when", "where", "why", "how"]]):
            return (1, 3)
        if "?" in prompt:
            return (1, 3)
        return (3, 3)
    else:
        return (0, 3)

In [None]:
def test_korean_to_english(prompt):
    correct = sum([
        grade_gpt3_contains_answer(prompt, True, "책상", "desk"),
        grade_gpt3_contains_answer(prompt, True, "책", "book"),
        grade_gpt3_contains_answer(prompt, True, "창문", "window"),
        grade_gpt3_contains_answer(prompt, True, "나무", "tree"),
        grade_gpt3_contains_answer(prompt, True, "트럭", "truck")
    ])
    if correct >= 5:
        return (5, 5)
    else:
        return (0, 5)


In [None]:
def test_to_jeopardy_answer(prompt):
    correct = sum([
        grade_gpt3_contains_answer(prompt, True, "Hawaii", "is Hawaii?"),
        grade_gpt3_contains_answer(prompt, True, "trees", "What are trees?"),
        grade_gpt3_contains_answer(prompt, True, "The Empire State Building", "What is the Empire State Building?"),
        grade_gpt3_contains_answer(prompt, True, "Neil Armstrong", "Who is Neil Armstrong?"),
        grade_gpt3_contains_answer(prompt, True, "John Legend", "Who is John Legend?")
    ])
    if correct >= 5:
        return (5, 5)
    else:
        return (0, 5)



In [None]:
def test_english_to_spanish(prompt):
    correct = sum([
        grade_gpt3_starts_with_answer(prompt, "desk", "escritorio", instruction_tuned=True),
        grade_gpt3_starts_with_answer(prompt, "book", "libro", instruction_tuned=True),
        grade_gpt3_starts_with_answer(prompt, "window", "ventana", instruction_tuned=True),
        grade_gpt3_starts_with_answer(prompt, "bed", "cama", instruction_tuned=True),
        grade_gpt3_starts_with_answer(prompt, "bread", "pan", instruction_tuned=True)
    ])
    if correct >= 5:
        return (5, 5)
    else:
        return (0, 5)

In [None]:
def print_message(score, max_score):
  if score == max_score:
    print('Correct! You earned {}/{} points. You are a star!'.format(score, max_score))
  else:
    print("You missed some points, try to check what's wrong")

# Section 1: Exploring Prompting
**Background:** Prompting is a way to guide a language model, which is ultimately just a model that predicts the most likely next sequence of words, to complete some arbitrary task you want it to complete. We'll walk through a few examples and then you'll try creating your own prompts.

A language model will "complete" (just like autocomplete) your prompt with what words are most likely to come next. We demonstrate this is the case by showing how GPT-3 completes movie quotes, when giving it the beginning of the quote:

In [None]:
print(run_gpt3("Life is like a box of chocolates,"))
print(run_gpt3("With great power,"))
print(run_gpt3("The name's Bond."))
print(run_gpt3("Houston, we"))
print(run_gpt3("I've a feeling we're not in"))

you never know what you’re gonna get.
comes great responsibility.
James Bond.
have a problem.
Kansas anymore.


Now imagine we give a prompt like this:

In [None]:
print(run_gpt3("Question: Who was the first president of the United States? Answer:"))

George Washington was the first president of the United States.


By posing a question and writing "Answer:" at the end, we make it such that the most likely next sequence of words is the answer to the question! This is the key to large language models being able to perform arbitrary tasks, even though they are only trained to predict the next word.

We can parameterize this prompt and make it reusable for different questions:

In [None]:
QA_PROMPT = "Question: {input} Answer:"
print(run_gpt3(QA_PROMPT.replace("{input}", "What company did Steve Jobs found?")))
print(run_gpt3(QA_PROMPT.replace("{input}", "What's the movie with Tom Cruise about fighter jets?")))
print(run_gpt3(QA_PROMPT.replace("{input}", "Are tomatoes a fruit or a vegetable?")))

Apple Computer Inc.
Top Gun.
Tomatoes are a fruit.


Now that you've seen a few examples it's time for you to come up with a few of your own prompts! Make sure you parameterize them with `{input}` before sending the prompt. All your prompts should be reuseable when the grader does `.replace("{input}", ...)` on them.

Note: These models are not easy to control. Therefore, it's okay if your prompt does not always get the answer right or also spews extra text along with the answer (as long as the answer comes first).

- **Problem 1.1:** Write a prompt that returns the capital of country.

In [None]:
# TODO
CAPITAL_OF_COUNTRY_PROMPT = ""

# Grader - DO NOT CHANGE
your_score, max_score = test_capital_of_country(CAPITAL_OF_COUNTRY_PROMPT)
print_message(your_score, max_score)

 - **Problem 1.2:** Write a prompt that given a famous movie returns the director.

In [None]:
# TODO
DIRECTOR_OF_MOVIE_PROMPT = ""

# Grader - DO NOT CHANGE
your_score, max_score = test_director_of_movie(DIRECTOR_OF_MOVIE_PROMPT)
print_message(your_score, max_score)

 - **Problem 1.3:** Write a prompt that given a word, returns a list of synonyms. (Hint: use `return_first_line=False` as an argument when using `run_gpt3`)

In [None]:
# TODO
SYNONYMS_OF_WORD_PROMPT = ""

# Grader - DO NOT CHANGE
your_score, max_score = test_synonyms_of_word(SYNONYMS_OF_WORD_PROMPT, first_line_only=False)
print_message(your_score, max_score)

 - **Problem 1.4:** Write a prompt that given a food item ("cookies"), returns a list of ingredients used to make that food item. (Hint: use `return_first_line=False` as an argument when using `run_gpt3`)

In [None]:
# TODO
INGREDIENTS_OF_FOOD_PROMPT = ""

# Grader - DO NOT CHANGE
your_score, max_score = test_ingredients_of_food(INGREDIENTS_OF_FOOD_PROMPT, first_line_only=False)
print_message(your_score, max_score)

**Problem 1.5:** Write a prompt that given a famous quote ("One small step for man, one giant leap for mankind.", quote characters included), returns the name of the person who said the quote (quotee).

*Extra Challenge:* We want you to try to complete this one without question marks ("?") or question words ("Who", "What", etc.). You will only get full points if your prompt does not contain those.

In [None]:
# TODO
QUOTEE_OF_QUOTE_PROMPT = ""

# Grader
your_score, max_score = test_quotee_of_quote(QUOTEE_OF_QUOTE_PROMPT)
print_message(your_score, max_score)

# Section 2: Prompt Engineering



The prompts you have used up to this point have been fairly basic and straightforward to create. But what if you have a more difficult task and it seems like your prompt isn't working? *Prompt engineering* is the procecss of iterating on a prompt in clever ways to induce the model to produce what you want. The best way of prompt engineering systematically vs. randomly is by understanding how the underlying model was trained and what data it was trained on to best prompt the model.

Imagine we want the model to generate a quote in Donald Trump's style of talking about a certain topic:

In [None]:
DONALD_TRUMP_PROMPT = "Question: What would Donald Trump say about {input}? Answer:"
DONALD_TRUMP_PROMPT_ENGINEERED_1 = 'On the topic of {input}, Donald Trump was quoted as saying "'
DONALD_TRUMP_PROMPT_ENGINEERED_2 = 'On the topic of {input}, Donald Trump expressed optimism saying "'
DONALD_TRUMP_PROMPT_ENGINEERED_3 = 'On the topic of {input}, Donald Trump expressed doubt saying "'

print(run_gpt3(DONALD_TRUMP_PROMPT.replace("{input}", 'the stock market'))) # Doesn't work
print(run_gpt3(DONALD_TRUMP_PROMPT_ENGINEERED_1.replace("{input}", 'the stock market'))) # Works!
print(run_gpt3(DONALD_TRUMP_PROMPT_ENGINEERED_2.replace("{input}", 'the stock market'))) # Works!
print(run_gpt3(DONALD_TRUMP_PROMPT_ENGINEERED_3.replace("{input}", 'the stock market'))) # Works!

He would say it’s a great time to buy stocks.
I think the stock market is going to go up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up, up.
I think we're going to have a very good year in the stock market".
I don't know if the stock market is going to be up or down" and that "it's a very big bubble.


The first naive prompt doesn't really work. After prompt engineering, not only do we get a much more realistic generation of his style, but we can also control whether he is talking about the topic positively or negatively.

**Please respond to the following questions in your report**

* **Problem 2.1:** Why did the `DONALD_TRUMP_PROMPT_ENGINEERED_1` prompt work much better than the `DONALD_TRUMP_PROMPT` prompt?

A prompt that is well-engineered can effectively solve difficult NLP tasks that previously were solved by fine-tuning models. In lecture, we showed some examples of these.

**Problem 2.2:** Write a prompt that will solve the [sentiment classification task](https://en.wikipedia.org/wiki/Sentiment_analysis), and classify [movie reviews](https://ai.stanford.edu/~amaas/data/sentiment/) as *positive* or *negative*. `IMDB_DATASET_X` and `IMDB_DATASET_Y` contain 200 reviews and sentiment labels (1 = positive, 0 = negative). Get as high of an accuracy as you can on these. Place your `MOVIE_SENTIMENT` prompt and `POSITIVE_VEBALIZERS` and `NEGATIVE_VERBALIZERS` in `report.pdf` for manual grading. Along with your `correct` (out of 200) score.

*Warning:* Be careful not to exhaust your free OpenAI credits while testing, you can check [on this page here](https://platform.openai.com/account/usage). To avoid exhausting your credits quickly, test your code on a few examples from the IMDB dataset first, and then scale up to the full 200.

In [None]:
# TODO
MOVIE_SENTIMENT_PROMPT = ""

POSITIVE_VERBALIZERS = [
    "good",
    # TODO - Add other positive verbalizers ...
]
NEGATIVE_VERBALIZERS = [
    "bad",
    # TODO - Add other negative verbalizers ...
]

def map_to_sentiment_label(gpt3_output):
    for v in POSITIVE_VERBALIZERS:
        if v.lower() in gpt3_output[:20].lower():
            return 1
    for v in NEGATIVE_VERBALIZERS:
        if v.lower() in gpt3_output[:20].lower():
            return 0
    return None

correct = 0
for review, label in zip(IMDB_DATASET_X, IMDB_DATASET_Y):
    gpt3_output = run_gpt3(MOVIE_SENTIMENT_PROMPT.replace("{input}", review))
    prediction = map_to_sentiment_label(gpt3_output)
    if prediction == label:
        correct += 1
    print(f"Prediction: {prediction}, Label: {label}")
print(f"Correct: {correct}/200")

# Section 3: Few-Shot Prompting

The prompts you have seen up until this point are zero-shot prompts, in that we are asking the model to complete a task without any examples. By providing some examples in the prompt, the model becomes significantly more capable. We'll show an example.

Consider the task of figuring out a more complex version of a word:

In [None]:
ZERO_SHOT_COMPLEX_PROMPT = "Question: What is a more complex word for {input}? Answer:"
FEW_SHOT_COMPLEX_PROMPT = "angry : aggrieved\nsad : depressed\n{input} :"

print(run_gpt3(ZERO_SHOT_COMPLEX_PROMPT.replace("{input}", 'happy'))) # Doesn't work
print(run_gpt3(FEW_SHOT_COMPLEX_PROMPT.replace("{input}", 'happy'))) # Works!

Happy is a simple word.
elated


The first zero-shot prompt where we have no example doesn't work at all, where as when we give 2 examples in the few-shot prompt (2-shot prompt), it works.

Now that you've seen an example of few-shot prompting, it's your turn to try it.

**Problem 3.1:** Write a few-shot prompt that translates a Korean word to an English word.

In [None]:
# TODO
KOREAN_TO_ENGLISH_PROMPT = "" # Solution

# Grader
your_score, max_score = test_korean_to_english(KOREAN_TO_ENGLISH_PROMPT)
print_message(your_score, max_score)

**Problem 3.2:** Write a few-shot prompt that converts an input into a [Jeopardy! style answer](https://en.wikipedia.org/wiki/Jeopardy!#:~:text=Rather%20than%20being%20given%20questions,the%20form%20of%20a%20question.) (The Great Lakes -> "What are the Great Lakes?" or Taylor Swift -> "Who is Taylor Swift?")

In [None]:
# TODO
TO_JEOPARDY_ANSWER_PROMPT = "" # Solution

# Grader
your_score, max_score = test_to_jeopardy_answer(TO_JEOPARDY_ANSWER_PROMPT)
print_message(your_score, max_score)

**Please respond to the following question in your `report.pdf`**

**Problem 3.3:** Come up with 3 more arbitrary tasks, where a zero-shot prompt might not suffice, and a few-shot prompt would be required. Provide a short write up describing what your tasks are. Provide examples of a zero-prompt not working for it. Then, show us your few-shot prompt and some results. Be creative and try to pick 3 tasks that are somewhat distinct from each other!

# Section 4: Prompting Instruction-Tuned Models

Large language models can be *instruction-tuned*, fine-tuned with examples of instructions and responses to those instructions, to make them easier to prompt and friendlier to humans. Instruction-tuned models can more easily be given natural langauge instructions describing a task you want them to complete. This makes it so that they are more performant without requiring as much prompt engineering and makes them more likely to succeed with just zero-shot prompting. The version of GPT-3 we were working with in previous exercises was not instruction-tuned, we now will use instruction-tuned models from here on out:

In [None]:
TO_JEOPARDY_INSTRUCTION_PROMPT = "What would a Jeopardy! contestant say if the answer was \"{input}\"?"

print(run_gpt3(TO_JEOPARDY_INSTRUCTION_PROMPT.replace("{input}", 'Taylor Swift'))) # Doesn't work on non-instruction tuned model
print(run_gpt3(TO_JEOPARDY_INSTRUCTION_PROMPT.replace("{input}", 'Taylor Swift'), instruction_tuned=True)) # Works and is simpler!

"Who is the most famous person in the world?".
"What is the name of the Grammy-winning singer-songwriter known for hits like 'Shake It Off' and 'Blank Space'?".


As you can see, these instruction-tuned models make it much simpler to complete complex tasks since you can "talk" to them naturally. We'll now ask you to try.

**Problem 4.1:** Write a prompt that returns the Spanish word given an English word (painting -> pintura).

*Extra Challenge:* We want you to complete this one such that the model only returns a single Spanish word and nothing else. You will only get points if your model only returns a single Spanish word and nothing else.

In [None]:
# TODO
ENGLISH_TO_SPANISH_PROMPT = ""

# Grader
your_score, max_score = test_english_to_spanish(ENGLISH_TO_SPANISH_PROMPT)
print_message(your_score, max_score)

**Please respond to the following question in your `report.pdf`**

**Problem 4.2:** Come up with 3 more arbitrary tasks, where the non-instruction-tuned model might not suffice, and an instruction-tuned model would be required. Provide a short write up describing what your tasks are. Provide examples of a prompt not working on a non-instruction-tuned model. Then, show us your instruction prompt on an instruction-tuned model and some results. Be creative and try to pick 3 tasks that are somewhat distinct from each other!

# Section 5: Chain-of-Thought Reasoning

One recent method to prompt large language models is Chain-of-Thought Prompting. This is similar to few-shot prompting, except you not only provide a few examples, but you also provide an explanation with a reasoning chain to the model. Providing this reasoning chain as been shown to improve performance on a wide variety of tasks.

We demonstrate on a task that consists of 2 arithmetic operations over 3 single digit numbers:

In [None]:
FEW_SHOT_ARITHMETIC_PROMPT = "2 * 4 + 2?\n10\n6 + 7 - 2\n11\n{input}?"
COT_ARITHMETIC_PROMPT = "2 * 4 + 2?\n2 * 4 = 8. 8 + 2 = 10\n6 + 7 - 2?\n6 + 7 = 13. 13 - 2 = 11\n{input}?"

print(run_gpt3(FEW_SHOT_ARITHMETIC_PROMPT.replace("{input}", '220 * 19 - 5'), instruction_tuned=True)) # Doesn't work without CoT prompting
print(run_gpt3(COT_ARITHMETIC_PROMPT.replace("{input}", '220 * 19 - 5'), return_first_line=False, instruction_tuned=True)) # Works!

4185.
220 * 19 = 4180. 4180 - 5 = 4175


Next, we create a dataset with 50 examples:

In [None]:
import random
import re

def compute(x, operand, y):
    if operand == '+':
        return x + y
    elif operand == '-':
        return x - y
    elif operand == '*':
        return x * y

def create_arithmetic_dataset(n_examples, seed = 42):
    random.seed(seed)
    X = []
    y = []
    for i in range(n_examples):
        num_1 = random.randint(0,100)
        operator_1 = random.choice(['+', '-', '*'])
        num_2 = random.randint(0,100)
        operator_2 = random.choice(['+', '-', '*'])
        num_3 = random.randint(0,100)
        if operator_2 == '*' and operator_1 != '*':
            # Order of operations:
            # Do the right-hand side first
            intermediate = compute(num_2, operator_2, num_3)
            final = compute(num_1, operator_1, intermediate)
        else:
            intermediate = compute(num_1, operator_1, num_2)
            final = compute(intermediate, operator_2, num_3)
        X.append(f'{num_1} {operator_1} {num_2} {operator_2} {num_3}')
        y.append(final)
    return X, y

def parse_answer(model_output):
    '''Parses the output of the model to get the final answer.'''
    try:
        # Gets the last number in the string using regex and returns
        # that
        return int(re.search(r'(\d+)(?!.*\d)', model_output)[0])
    except TypeError:
        return None

arithmetic_X, arithmetic_y = create_arithmetic_dataset(50)

**Please respond to the following questions in your report**

**Problem 5.1:** Your job is to investigate how few-shot Chain-of-Thought prompting performs vs. regular few-shot prompting over the entire arithmetic dataset and grade how many out of 50 are correct. Perform this experiment 6 times each with a different number of regular few-shot examples (1 example, 2 examples, 4 examples, 8 examples, 16 examples, 32 examples) and 6 times again each with a different number of Chain-of-Thought few-shot examples (1 CoT example, 2 CoT examples, 4 CoT examples, 8 CoT examples, 16 CoT examples, 32 CoT examples).

Create a table or plot of (N examples) vs. (% questions correct by the model with a few-shot prompt with N examples) vs. (% questions correct by the model with a CoT prompt with N examples). Report this table or plot in `report.pdf` with a short write-up about your observations. Keep the code used to build your table or plot in your notebook for inspection during grading.

*Note:* Make sure you use `instruction_tuned = True`.

*Hint:* You might find the `parse_answer` function helpful when grading how many of the model's outputs are correct or not.

*Warning:* Be careful not to exhaust your free OpenAI credits while testing, you can check [on this page here](https://platform.openai.com/account/usage). To avoid exhausting your credits quickly, test your code on a smaller arithmetic dataset first, and then scale up to the full one to report your results.

In [None]:
# TODO - Solve Problem 5.1 here

#### SOLUTION BELOW


#### SOLUTION ABOVE