# Homework 3 part 1: Large language model (LLM) prompting

## Learning objectives
After completing this assignment, students will be able to:     
* Prompt LLMs programmatically with templates (parameterized)
* Prompt LLMs with zero-shot and few-shot prompting
* Engineer and test different prompts

## Overview
In this part of the assignment, you will explore different prompting techniques for open-source LLMs. You will fill in a Jupyter notebook hosted on the Pitt CRCD to run your code. The default model is set to `gemma3` (gemma3:27b), but feel free to explore the other models, too. They are accessed by specifying `llama3.1` (llam3.1:70b) and `deepseek-r1` (deepseek-r1:70b) in the `run_llm` function.

## Deliverables
1. Your code: the Jupyter notebook you modified from the template for part 1. Submit:
    * your .ipynb file
    * a **.html export of your notebook**. To get a .html version, click File > Save and Export Notebook As... > HTML from within JupyterLab. 
2. A PDF report with answers to questions provided in the template notebook. Please name your report `hw3_{your pitt email id}.pdf`. No need to include @pitt.edu, just use the email ID before that part. For example: `report_mmyoder_hw3.pdf`. **Please make only one PDF report, containing answers to part 1 and part 2.** Make sure to include the following information:
    * answers to all the numbered questions below (prompts and their outputs, along with which model gave the output)
    * any additional resources, references, or web pages you've consulted
    * any person with whom you've discussed the assignment and describe the nature of your discussions
    * any generative AI tool used, and how it was used
    * any unresolved issues or problems

Please submit all of this material on Canvas. We will grade your report and may look over your code.

## Recommended Readings
- [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf). Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, ...others. ArXiV 2020.
- [Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing](https://arxiv.org/pdf/2107.13586.pdf). Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig. ACM Computing Surveys 2021.
- [Training language models to follow instructions with human feedback](https://arxiv.org/pdf/2203.02155.pdf). Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, ...others. ArXiV 2020.
- [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903.pdf). Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, Denny Zhou. NeurIPS 2022.

## Acknowledgments
This assignment is based on a homework assignment designed by Mark Yatskar and provided by Lorraine Li.

**To get started, start filling in this Jupyter notebook.**

## Setup 1: Dataset / Package
**Run the following cells and enter the class API Key for Pitt SCI LLMs!**

In [None]:
import openai
from openai import OpenAI
from time import sleep
from datasets import load_dataset
from functools import partial

In [None]:
IMDB_DATASET = load_dataset("imdb", split='train').shuffle(42)[0:200]
IMDB_DATASET_X = IMDB_DATASET['text']
IMDB_DATASET_Y = IMDB_DATASET['label']
del IMDB_DATASET

## TODO - Start
LLM_API_KEY = "" # the class API key for Pitt SCI LLMs here
## TODO - End

cache = {}

def run_llm(prompt, model='gemma3', return_first_line=False):
    cache_key = (prompt, model, return_first_line)
    if cache_key in cache:
        return cache[cache_key]
    
    client = OpenAI( 
        api_key=LLM_API_KEY,
        base_url="https://ol.sci.pitt.edu")
        
    # Send prompt to API
    for i in range(0,60,6):
        try:
            response = client.completions.create(
                model=model,
                prompt=prompt,
                temperature=0,
                max_tokens=300,
                top_p=1,
            )
            response = dict(response)['choices'][0]
            response = dict(response)['text'].strip()
            break
        except Exception as e:
            print(e)
            sleep(i)

    # Parse the response
    if return_first_line:
        final_response = response.split('.')[0]+'.'
        if '\n' in final_response:
            final_response = response.split('\n')[0]
    else:
        final_response = response

    # Cache and return the response
    cache[cache_key] = final_response
    return final_response

In [None]:
def grade_llm_starts_with_answer(prompt, input, answer, **kwargs):
    model_output = run_llm(prompt.replace("{input}", input), **kwargs).strip().lower()
    answer = answer.lower()
    return model_output.startswith(answer)

def grade_llm_contains_answer(prompt, first_line_only, input, answer, **kwargs):
    model_output = run_llm(prompt.replace("{input}", input), return_first_line=first_line_only, **kwargs).strip().lower()
    answer = answer.lower()
    return answer in model_output

## Setup 2 Define Test Cases

In [None]:
def test_capital_of_country(prompt):
    correct = sum([
        grade_llm_contains_answer(prompt, True, "Canada", "Ottawa"),
        grade_llm_contains_answer(prompt, True, "India", "New Delhi"),
        grade_llm_contains_answer(prompt, True, "Turkey", "Ankara"),
        grade_llm_contains_answer(prompt, True, "China", "Beijing"),
        grade_llm_contains_answer(prompt, True, "Japan", "Tokyo")
    ])

    if correct >= 3:
        return (3, 3)
    else:
        return (0, 3)

In [None]:
def test_director_of_movie(prompt):
    correct = sum([
        grade_llm_contains_answer(prompt, True, "Toy Story", "John Lasseter"),
        grade_llm_contains_answer(prompt, True, "Pulp Fiction", "Quentin Tarantino"),
        grade_llm_contains_answer(prompt, True, "Jurassic Park", "Steven Spielberg"),
        grade_llm_contains_answer(prompt, True, "Star Wars: Episode IV – A New Hope", "George Lucas"),
        grade_llm_contains_answer(prompt, True, "The Dark Knight", "Christopher Nolan")
    ])
    if correct >= 3:
        return (3, 3)
    else:
        return (0, 3)

In [None]:
def test_synonyms_of_word(prompt, first_line_only):
    correct = sum([
        grade_llm_contains_answer(prompt, first_line_only, "car", "vehicle"),
        grade_llm_contains_answer(prompt, first_line_only, "cold", "frigid"),
        grade_llm_contains_answer(prompt, first_line_only,"mad", "angry"),
        grade_llm_contains_answer(prompt, first_line_only,"desk", "table"),
        grade_llm_contains_answer(prompt, first_line_only,"gift", "present")
    ])
    if correct >= 3:
        return (3, 3)
    else:
        return (0, 3)

In [None]:
def test_ingredients_of_food(prompt, first_line_only):
    correct = sum([
        grade_llm_contains_answer(prompt, first_line_only, "tiramisu", "coffee"),
        grade_llm_contains_answer(prompt, first_line_only, "pesto", "leaves"),
        grade_llm_contains_answer(prompt, first_line_only, "samosa", "masala"),
        grade_llm_contains_answer(prompt, first_line_only, "hummus", "chickpeas"),
        grade_llm_contains_answer(prompt, first_line_only, "macaroon", "coconut")
    ])
    if correct >= 3:
        return (3, 3)
    else:
        return (0, 3)

In [None]:
def test_quotee_of_quote(prompt):
    correct = sum([
        grade_llm_starts_with_answer(prompt, '"If you can\'t handle me at my worst, then you sure as hell don\'t deserve me at my best."', "Marilyn Monroe"),
        grade_llm_contains_answer(prompt, True, '"The only thing we have to fear is fear itself."', "Roosevelt"),
        grade_llm_starts_with_answer(prompt, '"Genius is one percent inspiration and ninety-nine percent perspiration."', "Thomas Edison"),
        grade_llm_starts_with_answer(prompt, '"Nothing is certain except for death and taxes."', "Benjamin Franklin"),
        grade_llm_starts_with_answer(prompt, '"Life is like riding a bicycle. To keep your balance, you must keep moving."', "Albert Einstein")
    ])
    if correct >= 3:
        if any([q_word in prompt.lower() for q_word in ["who", "what", "when", "where", "why", "how"]]):
            return (1, 3)
        if "?" in prompt:
            return (1, 3)
        return (3, 3)
    else:
        return (0, 3)

In [None]:
def test_korean_to_english(prompt):
    correct = sum([
        grade_llm_contains_answer(prompt, True, "책상", "desk"),
        grade_llm_contains_answer(prompt, True, "책", "book"),
        grade_llm_contains_answer(prompt, True, "창문", "window"),
        grade_llm_contains_answer(prompt, True, "나무", "tree"),
        grade_llm_contains_answer(prompt, True, "트럭", "truck")
    ])
    if correct >= 5:
        return (5, 5)
    else:
        return (0, 5)


In [None]:
def test_to_jeopardy_answer(prompt):
    correct = sum([
        grade_llm_contains_answer(prompt, True, "Hawaii", "is Hawaii?"),
        grade_llm_contains_answer(prompt, True, "trees", "What are trees?"),
        grade_llm_contains_answer(prompt, True, "The Empire State Building", "What is the Empire State Building?"),
        grade_llm_contains_answer(prompt, True, "Neil Armstrong", "Who is Neil Armstrong?"),
        grade_llm_contains_answer(prompt, True, "John Legend", "Who is John Legend?")
    ])
    if correct >= 5:
        return (5, 5)
    else:
        return (0, 5)



In [None]:
def test_english_to_spanish(prompt):
    correct = sum([
        grade_llm_starts_with_answer(prompt, "desk", "escritorio", instruction_tuned=True),
        grade_llm_starts_with_answer(prompt, "book", "libro", instruction_tuned=True),
        grade_llm_starts_with_answer(prompt, "window", "ventana", instruction_tuned=True),
        grade_llm_starts_with_answer(prompt, "bed", "cama", instruction_tuned=True),
        grade_llm_starts_with_answer(prompt, "bread", "pan", instruction_tuned=True)
    ])
    if correct >= 5:
        return (5, 5)
    else:
        return (0, 5)

In [None]:
def print_message(score, max_score):
    if score == max_score:
        print('Correct! You earned {}/{} points. You are a star!'.format(score, max_score))
    else:
        print("You missed some points, try to check what's wrong")

# Section 1: Exploring Prompting
**Background:** Prompting is a way to guide a language model, which is ultimately just a model that predicts the most likely next sequence of words, to complete some arbitrary task you want it to complete. We'll walk through a few examples and then you'll try creating your own prompts.

A language model will "complete" (just like autocomplete) your prompt with what words are most likely to come next. We demonstrate this is the case by showing how LLMs completes movie quotes, when giving it the beginning of the quote:

In [None]:
print(run_llm("Life is like a box of chocolates,", return_first_line=True))
print(run_llm("With great power,", return_first_line=True))
print(run_llm("The name's Bond.", return_first_line=True))
print(run_llm("Houston, we", return_first_line=True))
print(run_llm("I've a feeling we're not in", return_first_line=True))

Now imagine we give a prompt like this:

In [None]:
print(run_llm("Question: Who was the first president of the United States? Answer:", return_first_line=True))

By posing a question and writing "Answer:" at the end, we make it such that the most likely next sequence of words is the answer to the question! This is the key to large language models being able to perform arbitrary tasks, even though they are only trained to predict the next word.

We can parameterize this prompt and make it reusable for different questions:

In [None]:
QA_PROMPT = "Question: {input} Answer:"
print(run_llm(QA_PROMPT.replace("{input}", "What company did Steve Jobs found?"), return_first_line=True))
print(run_llm(QA_PROMPT.replace("{input}", "What's the movie with Tom Cruise about fighter jets?"), return_first_line=True))
print(run_llm(QA_PROMPT.replace("{input}", "Are tomatoes a fruit or a vegetable?"), return_first_line=True))

Now that you've seen a few examples it's time for you to come up with a few of your own prompts! Make sure you parameterize them with `{input}` before sending the prompt.

Note: These models are not easy to control. Therefore, it's okay if your prompt does not always get the answer right or also spews extra text along with the answer.

- **Problem 1.1:** Write a prompt that returns the capital of country.

In [None]:
# TODO
CAPITAL_OF_COUNTRY_PROMPT = ""

# Grader - DO NOT CHANGE
your_score, max_score = test_capital_of_country(CAPITAL_OF_COUNTRY_PROMPT)
print_message(your_score, max_score)

 - **Problem 1.2:** Write a prompt that given a famous movie returns the director.

In [None]:
# TODO
DIRECTOR_OF_MOVIE_PROMPT = ""

# Grader - DO NOT CHANGE
your_score, max_score = test_director_of_movie(DIRECTOR_OF_MOVIE_PROMPT)
print_message(your_score, max_score)

 - **Problem 1.3:** Write a prompt that given a word, returns a list of synonyms.

In [None]:
# TODO
SYNONYMS_OF_WORD_PROMPT = ""

# Grader - DO NOT CHANGE
your_score, max_score = test_synonyms_of_word(SYNONYMS_OF_WORD_PROMPT, first_line_only=False)
print_message(your_score, max_score)

 - **Problem 1.4:** Write a prompt that given a food item ("cookies"), returns a list of ingredients used to make that food item. (Hint: use `return_first_line=False` as an argument when using `run_llm`)

In [None]:
# TODO
INGREDIENTS_OF_FOOD_PROMPT = ""

# Grader - DO NOT CHANGE
your_score, max_score = test_ingredients_of_food(INGREDIENTS_OF_FOOD_PROMPT, first_line_only=False)
print_message(your_score, max_score)

* **Problem 1.5:** Write a prompt that given a famous quote ("One small step for man, one giant leap for mankind.", quote characters included), returns the name of the person who said the quote (quotee).

*Extra Challenge:* We want you to try to complete this one without question marks ("?") or question words ("Who", "What", etc.).

In [None]:
# TODO
QUOTEE_OF_QUOTE_PROMPT = ""

# Grader
your_score, max_score = test_quotee_of_quote(QUOTEE_OF_QUOTE_PROMPT)
print_message(your_score, max_score)

# Section 2: Prompt Engineering



The prompts you have used up to this point have been fairly basic and straightforward to create. But what if you have a more difficult task and it seems like your prompt isn't working? *Prompt engineering* is the procecss of iterating on a prompt in clever ways to induce the model to produce what you want.

A prompt that is well-engineered can effectively solve difficult NLP tasks that previously were solved by fine-tuning models.

**Problem 2.1:** Write a prompt that will solve the [sentiment classification task](https://en.wikipedia.org/wiki/Sentiment_analysis), and classify [movie reviews](https://ai.stanford.edu/~amaas/data/sentiment/) as *positive* or *negative*. `IMDB_DATASET_X` and `IMDB_DATASET_Y` contain 200 reviews and sentiment labels (1 = positive, 0 = negative). Get as high of an accuracy as you can on these. Place your `MOVIE_SENTIMENT` prompt and `POSITIVE_VEBALIZERS` and `NEGATIVE_VERBALIZERS` in your report, along with your `correct` (out of 200) score.

In [None]:
# TODO
MOVIE_SENTIMENT_PROMPT = ""

POSITIVE_VERBALIZERS = [
    "good",
    # TODO - Add other positive verbalizers if necessary...
]
NEGATIVE_VERBALIZERS = [
    "bad",
    # TODO - Add other negative verbalizers if necessary...
]

def map_to_sentiment_label(llm_output):
    for v in POSITIVE_VERBALIZERS:
        if v.lower() in llm_output[:20].lower():
            return 1
    for v in NEGATIVE_VERBALIZERS:
        if v.lower() in llm_output[:20].lower():
            return 0
    return None

correct = 0
for review, label in zip(IMDB_DATASET_X, IMDB_DATASET_Y):
    llm_output = run_llm(MOVIE_SENTIMENT_PROMPT.replace("{input}", review), return_first_line=True)
    prediction = map_to_sentiment_label(llm_output)
    if prediction == label:
        correct += 1
    print(f"Prediction: {prediction}, Label: {label}")
print(f"Correct: {correct}/200")

# Section 3: Few-Shot Prompting

The prompts you have seen up until this point are zero-shot prompts, in that we are asking the model to complete a task without any examples. By providing some examples in the prompt, the model can become more capable. For example, the following zero-shot and few-shot prompts solicit more complex words:

In [None]:
ZERO_SHOT_COMPLEX_PROMPT = "Question: What is a more complex word for {input}? Answer:"
FEW_SHOT_COMPLEX_PROMPT = "angry : aggrieved\nsad : depressed\n{input} :"

print(run_llm(ZERO_SHOT_COMPLEX_PROMPT.replace("{input}", 'happy')))
print()
print(run_llm(FEW_SHOT_COMPLEX_PROMPT.replace("{input}", 'happy'), return_first_line=True))

Now that you've seen an example of few-shot prompting, it's your turn to try it.

**Problem 3.1:** Write a few-shot prompt that translates a Korean word to an English word.

In [None]:
# TODO
KOREAN_TO_ENGLISH_PROMPT = "" # Solution

# Grader
your_score, max_score = test_korean_to_english(KOREAN_TO_ENGLISH_PROMPT)
print_message(your_score, max_score)

**Problem 3.2:** Write a few-shot prompt that converts an input into a [Jeopardy! style answer](https://en.wikipedia.org/wiki/Jeopardy!#:~:text=Rather%20than%20being%20given%20questions,the%20form%20of%20a%20question.) (The Great Lakes -> "What are the Great Lakes?" or Taylor Swift -> "Who is Taylor Swift?")

In [None]:
# TODO
TO_JEOPARDY_ANSWER_PROMPT = "" # Solution

# Grader
your_score, max_score = test_to_jeopardy_answer(TO_JEOPARDY_ANSWER_PROMPT)
print_message(your_score, max_score)