# Outline of the Session

1. Prompting setup and basics. How to call an LLM via API?
2. Understanding few-shot prompting - An NLI Illustration
3. LLMs for Reasoning - Task Definitions and Requirements
4. Advanced Prompting Techniques - CoT, Self-consistency and PAL



# Prerequisite

For this tutorial, you will use Groq and together.ai for running LLM inference.

### Setting Up Groq
1. Go to https://groq.com/.
2. Sign In using your favourite Email Id.
3. Click on GroqCloud.
4. Create an API Key with name "Tutorial".
5. Copy the API Key and same it in GROQ_API_KEY variable.
6. Install groq package with pip.
7. Test it on a sample input.

Additional documentation can be found at https://console.groq.com/docs/quickstart.

In [None]:
!pip install pandas
!pip install scikit-learn
!pip install tqdm
!pip install datasets

In [None]:
GROQ_API_KEY = 'ADD YOUR API KEY HERE'
!pip install groq

In [None]:
from groq import Groq
client = Groq(api_key=GROQ_API_KEY)

def run_groq_model(messages, model, temperature=0.7, top_p=1, max_tokens=16):
    chat_completion = client.chat.completions.create(
        messages=messages, temperature=temperature, top_p=top_p,
        model=model, n=1, max_tokens=max_tokens
    )
    return chat_completion.choices[0].message.content

In [None]:
ret = run_groq_model([{"role": "user", "content": "Introduce yourself."}], "llama3-8b-8192", max_tokens=100)
print(ret)

In [None]:
ret = run_groq_model([{"role": "user", "content": "Introduce yourself."}], "llama3-8b-8192", max_tokens=100)
print(ret)

In [None]:
ret = run_groq_model([{"role": "user", "content": "Introduce yourself."}], "llama3-8b-8192", max_tokens=100)
print(ret)

In [None]:
# Try another model
ret = run_groq_model([{"role": "user", "content": "Introduce yourself."}], "mixtral-8x7b-32768", max_tokens=100)
print(ret)

### Play with generation parameters

In [None]:
# Deterministic response
ret = run_groq_model([{"role": "user", "content": "Describe the planet Jupiter in few words."}], "mixtral-8x7b-32768", max_tokens=100, temperature=0.0)
print(ret)

In [None]:
ret = run_groq_model([{"role": "user", "content": "Describe the planet Jupiter in few words."}], "mixtral-8x7b-32768", max_tokens=100, temperature=0.0)
print(ret)

In [None]:
ret = run_groq_model([{"role": "user", "content": "Describe the planet Jupiter in few words."}], "mixtral-8x7b-32768", max_tokens=100, temperature=0.9)
print(ret)

# 2. Understanding few-shot prompting - An NLI Illustration


As beginners, we start with an NLP task.

Natural language inference (NLI) is basically a reading comprehension task for computers. Imagine you are reading a passage (premise) and then asked a question (hypothesis). NLI  focuses on whether the question can be answered based on the information in the passage alone.

Here's a breakdown:

* There are two main parts: a **premise** (the background information) and a **hypothesis** (the question).
* The task is to determine the relationship between the two:
    * Does the answer to the question (hypothesis) necessarily follow from the information given (premise)? (**entailment**)
    * Does the answer to the question contradict the information given? (**contradiction**)
    * Is there not enough information in the passage to say for sure? (**neutral**)

NLI helps computers better understand the nuances of language and logic, which is useful for tasks like question answering and fact checking.

In [None]:
test_dataset = [
    # Entailment
    {"premise": "A soccer game with multiple males playing.", "hypothesis": "Some men are playing a sport.", "label": "entailment"},
    {"premise": "A woman is reading a book.", "hypothesis": "A female is holding a book.", "label": "entailment"},
    {"premise": "The cat is sleeping on the couch.", "hypothesis": "A cat is resting.", "label": "entailment"},
    {"premise": "A man is eating a sandwich.", "hypothesis": "A person is having a meal.", "label": "entailment"},
    {"premise": "A group of people are walking in the park.", "hypothesis": "People are outdoors.", "label": "entailment"},
    {"premise": "A child is drawing with crayons.", "hypothesis": "A kid is making art.", "label": "entailment"},
    {"premise": "The chef is cooking in the kitchen.", "hypothesis": "Someone is preparing food.", "label": "entailment"},
    {"premise": "A dog is barking loudly.", "hypothesis": "A dog is making noise.", "label": "entailment"},
    {"premise": "Two kids are playing with a ball.", "hypothesis": "Children are engaged in a game.", "label": "entailment"},
    {"premise": "The sun is shining brightly.", "hypothesis": "It is sunny outside.", "label": "entailment"},
    {"premise": "A teacher is writing on the board.", "hypothesis": "An instructor is using chalk.", "label": "entailment"},
    {"premise": "The musician is playing the guitar.", "hypothesis": "Someone is performing music.", "label": "entailment"},
    {"premise": "The flowers are blooming in the garden.", "hypothesis": "Plants are growing in the yard.", "label": "entailment"},
    {"premise": "A person is jogging on the beach.", "hypothesis": "Someone is running near the water.", "label": "entailment"},
    {"premise": "A car is parked on the street.", "hypothesis": "A vehicle is on the road.", "label": "entailment"},
    {"premise": "A baby is crawling on the floor.", "hypothesis": "An infant is moving on the ground.", "label": "entailment"},
    {"premise": "A boy is riding a bicycle.", "hypothesis": "A child is on a bike.", "label": "entailment"},
    {"premise": "The birds are flying in the sky.", "hypothesis": "Some animals are in the air.", "label": "entailment"},
    {"premise": "A couple is dancing at the wedding.", "hypothesis": "Two people are celebrating.", "label": "entailment"},
    {"premise": "A student is studying for an exam.", "hypothesis": "Someone is preparing for a test.", "label": "entailment"},

    # Neutral
    {"premise": "A man is playing a guitar.", "hypothesis": "The man is performing at a concert.", "label": "neutral"},
    {"premise": "A woman is jogging in the morning.", "hypothesis": "She is training for a marathon.", "label": "neutral"},
    {"premise": "A dog is playing in the yard.", "hypothesis": "The dog is digging a hole.", "label": "neutral"},
    {"premise": "A group of kids are playing football.", "hypothesis": "The children are at school.", "label": "neutral"},
    {"premise": "A man is fixing his car.", "hypothesis": "The car had a flat tire.", "label": "neutral"},
    {"premise": "A chef is preparing a meal.", "hypothesis": "The chef is making an Italian dish.", "label": "neutral"},
    {"premise": "A person is reading a newspaper.", "hypothesis": "The person is at a coffee shop.", "label": "neutral"},
    {"premise": "A child is drawing with crayons.", "hypothesis": "The child is creating a masterpiece.", "label": "neutral"},
    {"premise": "A family is having a picnic.", "hypothesis": "The family is celebrating a birthday.", "label": "neutral"},
    {"premise": "A woman is planting flowers.", "hypothesis": "The woman is a professional gardener.", "label": "neutral"},
    {"premise": "A man is swimming in the pool.", "hypothesis": "The man is training for a competition.", "label": "neutral"},
    {"premise": "A boy is riding a skateboard.", "hypothesis": "The boy is practicing for a contest.", "label": "neutral"},
    {"premise": "A couple is walking their dog.", "hypothesis": "The couple adopted the dog recently.", "label": "neutral"},
    {"premise": "A student is taking notes in class.", "hypothesis": "The student is preparing for a quiz.", "label": "neutral"},
    {"premise": "A man is drinking coffee.", "hypothesis": "The man is at a business meeting.", "label": "neutral"},
    {"premise": "A woman is baking a cake.", "hypothesis": "The cake is for a wedding.", "label": "neutral"},
    {"premise": "A child is playing the piano.", "hypothesis": "The child is a prodigy.", "label": "neutral"},
    {"premise": "A girl is reading a book.", "hypothesis": "The book is about science.", "label": "neutral"},
    {"premise": "A man is mowing the lawn.", "hypothesis": "The man is a gardener.", "label": "neutral"},
    {"premise": "A teacher is grading papers.", "hypothesis": "The teacher is teaching high school.", "label": "neutral"},

    # Contradiction
    {"premise": "A cat is sitting on the windowsill.", "hypothesis": "The cat is outside.", "label": "contradiction"},
    {"premise": "A woman is shopping for groceries.", "hypothesis": "The woman is at home.", "label": "contradiction"},
    {"premise": "A man is playing basketball.", "hypothesis": "The man is sitting on the bench.", "label": "contradiction"},
    {"premise": "A child is watching television.", "hypothesis": "The child is playing outside.", "label": "contradiction"},
    {"premise": "A dog is barking at a stranger.", "hypothesis": "The dog is sleeping.", "label": "contradiction"},
    {"premise": "A group of friends are having dinner.", "hypothesis": "The friends are at the park.", "label": "contradiction"},
    {"premise": "A woman is writing a letter.", "hypothesis": "The woman is typing on a computer.", "label": "contradiction"},
    {"premise": "A man is jogging in the park.", "hypothesis": "The man is sitting on a bench.", "label": "contradiction"},
    {"premise": "A chef is cooking in the kitchen.", "hypothesis": "The chef is cleaning the kitchen.", "label": "contradiction"},
    {"premise": "A student is studying in the library.", "hypothesis": "The student is at a party.", "label": "contradiction"},
    {"premise": "A person is driving a car.", "hypothesis": "The person is riding a bicycle.", "label": "contradiction"},
    {"premise": "A woman is swimming in the ocean.", "hypothesis": "The woman is sunbathing.", "label": "contradiction"},
    {"premise": "A man is reading a book.", "hypothesis": "The man is playing a video game.", "label": "contradiction"},
    {"premise": "A child is building a sandcastle.", "hypothesis": "The child is flying a kite.", "label": "contradiction"},
    {"premise": "A group of people are watching a movie.", "hypothesis": "The people are playing a game.", "label": "contradiction"},
    {"premise": "A man is fishing at the lake.", "hypothesis": "The man is hiking in the mountains.", "label": "contradiction"},
    {"premise": "A girl is painting a picture.", "hypothesis": "The girl is reading a book.", "label": "contradiction"},
    {"premise": "A woman is listening to music.", "hypothesis": "The woman is talking on the phone.", "label": "contradiction"},
    {"premise": "A child is playing with toys.", "hypothesis": "The child is asleep.", "label": "contradiction"},
    {"premise": "A man is walking his dog.", "hypothesis": "The man is riding a bicycle.", "label": "contradiction"}
]

### Let's design a prompt to solve the NLI task.

In [None]:
def nli_prompt(premise, hypothesis):
    """
    Design a prompt to solve the NLI task. A prompt typically contains
    1. A system message
    2. Alternate User and System messages.

    These are not concrete requirements and you can design your own prompt.
    """

    messages = [
        {"role": "system", "content": "ADD NLI INSTRUCTIONS HERE"},
        {"role": "user", "content": "ADD NLU INPUTS"},
    ] # model will generate the output

    return messages

In [None]:
results = []

for entry in test_dataset:
    messages = nli_prompt(entry['premise'], entry['hypothesis'])
    ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=16, temperature=0.0)
    results.append(ret)

In [None]:
results[:10]

### We are not able to parse the output of the model. How can we specify the output format?

In [None]:
exemplars = [
    {"premise": "A boy is jumping into the pool.", "hypothesis": "A child is entering the water.", "label": "entailment"},
    {"premise": "The artist is painting a landscape.", "hypothesis": "Someone is creating art.", "label": "entailment"},
    {"premise": "A man is running a marathon.", "hypothesis": "The man will win the race.", "label": "neutral"},
    {"premise": "A girl is reading under a tree.", "hypothesis": "The girl is studying for exams.", "label": "neutral"},
    {"premise": "A woman is cooking dinner.", "hypothesis": "The woman is dining out.", "label": "contradiction"},
    {"premise": "A dog is playing with a ball.", "hypothesis": "The dog is sleeping.", "label": "contradiction"}
]

In [None]:
def nli_prompt_with_exemplars(premise, hypothesis, exemplars):
    """
    Design a prompt to solve the NLI task. A prompt typically contains
    1. A system message
    2. Alternate User and System messages. Use exemplars here.
    """

    messages = [{"role": "system", "content": "ADD NLI INSTRUCTIONS HERE"}]
    for ee in exemplars:
        messages.append({"role": "user", "content": "ADD NLU INPUT HERE"})
        messages.append({"role": "assistant", "content": "ADD NLI OUTPUT HERE"})
    messages.append({"role": "user", "content": f"ADD NLI INPUT HERE"})

    return messages

In [None]:
results = []

for entry in tqdm(test_dataset):
    messages = nli_prompt_with_exemplars(entry['premise'], entry['hypothesis'], exemplars)
    ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=16, temperature=0.0)
    results.append(ret)

In [None]:
results[:10]

#### Now, parse the ouput to obtain final predictions.

In [None]:
def post(text):
    return text.replace('Answer: ', '').strip()

In [None]:
preds = [post(x) for x in results]
golds = [x['label'] for x in test_dataset]
print(Counter(preds))
print('Accuracy', accuracy_score(golds, preds))

In [None]:
results = []

for entry in tqdm(test_dataset):
    messages = nli_prompt_with_exemplars(entry['premise'], entry['hypothesis'], exemplars)
    ret = run_groq_model(messages, "mixtral-8x7b-32768", max_tokens=16, temperature=0.0)
    results.append(ret)


In [None]:
results[:10]

In [None]:
def post(text):
    return text.split('\n')[0].replace('Answer: ', '').strip()

preds = [post(x) for x in results]
golds = [x['label'] for x in test_dataset]
print(Counter(preds))
print('Accuracy', accuracy_score(golds, preds))

# 3. LLMs for Reasoning - Task Definitions and Requirements

### Solving Simple Math Problems with LLMs

We will play with GSM8K dataset consisting of school level math word problems.
Our objective is to use Llama and Mixtral models to solve these math word problems.

GSM8K, short for Grade School Math 8K, is a collection of math problems designed to challenge computers. Imagine a test for AI models, but instead of spelling bees, it's solving word problems!

Here's an example problem:

    Sarah has 10 cookies. She gives 3 to her friend. How many cookies does Sarah have left?

This problem requires two steps:

    Subtract the number of cookies given away (3) from the starting number (10).
    Find the answer (7 cookies).

To solve GSM8K problems, LLMs must identify different variables and their assignments (number_of_cookies=10), reason out the involved operations (number_of_cookies - 3) and output the final answer.

## Download the dataset using Huggingface

In [None]:
from datasets import load_dataset

dataset = load_dataset("openai/gsm8k", "main")

In [None]:
def get_final_answer(entry):
    entry['numerical_answer'] = entry['answer'].split('\n')[-1].split(' ', 1)[-1].strip()

    return entry

In [None]:
dataset["train"][0]

In [None]:
dataset['train'] = dataset['train'].map(get_final_answer)
dataset['test'] = dataset['test'].map(get_final_answer)

In [None]:
print(dataset['test'][0])

In [None]:
test_dataset = dataset['test'].train_test_split(test_size=50, seed=42)['test']

In [None]:
test_dataset[0]

In [None]:
def compute_accuracy(golds, preds):
    cnt = 0
    for gg, pp in zip(golds, preds):
        gg = float(gg.replace(',', ''))
        try:
            pp = float(pp.replace(',', ''))
        except:
            pp = None

        if gg == pp:
            cnt += 1
    return cnt / len(golds)

## Few-shot Exemplars

In [None]:
from tqdm import tqdm
from collections import Counter
from sklearn.metrics import accuracy_score

In [None]:
exemplars = [dataset['train'][ii] for ii in range(5)]

In [None]:
def few_shot_prompt(test_entry, exemplars):
    messages = [{"role": "system", "content": "MATH PROBLEM INSTRUCTIONS"}]
    for entry in exemplars:
        messages.append({"role": "user", "content": "QUESTION HERE"})
        messages.append({"role": "assistant", "content": "ANSWER HERE"})
    messages.append({"role": "user", "content": "QUESTION HERE"})

    return messages

In [None]:
results = []
for entry in tqdm(test_dataset):
    messages = few_shot_prompt(entry, exemplars)
    ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=16, temperature=0.0)
    results.append(ret)

In [None]:
results[:10]

##### Post process the outputs in required format

In [None]:
def post(text):
    return text.replace('[answer]', '').strip()

In [None]:
preds = [post(x) for x in results]
golds = [x['numerical_answer'].strip() for x in test_dataset]
print(Counter(preds))
print('Accuracy', compute_accuracy(golds, preds))

# 4. Advanced Prompting Techniques - CoT, Self-consistency and PAL

#### The model performance is not so great.

#### But, wait!

#### **As** humans we solve problems via step-wise reasoning. Let's add these reasoning steps (Chain-of-Thoughts) to the prompt and let model reason first and then answer.

In [None]:
def few_shot_prompt_with_COT(test_entry, exemplars):
    messages = [{"role": "system", "content": "MATH PROBLEM INSTRUCTIONS"}]
    for entry in exemplars:
        messages.append({"role": "user", "content": "QUESTION HERE"})
        messages.append({"role": "assistant", "content": "ANSWER HERE"})
    messages.append({"role": "user", "content": "QUESTION HERE"})

    return messages

In [None]:
results = []
for entry in tqdm(test_dataset):
    messages = few_shot_prompt_with_COT(entry, exemplars)
    ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=200, temperature=0.0)
    results.append(ret)

In [None]:
results[:10]

In [None]:
def post(text):
    return text.split('\n')[-1].split(' ', 1)[-1].strip()

In [None]:
preds = [post(x) for x in results]
golds = [x['numerical_answer'].strip() for x in test_dataset]
print(Counter(preds))
print('Accuracy', compute_accuracy(golds, preds))

#### A whooping 64% increase in performance. Let's try with Mixtral model.

In [None]:
results = []
for entry in tqdm(test_dataset):
    messages = few_shot_prompt_with_COT(entry, exemplars)
    ret = run_groq_model(messages, "mixtral-8x7b-32768", max_tokens=400, temperature=0.0)
    results.append(ret)

In [None]:
results[:10]

In [None]:
def post(text):
    return text.split('\n')[-1].split(' ', 1)[-1].strip()

In [None]:
preds = [post(x) for x in results]
golds = [x['numerical_answer'] for x in test_dataset]
print(Counter(preds))
print('Accuracy', compute_accuracy(golds, preds))

# Self-consistency Chain-of-Thought Prompting

Standard CoT prompts might get stuck on a single, potentially incorrect, reasoning path. In many cases, the answer could be reached through multiple approaches.

In [None]:
question = "A candle melts by 2 centimeters every hour that it burns. How many centimeters shorter will a candle be after burning from 1:00 PM to 5:00 PM?"

messages = few_shot_prompt_with_COT({'question': question}, exemplars)
print('\n\n'.join([x['content'] for x in messages]))

In [None]:
ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=200, temperature=0.0)
print(ret)

In [None]:
ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=200, temperature=0.9)
print(ret)

In [None]:
ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=200, temperature=0.9)
print(ret)

In [None]:
ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=200, temperature=0.9)
print(ret)

In [None]:
ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=200, temperature=0.9)
print(ret)

In [None]:
ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=200, temperature=0.9)
print(ret)

### Take Majority Voting across reasoning paths.

In [None]:
preds = []
for entry in tqdm(test_dataset):
    tmp = []
    for jj in range(5):
        messages = few_shot_prompt_with_COT(entry, exemplars)
        ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=200, temperature=0.7)
        tmp.append(ret)
    tmp = [post(x) for x in tmp]
    majority = Counter(tmp).most_common(1)[0][0]
    preds.append(majority)

In [None]:
tmp, majority, preds[-1]

In [None]:
golds = [x['numerical_answer'] for x in test_dataset]
# print(Counter(preds))
print('Accuracy', compute_accuracy(golds, preds))

### LLMs and Mathematical Computations

**Strength**

LLMs are awesome for understanding a given problem and generating logical steps to solve the problems.

**Weakness**

Some logical steps require arithmatic computations.

**Idea**

Use calculators to run arithmatic computations.

Specially, ask LLMs to output reasoning steps as Python code. Execute the code to obtain final results.

In [None]:
codes = ["""Here is the step-by-step Python code for the given problem.
```# Clips sold in April
clips_april = 48

# Clips sold in May (half of April)
clips_may = clips_april / 2

# Total clips sold
total_clips = clips_april + clips_may
print("FINAL ANSWER:", total_clips)
```
The final answer is "total_clip" """,

"""Here is the step-by-step Python code for the given problem.
```
# Hourly rate
rate_per_hour = 12

# Minutes worked
minutes_worked = 50

# Earnings per minute
rate_per_minute = rate_per_hour / 60

# Total earnings
total_earnings = rate_per_minute * minutes_worked
print("FINAL ANSWER:", total_earnings)
```
The final answer is "total_earnings" """,

"""Here is the step-by-step Python code for the given problem.
```
# Cost of the wallet
wallet_cost = 100

# Initial savings (half of the cost)
initial_savings = wallet_cost / 2

# Money given by parents
parents_contribution = 15

# Money given by grandparents (twice as much as parents)
grandparents_contribution = parents_contribution * 2

# Total savings after contributions
total_savings = initial_savings + parents_contribution + grandparents_contribution

# Money still needed
money_needed = wallet_cost - total_savings
print("FINAL ANSWER:", money_needed)
```
The final answer is "money_needed" """,

"""Here is the step-by-step Python code for the given problem.
```
# Total pages in the book
total_pages = 120

# Pages read yesterday
pages_yesterday = 12

# Pages read today (twice as many as yesterday)
pages_today = pages_yesterday * 2

# Total pages read so far
total_pages_read = pages_yesterday + pages_today

# Pages remaining
pages_remaining = total_pages - total_pages_read

# Pages to read tomorrow (half of the remaining pages)
pages_tomorrow = pages_remaining / 2
print("FINAL ANSWER:", pages_tomorrow)
```
The final answer is "pages_tomorrow" """,

"""Here is the step-by-step Python code for the given problem.
```
# Pages per letter
pages_per_letter = 3

# Friends
num_friends = 2

# Letters per week
letters_per_week = 2

# Pages written per week
pages_per_week = pages_per_letter * num_friends * letters_per_week

# Weeks in a year
weeks_in_year = 52

# Total pages written in a year
total_pages_year = pages_per_week * weeks_in_year
print("FINAL ANSWER:", total_pages_year)
```
The final answer is "total_pages_year" """
]

In [None]:
for jj, ee in enumerate(exemplars):
    ee['code'] = codes[jj]

In [None]:
def few_shot_prompt_with_code(test_entry, exemplars):
    messages = [{"role": "system", "content": "MATH PROBLEM INSTRUCTIONS"}]
    for entry in exemplars:
        messages.append({"role": "user", "content": "QUESTION HERE"})
        messages.append({"role": "assistant", "content": "ANSWER HERE"})
    messages.append({"role": "user", "content": "QUESTION HERE"})

    return messages

In [None]:
question = "A candle melts by 2 centimeters every hour that it burns. How many centimeters shorter will a candle be after burning from 1:00 PM to 5:00 PM?"

messages = few_shot_prompt_with_code({'question': question}, exemplars)
print('\n\n'.join([x['content'] for x in messages]))

In [None]:
ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=200, temperature=0.0)
print(ret)

In [None]:
import io
import sys

def run_python_string(text):
    lines = text.split('\n')
    idxs = [ii for ii in range(len(lines)) if '```' in lines[ii]]
    if len(idxs) != 2:
        return 'Code Error.'

    code_str = '\n'.join(lines[idxs[0] + 1: idxs[1]])
    # Use a local namespace to avoid variable conflicts
    local_var = {}

    output_buffer = io.StringIO()
    old_stdout = sys.stdout
    sys.stdout = output_buffer

    exec(code_str, globals(), local_var)

    # Restore the original stdout
    sys.stdout = old_stdout

    # Get the output from the buffer
    output = output_buffer.getvalue()

    for line in output.split('\n'):
        if 'FINAL ANSWER:' in line:
            return line.replace('FINAL ANSWER:', '').strip()

    return 'Code Error.'




In [None]:
run_python_string(ret)

In [None]:
results = []
for entry in tqdm(test_dataset):
    messages = few_shot_prompt_with_code(entry, exemplars)
    ret = run_groq_model(messages, "llama3-8b-8192", max_tokens=400, temperature=0.0)
    results.append(ret)

In [None]:
for ii in range(5):
    print(results[ii])
    print('-' * 120)

In [None]:
from copy import deepcopy

preds = []

orig_stdout = sys.stdout
for ret in results:
    try:
        val = run_python_string(ret)
    except:
        val = 'Code Error.'
    preds.append(val)
sys.stdout = orig_stdout

In [None]:
preds[:10]

In [None]:
golds = [x['numerical_answer'] for x in test_dataset]
print(Counter(preds))
print('Accuracy', compute_accuracy(golds, preds))

# Tree of Thoughts


Hulbert, Dave. "Using Tree-of-Thought Prompting to Boost ChatGPT's Reasoning." Last modified May 2023. GitHub repository. Zenodo. https://doi.org/10.5281/zenodo.10323452. https://github.com/dave1010/tree-of-thought-prompting.

We will use a modified version of ToT to solve a word problem.


In [None]:
question = "Tom has twice as many apples as Jerry. Together, they have 18 apples. How many apples does Tom have?"

In [None]:
prompt = """Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave. The question is...

Simulate three brilliant, logical experts collaboratively answering a question. Each one verbosely explains their thought process in real-time, considering the prior explanations of others and openly acknowledging mistakes. At each step, whenever possible, each expert refines and builds upon the thoughts of others, acknowledging their contributions. They continue until there is a definitive answer to the question. For clarity, your entire response should be in a markdown table. The question is...

Identify and behave as three different experts that are appropriate to answering this question.
All experts will write down the step and their thinking about the step, then share it with the group.
Then, all experts will go on to the next step, etc.
At each step all experts will score their peers response between 1 and 5, 1 meaning it is highly unlikely, and 5 meaning it is highly likely.
If any expert is judged to be wrong at any point then they leave.
After all experts have provided their analysis, you then analyze all 3 analyses and provide either the consensus solution or your best guess solution.
The question is...

Question: """ + question

print(prompt)

In [None]:
ret = run_groq_model([{'role': 'user', 'content': prompt}], "llama3-8b-8192", max_tokens=2048, temperature=0.0)
print(ret)