## GPT Prompts on GSM8K

In [25]:
import os
from openai import OpenAI
import re
import numpy as np
from tqdm import tqdm
from datasets import load_dataset

import time
import openai

In [26]:
# Load dataset
gsm8k = load_dataset('gsm8k', 'main')
validation_index = np.load('./lib_prompt/validation_index.npy')
validation_data = gsm8k['train'].select(validation_index)
gsm8k_test = gsm8k['test']

In [27]:
print(gsm8k['train'][0]['question'])

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?


In [28]:
gsm8k_test = gsm8k['test']

In [29]:
prompt_complex = open('lib_prompt/prompt_hardest.txt').read()
print(prompt_complex)

Question: Angelo and Melanie want to plan how many hours over the next week they should study together for their test next week. They have 2 chapters of their textbook to study and 4 worksheets to memorize. They figure out that they should dedicate 3 hours to each chapter of their textbook and 1.5 hours for each worksheet. If they plan to study no more than 4 hours each day, how many days should they plan to study total over the next week if they take a 10-minute break every hour, include 3 10-minute snack breaks each day, and 30 minutes for lunch each day?
Let's think step by step
Angelo and Melanie think they should dedicate 3 hours to each of the 2 chapters, 3 hours x 2 chapters = 6 hours total.
For the worksheets they plan to dedicate 1.5 hours for each worksheet, 1.5 hours x 4 worksheets = 6 hours total.
Angelo and Melanie need to start with planning 12 hours to study, at 4 hours a day, 12 / 4 = 3 days.
However, they need to include time for breaks and lunch. Every hour they want 

In [30]:
# Chatbot
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),  # Default
)

In [31]:
# from tenacity import (
#     retry,
#     stop_after_attempt,
#     wait_chain,
#     wait_fixed
# ) 
# 
# @retry(wait=wait_chain(*[wait_fixed(3) for i in range(3)] +
#                        [wait_fixed(5) for i in range(2)] +
#                        [wait_fixed(10)]))

In [32]:
def test_answer(pred_str, ans_str):
    pattern = '\d*\.?\d+'
    pred = re.findall(pattern, pred_str)
    if(len(pred) >= 1):
        # print(pred_str)
        pred = pred[-1]
        gold = re.findall(pattern, ans_str)
        # print(ans_str)
        gold = gold[-1]
        return pred == gold
    else: return False

def parse_pred_ans(filename):
    with open(filename) as fd: lines = fd.readlines()
    am, a = None, None
    num_q, acc = 0, 0
    current_mode = 'none'
    questions = []
    ans_pred = []
    ans_gold = []
    for l in lines:
        if(l.startswith('Q: ')):
            if(am is not None and a is not None):
                questions.append(q)
                ans_pred.append(am)
                ans_gold.append(a)
                if(test_answer(am, a)):
                    acc += 1
            current_mode = 'q'
            q = l
            num_q += 1
        elif(l.startswith('A_model:')):
            current_mode = 'am'
            am = l
        elif(l.startswith('A:')):
            current_mode = 'a'
            a = l
        else:
            if(current_mode == 'q'): q += l
            elif(current_mode == 'am'): am += l
            elif(current_mode == 'a'): a += l
            else:
                raise ValueError(current_mode)
                
    questions.append(q)
    ans_pred.append(am)
    ans_gold.append(a)
    if(test_answer(am, a)):
        acc += 1
    print('num_q %d correct %d ratio %.4f' % (num_q, acc, float(acc / num_q)))
    return questions, ans_pred, ans_gold

def test_finished(ans_model):
    if('answer is' in ans_model): return True
    else: return False

def extract_ans(ans_model):
    ans_model = ans_model.split('\n')
    ans = []
    residual = []
    for li, al in enumerate(ans_model):
        ans.append(al)
        if('answer is' in al):
            break
    residual = list(ans_model[li + 1:])
    ans = '\n'.join(ans)
    residual = '\n'.join(residual)
    return ans, residual

In [33]:
prompt_q = prompt_complex + '\nQuestion: ' + gsm8k_test[1]['question'] + '\n'
print(prompt_q)

Question: Angelo and Melanie want to plan how many hours over the next week they should study together for their test next week. They have 2 chapters of their textbook to study and 4 worksheets to memorize. They figure out that they should dedicate 3 hours to each chapter of their textbook and 1.5 hours for each worksheet. If they plan to study no more than 4 hours each day, how many days should they plan to study total over the next week if they take a 10-minute break every hour, include 3 10-minute snack breaks each day, and 30 minutes for lunch each day?
Let's think step by step
Angelo and Melanie think they should dedicate 3 hours to each of the 2 chapters, 3 hours x 2 chapters = 6 hours total.
For the worksheets they plan to dedicate 1.5 hours for each worksheet, 1.5 hours x 4 worksheets = 6 hours total.
Angelo and Melanie need to start with planning 12 hours to study, at 4 hours a day, 12 / 4 = 3 days.
However, they need to include time for breaks and lunch. Every hour they want 

In [34]:
response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
        {"role": "system", "content": "Follow the given examples and answer the question."},
        {"role": "user", "content": prompt_q},
    ]
)

In [35]:
print(response.choices[0].message.content)

Let's think step by step: 

1. One robe requires 2 bolts of blue fiber.
2. It takes half as much white fiber as blue fiber, which is 2 / 2 = 1 bolt of white fiber.
3. The total bolts required for one robe combine the blue and white fiber, which is 2 + 1 = 3 bolts.

The answer is 3.


## 1. Complex Prompt Random Sampling

In [36]:
i = 0
with (open('outputs/test_gpt_4o_complex.txt', 'w') as fd):
    for q, a in tqdm(zip(gsm8k_test['question'], gsm8k_test['answer']), 
                               total=len(gsm8k_test['question'])):
        
        prompt_q = prompt_complex + '\nQuestion: ' + q + '\n'  
        
        response = client.chat.completions.create(
              model="gpt-4o",
              messages=[
                    {"role": "system", "content": "Follow the given examples and answer the question."},
                    {"role": "user", "content": prompt_q},
                ]
            )
        ans_model = response.choices[0].message.content
        ans_, residual = extract_ans(ans_model)
            
        fd.write('Q: %s\nA_model:\n%s\nA:\n%s\n\n' % (q, ans_, a))
        i += 1
        # if i == 1: break

  2%|▏         | 23/1319 [01:45<1:39:29,  4.61s/it]


KeyboardInterrupt: 

In [38]:
_, _, _ = parse_pred_ans('outputs/test_gpt_4o_complex.txt')

num_q 23 correct 20 ratio 0.8696


## 2. Complex Prompt Greedy Decoding

In [14]:
i = 0
with open('outputs/test_gpt_4o_complex_temp_0.txt', 'w') as fd:
    for q, a in tqdm(zip(gsm8k_test['question'], gsm8k_test['answer']), 
                               total=len(gsm8k_test['question'])):
        
        prompt_q = prompt_complex + '\nQuestion: ' + q + '\n'  
        
        response = client.chat.completions.create(
              model="gpt-4o",
              messages=[
                    {"role": "system", "content": "Follow the given examples and answer the question."},
                    {"role": "user", "content": prompt_q},
                ],
                temperature=0
            )
        ans_model = response.choices[0].message.content
        ans_, residual = extract_ans(ans_model)
            
        fd.write('Q: %s\nA_model:\n%s\nA:\n%s\n\n' % (q, ans_, a))
        i += 1
        if i == 1: break

  0%|          | 0/1319 [00:01<?, ?it/s]


In [15]:
_, _, _ = parse_pred_ans('outputs/test_gpt_4o_complex_temp_0.txt')

num_q 1 correct 1 ratio 1.0000


## 3. Baseline Prompt Greedy Decoding

In [16]:
prompt_original = open('lib_prompt/prompt_original.txt').read()

In [17]:
i = 0
with open('outputs/test_gpt_4o_original_temp_0.txt', 'w') as fd:
    for q, a in tqdm(zip(gsm8k_test['question'], gsm8k_test['answer']), 
                               total=len(gsm8k_test['question'])):
        
        prompt_q = prompt_original + '\nQuestion: ' + q + '\n'  
        
        response = client.chat.completions.create(
              model="gpt-4o",
              messages=[
                    {"role": "system", "content": "Follow the given examples and answer the question."},
                    {"role": "user", "content": prompt_q},
                ],
                temperature=0
            )
        ans_model = response.choices[0].message.content
        ans_, residual = extract_ans(ans_model)
            
        fd.write('Q: %s\nA_model:\n%s\nA:\n%s\n\n' % (q, ans_, a))
        i += 1
        if i == 1: break

  0%|          | 0/1319 [00:02<?, ?it/s]


In [18]:
_, _, _ = parse_pred_ans('outputs/test_gpt_4o_original_temp_0.txt')

num_q 1 correct 1 ratio 1.0000


## 4. Baseline Prompt, Dialog In-Context Learning


In [19]:
def make_dialog_prompt(prompt):
    messages = []
    messages.append({"role": "system", "content": "Follow the given examples and answer the question."})
    cases = prompt.split("\n\n")
    for c in cases[:-1]:
        question = c.split("\n")[:2]
        messages.append({"role": "user", "content": "\n".join(question)})
        answer = c.split("\n")[2:]
        messages.append({"role": "assistant", "content": "\n".join(answer)})
    messages.append({"role": "user", "content": cases[-1] + "Let's think step by step"})
    return messages

In [20]:
i = 0
with open('outputs/test_gpt_4o_original_dialog_icl.txt', 'w') as fd:
    for q, a in tqdm(zip(gsm8k_test['question'], gsm8k_test['answer']), 
                               total=len(gsm8k_test['question'])):
        
        prompt_q = prompt_original + '\nQuestion: ' + q + '\n'
        dialog_prompt = make_dialog_prompt(prompt_q)
        
        response = client.chat.completions.create(
              model="gpt-4o",
              messages=dialog_prompt,
              temperature=0
            )
        ans_model = response.choices[0].message.content
        ans_, residual = extract_ans(ans_model)
            
        fd.write('Q: %s\nA_model:\n%s\nA:\n%s\n\n' % (q, ans_, a))
        i += 1
        if i == 1: break

  0%|          | 0/1319 [00:03<?, ?it/s]


In [21]:
_, _, _ = parse_pred_ans('outputs/test_gpt_4o_original_dialog_icl.txt')

num_q 1 correct 1 ratio 1.0000


## 5. Complex Prompt, Dialog In-Context Learning

In [22]:
i = 0
with open('outputs/test_gpt_4o_complex_dialog_icl.txt', 'w') as fd:
    for q, a in tqdm(zip(gsm8k_test['question'], gsm8k_test['answer']), 
                               total=len(gsm8k_test['question'])):
        
        prompt_q = prompt_complex + '\nQuestion: ' + q + '\n'
        dialog_prompt = make_dialog_prompt(prompt_q)
        
        response = client.chat.completions.create(
              model="gpt-4o",
              messages=dialog_prompt,
              temperature=0
            )
        ans_model = response.choices[0].message.content
        ans_, residual = extract_ans(ans_model)
            
        fd.write('Q: %s\nA_model:\n%s\nA:\n%s\n\n' % (q, ans_, a))
        i += 1
        if i == 1: break

  0%|          | 0/1319 [00:02<?, ?it/s]


In [23]:
_, _, _ = parse_pred_ans('outputs/test_gpt_4o_complex_dialog_icl.txt')

num_q 1 correct 1 ratio 1.0000
