# GPT-3.5-Turbo Performance on MMLU - Abstract Algebra

In [2]:
import json

import numpy as np

from tqdm import tqdm
from datasets import load_dataset
from tenacity import retry, stop_after_attempt, wait_chain, wait_fixed
from utils import extract_answer_by_LLM, extract_answer_by_pattern, extract_answer_by_pattern_zeroshot_fewshot_cot

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from openai import OpenAI 

# get api_key from .env file
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

In [4]:
mmlu_prompt = json.load(open('MMLU/lib_prompt/mmlu-cot.json'))

In [8]:
mmlu_prompt.keys()

dict_keys(['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy

In [98]:
print(mmlu_prompt['abstract_algebra'])

The following are multiple choice questions (with answers) about abstract algebra.

Q: Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.
(A) True, True (B) False, False (C) True, False (D) False, True
A: Let's think step by step. A cyclic group is a group that is generated by a single element. Hence a subgroup generated by a single element of a group is cyclic and Statement 1 is True. The answer is (C).

Q: The symmetric group $S_n$ has $
actorial{n}$ elements, hence it is not true that $S_{10}$ has 10 elements.
Find the characteristic of the ring 2Z.
(A) 0 (B) 3 (C) 12 (D) 30
A: Let's think step by step. A characteristic of a ring is R is $n$ if the statement $ka = 0$ for all $a\in 2Z$ implies that $k$ is a multiple of $n$. Assume that $ka = 0$ for all $a\in 2Z$ for some $k$. In particular $2k = 0$. Hence $k=0$ and $n=0$. The answer is (A).

Q: Statement 1| Every function from a finite set onto itse

In [99]:
print(mmlu_prompt['high_school_statistics'])

The following are multiple choice questions (with answers) about high school statistics.

Q: A new smartwatch is manufactured in one part of a factory, then secured for shipping in another, independent part of the factory. The weight of the smartwatch has a mean of 62 grams and a standard deviation of 1.0 grams. The weight of the packaging (box, user's guide, bubble wrap, etc.) has a mean of 456 grams and a standard deviation of 6 grams. Together, the distribution of the weight of the smartwatch and its packaging would have the following mean and standard deviation:
(A) Mean 518 grams; standard deviation 7.0 grams (B) Mean 518 grams; standard deviation 3.5 grams (C) Mean 518 grams; standard deviation 6.1 grams (D) Mean 394 grams; standard deviation 6.1 grams
A: Let's think step by step. Since the weight of the watch and the weight of the packaging are independent random variables, the mean and variance of their sum is equal to the sum of their individual means and variances. So the mea

In [5]:
abstract_algebra = load_dataset("lukaemon/mmlu", "abstract_algebra")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


# zero-shot cot

In [65]:
# prompt_template_with_choices_zero_shot_cot = """
# Here is a math question: "{input}"
# Correct answer is among: (A): {A}, (B): {B}, (C): {C}, (D): {D}.
# 1. Let's solve the question step by step, print out each step. Pay attention to make use of information in both question and choices.
# 2. Compare answer against the choices (A): {A}, (B): {B}, (C): {C}, (D): {D}, and decide which choice is selected. If answer matches a choice, select the choice i.e. one of "(A)", "(B)", "(C)" and "(D)" as final result; if answer doesn't match any choice, the answer is not correct, and final result is "(None)".
# 3. print out final result, must in format "the answer is _final_result_" in the last line where _final_result_ is one of "(A)", "(B)", "(C)", "(D)" and "(None)", without any other text. 
# """
# q_ = abstract_algebra['test'][0]
# prompt_question = prompt_template_with_choices_zero_shot_cot.format(**q_)
# prompt_question  

'\nHere is a math question: "Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q."\nCorrect answer is among: (A): 0, (B): 4, (C): 2, (D): 6.\n1. Let\'s solve the question step by step, print out each step. Pay attention to make use of information in both question and choices.\n2. Compare answer against the choices (A): 0, (B): 4, (C): 2, (D): 6, and decide which choice is selected. If answer matches a choice, select the choice i.e. one of "(A)", "(B)", "(C)" and "(D)" as final result; if answer doesn\'t match any choice, the answer is not correct, and final result is "(None)".\n3. print out final result, must in format "the answer is _final_result_" in the last line where _final_result_ is one of "(A)", "(B)", "(C)", "(D)" and "(None)", without any other text. \n'

In [86]:
# from utils import completion_with_backoff, extract_answer_by_LLM, extract_answer_by_pattern, extract_answer_by_pattern_fewshot_cot


# task = 'abstract_algebra'

# prompt_template_with_choices_zero_shot_cot = """
# Here is a math question: "{input}"
# Correct answer is among: (A): {A}, (B): {B}, (C): {C}, (D): {D}.
# 1. Let's solve the question step by step, print out each step. Pay attention to make use of information in both question and choices.
# 2. Compare answer against the choices (A): {A}, (B): {B}, (C): {C}, (D): {D}, and decide which choice is selected. If answer matches a choice, select the choice i.e. one of "(A)", "(B)", "(C)" and "(D)" as final result; if answer doesn't match any choice, the answer is not correct, and final result is "(None)".
# 3. print out final result, must in format "the answer is _final_result_" in the last line where _final_result_ is one of "(A)", "(B)", "(C)", "(D)" and "(None)", without any other text. 
# """

# i = 0
# with open('zero_shot_cot_outputs/test_gpt_3.5_turbo_%s.txt' % task, 'w', encoding='utf-8') as fd:
#     for q_ in tqdm(abstract_algebra['test'], total=len(abstract_algebra['test'])):
#         # q = q_['input'] + '\n'
#         # for letter in ['A', 'B', 'C', 'D']:
#         #     q += '(' + letter + ') ' + q_[letter] + ' '
#         # q += "\nA: Let's think step by step."  
            
#         # prompt_q = mmlu_prompt[task] + "\n\n" + q
#         prompt_question = prompt_template_with_choices_zero_shot_cot.format(**q_)
#         # print(prompt_q)
#         response_content = completion_with_backoff(
#               model="gpt-3.5-turbo",
#               messages=[
#                     # {"role": "system", "content": "Follow the given examples and answer the question."},
#                     {"role": "user", "content": prompt_question},
#                 ]
#             )
#         # ans_model = response['choices'][0]['message']['content']
#         ans_model = response_content.choices[0].message.content
#         # print(ans_model)
#         ans_, residual = extract_ans(ans_model)
            
#         a = q_['target']
#         # fd.write('Q: %s\nA_model:\n%s\nA:\n%s\n\n' % (q, ans_, a))
#         fd.write('Q: %s\nA_model:\n%s\nA:\n%s\n\n' % (prompt_question, ans_, a))
#         i += 1
#         # if(i == 2): break

  0%|          | 0/100 [00:00<?, ?it/s]

100%|██████████| 100/100 [06:30<00:00,  3.91s/it]


In [19]:
# abstract_algebra['test'][0]

{'input': 'Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.',
 'A': '0',
 'B': '4',
 'C': '2',
 'D': '6',
 'target': 'B'}

In [87]:
# questions, ans_pred, ans_gold, marks  = parse_pred_ans('zero_shot_cot_outputs/test_gpt_3.5_turbo_%s.txt' % task)

num_q 100 correct 47 ratio 0.4700


In [81]:
# questions

['Q: \nHere is a math question: "Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q."\nCorrect answer is among: (A): 0, (B): 4, (C): 2, (D): 6.\n1. Let\'s solve the question step by step, print out each step. Pay attention to make use of information in both question and choices.\n2. Compare answer against the choices (A): 0, (B): 4, (C): 2, (D): 6, and decide which choice is selected. If answer matches a choice, select the choice i.e. one of "(A)", "(B)", "(C)" and "(D)" as final result; if answer doesn\'t match any choice, the answer is not correct, and final result is "(None)".\n3. print out final result, must in format "the answer is _final_result_" in the last line where _final_result_ is one of "(A)", "(B)", "(C)", "(D)" and "(None)", without any other text. \n\n',
 'Q: \nHere is a math question: "Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5."\nCorrect answer is among: (A): 8, (B): 2, (C): 24, (D): 120.\n1. Let\'s solve the quest

In [82]:
# ans_model

'1. p = (1, 2, 5, 4)(2, 3) means p(1) = 2, p(2) = 3, p(3) = 3, p(4) = 5, p(5) = 4.\n2. To find the index of <p> in S_5, we need to find the smallest positive integer n such that p^n = (1) in S_5.\n3. We can calculate p^2 as follows: p^2 = (2, 3)(1, 2, 5, 4) = (1, 3, 5)(2, 4).\n4. We see that p^2 is not equal to (1), so we continue to calculate p^3: p^3 = (1, 3, 5)(2, 4)(1, 2, 5, 4) = (3, 5)(1, 4, 2).\n5. We see that p^3 is not equal to (1), so we continue to calculate p^4: p^4 = (3, 5)(1, 4, 2)(1, 2, 5, 4) = (1, 5, 3)(4, 2).\n6. We see that p^4 is equal to (1), so the index of <p> in S_5 is 4.\nthe answer is (B)'

In [83]:
# ans_pred

['A_model:\n1. Since Q(sqrt(2), sqrt(3), sqrt(18)) = Q(sqrt(2), sqrt(3)), we need to find the degree of the field extension Q(sqrt(2), sqrt(3)) over Q.\n\n2. The degree of the field extension Q(alpha) over Q is equal to the degree of the minimal polynomial of alpha over Q. \n\n3. The minimal polynomial of sqrt(2) over Q is x^2 - 2, which has degree 2. \n   The minimal polynomial of sqrt(3) over Q(sqrt(2)) is x^2 - 3, which also has degree 2. \n\n4. Therefore, the degree of the field extension Q(sqrt(2), sqrt(3)) over Q is 2*2 = 4.\n\n5. Finally, the answer is (B): 4\n',
 'A_model:\n1. p = (1, 2, 5, 4)(2, 3) means p(1) = 2, p(2) = 3, p(3) = 3, p(4) = 5, p(5) = 4.\n2. To find the index of <p> in S_5, we need to find the smallest positive integer n such that p^n = (1) in S_5.\n3. We can calculate p^2 as follows: p^2 = (2, 3)(1, 2, 5, 4) = (1, 3, 5)(2, 4).\n4. We see that p^2 is not equal to (1), so we continue to calculate p^3: p^3 = (1, 3, 5)(2, 4)(1, 2, 5, 4) = (3, 5)(1, 4, 2).\n5. We s

# KE

In [98]:
# task = 'abstract_algebra'

# prompt_template_with_choices = """
# Here is a math question: "{input}"
# Correct answer is among: A: {A}, B: {B}, C: {C}, D: {D}.
# Let's analyze the question from the following angles, print out each rationals in each step:
# 1. Read question and choices carefully.
# 2. According to math education syllabus, what category does the question belong to?
# 3. What domain specific problem solving skills and knowledge are commonly used to solve questions of the category?
# 4. Select the most suitable method to solve the question.
# 5. Solve the question step by step, pay attention to make use of information in both question and choices. 
# 6. Compare answer against the choices (A): {A}, (B): {B}, (C): {C}, (D): {D}, and decide which choice is selected. If answer matches a choice, select the choice i.e. one of "(A)", "(B)", "(C)" and "(D)" as final result; if answer doesn't match any choice, the answer is not correct, and final result is "(None)".
# 7. print out final result, must in format "the answer is _final_result_" in the last line where _final_result_ is one of "(A)", "(B)", "(C)", "(D)" and "(None)", without any other text. 
# """

# # prompt_template_with_choices = """
# # Here is a math question: "{input}"
# # Correct answer is among: A: {A}, B: {B}, C: {C}, D: {D}.
# # Let's analyze the question from the following angles, print out each rationals in each step:
# # 1. Read question and choices carefully.
# # 2. According to math education syllabus, what category does the question belong to?
# # 3. What domain specific problem solving skills and knowledge are commonly used to solve questions of the category?
# # 4. Select the most suitable method to solve the question.
# # 5. Solve the question step by step, pay attention to make use of information in both question and choices. 
# # 6. print out final result, must in format "the answer is _final_result_" in the last line where _final_result_ is one of "(A)", "(B)", "(C)", "(D)" and "(None)", without any other text. 
# # """

# i = 0
# with open('ke_outputs/test_gpt_3.5_turbo_%s.txt' % task, 'w', encoding='utf-8') as fd:
#     for q_ in tqdm(abstract_algebra['test'], total=len(abstract_algebra['test'])):
#         # q = q_['input'] + '\n'
#         # for letter in ['A', 'B', 'C', 'D']:
#         #     q += '(' + letter + ') ' + q_[letter] + ' '
#         # q += "\nA: Let's think step by step."  
            
#         # prompt_q = mmlu_prompt[task] + "\n\n" + q
#         prompt_question = prompt_template_with_choices.format(**q_)
#         # print(prompt_q)
#         response_content = completion_with_backoff(
#               model="gpt-3.5-turbo",
#               messages=[
#                     # {"role": "system", "content": "Follow the given examples and answer the question."},
#                     {"role": "user", "content": prompt_question},
#                 ]
#             )
        
#         ans_model = response_content.choices[0].message.content
#         # print(ans_model)
#         ans_, residual = extract_ans(ans_model)
            
#         a = q_['target']
#         # fd.write('Q: %s\nA_model:\n%s\nA:\n%s\n\n' % (q, ans_, a))
#         fd.write('Q: %s\nA_model:\n%s\nA:\n%s\n\n' % (prompt_question, ans_, a))
#         i += 1
#         # if(i == 2): break



100%|██████████| 100/100 [07:36<00:00,  4.57s/it]


In [99]:
# questions, ans_pred, ans_gold, marks  = parse_pred_ans('ke_outputs/test_gpt_3.5_turbo_%s.txt' % task)

num_q 100 correct 29 ratio 0.2900


In [97]:
# marks

[0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1]

In [81]:
print(prompt_template_zero_shot_cot.format(**abstract_algebra['test'][0]))


Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
(A) 0 (B) 4 (C) 2 (D) 6
A: Let's think step by step. Print out each step, and the final result in format "the answer is (final_result)".



In [13]:
# force reload the module
import importlib
import utils
import prompts
importlib.reload(utils)
importlib.reload(prompts)
from utils import calculate_accuracy_one_sheet, get_datetime, get_timestampe, mark_answer_sheet, solve_multiple_questions, extract_answer_by_LLM, extract_answer_by_pattern, extract_answer_by_pattern_zeroshot_fewshot_cot
from prompts import prompt_template_ke_with_choices_20240406, prompt_template_few_shot_cot, prompt_template_zero_shot_cot, prompt_template_zero_shot_cot_forget_training_data



task_name = "abstract_algebra"
# model_name = "gpt-3.5-turbo"
# model_name = "gpt-4-0613"
# model_config = None 
# q_list = abstract_algebra['test'].select(range(2))
# q_list = abstract_algebra['test']



def run_on_task(task_name, q_list, model_name, model_config, log_file_dir, log_file_name_prefix, prompt_template, answer_extract_fn):
    experiment_log = {'metadata': {'model_name': model_name, 
                                'model_config': model_config, 
                                'task': task_name, 
                                'experiment_time':get_datetime()}}

    response_list = solve_multiple_questions(q_list=q_list, 
                                            prompt_template= prompt_template, 
                                            model_name=model_name, 
                                            model_config=model_config)
    print(f"response_list: {len(response_list)}")
    response_list = mark_answer_sheet(response_list=response_list, answer_extract_fn=answer_extract_fn)
    accuracy = calculate_accuracy_one_sheet(response_list)

    print(f"accuracy: {accuracy}")

    experiment_log['response_list'] = response_list
    experiment_log['accuracy'] = accuracy
    # create log_file_dir if not exists
    if not os.path.exists(log_file_dir):
        os.makedirs(log_file_dir)
    log_file_name = os.path.join(log_file_dir, f'{log_file_name_prefix}_{get_timestampe()}.json')
    with open(log_file_name, 'w') as f:
        json.dump(experiment_log, f, indent=4)



ke_zeroshot_config = {
    'model_name': 'gpt-3.5-turbo',
    'model_config': None,
    'log_file_dir': './outputs/ke_zeroshot_cot_sc',
    'log_file_name_prefix': 'test_task',
    'prompt_template': prompt_template_ke_with_choices_20240406,
    'answer_extract_fn': [extract_answer_by_pattern, extract_answer_by_LLM],
}

few_shot_config = {
    'model_name': 'gpt-3.5-turbo',
    'model_config': None,
    'log_file_dir': './outputs/few_shot_cot',
    'log_file_name_prefix': 'test_task',
    'prompt_template': prompt_template_few_shot_cot,
    'answer_extract_fn': [extract_answer_by_pattern_zeroshot_fewshot_cot, extract_answer_by_LLM],
}

zero_shot_config = {
    'model_name': 'gpt-3.5-turbo',
    'model_config': None,
    'log_file_dir': './outputs/zero_shot_cot',
    'log_file_name_prefix': 'test_task',
    'prompt_template': prompt_template_zero_shot_cot,
    'answer_extract_fn': [extract_answer_by_pattern_zeroshot_fewshot_cot, extract_answer_by_LLM],
}

zero_shot_forget_training_data_config = {
    'model_name': 'gpt-3.5-turbo',
    'model_config': None,
    'log_file_dir': './outputs/zero_shot_cot_forget_training_data',
    'log_file_name_prefix': 'test_task',
    'prompt_template': prompt_template_zero_shot_cot_forget_training_data,
    'answer_extract_fn': [extract_answer_by_pattern_zeroshot_fewshot_cot, extract_answer_by_LLM],
}

# zero_shot_few_shot_cot_different_category_config = {
#     'model_name': 'gpt-3.5-turbo',
#     'model_config': None,
#     'log_file_dir': './outputs/zero_shot_cot_different_category',
#     'log_file_name_prefix': 'test_task',
#     'prompt_template': prompt_template_few_shot_cot_different_category,
#     'answer_extract_fn': [extract_answer_by_pattern_zeroshot_fewshot_cot, extract_answer_by_LLM],
# }







In [15]:
for task_name in [
                # 'abstract_algebra',
                  'college_mathematics',
                  'college_physics',

                  ]:
    print(f"task_name: {task_name}")
    q_list = load_dataset("lukaemon/mmlu", task_name)
    q_list = q_list['test']
    for _ in range(2):
        for run_config in [few_shot_config, zero_shot_forget_training_data_config, zero_shot_config]:
            prompt_template = run_config['prompt_template']
            if callable(prompt_template):
                prompt_template = prompt_template(task_name=task_name)
            log_file_dir = os.path.join(run_config['log_file_dir'], task_name)
            run_on_task(task_name=task_name,
                        q_list=q_list,
                        model_name=run_config['model_name'],
                        model_config=run_config['model_config'],
                        log_file_dir=log_file_dir,
                        log_file_name_prefix=run_config['log_file_name_prefix'],
                        prompt_template=prompt_template,
                        answer_extract_fn=run_config['answer_extract_fn'])


task_name: college_mathematics


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
100%|██████████| 100/100 [05:18<00:00,  3.18s/it]


response_list: 100
accuracy: 0.48


100%|██████████| 100/100 [02:15<00:00,  1.36s/it]


response_list: 100
accuracy: 0.39


100%|██████████| 100/100 [03:44<00:00,  2.25s/it]


response_list: 100
accuracy: 0.46


100%|██████████| 100/100 [06:37<00:00,  3.97s/it]


response_list: 100
accuracy: 0.4


100%|██████████| 100/100 [02:32<00:00,  1.52s/it]


response_list: 100
accuracy: 0.4


100%|██████████| 100/100 [05:12<00:00,  3.12s/it]


response_list: 100
accuracy: 0.41
task_name: college_physics


100%|██████████| 102/102 [04:07<00:00,  2.43s/it]


response_list: 102
accuracy: 0.5980392156862745


100%|██████████| 102/102 [02:19<00:00,  1.37s/it]


response_list: 102
accuracy: 0.38235294117647056


100%|██████████| 102/102 [03:47<00:00,  2.23s/it]


response_list: 102
accuracy: 0.5


100%|██████████| 102/102 [04:14<00:00,  2.50s/it]


response_list: 102
accuracy: 0.6176470588235294


100%|██████████| 102/102 [02:32<00:00,  1.49s/it]


response_list: 102
accuracy: 0.4117647058823529


100%|██████████| 102/102 [03:52<00:00,  2.28s/it]


response_list: 102
accuracy: 0.5294117647058824


In [11]:
task_name = 'college_mathematics'
run_config = few_shot_config
prompt_template = run_config['prompt_template']
if callable(prompt_template):
    prompt_template = prompt_template(task_name=task_name)
prompt_template

"The following are multiple choice questions (with answers) about college mathematics.\n\nQ: Let V be the set of all real polynomials p(x). Let transformations T, S be defined on V by T:p(x) -> xp(x) and S:p(x) -> p'(x) = d/dx p(x), and interpret (ST)(p(x)) as S(T(p(x))). Which of the following is true?\n(A) ST = 0 (B) ST = T (C) ST = TS (D) ST - TS is the identity map of V onto itself.\nA: Let's think step by step. For a given polynomial $p$ we have\n\\[ST(p) = (xp(x))’ = p(x) + xp’(x)\\]\nand\n\\[TS(p) = xp’(x).\\]\nHence \\[ST(p) - TS(p) = p(x) + xp’(x) - xp’(x).\\] The answer is (D).\n\nQ: Suppose that f(1 + x) = f(x) for all real x. If f is a polynomial and f(5) = 11, then f(15/2)\n(A) -11 (B) 0 (C) 11 (D) 33/2\nA: Let's think step by step. The only polynomial so that $f(1 + x) = f(x)$ is a constant polynomial. Hence $f(5) = 11 = f(15/2)$. The answer is (C).\n\nQ: Let A be a real 2x2 matrix. Which of the following statements must be true?\nI. All of the entries of A^2 are nonnegat

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


{0: 'B',
 1: 'C',
 2: 'D',
 3: 'B',
 4: 'B',
 5: 'A',
 6: 'A',
 7: 'D',
 8: 'B',
 9: 'C',
 10: 'C',
 11: 'C',
 12: 'A',
 13: 'C',
 14: 'C',
 15: 'B',
 16: 'C',
 17: 'C',
 18: 'D',
 19: 'A',
 20: 'A',
 21: 'A',
 22: 'D',
 23: 'D',
 24: 'B',
 25: 'C',
 26: 'C',
 27: 'B',
 28: 'D',
 29: 'A',
 30: 'B',
 31: 'B',
 32: 'A',
 33: 'C',
 34: 'A',
 35: 'B',
 36: 'D',
 37: 'B',
 38: 'C',
 39: 'D',
 40: 'C',
 41: 'A',
 42: 'B',
 43: 'C',
 44: 'C',
 45: 'C',
 46: 'B',
 47: 'B',
 48: 'C',
 49: 'B',
 50: 'D',
 51: 'B',
 52: 'A',
 53: 'C',
 54: 'B',
 55: 'A',
 56: 'D',
 57: 'B',
 58: 'A',
 59: 'A',
 60: 'D',
 61: 'D',
 62: 'C',
 63: 'B',
 64: 'B',
 65: 'C',
 66: 'D',
 67: 'C',
 68: 'A',
 69: 'A',
 70: 'D',
 71: 'C',
 72: 'D',
 73: 'B',
 74: 'D',
 75: 'B',
 76: 'C',
 77: 'A',
 78: 'D',
 79: 'C',
 80: 'B',
 81: 'D',
 82: 'A',
 83: 'C',
 84: 'D',
 85: 'C',
 86: 'A',
 87: 'A',
 88: 'A',
 89: 'C',
 90: 'D',
 91: 'B',
 92: 'B',
 93: 'A',
 94: 'D',
 95: 'C',
 96: 'B',
 97: 'C',
 98: 'C',
 99: 'C'}

In [17]:
from utils import get_gold_labels, calculate_accuracy_sc, majority_vote
task_name_list = ['abstract_algebra', 'college_mathematics',
                  'college_physics',]


for task_name in task_name_list:
    print(f'Task: {task_name}')
    gold_labels = get_gold_labels(task_name=task_name)

    zero_shot_cot_sc_mark, question_votes = majority_vote(f'outputs/zero_shot_cot/{task_name}', filter={'task': task_name, 'model_name': 'gpt-3.5-turbo'})
    print(calculate_accuracy_sc(gold_labels=gold_labels, predicted_labels=zero_shot_cot_sc_mark))

    zero_shot_forget_training_data_sc_mark, question_votes = majority_vote(f'outputs/zero_shot_cot_forget_training_data/{task_name}', filter={'task': task_name, 'model_name': 'gpt-3.5-turbo'})
    print(calculate_accuracy_sc(gold_labels=gold_labels, predicted_labels=zero_shot_forget_training_data_sc_mark))

    few_shot_cot_sc_mark, question_votes = majority_vote(f'outputs/few_shot_cot/{task_name}', filter={'task': task_name, 'model_name': 'gpt-3.5-turbo'})  
    print(calculate_accuracy_sc(gold_labels=gold_labels, predicted_labels=few_shot_cot_sc_mark))  

Task: abstract_algebra


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


0.38
0.44
0.51
Task: college_mathematics
0.4
0.38


IndexError: list index out of range

In [30]:
question_votes

{0: Counter({'D': 19, 'B': 3, 'C': 1}),
 1: Counter({'C': 15, 'D': 7, 'A': 2}),
 2: Counter({'D': 25}),
 3: Counter({'C': 16, 'A': 7}),
 4: Counter({'C': 15, 'B': 6}),
 5: Counter({'C': 21, 'B': 3}),
 6: Counter({'C': 22, 'A': 2}),
 7: Counter({'C': 24}),
 8: Counter({'B': 22, 'D': 1, 'C': 1}),
 9: Counter({'C': 13, 'D': 9}),
 10: Counter({'C': 21, 'A': 3}),
 11: Counter({'A': 14, 'B': 9, 'C': 1}),
 12: Counter({'A': 14, 'C': 5, 'D': 5}),
 13: Counter({'D': 10, 'B': 8, 'C': 2}),
 14: Counter({'A': 17, 'C': 4}),
 15: Counter({'C': 23, 'A': 1}),
 16: Counter({'C': 20}),
 17: Counter({'A': 14, 'C': 6}),
 18: Counter({'D': 22, 'C': 2}),
 19: Counter({'C': 19, 'D': 5}),
 20: Counter({'C': 22, 'A': 2}),
 21: Counter({'C': 9, 'A': 6, 'D': 4, 'B': 3}),
 22: Counter({'C': 22, 'A': 2}),
 23: Counter({'C': 11, 'D': 10, 'B': 3}),
 24: Counter({'A': 16, 'C': 8}),
 25: Counter({'A': 23, 'C': 1}),
 26: Counter({'C': 24}),
 27: Counter({'D': 18, 'B': 3, 'A': 3}),
 28: Counter({'C': 12, 'A': 9, 'D': 2}

In [146]:
importlib.reload(utils)
from utils import calculate_accuracy_one_sheet, get_datetime, get_timestampe, mark_answer_sheet, solve_multiple_questions, extract_answer_by_pattern, extract_answer_by_pattern_zeroshot_cot, extract_answer_by_pattern_zeroshot_fewshot_cot, extract_answer_by_LLM, request_gpt

<function utils.extract_answer_by_LLM(response_content, choices)>

In [153]:
log_file_name_prefix = "./outputs/zero_shot_cot_forget_training_data/test_abstract_algebra_2024-04-07-14-29-24.json"
with open(log_file_name_prefix, 'r') as f:
    experiment_log = json.load(f)
    response_list = experiment_log['response_list']

answer_extract_fn = extract_answer_by_pattern_zeroshot_cot
print(f"response_list: {len(response_list)}")
response_list = mark_answer_sheet(response_list=response_list, answer_extract_fn=answer_extract_fn)
accuracy = calculate_accuracy_one_sheet(response_list)

print(f"accuracy: {accuracy}")

experiment_log['response_list'] = response_list
experiment_log['accuracy'] = accuracy
with open(log_file_name_prefix, 'w') as f:
    json.dump(experiment_log, f, indent=4)

response_list: 100
accuracy: 0.42


In [117]:
importlib.reload(utils)
from utils import calculate_accuracy_one_sheet, get_datetime, get_timestampe, mark_answer_sheet, solve_multiple_questions, extract_answer_by_LLM, request_gpt

answer_extract_fn = extract_answer_by_LLM

response_list = mark_answer_sheet(response_list=response_list, answer_extract_fn=answer_extract_fn)
accuracy = calculate_accuracy_one_sheet(response_list)

print(f"accuracy: {accuracy}")

experiment_log['response_list'] = response_list
experiment_log['accuracy'] = accuracy
# with open(log_file_name, 'w') as f:
#     json.dump(experiment_log, f, indent=4)

accuracy: 0.42


In [135]:
pprint.pprint(response_list)

[{'A': '0',
  'B': '4',
  'C': '2',
  'D': '6',
  'id': 0,
  'input': 'Find the degree for the given field extension Q(sqrt(2), sqrt(3), '
           'sqrt(18)) over Q.',
  'llm_answer': None,
  'prompt': '\n'
            'Here is a math question: "Find the degree for the given field '
            'extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q."\n'
            'Correct answer is among: A: 0, B: 4, C: 2, D: 6.\n'
            "Let's analyze the question from the following angles, print out "
            'each rationals in each step:\n'
            '1. According to math education syllabus, what category does the '
            'question belong to?\n'
            '2. What domain specific knowledge and problem solving skills are '
            'suitable for solving this question?\n'
            '3. Solve the question step by step, make sure every step is '
            'mathematically solid. \n'
            '4. Select final result from A: 0, B: 4, C: 2, D: 6. If answer '
            'doesn\'t

In [13]:
response_content = response_list[75]['response']
choices = {k: response_list[75][k] for k in ['A', 'B', 'C', 'D']}
print(choices)
print(response_content)
llm_answer = extract_answer_by_LLM(response_content=response_content, choices=choices)
print(llm_answer)

{'A': 'True, True', 'B': 'False, False', 'C': 'True, False', 'D': 'False, True'}
Step 1: Simplify (ab)^{-2} = b^{-2}a^{-2}
(ab)^{-2} = (a^{-1}b^{-1})^2
(ab)^{-2} = a^{-2}b^{-2}

Step 2: Simplify (ab)^n = a^nb^n
This statement is not true in general. It is only true when n = 1.

The answer is (C) False, True
D


In [14]:
from utils import request_gpt
prompt_template = """ The question has choices:
    (A) {A} (B) {B} (C) {C} (D) {D}
    The provided answer is {response_content}. 
    Extract the correct choice, and return in format "answer is (A|B|C|D)". If the answer is not available, return "answer is None".
    """
prompt = prompt_template.format(response_content=response_content, **choices)
print(prompt)
model_name = "gpt-3.5-turbo"
model_config = {"temperature": 0.01, "top_p": 0.1}
extracted_choice = request_gpt(prompt=prompt, model_name=model_name, model_config=model_config)
match = re.search(r'answer is \(?(A|B|C|D)\)?', extracted_choice)

 The question has choices:
    (A) True, True (B) False, False (C) True, False (D) False, True
    The provided answer is Step 1: Simplify (ab)^{-2} = b^{-2}a^{-2}
(ab)^{-2} = (a^{-1}b^{-1})^2
(ab)^{-2} = a^{-2}b^{-2}

Step 2: Simplify (ab)^n = a^nb^n
This statement is not true in general. It is only true when n = 1.

The answer is (C) False, True. 
    Extract the correct choice, and return in format "answer is (A|B|C|D)". If the answer is not available, return "answer is None".
    


In [15]:
extracted_choice

'Answer is (D) False, True.'

In [16]:
match = re.search(r'answer is \(?(A|B|C|D)\)?', extracted_choice, re.IGNORECASE)
if match:
    print(match.group(1))

D


In [118]:
import pprint
pprint.pprint(experiment_log)

{'accuracy': 0.42,
 'metadata': {'experiment_time': '2024-04-07 16:41:31',
              'model_config': None,
              'model_name': 'gpt-3.5-turbo',
              'task': 'abstract_algebra'},
 'response_list': [{'A': '0',
                    'B': '4',
                    'C': '2',
                    'D': '6',
                    'id': 0,
                    'input': 'Find the degree for the given field extension '
                             'Q(sqrt(2), sqrt(3), sqrt(18)) over Q.',
                    'llm_answer': 'D',
                    'prompt': '\n'
                              'Find the degree for the given field extension '
                              'Q(sqrt(2), sqrt(3), sqrt(18)) over Q.\n'
                              '(A) 0 (B) 4 (C) 2 (D) 6\n'
                              "A: Let's think step by step. Print out each "
                              'step, and the final result in format "the '
                              'answer is (final_result)".\n',
       

In [73]:
print(mmlu_prompt['abstract_algebra'])

The following are multiple choice questions (with answers) about abstract algebra.

Q: Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.
(A) True, True (B) False, False (C) True, False (D) False, True
A: Let's think step by step. A cyclic group is a group that is generated by a single element. Hence a subgroup generated by a single element of a group is cyclic and Statement 1 is True. The answer is (C).

Q: The symmetric group $S_n$ has $
actorial{n}$ elements, hence it is not true that $S_{10}$ has 10 elements.
Find the characteristic of the ring 2Z.
(A) 0 (B) 3 (C) 12 (D) 30
A: Let's think step by step. A characteristic of a ring is R is $n$ if the statement $ka = 0$ for all $a\in 2Z$ implies that $k$ is a multiple of $n$. Assume that $ka = 0$ for all $a\in 2Z$ for some $k$. In particular $2k = 0$. Hence $k=0$ and $n=0$. The answer is (A).

Q: Statement 1| Every function from a finite set onto itse

In [75]:
abstract_algebra['test'][13]

{'input': 'The polynomial x^3 + 2x^2 + 2x + 1 can be factored into linear factors in Z_7[x]. Find this factorization.',
 'A': '(x − 2)(x + 2)(x − 1)',
 'B': '(x + 1)(x + 4)(x − 2)',
 'C': '(x + 1)(x − 4)(x − 2)',
 'D': '(x - 1)(x − 4)(x − 2)',
 'target': 'C'}