# S&DS 617 Applied Machine Learning and Causal Inference Research Seminar: Assignment 2

**Deadline**

Assignment 2 is due Monday, March 24th at 1:30pm. Late work will not be accepted.


**Submission**

Submit your assignment as a .pdf on Gradescope. On Gradescope, there are 2 assignments, one where you will submit a pdf file and one where you will submit the corresponding .ipynb that generated it. 
Note: The problems in each homework assignment are numbered.When submitting the pdf on Gradescope, please select the correct pages that correspond to each problem. 

To produce the .pdf, do the following to preserve the cell structure of the notebook:
- Go to "File" at the top-left of your Jupyter Notebook
- Under "Download as", select "HTML (.html)"
- After the .html has downloaded, open it and then select "File" and "Print"
- From the print window, select the option to save as a .pdf

## Problem 1

In this exercise, we'll employ different prompt tuning techniques on GSM8k dataset [Link](https://github.com/openai/grade-school-math/tree/master/grade_school_math/data).

The GSM8K dataset is an OpenAI-curated collection of 8,500 grade school math problems designed to challenge and evaluate the mathematical reasoning capabilities of language models. It contains approximately 7,500 training and 1,000 test problems that require 2 to 8 steps to solve, using basic arithmetic operations. The dataset aims to diagnose current model limitations in multi-step reasoning and supports advancements in AI's understanding and processing of natural language math problems. It includes both standard problems and a Socratic format with guiding subquestions, along with calculation annotations to aid in accuracy, making it a valuable resource for AI research in natural language processing.

Below, we have provided helper functions to load the data. 

In [2]:
import requests
import tarfile
import os
import json
import re
import openai
import pandas as pd
from sklearn.metrics import accuracy_score
import ast  
import sys
import time
import concurrent.futures

# Function to load data from a URL
def load_data_from_url(url):
    response = requests.get(url)
    if response.status_code == 200:
        data = [json.loads(line) for line in response.iter_lines(decode_unicode=True)]
        df = pd.DataFrame(data)
        return df
    else:
        print(f"Failed to download the file: {response.status_code}")
        return None

# Function to get the true solution from JSON file 
def extract_solution(problem):
    """
    Extracts the final numeric solution from a problem dictionary with 'question' and 'answer' keys.
    The answer is expected to contain a '####' token followed by the final numeric solution.
    
    :param problem: A dictionary with 'question' and 'answer' keys.
    :return: The final numeric solution as a string.
    """
    # Split the answer into lines
    answer_lines = problem['answer'].split('\n')
    # Look for the line with the '####' token
    for line in answer_lines:
        if '####' in line:
            # Extract the numeric solution that follows the '####' token
            solution = line.split('####')[-1].strip()
            return solution
    # If no solution is found, return None
    return None

def get_answer(string):
    """
    Extracts the final numeric solution from a string.
    The answer is expected to contain a '####' token followed by the final numeric solution.

    :param string: a string with the answer 
    :return: The final numeric solution as a string, or None.
    """
    if '####' in string:
        solution = string.split('####')[-1].strip()
        sol = re.search(r"[-+]?\d[\d,]*(\.\d+)?", solution)
        if sol:
            cleaned = sol.group().replace(',', '')
            return cleaned
    return string

### Load Data

In [3]:
# URLs for the train and test data
train_url = "https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/train.jsonl"
test_url = "https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl"

# Load the training data
df_train = load_data_from_url(train_url)
df_train = df_train.sample(n=1000, random_state=928)

# Load the test data
df_test = load_data_from_url(test_url)
df_test = df_test.sample(n=500, random_state=184)
labels = df_test.apply(extract_solution, axis = 1)
# Display the lengths of the datasets as a check
len(df_train), len(df_test)

(1000, 500)

In [3]:
sample_row = df_train.iloc[0,] 
print(sample_row)

question    Taegan goes to a carnival where she wins ticke...
answer      If tickets are valued at $3 then in total, Tae...
Name: 3103, dtype: object


In [4]:
sample_row['question'] # sample question

'Taegan goes to a carnival where she wins tickets from each of the 5 carnival games and also finds 5 tickets on the floor. Each ticket is worth $3. In total, she has tickets that total a value of $30. If Taegan won an equal number of tickets from each of the games, how many tickets did she win from each game?'

In [5]:
# Extract the solution
extract_solution(sample_row)

'1'

In [4]:

from dotenv import load_dotenv
# Load environment variables from the .env file
load_dotenv()

# Access the OpenAI API key
openai_api_key = os.getenv("OPENAI_API_KEY")

# Use the API key
if openai_api_key:
    print("OpenAI API Key loaded successfully!")
else:
    print("OpenAI API Key not found. Please check your .env file.")


OpenAI API Key loaded successfully!


In [5]:

# Set up the OpenAI client
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))


a) Implement zero shot learning, few shot learning, and chain of thought prompting using gpt-4o. Make a figure or table comparing the accuracy of each on the test set and comment on your results and whether they align with your expectations. Sample at most 1000 observations. Again, remember to start with a smaller subset of your dataset to ensure your implementation is correct before scaling up. 

## Zero Shot 4o-mini

In [6]:
model_type = "gpt-4o-mini"

In [24]:
def zero_shot(txt): 
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": (
                    "Solve the following math word problem. "
                    "Only include one integer at the end of your response, and put it after '####' on a new line.\n\n"
                    f"{txt}\n\n"
                )
            }
        ],
        model=model_type
    )
    response = chat_completion.choices[0].message.content
    res = get_answer(response)
    return res


In [25]:
gpt_results = []

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:  # 5 parallel threads
    gpt_results = list(executor.map(zero_shot, df_test['question']))

In [26]:
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, " zero-shot: ", acc_results)

Test results for  gpt-4o-mini  zero-shot:  0.95


## Few Shot 4o-mini

In [6]:
def run_experiment(txt, model_type, prompt):
    def call(txt): 
        chat_completion = client.chat.completions.create(
                messages=[
            {
                "role": "user",
                "content": (
                    
                    "Solve the following math word problem. "
                    "Only include one integer at the end of your response, and put it after '####' on a new line."
                    f"Here are a few examples: {prompt}"
                    f"These are the questions: {txt}\n\n"
                )
            }
            ],
                model=model_type# Specify the model
            )
        response = chat_completion.choices[0].message.content
        res = get_answer(response)
        return res
    gpt_results = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:  # 5 parallel threads
        gpt_results = list(executor.map(call, txt))

    return gpt_results

In [12]:
model_type = "gpt-4o-mini"
few_shot_prompt = """
Q: What is 10*5? 
#### 50 

Q: How many miles did Rob drive if he drove for 3 hours at 20 miles per hour? 
#### 60 
"""

gpt_results = run_experiment(df_test['question'], model_type, few_shot_prompt)

In [13]:
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, " few-shot: ", acc_results)

Test results for  gpt-4o-mini  few-shot:  0.942


## Chain of Thought 4o-mini

In [16]:
model_type = "gpt-4o-mini"
cot_prompt = """
Q: Olivia has 23 books. She buys 18 more books. Then she gives 15 books to her friends. How many books does she have now?
A: Let's break it down step by step.
- Olivia starts with 23 books.
- She buys 18 more: 23 + 18 = 41.
- She gives away 15: 41 - 15 = 26.
#### 26

Q: A box holds 12 pencils. A teacher has 7 boxes. She gives 10 pencils to each of 3 students. How many pencils does she have left?
A: Step by step:
- Total pencils = 12 × 7 = 84.
- 3 students each get 10 pencils 
→ 3 × 10 = 30.
- Pencils left = 84 - 30 = 54.
#### 54
"""
gpt_results = run_experiment(df_test['question'], model_type, cot_prompt)


In [17]:
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "cot: ", acc_results)

Test results for  gpt-4o-mini cot:  0.94


# a) Final Results 

<font color = 'blue'>

| Model   | Prompting Method | Accuracy |
|---------|----------------|----------|
| 4o-mini | Zero-Shot      |  0.95   |
| 4o-mini | Few-Shot       | 0.942  |
| 4o-mini | Chain of Thought       |  0.94|


These results suprise me as I expected Few Shot and Chain of Thought to have higher accuracy than Zero shot. However, this may be because 4o-mini was trained through zero shot, and the added tokens forced it to create more incorrect answers. I am specifically referring to the variable compute section of the Wei paper, where more tokens did not necessarily mean higher accuracy. This may mean that 4o-mini works best with little to no guidance.

b)  Explore recent literature for reasoning methods that could enhance the performance of CoT and improve the baseline obtained in a). Then, compare the results with the reasoning models o1 and o3, and discuss your findings.

## o1-mini Few Shot and CoT

In [18]:
model_type = "o1-mini"
gpt_results = run_experiment(df_test['question'], model_type, few_shot_prompt)
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "cot: ", acc_results)

Test results for  o1-mini cot:  0.964


In [19]:
model_type = "o1-mini"
gpt_results = run_experiment(df_test['question'], model_type, cot_prompt)
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "cot: ", acc_results)

Test results for  o1-mini cot:  0.956


## o3-mini Few Shot and CoT

In [20]:
model_type = "o3-mini"
gpt_results = run_experiment(df_test['question'], model_type, few_shot_prompt)
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "cot: ", acc_results)

Test results for  o3-mini cot:  0.958


In [21]:
model_type = "o3-mini"
gpt_results = run_experiment(df_test['question'], model_type, cot_prompt)
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "cot: ", acc_results)

Test results for  o3-mini cot:  0.968


<font color = 'blue'>

# Tree of Thought

In [Yao 2023](https://arxiv.org/abs/2305.10601v2) they discuss Tree of Thought as a reasoning method to improve Chain of Thought. Thus I decided to try it and see if it would improve accuracy results.

In [22]:

tree_prompt = """
Q: Emma has $50. She buys 2 books for $12 each and 1 backpack for $20. How much money does she have left?

A: Let’s explore different solution paths:

Option 1:

Cost of books = 2 × 12 = 24
Total cost = 24 + 20 = 44
Leftover = 50 − 44 = 6
Option 2:

Buy backpack first: 50 − 20 = 30
Then books: 30 − (2 × 12) = 30 − 24 = 6
Option 3:

Add costs: 12 + 12 + 20 = 44
Leftover = 50 − 44 = 6
All paths give the same result. Final answer:

#### 6

Q: A pizza is cut into 8 slices. Tom eats 2 slices, Jerry eats 3, and Spike eats the rest. How many slices did Spike eat?

A: Consider several ways to reason through it:

Path A:

Tom + Jerry = 2 + 3 = 5
Spike = 8 − 5 = 3
Path B:

Count what’s left: Only 3 slices not mentioned, must be Spike’s.
Path C:

8 total. Deduct Tom’s 2 = 6. Deduct Jerry’s 3 = 3.
All paths agree.

#### 3
"""


In [23]:
model_type = "o1-mini"
gpt_results = run_experiment(df_test['question'], model_type,tree_prompt)
print("Processing complete!")
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "tree: ", acc_results)


Processing complete!
Test results for  o1-mini tree:  0.958


In [24]:
model_type = "o3-mini"
gpt_results = run_experiment(df_test['question'], model_type,tree_prompt)
print("Processing complete!")
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "tree: ", acc_results)



Processing complete!
Test results for  o3-mini tree:  0.962


In [27]:
model_type = "gpt-4o-mini"
gpt_results = run_experiment(df_test['question'], model_type,tree_prompt)
print("Processing complete!")
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "tree: ", acc_results)


Processing complete!
Test results for  gpt-4o-mini tree:  0.926


<font color = 'blue'>

# b) Final Results

| Model   | Prompting Method | Accuracy |
|---------|----------------|----------|
| o1-mini | few shot|  0.964|
| o3-mini | few shot | 0.958|
| o1-mini |CoT |0.956|
| o3-mini |  CoT  |0.968|
| 4o-mini |  CoT  |0.94|
| o1-mini |  Tree  |0.958|
| o3-mini |  Tree  |0.962|
| 4o-mini |  Tree  |0.926|


These results suprise me as I always assume the most recent model of GPT to be "the best." However, now I understand that the type of prompt has a large impact on performance. It appears that 4o-mini seems to have problems with long guided prompts, and it works best when left as zero shot. However, o3-mini acts in reverse and prefers longer guided prompts. Perhaps it was trained to expect more guidance. o1-mini exists in the middle. This is important to understand as understand the type of prompt will produce different results for each model, and so I should adapt my methods according to the model I choose. 

c) Perform an ablation study similar to that of Section 3.3 of Wei et al. 2023: [Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf). Comment on your results.


In Wei et al. 2023, they discuss how chain of thought may be impacted by the amount of tokens present in the prompt. I want to test whether a sparser answers focused only on numbers would produce better results. This is also called symbolic reasoning.

In [7]:
model_type = "o1-mini"
sparse_prompt = """
Q: Emily has 12 pencils. She gives 3 pencils to each of her 4 friends. How many pencils does she have left?
A: (3 * 4) - 12 = 0
#### 0

Q: A pack of gum costs $3. You buy 4 packs and a soda that costs $2. How much do you spend?
A: 4 * 3 + 2 = 14 
#### 14 

Q: Solve for x: 3x + 4 = 19
A: (19 − 4)/3 = 5
##### 14 
"""

gpt_results = run_experiment(df_test['question'], model_type,sparse_prompt)

print("Processing complete!")
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "symbolic reasoning: ", acc_results)


Processing complete!
Test results for  o1-mini symbolic reasoning:  0.96


In [8]:
model_type = "o3-mini"
gpt_results = run_experiment(df_test['question'], model_type,sparse_prompt)

print("Processing complete!")
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "symbolic reasoning: ", acc_results)

Processing complete!
Test results for  o3-mini symbolic reasoning:  0.966


In [9]:
model_type = "gpt-4o-mini"
gpt_results = run_experiment(df_test['question'], model_type,sparse_prompt)

print("Processing complete!")
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "symbolic reasoning: ", acc_results)

Processing complete!
Test results for  gpt-4o-mini symbolic reasoning:  0.916


<font color='blue'>

# c) Final results
| Model   |  Accuracy |
|---------|----------|
| o1-mini | 0.96|
| o3-mini | 0.966|
| 4o-mini |  0.916|

Because of the prior experiments, I was interested in performing an Ablation study to examine the impact of sparser yet still guided prompts. I wanted to provide a bare bones, numerical only construction, which according to GPT is calle "symbolic reasoning." We can see the o3-mini did extremely well with symbolic reasoning, and it makes me reconsider my earlier statements about prompt length. I think o3-mini actually just performs better with structure responses in general. o1-mini also performs similarly. In contrast, 4o-mini performs the worst it has this entire PSET. I had assumed that it would do quite well as it did great in zero shot, but against my expectations, it appears 4o-mini simply does not do as well with strict guidance. This may be because 4o-mini was created after the usage of Chat-GPT where users do not provide any sort of prompt. Thus, it suggests 4o-mini was trained on mostly zero shot and instructed to perform tasks with no guidance from the user, as that is most similar to user behavior in Chat-GPT. 