# S&DS 617 Applied Machine Learning and Causal Inference Research Seminar: Assignment 2

**Deadline**

Assignment 2 is due Monday, March 24th at 1:30pm. Late work will not be accepted.


**Submission**

Submit your assignment as a .pdf on Gradescope. On Gradescope, there are 2 assignments, one where you will submit a pdf file and one where you will submit the corresponding .ipynb that generated it. 
Note: The problems in each homework assignment are numbered.When submitting the pdf on Gradescope, please select the correct pages that correspond to each problem. 

To produce the .pdf, do the following to preserve the cell structure of the notebook:
- Go to "File" at the top-left of your Jupyter Notebook
- Under "Download as", select "HTML (.html)"
- After the .html has downloaded, open it and then select "File" and "Print"
- From the print window, select the option to save as a .pdf

## Problem 1

In this exercise, we'll employ different prompt tuning techniques on GSM8k dataset [Link](https://github.com/openai/grade-school-math/tree/master/grade_school_math/data).

The GSM8K dataset is an OpenAI-curated collection of 8,500 grade school math problems designed to challenge and evaluate the mathematical reasoning capabilities of language models. It contains approximately 7,500 training and 1,000 test problems that require 2 to 8 steps to solve, using basic arithmetic operations. The dataset aims to diagnose current model limitations in multi-step reasoning and supports advancements in AI's understanding and processing of natural language math problems. It includes both standard problems and a Socratic format with guiding subquestions, along with calculation annotations to aid in accuracy, making it a valuable resource for AI research in natural language processing.

Below, we have provided helper functions to load the data. 

In [1]:
import requests
import tarfile
import os
import json
import re
import openai
import pandas as pd
from sklearn.metrics import accuracy_score
import ast 
import sys
import time

# Function to load data from a URL
def load_data_from_url(url):
    response = requests.get(url)
    if response.status_code == 200:
        data = [json.loads(line) for line in response.iter_lines(decode_unicode=True)]
        df = pd.DataFrame(data)
        return df
    else:
        print(f"Failed to download the file: {response.status_code}")
        return None

# Function to get the true solution from JSON file 
def extract_solution(problem):
    """
    Extracts the final numeric solution from a problem dictionary with 'question' and 'answer' keys.
    The answer is expected to contain a '####' token followed by the final numeric solution.
    
    :param problem: A dictionary with 'question' and 'answer' keys.
    :return: The final numeric solution as a string.
    """
    # Split the answer into lines
    answer_lines = problem['answer'].split('\n')
    # Look for the line with the '####' token
    for line in answer_lines:
        if '####' in line:
            # Extract the numeric solution that follows the '####' token
            solution = line.split('####')[-1].strip()
            return solution
    # If no solution is found, return None
    return None

### Load Data

In [3]:
# URLs for the train and test data
train_url = "https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/train.jsonl"
test_url = "https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl"

# Load the training data
df_train = load_data_from_url(train_url)
df_train = df_train.sample(n=1000, random_state=928)

# Load the test data
df_test = load_data_from_url(test_url)
df_test = df_test.sample(n=500, random_state=184)

# Display the lengths of the datasets as a check
len(df_train), len(df_test)

(1000, 500)

In [4]:
sample_row = df_train.iloc[0,] 
print(sample_row)

question    Taegan goes to a carnival where she wins ticke...
answer      If tickets are valued at $3 then in total, Tae...
Name: 3103, dtype: object


In [5]:
sample_row['question'] # sample question

'Taegan goes to a carnival where she wins tickets from each of the 5 carnival games and also finds 5 tickets on the floor. Each ticket is worth $3. In total, she has tickets that total a value of $30. If Taegan won an equal number of tickets from each of the games, how many tickets did she win from each game?'

In [6]:
# Extract the solution
extract_solution(sample_row)

'1'

a) Implement zero shot learning, few shot learning, and chain of thought prompting using gpt-4o. Make a figure or table comparing the accuracy of each on the test set and comment on your results and whether they align with your expectations. Sample at most 1000 observations. Again, remember to start with a smaller subset of your dataset to ensure your implementation is correct before scaling up. 

# Zero Shot 

In [7]:

from dotenv import load_dotenv
# Load environment variables from the .env file
load_dotenv()

# Access the OpenAI API key
openai_api_key = os.getenv("OPENAI_API_KEY")

# Use the API key
if openai_api_key:
    print("OpenAI API Key loaded successfully!")
else:
    print("OpenAI API Key not found. Please check your .env file.")


OpenAI API Key loaded successfully!


In [8]:

# Set up the OpenAI client
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))


In [9]:
model_type = "o1-mini"
labels = df_test.apply(extract_solution, axis = 1)

In [10]:

def zero_shot(txt): 
    chat_completion = client.chat.completions.create(
            messages=[
                {"role": "user", "content": f"Return a single-number answer only. No explanation. {txt}"}
            ],
            model=model_type
        )
    response = chat_completion.choices[0].message.content
    return response

In [16]:
import concurrent.futures
import re

def process_question(question):
    ans = zero_shot(question)
    clean_response = re.sub(r"```.*?\n", "", ans).strip("```")
    return clean_response

gpt_results = []

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:  # 5 parallel threads
    gpt_results = list(executor.map(process_question, df_test['question']))

print("Processing complete!")


Processing complete!


In [18]:
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, " zero-shot: ", acc_results)

Test results for  o1-mini  zero-shot:  0.938


# Few Shot 

In [31]:
model_type = "o1-mini"
prompt = """
Q: What is 10*5? 
A: 50 

Q: How many miles did Rob drive if he drove for 3 hours at 20 miles per hour? 
A: 60 
"""

def few_shot(txt): 
    chat_completion = client.chat.completions.create(
            messages=[
                {"role": "user", "content": f"Return a single-number answer only. No explanation. Here are a few examples: {prompt}. Q: {txt}"}
            ],
            model=model_type# Specify the model
        )
    response = chat_completion.choices[0].message.content
    return response

In [32]:
def process_question(question):
    ans = few_shot(question)
    clean_response = re.sub(r"```.*?\n", "", ans).strip("```")
    return clean_response

gpt_results = []

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:  # 5 parallel threads
    gpt_results = list(executor.map(process_question, df_test['question']))

print("Processing complete!")


Processing complete!


In [33]:
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, " few-shot: ", acc_results)

Test results for  o1-mini  few-shot:  0.958


# Chain of Thought 

b)  Explore recent literature for reasoning methods that could enhance the performance of CoT and improve the baseline obtained in a). Then, compare the results with the reasoning models o1 and o3, and discuss your findings.

In [28]:
model_type = "o1-mini"
prompt = """
Question 1: \n 
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? \n
2 cans * 3 tennis balls = 6 tennis balls \n 
6 tennis balls he bought + 5 tennis balls he had originally = 11. 

Question 2: \n
Q: The concert was scheduled to be on 06/01/1943, but was delayed by one day to today. What is the date 10 days ago in MM/DD/YYYY?  
A: One day after 06/01/1943 is 06/02/1943, so today is 06/02/1943. 10 days before today is 05/23/1943. So the answer is 05/23/1943. 
"""

def cot(txt): 
    chat_completion = client.chat.completions.create(
            messages=[
                {"role": "user", "content": f"Return a single-number answer only. No explanation. Here are a few examples: {prompt}. Question: {txt}"}
            ],
            model=model_type# Specify the model
        )
    response = chat_completion.choices[0].message.content
    return response

In [29]:
def process_question(question):
    ans = cot(question)
    clean_response = re.sub(r"```.*?\n", "", ans).strip("```")
    return clean_response

gpt_results = []

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:  # 5 parallel threads
    gpt_results = list(executor.map(process_question, df_test['question']))

print("Processing complete!")


Processing complete!


In [30]:
acc_results = accuracy_score(labels, gpt_results)
print("Test results for ", model_type, "cot: ", acc_results)

Test results for  o1-mini cot:  0.942


# Final Results 

| Model   | Prompting Method | Accuracy |
|---------|----------------|----------|
| o1-mini | Zero-Shot      | 0.938    |
| o1-mini | Few-Shot       | 0.958    |
| o1-mini | Chain of Thought       | 0.942  |

c) Perform an ablation study similar to that of Section 3.3 of Wei et al. 2023: [Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf). Comment on your results.
