### **ChatGPT MWP performance prediction**
The emergence of large language models (LLMs) have gained much popularity in recent years, with OpenAI's GPT-3 series models being considered as the state-of-the-art. In particular, the variant of GPT-3 tuned for natural dialog, known as ChatGPT, has gathered much popular interest. However, LLMs have known performance issues, 
specifically when reasoning tasks are involved. This project aims to investigate aspects of math word problems (MWPs) that can indicate the success or failure of ChatGPT on such problems.
  
In this notebook in particular, we attempt to predict ChatGPT's performance on specific questions in DRAW-1K using classifiers by extracting equations from ChatGPT's response.  
  

### **Download libraries**
In order to replicate the results produced in this notebook, it is recommended to use the exact version of Python as well as the exact versions of each library.  
We first download the libraries that will be used in this notebook. We specify the exact version of each library to download.

In [57]:
%%capture

# =========================================== #
#               Requirements
# ------------------------------------------- #
# - Python 3.7.9
# =========================================== #

%pip install nltk==3.8.1
%pip install pandas==1.3.5
%pip install sympy==1.10.1
%pip install plotly==5.13.0
%pip install xgboost==1.6.2

# %pip install xlsxwriter==3.0.9
# %pip install scikit-learn==1.0.2
# %pip install beautifulsoup4==4.11.2
# %pip install torch==1.13.1
# %pip install transformers==4.27.4
# %pip install tqdm==4.64.1

### **Load libraries**

In [58]:
# =========================================== #
#                Libraries
# =========================================== #

# ------------------------------------------- #
#   Python
# ------------------------------------------- #
import re
import os
import ast
import sys
import random
import threading
from time import sleep

try:
    import thread
except ImportError:
    import _thread as thread

import wordninja

# ------------------------------------------- #
#   Pandas
# ------------------------------------------- #
import pandas
pandas.options.display.max_rows = 4000

# ------------------------------------------- #
#   Plotly
# ------------------------------------------- #
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px

# ------------------------------------------- #
#   Sympy
# ------------------------------------------- #
from sympy.parsing.sympy_parser import parse_expr
from sympy.parsing.sympy_parser import transformations
from sympy.parsing.sympy_parser import T
from sympy import Eq
from sympy import solve

# ------------------------------------------- #
#   Sklearn
# ------------------------------------------- #
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn import metrics

from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

# ------------------------------------------- #
#   Matplotlib
# ------------------------------------------- #
import matplotlib.pyplot as plt

# ------------------------------------------- #
#   XGBoost
# ------------------------------------------- #
from xgboost import XGBClassifier

# ------------------------------------------- #
#   Tensorflow
# ------------------------------------------- #
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import activations

# ------------------------------------------- #
#   Seaborn
# ------------------------------------------- #
import seaborn

# ------------------------------------------- #
#   XlsxWriter
# ------------------------------------------- #
import xlsxwriter

# ------------------------------------------- #
#   Numpy
# ------------------------------------------- #
import numpy

### **Constants**
Here, we configure values required one run through of the contents in the notebook.   
This should be the only part of the notebook you would have to change from run-to-run.

In [59]:
RESPONSE_FILE_PATH = '../data/hwelsters__gpt-3.5-turbo__v001 (prefix__system_of_equations).jsonl'
QUESTION_FILE_PATH = '../data/draw.json'
TEST_SIZE = 0.2
RANDOM_STATE = 42
N_SPLITS = 5
VISUALIZE_DATA = True
XLSX_OUTPUT_FILE_PATH = 'sympy_predict.xlsx'

### **Set Seed for RNGs**
We set the seed for the RNGs to ensure consistency from run-to-run

In [60]:
os.environ['PYTHONHASHSEED']=str(RANDOM_STATE)
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'  # new flag present in tf 2.0+
random.seed(RANDOM_STATE)
numpy.random.seed(RANDOM_STATE)
tf.random.set_seed(RANDOM_STATE)

### **Load data**

In [61]:
# =========================================== #
#                Load data
# =========================================== #

# ------------------------------------------- #
#   Functions which loads data into
#   a dataframe based on its file extension
#   Currently accepts .json, .jsonl, .csv
# ------------------------------------------- #
def load_file(path):
    split_path = os.path.splitext(path)
    file_extension = split_path[-1]

    data = pandas.DataFrame()
    if file_extension == '.json': data = pandas.read_json(path)
    if file_extension == '.jsonl': data = pandas.read_json(path, lines=True)
    if file_extension == '.csv': data = pandas.read_csv(path)

    return data


# ------------------------------------------- #
#   Helper function that loads data stored in 
#   a file at a particular file-path into a 
#   pandas dataframe, extracting only the 
#   columns specified.
# ------------------------------------------- #
def load_data(path, sample_size=5, columns=None, label=None):
    available_columns, dataframe = load_file(path, columns)
    if columns != None: available_columns = list(set(columns).intersection(available_columns))
    return dataframe[available_columns]

In [62]:
equations_df = load_file('../data/equations.jsonl')
final_answer_df = load_file('../data/answers.jsonl')
response_df = load_file('../data/response.jsonl')
ground_df = load_file('../data/ground.json')

display(equations_df.head(3))
display(final_answer_df.head(3))
display(response_df.head(3))
display(ground_df.head(3))

Unnamed: 0,model,temperature,max_tokens,date_time,question_number,question,system_text,response,prompt_tokens,completion_tokens,total_tokens
0,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:24:18,8,Extract all equations from this text and canon...,Extract all equations from this text and canon...,"{\n ""equations"": [\n ""x + y = 9"",\n ""10...",363,66,429
1,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:24:18,16,Extract all equations from this text and canon...,Extract all equations from this text and canon...,"{\n ""equations"": [\n ""x + 24 = faste...",464,65,529
2,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:24:19,45,Extract all equations from this text and canon...,Extract all equations from this text and canon...,"{""equations"": [""a + c = 24"", ""16a + 9c = 258""]}",335,23,358


Unnamed: 0,model,temperature,max_tokens,date_time,question_number,question,system_text,response,prompt_tokens,completion_tokens,total_tokens
0,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:22:11,10,Extract the final answer from this text. Outp...,Extract the final answer from this text. Outp...,"{""answers"": [1800.0, 4200.0]}",286,15,301
1,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:22:11,7,Extract the final answer from this text. Outp...,Extract the final answer from this text. Outp...,"{""answers"": [10.0, 17.0]}",286,13,299
2,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:22:11,18,Extract the final answer from this text. Outp...,Extract the final answer from this text. Outp...,"{""answers"":[4.0]}",280,7,287


Unnamed: 0,model,date_time,question_number,question,response
0,gpt-3.5-turbo-0301,2023-09-03 12:52:57,1,A factory makes three-legged stools and four-l...,\n\nLet x be the number of three-legged stools...
1,gpt-3.5-turbo-0301,2023-09-03 12:52:57,0,Juniors boat will go 15 miles per hour in stil...,\n\nLet's call the speed of the current 'c'. \...
2,gpt-3.5-turbo-0301,2023-09-03 12:53:00,3,The student-teacher ratio for Washington High ...,"\n\nTo calculate the number of students, we ne..."


Unnamed: 0,sQuestion,lSolutions,Template,lEquations,iIndex,Alignment,Equiv
0,Juniors boat will go 15 miles per hour in stil...,[2.14285714286],[a * m + b * m = b * c - a * c],[12*(15-x)=9*(15+x)],397760,"[{'coeff': 'a', 'SentenceId': 1, 'Value': 9.0,...",[]
1,A factory makes three-legged stools and four-l...,"[83.0, 78.0]","[a * m + b * n = c, m + n = d]","[student+general=161, 3*student+4*general=566]",327651,"[{'coeff': 'a', 'SentenceId': 0, 'Value': 3.0,...",[]
2,a bank offers two checking plans . The anywher...,[14.0],[0.01 * a * m - 0.01 * b * m = c],[(.01*30)*x=(.01*22)*x+1.12],238992,"[{'coeff': 'a', 'SentenceId': 1, 'Value': 30.0...",[]


### **Convert ChatGPT's JSON response to Dataframe**

In [63]:
def extract_json(text : str, question_number):
    try: 
        text = re.findall(r'\{.*\}', text, re.DOTALL)[0]
        body = ast.literal_eval(text)
        body["question_number"] = question_number
        return body
    except: return "FAILED"


e_equations_df = equations_df.apply(lambda row : extract_json(row['response'], row["question_number"]), axis=1)
e_final_answer_df = final_answer_df.apply(lambda row : extract_json(row['response'], row["question_number"]), axis=1)

e_equations_df = e_equations_df[e_equations_df != 'FAILED']
e_final_answer_df = e_final_answer_df[e_final_answer_df != 'FAILED']

e_equations_df = pandas.DataFrame(e_equations_df.to_list())
e_final_answer_df = pandas.DataFrame(e_final_answer_df.to_list())

e_response_df = response_df[["response", "question_number"]]

display(e_equations_df.head(3))
display(e_response_df.head(3))
display(e_final_answer_df.head(3))

Unnamed: 0,equations,question_number
0,"[x + y = 9, 10y + x = 6(x + y), 10y + x = 6x +...",8
1,"[x + 24 = faster car rate, 2x + 24 = relative ...",16
2,"[a + c = 24, 16a + 9c = 258]",45


Unnamed: 0,response,question_number
0,\n\nLet x be the number of three-legged stools...,1
1,\n\nLet's call the speed of the current 'c'. \...,0
2,"\n\nTo calculate the number of students, we ne...",3


Unnamed: 0,answers,question_number
0,"[1800.0, 4200.0]",10
1,"[10.0, 17.0]",7
2,[4.0],18


In [64]:
def combine(question_number):
    ground_val = ground_df.loc[question_number]["lSolutions"]
    response_val = e_response_df[e_response_df["question_number"] == question_number]
    equations_val = e_equations_df.loc[e_equations_df["question_number"] == question_number]
    final_answer_val = e_final_answer_df.loc[e_final_answer_df["question_number"] == question_number]

    if len(response_val) > 0: response_val = response_val.iloc[0]["response"]
    else: response_val = ""

    if len(equations_val) > 0: equations_val = equations_val.iloc[0]["equations"]
    else: equations_val = []

    if len(final_answer_val) > 0: final_answer_val = final_answer_val.iloc[0]["answers"]
    else: final_answer_val = []

    return {"ground_solution" : ground_val, "response" : response_val, "equations": equations_val, "answers" : final_answer_val}

combined_df = response_df.apply(lambda row : combine(row["question_number"]), axis=1).to_list()
combined_df = pandas.DataFrame(combined_df)
combined_df.head(3)

Unnamed: 0,ground_solution,response,equations,answers
0,"[83.0, 78.0]",\n\nLet x be the number of three-legged stools...,"[x + y = 161, 3x + 4y = 566, x = 161 - y, 3(16...","[78.0, 83.0]"
1,[2.14285714286],\n\nLet's call the speed of the current 'c'. \...,"[speed downstream = 15 + c, speed upstream = 1...",[2.14]
2,[1155.0],"\n\nTo calculate the number of students, we ne...","[a = b x c, a = 42 x 27.5, a = 1155]",[1155.0]


In [65]:
# ------------------------------------------- #
#   This function standardizes ChatGPT's 
#   equations.
#   
#   e.g 'tons of hay + tons of wheat = 10'
#       --> a + b = 10
#   
#   In some cases, ChatGPT returns a
#   dict of equations
#   In this case, we just get the values or 
#   the keys of this dictionary depending on 
#   which one has an equal sign
# ------------------------------------------- #
def canonicalize_equations(texts):  
    # If ChatGPT did not give equations at all, return None
    if (str(texts) == "nan"): return None

    # CHATGPT sometimes gives equations as dictionaries rather than arrays. 
    # As such, we do this instead
    
    if type(texts) is dict: 
        new_texts = []
        for text in texts.values(): 
            if str(text).count('=') > 0: new_texts.append(text)
        for text in texts.keys(): 
            if str(text).count('=') > 0: new_texts.append(text)
        texts = new_texts

    if type(texts) is list and len(texts) > 0:
        new_texts = []
        for text in texts:
            if type(text) is dict:
                for t in text.values(): 
                    if str(t).count('=') > 0: new_texts.append(t)
                for t in text.keys(): 
                    if str(t).count('=') > 0: new_texts.append(t)
            elif type(text) is str: new_texts.append(text)
        texts = new_texts
        # print("NEW TEXT: ", texts)
    
    # Converts it all to a list of str
    if type(texts) is set: texts = list(texts)
    texts = [str(text) for text in texts]


    A_CODE = 97
    letters = {}
    index = A_CODE

    new_texts = []
    for text in texts:
        if type(text) is dict:
            text = text[list(text.values())[0]]
        new_texts.append(text)
    texts = new_texts


    REP = '[supadupaepicstonkifiers]'
    for text in texts:
        text = str(text)
        text = text.replace('+', REP)
        text = text.replace('-', REP)
        text = text.replace('*', REP)
        text = text.replace('=', REP)
        text = text.replace('(', REP)
        text = text.replace(')', REP)
        text = text.replace('{', REP)
        text = text.replace('}', REP)
        text = text.replace(',', '')
        text = text.replace('/', REP)
        text = text.replace('.', REP)

        spl = text.split(REP)

        for text in spl:
            split = wordninja.split(text)
            for t in split:
                t = t.strip() 
                while len(t) > 0 and t[0].isdigit(): t = t[1:]
                if len(t) > 0 and not t in letters.keys(): 
                    letters[t] = chr(index)
                    index += 1
    
    to_return = []

    def get_len(key):
        return -len(key[0])

    test_dict_list = list(letters.items())
    test_dict_list.sort(key = get_len)

    letters = {ele[0] : ele[1]  for ele in test_dict_list}

    # Canonicalizes the equation variable names
    for text in texts:
        for swap in letters.keys():
            text = text.replace(swap, letters[swap])
        to_return.append(text)

    return to_return

canonicalized_df = combined_df.copy()
canonicalized_df["canonicalized_equations"] = canonicalized_df["equations"].apply(lambda row : canonicalize_equations(row))
canonicalized_df.head()

Unnamed: 0,ground_solution,response,equations,answers,canonicalized_equations
0,"[83.0, 78.0]",\n\nLet x be the number of three-legged stools...,"[x + y = 161, 3x + 4y = 566, x = 161 - y, 3(16...","[78.0, 83.0]","[a + b = 161, 3a + 4b = 566, a = 161 - b, 3(16..."
1,[2.14285714286],\n\nLet's call the speed of the current 'c'. \...,"[speed downstream = 15 + c, speed upstream = 1...",[2.14],"[a b = 15 + c, a d = 15 - c, e b = e d, f b / ..."
2,[1155.0],"\n\nTo calculate the number of students, we ne...","[a = b x c, a = 42 x 27.5, a = 1155]",[1155.0],"[a = b d d, a = 42 d 27.5, a = 1155]"
3,[14.0],\n\nLet's start by finding a general expressio...,"[Cost = 0.3x, Cost = 1.12 + 0.22x, 1.12 + 0.22...",[4.22],"[a = 0.3b, a = 1.12 + 0.22b, 1.12 + 0.22b < 0...."
4,"[38.0, 15.0]",\n\nLet's use M to represent Mike's age and S ...,"[M - S = 11, M + U = 53, S = M - 11, M + (U + ...",[34.0],"[a - b = 11, a + c = 53, b = a - 11, a + (c + ..."


In [66]:
# ------------------------------------------- #
#   A function for creating a SymPy 
#   expressions
# ------------------------------------------- #
def create_expression(text):
    split_text = text.split('=')
    expression = Eq(parse_expr(split_text[0], transformations=T[:6]), parse_expr(split_text[1], transformations=T[:6]))
    return expression


def quit_function(fn_name):
    # print to stderr, unbuffered in Python 2.
    sys.stderr.flush() # Python 3 stderr is likely buffered.
    thread.interrupt_main() # raises KeyboardInterrupt

# ------------------------------------------- #
#   A function that times out functions 
#   after a certain period of time
# ------------------------------------------- #
def exit_after(s):
    '''
    use as decorator to exit process if 
    function takes longer than s seconds
    '''
    def outer(fn):
        def inner(*args, **kwargs):
            timer = threading.Timer(s, quit_function, args=[fn.__name__])
            timer.start()
            try:
                result = fn(*args, **kwargs)
            finally:
                timer.cancel()
            return result
        return inner
    return outer

# ------------------------------------------- #
#   Solves a system of equations given in
#   a list of str
# ------------------------------------------- #
@exit_after(5)
def solve_system(equations):
    text = str(equations)
    text = wordninja.split(text)

    s = set()
    for t in text:
        if len(t) > 0 and t[0].isalpha(): s.add(t)
    
    try:
        expressions = [create_expression(equation) for equation in equations][0:len(s)]
        return solve(expressions)
    except:
        return {}

canonicalized_df["solved_ans"] = canonicalized_df.apply(lambda row : solve_system(row["canonicalized_equations"]), axis=1)
canonicalized_df.head(5)

Unnamed: 0,ground_solution,response,equations,answers,canonicalized_equations,solved_ans
0,"[83.0, 78.0]",\n\nLet x be the number of three-legged stools...,"[x + y = 161, 3x + 4y = 566, x = 161 - y, 3(16...","[78.0, 83.0]","[a + b = 161, 3a + 4b = 566, a = 161 - b, 3(16...","{a: 78, b: 83}"
1,[2.14285714286],\n\nLet's call the speed of the current 'c'. \...,"[speed downstream = 15 + c, speed upstream = 1...",[2.14],"[a b = 15 + c, a d = 15 - c, e b = e d, f b / ...","[{c: 15/7, a: 90/(7*d), e: 0, b: 4*d/3, f: 0}]"
2,[1155.0],"\n\nTo calculate the number of students, we ne...","[a = b x c, a = 42 x 27.5, a = 1155]",[1155.0],"[a = b d d, a = 42 d 27.5, a = 1155]","[{a: 1155.00000000000, b: 1155.00000000000, d:..."
3,[14.0],\n\nLet's start by finding a general expressio...,"[Cost = 0.3x, Cost = 1.12 + 0.22x, 1.12 + 0.22...",[4.22],"[a = 0.3b, a = 1.12 + 0.22b, 1.12 + 0.22b < 0....",{}
4,"[38.0, 15.0]",\n\nLet's use M to represent Mike's age and S ...,"[M - S = 11, M + U = 53, S = M - 11, M + (U + ...",[34.0],"[a - b = 11, a + c = 53, b = a - 11, a + (c + ...","{a: 53 - c, b: 42 - c}"


In [67]:
# ------------------------------------------- #
#   A helper function for comparing  
#   ChatGPT's response to the actual solution
# ------------------------------------------- #
def evaluate_response(actual_solution, response_solution, transform_func):
    actual_solution = [transform_func(float(solution)) for solution in actual_solution]
    response_solution = [transform_func(float(solution)) for solution in response_solution]

    actual_solution = set(actual_solution)
    response_solution = set(response_solution)
    
    if actual_solution.issubset(response_solution): return "all"
    elif actual_solution.intersection(response_solution): return "some"
    else: return "none"

def count_symbols(text, symbol): return str(text).count(symbol)

# ------------------------------------------- #
#   A function that extracts decimal numbers
#   from string
# ------------------------------------------- #
def extract_decimals(response : str):
    response = str(response)
    pattern = r'\d+(?:\.\d+)?'
    decimals = re.findall(pattern, response)
    decimals = [float(decimal) for decimal in decimals]
    return set(decimals)

features_df = canonicalized_df.copy()
features_df["extracted_solved_ans"] = features_df.apply(lambda row : extract_decimals(row["solved_ans"]), axis=1)
features_df["extracted_final_answer"] = features_df.apply(lambda row : extract_decimals(row["answers"]), axis=1)
features_df["normal_correct"] = features_df.apply(lambda row : evaluate_response(row["ground_solution"], row["answers"], lambda x : x), axis=1) 
features_df["rounded_correct"] = features_df.apply(lambda row : evaluate_response(row["ground_solution"], row["answers"], lambda x : round(x, 1)), axis=1)
features_df["final_solved_same"] = features_df.apply(lambda row : 1 if len(row["extracted_final_answer"].intersection(row["extracted_solved_ans"])) > 0 else 0, axis=1)

answer_map = {"all": 1, "some" : 1, "none": 0}
features_df['is_correct'] = features_df.apply(lambda row : answer_map[row["normal_correct"]], axis=1)
features_df["num_of_additions"] = features_df.apply(lambda row : count_symbols(row["equations"], '+'), axis=1)
features_df["num_of_subtractions"] = features_df.apply(lambda row : count_symbols(row["equations"], '-'), axis=1)
features_df["num_of_multiplications"] = features_df.apply(lambda row : count_symbols(row["equations"], '*'), axis=1)
features_df["num_of_divisions"] = features_df.apply(lambda row : count_symbols(row["equations"], '/'), axis=1)
features_df["num_of_equations"] = features_df.apply(lambda row : count_symbols(row["equations"], '='), axis=1)
features_df["num_of_parentheses"] = features_df.apply(lambda row : count_symbols(row["equations"], '('), axis=1)

features_df.head(3) 

Unnamed: 0,ground_solution,response,equations,answers,canonicalized_equations,solved_ans,extracted_solved_ans,extracted_final_answer,normal_correct,rounded_correct,final_solved_same,is_correct,num_of_additions,num_of_subtractions,num_of_multiplications,num_of_divisions,num_of_equations,num_of_parentheses
0,"[83.0, 78.0]",\n\nLet x be the number of three-legged stools...,"[x + y = 161, 3x + 4y = 566, x = 161 - y, 3(16...","[78.0, 83.0]","[a + b = 161, 3a + 4b = 566, a = 161 - b, 3(16...","{a: 78, b: 83}","{83.0, 78.0}","{83.0, 78.0}",all,all,1,1,5,3,0,0,8,1
1,[2.14285714286],\n\nLet's call the speed of the current 'c'. \...,"[speed downstream = 15 + c, speed upstream = 1...",[2.14],"[a b = 15 + c, a d = 15 - c, e b = e d, f b / ...","[{c: 15/7, a: 90/(7*d), e: 0, b: 4*d/3, f: 0}]","{0.0, 3.0, 4.0, 7.0, 15.0, 90.0}",{2.14},none,all,0,0,4,4,0,4,9,4
2,[1155.0],"\n\nTo calculate the number of students, we ne...","[a = b x c, a = 42 x 27.5, a = 1155]",[1155.0],"[a = b d d, a = 42 d 27.5, a = 1155]","[{a: 1155.00000000000, b: 1155.00000000000, d:...","{1.0, 1155.0}",{1155.0},all,all,1,1,0,0,0,0,3,0


In [68]:
model_df = features_df[[
    # "num_of_unknowns", 
    "num_of_additions", 
    "num_of_subtractions", 
    "num_of_multiplications", 
    "num_of_divisions", 
    "num_of_equations", 
    "num_of_parentheses", 

    "final_solved_same",
    "is_correct",
]]
model_df = model_df.dropna()

print("Dataframe LEN:" + str(len(model_df)))
model_df.head()

Dataframe LEN:1000


Unnamed: 0,num_of_additions,num_of_subtractions,num_of_multiplications,num_of_divisions,num_of_equations,num_of_parentheses,final_solved_same,is_correct
0,5,3,0,0,8,1,1,1
1,4,4,0,4,9,4,0,0
2,0,0,0,0,3,0,1,1
3,3,0,0,0,6,2,0,0
4,7,7,0,0,9,3,0,0


In [69]:
train, test = train_test_split(model_df, test_size=TEST_SIZE, random_state=RANDOM_STATE)

train_x = train.drop(columns=["is_correct"])
test_x = test.drop(columns=["is_correct"])

train_y = train["is_correct"]
test_y = test[["is_correct"]]

display(train_x.head())
display(train_y.head())

Unnamed: 0,num_of_additions,num_of_subtractions,num_of_multiplications,num_of_divisions,num_of_equations,num_of_parentheses,final_solved_same
29,2,0,0,0,5,0,0
535,4,0,0,0,5,1,1
695,2,0,0,0,8,1,1
557,5,3,0,0,9,1,1
836,0,2,0,6,9,5,0


29     1
535    1
695    1
557    1
836    1
Name: is_correct, dtype: int64

In [70]:
# =========================================== #
#    Sklearn Helper
# =========================================== #
class SklearnHelper(object):
    def __init__(self, clf, seed=0, params={}):
        self.clf = clf
        self.params = params

    def fit(self, x_train, y_train):
        self.fitted = self.clf(**self.params).fit(x_train, y_train)
    
    def predict(self, x):
        return self.fitted.predict(x)
    
    def predict_proba(self, x):
        return self.fitted.predict_proba(x)
    
    def feature_importance(self):
        return self.fitted.feature_importances_
    
    def tune(self, x_train, y_train, params=None):
        
        grid = GridSearchCV(self.clf(), param_grid=params)
        grid.fit(x_train, y_train)

        self.params = grid.best_params_

        print(self.params)
        pass

    def get_model(self):
        return self.clf(**self.params)

In [71]:
# =========================================== #
#    Cross-validate models
# =========================================== #
def cross_validate(model, x_data, y_data, n_splits, random_state=42, shuffle=True):
    cross_fold = KFold(n_splits, random_state=random_state, shuffle=shuffle)

    cross_validation_results = []
    for i, (train_index, test_index) in enumerate(cross_fold.split(x_data)):
        x_train, y_train = x_data.iloc[train_index], y_data.iloc[train_index]
        x_test, y_test = x_data.iloc[test_index], y_data.iloc[test_index]

        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)

        report = classification_report(y_test, y_pred, target_names=list(set(answer_map.values())), output_dict=True)
        cross_validation_results.append(report)

    average_results = {}
    for result in cross_validation_results:
        for key1 in list(set(answer_map.values())):
            average_results[key1] = {}
            for key2 in ['precision', 'recall', 'f1-score', 'support']:
                average_results[key1][key2] = 0

    for result in cross_validation_results:
        for key1 in list(set(answer_map.values())):
            for key2 in ['precision', 'recall', 'f1-score', 'support']:
                average_results[key1][key2] += result[key1][key2]
    
    for key1 in list(set(answer_map.values())):
        for key2 in ['precision', 'recall', 'f1-score', 'support']:
            average_results[key1][key2] /= n_splits
        
    return pandas.DataFrame(average_results)

def cross_validate_model(classifier):
    return cross_validate(classifier, train_x, train_y, N_SPLITS, RANDOM_STATE, True)

In [72]:
results_dict = {}

In [73]:
# =========================================== #
#    Random Forest
# =========================================== #
random_forest = SklearnHelper(RandomForestClassifier)

random_forest.tune(train_x, train_y, params={
    "random_state" : [RANDOM_STATE],
    'n_jobs': [-1],
    'n_estimators': [100],
    'max_depth': [6],
    'min_samples_leaf': [2],
    'max_features': ['sqrt'],
    'class_weight': ['balanced_subsample']
}
)

random_forest_results = cross_validate_model(random_forest)
results_dict.update(RandomForestClassifier=random_forest_results)
pandas.DataFrame(random_forest_results)

{'class_weight': 'balanced_subsample', 'max_depth': 6, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'n_estimators': 100, 'n_jobs': -1, 'random_state': 42}


Unnamed: 0,0,1
precision,0.553447,0.94715
recall,0.89748,0.718096
f1-score,0.684222,0.816544
support,44.8,115.2


In [74]:
# =========================================== #
#    Extra Trees Classifier
# =========================================== #
extra_trees = SklearnHelper(ExtraTreesClassifier)
extra_trees.tune(train_x, train_y, params={
    "random_state" : [RANDOM_STATE],
    'n_jobs': [-1],
    'n_estimators': [100],
    'max_depth': [6],
    'min_samples_leaf': [2],
    'max_features': ['sqrt'],
    'class_weight': ['balanced_subsample']
})
extra_trees_results = cross_validate_model(extra_trees)
results_dict.update(ExtraTreesClassifier=extra_trees_results)
pandas.DataFrame(extra_trees_results)

{'class_weight': 'balanced_subsample', 'max_depth': 6, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'n_estimators': 100, 'n_jobs': -1, 'random_state': 42}


Unnamed: 0,0,1
precision,0.563374,0.958156
recall,0.92047,0.721711
f1-score,0.698222,0.822862
support,44.8,115.2


In [75]:
# =========================================== #
#    XGBoost
# =========================================== #
xgboost = SklearnHelper(XGBClassifier, seed=RANDOM_STATE)
xgboost.tune(train_x, train_y, params={
    "random_state" : [RANDOM_STATE],
    'learning_rate': [0.1],
    'n_jobs': [-1],
    'n_estimators': [100],
    'max_depth': [6],
    'subsample': [0.3],
})
xgboost_results = cross_validate_model(xgboost)
results_dict.update(XGBClassifier=xgboost_results)
pandas.DataFrame(xgboost_results)

{'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 100, 'n_jobs': -1, 'random_state': 42, 'subsample': 0.3}


Unnamed: 0,0,1
precision,0.603952,0.843595
recall,0.597037,0.848934
f1-score,0.594791,0.845032
support,44.8,115.2


In [76]:
# =========================================== #
#    Gradient Boost
# =========================================== #
gradient_boost = SklearnHelper(GradientBoostingClassifier, seed=RANDOM_STATE)
gradient_boost.tune(train_x, train_y, params={
    "random_state" : [RANDOM_STATE],
    'learning_rate': [0.1],
    'n_estimators': [100],
    'max_depth': [6],
    'subsample': [0.3],
})
gradient_boost_results = cross_validate_model(gradient_boost)
results_dict.update(GradientBoostingClassifier=gradient_boost_results)
pandas.DataFrame(gradient_boost_results)

{'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 100, 'random_state': 42, 'subsample': 0.3}


Unnamed: 0,0,1
precision,0.602532,0.835839
recall,0.566581,0.855625
f1-score,0.582545,0.84533
support,44.8,115.2


In [77]:
# =========================================== #
#    Gradient Boost
# =========================================== #
k_neighbors = SklearnHelper(KNeighborsClassifier, seed=RANDOM_STATE)
k_neighbors.tune(train_x, train_y, params={
    "n_neighbors": [5],
    "weights": ['uniform'],
    "algorithm": ['auto'],
    'leaf_size': [10],
    'p': [2]
})
k_neighbors_results = cross_validate_model(k_neighbors)
results_dict.update(KNeighborsClassifier=k_neighbors_results)
pandas.DataFrame(k_neighbors_results)


{'algorithm': 'auto', 'leaf_size': 10, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}


Unnamed: 0,0,1
precision,0.556848,0.767055
recall,0.295388,0.907425
f1-score,0.38228,0.830925
support,44.8,115.2


In [78]:
# =========================================== #
#    Ada Boost
# =========================================== #
adaboost = SklearnHelper(AdaBoostClassifier, seed=RANDOM_STATE)
adaboost.tune(train_x, train_y, params={
    "random_state" : [RANDOM_STATE],
    'learning_rate': [0.1],
    'n_estimators': [100],
})
adaboost_results = cross_validate_model(adaboost)
results_dict.update(AdaBoostClassifier=adaboost_results)
pandas.DataFrame(adaboost_results)

{'learning_rate': 0.1, 'n_estimators': 100, 'random_state': 42}


Unnamed: 0,0,1
precision,0.619537,0.854487
recall,0.625036,0.849593
f1-score,0.611116,0.849461
support,44.8,115.2


In [79]:
# =========================================== #
#    Ada Boost
# =========================================== #
mlp = SklearnHelper(MLPClassifier, seed=RANDOM_STATE)
mlp.tune(train_x, train_y, params={
    'hidden_layer_sizes': [(64, 32, 16, 8, 4)],
    'max_iter': [1000]
})
mlp_results = cross_validate_model(mlp)
results_dict.update(MLPClassifier=adaboost_results)
pandas.DataFrame(adaboost_results)

{'hidden_layer_sizes': (64, 32, 16, 8, 4), 'max_iter': 1000}


Unnamed: 0,0,1
precision,0.619537,0.854487
recall,0.625036,0.849593
f1-score,0.611116,0.849461
support,44.8,115.2


In [80]:
estimators = [
    ('rf', random_forest.get_model()),
    ('xgb', xgboost.get_model()),
    ('knn', k_neighbors.get_model()),
    ('ab', adaboost.get_model())
]

lgclassifier = AdaBoostClassifier(random_state=RANDOM_STATE)

stacked = SklearnHelper(StackingClassifier, params={
    "estimators": estimators,
    "final_estimator": lgclassifier,
    "cv": 5})


stacked_results = cross_validate_model(stacked)
results_dict.update(StackingClassifier=stacked_results)
pandas.DataFrame(stacked_results)

Unnamed: 0,0,1
precision,0.612507,0.83158
recall,0.551515,0.866523
f1-score,0.57634,0.847889
support,44.8,115.2


In [81]:
# %load causality_analysis/calculate_causality.py
import pandas as pd

MAX_NAME_DIFFERENCE = 3

class Causality:
    @staticmethod
    def causality_wrapper(input_df, VALID_COLUMN, EFFECT_COLUMN):
        def negation(column):
            # negation -- 
            # OUTPUT: returns a column of 0s and 1s of the negation of [column]. 1s are flipped to 0 and vice versa
            # INPUT: [column] should be a column of 0s and 1s
            return 1 - column
            
            """"""
        def conjunction(column_1, column_2):
            # conjunction -- 
            # output: returns a column of 0s and 1s of the conjunction between [column_1] and [column_2].
            # INPUT: [column_1] and [column_2] should be columns of 0s and 1s
            return column_1 * column_2
            
            """"""
        def disjunction(column_1, column_2):
            # disjunction -- 
            # OUTPUT: returns a column of 0s and 1s of the disjunction between [column_1] and [column_2].
            # INPUT: [column_1] and [column_2] should be columns of 0s and 1s
            return column_1 | column_2
            
            """"""
        def conditional_probability(occurence_column, condition_column):
            # conditional_probability -- 
            # OUTPUT: returns a number which represents the conditional probability p(occurence | condition)
            # INPUT: [occurence_column] and [condition_column] should be columns of 0s and 1s
            if condition_column.sum() == 0: return 0
            return conjunction(occurence_column, condition_column).sum() / condition_column.sum()
            
            """"""
        def prior(data):
            # prior -- 
            # OUTPUT: returns a number which represents the prior
            # INPUT: [data] should be a Pandas dataframe with the columns [CORRECT_COLUMN] and [VALID_COLUMN].
            # TODO : Possible optimizations can be made where we cache the result instead of calling this expensive operation again and again
            return conditional_probability(data[EFFECT_COLUMN], data[VALID_COLUMN])
            
            """"""
        def is_prima_facie(data, column_name):
            # is_prima_facie -- 
            # OUTPUT: returns a boolean which determines whether the column indicated by [column_name] is a prima facie
            # INPUT: [data] should be a Pandas dataframe with the columns [CORRECT_COLUMN] and [VALID_COLUMN].
            # INPUT: [column_name] should be a valid column in [data]
            # INPUT: The [CORRECT_COLUMN] and [VALID_COLUMN] columns should be columns of 0s and 1s 
            return conditional_probability(data[EFFECT_COLUMN], data[column_name]) - prior(data) > 0
            
            """"""
        def is_cooccur(column_1, column_2):
            # is_cooccur -- 
            # OUTPUT: returns a boolean based on if there is at least one row where both [column_1] and [column_2] is equal to 1
            # INPUT: [column_1] and [column_2] should both be columns of 0s and 1s
            return conjunction(column_1, column_2).sum() > 0
            
            """"""
        def is_same_category(column_name_1, column_name_2):
            # same_category -- 
            # OUTPUT: Returns a boolean signifying whether the [column_name_1] and [column_name_2] are different by [MAX_NAME_DIFFERENCE]
            #         If the two words are not different by [MAX_NAME_DIFFERENCE], they are in the same category so it returns true
            count = 0
            shortest = min(len(column_name_1), len(column_name_2))
            for i in range(0, shortest):
                if column_name_1[i] == column_name_2[i]:
                    count = count + 1
            return max(len(column_name_1), len(column_name_2)) - count < MAX_NAME_DIFFERENCE
            
            """"""
        def rel(data, column_name):
            # rel -- 
            # OUTPUT: returns a list of the names of other columns which cooccur with [column_name] and are prima facie
            # INPUT: [data] should be a Pandas dataframe with the columns [CORRECT_COLUMN] and [VALID_COLUMN].
            # INPUT: [column_name] should be a valid column in [data]
            # INPUT: The [CORRECT_COLUMN] and [VALID_COLUMN] columns should be columns of 0s and 1s 
            
            # # If it is not a prima facie cause, we don't bother to find its rel
            if not is_prima_facie(data,column_name): return[]
                
            if column_name in [VALID_COLUMN, EFFECT_COLUMN]: return []
            
            name_list = []
            for potential_cause in data.columns:
                # Make sure we are not including the [CORRECT_COLUMN] and [VALID_COLUMN] as part of rel
                if potential_cause in [EFFECT_COLUMN, VALID_COLUMN, column_name]: continue

                if is_same_category(potential_cause, column_name): continue

                if is_cooccur(data[column_name], data[potential_cause]) and is_prima_facie(data, potential_cause):
                    name_list.append(potential_cause)
            return name_list
            
            """"""
        def calculate_causality(data, column_name):
            # calculate_causality -- 
            # OUTPUT: returns a number which represents the causality value of the column indicated by [column_name]
            # INPUT: [data] should be a Pandas dataframe with the columns [CORRECT_COLUMN].
            # INPUT: [column_name] should be a valid column in [data]
            # INPUT: The [CORRECT_COLUMN] and [VALID_COLUMN] columns should be columns of 0s and 1s 

            # If it's not a prima facie cause, we don't bother to calculate its causality value
            if not is_prima_facie(data, column_name):
                return "n/a"

            relateds = rel(data, column_name)
            total_probability = 0
            for related in relateds:
                conj = conjunction(data[column_name], data[related])
                negj = conjunction(negation(data[column_name]), data[related])

                k = data[column_name].sum() / len(data)
                conj = conditional_probability(data[EFFECT_COLUMN], conj)
                negj = conditional_probability(data[EFFECT_COLUMN], negj)

                # total_probability += k * (conj - negj)
                total_probability += (conj - negj)

            if (len(relateds) > 0): return total_probability / len(relateds)
            return total_probability
            
            """"""
        def is_binary_column(data, column_name):
            # is_binary_column --
            # Checks to see if a column is a column of 1s and 0s
            # INPUT: [data] is a dataframe
            # INPUT: [column_name] should be the name of a valid column in [data]
            return data.apply(lambda row : 0 if (isinstance(row[column_name], int) and (row[column_name] <= 1)) else 1, axis=1).sum() <= 0
            
            """"""
        def remove_non_binary_columns(data):
            # remove_non_binary_columns --
            # Removes all columns that are not 0s or 1s in the dataset
            # INPUT: [data] is a dataframe
            non_binary = []
            for i in data.columns:
                if i in [EFFECT_COLUMN, VALID_COLUMN]: continue
                if not is_binary_column(data, i):
                    non_binary.append(i)

            return data.drop(columns=non_binary)
            
            """"""
        def generate_row(data, column_name):
            # generate_row --
            # TODO: This is kind of a terrible name but I can't really think of anything more descriptive. If anyone has any ideas, feel free to modify it
            # It basically creates a row, which is actually a data frame with all the data that is needed
            # OUTPUT: It outputs a row with all the required values
            # INPUT: [data] should be a dataframe
            # INPUT: [column_name] should be a string representing a valid column in [data]
            toReturn = pd.DataFrame({
                "name": [column_name], 
                "support": conjunction(data[column_name], data[VALID_COLUMN]).sum(),
                "causality": [calculate_causality(data, column_name)],
                "rel": ','.join(rel(data, column_name)),
                "conditional_probability":[conditional_probability(data[EFFECT_COLUMN], data[column_name])], 
                "prior": prior(data),
                "conditional - prior": conditional_probability(data[EFFECT_COLUMN], data[column_name]) - prior(data)
            })
            return toReturn
            
            """"""

        def causality_values(input_df):
            # causality_values --
            # Calculates causality values

            # Then remove all the non binary columns
            # input_df = remove_non_binary_columns(input_df)

            # TODO: This is a hack
            # short_names = []
            # for column in input_df.columns:
            #     if len(column) < 5 and column != VALID_COLUMN and column != EFFECT_COLUMN: short_names.append(column)
            # input_df = input_df.drop(columns=short_names, axis=1)

            # TODO: I'm not sure if there's another way to do this, so feel free to make modifications
            # Generate a dud data frame with a single so we can append to it.
            to_save = generate_row(input_df, VALID_COLUMN)
            for column in input_df.columns:
                if column in [VALID_COLUMN, EFFECT_COLUMN]: continue
                to_save = to_save.append(generate_row(input_df, column))

            # Remove the dud first row
            to_save = to_save[1:]
            return to_save

            """"""

        to_return = causality_values(input_df)
        return to_return

In [82]:
def generate_geq_column(dataframe, column_name):
    i = 1
    while True:
        res = dataframe.apply(lambda row : 1 if row[column_name] >= i else 0, axis=1)
        if res.sum() == 0: break
        dataframe[f"{column_name}_geq_{i}"] = res
        i += 1
    return dataframe

def generate_geq_columns(dataframe, column_names):
    for column_name in column_names:
        dataframe = generate_geq_column(dataframe, column_name)
    return dataframe

causality_df = model_df.copy()
causality_df["is_correct"] = causality_df.apply(lambda row : 1 - row["is_correct"], axis=1)
causality_df["final_solved_different"] = causality_df.apply(lambda row : 1 - row["final_solved_same"], axis=1)
causality_df = causality_df.drop(columns=["final_solved_same"])

causality_df["num_of_additions_and_subtractions"] = causality_df["num_of_additions"] + causality_df["num_of_subtractions"]
causality_df["num_of_divisions_and_multiplications"] = causality_df["num_of_divisions"] + causality_df["num_of_multiplications"]
causality_df = generate_geq_columns(causality_df, [
    "num_of_additions_and_subtractions", 
    "num_of_divisions_and_multiplications", 
    "num_of_multiplications", 
    "num_of_divisions", 
    "num_of_equations", 
    "num_of_parentheses", 
])

causality_df = causality_df.drop(columns=[
    "num_of_additions_and_subtractions", 
    "num_of_divisions_and_multiplications", 
    "num_of_additions",
    "num_of_subtractions",
    "num_of_multiplications", 
    "num_of_divisions", 
    "num_of_equations", 
    "num_of_parentheses", 
])

causality_df['valid'] = 1
causality_df = Causality.causality_wrapper(causality_df, 'valid', 'is_correct')
causality_df

Unnamed: 0,name,support,causality,rel,conditional_probability,prior,conditional - prior
0,final_solved_different,463,0.759686,"num_of_additions_and_subtractions_geq_5,num_of...",0.555076,0.28,0.275076
0,num_of_additions_and_subtractions_geq_1,975,,,0.275897,0.28,-0.004103
0,num_of_additions_and_subtractions_geq_2,903,,,0.27907,0.28,-0.00093
0,num_of_additions_and_subtractions_geq_3,764,,,0.278796,0.28,-0.001204
0,num_of_additions_and_subtractions_geq_4,658,,,0.278116,0.28,-0.001884
0,num_of_additions_and_subtractions_geq_5,553,0.403503,"final_solved_different,num_of_divisions_and_mu...",0.282098,0.28,0.002098
0,num_of_additions_and_subtractions_geq_6,466,,,0.274678,0.28,-0.005322
0,num_of_additions_and_subtractions_geq_7,346,0.524248,"final_solved_different,num_of_divisions_and_mu...",0.294798,0.28,0.014798
0,num_of_additions_and_subtractions_geq_8,243,0.517421,"final_solved_different,num_of_divisions_and_mu...",0.337449,0.28,0.057449
0,num_of_additions_and_subtractions_geq_9,145,0.469156,"final_solved_different,num_of_divisions_and_mu...",0.37931,0.28,0.09931


In [83]:
chatgpt_stats = {}
chatgpt_stats.update(some_correct={
    'count': features_df.apply(lambda row : 1 if row["normal_correct"] == 'some' else 0, axis=1).sum(),
    'definition': "ChatGPT's response  mentioned some of the numbers in the solution"
})
chatgpt_stats.update(all_correct={
    'count': features_df.apply(lambda row : 1 if row["normal_correct"] == 'all' else 0, axis=1).sum(),
    'definition': "ChatGPT's response mentioned all the numbers in the solution"
})
chatgpt_stats.update(some_correct_rounded={
    'count': features_df.apply(lambda row : 1 if row["rounded_correct"] == 'some' and row["normal_correct"] == 'none' else 0, axis=1).sum(),
    'definition':"ChatGPT's response mentioned some of the numbers in the solution when both answers and ChatGPT's solutions were rounded."
})

chatgpt_stats.update(all_correct_rounded={
    'count': features_df.apply(lambda row : 1 if row["rounded_correct"] == 'all'  and row["normal_correct"] == 'none' else 0, axis=1).sum(),
    'definition': "ChatGPT's response mentioned all the numbers in the solution when both answers and ChatGPT's solutions were rounded."
})

chatgpt_stats.update(none_correct={
    'count': features_df.apply(lambda row : 1 if row["normal_correct"] == 'none' and row["rounded_correct"] == 'none' else 0, axis=1).sum(),
    'definition': "ChatGPT's response did not mention any of the numbers in the solution."
})

In [84]:
from xlsx_creation.xlsx_writer import XlsxWriter
XlsxWriter.write_xlsx(stats_obj = chatgpt_stats,causality_df=causality_df, output_file_path=XLSX_OUTPUT_FILE_PATH, description="JSON", results_dict=results_dict, input_df=None)