### **ChatGPT MWP performance prediction**
The emergence of large language models (LLMs) have gained much popularity in recent years, with OpenAI's GPT-3 series models being considered as the state-of-the-art. In particular, the variant of GPT-3 tuned for natural dialog, known as ChatGPT, has gathered much popular interest. However, LLMs have known performance issues, 
specifically when reasoning tasks are involved. This project aims to investigate aspects of math word problems (MWPs) that can indicate the success or failure of ChatGPT on such problems.
  
In this notebook in particular, we attempt to predict ChatGPT's performance on specific questions in DRAW-1K using classifiers by extracting equations from ChatGPT's response.  
  

### **Download libraries**
In order to replicate the results produced in this notebook, it is recommended to use the exact version of Python as well as the exact versions of each library.  
We first download the libraries that will be used in this notebook. We specify the exact version of each library to download.

In [53]:
%%capture

# =========================================== #
#               Requirements
# ------------------------------------------- #
# - Python 3.7.9
# =========================================== #

%pip install nltk==3.8.1
%pip install pandas==1.3.5
%pip install sympy==1.10.1
%pip install plotly==5.13.0
%pip install xgboost==1.6.2

# %pip install xlsxwriter==3.0.9
# %pip install scikit-learn==1.0.2
# %pip install beautifulsoup4==4.11.2
# %pip install torch==1.13.1
# %pip install transformers==4.27.4
# %pip install tqdm==4.64.1

### **Load libraries**

In [54]:
# =========================================== #
#                Libraries
# =========================================== #

# ------------------------------------------- #
#   Python
# ------------------------------------------- #
import re
import os
import ast
import sys
import random
import threading
from time import sleep

try:
    import thread
except ImportError:
    import _thread as thread

import wordninja

# ------------------------------------------- #
#   Pandas
# ------------------------------------------- #
import pandas
pandas.options.display.max_rows = 4000

# ------------------------------------------- #
#   Plotly
# ------------------------------------------- #
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px

# ------------------------------------------- #
#   Sympy
# ------------------------------------------- #
from sympy.parsing.sympy_parser import parse_expr
from sympy.parsing.sympy_parser import transformations
from sympy.parsing.sympy_parser import T
from sympy import Eq
from sympy import solve

# ------------------------------------------- #
#   Sklearn
# ------------------------------------------- #
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn import metrics

from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

# ------------------------------------------- #
#   Matplotlib
# ------------------------------------------- #
import matplotlib.pyplot as plt

# ------------------------------------------- #
#   XGBoost
# ------------------------------------------- #
from xgboost import XGBClassifier

# ------------------------------------------- #
#   Tensorflow
# ------------------------------------------- #
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import activations

# ------------------------------------------- #
#   Seaborn
# ------------------------------------------- #
import seaborn

# ------------------------------------------- #
#   XlsxWriter
# ------------------------------------------- #
import xlsxwriter

# ------------------------------------------- #
#   Numpy
# ------------------------------------------- #
import numpy

### **Constants**
Here, we configure values required one run through of the contents in the notebook.   
This should be the only part of the notebook you would have to change from run-to-run.

In [55]:
RESPONSE_FILE_PATH = '../data/hwelsters__gpt-3.5-turbo__v001 (prefix__system_of_equations).jsonl'
QUESTION_FILE_PATH = '../data/draw.json'
TEST_SIZE = 0.2
RANDOM_STATE = 42
N_SPLITS = 5
VISUALIZE_DATA = True
XLSX_OUTPUT_FILE_PATH = 'sympy_predict.xlsx'

### **Set Seed for RNGs**
We set the seed for the RNGs to ensure consistency from run-to-run

In [56]:
os.environ['PYTHONHASHSEED']=str(RANDOM_STATE)
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'  # new flag present in tf 2.0+
random.seed(RANDOM_STATE)
numpy.random.seed(RANDOM_STATE)
tf.random.set_seed(RANDOM_STATE)

### **Load data**

In [57]:
# =========================================== #
#                Load data
# =========================================== #

# ------------------------------------------- #
#   Functions which loads data into
#   a dataframe based on its file extension
#   Currently accepts .json, .jsonl, .csv
# ------------------------------------------- #
def load_file(path):
    split_path = os.path.splitext(path)
    file_extension = split_path[-1]

    data = pandas.DataFrame()
    if file_extension == '.json': data = pandas.read_json(path)
    if file_extension == '.jsonl': data = pandas.read_json(path, lines=True)
    if file_extension == '.csv': data = pandas.read_csv(path)

    return data


# ------------------------------------------- #
#   Helper function that loads data stored in 
#   a file at a particular file-path into a 
#   pandas dataframe, extracting only the 
#   columns specified.
# ------------------------------------------- #
def load_data(path, sample_size=5, columns=None, label=None):
    available_columns, dataframe = load_file(path, columns)
    if columns != None: available_columns = list(set(columns).intersection(available_columns))
    return dataframe[available_columns]

In [58]:
equations_df = load_file('../data/equations.jsonl')
final_answer_df = load_file('../data/answers.jsonl')
response_df = load_file('../data/response.jsonl')
ground_df = load_file('../data/ground.json')

display(equations_df.head(3))
display(final_answer_df.head(3))
display(response_df.head(3))
display(ground_df.head(3))

Unnamed: 0,model,temperature,max_tokens,date_time,question_number,question,system_text,response,prompt_tokens,completion_tokens,total_tokens
0,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:37:10,52,Extract all equations from this text and canon...,Extract all equations from this text and canon...,This is the correct solution.,152,6.0,158
1,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:37:11,36,Extract all equations from this text and canon...,Extract all equations from this text and canon...,"I'm sorry, but there is no associated text to ...",192,22.0,214
2,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:37:12,46,Extract all equations from this text and canon...,Extract all equations from this text and canon...,The input text was not provided. Please provid...,177,16.0,193


Unnamed: 0,model,temperature,max_tokens,date_time,question_number,question,system_text,response,prompt_tokens,completion_tokens,total_tokens
0,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:22:11,10,Extract the final answer from this text. Outp...,Extract the final answer from this text. Outp...,"{""answers"": [1800.0, 4200.0]}",286,15,301
1,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:22:11,7,Extract the final answer from this text. Outp...,Extract the final answer from this text. Outp...,"{""answers"": [10.0, 17.0]}",286,13,299
2,gpt-3.5-turbo-0301,1,2048,2023-04-13 17:22:11,18,Extract the final answer from this text. Outp...,Extract the final answer from this text. Outp...,"{""answers"":[4.0]}",280,7,287


Unnamed: 0,model,date_time,question_number,question,response
0,gpt-3.5-turbo-0301,2023-09-03 12:52:57,1,A factory makes three-legged stools and four-l...,\n\nLet x be the number of three-legged stools...
1,gpt-3.5-turbo-0301,2023-09-03 12:52:57,0,Juniors boat will go 15 miles per hour in stil...,\n\nLet's call the speed of the current 'c'. \...
2,gpt-3.5-turbo-0301,2023-09-03 12:53:00,3,The student-teacher ratio for Washington High ...,"\n\nTo calculate the number of students, we ne..."


Unnamed: 0,sQuestion,lSolutions,Template,lEquations,iIndex,Alignment,Equiv
0,Juniors boat will go 15 miles per hour in stil...,[2.14285714286],[a * m + b * m = b * c - a * c],[12*(15-x)=9*(15+x)],397760,"[{'coeff': 'a', 'SentenceId': 1, 'Value': 9.0,...",[]
1,A factory makes three-legged stools and four-l...,"[83.0, 78.0]","[a * m + b * n = c, m + n = d]","[student+general=161, 3*student+4*general=566]",327651,"[{'coeff': 'a', 'SentenceId': 0, 'Value': 3.0,...",[]
2,a bank offers two checking plans . The anywher...,[14.0],[0.01 * a * m - 0.01 * b * m = c],[(.01*30)*x=(.01*22)*x+1.12],238992,"[{'coeff': 'a', 'SentenceId': 1, 'Value': 30.0...",[]


### **Convert ChatGPT's JSON response to Dataframe**

In [59]:
def extract_json(text : str, question_number):
    try: 
        text = re.findall(r'\{.*\}', text, re.DOTALL)[0]
        body = ast.literal_eval(text)
        body["question_number"] = question_number
        return body
    except: return "FAILED"


e_equations_df = equations_df.apply(lambda row : extract_json(row['response'], row["question_number"]), axis=1)
e_final_answer_df = final_answer_df.apply(lambda row : extract_json(row['response'], row["question_number"]), axis=1)

e_equations_df = e_equations_df[e_equations_df != 'FAILED']
e_final_answer_df = e_final_answer_df[e_final_answer_df != 'FAILED']

e_equations_df = pandas.DataFrame(e_equations_df.to_list())
e_final_answer_df = pandas.DataFrame(e_final_answer_df.to_list())

e_response_df = response_df[["response", "question_number"]]

display(e_equations_df.head(3))
display(e_response_df.head(3))
display(e_final_answer_df.head(3))

Unnamed: 0,equations,question_number
0,"[a - b = 36, a + b = 62, a = b + 36, (b + 36) ...",9
1,"[x = v, 30 - x = w, 0.01x + 0.05(30 - x) = 0.0...",15
2,"[a + b = 81, a - b = 9, 2a = 90, a = 45, a + b...",19


Unnamed: 0,response,question_number
0,\n\nLet x be the number of three-legged stools...,1
1,\n\nLet's call the speed of the current 'c'. \...,0
2,"\n\nTo calculate the number of students, we ne...",3


Unnamed: 0,answers,question_number
0,"[1800.0, 4200.0]",10
1,"[10.0, 17.0]",7
2,[4.0],18


In [60]:
def combine(question_number):
    ground_val = ground_df.loc[question_number]["lSolutions"]
    response_val = e_response_df[e_response_df["question_number"] == question_number]
    equations_val = e_equations_df.loc[e_equations_df["question_number"] == question_number]
    final_answer_val = e_final_answer_df.loc[e_final_answer_df["question_number"] == question_number]

    if len(response_val) > 0: response_val = response_val.iloc[0]["response"]
    else: response_val = ""

    if len(equations_val) > 0: equations_val = equations_val.iloc[0]["equations"]
    else: equations_val = []

    if len(final_answer_val) > 0: final_answer_val = final_answer_val.iloc[0]["answers"]
    else: final_answer_val = []

    return {"ground_solution" : ground_val, "response" : response_val, "equations": equations_val, "answers" : final_answer_val}

combined_df = response_df.apply(lambda row : combine(row["question_number"]), axis=1).to_list()
combined_df = pandas.DataFrame(combined_df)
combined_df.head(3)

Unnamed: 0,ground_solution,response,equations,answers
0,"[83.0, 78.0]",\n\nLet x be the number of three-legged stools...,"[a + b = 161, 3a + 4b = 566, a = 161 - b, 3(16...","[78.0, 83.0]"
1,[2.14285714286],\n\nLet's call the speed of the current 'c'. \...,"[s_d = 15 + c, s_u = 15 - c, t_d = t_u, d_d / ...",[2.14]
2,[1155.0],"\n\nTo calculate the number of students, we ne...",[],[1155.0]


In [61]:
# ------------------------------------------- #
#   This function standardizes ChatGPT's 
#   equations.
#   
#   e.g 'tons of hay + tons of wheat = 10'
#       --> a + b = 10
#   
#   In some cases, ChatGPT returns a
#   dict of equations
#   In this case, we just get the values or 
#   the keys of this dictionary depending on 
#   which one has an equal sign
# ------------------------------------------- #
def canonicalize_equations(texts):  
    # If ChatGPT did not give equations at all, return None
    if (str(texts) == "nan"): return None

    # CHATGPT sometimes gives equations as dictionaries rather than arrays. 
    # As such, we do this instead
    
    if type(texts) is dict: 
        new_texts = []
        for text in texts.values(): 
            if str(text).count('=') > 0: new_texts.append(text)
        for text in texts.keys(): 
            if str(text).count('=') > 0: new_texts.append(text)
        texts = new_texts

    if type(texts) is list and len(texts) > 0:
        new_texts = []
        for text in texts:
            if type(text) is dict:
                for t in text.values(): 
                    if str(t).count('=') > 0: new_texts.append(t)
                for t in text.keys(): 
                    if str(t).count('=') > 0: new_texts.append(t)
            elif type(text) is str: new_texts.append(text)
        texts = new_texts
        # print("NEW TEXT: ", texts)
    
    # Converts it all to a list of str
    if type(texts) is set: texts = list(texts)
    texts = [str(text) for text in texts]


    A_CODE = 97
    letters = {}
    index = A_CODE

    new_texts = []
    for text in texts:
        if type(text) is dict:
            text = text[list(text.values())[0]]
        new_texts.append(text)
    texts = new_texts


    REP = '[supadupaepicstonkifiers]'
    for text in texts:
        text = str(text)
        text = text.replace('+', REP)
        text = text.replace('-', REP)
        text = text.replace('*', REP)
        text = text.replace('=', REP)
        text = text.replace('(', REP)
        text = text.replace(')', REP)
        text = text.replace('{', REP)
        text = text.replace('}', REP)
        text = text.replace(',', '')
        text = text.replace('/', REP)
        text = text.replace('.', REP)

        spl = text.split(REP)

        for text in spl:
            split = wordninja.split(text)
            for t in split:
                t = t.strip() 
                while len(t) > 0 and t[0].isdigit(): t = t[1:]
                if len(t) > 0 and not t in letters.keys(): 
                    letters[t] = chr(index)
                    index += 1
    
    to_return = []

    def get_len(key):
        return -len(key[0])

    test_dict_list = list(letters.items())
    test_dict_list.sort(key = get_len)

    letters = {ele[0] : ele[1]  for ele in test_dict_list}

    # Canonicalizes the equation variable names
    for text in texts:
        for swap in letters.keys():
            text = text.replace(swap, letters[swap])
        to_return.append(text)

    return to_return

canonicalized_df = combined_df.copy()
canonicalized_df["canonicalized_equations"] = canonicalized_df["equations"].apply(lambda row : canonicalize_equations(row))
canonicalized_df.head()

Unnamed: 0,ground_solution,response,equations,answers,canonicalized_equations
0,"[83.0, 78.0]",\n\nLet x be the number of three-legged stools...,"[a + b = 161, 3a + 4b = 566, a = 161 - b, 3(16...","[78.0, 83.0]","[a + b = 161, 3a + 4b = 566, a = 161 - b, 3(16..."
1,[2.14285714286],\n\nLet's call the speed of the current 'c'. \...,"[s_d = 15 + c, s_u = 15 - c, t_d = t_u, d_d / ...",[2.14],"[a_b = 15 + c, a_d = 15 - c, e_b = e_d, b_b / ..."
2,[1155.0],"\n\nTo calculate the number of students, we ne...",[],[1155.0],[]
3,[14.0],\n\nLet's start by finding a general expressio...,"[C = 0.3x, C = 1.12 + 0.22x, 1.12 + 0.22x < 0....",[4.22],"[a = 0.3b, a = 1.12 + 0.22b, 1.12 + 0.22b < 0...."
4,"[38.0, 15.0]",\n\nLet's use M to represent Mike's age and S ...,"[a - b = 11, a + c = 53, b = a - 11, a + (c + ...",[34.0],"[a - b = 11, a + c = 53, b = a - 11, a + (c + ..."


In [62]:
# ------------------------------------------- #
#   A function for creating a SymPy 
#   expressions
# ------------------------------------------- #
def create_expression(text):
    split_text = text.split('=')
    expression = Eq(parse_expr(split_text[0], transformations=T[:6]), parse_expr(split_text[1], transformations=T[:6]))
    return expression


def quit_function(fn_name):
    # print to stderr, unbuffered in Python 2.
    sys.stderr.flush() # Python 3 stderr is likely buffered.
    thread.interrupt_main() # raises KeyboardInterrupt

# ------------------------------------------- #
#   A function that times out functions 
#   after a certain period of time
# ------------------------------------------- #
def exit_after(s):
    '''
    use as decorator to exit process if 
    function takes longer than s seconds
    '''
    def outer(fn):
        def inner(*args, **kwargs):
            timer = threading.Timer(s, quit_function, args=[fn.__name__])
            timer.start()
            try:
                result = fn(*args, **kwargs)
            finally:
                timer.cancel()
            return result
        return inner
    return outer

# ------------------------------------------- #
#   Solves a system of equations given in
#   a list of str
# ------------------------------------------- #
@exit_after(5)
def solve_system(equations):
    text = str(equations)
    text = wordninja.split(text)

    s = set()
    for t in text:
        if len(t) > 0 and t[0].isalpha(): s.add(t)
    
    try:
        expressions = [create_expression(equation) for equation in equations][0:len(s)]
        return solve(expressions)
    except:
        return {}

canonicalized_df["solved_ans"] = canonicalized_df.apply(lambda row : solve_system(row["canonicalized_equations"]), axis=1)
canonicalized_df.head(5)

Unnamed: 0,ground_solution,response,equations,answers,canonicalized_equations,solved_ans
0,"[83.0, 78.0]",\n\nLet x be the number of three-legged stools...,"[a + b = 161, 3a + 4b = 566, a = 161 - b, 3(16...","[78.0, 83.0]","[a + b = 161, 3a + 4b = 566, a = 161 - b, 3(16...","{a: 78, b: 83}"
1,[2.14285714286],\n\nLet's call the speed of the current 'c'. \...,"[s_d = 15 + c, s_u = 15 - c, t_d = t_u, d_d / ...",[2.14],"[a_b = 15 + c, a_d = 15 - c, e_b = e_d, b_b / ...","{a_b: 120/7, a_d: 90/7, b_b: 4*b_d/3, c: 15/7,..."
2,[1155.0],"\n\nTo calculate the number of students, we ne...",[],[1155.0],[],[]
3,[14.0],\n\nLet's start by finding a general expressio...,"[C = 0.3x, C = 1.12 + 0.22x, 1.12 + 0.22x < 0....",[4.22],"[a = 0.3b, a = 1.12 + 0.22b, 1.12 + 0.22b < 0....",{}
4,"[38.0, 15.0]",\n\nLet's use M to represent Mike's age and S ...,"[a - b = 11, a + c = 53, b = a - 11, a + (c + ...",[34.0],"[a - b = 11, a + c = 53, b = a - 11, a + (c + ...","{a: 53 - c, b: 42 - c}"


In [63]:
# ------------------------------------------- #
#   A helper function for comparing  
#   ChatGPT's response to the actual solution
# ------------------------------------------- #
def evaluate_response(actual_solution, response_solution, transform_func):
    actual_solution = [transform_func(float(solution)) for solution in actual_solution]
    response_solution = [transform_func(float(solution)) for solution in response_solution]

    actual_solution = set(actual_solution)
    response_solution = set(response_solution)
    
    if actual_solution.issubset(response_solution): return "all"
    elif actual_solution.intersection(response_solution): return "some"
    else: return "none"

def count_symbols(text, symbol): return str(text).count(symbol)

# ------------------------------------------- #
#   A function that extracts decimal numbers
#   from string
# ------------------------------------------- #
def extract_decimals(response : str):
    response = str(response)
    pattern = r'\d+(?:\.\d+)?'
    decimals = re.findall(pattern, response)
    decimals = [float(decimal) for decimal in decimals]
    return set(decimals)

features_df = canonicalized_df.copy()
features_df["extracted_solved_ans"] = features_df.apply(lambda row : extract_decimals(row["solved_ans"]), axis=1)
features_df["extracted_final_answer"] = features_df.apply(lambda row : extract_decimals(row["answers"]), axis=1)
features_df["normal_correct"] = features_df.apply(lambda row : evaluate_response(row["ground_solution"], row["answers"], lambda x : x), axis=1) 
features_df["rounded_correct"] = features_df.apply(lambda row : evaluate_response(row["ground_solution"], row["answers"], lambda x : round(x, 1)), axis=1)
features_df["final_solved_same"] = features_df.apply(lambda row : 1 if len(row["extracted_final_answer"].intersection(row["extracted_solved_ans"])) > 0 else 0, axis=1)

answer_map = {"all": 1, "some" : 0, "none": 0}
features_df['is_correct'] = features_df.apply(lambda row : answer_map[row["normal_correct"]], axis=1)
features_df["num_of_additions"] = features_df.apply(lambda row : count_symbols(row["equations"], '+'), axis=1)
features_df["num_of_subtractions"] = features_df.apply(lambda row : count_symbols(row["equations"], '-'), axis=1)
features_df["num_of_multiplications"] = features_df.apply(lambda row : count_symbols(row["equations"], '*'), axis=1)
features_df["num_of_divisions"] = features_df.apply(lambda row : count_symbols(row["equations"], '/'), axis=1)
features_df["num_of_equations"] = features_df.apply(lambda row : count_symbols(row["equations"], '='), axis=1)
features_df["num_of_parentheses"] = features_df.apply(lambda row : count_symbols(row["equations"], '('), axis=1)

features_df.head(3) 

Unnamed: 0,ground_solution,response,equations,answers,canonicalized_equations,solved_ans,extracted_solved_ans,extracted_final_answer,normal_correct,rounded_correct,final_solved_same,is_correct,num_of_additions,num_of_subtractions,num_of_multiplications,num_of_divisions,num_of_equations,num_of_parentheses
0,"[83.0, 78.0]",\n\nLet x be the number of three-legged stools...,"[a + b = 161, 3a + 4b = 566, a = 161 - b, 3(16...","[78.0, 83.0]","[a + b = 161, 3a + 4b = 566, a = 161 - b, 3(16...","{a: 78, b: 83}","{83.0, 78.0}","{83.0, 78.0}",all,all,1,1,5,3,0,0,8,1
1,[2.14285714286],\n\nLet's call the speed of the current 'c'. \...,"[s_d = 15 + c, s_u = 15 - c, t_d = t_u, d_d / ...",[2.14],"[a_b = 15 + c, a_d = 15 - c, e_b = e_d, b_b / ...","{a_b: 120/7, a_d: 90/7, b_b: 4*b_d/3, c: 15/7,...","{3.0, 4.0, 7.0, 15.0, 120.0, 90.0}",{2.14},none,all,0,0,4,4,0,4,9,4
2,[1155.0],"\n\nTo calculate the number of students, we ne...",[],[1155.0],[],[],{},{1155.0},all,all,0,1,0,0,0,0,0,0


In [64]:
model_df = features_df[[
    # "num_of_unknowns", 
    "num_of_additions", 
    "num_of_subtractions", 
    "num_of_multiplications", 
    "num_of_divisions", 
    "num_of_equations", 
    "num_of_parentheses", 
    "is_correct",
]]
model_df = model_df.dropna()

print("Dataframe LEN:" + str(len(model_df)))
model_df.head()

Dataframe LEN:1000


Unnamed: 0,num_of_additions,num_of_subtractions,num_of_multiplications,num_of_divisions,num_of_equations,num_of_parentheses,final_solved_same,is_correct
0,5,3,0,0,8,1,1,1
1,4,4,0,4,9,4,0,0
2,0,0,0,0,0,0,0,1
3,3,0,0,0,6,2,0,0
4,7,7,0,0,9,3,0,0


In [65]:
train, test = train_test_split(model_df, test_size=TEST_SIZE, random_state=RANDOM_STATE)

train_x = train.drop(columns=["is_correct"])
test_x = test.drop(columns=["is_correct"])

train_y = train["is_correct"]
test_y = test[["is_correct"]]

display(train_x.head())
display(train_y.head())

Unnamed: 0,num_of_additions,num_of_subtractions,num_of_multiplications,num_of_divisions,num_of_equations,num_of_parentheses,final_solved_same
29,0,0,0,0,0,0,0
535,0,0,0,0,0,0,0
695,2,0,0,0,8,1,0
557,5,5,0,1,9,2,0
836,0,2,0,6,9,5,0


29     1
535    1
695    1
557    1
836    1
Name: is_correct, dtype: int64

In [66]:
# =========================================== #
#    Sklearn Helper
# =========================================== #
class SklearnHelper(object):
    def __init__(self, clf, seed=0, params={}):
        self.clf = clf
        self.params = params

    def fit(self, x_train, y_train):
        self.fitted = self.clf(**self.params).fit(x_train, y_train)
    
    def predict(self, x):
        return self.fitted.predict(x)
    
    def predict_proba(self, x):
        return self.fitted.predict_proba(x)
    
    def feature_importance(self):
        return self.fitted.feature_importances_
    
    def tune(self, x_train, y_train, params=None):
        
        grid = GridSearchCV(self.clf(), param_grid=params)
        grid.fit(x_train, y_train)

        self.params = grid.best_params_

        print(self.params)
        pass

    def get_model(self):
        return self.clf(**self.params)

In [67]:
# =========================================== #
#    Cross-validate models
# =========================================== #
def cross_validate(model, x_data, y_data, n_splits, random_state=42, shuffle=True):
    cross_fold = KFold(n_splits, random_state=random_state, shuffle=shuffle)

    cross_validation_results = []
    for i, (train_index, test_index) in enumerate(cross_fold.split(x_data)):
        x_train, y_train = x_data.iloc[train_index], y_data.iloc[train_index]
        x_test, y_test = x_data.iloc[test_index], y_data.iloc[test_index]

        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)

        report = classification_report(y_test, y_pred, target_names=list(set(answer_map.values())), output_dict=True)
        cross_validation_results.append(report)

    average_results = {}
    for result in cross_validation_results:
        for key1 in list(set(answer_map.values())):
            average_results[key1] = {}
            for key2 in ['precision', 'recall', 'f1-score', 'support']:
                average_results[key1][key2] = 0

    for result in cross_validation_results:
        for key1 in list(set(answer_map.values())):
            for key2 in ['precision', 'recall', 'f1-score', 'support']:
                average_results[key1][key2] += result[key1][key2]
    
    for key1 in list(set(answer_map.values())):
        for key2 in ['precision', 'recall', 'f1-score', 'support']:
            average_results[key1][key2] /= n_splits
        
    return pandas.DataFrame(average_results)

def cross_validate_model(classifier):
    return cross_validate(classifier, train_x, train_y, N_SPLITS, RANDOM_STATE, True)

In [68]:
results_dict = {}

In [69]:
# =========================================== #
#    Random Forest
# =========================================== #
random_forest = SklearnHelper(RandomForestClassifier)

random_forest.tune(train_x, train_y, params={
    "random_state" : [RANDOM_STATE],
    'n_jobs': [-1],
    'n_estimators': [100],
    'max_depth': [6],
    'min_samples_leaf': [2],
    'max_features': ['sqrt'],
    'class_weight': ['balanced_subsample']
}
)

random_forest_results = cross_validate_model(random_forest)
results_dict.update(RandomForestClassifier=random_forest_results)
pandas.DataFrame(random_forest_results)

{'class_weight': 'balanced_subsample', 'max_depth': 6, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'n_estimators': 100, 'n_jobs': -1, 'random_state': 42}


Unnamed: 0,0,1
precision,0.505637,0.813346
recall,0.754364,0.575711
f1-score,0.599966,0.666367
support,57.4,102.6
