In [18]:
import json

from solver import *
from clean_mathqa import *

## Evaluate Test and Challenge Set Correctness

In this short notebook, we use our own calculator to determine the correctness of the test and challenge sets i.e. do the programs for each problem actually generate the gold solution?

Impact:
- Can consider only training on 'correct' examples where the program generates the correct solutions.

In [19]:
# Parameters

# Inputs
TEST_FILE = "data/test.json" # original MathQA test set
CHALLENGE_FILE = "data/challenge_test.json" # newer MathQA test set by same authors with duplicates removed

# Known broken questions
# Questions are broken as programs do mathematically invalid things e.g. square root of negative number
TEST_BROKEN_QUESTIONS_FILE = "data/test_broken.txt"
CHALLENGE_BROKEN_QUESTIONS_FILE = "data/challenge_broken.txt"

# Delta between evaluated solution and gold solution to be considered 'correct'
DELTA = 0.5

In [20]:
# Simple evaluation loop
def test_fct(data, broken_questions=frozenset()):

    # Test
    correct_count = 0
    total = 0
    for i, datum in enumerate(data):

        # Skip broken questions
        if (i + 1) in broken_questions:
            continue

        # Attempt to solve
        formula = fill_n_args(datum["linear_formula"].replace("|", " "), datum["Problem"])
        solution = float(solve_linear_program(formula))

        # Round solution
        if solution.is_integer():
            solution = int(solution)
        else:
            solution = round(solution, 5)

        # Get gold answer
        gold = get_gold_value(datum["correct"], extract_options(datum["options"]))

        # Evaluate
        status = "WRONG"
        if solution and gold and abs(float(gold) == float(solution)):
            status = "EQUAL"
            correct_count += 1

        elif solution and gold and abs(float(gold) - float(solution)) <= DELTA:
            status = "CLOSE"
            correct_count += 1

        total += 1

        print("{} - [{}] Gold: {}, evaluated: {}".format(i + 1, status, gold, solution))

    print("Correct: {}, total: {}, pct: {}".format(correct_count, total, correct_count / total))
    

In [21]:
# Evaluate test set
with open(TEST_FILE, "r") as f:
    data = json.load(f)
    
broken_questions = get_broken_questions(TEST_BROKEN_QUESTIONS_FILE)
test_fct(data, broken_questions)

Bad option: d ) data inadequate 
Bad option: e ) none of these
1 - [EQUAL] Gold: 38, evaluated: 38
2 - [CLOSE] Gold: 129, evaluated: 128.96016
3 - [EQUAL] Gold: 870, evaluated: 870
4 - [WRONG] Gold: 2700, evaluated: 4665.6
5 - [CLOSE] Gold: 0.00137, evaluated: 0.02601
6 - [EQUAL] Gold: -49, evaluated: -49
7 - [EQUAL] Gold: 0.036, evaluated: 0.036
8 - [WRONG] Gold: 70, evaluated: 45
9 - [EQUAL] Gold: 9, evaluated: 9
10 - [EQUAL] Gold: 8, evaluated: 8
11 - [WRONG] Gold: 4096, evaluated: 5120
12 - [EQUAL] Gold: 2240, evaluated: 2240
13 - [EQUAL] Gold: 2400, evaluated: 2400
14 - [EQUAL] Gold: 192, evaluated: 192
15 - [WRONG] Gold: 15, evaluated: 0.14444
Bad option: e ) none of these
16 - [EQUAL] Gold: 30, evaluated: 30
17 - [WRONG] Gold: 36, evaluated: -10.28489
18 - [WRONG] Gold: 257, evaluated: 194.25
19 - [EQUAL] Gold: 600, evaluated: 600
20 - [EQUAL] Gold: 32, evaluated: 32
21 - [WRONG] Gold: 223, evaluated: 222.22222
22 - [WRONG] Gold: 4, evaluated: 11
23 - [EQUAL] Gold: 9, evaluated:

Bad option: e ) none of these
991 - [WRONG] Gold: 10123.2, evaluated: 16197.12
992 - [EQUAL] Gold: 74, evaluated: 74
993 - [WRONG] Gold: 30, evaluated: 28.0
994 - [CLOSE] Gold: 519, evaluated: 518.98438
995 - [EQUAL] Gold: 3, evaluated: 3
996 - [WRONG] Gold: 2, evaluated: 1
Bad option: e ) none
997 - [CLOSE] Gold: 72, evaluated: 71.99424
998 - [EQUAL] Gold: 0.0125, evaluated: 0.0125
999 - [EQUAL] Gold: 60, evaluated: 60
1000 - [EQUAL] Gold: 3, evaluated: 3
1001 - [WRONG] Gold: 65, evaluated: 76
1002 - [EQUAL] Gold: 9.5, evaluated: 9.5
1003 - [WRONG] Gold: 14172.48, evaluated: 16197.12
1004 - [WRONG] Gold: 3, evaluated: 6.9282
1005 - [CLOSE] Gold: 5.45, evaluated: 5.45455
Bad option: e ) none
1006 - [CLOSE] Gold: 6.8, evaluated: 6.875
1007 - [WRONG] Gold: 3, evaluated: -2
1008 - [EQUAL] Gold: 5, evaluated: 5
1009 - [WRONG] Gold: 51, evaluated: 37
1010 - [WRONG] Gold: 90, evaluated: 80
1011 - [EQUAL] Gold: 12, evaluated: 12.0
1012 - [EQUAL] Gold: 0.39, evaluated: 0.39
1013 - [WRONG] Gold

1980 - [CLOSE] Gold: 0.1388, evaluated: 0.00555
1981 - [EQUAL] Gold: 144, evaluated: 144
1982 - [EQUAL] Gold: 50, evaluated: 50
1983 - [EQUAL] Gold: 55, evaluated: 55
1984 - [EQUAL] Gold: 0.66667, evaluated: 0.66667
Cannot parse token containing two numbers, defaulting to first: a ) 3 and 15 
Cannot parse token containing two numbers, defaulting to first: b ) 3 and 20 
Cannot parse token containing two numbers, defaulting to first: c ) 4 and 12 
Cannot parse token containing two numbers, defaulting to first: d ) 4 and 14 
Cannot parse token containing two numbers, defaulting to first: e ) 5 and 14
1985 - [WRONG] Gold: 4, evaluated: 411
1986 - [EQUAL] Gold: 59.8, evaluated: 59.8
1987 - [EQUAL] Gold: 2, evaluated: 2
1988 - [EQUAL] Gold: 3200, evaluated: 3200
1989 - [WRONG] Gold: 5, evaluated: 3
1990 - [CLOSE] Gold: 57, evaluated: 57.14286
Bad option: e ) none
1991 - [EQUAL] Gold: 270, evaluated: 270
1992 - [EQUAL] Gold: 27, evaluated: 27
1993 - [WRONG] Gold: 19956.732, evaluated: 101
199

2954 - [CLOSE] Gold: 340.9, evaluated: 340.90909
Bad option: e ) none of these
2955 - [EQUAL] Gold: 7600, evaluated: 7600.0
Bad option: e ) none of these
2956 - [WRONG] Gold: 1954404, evaluated: 100
2957 - [EQUAL] Gold: 20, evaluated: 20
2958 - [WRONG] Gold: 3, evaluated: 2.25
2959 - [EQUAL] Gold: 1, evaluated: 1
2960 - [EQUAL] Gold: 4, evaluated: 4
2961 - [WRONG] Gold: 900, evaluated: -186300
2962 - [EQUAL] Gold: -49, evaluated: -49
2963 - [WRONG] Gold: 590, evaluated: 657.5
2964 - [EQUAL] Gold: 142, evaluated: 142
Bad option: e ) none
2965 - [EQUAL] Gold: 0.83333, evaluated: 0.83333
2966 - [EQUAL] Gold: 45, evaluated: 45
2967 - [EQUAL] Gold: 72, evaluated: 72
2968 - [EQUAL] Gold: 8, evaluated: 8
2969 - [EQUAL] Gold: 25, evaluated: 25
2970 - [CLOSE] Gold: 87, evaluated: 87.25
2971 - [EQUAL] Gold: 310, evaluated: 310.0
2972 - [EQUAL] Gold: 30, evaluated: 30
2973 - [EQUAL] Gold: 6, evaluated: 6
2974 - [WRONG] Gold: 88888883, evaluated: 0
Bad option: e ) none of these
2975 - [WRONG] Gold

Running the solver on the test set (2985 questions) with delta = 0.5 generates 2035 / 2975 correct answers, or 68.4%.

In [22]:
# Evaluate challenge set
with open(CHALLENGE_FILE, "r") as f:
    cdata = json.load(f)
    
broken_questions = get_broken_questions(CHALLENGE_BROKEN_QUESTIONS_FILE)
test_fct(cdata, broken_questions)

1 - [EQUAL] Gold: 192, evaluated: 192
2 - [WRONG] Gold: 15, evaluated: 0.14444
Bad option: e ) none of these
3 - [EQUAL] Gold: 30, evaluated: 30
4 - [EQUAL] Gold: 50, evaluated: 50
5 - [WRONG] Gold: 60, evaluated: 10
6 - [EQUAL] Gold: 4, evaluated: 4
7 - [EQUAL] Gold: 0.4, evaluated: 0.4
8 - [EQUAL] Gold: 4, evaluated: 4
9 - [WRONG] Gold: 240, evaluated: 248
10 - [EQUAL] Gold: 0.33333, evaluated: 0.33333
11 - [EQUAL] Gold: 144, evaluated: 144
12 - [EQUAL] Gold: 6, evaluated: 6
13 - [WRONG] Gold: 60, evaluated: 55.55556
14 - [EQUAL] Gold: 2, evaluated: 2
15 - [CLOSE] Gold: 41.4, evaluated: 41.42857
16 - [EQUAL] Gold: 1.77778, evaluated: 1.77778
17 - [CLOSE] Gold: 0.07692, evaluated: 0.1
18 - [WRONG] Gold: 24, evaluated: 6
19 - [EQUAL] Gold: 0.5, evaluated: 0.5
20 - [EQUAL] Gold: 0.375, evaluated: 0.375
21 - [EQUAL] Gold: 6, evaluated: 6
22 - [WRONG] Gold: 178, evaluated: 240
23 - [CLOSE] Gold: 0.03846, evaluated: 0.23077
24 - [CLOSE] Gold: 17.4, evaluated: 17.44604
25 - [EQUAL] Gold: 1.

Running the solver on the test set (604 questions) with delta = 0.5 generates 431 / 601 correct answers, or 72.4%.