# 230531-1253 gpt-4 evals

## Goal: preliminary calculus evals for GPT-4

Specifics:

- Craft a better prompt for the grading model.
  - See https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
- Collect calculus problems.
  - Work on calculus problems. After each problem, discuss with GPT-4 on ChatGPT-Plus. Take note of problems that are difficult for GPT-4 to solve.
  - Transfer such problems, together with incorrect responses, into YAML file.
  - For each such problem, print out the student prompt. For each generated solution, print out the grader prompt.
  - Send the grader prompt to GPT-4 via ChatGPT-Plus. Questions:
    - Does the grader model grade consistently?
    - Does the grader model grade correctly?

## Improving the grader prompt

Here's the current grader prompt template:

```python
grader_prompt_template = grader_prompt_template_v1 = '''
You are comparing a submitted answer to an expert answer for a given mathematics problem:
---
[Mathematics Problem]: {input}
---
[Expert]: {ideal}
---
[Submission]: {completion}
---
Use the following rubric to score the submission. Round the score to the nearest integer.

[Rubric]:

{rubric}
---
First, write out in a step by step manner your reasoning for the score you gave to be sure that your conclusion is correct. Avoid simply stating the score at the outset. When you are done giving your reasoning, print the score rounded to the nearest integer (without quotes or punctuation) on its own line.

Reasoning:
'''.strip()
```

Proposed improvements:

```
You are comparing a submitted answer to an expert solution for a given mathematics problem:

Mathematics Problem: """
{input}
"""

Submitted Answer: """
{completion}
"""

Expert Solution: """
{ideal}
"""

Use the following rubric to score the submitted answer. First break the answer down step-by-step. Then evaluate each step for correctness and mathematical rigor. Use your step-by-step evaluations to score the answer. Round the score to the nearest integer.

Rubric: """
{rubric}
"""

Write out in a step by step manner your reasoning for the score you gave to be sure that your conclusion is correct. Avoid simply stating the score at the outset. When you are done giving your reasoning, print the score in the range 0-100, rounded to the nearest integer (without quotes or punctuation) on its own line.

Desired format:
Final Answer Correctness: -||-
Explanation Correctness: -||-
Mathematical Rigor: -||-
Correctness of Intermediate Steps: -||-
Score:
-||-

Your evaluation and reasoning:
```

Code to read YAML response file:

In [1]:
from math_eval import *

import pandas as pd
from IPython.display import Markdown, Math, Latex
import openai
import ruamel.yaml
from ruamel.yaml import YAML
from ruamel.yaml.scalarstring import LiteralScalarString
from ruamel.yaml.representer import RoundTripRepresenter

import datetime
import decimal
import json
import os
import re
import sys
import time

In [2]:
engine = sa.create_engine("sqlite:///230531-1326_math-evals.db")
session = Session(engine)
Base.metadata.create_all(engine)

In [3]:
Base.metadata.drop_all(engine)

In [4]:
Base.metadata.create_all(engine)

In [28]:
session.rollback()

In [5]:
session.commit()

In [33]:
session.flush()

In [3]:
openai.api_key = os.environ['OPENAI_API_KEY']

yaml=YAML()
yaml.default_flow_style = False
yaml.allow_unicode = True
yaml.encoding = 'utf-8'

def represent_decimal(self, data):
  value = '0.' if data.is_zero() else str(data.normalize())
  return self.represent_scalar(u'tag:yaml.org,2002:float', value)

RoundTripRepresenter.add_representer(Decimal, represent_decimal)

In [4]:
def fix_latex(text):
  text = text.replace('\\(', '$')
  text = text.replace('\\)', '$')
  text = text.replace('\\[', '$$')
  text = text.replace('\\]', '$$')
  return text

In [14]:
gpt4_chatgpt_plus = Model.get_or_create(session, dict(name='GPT-4 via ChatGPT-Plus'))
example_model = Model.get_or_create(session, dict(name='human'), dict(is_example=True))
gpt4_chatgpt_plus, example_model

(<Model(id=1, name=GPT-4 via ChatGPT-Plus, notes=, is_example=False)>,
 <Model(id=2, name=example grader, notes=, is_example=True)>)

In [6]:
gpt4_ps = ProblemSet.get_or_create(session,
  dict(name='GPT-4 calculus problem set 230531-1326'),
)
gpt4_ss = SubmissionSet.get_or_create(session,
  dict(name='GPT-4 calculus submission set 230531-1326'),
)
example_ss = SubmissionSet.get_or_create(session,
  dict(name='GPT-4 calculus example-submission set 230531-1326'),
  dict(is_example=True),
)
gpt4_es = EvaluationSet.get_or_create(session,
  dict(name='GPT-4 calculus evaluation set 230531-1326'),
)
example_es = EvaluationSet.get_or_create(session,
  dict(name='GPT-4 calculus example-evaluation set 230531-1326'),
  dict(is_example=True),
)
print(f'{gpt4_ps=}')
print(f'{gpt4_ss=}')
print(f'{gpt4_es=}')
print(f'{example_es=}')

gpt4_ps=<ProblemSet(id=1, name=GPT-4 calculus problem set 230531-1326, notes=)>
gpt4_ss=<SubmissionSet(id=1, name=GPT-4 calculus submission set 230531-1326, notes=, is_example=False)>
gpt4_es=<EvaluationSet(id=1, name=GPT-4 calculus evaluation set 230531-1326, notes=, is_example=False)>
example_es=<EvaluationSet(id=2, name=GPT-4 calculus example-evaluation set 230531-1326, notes=, is_example=True)>


In [7]:
ps_ct = session.scalar(sa.select().with_only_columns(sa.func.count(ProblemSet.id)))
ss_ct = session.scalar(sa.select().with_only_columns(sa.func.count(SubmissionSet.id)))
es_ct = session.scalar(sa.select().with_only_columns(sa.func.count(EvaluationSet.id)))
print(f'{ps_ct=}, {ss_ct=}, {es_ct=}')

ps_ct=1, ss_ct=2, es_ct=2


In [8]:
with open('calculus-responses-gpt4.yaml', 'r') as f:
    data = yaml.load(f)

In [9]:
yaml.dump(data, sys.stdout)

problems:
- input: |-
    If the infinite curve $y=e^{-x}$, $x \ge 0$, is rotated about the $x$-axis, find the area of the resulting surface.

    Explain your reasoning. Use LaTeX with delimiters '\(' and '\)' to make mathematical statements. When you are done, repeat your answer by itself on a new line.
  ideal: |-
    The area of a surface of revolution, when you're rotating the curve \(y = f(x)\), \(a \leq x \leq b\), around the x-axis is given by the formula:

    \[A = 2\pi \int_{a}^{b} y \sqrt{1 + \left(\frac{dy}{dx}\right)^2} dx\]

    This equation comes from summing up the surface areas of infinitesimal frustums (which are approximately cylindrical for small enough changes in \(x\)) that make up the surface.

    In our case, \(f(x) = e^{-x}\), and we're revolving about the x-axis for \(0 \leq x < \infty\). We need to compute the derivative of \(f(x)\), which is \(f'(x) = -e^{-x}\).

    Plugging these into the formula gives:

    \[A = 2\pi \int_{0}^{\infty} e^{-x} \sqrt{1 +

Things to test:

- A problem can be part of more than one problem set. Test this as follows: create two YAML files with identical problems but different problem sets. Load both files. After the first load, the problem should be part of the first problem set. The second load should not create any new problems, should not trigger any errors, and afterward the problem should be contained in both problem sets.

- If problem ID is specified for a submission, an exception should be raised. This is because the submission should be listed hierarchically under a problem, and that problem's ID should be used when the submission is created. Test this as follows: create a YAML file that specifies a single problem and single submission in hierarchy, in which any problem ID is specified in the submission spec. This should trigger the error.
- If a submission is found in the datagbase for the given submission ID, and its problem ID doesn't match that of the parent problem in the hierarchy, an exception should be raised to indicate this inconsistency. This is because the submission should answer a single problem, and the submission's problem shouldn't change. Test this as follows: create a YAML file that specifies two problems, with one problem having a single submission in hierarchy, and the other problem having no submissions. This should create the corresponding problems and submission in the database; the submission's problem should be recorded in the database as the first problem. Use the loaded data hierarchy to generate a second YAML file, which should now record the IDs of the problems and the submission. Modify the second file, moving the submission from the first problem to the second problem in hierarchy. Try to load the modified YAML file. The incorrect placement of the submission should trigger the error.
- Similarly, the model ID for a submission shouldn't change.
- The submission can, however, be part of more than one submission set. Test this as follows: create two YAML files with identical problems and submissions but different submission sets. Load both files. After the first load, the submission should be part of the first submission set. The second load should not create any new problems or submissions, should not trigger any errors, and afterward the submission should be contained in both submission sets.

- If submission ID is specified for an evaluation, an exception should be raised. This is because the evaluation should be listed hierarchically under a submission, and that submission's ID should be used when the evaluation is created. Test this as follows: create a YAML file that specifies a single problem, a single submission, and a single evaluation in hierarchy, in which any submission ID is specified in the evaluation spec. This should trigger the error.
- If an evaluation is found in the database for the given evaluation ID, and its submission ID doesn't match that of the parent submission in the hierarchy, an exception should be raised to indicate this inconsistency. This is because the evaluation should evaluation a single submission, and evaluation's submission shouldn't change. Test this as follows: create a YAML file that specifies one problem and two submissions, with one submission having a single evaluation in hierarchy, and the other submission having no evaluations. This should create the corresponding problem, submissions, and evaluation in the database; the evaluation's submissions should be recorded in the database as the first submission. Use the loaded data hierarchy to generate a second YAML file, which should now record the IDs of the problem, the submissions, and the evaluation. Modify the second file, moving the evaluation from the first submission to the second submission in hierarchy. Try to load the modified YAML file. The incorrect placement of the evaluation should trigger the error.
- The evaluation can, however, be part of more than one evaluation set. Test this as follows: create two YAML files with identical problems, submissions and evluations but different evaluation sets. Load both files. After the first load, the evaluation should be part of the first evaluation set. The second load should not create any new problems, submissions, or evaluations, should not trigger any errors, and afterward the evaluation should be contained in both evaluations sets.
- Similarly, the model ID for an evaluation shouldn't change.

This only applies to YAML files. Manually updating these IDs in the database is another matter.

In [42]:
problem_set = gpt4_ps

default_submission_model = gpt4_chatgpt_plus
default_example_submission_model = example_model
submission_set = gpt4_ss
example_submission_set = example_ss

default_evaluation_model = gpt4_chatgpt_plus
default_example_evaluation_model = example_model
evaluation_set = gpt4_es
example_evaluation-set = example_es

for pn, problem_data in enumerate(data['problems']):
  # Load info from YAML.
  # Search by either:
  # - problem ID, if available, or
  # - input if not.
  # Place problem into problem set.
  # - If problem set is specified, use it.
  # - Otherwise, use the default.
  problem_id = problem_data.get('problem_id')
  input = problem_data['input']    # Required.
  ideal = problem_data.get('ideal')
  rubric = problem_data.get('rubric')
  problem_notes = problem_data.get('rubric')
  
  # Update or create problem.
  problem_search_dict = {}
  problem_update_dict = {}
  # If problem ID is specified, search by problem ID.
  # Otherwise, search by problem input.
  if problem_id is not None:
    problem_search_dict['id'] = problem_id
    problem_update_dict['input'] = input
  else:
    problem_search_dict['input'] = input
  # Update ideal and rubric fields.
  if ideal:
    problem_update_dict['ideal'] = ideal
  if rubric:
    problem_update_dict['rubric'] = rubric
  problem = Problem.update_or_create(session, problem_search_dict, problem_update_dict)

  # Update notes field.
  if problem_notes:
    problem.notes = problem_notes
  
  # Place problem into problem set.
  # Note: if problem-set ID was specified for problem in YAML, it will be ignored.
  problem.problem_sets.append(problem_set)
    
  for sn, submission_data in enumerate(problem_data.get('submissions')):
    # Load info from YAML.
    # Search by either:
    # - submission ID, if available, or
    # - submission message, model ID, and problem ID if not.
    #   - If the model ID is specified, use it. Otherwise, use the default.
    #     Which default to use depends on whether this is an example.
    #     - Note: if this an example submission, and the model is specified
    #       but isn't the example model, print a warning.
    #   - If the problem ID is specified, use it. Otherwise, use the default.
    # Place submission into submission set.
    # - If the submission set is specified, use it.
    # - Otherwise, use the default.
    #   Which default to use depends on whether this is an example.
    # - Note: if this is an example submission, and the submission set is
    #   specified but isn't an example submission set, print a warning.
    submission_id = submission_data.get('submission_id')
    submission_completion = submission_data.get('completion')
    submission_message = submission_data['message']     # Required.
    submission_score = submission_data.get('score')
    submission_notes = submission_data.get('notes')
    is_example = submission_data.get('is_example')
    submission_model_id = submission_data.get('model_id')
    # Note: YAML file SHOULD NOT SPECIFY PROBLEM ID because submission specs
    # should be hierarchically listed under a problem spec, and that problem's
    # ID should be used.
    submission_problem_id = submission_data.get('problem_id')
    if submission_problem_id is not None:
      raise LookupError(f'YAML file should not specify problem_id={submission_problem_id} in submission specs, but does anyway.')
    # Note: in the database there's a many-to-many mapping between submissions
    # and submission sets. But in the YAML file, at most a single submission set
    # is represented.
    submission_set_id = submission_data.get('submission_set_id')
    # Model ID and problem ID fields can't be null.
    # If model ID is specified, use it.
    # Otherwise use the default.
    # Which default to use depends on whether this is an example submission.
    if submission_model_id is not None:
      submission_model = Model.get_or_create(session, dict(id=submission_model_id))
    else:
      if is_example:
        submission_model = default_example_submission_model
      else:
        submission_model = default_submission_model

    # Update or create submission.
    submission_search_dict = {}
    submission_update_dict = {}
    # If submissson ID is specified, search by submission ID.
    # Otherwise, search by submission message, model, and problem.
    if submission_id is not None:
      submission_search_dict['id'] = submission_id
      submission_update_dict['message'] = submission_message
      #submission_update_dict['model'] = submission_model
      #submission_update_dict['problem'] = problem
    else:
      submission_search_dict['message'] = submission_message
      submission_search_dict['model'] = submission_model
      submission_search_dict['problem'] = problem
    # Update is_example field.
    if is_example is not None:
      submission_update_dict['is_example'] = is_example

    submission = Submission.update_or_create(session, submission_search_dict, submission_update_dict)

    if submission.problem != problem:
      raise RuntimeError(f'submission with {submission.id=} has wrong problem ID {submission.problem_id=} (should be same as {problem.id=}).')
    if submission.model != submission_model:
      raise RuntimeError(f'submission with {submission.id=} has wrong model ID {submission.model_id=} (should be same as {submission_model.id=}).')

    # Update notes and completion fields.
    if submission_completion:
      submission.completion = submission_completion
    if submission_notes:
      submission.notes = submission_notes
    # Place submission into its submission set.
    # Note: if submission-set ID is specified for submission in YAML, it will be ignored.
    if is_example:
      submission.submission_sets.append(example_submission_set)
    else:
      submission.submission_sets.append(submission_set)
    
    for en, evaluation_data in enumerate(submission_data.get('evaluations')):
      # Load info from YAML.
      # Search by either:
      # - evaluation ID, if available, or
      # - message, model ID, and submission ID if not.
      #   - If the submission ID is specified, use it. Otherwise, use the default.
      #   - If the model ID is specified, use it. Otherwise, use the default.
      #     Which default to use depends on whether this is for an example submission.
      #     - Note: if this is for an example submission, and the model is specified 
      #       but isn't the example model, print a warning.
      # Place evaluation into evaluation set.
      # - If the evaluation set is specified, use it.
      # - Otherwise, use the default.
      #   Which default to use depends on whether this is for an example submission.
      #   - Note: if this is for an example submission, and the evaluation set is specified 
      #     but isn't an example evaluation set, print a warning.
      evaluation_id = evaluation_data.get('evaluation_id')
      evaluation_completion = evaluation_data.get('completion')
      evaluation_message = evaluation_data['message']     # Required.
      evaluation_score = evaluation_data.get('score')
      evaluation_notes = submission_data.get('notes')
      evaluation_model_id = evaluation_data.get('model_id')
      # Note: YAML file SHOULD NOT SPECIFY SUBMISSION ID because evaluation specs
      # should be hierarchically listed under a submission spec, and that submission's
      # ID should be used.
      evaluation_submission_id = evaluation_data.get('submission_id')
      if evaluation_submission_id is not None:
        raise LookupError(f'YAML file should not specify submission_id={evaluation_submission_id} in evaluation specs, but does anyway.')
      # Note: in the database there's a many-to-many mapping between evaluations
      # and evaluation sets. But in the YAML file, at most a single evaluation set
      # is represented.
      evaluation_set_id = evaluation_data.get('evaluation_id')
      # Model ID and submission ID fields can't be null.
      # If model ID is specified, use it.
      # Otherwise use the default.
      # Which default to use depends on whether this is an example submission.
      if evaluation_model_id is not None:
        evaluation_model = Model.get_or_create(session, dict(id=evaluation_model_id))
      else:
        if is_example:
          evaluation_model = default_evaluation_model
        else:
          evaluation_model = default_example_evaluation_model
      if evaluation_set_id is not None:
        evaluation_set = EvaluationSet.get_or_create(session, dict(id=evaluation_set_id))
      else:
        if is_example:
          evaluation_set = default_evaluation_set
        else:
          evaluation_set = default_example_evaluation

      # Update or create problem.
      evaluation_search_dict = {}
      evaluation_update_dict = {}
      if evaluation_id is not None:
        evaluation_search_dict['id'] = evaluation_id
        evaluation_update_dict['message'] = evaluation_message
        evaluation_update_dict['model'] = evaluation_model
        evaluation_update_dict['submission'] = submission
      else:
        evaluation_search_dict['message'] = evaluation_message
        evaluation_search_dict['model'] = evaluation_model
        evaluation_search_dict['submission'] = submission
      if score is not None:
        evaluation_update_dict['score'] = score

      evaluation = Evaluation.update_or_create(session, evaluation_search_dict, evaluation_update_dict)
        
      if evaluation_notes:
        evaluation.notes = evaluation_notes
      if evaluation_completion:
        evaluation.completion = evaluation_completion
      evaluation_set.evaluations.append(evaluation)


SyntaxError: invalid syntax (1372023594.py, line 1)

In [45]:
session.commit()

In [44]:
gpt4_ps.problems

[<Problem(id=1, notes=- Final Answer Correctness (40 points): The final answer provided by the model matches the expert's answer in its simplest form.
   - 40: The final answer is mathematically equivalent to the expert's answer and is in its simplest form.
   - 0: The final answer is not mathematically equivalent to the expert's answer or is not in its simplest form.
 - Explanation Correctness (30 points): The explanation provided by the model to arrive at the answer is correct and logically sound.
   - 30: The explanation correctly justifies each step of the problem-solving process.
   - 15: The explanation is partially correct but does not correctly justify each step of the problem-solving process.
   - 0: The explanation is incorrect or missing.
 - Mathematical Rigor (20 points): The submission correctly uses mathematical identities and applies integral and derivative rules.
   - 20: The submission demonstrates full mathematical rigor.
   - 10: The submission demonstrates some math

In [23]:
for row in session.execute(sa.text("SELECT * FROM problem_set_problem_associations")):
  print(f'{row=}')

row=(1, 2)


In [46]:
problem_ct = session.scalar(sa.select().with_only_columns(sa.func.count(Problem.id)))
submission_ct = session.scalar(sa.select().with_only_columns(sa.func.count(Submission.id)))
evaluation_ct = session.scalar(sa.select().with_only_columns(sa.func.count(Evaluation.id)))
print(f'{problem_ct=}, {submission_ct=}, {evaluation_ct=}')
print(f'{len(gpt4_ps.problems)=}, {len(gpt4_ss.submissions)=}')
print(f'{len(gpt4_es.evaluations)=}, {len(example_es.evaluations)=}')

problem_ct=1, submission_ct=1, evaluation_ct=1
len(gpt4_ps.problems)=1, len(gpt4_ss.submissions)=1
len(gpt4_es.evaluations)=1, len(human_es.evaluations)=0


In [None]:
for pn, problem_data in enumerate(problems_data):
  problem = Problem.update_or_create(session,
    search_dict=dict(
      input=problem_data['input'],
      problem_set_id=integration_table_ps.id
    ),
    update_dict=dict(
      ideal=problem_data['ideal'],
      rubric=problem_data['rubric']
    ),
  )
  for sn, submission_data in enumerate(problem_data.get('completions', [])):
    message = submission_data['completion']
    submission = Submission.update_or_create(session,
      search_dict=dict(
        problem_id=problem.id,
        submission_set_id=integration_table_ss.id,
        message=message,
      ),
      update_dict=dict(
        notes_json='""'
      ),
    )

    notes_json = json.dumps(
      dict(example_evaluations = [
        dict(
          evaluator = 'KN',
          comment = submission_data['comment'],
          correct = submission_data['correct'],
        ),
      ])
    )
    example_evaluation = Evaluation.update_or_create(session,
      search_dict=dict(
        submission_id=submission.id,
        evaluation_set_id=integration_table_example_es.id,
        message=submission_data['comment'],
      ),
      update_dict=dict(
        notes_json=notes_json
      ),
    )

    for en, evaluation_data in enumerate(submission_data.get('evaluations', [])):
      evaluation = Evaluation.get_or_create(session,
        search_dict=dict(
          submission_id=submission.id,
          evaluation_set_id=integration_table_es.id,
          message=evaluation_data['evaluation']
        ),
      )