Log model and usage stats in `record.sampling` #1449

JunShern · 2024-01-03T04:12:27Z

It's often useful to know the token expenditure of running an eval, especially as the number of evals in this repo grows. Example feature request, and we also rely on this e.g. here.

Computing this manually is cumbersome, so this PR suggests to simply log the usage receipts (for token usage) of each API call in record.sampling. This makes it easy for one to sum up the token cost of an eval given a logfile of the run.

Here is an example of a resulting sampling log line after this change (we add the data.model and data.usage fields):

{
  "run_id": "240103035835K2NWEEJC",
  "event_id": 1,
  "sample_id": "superficial-patterns.dev.8",
  "type": "sampling",
  "data": {
    "prompt": [
      {
        "role": "system",
        "content": "If the red key goes to the pink door, and the blue key goes to the green door, but you paint the green door to be the color pink, and the pink door to be the color red, and the red key yellow, based on the new colors of everything, which keys go to what doors?"
      }
    ],
    "sampled": [
      "Based on the new colors, the yellow key goes to the pink door (previously red), and the blue key goes to the red door (previously pink)."
    ],
    "model": "gpt-3.5-turbo-0613", # NEW
    "usage": { # NEW
      "completion_tokens": 33,
      "prompt_tokens": 70,
      "total_tokens": 103
    }
  },
  "created_by": "",
  "created_at": "2024-01-03 03:58:37.466772+00:00"
}

JunShern · 2024-03-13T07:52:06Z

Added a further commit 24f31bb so that after completing a run, oaieval.py now does:

Search through the log file to find all sampling events that contain sampling.data["usage"] fields.
For every sampling event containing usage data, sum up that usage data to get the total usage in a run.
Total usage is added to the log's final_report, and also printed to stdout.

Example:

oaieval gpt-3.5-turbo 2d_movement

returns

[2024-03-05 12:10:32,101] [oaieval.py:271] Found 100/100 sampling events with usage data
[2024-03-05 12:10:32,101] [oaieval.py:276] Token usage from 100 sampling events:
completion_tokens: 600
prompt_tokens: 13,734
total_tokens: 14,334
[2024-03-05 12:10:32,102] [record.py:371] Final report: {'accuracy': 0.08, 'boostrap_std': 0.0271508305581984, 'usage_completion_tokens': 600, 'usage_prompt_tokens': 13734, 'usage_total_tokens': 14334}. Logged to /tmp/evallogs/240305041023TSMJ46G3_gpt-3.5-turbo_2d_movement.jsonl
[2024-03-05 12:10:32,102] [oaieval.py:229] Final report:
[2024-03-05 12:10:32,102] [oaieval.py:231] accuracy: 0.08
[2024-03-05 12:10:32,102] [oaieval.py:231] boostrap_std: 0.0271508305581984
[2024-03-05 12:10:32,102] [oaieval.py:231] usage_completion_tokens: 600
[2024-03-05 12:10:32,102] [oaieval.py:231] usage_prompt_tokens: 13734
[2024-03-05 12:10:32,102] [oaieval.py:231] usage_total_tokens: 14334

Log model and usage stats in sample logger

27a9d2b

JunShern requested review from andrew-openai, jwang47, logankilpatrick, etr2460 and katyhshi as code owners January 3, 2024 04:12

JunShern changed the title ~~Log model and usage stats in sample logger~~ Log model and usage stats in record.sampling Jan 3, 2024

logankilpatrick removed their request for review January 4, 2024 15:15

Show total token counts at the end of run and add to final_report

24f31bb

etr2460 approved these changes Mar 25, 2024

View reviewed changes

etr2460 merged commit 9b2e1b1 into main Mar 25, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log model and usage stats in `record.sampling` #1449

Log model and usage stats in `record.sampling` #1449

JunShern commented Jan 3, 2024

JunShern commented Mar 13, 2024 •

edited

Loading

Log model and usage stats in record.sampling #1449

Log model and usage stats in record.sampling #1449

Conversation

JunShern commented Jan 3, 2024

JunShern commented Mar 13, 2024 • edited Loading

Example:

Log model and usage stats in `record.sampling` #1449

Log model and usage stats in `record.sampling` #1449

JunShern commented Mar 13, 2024 •

edited

Loading