Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log model and usage stats in record.sampling #1449

Merged
merged 2 commits into from
Mar 25, 2024
Merged

Conversation

JunShern
Copy link
Collaborator

@JunShern JunShern commented Jan 3, 2024

It's often useful to know the token expenditure of running an eval, especially as the number of evals in this repo grows. Example feature request, and we also rely on this e.g. here.

Computing this manually is cumbersome, so this PR suggests to simply log the usage receipts (for token usage) of each API call in record.sampling. This makes it easy for one to sum up the token cost of an eval given a logfile of the run.

Here is an example of a resulting sampling log line after this change (we add the data.model and data.usage fields):

{
  "run_id": "240103035835K2NWEEJC",
  "event_id": 1,
  "sample_id": "superficial-patterns.dev.8",
  "type": "sampling",
  "data": {
    "prompt": [
      {
        "role": "system",
        "content": "If the red key goes to the pink door, and the blue key goes to the green door, but you paint the green door to be the color pink, and the pink door to be the color red, and the red key yellow, based on the new colors of everything, which keys go to what doors?"
      }
    ],
    "sampled": [
      "Based on the new colors, the yellow key goes to the pink door (previously red), and the blue key goes to the red door (previously pink)."
    ],
    "model": "gpt-3.5-turbo-0613", # NEW
    "usage": { # NEW
      "completion_tokens": 33,
      "prompt_tokens": 70,
      "total_tokens": 103
    }
  },
  "created_by": "",
  "created_at": "2024-01-03 03:58:37.466772+00:00"
}

@JunShern JunShern changed the title Log model and usage stats in sample logger Log model and usage stats in record.sampling Jan 3, 2024
@logankilpatrick logankilpatrick removed their request for review January 4, 2024 15:15
@JunShern
Copy link
Collaborator Author

JunShern commented Mar 13, 2024

Added a further commit 24f31bb so that after completing a run, oaieval.py now does:

  1. Search through the log file to find all sampling events that contain sampling.data["usage"] fields.
  2. For every sampling event containing usage data, sum up that usage data to get the total usage in a run.
  3. Total usage is added to the log's final_report, and also printed to stdout.

Example:

oaieval gpt-3.5-turbo 2d_movement

returns

[2024-03-05 12:10:32,101] [oaieval.py:271] Found 100/100 sampling events with usage data
[2024-03-05 12:10:32,101] [oaieval.py:276] Token usage from 100 sampling events:
completion_tokens: 600
prompt_tokens: 13,734
total_tokens: 14,334
[2024-03-05 12:10:32,102] [record.py:371] Final report: {'accuracy': 0.08, 'boostrap_std': 0.0271508305581984, 'usage_completion_tokens': 600, 'usage_prompt_tokens': 13734, 'usage_total_tokens': 14334}. Logged to /tmp/evallogs/240305041023TSMJ46G3_gpt-3.5-turbo_2d_movement.jsonl
[2024-03-05 12:10:32,102] [oaieval.py:229] Final report:
[2024-03-05 12:10:32,102] [oaieval.py:231] accuracy: 0.08
[2024-03-05 12:10:32,102] [oaieval.py:231] boostrap_std: 0.0271508305581984
[2024-03-05 12:10:32,102] [oaieval.py:231] usage_completion_tokens: 600
[2024-03-05 12:10:32,102] [oaieval.py:231] usage_prompt_tokens: 13734
[2024-03-05 12:10:32,102] [oaieval.py:231] usage_total_tokens: 14334

@etr2460 etr2460 merged commit 9b2e1b1 into main Mar 25, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants