Add HTTP recorder for evals; introduce --http-run flag #1312

nstankov-bg · 2023-07-09T10:42:47Z

Change name

Introducing HTTP Recorder to OAIEvals Framework

Change description

This pull request introduces the HttpRecorder class to the evals framework, extending the current Eval mechanism.

HTTP Recorder

A new class HttpRecorder has been added to the evals.record module. This new recorder sends evaluation events directly to a specified HTTP endpoint using POST requests. The URL for this endpoint can be specified using the --http-run-url command-line argument when running the evaluations. In addition to the local and dry run modes, we now have an HTTP run mode that can be triggered using the --http-run flag.

Motivation

This change was largely motivated by the development of the OAIEvals Collector. As the creator of this Go application designed specifically for collecting and storing raw evaluation metrics, I saw the need for an HTTP endpoint in our evaluation mechanism. The OAIEvals Collector provides an HTTP handler designed for evals, thus making it an ideal recipient for the data recorded by the new HttpRecorder.

Allowing for more types of exporters, such as the new HttpRecorder, will likely increase the adoption of testing. The flexibility and ease-of-use of the HTTP Recorder allows for more developers to easily integrate testing into their workflows. This could ultimately lead to higher quality code, faster debugging, and an overall more efficient development process.

In addition, the integration of the HttpRecorder and visualization tools like Grafana dramatically lowers the barrier of entry for developers seeking to visualize their evaluation. By providing a streamlined process for collecting and visualizing metrics, I aim to make data-driven development more accessible, leading to more informed decision-making and improved model quality.

Demo:

Elasticsearch x Kabana:

Elasticsearch Test
Basic Visualization via Kibana

InfluxDB:

InfluxDB Test 1	InfluxDB Test 2

Grafana:

Grafana Test
Basic Visualization via InfluxDB & TimeScaleDB
!

Kafka:

Kafka Test
Basic Visualization via Kafka UI
!

Testing

The new HTTP Recorder feature was thoroughly tested with the following commands:

python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=50 --dry-run
python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=50
python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=20 --http-run --http-run-url=http://localhost:8081/events --http-batch-size=10
python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=20 --http-run --http-run-url=http://localhost:8081/events

Each command provided expected results, demonstrating the feature's proper functionality.

Criteria for a good eval ✅

The introduced changes enhance the project's functionality and do not break any existing features. The changes pass all the existing tests and follow the project's contribution guidelines.

Eval structure 🏗️

The changes are located in evals/cli/oaieval.py for the new HTTP run mode and evals.record for the new HttpRecorder.
I ensure I have the right to use the data I submit via this eval.

Final checklist 👀

I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.
I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.
I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted.
I have filled out all required fields of this form.
I have used Git LFS for the Eval JSON data.

Eval JSON data

Due to the nature of this PR, there are no new eval samples.

bclarkicc · 2023-07-10T19:45:00Z

This is a ridiculously helpful change for a project our org is currently working on, and I sincerely hope that it gets merged!

dbergunder · 2023-07-10T19:48:58Z

This will prevent us from maintaining a fork. 👍🏻

evals/record.py

evals/cli/oaieval.py

This commit introduces several enhancements to the evaluation script, including: 1. Improved error handling for HTTP requests in the HttpRecorder class. The script now retries failed HTTP requests up to a maximum number of times and throws an exception if all attempts fail. 2. The addition of a new command-line option `--http-run-url` to specify the URL for sending evaluation results in HTTP mode. The existing `--url` option has been replaced with this new option for clarity. 3. Updated help text for the command-line options `--local-run` and `--http-run` to provide more detailed information about their function.

…to specify the number of events to send in each HTTP request when running evaluations in HTTP mode. The default value is 1, which means events are sent individually.

jwang47

One scenario we might also want to handle is if there's an uncaught error that happens during the eval run. We might want to send an event indicating that the eval failed in that situation.

evals/cli/oaieval.py

evals/record.py

evals/cli/oaieval.py

evals/record.py

This commit makes several changes to the HttpRecorder class: - Renames the `http_batch_size` parameter to `batch_size` for consistency. - Updates the `HttpRecorder` initializer to accept `batch_size` as an argument. - Updates the `_flush_events_internal` method to use the new `batch_size` attribute. - Removes the retry logic from the `_send_event` method. It now attempts to send each batch of events once, and raises an exception if the request fails. - Updates the `run` method in `oaieval.py` to pass `http_batch_size` as an argument when initializing an `HttpRecorder` instance. These changes simplify the event recording logic and make the code more straightforward to understand. They also give the user more control over the batch size when recording events.

nstankov-bg · 2023-07-26T12:28:46Z

Hey @jwang47,

One scenario we might also want to handle is if there's an uncaught error that happens during the eval run. We might want to send an event indicating that the eval failed in that situation.

Regarding the handling of errors during the evaluation run, we can represent a failed eval event with a structure similar to a successful event. The Type field would differ, and the Data field would carry additional error-specific information.

For example, an event for a failed evaluation could look like this:

{
  "run_id": "2307080128125Q6U7IFP",
  "event_id": 4,
  "sample_id": "abstract-causal-reasoning-text.dev.484",
  "type": "eval_failed",
  "data": {
    "error": {
      "type": "service_unavailable",
      "code": 503,
      "message": "Service Unavailable",
      "details": "The upstream service was overloaded."
    }
  },
  "created_by": "",
  "created_at": "2023-07-08 01:28:13.704853+00:00"
}

Considering that a successful one looks like this:

{
  "run_id": "2307080128125Q6U7IFP",
  "event_id": 3,
  "sample_id": "abstract-causal-reasoning-text.dev.484",
  "type": "match",
  "data": {
    "correct": true,
    "expected": "ANSWER: off",
    "picked": "ANSWER: off",
    "sampled": "undetermined."
  },
  "created_by": "",
  "created_at": "2023-07-08 01:28:13.704853+00:00"
}

In this structure, the Type field is set to "eval_failed" to indicate that this is an error event. Within the Data field, we have an error object that contains detailed information about the error, such as its type, code, message, and additional details. This allows us to provide comprehensive information about any errors that occur, facilitating easier troubleshooting.

Can this be added in another PR, as the complexities of error handling on such level can get rather complex, and I would love to tackle this with the focused attention it needs.

jwang47 · 2023-07-28T20:18:59Z

Can this be added in another PR, as the complexities of error handling on such level can get rather complex, and I would love to tackle this with the focused attention it needs.

Sounds good, let's add in a later PR.

evals/cli/oaieval.py

evals/record.py

… in `record.py`: 1. Added local fallback support: When an event fails to be sent due to an HTTP error, it is now automatically saved locally using the LocalRecorder class. The local fallback path is passed as an argument to the HttpRecorder's constructor. 2. Implemented a threshold for failed requests: If more than 5% of events fail to be sent, a RuntimeError is raised. This prevents excessive failed requests from potentially overwhelming the system. 3. Improved logging: Added a success log message when events are successfully sent. Also, the warning message and increment of failed requests now only occur if the request does not succeed. 4. Updated the `record_final_report` method to use the LocalRecorder as a fallback when sending the final report fails. This commit improves the resilience of the HttpRecorder and provides more informative logging for debugging and monitoring purposes.

…g/evals into feature/http-recorder

evals/record.py

- Add CLI argument to set acceptable failure rate for HTTP requests. - Update to track and act upon failed request rates exceeding the specified threshold. - Refactor error messages for clarity and provide more specific information about failures. - Enhance error handling in the event of exceeded failure thresholds, suggesting a fallback to .

jwang47

Looks good, thanks for the contribution!

### Change name Introducing HTTP Recorder to OAIEvals Framework ### Change description This pull request introduces the `HttpRecorder` class to the evals framework, extending the current Eval mechanism. #### HTTP Recorder A new class `HttpRecorder` has been added to the `evals.record` module. This new recorder sends evaluation events directly to a specified HTTP endpoint using POST requests. The URL for this endpoint can be specified using the `--http-run-url` command-line argument when running the evaluations. In addition to the local and dry run modes, we now have an HTTP run mode that can be triggered using the `--http-run` flag. #### Motivation This change was largely motivated by the development of the [OAIEvals Collector](https://github.com/nstankov-bg/oaievals-collector). As the creator of this Go application designed specifically for collecting and storing raw evaluation metrics, I saw the need for an HTTP endpoint in our evaluation mechanism. The OAIEvals Collector provides an HTTP handler designed for evals, thus making it an ideal recipient for the data recorded by the new `HttpRecorder`. Allowing for more types of exporters, such as the new `HttpRecorder`, will likely increase the adoption of testing. The flexibility and ease-of-use of the HTTP Recorder allows for more developers to easily integrate testing into their workflows. This could ultimately lead to higher quality code, faster debugging, and an overall more efficient development process. In addition, the integration of the `HttpRecorder` and visualization tools like Grafana dramatically lowers the barrier of entry for developers seeking to visualize their evaluation. By providing a streamlined process for collecting and visualizing metrics, I aim to make data-driven development more accessible, leading to more informed decision-making and improved model quality. #### Demo: **Elasticsearch x Kabana**: | Elasticsearch Test | | :---: | | *Basic Visualization via Kibana* | ![Screenshot 2023-07-22 at 10 33 28 AM](https://github.com/openai/evals/assets/27363885/e03a4cf7-49f4-4207-a1b7-58a5ccef9b1c) **InfluxDB:** | InfluxDB Test 1 | InfluxDB Test 2 | | :---: | :---: | | ![InfluxDB Image 1](https://github.com/openai/evals/assets/27363885/f2359bce-5af2-49c6-a4dd-66e362ece63d) | ![InfluxDB Image 2](https://github.com/openai/evals/assets/27363885/be3c7361-2601-417d-a311-96c09da954c9) | **Grafana:** | Grafana Test | | :---: | | *Basic Visualization via InfluxDB & TimeScaleDB* | !<img width="1499" alt="Screenshot 2023-07-11 at 10 03 17 PM" src="https://github.com/nstankov-bg/oaievals-collector/assets/27363885/cd119b1a-939c-4f2d-b141-d26e83784cbc"> **Kafka:** | Kafka Test | | :---: | | *Basic Visualization via Kafka UI* | !<img width="1505" alt="Screenshot 2023-07-12 at 4 35 03 PM" src="https://github.com/nstankov-bg/oaievals-collector/assets/27363885/e8075f06-b628-4773-99d9-a032e28f2472"> #### Testing The new HTTP Recorder feature was thoroughly tested with the following commands: - `python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=50 --dry-run` - `python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=50` - `python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=20 --http-run --http-run-url=http://localhost:8081/events --http-batch-size=10` - `python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=20 --http-run --http-run-url=http://localhost:8081/events` Each command provided expected results, demonstrating the feature's proper functionality.

nstankov-bg added 3 commits July 9, 2023 06:15

Add HTTP recorder for evals; introduce --http-run flag and url argument

998951f

revert-diabetes-sample

fac78cd

remove gpt-codeinterpret-comments-to-avoid-cofusion

d2a23ed

nstankov-bg marked this pull request as ready for review July 9, 2023 11:21

nstankov-bg requested review from andrew-openai, jwang47, logankilpatrick and rlbayes as code owners July 9, 2023 11:21

nstankov-bg mentioned this pull request Jul 11, 2023

Any website where I can share evaluation results? #1166

Open

jwang47 reviewed Jul 21, 2023

View reviewed changes

evals/record.py Outdated Show resolved Hide resolved

evals/record.py Outdated Show resolved Hide resolved

evals/cli/oaieval.py Outdated Show resolved Hide resolved

evals/cli/oaieval.py Outdated Show resolved Hide resolved

nstankov-bg added 4 commits July 21, 2023 16:21

change error msg.

1c7f61d

change local help string

2ff7d4c

Added a new command line argument '--http-batch-size' to allow users …

2c7f860

…to specify the number of events to send in each HTTP request when running evaluations in HTTP mode. The default value is 1, which means events are sent individually.

nstankov-bg requested a review from jwang47 July 21, 2023 20:45

nstankov-bg changed the title ~~Add HTTP recorder for evals; introduce --http-run flag and url argument~~ Add HTTP recorder for evals; introduce --http-run flag Jul 21, 2023

jwang47 reviewed Jul 25, 2023

View reviewed changes

evals/cli/oaieval.py Show resolved Hide resolved

evals/cli/oaieval.py Outdated Show resolved Hide resolved

evals/record.py Outdated Show resolved Hide resolved

evals/cli/oaieval.py Outdated Show resolved Hide resolved

evals/record.py Outdated Show resolved Hide resolved

nstankov-bg requested a review from jwang47 July 25, 2023 23:59

pre-commit lint

b25ee83

jwang47 reviewed Jul 28, 2023

View reviewed changes

evals/cli/oaieval.py Show resolved Hide resolved

removed Run Completed, by mistake

e8cbb66

nstankov-bg requested a review from jwang47 July 28, 2023 20:23

Merge branch 'openai:main' into feature/http-recorder

5b127d4

jwang47 reviewed Jul 29, 2023

View reviewed changes

evals/record.py Show resolved Hide resolved

Merge branch 'feature/http-recorder' of https://github.com/nstankov-b…

825d919

…g/evals into feature/http-recorder

jwang47 reviewed Jul 31, 2023

View reviewed changes

evals/record.py Outdated Show resolved Hide resolved

nstankov-bg requested a review from jwang47 August 1, 2023 17:56

jwang47 approved these changes Aug 3, 2023

View reviewed changes

jwang47 merged commit d39903f into openai:main Aug 3, 2023

Add HTTP recorder for evals; introduce --http-run flag #1312

Add HTTP recorder for evals; introduce --http-run flag #1312

Uh oh!

Conversation

nstankov-bg commented Jul 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change name

Change description

HTTP Recorder

Motivation

Demo:

Testing

Criteria for a good eval ✅

Eval structure 🏗️

Final checklist 👀

Eval JSON data

Uh oh!

bclarkicc commented Jul 10, 2023

Uh oh!

dbergunder commented Jul 10, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jwang47 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nstankov-bg commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jwang47 commented Jul 28, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jwang47 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nstankov-bg commented Jul 9, 2023 •

edited

Loading

nstankov-bg commented Jul 26, 2023 •

edited

Loading