Skip to content

Conversation

@nstankov-bg
Copy link
Contributor

@nstankov-bg nstankov-bg commented Jul 9, 2023

Change name

Introducing HTTP Recorder to OAIEvals Framework

Change description

This pull request introduces the HttpRecorder class to the evals framework, extending the current Eval mechanism.

HTTP Recorder

A new class HttpRecorder has been added to the evals.record module. This new recorder sends evaluation events directly to a specified HTTP endpoint using POST requests. The URL for this endpoint can be specified using the --http-run-url command-line argument when running the evaluations. In addition to the local and dry run modes, we now have an HTTP run mode that can be triggered using the --http-run flag.

Motivation

This change was largely motivated by the development of the OAIEvals Collector. As the creator of this Go application designed specifically for collecting and storing raw evaluation metrics, I saw the need for an HTTP endpoint in our evaluation mechanism. The OAIEvals Collector provides an HTTP handler designed for evals, thus making it an ideal recipient for the data recorded by the new HttpRecorder.

Allowing for more types of exporters, such as the new HttpRecorder, will likely increase the adoption of testing. The flexibility and ease-of-use of the HTTP Recorder allows for more developers to easily integrate testing into their workflows. This could ultimately lead to higher quality code, faster debugging, and an overall more efficient development process.

In addition, the integration of the HttpRecorder and visualization tools like Grafana dramatically lowers the barrier of entry for developers seeking to visualize their evaluation. By providing a streamlined process for collecting and visualizing metrics, I aim to make data-driven development more accessible, leading to more informed decision-making and improved model quality.

Demo:

Elasticsearch x Kabana:

Elasticsearch Test
Basic Visualization via Kibana
Screenshot 2023-07-22 at 10 33 28 AM

InfluxDB:

InfluxDB Test 1 InfluxDB Test 2
InfluxDB Image 1 InfluxDB Image 2

Grafana:

Grafana Test
Basic Visualization via InfluxDB & TimeScaleDB
!Screenshot 2023-07-11 at 10 03 17 PM

Kafka:

Kafka Test
Basic Visualization via Kafka UI
!Screenshot 2023-07-12 at 4 35 03 PM

Testing

The new HTTP Recorder feature was thoroughly tested with the following commands:

  • python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=50 --dry-run
  • python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=50
  • python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=20 --http-run --http-run-url=http://localhost:8081/events --http-batch-size=10
  • python3 oaieval.py gpt-3.5-turbo abstract-causal-reasoning-text.dev.v0 --max_samples=20 --http-run --http-run-url=http://localhost:8081/events

Each command provided expected results, demonstrating the feature's proper functionality.

Criteria for a good eval ✅

The introduced changes enhance the project's functionality and do not break any existing features. The changes pass all the existing tests and follow the project's contribution guidelines.

Eval structure 🏗️

  • The changes are located in evals/cli/oaieval.py for the new HTTP run mode and evals.record for the new HttpRecorder.
  • I ensure I have the right to use the data I submit via this eval.

Final checklist 👀

  • I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.
  • I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.
  • I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted.
  • I have filled out all required fields of this form.
  • I have used Git LFS for the Eval JSON data.

Eval JSON data

Due to the nature of this PR, there are no new eval samples.

@nstankov-bg nstankov-bg marked this pull request as ready for review July 9, 2023 11:21
@bclarkicc
Copy link

This is a ridiculously helpful change for a project our org is currently working on, and I sincerely hope that it gets merged!

@dbergunder
Copy link

This will prevent us from maintaining a fork. 👍🏻

This commit introduces several enhancements to the evaluation script, including:

1. Improved error handling for HTTP requests in the HttpRecorder class. The script now retries failed HTTP requests up to a maximum number of times and throws an exception if all attempts fail.

2. The addition of a new command-line option `--http-run-url` to specify the URL for sending evaluation results in HTTP mode. The existing `--url` option has been replaced with this new option for clarity.

3. Updated help text for the command-line options `--local-run` and `--http-run` to provide more detailed information about their function.
…to specify the number of events to send in each HTTP request when running evaluations in HTTP mode. The default value is 1, which means events are sent individually.
@nstankov-bg nstankov-bg requested a review from jwang47 July 21, 2023 20:45
@nstankov-bg nstankov-bg changed the title Add HTTP recorder for evals; introduce --http-run flag and url argument Add HTTP recorder for evals; introduce --http-run flag Jul 21, 2023
Copy link
Contributor

@jwang47 jwang47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One scenario we might also want to handle is if there's an uncaught error that happens during the eval run. We might want to send an event indicating that the eval failed in that situation.

This commit makes several changes to the HttpRecorder class:
- Renames the `http_batch_size` parameter to `batch_size` for consistency.
- Updates the `HttpRecorder` initializer to accept `batch_size` as an argument.
- Updates the `_flush_events_internal` method to use the new `batch_size` attribute.
- Removes the retry logic from the `_send_event` method. It now attempts to send each batch of events once, and raises an exception if the request fails.
- Updates the `run` method in `oaieval.py` to pass `http_batch_size` as an argument when initializing an `HttpRecorder` instance.

These changes simplify the event recording logic and make the code more straightforward to understand. They also give the user more control over the batch size when recording events.
@nstankov-bg nstankov-bg requested a review from jwang47 July 25, 2023 23:59
@nstankov-bg
Copy link
Contributor Author

nstankov-bg commented Jul 26, 2023

Hey @jwang47,

One scenario we might also want to handle is if there's an uncaught error that happens during the eval run. We might want to send an event indicating that the eval failed in that situation.

Regarding the handling of errors during the evaluation run, we can represent a failed eval event with a structure similar to a successful event. The Type field would differ, and the Data field would carry additional error-specific information.

For example, an event for a failed evaluation could look like this:

{
  "run_id": "2307080128125Q6U7IFP",
  "event_id": 4,
  "sample_id": "abstract-causal-reasoning-text.dev.484",
  "type": "eval_failed",
  "data": {
    "error": {
      "type": "service_unavailable",
      "code": 503,
      "message": "Service Unavailable",
      "details": "The upstream service was overloaded."
    }
  },
  "created_by": "",
  "created_at": "2023-07-08 01:28:13.704853+00:00"
}

Considering that a successful one looks like this:

{
  "run_id": "2307080128125Q6U7IFP",
  "event_id": 3,
  "sample_id": "abstract-causal-reasoning-text.dev.484",
  "type": "match",
  "data": {
    "correct": true,
    "expected": "ANSWER: off",
    "picked": "ANSWER: off",
    "sampled": "undetermined."
  },
  "created_by": "",
  "created_at": "2023-07-08 01:28:13.704853+00:00"
}

In this structure, the Type field is set to "eval_failed" to indicate that this is an error event. Within the Data field, we have an error object that contains detailed information about the error, such as its type, code, message, and additional details. This allows us to provide comprehensive information about any errors that occur, facilitating easier troubleshooting.

Can this be added in another PR, as the complexities of error handling on such level can get rather complex, and I would love to tackle this with the focused attention it needs.

@jwang47
Copy link
Contributor

jwang47 commented Jul 28, 2023

Can this be added in another PR, as the complexities of error handling on such level can get rather complex, and I would love to tackle this with the focused attention it needs.

Sounds good, let's add in a later PR.

@nstankov-bg nstankov-bg requested a review from jwang47 July 28, 2023 20:23
… in `record.py`:

1. Added local fallback support: When an event fails to be sent due to an HTTP error, it is now automatically saved locally using the LocalRecorder class. The local fallback path is passed as an argument to the HttpRecorder's constructor.

2. Implemented a threshold for failed requests: If more than 5% of events fail to be sent, a RuntimeError is raised. This prevents excessive failed requests from potentially overwhelming the system.

3. Improved logging: Added a success log message when events are successfully sent. Also, the warning message and increment of failed requests now only occur if the request does not succeed.

4. Updated the `record_final_report` method to use the LocalRecorder as a fallback when sending the final report fails.

This commit improves the resilience of the HttpRecorder and provides more informative logging for debugging and monitoring purposes.
- Add  CLI argument to set acceptable failure rate for HTTP requests.
- Update  to track and act upon failed request rates exceeding the specified threshold.
- Refactor error messages for clarity and provide more specific information about failures.
- Enhance error handling in the event of exceeded failure thresholds, suggesting a fallback to .
@nstankov-bg nstankov-bg requested a review from jwang47 August 1, 2023 17:56
Copy link
Contributor

@jwang47 jwang47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the contribution!

@jwang47 jwang47 merged commit d39903f into openai:main Aug 3, 2023
jacobbieker pushed a commit to withmartian/-ARCHIVED--router-evals that referenced this pull request Jan 9, 2024
### Change name

Introducing HTTP Recorder to OAIEvals Framework

### Change description

This pull request introduces the `HttpRecorder` class to the evals
framework, extending the current Eval mechanism.

#### HTTP Recorder

A new class `HttpRecorder` has been added to the `evals.record` module.
This new recorder sends evaluation events directly to a specified HTTP
endpoint using POST requests. The URL for this endpoint can be specified
using the `--http-run-url` command-line argument when running the
evaluations. In addition to the local and dry run modes, we now have an
HTTP run mode that can be triggered using the `--http-run` flag.

#### Motivation

This change was largely motivated by the development of the [OAIEvals
Collector](https://github.com/nstankov-bg/oaievals-collector). As the
creator of this Go application designed specifically for collecting and
storing raw evaluation metrics, I saw the need for an HTTP endpoint in
our evaluation mechanism. The OAIEvals Collector provides an HTTP
handler designed for evals, thus making it an ideal recipient for the
data recorded by the new `HttpRecorder`.

Allowing for more types of exporters, such as the new `HttpRecorder`,
will likely increase the adoption of testing. The flexibility and
ease-of-use of the HTTP Recorder allows for more developers to easily
integrate testing into their workflows. This could ultimately lead to
higher quality code, faster debugging, and an overall more efficient
development process.

In addition, the integration of the `HttpRecorder` and visualization
tools like Grafana dramatically lowers the barrier of entry for
developers seeking to visualize their evaluation. By providing a
streamlined process for collecting and visualizing metrics, I aim to
make data-driven development more accessible, leading to more informed
decision-making and improved model quality.



#### Demo:

**Elasticsearch x Kabana**:

| Elasticsearch Test |
| :---: |
| *Basic Visualization via Kibana*
| ![Screenshot 2023-07-22 at 10 33 28
AM](https://github.com/openai/evals/assets/27363885/e03a4cf7-49f4-4207-a1b7-58a5ccef9b1c)

**InfluxDB:**

| InfluxDB Test 1 | InfluxDB Test 2 |
| :---: | :---: |
| ![InfluxDB Image
1](https://github.com/openai/evals/assets/27363885/f2359bce-5af2-49c6-a4dd-66e362ece63d)
| ![InfluxDB Image
2](https://github.com/openai/evals/assets/27363885/be3c7361-2601-417d-a311-96c09da954c9)
|


**Grafana:**

| Grafana Test |
| :---: |
| *Basic Visualization via InfluxDB & TimeScaleDB*
| !<img width="1499" alt="Screenshot 2023-07-11 at 10 03 17 PM"
src="https://github.com/nstankov-bg/oaievals-collector/assets/27363885/cd119b1a-939c-4f2d-b141-d26e83784cbc">

**Kafka:**
| Kafka Test |
| :---: |
| *Basic Visualization via Kafka UI*
| !<img width="1505" alt="Screenshot 2023-07-12 at 4 35 03 PM"
src="https://github.com/nstankov-bg/oaievals-collector/assets/27363885/e8075f06-b628-4773-99d9-a032e28f2472">


#### Testing

The new HTTP Recorder feature was thoroughly tested with the following
commands:

- `python3 oaieval.py gpt-3.5-turbo
abstract-causal-reasoning-text.dev.v0 --max_samples=50 --dry-run`
- `python3 oaieval.py gpt-3.5-turbo
abstract-causal-reasoning-text.dev.v0 --max_samples=50`
- `python3 oaieval.py gpt-3.5-turbo
abstract-causal-reasoning-text.dev.v0 --max_samples=20 --http-run
--http-run-url=http://localhost:8081/events --http-batch-size=10`
- `python3 oaieval.py gpt-3.5-turbo
abstract-causal-reasoning-text.dev.v0 --max_samples=20 --http-run
--http-run-url=http://localhost:8081/events`

Each command provided expected results, demonstrating the feature's
proper functionality.
Linmj-Judy pushed a commit to TablewareBox/evals that referenced this pull request Feb 27, 2024
### Change name

Introducing HTTP Recorder to OAIEvals Framework

### Change description

This pull request introduces the `HttpRecorder` class to the evals
framework, extending the current Eval mechanism.

#### HTTP Recorder

A new class `HttpRecorder` has been added to the `evals.record` module.
This new recorder sends evaluation events directly to a specified HTTP
endpoint using POST requests. The URL for this endpoint can be specified
using the `--http-run-url` command-line argument when running the
evaluations. In addition to the local and dry run modes, we now have an
HTTP run mode that can be triggered using the `--http-run` flag.

#### Motivation

This change was largely motivated by the development of the [OAIEvals
Collector](https://github.com/nstankov-bg/oaievals-collector). As the
creator of this Go application designed specifically for collecting and
storing raw evaluation metrics, I saw the need for an HTTP endpoint in
our evaluation mechanism. The OAIEvals Collector provides an HTTP
handler designed for evals, thus making it an ideal recipient for the
data recorded by the new `HttpRecorder`.

Allowing for more types of exporters, such as the new `HttpRecorder`,
will likely increase the adoption of testing. The flexibility and
ease-of-use of the HTTP Recorder allows for more developers to easily
integrate testing into their workflows. This could ultimately lead to
higher quality code, faster debugging, and an overall more efficient
development process.

In addition, the integration of the `HttpRecorder` and visualization
tools like Grafana dramatically lowers the barrier of entry for
developers seeking to visualize their evaluation. By providing a
streamlined process for collecting and visualizing metrics, I aim to
make data-driven development more accessible, leading to more informed
decision-making and improved model quality.



#### Demo:

**Elasticsearch x Kabana**:

| Elasticsearch Test |
| :---: |
| *Basic Visualization via Kibana*
| ![Screenshot 2023-07-22 at 10 33 28
AM](https://github.com/openai/evals/assets/27363885/e03a4cf7-49f4-4207-a1b7-58a5ccef9b1c)

**InfluxDB:**

| InfluxDB Test 1 | InfluxDB Test 2 |
| :---: | :---: |
| ![InfluxDB Image
1](https://github.com/openai/evals/assets/27363885/f2359bce-5af2-49c6-a4dd-66e362ece63d)
| ![InfluxDB Image
2](https://github.com/openai/evals/assets/27363885/be3c7361-2601-417d-a311-96c09da954c9)
|


**Grafana:**

| Grafana Test |
| :---: |
| *Basic Visualization via InfluxDB & TimeScaleDB*
| !<img width="1499" alt="Screenshot 2023-07-11 at 10 03 17 PM"
src="https://github.com/nstankov-bg/oaievals-collector/assets/27363885/cd119b1a-939c-4f2d-b141-d26e83784cbc">

**Kafka:**
| Kafka Test |
| :---: |
| *Basic Visualization via Kafka UI*
| !<img width="1505" alt="Screenshot 2023-07-12 at 4 35 03 PM"
src="https://github.com/nstankov-bg/oaievals-collector/assets/27363885/e8075f06-b628-4773-99d9-a032e28f2472">


#### Testing

The new HTTP Recorder feature was thoroughly tested with the following
commands:

- `python3 oaieval.py gpt-3.5-turbo
abstract-causal-reasoning-text.dev.v0 --max_samples=50 --dry-run`
- `python3 oaieval.py gpt-3.5-turbo
abstract-causal-reasoning-text.dev.v0 --max_samples=50`
- `python3 oaieval.py gpt-3.5-turbo
abstract-causal-reasoning-text.dev.v0 --max_samples=20 --http-run
--http-run-url=http://localhost:8081/events --http-batch-size=10`
- `python3 oaieval.py gpt-3.5-turbo
abstract-causal-reasoning-text.dev.v0 --max_samples=20 --http-run
--http-run-url=http://localhost:8081/events`

Each command provided expected results, demonstrating the feature's
proper functionality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants