# Getting Started with OpenAI Evals


This notebook will go over:
* Introduction to OpenAI Evals library [[enter link](https://github.com/openai/evals/tree/main)]
* What are Evals
* Building an Eval
* Running an Eval

Evaluation is the process of validating and testing the outputs that your LLM applications are producing. Having strong evaluations (“evals”) will mean a more stable, reliable application which is resilient to code and model changes.An eval is basically a task used to measure the quality of output of an LLM or LLM system. Given an input prompt, an output is generated. We evaluate this output with a set of ideal_answers and find the quality of the LLM system.

**OpenAI Evals consists of:**
1. A framework to evaluate an LLM (large language model) or a system built on top of an LLM.
2. An open-source registry of challenging evals

**Why is it important to evaluate?**

If you are building with LLMs, creating high quality evals is one of the most impactful things you can do. Developing AI solutions involves an iterative design process. Without evals, it can be very difficult and time intensive to understand how different model versions and prompts might affect your use case. With OpenAI’s new continuous model upgrades, evals allow you to efficiently test model performance for your use cases in a standardized way. Developing a suite of evals customized to your objectives will help you quickly and effectively understand how new models may perform for your use cases. You can also make evals a part of your CI/CD pipeline (recommended) to make sure you achieve the desired accuracy before deploying.

**Types of Evals**

The simplest and most common type of eval has an input and an ideal response or answer. For example,
we can have an eval sample where the input is “What year was Obama elected president for the first
time?” and the ideal answer is “2008”. We feed the input to a model and get the completion. If the model
says “2008”, it is then graded as correct. Eval samples are aggregated into an eval dataset that can
quantify overall performance within a certain topic. For example, this eval sample may be part of a
“president-election-years” eval that checks for every U.S. President, what year they were first elected.
Evals are not restricted to checking factual accuracy: all that is needed is a reproducible way to grade a
completion. Here are some other examples of valid evals:
* The input asks to write a short essay on a topic. The grading criteria is to check if the essay is of
particular length or if certain keywords or themes are present in the completion.
* The input is to write a funny joke, and the grading criteria is to check how funny it was.
* The input is to follow a sequence of instructions, and the grading ensures that all instructions
were followed.

In a naive implementation, we could just grade each completion by hand based on the criteria. Ideally,
we’d like to automate the grading process to let these experiments scale to huge datasets. In the next
section, we’ll talk about the ways in which we’ve automated eval grading.
Grading evals

There are two main ways we can automatically grade completions: writing some validation logic in code
or using the model itself to inspect the answer. We’ll introduce each with some examples.
Writing logic for answer checking

* Consider the Obama example from above, where the ideal response is 2008. We can write a
string match to check if the completion includes the phrase “2008”. If it does, we consider it
correct.
* Consider another eval where the input is to generate valid JSON: We can write some code that
attempts to parse the completion as JSON and then considers the completion correct if it is
parsable.
Model grading: A two stage process where the model first answers the question, then we ask a
model to look at the response to check if it’s correct.
* Consider an input that asks the model to write a funny joke. The model then generates a
completion. We then create a new input to the model to answer the question: “Is this following
joke funny? First reason step by step, then answer yes or no” that includes the completion. We
finally consider the original completion correct if the new model completion ends with “yes”.
Model grading works best with the latest, most powerful models like GPT-4 and if we give them the ability
to reason before making a judgment. Model grading will have an error rate, so it is important to validate
the performance with human evaluation before running the evals at scale. For best results, it makes
sense to use a different model to do grading from the one that did the completion, like using GPT-4 to
grade GPT-3.5 answers.


### Getting Setup

First, go to github.com/openai/evals and clone the repository with `git clone git@github.com:openai/evals.git` and go through the [setup instructions](https://github.com/openai/evals). 

To run evals later in this notebook, you will need to set up and specify your OpenAI API key. After you obtain an API key, specify it using the `OPENAI_API_KEY` environment variable. Please be aware of the costs associated with using the API when running evals.

## Building an evaluation for OpenAI Evals framework

To start creating an eval, we need

1. The test dataset in the JSONL format.
2. The eval template to be used

### Creating the eval dataset
Lets create a dataset for a use case where we are evaluating the model's ability to generate syntactically correct SQL. In this use case, we have a series of tables that are related to car manufacturing

First we will need to create a system prompt that we would like to evaluate. We will pass in instructions for the model as well as an overview of the table structure:
`"TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]"`

For this prompt, we can ask a specific question:
`"Q: how many car makers are their in germany?"`

And we have an expected answer:
`"A: SELECT count ( * )  FROM CAR_MAKERS AS T1 JOIN COUNTRIES AS T2 ON T1.Country   =   T2.CountryId WHERE T2.CountryName   =   'germany'"`

The dataset needs to be in the followingformat"
`"input": [{"role": "system", "content": "<input prompt>"}, {"role": "user", "content": <user input>}, "ideal": "correct answer"]`

Putting it all together, we get:
`{"input": [{"role": "system", "content": "TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]\n"}, {"role": "system", "content": "Q: how many car makers are their in germany"}, "ideal": ["A: SELECT count ( * )  FROM CAR_MAKERS AS T1 JOIN COUNTRIES AS T2 ON T1.Country   =   T2.CountryId WHERE T2.CountryName   =   'germany'"]}`


One way to speed up the process of building eval datasets, is to use GPT-4 to generate synthetic data

In [1]:
## Use GPT-4 to generate synthetic data

from openai import OpenAI

client = OpenAI()
# Define the system prompt and user input (these should be filled as per the specific use case)
system_prompt = """You are a helpful assistant that can ask questions about a database table and write SQL queries to answer the question.
    A user will pass in a table schema and your job is to return a question answer pairing. The question should relevant to the schema of the table,
    and you can speculate on its contents. You will then have to generate a SQL query to answer the question. Below are some examples of what this should look like.

    Example 1
    ```````````
    User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\n
    Assistant Response:
    Q: How many visitors have visited the museum with the most staff?
    A: SELECT count ( * )  FROM VISIT AS T1 JOIN MUSEUM AS T2 ON T1.Museum_ID   =   T2.Museum_ID WHERE T2.Num_of_Staff   =   ( SELECT max ( Num_of_Staff )  FROM MUSEUM ) 
    ```````````

    Example 2
    ```````````
    User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\n
    Assistant Response:
    Q: What are the names who have a membership level higher than 4?
    A: SELECT Name   FROM VISITOR AS T1 WHERE T1.Level_of_membership   >   4 
    ```````````

    Example 3
    ```````````
    User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\n
    Assistant Response:
    Q: How many tickets of customer id 5?
    A: SELECT count ( * )  FROM VISIT AS T1 JOIN VISITOR AS T2 ON T1.visitor_ID   =   T2.ID WHERE T2.ID   =   5 
    ```````````
    """

user_input = "Table car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]"

messages = []
messages.append({
    "role": "system",
    "content": system_prompt
})
messages.append({
    "role": "user",
    "content": user_input
})


completion = client.chat.completions.create(
  model="gpt-4-turbo-preview",
  messages=messages,
  temperature=1.0,
  n=5
)

for choice in completion.choices:
    print(choice.message.content + "\n")


Q: Which continent has the highest average horsepower for their cars?
A: SELECT continents.Continent, AVG(cars_data.Horsepower) as AvgHorsepower FROM continents JOIN countries ON continents.ContId = countries.Continent JOIN car_makers ON countries.CountryId = car_makers.Country JOIN model_list ON car_makers.Id = model_list.Maker JOIN car_names ON model_list.Model = car_names.Model JOIN cars_data ON car_names.MakeId = cars_data.Id GROUP BY continents.Continent ORDER BY AvgHorsepower DESC LIMIT 1

Q: What is the average MPG (Miles Per Gallon) for cars produced by makers from the continent 'Europe'?
A: SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'

Q: What is the average weight of cars p

Once we have the synthetic data, we need to convert it to match the format of the eval dataset.

In [2]:
eval_data = []
input_prompt = "TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]"

for choice in completion.choices:
    question = choice.message.content.split("Q: ")[1].split("\n")[0]  # Extracting the question
    answer = choice.message.content.split("\nA: ")[1].split("\n")[0]  # Extracting the answer
    eval_data.append({
        "input": [
            {"role": "system", "content": input_prompt},
            {"role": "user", "content": question},
        ],
        "ideal": answer
    })

for item in eval_data:
    print(item)


{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which continent has the highest average horsepower for their cars?'}], 'ideal': 'SELECT continents.Continent, AVG(cars_data.Horsepower) as AvgHorsepower FROM continents JOIN countries ON continents.ContI

Next we need to create the eval registry to run it in the framework.

The evals framework requires a .yaml file structured with the following properties:
* `id` - An identifier for your eval
* `description` - A short description of your eval
* `disclaimer` - An additional notes about your eval
* `metrics` - There are three types of eval metrics we can choose from: match, includes, fuzzyMatch

For our eval, we will configure the following:

In [3]:
"""
spider-sql:
  id: spider-sql.dev.v0
  metrics: [accuracy]
  description: Eval that scores SQL code from 194 examples in the Spider Text-to-SQL test dataset. The problems are selected by taking the first 10 problems for each database that appears in the test set.
    Yu, Tao, et al. \"Spider; A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.\" Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, https://doi.org/10.18653/v1/d18-1425.
  disclaimer: Problems are solved zero-shot with no prompting other than the schema; performance may improve with training examples, fine tuning, or a different schema format. Evaluation is currently done through model-grading, where SQL code is not actually executed; the model may judge correct SQL to be incorrect, or vice-versa.

  """""

'\nspider-sql:\n  id: spider-sql.dev.v0\n  metrics: [accuracy]\n  description: Eval that scores SQL code from 194 examples in the Spider Text-to-SQL test dataset. The problems are selected by taking the first 10 problems for each database that appears in the test set.\n    Yu, Tao, et al. "Spider; A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, https://doi.org/10.18653/v1/d18-1425.\n  disclaimer: Problems are solved zero-shot with no prompting other than the schema; performance may improve with training examples, fine tuning, or a different schema format. Evaluation is currently done through model-grading, where SQL code is not actually executed; the model may judge correct SQL to be incorrect, or vice-versa.\n\n  '

## Running an evaluation

We can run this eval using the `oaieval` CLI:

First, install the library: `pip install .` (if you are running the [OpenA Evals library](github.com/openai/evals) locally) or `pip install oaieval`

Then, run the eval: `oaieval gpt-3.5-turbo spider-sql`

The valid eval names are specified in the YAML files under `evals/registry/evals` and their corresponding implementations can be found in `evals/elsuite`.

In [4]:
!pip install evals



These CLIs can accept various flags to modify their default behavior. You can run `oaieval --help` to see a full list of CLI options. 

After running that command, you’ll see the final report of accuracy printed to the console, as well as a file path to a temporary file that contains the full report.

In [5]:
!oaieval gpt-3.5-turbo spider-sql

[2024-03-18 00:52:41,040] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/evals
[2024-03-18 00:52:44,020] [registry.py:257] Loading registry from /Users/shyamal/.evals/evals
[2024-03-18 00:52:44,031] [oaieval.py:189] [1;35mRun started: 240318075244BE2IAMOY[0m
[2024-03-18 00:52:44,055] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/modelgraded
[2024-03-18 00:52:44,194] [registry.py:257] Loading registry from /Users/shyamal/.evals/modelgraded
[2024-03-18 00:52:44,195] [data.py:90] Fetching /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/data/sql/spider_sql.jsonl
[2024-03-18 00:52:44,206] [eval.py:36] Evaluating 194 samples
[2024-03-18 00:52:44,290] [eval.py:144] Running in threaded mode with 10 threads!
  0%|                                                   | 0/194 [00:00<?, ?it/s][2024-03-18 00:52:44,976] [_clien

`oaievalset` expects a model name and an eval set name, for which the valid options are specified in the YAML files under `evals/registry/eval_sets`.

### Going through eval logs

The eval logs are located at `/tmp/evallogs` and different log files are created for each evaluation run. 

In [6]:
!cat /tmp/evallogs/24031807290743SAP6GW_gpt-3.5-turbo_spider-sql.jsonl

{"spec": {"completion_fns": ["gpt-3.5-turbo"], "eval_name": "spider-sql.dev.v0", "base_eval": "spider-sql", "split": "dev", "run_config": {"completion_fns": ["gpt-3.5-turbo"], "eval_spec": {"cls": "evals.elsuite.modelgraded.classify:ModelBasedClassify", "registry_path": "/Users/shyamal/.virtualenvs/api-eajm/lib/python3.11/site-packages/evals/registry", "args": {"samples_jsonl": "sql/spider_sql.jsonl", "eval_type": "cot_classify", "modelgraded_spec": "sql"}, "key": "spider-sql.dev.v0", "group": "sql"}, "seed": 20220722, "max_samples": null, "command": "/Users/shyamal/.virtualenvs/api-eajm/bin/oaieval gpt-3.5-turbo spider-sql", "initial_settings": {"visible": true}}, "created_by": "", "run_id": "24031807290743SAP6GW", "created_at": "2024-03-18 07:29:07.351613"}}
{"run_id": "24031807290743SAP6GW", "event_id": 0, "sample_id": "spider-sql.dev.142", "type": "sampling", "data": {"prompt": [{"content": "Answer the following question with syntactically correct SQLite SQL. Be creative but the SQ