# Getting Started with OpenAI Evals


This notebook will go over:
* Introduction to OpenAI Evals library [enter link]
* What are Evals
* Building an Eval
* Running an Eval

Evaluation is the process of validating and testing the outputs that your LLM applications are producing. Having strong evaluations (“evals”) will mean a more stable, reliable application which is resilient to code and model changes.An eval is basically a task used to measure the quality of output of an LLM or LLM system. Given an input prompt, an output is generated. We evaluate this output with a set of ideal_answers and find the quality of the LLM system.

OpenAI Evals consists of:
1. A framework to evaluate an LLM (large language model) or a system built on top of an LLM.
2. An open-source registry of challenging evals

*Why is it important to evaluate?*

If you are building with LLMs, creating high quality evals is one of the most impactful things you can do. Without evals, it can be very difficult and time intensive to understand how different model versions might affect your use case. With OpenAI’s new continuous model upgrades, evals allow you to efficiently test model performance for your use cases in a standardized way. Developing a suite of evals customized to your objectives will help you quickly and effectively understand how new models may perform for your use cases.

*Types of Evals*

The simplest and most common type of eval has an input and an ideal response or answer. For example,
we can have an eval sample where the input is “What year was Obama elected president for the first
time?” and the ideal answer is “2008”. We feed the input to a model and get the completion. If the model
says “2008”, it is then graded as correct. Eval samples are aggregated into an eval dataset that can
quantify overall performance within a certain topic. For example, this eval sample may be part of a
“president-election-years” eval that checks for every U.S. President, what year they were first elected.
Evals are not restricted to checking factual accuracy: all that is needed is a reproducible way to grade a
completion. Here are some other examples of valid evals:
* The input asks to write a short essay on a topic. The grading criteria is to check if the essay is of
particular length or if certain keywords or themes are present in the completion.
* The input is to write a funny joke, and the grading criteria is to check how funny it was.
* The input is to follow a sequence of instructions, and the grading ensures that all instructions
were followed.

In a naive implementation, we could just grade each completion by hand based on the criteria. Ideally,
we’d like to automate the grading process to let these experiments scale to huge datasets. In the next
section, we’ll talk about the ways in which we’ve automated eval grading.
Grading evals

There are two main ways we can automatically grade completions: writing some validation logic in code
or using the model itself to inspect the answer. We’ll introduce each with some examples.
Writing logic for answer checking

* Consider the Obama example from above, where the ideal response is 2008. We can write a
string match to check if the completion includes the phrase “2008”. If it does, we consider it
correct.
* Consider another eval where the input is to generate valid JSON: We can write some code that
attempts to parse the completion as JSON and then considers the completion correct if it is
parsable.
Model grading: A two stage process where the model first answers the question, then we ask a
model to look at the response to check if it’s correct.
* Consider an input that asks the model to write a funny joke. The model then generates a
completion. We then create a new input to the model to answer the question: “Is this following
joke funny? First reason step by step, then answer yes or no” that includes the completion. We
finally consider the original completion correct if the new model completion ends with “yes”.
Model grading works best with the latest, most powerful models like GPT-4 and if we give them the ability
to reason before making a judgment. Model grading will have an error rate, so it is important to validate
the performance with human evaluation before running the evals at scale. For best results, it makes
sense to use a different model to do grading from the one that did the completion, like using GPT-4 to
grade GPT-3.5 answers.


## Building an evaluation for the OpenAI Evals framework

To start creating an eval, we need

1/ The test dataset in the JSONL format.
2/ The eval template to be used

### Creating the eval dataset
Lets create a dataset for a use case where we are evaluating the model's ability to generate syntactically correct SQL.

First we will need to create a system prompt that we would like to evaluate. We will pass in instructions for the model as well as an overview of the table structure:
`"TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable city, columns = [*,ID,Name,CountryCode,District,Population]\nTable country, columns = [*,Code,Name,Continent,Region,SurfaceArea,IndepYear,Population,LifeExpectancy,GNP,GNPOld,LocalName,GovernmentForm,HeadOfState,Capital,Code2]\nTable countrylanguage, columns = [*,CountryCode,Language,IsOfficial,Percentage]\nTable sqlite_sequence, columns = [*,name,seq]\nForeign_keys = [city.CountryCode = country.Code,countrylanguage.CountryCode = country.Code]\n"`

For this prompt, we can ask a specific question:
`"Q: What is the GNP of Afghanistan?"`

And we have an expected answer:
`"A: SELECT GNP FROM country WHERE name = \"Afghanistan\""`

The dataset needs to be in the followingformat"
`"input": [{"role": "system", "content": "<input prompt>","name":"example-user"}, "ideal": "correct answer"]`

Putting it all together, we get:
`{"input": [{"role": "system", "content": "TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable city, columns = [*,ID,Name,CountryCode,District,Population]\nTable country, columns = [*,Code,Name,Continent,Region,SurfaceArea,IndepYear,Population,LifeExpectancy,GNP,GNPOld,LocalName,GovernmentForm,HeadOfState,Capital,Code2]\nTable countrylanguage, columns = [*,CountryCode,Language,IsOfficial,Percentage]\nTable sqlite_sequence, columns = [*,name,seq]\nForeign_keys = [city.CountryCode = country.Code,countrylanguage.CountryCode = country.Code]\n"}, {"role": "user", "content": "Q: What is the GNP of Afghanistan?"}], "ideal": ["A: SELECT GNP FROM country WHERE name = \"Afghanistan\""]}`


One way to speed up the process of building eval datasets, is to use GPT-4 to generate synthetic data

In [None]:
## Use GPT-4 to generate synthetic data

## Running an evaluation

we can run this eval using the oaieval CLI like this

pip install .
oaieval gpt-3.5-turbo <name of eval>

### Going through eval logs