# OpenAI Evals

[Evals](https://github.com/openai/evals) provide a framework for evaluating large language models (LLMs) or systems built using LLMs. It offers an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.

If you are building with LLMs, creating high quality evals is one of the most impactful things you can do. Without evals, it can be very difficult and time intensive to understand how different model versions might effect your use case. 

## Installation

There are two different options to install 'evals':
1. Cloning 'evals' repository and installing from the source code
2. Using pip install

Theoritically option 2 should be sufficient to use evals framework and run your evaluations. But I recommend the option 1 for the several reasons:
* The version in the Python Package Index is out-of-date, it is not compatible with the lastest Python library as of Feb 2024. This migth be the case when you try.
* To conveniently browse the source code, check the examples, read the documents, see the JSONL file contents it is good to have the source codes.
* 'evals' is open to contribution. You may want to contribute in your next step.

### Option 1: Use the source code
Evals registry is stored using [Git-LFS](https://git-lfs.com/). First install LFS.
```console
brew install git-lfs
git lfs install
```

Go to parent directory (Assuming this notebook is in openai-cheat-sheet/) and clone evals repository
```console
cd ..
git clone https://github.com/openai/evals.git
```

Fetch all the files (from within your local copy of the evals repo). This will populate all the pointer files under **evals/registry/data**.
```console
cd ../evals
git lfs fetch --all 
git lfs pull
```

Install evals from the source code along with its dependencies.
```console
cd ../evals
pip install -e .
```

### Option 2: Install from Python Package Index

```console
pip install evals
```

## Scenario: Evaluation of retriaval with different options

Let's assume that we have a chat completion that uses a CSV file for retriaval. This file should have only two columns: text and embedding. For each user query, it will find top {k} embeddings from CSV, which are the closest to the user's query, then add the corresponding 'text' of these embeddings to the system message to enrich the context. And the completion will reply accordingly. 

This is already implemented in [evals.completion_fns.retrieval:RetrievalCompletionFn](https://github.com/openai/evals/blob/main/evals/completion_fns/retrieval.py) class in evals. But in my tests, it was not working correctly. So here we will create our own custom completion function and register it.

Our aim is to try out 3 different options with the same sample data set and compare their accuracy scores.
1. Using GPT-3.5-turbo directly without feeding any extra context with retrieval.
2. Using GPT-3.5-turbo with retrieval.
3. Using GPT-3.5-turbo with retrieval and instructing it to use chain-of-thought logic.

For this, we will use a file with the birthdays of all presidents. Our completion function will retrieve the data from this file. Then we will test it with some sample user questions like "Was Franklin Pierce born before Abraham Lincoln? Answer Y or N."

### Step 1: Setup retrieval data
While we are using RetrievalCompletionFn, we will use [president_birthdays.csv](./data/president_birthdays.csv). 
1. We will generate the 'text' column using the data in the file
2. We will get the 'embeddings' using the text column
3. We will generate 'output/presidents_embeddings.csv' file only with 'text' and 'embedding' columns. This file will be the input for RetrievalCompletionFn.

In [1]:
import pandas as pd

input_datapath = "data/president_birthdays.csv"

df = pd.read_csv(input_datapath).rename(columns={" \"Name\"": "Name", " \"Month\"": "Month", " \"Day\"": "Day", " \"Year\"": "Year"}).set_index("Index")
df["text"] = df.apply(lambda r: f"{r['Name']} was born on {r['Month']}/{r['Day']}/{r['Year']}", axis=1)
display(df.head())

Unnamed: 0_level_0,Name,Day,Month,Year,text
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"""George Washington""",22,2,1732.0,"""George Washington"" was born on 2/22/1732.0"
2,"""John Adams""",30,10,1735.0,"""John Adams"" was born on 10/30/1735.0"
3,"""Thomas Jefferson""",13,4,1743.0,"""Thomas Jefferson"" was born on 4/13/1743.0"
4,"""James Madison""",16,3,1751.0,"""James Madison"" was born on 3/16/1751.0"
5,"""James Monroe""",28,4,1758.0,"""James Monroe"" was born on 4/28/1758.0"


In [2]:
from openai import OpenAI
client = OpenAI()

def embed(text):
    return client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        ).data[0].embedding

df["embedding"] = df['text'].apply(embed)
df[["text", "embedding"]].to_csv("output/presidents_embeddings.csv")

### Step 2: Implement your completion function

You can check [MLRetrievalCompletionFn class](./evals_registry/completion_fns/retrieval.py). It accepts 3 arguments.
1. **completion_fn**: we can simply pass a model name, e.g. 'gpt-4-turbo' or pass another completions function. It is kind of chain, a completion function having another one as an input. 
2. **embeddings_and_text_path**: CSV file with 'text' and 'embedding' columns. It will be used for retrieval. 
3. **k**: Top k closest embeddings will be passed to the prompt. In our case it is 2, because the user will always ask questions similar to "Was Andrew Jackson born before William Harrison?", so we always need only two president's data.

If you want to read more about Completion Functions, you can read [this document](https://github.com/openai/evals/blob/main/docs/completion-fns.md)


### Step 3: Register your completion function

To register our completion function, we need to write a YAML file. You can check [presidents.yaml](./evals_registry/completion_fns/presidents.yaml). 

Here we defined 3 different completion functions:
1. **cot/gpt-3.5**: We will not use this one directly, but use it as an input to the 3rd one. It is based on [ChainOfThoughtCompletionFn](https://github.com/openai/evals/blob/main/evals/completion_fns/cot.py) class, adds chain of thought logic on top of gpt-3.5-turbo.
1. **retrieval/presidents/gpt-3.5-turbo**: for using 'gpt-3.5-turbo' with retrieval.
2. **retrieval/presidents/cot/gpt-3.5-turbo**: for using 'gpt-3.5-turbo' with chain of thought logic, with retrieval.


In [3]:
# Open the file in read mode ('r')
with open('evals_registry/completion_fns/presidents.yaml', 'r') as file:
    # Read the file's content
    file_content = file.read()
    # Print the content
    print(file_content)

cot/gpt-3.5:
  class: evals.completion_fns.cot:ChainOfThoughtCompletionFn
  args:
    cot_completion_fn: gpt-3.5-turbo

retrieval/presidents/gpt-3.5-turbo:
  class: evals_registry.completion_fns.retrieval:MLRetrievalCompletionFn
  args:
    completion_fn: gpt-3.5-turbo
    embeddings_and_text_path: output/presidents_embeddings.csv
    k: 2

retrieval/presidents/cot/gpt-3.5-turbo:
  class: evals_registry.completion_fns.retrieval:MLRetrievalCompletionFn
  args:
    completion_fn: cot/gpt-3.5
    embeddings_and_text_path: output/presidents_embeddings.csv


### Step 4: Build your eval

An eval is simply a dataset and a choice of eval class. You have 3 options to build an eval:
1. Using one of the basic eval classes. Most common ones are 'Match', 'Includes' and 'FuzyMatch'. For a full list, you can check [here](https://github.com/openai/evals/tree/main/evals/elsuite/basic).
2. Using the [model graded eval class](https://github.com/openai/evals/blob/main/docs/eval-templates.md#the-model-graded-eval-template). Model graded eval means using an LLM model to evaluate the outputs of another LLM model. 
3. Creating your custom eval class. Most of the cases, this will not be necessary. So it is not covered under this notebook. But you can check [here](https://github.com/openai/evals/blob/main/docs/custom-eval.md), if you want to learn more.

Register the eval by adding a file to /evals/<eval_name>.yaml under registry folder (in our case it is 'evals_registry') using the elsuite registry format. For example, for a Match eval, it would be:

```console
<eval_name>:
  id: <eval_name>.dev.v0
  description: <description>
  metrics: [accuracy]

<eval_name>.dev.v0:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: <eval_name>/samples.jsonl
```

Upon running the eval, the data will be searched for in 'data' folder under registry. For example, if older/samples.jsonl is the provided file, the data is expected to be in evals_registry/data/older/samples.jsonl.

The naming convention for evals is in the form {eval_name}.{split>}.{version}

* **eval_name** is the eval name, used to group evals whose scores are comparable.
* **split** is the data split, used to further group evals that are under the same <base_eval>. E.g., "val", "test", or "dev" for testing.
* **version** is the version of the eval, which can be any descriptive text you'd like to use (though it's best if it does not contain .).

In general, running the same eval name against the same model should always give similar results so that others can reproduce it. Therefore, when you change your eval, you should bump the version.

In our sample scerio, we will go ahead with the first option and use 'Match'. You can check [older.yaml](./evals_registry/evals/older.yaml). 

In [4]:
# Open the file in read mode ('r')
with open('evals_registry/evals/older.yaml', 'r') as file:
    # Read the file's content
    file_content = file.read()
    # Print the content
    print(file_content)

older:
  id: older.dev.v0
  description: Test the model's ability to determine who is older.
  metrics: [accuracy]
older.dev.v0:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: older/samples.jsonl


### Step 5: Setup your sample data to use for evaluation
You will need to convert your samples into the right JSON lines (JSONL) format. A JSONL file is just a JSON file with a unique JSON object per line.

You can use the openai CLI (available with OpenAI-Python) to transform data from some common file types into JSONL:

```console
openai tools fine_tunes.prepare_data -f data[.csv, .json, .txt, .xlsx or .tsv]
```
You can find some examples of JSONL eval files [here](https://github.com/openai/evals/blob/main/evals/registry/data/README.md)

Each JSON object will represent one data point in your eval. The keys you need in the JSON object depend on the eval template. All templates expect an "`input`" key, which is the prompt.

For the basic evals Match, Includes, and FuzzyMatch, the other required key is "`ideal`", which is a string (or a list of strings) specifying the correct reference answer(s). 

In our sample scerio, we will use [samples.jsonl](./evals_registry/data/older/samples.jsonl). Check the content of it to have an understanding.

In [5]:
# Open the file in read mode ('r')
with open('evals_registry/data/older/samples.jsonl', 'r') as file:
    # Loop over the first 3 lines and print each
    for i, line in enumerate(file):
        if i < 3:
            print(line, end='')  # Use end='' to avoid adding extra newlines
        else:
            break

{"input": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Was Abraham Lincoln born before Franklin Pierce? Answer Y or N."}], "ideal": "N"}
{"input": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Was Abraham Lincoln born before Andrew Johnson? Answer Y or N."}], "ideal": "N"}
{"input": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Was Andrew Jackson born before John Quincy Adams? Answer Y or N."}], "ideal": "Y"}


### Step 6: Run the eval
After installing evals tool, we can simply run evaluation in the command line by defining completion function and evaluation task.

* `completion_fn`: If you will only use an OpenAI model directly, you simply put its name. If you have your own custom completion function (as in [presidents.yaml](./evals_registry/completion_fns/presidents.yaml)), put its name.
* `eval_task`: It refers to a YAML file in the registry directory. The file defines parameters for a specific evaluation task, e.g., evaluation data, evaluation metrics and prompting strategies. 

```console
oaieval <completion_fn> <eval_task>
```

Each run will create a jsonl log file under the tmp folder as such '/tmp/evallogs/2402140749577WDOLKUD_gpt-3.5-turbo_older.jsonl'. You can check the corresponding file to see how the run went.

For more information on running evals, you can read [this](https://github.com/openai/evals/blob/main/docs/run-evals.md).

**NOTE**: The default registry path for evals is 'evals/registry', here we use a custom path of './evals_registry'. So we need to pass `--registry_path` argument to point out our registry folder.

**NOTE**: For evals to resolve evals_registry.completion_fns.retrieval:MLRetrievalCompletionFn class, we need to add the path of the evals_registry folder to the PYTHONPATH. 


In [6]:
# Use gpt-3.5-turbo directly without any retrieval or extra prompt -> Accuracy: 0.7
!export PYTHONPATH=".:$PYTHONPATH"; oaieval gpt-3.5-turbo older --max_samples 10 --registry_path ./evals_registry

[2024-04-28 16:59:48,590] [registry.py:271] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/openai-cheat-sheet/env/lib/python3.12/site-packages/evals/registry/evals
[2024-04-28 16:59:49,134] [registry.py:271] Loading registry from /Users/meltemseyhan/.evals/evals
[2024-04-28 16:59:49,134] [registry.py:271] Loading registry from evals_registry/evals
[2024-04-28 16:59:49,137] [oaieval.py:215] [1;35mRun started: 240428135949DJTNETK7[0m
[2024-04-28 16:59:49,211] [data.py:94] Fetching evals_registry/data/older/samples.jsonl
[2024-04-28 16:59:49,212] [eval.py:36] Evaluating 10 samples
[2024-04-28 16:59:49,253] [eval.py:144] Running in threaded mode with 10 threads!
100%|███████████████████████████████████████████| 10/10 [00:02<00:00,  3.73it/s]
[2024-04-28 16:59:51,956] [oaieval.py:275] Found 10/10 sampling events with usage data
[2024-04-28 16:59:51,956] [oaieval.py:283] Token usage from 10 sampling events:
completion_tokens: 12
prompt_tokens: 317
total_tokens: 

In [7]:
# Use gpt-3.5-turbo with retrieval -> Accuracy: 0.9
!export PYTHONPATH=".:$PYTHONPATH"; oaieval retrieval/presidents/gpt-3.5-turbo older --max_samples 10 --registry_path ./evals_registry

[2024-04-28 17:00:18,157] [registry.py:271] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/openai-cheat-sheet/env/lib/python3.12/site-packages/evals/registry/evals
[2024-04-28 17:00:18,756] [registry.py:271] Loading registry from /Users/meltemseyhan/.evals/evals
[2024-04-28 17:00:18,756] [registry.py:271] Loading registry from evals_registry/evals
[2024-04-28 17:00:19,266] [registry.py:271] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/openai-cheat-sheet/env/lib/python3.12/site-packages/evals/registry/completion_fns
[2024-04-28 17:00:19,271] [registry.py:271] Loading registry from /Users/meltemseyhan/.evals/completion_fns
[2024-04-28 17:00:19,271] [registry.py:271] Loading registry from evals_registry/completion_fns
[2024-04-28 17:00:19,273] [registry.py:271] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/openai-cheat-sheet/env/lib/python3.12/site-packages/evals/registry/solvers
[2024-04-28 17:00:19,45

In [8]:
# Use gpt-3.5-turbo with retrieval and chain-of-thought -> Accuracy: 0.9
!export PYTHONPATH=".:$PYTHONPATH"; oaieval retrieval/presidents/cot/gpt-3.5-turbo older --max_samples 10 --registry_path ./evals_registry

[2024-04-28 17:00:51,275] [registry.py:271] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/openai-cheat-sheet/env/lib/python3.12/site-packages/evals/registry/evals
[2024-04-28 17:00:51,824] [registry.py:271] Loading registry from /Users/meltemseyhan/.evals/evals
[2024-04-28 17:00:51,824] [registry.py:271] Loading registry from evals_registry/evals
[2024-04-28 17:00:52,299] [registry.py:271] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/openai-cheat-sheet/env/lib/python3.12/site-packages/evals/registry/completion_fns
[2024-04-28 17:00:52,305] [registry.py:271] Loading registry from /Users/meltemseyhan/.evals/completion_fns
[2024-04-28 17:00:52,305] [registry.py:271] Loading registry from evals_registry/completion_fns
[2024-04-28 17:00:52,307] [registry.py:271] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/openai-cheat-sheet/env/lib/python3.12/site-packages/evals/registry/solvers
[2024-04-28 17:00:52,49