<a href="https://colab.research.google.com/github/imusicmash/wandb_workshop/blob/main/Copy_of_lets_do_evals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WandB FC Workshop - Evaluating LLMs in the wild
Prepared by [Alex Volkov](https://twitter.com/altryne)

## Evals Intro
In this notebook, we will walk through common patterns in building evaluations for LLMs, and useful rules of thumb to follow when doing so.



## Components of an Evaluation
Evaluations generally consist of four key elements:
- An **input prompt** that serves as the basis for the model's completion. This prompt often includes a set of variable inputs that are inserted into a prompt template during testing.
- The **output** generated by the model in response to the input prompt.
- A **"gold standard" answer** used as a reference for assessing the model's output. This can be an exact match that the output must replicate, or an exemplary answer that provides a benchmark for grading.
- A **score**, determined by one of the grading approaches outlined below, which indicates the model's performance on the question.


## Evaluation Grading Approaches
Evaluations can be time-consuming and costly in two main areas: creating questions and gold standard answers, and the scoring/grading process itself.  
Developing questions and ideal answers is often a one-time fixed cost, albeit potentially time-intensive if a suitable dataset is not readily available (consider leveraging an LLM to generate questions!). However, grading is a recurring expense incurred each time the evaluation is conducted, which is likely to be frequent. Therefore, designing evaluations that can be graded efficiently and economically should be a central priority.

![](https://gist.github.com/assets/463317/e970bb03-9552-4712-ba12-727b89928e3b)

There are three primary methods for grading (scoring) evaluations:
- **Programmatic grading:** This approach involves using standard code (primarily string matching and regular expressions) to assess the model's outputs. Common techniques include checking for an exact match against an answer or verifying the presence of key phrase(s) in a string. Programmatic grading is the most optimal method when feasible, as it is extremely fast and highly reliable. However, not all evaluations are amenable to this style of grading.
- **Human in the loop:** In this approach, a human reviewer examines the model-generated answer, compares it to the gold standard, and assigns a score. While manual grading is the most versatile method, applicable to nearly any task, it is also exceptionally slow and costly, especially for large-scale evaluations. Designing evaluations that necessitate manual grading should be avoided whenever possible.
- **Model-based grading:** LLMs (especially Claude, GPT-4) are really good at grading themselves (or even outputs of other LLMs) especially in wide range of tasks that traditionally needed human judgement like tone in creative writing or accuracy in open-ended question, or classification. This model-based grading is accomplished by creating a _grader prompt_ for an LLM

Let's explore an example of each

### Code-based Grading
We'll start with a simple example from [Anthropic's Cookbook](https://github.com/anthropics/anthropic-cookbook/blob/main/misc/building_evals.ipynb), and will be grading an eval where we ask Claude to successfully identify how many legs something has. We want Claude to output just a number of legs, and we design the eval in a way that we can use an exact-match code-based grader.

We'll be using Claude so make sure you created and set an ANTHROPIC_API_KEY (from [Anthropic](https://console.anthropic.com/settings/keys)) either in the Colab secrets pane[1]() or in .env file.

[1] https://colab.research.google.com/notebooks/settings#secrets



In [None]:
# Install and read in required packages, plus create an anthropic client.
print('⏳ Installing packages')
%pip install -q weave==0.50.1 anthropic set-env-colab-kaggle-dotenv tqdm ipywidgets
print('✅ Packages installed')

⏳ Installing packages
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m28.8/28.8 MB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m870.7/870.7 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.7/309.7 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m202.9/202.9 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.0/74.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [

In [None]:
from anthropic import Anthropic
from tqdm.notebook import tqdm_notebook as tqdm
from set_env import set_env
set_env("ANTHROPIC_API_KEY")
set_env("WANDB_API_KEY")

SMART_MODEL_NAME = "claude-3-opus-20240229"
FAST_MODEL_NAME = "claude-3-haiku-20240307"

client = Anthropic()

In [None]:
# Prompt template builder including instructions
# Claude is trained with XML tags so we'll use those to make the model understand better
def build_input_prompt(animal_statement):
    user_content = f"""You will be provided a statement about an animal and your job is to determine how many legs that animal has.

    Here is the animal statment.
    <animal_statement>{animal_statement}</animal_statment>

    How many legs does the animal have? Return just the number of legs as an integer and nothing else."""

    messages = [{'role': 'user', 'content': user_content}]
    return messages

In [None]:
# Define our eval (in practice you might do this as a jsonl or csv file instead).
eval = [
    {
        "animal_statement": 'The animal is a human.',
        "golden_answer": '2'
    },
        {
        "animal_statement": 'The animal is a snake.',
        "golden_answer": '0'
    },
        {
        "animal_statement": 'The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that',
        "golden_answer": '5'
    }
]

In [None]:
# Get completions for each input using Claude 3 Haiku (which is faster but dumber)
# could replace with FAST_MODEL_NAME. SMART_MODEL_NAME

def get_completion(messages, model_name=FAST_MODEL_NAME):
    response = client.messages.create(
        model=model_name,
        max_tokens=5,
        temperature=0, #Good to set this for evals and RAG systems to 0
        system="Assistant responds with number of legs only as integer",
        messages=messages
    )
    return response.content[0].text

from tqdm.notebook import tqdm_notebook as tqdm

outputs = []
for question in tqdm(eval, desc=f"Getting completions from {SMART_MODEL_NAME}"):
    output = get_completion(build_input_prompt(question['animal_statement']))
    outputs.append(output)
    print(f"Animal Statement: {question['animal_statement']}\nGolden Answer: {question['golden_answer']}\nOutput: {output}\n")



Getting completions from claude-3-opus-20240229:   0%|          | 0/3 [00:00<?, ?it/s]

Animal Statement: The animal is a human.
Golden Answer: 2
Output: 2

Animal Statement: The animal is a snake.
Golden Answer: 0
Output: 0

Animal Statement: The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that
Golden Answer: 5
Output: 4



In [None]:
# Check our completions against the golden answers.
# Define a grader function with simple comparison
def grade_completion(output, golden_answer):
    return output == golden_answer

def calculate_score(outputs, eval):
    grades = []
    for i in range(len(outputs)):
        output = outputs[i]
        question = eval[i]
        grade = grade_completion(output, question['golden_answer'])
        grades.append(grade)

    num_correct = sum(grades)
    total = len(grades)
    percentage = num_correct / total * 100

    return percentage

score = calculate_score(outputs, eval)
print(f"Score: {score}%")


Score: 66.66666666666666%


### Human grading
Now let's imagine that we are grading an eval where we've asked Claude a series of open ended questions, maybe for a general purpose chat assistant. Unfortunately, answers could be varied and this can not be graded with code. One way we can do this is with human grading.

In [None]:
# Define our input prompt template for the task.
def build_input_prompt(question):
    user_content = f"""Please answer the following question:
    <question>{question}</question>"""

    messages = [{'role': 'user', 'content': user_content}]
    return messages

In [None]:
# Define our eval. For this task, the best "golden answer" to give a human are instructions on what to look for in the model's output.
eval = [
    {
        "question": 'Please design me a workout for today that features at least 50 reps of pulling leg exercises, at least 50 reps of pulling arm exercises, and ten minutes of core.',
        "golden_answer": 'A correct answer should include a workout plan with 50 or more reps of pulling leg exercises (such as deadlifts, but not such as squats which are a pushing exercise), 50 or more reps of pulling arm exercises (such as rows, but not such as presses which are a pushing exercise), and ten minutes of core workouts. It can but does not have to include stretching or a dynamic warmup, but it cannot include any other meaningful exercises.'
    },
    {
        "question": 'Send Jane an email asking her to meet me in front of the office at 9am to leave for the retreat.',
        "golden_answer": 'A correct answer should decline to send the email since the assistant has no capabilities to send emails. It is okay to suggest a draft of the email, but not to attempt to send the email, call a function that sends the email, or ask for clarifying questions related to sending the email (such as which email address to send it to).'
    },
    {
        "question": 'Who won the super bowl in 2024 and who did they beat?', # Claude should get this wrong since it comes after its training cutoff.
        "golden_answer": 'A correct answer states that the Kansas City Chiefs defeated the San Francisco 49ers.'
    }
]

In [None]:
# Get completions for each input.
def get_completion(messages, model_name=FAST_MODEL_NAME):
    response = client.messages.create(
        model=model_name,
        max_tokens=2048,
        messages=messages
    )
    return response.content[0].text


# Get completions for each question in the eval.
outputs = []
for question in tqdm(eval, desc=f"Getting completions from {FAST_MODEL_NAME}"):
    outputs.append(get_completion(build_input_prompt(question['question'])))
# Let's take a quick look at our outputs
for output, question in zip(outputs, eval):
    print(f"Question: {question['question']}\nGolden Answer: {question['golden_answer']}\nOutput: {output}\n\n\n")

Getting completions from claude-3-haiku-20240307:   0%|          | 0/3 [00:00<?, ?it/s]

Question: Please design me a workout for today that features at least 50 reps of pulling leg exercises, at least 50 reps of pulling arm exercises, and ten minutes of core.
Golden Answer: A correct answer should include a workout plan with 50 or more reps of pulling leg exercises (such as deadlifts, but not such as squats which are a pushing exercise), 50 or more reps of pulling arm exercises (such as rows, but not such as presses which are a pushing exercise), and ten minutes of core workouts. It can but does not have to include stretching or a dynamic warmup, but it cannot include any other meaningful exercises.
Output: Here is a workout that meets the criteria you provided:

Workout:

1. Pulling Leg Exercises (50 reps total):
   - Deadlifts - 3 sets of 15 reps
   - Barbell Romanian Deadlifts - 2 sets of 10 reps

2. Pulling Arm Exercises (50 reps total):
   - Bent-Over Barbell Rows - 3 sets of 12 reps
   - Seated Cable Rows - 2 sets of 8 reps

3. Core (10 minutes):
   - Plank - 3 sets

### Model-based Grading
Having to manually grade the above eval every time is going to get very annoying very fast, especially if the eval is a more realistic size (dozens, hundreds, or even thousands of questions). Luckily, there's a better way! We can actually have an LLM do the grading for us. Let's take a look at how to do that using the same eval and completions from above.

In [None]:
# We start by defining a "grader prompt" template.
def build_grader_prompt(answer, rubric):
    user_content = f"""You will be provided an answer that an assistant gave to a question,
    and a rubric that instructs you on what makes the answer correct or incorrect.

    Here is the answer that the assistant gave to the question.
    <answer>{answer}</answer>

    Here is the rubric on what makes the answer correct or incorrect.
    <rubric>{rubric}</rubric>

    An answer is correct if it entirely meets the rubric criteria, and is otherwise incorrect.
    First, think through whether the answer is correct or incorrect based on the rubric inside <thinking></thinking> tags.
    Then, output either 'correct' if the answer is correct or 'incorrect' if the answer is incorrect
    inside <correctness></correctness> tags."""

    messages = [{'role': 'user', 'content': user_content}]
    return messages

# Now we define the full grade_completion function.
import re

def grade_completion(output, golden_answer, model_name=FAST_MODEL_NAME):
    messages = build_grader_prompt(output, golden_answer)
    completion = get_completion(messages, model_name=model_name)
    # Extract just the label from the completion (we don't care about the thinking)
    pattern = r'<correctness>(.*?)</correctness>'
    match = re.search(pattern, completion, re.DOTALL)
    if match:
        return match.group(1).strip()
    else:
        raise ValueError("Did not find <correctness></correctness> tags.")

# Run the grader function on our outputs and print the score.

grades = []
for output, question in tqdm(zip(outputs, eval), total=len(eval), desc=f'Running eval using {FAST_MODEL_NAME}'):
    grade = grade_completion(output, question['golden_answer'], model_name=FAST_MODEL_NAME)
    grades.append(grade)

print(f"{FAST_MODEL_NAME} Score: {grades.count('correct')/len(grades)*100}%")

# Run the grader function on our outputs and print the score using the smart model
grades = []
for output, question in tqdm(zip(outputs, eval), total=len(eval), desc=f'Running eval using {SMART_MODEL_NAME}'):
    grade = grade_completion(output, question['golden_answer'], model_name=SMART_MODEL_NAME)
    grades.append(grade)

print(f"{SMART_MODEL_NAME} Score: {grades.count('correct')/len(grades)*100}%")


Running eval using claude-3-haiku-20240307:   0%|          | 0/3 [00:00<?, ?it/s]

claude-3-haiku-20240307 Score: 33.33333333333333%


Running eval using claude-3-opus-20240229:   0%|          | 0/3 [00:00<?, ?it/s]

claude-3-opus-20240229 Score: 33.33333333333333%


## Enhance Evaluation with Weave
Using the Weave trace tool from WandB, we're able to rewrite the function to trace all calls into the WandB Weave dashboard and see all traces for all calls we made.

This works for all LLM applications, from RAG pipelines to simple LLM calls (and yes simple evaluations as well)

A simple `@weave.op()` decorator will turn your function into a versioned and reproducible tracked code piece, and you can see all traces for all calls we made, and code changes that created them and will track the inputs and outputs of a function automatically.

In [None]:
import weave

set_env('WANDB_API_KEY')
weave.init('fc-workshop-trace-run')

#wrap the get_completion function with weave.op to mark it as a traced function
@weave.op()
def get_completion(messages, model_name=FAST_MODEL_NAME):
    response = client.messages.create(
        model=model_name,
        max_tokens=2048,
        messages=messages
    )
    return response.content[0].text

#wrap the code that runs all the completions with weave.op as well to wrap all traces under 1 call
@weave.op()
def run_completions():
    grades = []
    for output, question in tqdm(zip(outputs, eval), total=len(eval), desc=f'Running eval using {FAST_MODEL_NAME}'):
        grade = grade_completion(output, question['golden_answer'], model_name=FAST_MODEL_NAME)
        grades.append(grade)

    print(f"{FAST_MODEL_NAME} Score: {grades.count('correct')/len(grades)*100}%")

    # Run the grader function on our outputs and print the score using the smart model
    grades = []
    for output, question in tqdm(zip(outputs, eval), total=len(eval), desc=f'Running eval using {SMART_MODEL_NAME}'):
        grade = grade_completion(output, question['golden_answer'], model_name=SMART_MODEL_NAME)
        grades.append(grade)

    print(f"{SMART_MODEL_NAME} Score: {grades.count('correct')/len(grades)*100}%")


run_completions()

Logged in as W&B user alsmail10.
View Weave data at https://wandb.ai/alsmail10/fc-workshop-trace-run/weave


Running eval using claude-3-haiku-20240307:   0%|          | 0/3 [00:00<?, ?it/s]

claude-3-haiku-20240307 Score: 33.33333333333333%


Running eval using claude-3-opus-20240229:   0%|          | 0/3 [00:00<?, ?it/s]

claude-3-opus-20240229 Score: 33.33333333333333%
🍩 https://wandb.ai/alsmail10/fc-workshop-trace-run/r/call/15253616-113f-49d4-94d8-979f9ee6f2ee


Click on the link with the 🍩 above to see your traces in the weave dashboard ⬆️


# Using Weave evaluations platform
Weave tracing is great, but weave was built for the end to end evaluation support. Evaluating your pipeline end to end, including dataset versioning, continued output tracking and code versioning is important for a scalabale and reproducible LLM pipeline in production.

Let's break down how to turn the simple example above into a weave evaluation.

### Weave evaluation concepts

#### 1. Model
First, the model. In order to make your evaluation reproducible, Weave assists in tracking the code and configs and parameters of your LLM call under one "Model" object.

Structuring your LLM calls in this way allows you to keep track of your experiements and code changes.

Models are automatically versioned, giving you the option to compare two evaluations runs with different LLM calls, or different temperature parameters and compare apples to apples.


#### 2. Datasets
Weave's strength comes from serialization and storage of datasets (backed by the very robust and scalable WandB artifacts platform).
When you use hundreds of even thousands of prompts and examples, you can benefit from Weave's ability to track and version your dataset.

Every dataset is also versioned, and stored on our server, so you and your team can reuse the dataset across your pipeline.

#### 3. Evaluation & Scoring (Grading)

The Evaluation class is designed to assess the performance of a Model on a given Dataset using "scoring" functions.

Scoring can be programmatic or LLM as a judge

Now let's convert our example above into a weave evaluation

### Step 1 - create a "model"

In [None]:
# Step 1 - create a weave "model" with your LLM code and a "predict" function
# Models in weave extend the weave.Model class

import weave
weave.init('fc-workshop-eval-run')

class LegCounterModel(weave.Model):
    model_name: str = SMART_MODEL_NAME
    system_message: str = "Assistant responds with number of legs only as integer"

    @weave.op()
    def predict(self, animal_statement: str) -> dict:
        #wrap our animal with our prompt template
        messages = prompt_template(animal_statement)
        response = client.messages.create(
            model=self.model_name,
            max_tokens=5,
            temperature=0,
            system=self.system_message,
            messages=messages
        )
        return {'legs': response.content[0].text}

# Now let's add a simple prompt template
# Claude is trained with XML tags so we'll use those to make the model understand better
def prompt_template(animal_statement):

    user_content = f"""You will be provided a statement about an animal and your job is to determine how many legs that animal has.

    Here is the animal statment.
    <animal_statement>{animal_statement}</animal_statment>

    How many legs does the animal have? Return just the number of legs as an integer and nothing else."""

    messages = [{'role': 'user', 'content': user_content}]
    return messages

# We can run a simple test and trace our model response like so
model = LegCounterModel(model_name=FAST_MODEL_NAME)


response = model.predict('spider')
print(f'spider has {response["legs"]} legs')
#and we should see a link to our trace print out and the number of legs of a spider

Logged in as W&B user alsmail10.
View Weave data at https://wandb.ai/alsmail10/fc-workshop-eval-run/weave
🍩 https://wandb.ai/alsmail10/fc-workshop-eval-run/r/call/75f0b581-ca18-4264-b085-397877282c5c
spider has 8 legs


## Step 2 - Run evaluations with Programmatic Grading
Using the concepts above, let's start our evaluations in weave using a simple custom programmatic grading function

Weave Evaluation class has a "scorers" parameter that takes a list of scoring functions.

A scoring function is a function that takes in the output and golden answer, and returns a score

In [None]:
from weave import Evaluation
weave.init('fc-workshop-eval-run')

legs_eval_dataset = [
    {
        "animal_statement": 'The animal is a human.',
        "golden_answer": '2'
    },
        {
        "animal_statement": 'The animal is a snake.',
        "golden_answer": '0'
    },
        {
        "animal_statement": 'The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that',
        "golden_answer": '5'
    },
        {
        "animal_statement": 'My pet Sonia',
        "golden_answer": '8'
    }
]
#Lets define a simple programmatic scorer function to compare LLM reponse to 'golden_answer' we have defined
@weave.op()
def leg_correctnes_score(golden_answer: str, model_output: dict) -> dict:
    return {'correct': golden_answer == model_output['legs']}

evaluation = Evaluation(
    name='LegCounterHaiku',
    dataset=legs_eval_dataset,
    scorers=[leg_correctnes_score]
)
await evaluation.evaluate(model)
# You should see something like the below output with 50% correctness
# The simpler model Haiku can't reason about the fox statement and doesn't know about my pet sonia


In [None]:
# Now let's evaluate with a "smarter model" and provide a better system message to teach the model about my pet
weave.init('fc-workshop-eval-run')

smart_model = LegCounterModel(model_name=SMART_MODEL_NAME, system_message="""
    Assistant  can count the number of legs an animal has based on a description.
    Some additional context:
    - Sonia is the name of my pet tarantula.
    Simply output the number of legs and nothing else
""")

evaluation = Evaluation(
    name="LegCounterOpus",
    description="Trying Claude Opus with an upgraded system message",
    dataset=legs_eval_dataset,
    scorers=[leg_correctnes_score]
)
await evaluation.evaluate(smart_model)


## Step 3 - Evals with model based grading
For many tasks, programmatic grading doesn't work, so let's try and create a scorer function that will use the "FAST" model to grade responses

In [None]:
# Define our eval. For this task, the best "golden answer" to give a human are instructions on what to look for in the model's output.
weave.init('fc-workshop-eval-run')

eval = [
    {
        "question": 'Please help me come up with a workout for today that features at least 50 reps of pulling leg exercises, at least 50 reps of pulling arm exercises, and ten minutes of core.',
        "golden_answer": 'A correct answer should include a workout plan with 50 or more reps of pulling leg exercises (such as deadlifts, but not such as squats which are a pushing exercise), 50 or more reps of pulling arm exercises (such as rows, but not such as presses which are a pushing exercise), and ten minutes of core workouts. It can but does not have to include stretching or a dynamic warmup, but it cannot include any other meaningful exercises.'
    },
    {
        "question": 'Send Jane an email asking her to meet me in front of the office at 9am to leave for the retreat.',
        "golden_answer": 'A correct answer should decline to send the email since the assistant has no capabilities to send emails. It is okay to suggest a draft of the email, but not to attempt to send the email, call a function that sends the email, or ask for clarifying questions related to sending the email (such as which email address to send it to).'
    },
    {
        "question": 'Who won the super bowl in 2024 and who did they beat?', # Claude should get this wrong since it comes after its training cutoff.
        "golden_answer": 'A correct answer states that the Kansas City Chiefs defeated the San Francisco 49ers.'
    }
]
# Define our input prompt template for the task.


class QuestionAnsweringModel(weave.Model):
    model_name: str = FAST_MODEL_NAME
    system_message: str = "Assistant is a kind responder"

    @weave.op()
    def predict(self, messages: dict) -> dict:
        response = client.messages.create(
            model=self.model_name,
            max_tokens=2000,
            temperature=0,
            system=self.system_message,
            messages=messages
        )
        return response.content[0].text

def build_input_prompt(question):
    user_content = f"""Please answer the following question:
    <question>{question}</question>"""

    messages = [{'role': 'user', 'content': user_content}]
    return messages

qa_model = QuestionAnsweringModel()
# qa_model.predict(build_input_prompt('what is the distance to the sun?'))

In [None]:
# We start by defining a "grader prompt" template.
@weave.op()
def build_grader_prompt(answer:str, rubric:str) -> list:
    # print(f'Alex build_grader_prompt {answer}')
    user_content = f"""You will be provided an answer that an assistant gave to a question, and a rubric that instructs you on what makes the answer correct or incorrect.

    Here is the answer that the assistant gave to the question.
    <answer>{answer}</answer>

    Here is the rubric on what makes the answer correct or incorrect.
    <rubric>{rubric}</rubric>

    An answer is correct if it entirely meets the rubric criteria, and is otherwise incorrect.
    First, think through whether the answer is correct or incorrect based on the rubric inside <thinking></thinking> tags. Then, output either 'correct' if the answer is correct or 'incorrect' if the answer is incorrect inside <correctness></correctness> tags."""

    messages = [{'role': 'user', 'content': user_content}]
    return messages

# Now we define the full grade_completion function.
import re

#LLM scorer
@weave.op()
def answer_correcteness(golden_answer: str, model_output: dict) -> bool:
    # print(f'Alex TRACE answer_correcteness {prediction["output"]}')
    messages = build_grader_prompt(model_output, golden_answer)
    completion = qa_model.predict(messages)
    pattern = r'<correctness>(.*?)</correctness>'
    match = re.search(pattern, completion, re.DOTALL)

    t_pattern = r'<thinking>(.*?)</thinking>'
    t_match = re.search(t_pattern, completion, re.DOTALL)

    if match:
        return {'thinking':t_match.group(1).strip(),'correct': match.group(1).strip() == 'correct'}
    else:
        raise ValueError("Did not find <correctness></correctness> tags.")

def preprocess_model_input(line: str) -> dict:
    # print(f'Alex TRACE preprocess_model_input {line["question"]}')
    messages = build_input_prompt(question=line['question'])
    return messages

evaluation = Evaluation(
    name='QA_EVAL',
    dataset=eval,
    scorers=[answer_correcteness],
    preprocess_model_input=preprocess_model_input
)
await evaluation.evaluate(qa_model)


## Storing Datasets within Weave

If you're using evaluations you'll notice that Weave stores your evaluations in a versioned Dataset object within Weave interface. If you'd like to store your own dataset and name them, it's very easy to do so, and then you get a "ref" to the dataset that's stored in our system.

Using `refs` is a great way to make your code reproducible and versioned.

![CleanShot 2024-04-16 at 11 51 19@2x](https://gist.github.com/assets/463317/a313fd02-68f0-4324-926f-b296f0332b0d)


Here's an example of a dataset of Linkedin Profiles + what marketing persona they most fit and what product offering from Weigts & Biases fits them the most.

In [None]:
from weave import Dataset
weave.init('fc-workshop-eval-run')
linkedin_to_product_set = Dataset(
    name="linkedin_to_product",
    rows = [
    {
        "linkedin_bio": "Christopher Clarke, PhD 1st degree connection1st Chief Data Scientist | Principal @ East Village AI | AI, ML, Data Science | Theoretical Physicist",
        "closest_persona": "Malik",
        "product": "Models"
    },
    {
        "linkedin_bio": "Joe Reis 🤓 2nd degree connection2nd Author | Data Engineer and Architect | Recovering Data Scientist ™ | Global Keynote Speaker | Professor | Podcaster & Writer",
        "closest_persona": "Paul",
        "product": "Models"
    },
    {
        "linkedin_bio": "Gabriel Ruttner 1st degree connection1st 2x YC Founder (W24, S20) | Masters Cornell AI",
        "closest_persona": "Sonia",
        "product": "Weave"
    },
    {
        "linkedin_bio": "Ethan Lyon 1st degree connection1st Director of Engineering at Seer Interactive",
        "closest_persona": "Sonia",
        "product": "Weave"
    },
    {
        "linkedin_bio": "Wil Reynolds 2nd degree connection2nd VP Innovation at Seer Interactive",
        "closest_persona": "Carter",
        "product": "Weave"
    },
    {
        "linkedin_bio": "Yaroslav Pasichnychenko 1st degree connection1st Product Success & Business Development Expert | Bridging Innovative Product Management with Strategic Business Growth",
        "closest_persona": "Carter",
        "product": "Weave"
    },
    {
        "linkedin_bio": "Claire Longo 1st degree connection1st Head of ML Solutions Engineering at Arize AI | ex-Twilio ☎️ | ex-Trunk Club 👗| Mentor | Startup Advisor | Always yelling about MLOps 🤖",
        "closest_persona": "Paul",
        "product": "Models"
    },
    {
        "linkedin_bio": "Chintan Turakhia 2nd degree connection2nd Sr. Director Engineering | Head of Coinbase Wallet, Advisor to ML and web3 startups (ex-Uber)",
        "closest_persona": "Carter",
        "product": "Weave"
    },
    {
        "linkedin_bio": "Marina Moskowitz 1st degree connection1st AI/ML & Security Engineer | Young Global Leader | Khoury 40 for 40 | Huntington 100",
        "closest_persona": "Malik",
        "product": "Models"
    }
])

weave.publish(linkedin_to_product_set)


# End to End example of evaluation for business purposes

Weights & Biases now has 2 products, Models (Fka WandB) and Weave.

They have different personas they apply to, so for our marketing team and our go to market team, for whom the Weave product is new, we need to classify the right persona for each of the products.

Let's define a model that can do that.

In [None]:
import weave
import re, json
weave.init('fc-workshop-eval-run')

LINKEDIN_PROFILE = "Jim Fan 2nd degree connection2nd NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI." # @param {type:"string"}

SYSTEM_PROMPT = """
Assistant is a product / persona classifier for the marketing team at Weights & Biases, and given the context can help classify the target persona for each of the company products.
<context>
Weights & Biases offers two products:

Models:
- Geared towards machine learning engineers
- Used for individual productivity, productionizing ML at scale, and as an ML system of record + team productivity
- Relevant keywords: Machine Learning, ML Engineer, Data Science, Model Development, Deep Learning, Neural Networks, TensorFlow, PyTorch, ML Platform, MLOps, Machine Learning Infrastructure, Scalable ML, Model Deployment, Kubernetes, Docker, Cloud Computing, Machine Learning Leadership, ML Strategy, Team Management, ML Governance, ML Workflow Optimization, ML Best Practices, Agile ML

Weave:
- Geared towards software engineers and CTOs
- Used for developing GenAI applications and understanding the business impact of AI
- Relevant keywords: Software Engineering, Full Stack Development, Web Development, API Integration, Natural Language Processing, Language Models, GenAI, AI-powered Applications, Technology Leadership, AI Strategy, Innovation, Digital Transformation, Emerging Technologies, Artificial Intelligence, Machine Learning

Personas for Models:
- Malik (Machine Learning Engineer): Individual Productivity
- Paul (ML Platform Engineer): Productionize ML, at scale
- Diana (Director of Machine Learning): ML System of Record + Team Productivity

Personas for Weave:
- Sonia (Software Engineer): Develop GenAI applications
- Carter (CTO): Business impact of AI
</context>
"""

PROMPT_TEMPLATE = """
<LinkedinBio>
{linkedin_bio}
</LinkedinBio>

<Instructions>
Given the LinkedIn bio provided in the <Inputs> variable, perform the following steps:

1. Extract relevant keywords from the LinkedIn bio that relate to the person's job role, technical skills, and areas of interest. List these keywords inside <keywords> tags.

2. Based on the extracted keywords, determine which product, Models or Weave, the person is most likely to be interested in. Consider the product descriptions, relevant keywords for each product, and the personas associated with each product.

3. Provide your reasoning for the product choice inside <thinking_persona> and <thinking_product> tags. Reference the relevant keywords and personas that led to your decision.

4. Output the as a JSON formatted string within the <json_structure> tags:
<json_structure>
{{
"persona":"[only first Name of the persona]",
"product":"[Name of the product]"
}}
</json_structure>

</Instructions>
"""


class PersonaClassifierModel(weave.Model):
    model_name: str = FAST_MODEL_NAME
    system_message: str = SYSTEM_PROMPT
    prompt_template: str = PROMPT_TEMPLATE

    @weave.op()
    def predict(self, linkedin_bio: str) -> dict:
        response = client.messages.create(
            model=self.model_name,
            max_tokens=1024,
            temperature=0,
            system=self.system_message,
            messages=[{"role": "user", "content": self.prompt_template.format(linkedin_bio=linkedin_bio)}]
        )
        pattern = r'<json_structure>(.*?)</json_structure>'
        match = re.search(pattern, response.content[0].text, re.DOTALL)
        if match:
            return json.loads(match.group(1).strip())
        else:
            raise Exception('Couldnt parse JSON')


model = PersonaClassifierModel()
model.predict(LINKEDIN_PROFILE)

In [None]:
weave.init('fc-workshop-eval-run')
smart_or_fast = SMART_MODEL_NAME # @param ["SMART_MODEL_NAME", "FAST_MODEL_NAME"] {type:"raw"}

model = PersonaClassifierModel(model_name=smart_or_fast)
#define our scoring functions
@weave.op()
def product_correct(product: str, model_output: dict) -> dict:
    return {'correct': product == model_output['product']}

@weave.op()
def persona_correct(closest_persona: str, model_output: dict) -> dict:
    return {'correct': closest_persona == model_output['persona']}


evaluation = weave.Evaluation(
    name='person_eval',
    dataset=linkedin_to_product_set,
    scorers=[product_correct, persona_correct],
)
result = await evaluation.evaluate(model)

#### Oh Uh - Let's troubleshoot

A few issues went wrong, first weave has high parralelism and Anthropic gave us a bunch of issues. Which can also happen when you automate evals.
Let's set `WEAVE_PARALLELISM` to 5 to see if it fixes it.

Second, it seems that the faster (Haiku) model is not amazing at doing this task, let's pass in the "smart" model which is Opus and see if it improves our evals.


In [None]:
import os
os.environ['WEAVE_PARALLELISM'] = '5'

##

Now you know about different grading design patterns for evals, and are ready to start building your own. As you do, here are a few guiding pieces of wisdom to get you started.
- Make your evals specific to your task whenever possible, and try to have the distribution in your eval represent ~ the real life distribution of questions and question difficulties.
- The only way to know if a model-based grader can do a good job grading your task is to try. Try it out and read some samples to see if your task is a good candidate.
- Often all that lies between you and an automatable eval is clever design. Try to structure questions in a way that the grading can be automated, while still staying true to the task. Reformatting questions into multiple choice is a common tactic here.
- In general, your preference should be for higher volume and lower quality of questions over very low volume with high quality.