# LLMs in production - Trace, Compile, Evals - by Weights & Biases
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tcapelle/llm-evals-workshop/blob/main/eval.ipynb) [![Weights & Biases](https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-gradient.svg)](https://wandb.ai/capecape/evals-workshop/weave/traces)




# Intro
This notebook is accompanying a workshop, that will walk you through common patterns in building evaluations for LLMs, and useful rules of thumb to follow when doing so using [W&B Weave](https://wandb.me/weave-workshop-jan)

We'll explore the following methodology for productizing robust LLM applications: 

![three](https://gist.github.com/user-attachments/assets/0d51de65-8ec7-4cc5-a102-5a13229f5531)


Make sure to set your WANDB_API_KEY (get your key from [here](https://wandb.ai/authorize)) and OPENAI_API_KEY (if you have that) in the environment variables.

If you're running in Colab, set the variables in the keys section on the left. 

Prepared by [Alex Volkov](https://twitter.com/altryne) and [Thomas](https://tcapelle.github.io/socials)

In [None]:
# Install and read in required packages
try:
    import google.colab
    !git clone -q --branch main https://github.com/tcapelle/llm-evals-workshop
    %cd llm-evals-workshop
except ImportError:
    pass

print('⏳ Installing packages')
%pip install -q uv
!uv pip install -q --system weave gradio set-env-colab-kaggle-dotenv tqdm ipywidgets requests openai pillow litellm
print('✅ Packages installed')

Setup weave and the LLM

In [1]:
%load_ext gradio

from set_env import set_env
from openai import OpenAI
from dotenv import load_dotenv
import weave

load_dotenv()
set_env("WANDB_API_KEY")
set_env("OPENAI_API_KEY")

# initialize weave
weave_api = weave.init('evals-workshop')

Logged in as Weights & Biases user: capecape.
View Weave data at https://wandb.ai/capecape/evals-workshop/weave


In [2]:
# Initialize our LLM client
from litellm import completion

model = "openai/gpt-4o"

# 1. Tracing LLM calls with Weave

#### Why Tracing is Important for LLM Application Reliability

In building reliable LLM-based applications, having a clear view into
how your system behaves is crucial. That’s where “tracing” comes in.

1. **Detailed Interaction Records**:
   Tracing captures all the inputs, prompts, responses, and any user feedback.
   By preserving this detailed record, you always have the context needed to
   debug unexpected or incorrect results.

2. **Rapid Issue Diagnosis**:
   With thorough traces, you can pinpoint issues faster—often without
   needing direct access to remote systems. Simply reviewing the logs can
   reveal how a certain response was triggered.

3. **Collaboration and Sharing**:
   Traces can be shared with both technical and non-technical stakeholders.
   This not only streamlines collaboration but also ensures everyone is
   working off the same “source of truth” when investigating bugs
   or brainstorming improvements.

4. **Outlier Spotting and Performance Tuning**:
   By tracking calls at scale, you can detect when responses deviate
   dramatically from the norm, troubleshoot any failures, and identify
   potential performance bottlenecks.

5. **Facilitates Product Evolution**:
   As you enhance or expand your LLM application, comprehensive
   tracing data helps you make more informed decisions about what to
   improve, remove, or refine.

With W&B Weave, comprehensive tracing is just 1 line of code, and offers features such as:
- Syntax highlighting specific to your use-case (Markdown, JSON, etc.)
- Ability to share links with other members of your team
- Ability to filter traces by function name, input, output, etc.
- Tracking latency, token count and cost per call (and trends)
- Code associated with the llm call and versioning
- Ability to add metadata per trace

If you need to instrument existing code, you can use the `@weave.op` decorator to trace the function.  

![CleanShot 2024-04-08 at 14 15 40@2x](https://gist.github.com/assets/463317/4e9ada49-572f-47d9-91e1-55ab72b2a476)

In [3]:
@weave.op  # 💪
def analyze_post_sentiment(avatar, displayName, text):
    # Prompt for OpenAI to analyze the sentiment

    prompt = f"""Analyze the following Bluesky post and determine if the author is a:
    - DOOMER (someone who hates AI and uses derogatory language)
    - BOOMER (someone who doesn't understand AI and asks to remove their data)
    - NEITHER (neutral or positive response)
    
    Post: {displayName}: "{text}"
    
    Respond with just one word (DOOMER, BOOMER, or NEITHER) followed by a brief explanation.
    """
    
    response = completion(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.5
    )
    
    try:
        current_call = weave.require_current_call()
        weave_call_id = current_call.id
    except:
        weave_call_id = None
    
    return {
        "llm_classification": response.choices[0].message.content,
        "weave_call_id": weave_call_id
    }

In [4]:
# Lets test this out without tracing first
response_dict = analyze_post_sentiment("","Alex","I hate AI")

print(response_dict)

🍩 https://wandb.ai/capecape/evals-workshop/r/call/019490e8-9d9c-7663-9bc3-816e81e11bc9
{'llm_classification': "DOOMER: The post explicitly states hatred towards AI, which aligns with a DOOMER's negative attitude and derogatory stance towards AI.", 'weave_call_id': '019490e8-9d9c-7663-9bc3-816e81e11bc9'}


We can see that even without `@weave.op`, since Weave is initialized, it will still trace the function call and store it in the Weave project as it automatically understands that we use the OpenAI client. However, if we add `@weave.op`, we can get even more detail and instrument our existing code with Weave.
 
Tracing becomes even more useful when you have a lot of nested calls, such as a multi-step chat conversation, a RAG system with retrieval, or an agentic system with multiple steps.

![text](https://cln.sh/Sc8ZtrdM+)

[Here's a great example](https://wandb.ai/wandb-designers/winston/weave/traces?cols=%7B%22attributes.weave.client_version%22%3Afalse%2C%22attributes.weave.os_name%22%3Afalse%2C%22attributes.weave.os_release%22%3Afalse%2C%22attributes.weave.os_version%22%3Afalse%2C%22attributes.weave.source%22%3Afalse%2C%22attributes.weave.sys_version%22%3Afalse%7D&peekPath=%2Fwandb-designers%2Fwinston%2Fcalls%2F0193ff3f-54d7-73a3-8004-0a582a594307%3Fpath%3Dwinston-solve*0%2Bvincent-execute*0%26tracetree%3D1) of a more complex traced setup from our internal agent system called Winston - with multiple tools selection, retrieval steps etc Winston Weave Dashboard


# 2. User Feedback & Annotations

Collecting user feedback is a crucial way to improve your LLM applications. There's a reason that every chatbot you use has 👍/👎 and a text box to leave feedback. This is one of the best ways for those labs to understand and improve their models and align them to user preferences.

![text](https://cln.sh/JGMBxMtH+)

Users don't have to be external as well, as you develop your application, marking traces as "good" or "bad", and adding why, is a great way to kick start your initial evaluation dataset with working and non-working examples. 

Additionally, after logging hundreds of thousads of traces, they will all start looking the same, so additional context like your user's feedback, will greately improve your ability to look at your data and find the outliers.

Weave supports collecting user Feedback in the UI and also via the API so you can collect it from your users and also leave it yourself while looking at your data. 

![text](https://cln.sh/X6fFHD8t+)

Read more about feedback [here](https://weave-docs.wandb.ai/guides/tracking/feedback)


In [5]:
# @title { display-mode: "form" }
import os
import gradio as gr
from PIL import Image
import requests 
import io
import json
from jinja2 import Environment, FileSystemLoader
from datetime import datetime
import random
import weave
from weave.flow.annotation_spec import AnnotationSpec

# initialize annotations for this project
annotation = weave.publish(AnnotationSpec(
    name="Doomer or Boomer",
    description="Doomer or Boomer or Neither",
    field_schema={ "type": "string", "enum": ["Doomer", "Boomer", "Neither"],},
), "doomer_or_boomer")

annotation_reason = weave.publish(AnnotationSpec(
    name="Reason",
    description="Reason why you chose this value, write before clicking.",
    field_schema={ "type": "string"},
), "reason")


# cell 2
# Load the Jinja2 environment
env = Environment(loader=FileSystemLoader('templates'))
template = env.get_template('post.html.jinja')

# Load replies data
def load_replies():
    replies = []
    # Load replies from both files
    with open('data/replies_alpin.json', 'r') as f:
        data = json.load(f)
        replies.extend(data['thread']['replies'])
    with open('data/replies_daniel.json', 'r') as f:
        data = json.load(f)
        replies.extend(data['thread']['replies'])
    return replies


def get_random_post_and_analyze():
    replies = load_replies()
    post = random.choice(replies)
    
    # Format the post data for the template
    created_at = datetime.fromisoformat(post['post']['record']['createdAt'].replace('Z', '+00:00'))
    formatted_date = created_at.strftime('%b %d, %Y, %I:%M %p')
    
    # Convert AT URI to bsky.app URL
    at_uri = post['post']['uri']
    _, _, author_did, _, post_id = at_uri.split('/')
    post_url = f"https://bsky.app/profile/{post['post']['author']['handle']}/post/{post_id}"
    
    # Analyze the post
    #download the avatar and convert to PIL image
    avatar_uri = post['post']['author'].get('avatar')
    avatar_response = requests.get(avatar_uri)
    avatar_pil = Image.open(io.BytesIO(avatar_response.content))

    response_dict = analyze_post_sentiment(avatar_pil, post['post']['author']['displayName'], post['post']['record']['text'])
    analysis = response_dict['llm_classification']
    weave_call_id = response_dict['weave_call_id']
    
    post_data = {
        'author': post['post']['author'],
        'created_at': formatted_date,
        'text': post['post']['record']['text'],
        'like_count': post['post'].get('likeCount', 0),
        'repost_count': post['post'].get('repostCount', 0),
        'has_image': False,
        'post_url': post_url
    }
    
    return template.render(**post_data), analysis, weave_call_id, ''


def submit_feedback(user_selection, reason, weave_call_id):
    """
    Example function that could send user feedback (the user_selection)
    and the weave_call_id to your Weave (or any other) API.
    """
    call = weave_api.get_call(weave_call_id)
    
    if not call:
        raise Exception('No Weave call ID found, have you tried adding @weave.op to the analyze_post_sentiment function?')
    
    if reason:
        reason_resp = weave_api.server.feedback_create(
            {
            "project_id": weave_api._project_id(),
            "weave_ref": call.ref.uri(),
            "feedback_type": "wandb.annotation.reason",
            "annotation_ref": annotation_reason.uri(),
            "payload": {"value": reason},
            }
        )

    resp = weave_api.server.feedback_create(
        {
            "project_id": weave_api._project_id(),
            "weave_ref": call.ref.uri(),
            "feedback_type": "wandb.annotation.doomer_or_boomer",
            "annotation_ref": annotation.uri(),
            "payload": {"value": user_selection},
        }
    )
    
    # Ready to analyze the next post
    return get_random_post_and_analyze()


📦 Published to https://wandb.ai/capecape/evals-workshop/weave/objects/doomer_or_boomer/versions/MfppDkza1qvK772eNZWIU1XwwZbtwGQ8UQWWEcyZlfc
📦 Published to https://wandb.ai/capecape/evals-workshop/weave/objects/reason/versions/Z3Do6YnUa9YHEuELfGyZtJt7JTkgb30oVBv04U4HWyc




# 2.1 Doomer or Boomer App - Annotations by example

Unlike user feedback, Annotations are a bit of a more structure way to classify responses, to help create a dataset of golden answers and reasons or rationales for those answers. All of the major companies use Scale.ai for this and pay them a LOT of money, but you don't have to right away, you can start small, by yourself or with your team. 

Let's see how we can kickstart a simple dataset of annotations by a practical example.

![image](https://gist.github.com/user-attachments/assets/a8537545-e070-4c8e-9988-2a8a905b9d2c)

To simulate a real world scenario, we'll build a simple app that will allow you to annotate a few posts. 

In our case, we're pretending to work at a company that's trying to build an AI classifier for Bluesky posts. We're humans that work in the company and are helping it to align and finetune models for AI moderation. 

We've compiled replies from BlueSky users, on 2 posts that collected publicly available data from BlieSky to train AI models (BlueSky data is public), which led to a lot of hate by users on BlueSky. 

We're going to build a simple app that will use an LLM to classify the replies into 3 categories: `Doomer`, `Boomer`, or `Neither`. 

`Doomer`: Someone who hates AI, and uses derogatory language towards the author of the post because of thier hate for AI and their data being used for AI  
`Boomer`: Someone who doesn't understand AI, and copy-pastes a request to remove their data from the dataset  
`Neither`: Folks who reply neutral or positive to the post.

At first our LLMs will not have context to the task, so won't be able to reliably classify the replies, so a human is needed to annotate with additional context, you are that human. 

Launch the app and go through a few posts, annotate with a reason for your choice and the correct classification, we'll later use this data to align/finetune our LLM to classify the replies more accuretly and reliably.

In [8]:
# %%blocks
# TODO - Launch the Gradio app and annotate 10-20 examples according to the rules
os.environ['WEAVE_PRINT_CALL_LINK'] = 'false'
with gr.Blocks(theme=gr.themes.Soft()) as demo:
    # Add a title and description
    gr.Markdown("""
    # 🦋 Doomer or Boomer
    Our AI analyzes bluesky replies and posts to determine if the author is a doomer or a boomer.  
    Source of data: Replies to a post by a BlueSky user that compiled a dataset of posts, which went viral and generated a lot of hate on BlueSky.  
    These are replies and comments on 2 posts that collected a dataset of posts of BlueSky users to train AI models (BlueSky data is public)
    """)
    
    with gr.Row():
        with gr.Column(scale=2):
            post_html = gr.HTML()
            next_post_btn = gr.Button("Skip Post & Analyze Another", variant="primary")
            gr.Markdown(f"""
            #### Instructions for labeler: 
            `Doomer`: Someone who hates AI, and uses derogatory language towards the author of the post because of thier hate for AI and their data being used for AI  
            `Boomer`: Someone who doesn't understand AI, and copy-pastes a request to remove their data from the dataset  
            `Neither`: Folks who reply neutral or positive to the post.
            
            See your Weave project & traces [here](https://wandb.ai/{weave_api._project_id()})
            """)
        
        with gr.Column(scale=1):
            analysis_output = gr.Textbox(
                label="Analysis Results",
                placeholder="Analysis will appear here...",
                lines=4
            )
            weave_call_id_state = gr.State()
            
            # Replace dropdown with three buttons
            reason_input = gr.Textbox(label="Add reason and click",placeholder="Reason why you chose this value, write before clicking.", lines=2)
            with gr.Row():
                doomer_btn = gr.Button("Doomer 😡", variant="huggingface")
                boomer_btn = gr.Button("Boomer 👵", variant="primary")
                neither_btn = gr.Button("Neither 🤷")

            
    # Set up event handler for combined next/analyze
    next_post_btn.click(fn=get_random_post_and_analyze, outputs=[post_html, analysis_output, weave_call_id_state, reason_input])
    
    doomer_btn.click(
    fn=submit_feedback,
    inputs=[gr.State("Doomer"), reason_input, weave_call_id_state],
    outputs=[post_html, analysis_output, weave_call_id_state, reason_input]
    )
    boomer_btn.click(
        fn=submit_feedback,
        inputs=[gr.State("Boomer"), reason_input, weave_call_id_state],
        outputs=[post_html, analysis_output, weave_call_id_state, reason_input]
    )
    neither_btn.click(
        fn=submit_feedback,
        inputs=[gr.State("Neither"), reason_input, weave_call_id_state],
        outputs=[post_html, analysis_output, weave_call_id_state, reason_input]
    )

    
    # Initialize with first post and analysis
    post_html.value, analysis_output.value, weave_call_id_state.value, reason_input.value = get_random_post_and_analyze()

demo.launch()

* Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.




## 2.1 Building a dataset from annotated calls

Now that we've annotated at least 10-20 examples, we can build our first evaluation dataset! 

![text](https://cln.sh/dyBq4QXD+)

Step 1: Filter calls in Weave UI by only those with annotations not empty

Step 2: Use the Export -> Use Python button to get code to extract a list of filtered annotated calls

Step 3: Convert the calls to a clean evaluation dataset (and optionally publish to Weave)



In [6]:
@weave.op
def get_annotated_calls():
   # Weave API call to get all calls filtered by annotations not empty (with reasons)
   resp = weave_api.server.calls_query_stream({
      "project_id": weave_api._project_id(),
      "filter": {"op_names": [f"weave:///{weave_api._project_id()}/op/analyze_post_sentiment:*"]},
      "query": {"$expr":{"$and":[{"$not":[{"$eq":[{"$getField":"feedback.[wandb.annotation.doomer_or_boomer].payload.value"},{"$literal":""}]}]},{"$not":[{"$eq":[{"$getField":"feedback.[wandb.annotation.reason].payload.value"},{"$literal":""}]}]}]}},
      "sort_by": [{"field":"started_at","direction":"desc"}],
      "include_feedback": True,
   })

   # Iterate over the calls, clean up and publish as a dataset we can version and reference later.
   list_of_calls = []
   dataset = []
   for call in resp:
      try:
         row = {}
         call_dict = dict(call)
         row["input"] = call_dict.get('inputs').get('text')
         row["displayName"] = call_dict.get('inputs').get('displayName')
         row["llm_classification"] = call_dict.get('output').get('llm_classification')
         list_of_feedback = call_dict.get('summary').get('weave').get('feedback')
         for feedback in list_of_feedback:
            if feedback.get("feedback_type") == 'wandb.annotation.doomer_or_boomer':
               row["human_annotation"] = feedback.get('payload').get('value')
            if feedback.get("feedback_type") == 'wandb.annotation.reason':
               row["reason"] = feedback.get('payload').get('value')
      except Exception as e:
        continue
      
      dataset.append(row)

   return dataset


In [7]:
dataset = get_annotated_calls()
print(len(dataset))

weave.publish(weave.Dataset(name="doomer_or_boomer_dataset", rows=dataset))

🍩 https://wandb.ai/capecape/evals-workshop/r/call/019490e9-3fe4-7b10-b67d-a11d884e397f
16
📦 Published to https://wandb.ai/capecape/evals-workshop/weave/objects/doomer_or_boomer_dataset/versions/AjXt0JMXffKov9O8MUU2oyr2oVTLYTjIQ9XEva0ZEus


ObjectRef(entity='capecape', project='evals-workshop', name='doomer_or_boomer_dataset', _digest='AjXt0JMXffKov9O8MUU2oyr2oVTLYTjIQ9XEva0ZEus', _extra=())

## 2.2 Storing Datasets within Weave

If you'd like to store your own dataset and name them, it's very easy to do so, and then you get a "ref" to the dataset that's stored in our system. Weave datasets are versioned, which means you can reference them in your code by a URL or a ref, and either point to the latest version or a specific version. 

Using `refs` is a great way to make your code reproducible and versioned.

![CleanShot 2025-01-07 at 16 12 35@2x](https://gist.github.com/user-attachments/assets/e2d02340-cc0f-41e8-8d97-957b08611d08)


Here's an example of the dataset we just created, and how we can reuse it in our evaluations.

In [8]:
# TODO 5: replace this dataset with your own ref using the dataset link above and looking at the "use" tab
weave_ref = "weave:///capecape/evals-workshop/object/doomer_or_boomer_dataset:AjXt0JMXffKov9O8MUU2oyr2oVTLYTjIQ9XEva0ZEus"
doomer_or_boomer_dataset = weave.ref(weave_ref).get()

import pandas as pd
df = pd.DataFrame(doomer_or_boomer_dataset.rows)
df.head(20)

Unnamed: 0,input,displayName,llm_classification,reason,human_annotation
0,I request that any of my data that is containe...,Kathryn Tewson,BOOMER - The author is requesting the removal ...,fear of data,Boomer
1,"""Researchers""",Ryan,"DOOMER: The use of quotation marks around ""Res...",hate against the researcher,Doomer
2,Do not waiver,phi,"NEITHER\n\nThe post ""phi: 'Do not waiver'"" doe...",nothing,Neither
3,lmao,wireless_anon,"NEITHER\n\nThe post ""lmao"" is a neutral expres...",nothing,Neither
4,I hope you become resistant to Adderall,Bro Laren,NEITHER\n\nThe post does not contain any expli...,nothing,Neither
5,"This is garbage, you're a thief, and you disgu...",Hank Single,DOOMER\n\nThe post contains derogatory languag...,Hate to AI,Doomer
6,If you’ve got any of my data then you should r...,Sanitary Naptime,BOOMER: The post expresses a desire to have pe...,fear of data and hate,Boomer
7,I request that any and all of my data (or ment...,Chip,BOOMER\n\nExplanation: The author is requestin...,fear of data,Boomer
8,Hi I do not consent for my posts or content to...,Ness (they/them),NEITHER\n\nThe post expresses a clear request ...,fear of data,Boomer
9,Didn’t we all leave twitter in part because of...,Your Favorite Southern Belle Stoner Mom,NEITHER\n\nThe post expresses frustration with...,frustration,Neither


# Step 3 : Evaluations 
### Components of an Evaluation

Evaluations generally consist of four key elements:
- An **input prompt** that serves as the basis for the model's completion. This prompt often includes a set of variable inputs that are inserted into a prompt template during testing.
- The **output** generated by the model in response to the input prompt.
- A **"gold standard" answer** used as a reference for assessing the model's output. This can be an exact match that the output must replicate, or an exemplary answer that provides a benchmark for scoring.
- A **score**, determined by one of the scoring approaches outlined below, which indicates the model's performance on the question.

#TODO 6: Look at the dataset and try to match the input, output, gold standard each row

## Evaluation Grading Approaches
Evaluations can be time-consuming and costly in two main areas: creating questions and gold standard answers, and the scoring/grading process itself.  
Developing questions and ideal answers is often a one-time fixed cost, albeit potentially time-intensive if a suitable dataset is not readily available (consider leveraging an LLM to generate questions!). However, scoring is a recurring expense incurred each time the evaluation is conducted, which is likely to be frequent. Therefore, designing evaluations that can be scored efficiently and economically should be a central priority.

![](https://gist.github.com/assets/463317/e970bb03-9552-4712-ba12-727b89928e3b)

There are three primary methods for grading (scoring) evaluations:  
- **Programmatic:** This approach involves using standard code (primarily string matching and regular expressions) to assess the model's outputs. Common techniques include checking for an exact match against an answer or verifying the presence of key phrase(s) in a string. Programmatic scoring is the most optimal method when feasible, as it is extremely fast and highly reliable. However, not all evaluations are amenable to this style of scoring. 
  - Goes great with structured output - validate against an enum
  - Code generation output - does it run, is valid, does it compile? 
  - Tool use validation - do the tools exist? 
- **Human in the loop:** In this approach, a human reviewer examines the model-generated answer, compares it to the gold standard, and assigns a score. While manual scoring is the most versatile method, applicable to nearly any task, it is also exceptionally slow and costly, especially for large-scale evaluations. Designing evaluations that necessitate manual scoring should be avoided whenever possible.
  - Domain specific & expert information
  - Sensitive topics
- **Model-based scoring AKA LLM as a judge:** LLMs (especially Claude, GPT-4o, Gemini) are really good at grading themselves (or even outputs of other LLMs) especially in wide range of tasks that traditionally needed human judgement like tone in creative writing or accuracy in open-ended question, or classification. This model-based scoring is accomplished by creating a _scorer prompt_ for an LLM
  - Open ended style questions
  - Classification & Translation 
  - Instruction following

Let's explore an example of each

## 3.1 Programmatic scoring 

Here we have a simple programmatic eval that will try and check if the LLM had the right answer.

In [10]:
## Create a programmatic scorer that will compare the ground truth to the LLM answer and check if it is correct
os.environ['WEAVE_PRINT_CALL_LINK'] = 'true'
import weave
from weave import Evaluation

# def string_match(output: str, human_annotation: str):
#     # check if the model output is exactly the same as human_annotation (Doomer, Boomer, Neither)
#     # we expect this evaluation to fail becuase the LLM is talking alot and never returns just the reason
#     if not output or not human_annotation:
#         raise ValueError("Model output or human annotation is empty")
#     return {"match": output == human_annotation}


def string_match(output: str, human_annotation: str):
    # check if model_output includes the human_annotation only once 
    if human_annotation.lower() in output.lower():
        #possible match, now lets check if the model_output includes any of the other options but not the human_annotation
        for option in ["doomer", "boomer", "neither"]:
            if option.lower() in output.lower() and option.lower() != human_annotation.lower():
                return {"match": False}
        return {"match": True}
    return {"match": False}

evaluation = Evaluation(
    dataset=doomer_or_boomer_dataset, scorers=[string_match]
)

@weave.op
def function_to_evaluate(input: str):
    # here's where you would add your LLM call and return the output
    # since we already called the LLM, we can just iterate over the dataset 
    # and return the llm_classification where the question is the same
    row = [row for row in doomer_or_boomer_dataset.rows if row['input'] == input]
    return row[0].get('llm_classification')

await evaluation.evaluate(function_to_evaluate)

🍩 https://wandb.ai/capecape/evals-workshop/r/call/019490f1-9ff2-7c50-9cd1-dcb52fcd530c


{'string_match': {'match': {'true_count': 8, 'true_fraction': 0.5}},
 'model_latency': {'mean': 0.008246973156929016}}

### 3.1.1 Structured outputs with programmatic scorers

The above example likely gave us a score of 0, because LLMs like to talk, and comparing that via a simple string match is not going to work. 

Programmatic scorers work great when we have structured outputs and we know exactly what to expect from LLMs. Let's recreate our LLM calls for the same questions with strucutred outputs so we can compare the LLM output directly to the human annotation and see if we can get a better score.

In [11]:
import os
# os.environ['WEAVE_PARALLELISM'] = '5'  # uncomment this to hit the endpoint slower (default is 20)
os.environ['WEAVE_PRINT_CALL_LINK'] = 'true'

from typing import Literal
from pydantic import BaseModel, Field

class DoomerOrBoomer(BaseModel):
    classification: Literal["DOOMER", "BOOMER", "NEITHER"] = Field(description="The classification of the post, either DOOMER, BOOMER, or NEITHER", 
                                                                   example="DOOMER")
    reason: str = Field(description="The reason for the classification, a short explanation of why the post is classified like this.")

@weave.op
def with_structured_llm_call(input: str, displayName: str) -> DoomerOrBoomer:
    prompt = f"""Analyze the following Bluesky post and determine if the author is a:
    - DOOMER (someone who hates AI and uses derogatory language)
    - BOOMER (someone who doesn't understand AI and asks to remove their data)
    - NEITHER (neutral or positive response)
    Text to Classify: 
    \n\n {displayName}: "{input}"
    """
    
    response = completion(
        model=model,
        messages=[
            {"role": "user", "content": prompt}],
        temperature=0.5,
        response_format=DoomerOrBoomer
    )
    json_out = response.choices[0].message.content
    return DoomerOrBoomer.model_validate_json(json_out)


In [12]:
out = with_structured_llm_call("I'm a doomer because I hate AI", "tcapelle")
out.model_dump()

🍩 https://wandb.ai/capecape/evals-workshop/r/call/019490f1-faa8-7f83-9200-efea80cfedc8


{'classification': 'DOOMER',
 'reason': 'The author explicitly states they are a doomer and expresses hatred towards AI, which aligns with the definition of a doomer.'}

In [13]:
def string_match(output: str, human_annotation: str):
    # check if the model output is exactly the same as human_annotation (Doomer, Boomer, Neither)
    if not output:
        raise ValueError("Model output is empty")
    
    return {"match": output.classification.lower() == human_annotation.lower()}

new_evaluation = Evaluation(
    dataset=doomer_or_boomer_dataset, scorers=[string_match]
)

await new_evaluation.evaluate(with_structured_llm_call)

🍩 https://wandb.ai/capecape/evals-workshop/r/call/019490f2-8a3e-72d3-998f-49900727d68a


{'string_match': {'match': {'true_count': 12, 'true_fraction': 0.75}},
 'model_latency': {'mean': 13.258970081806183}}

In [14]:
# @title { display-mode: "form" }

calls = weave_api.get_calls(
    filter={"op_names": ["weave:///capecape/evals-workshop/op/Evaluation.predict_and_score:UQ7QVKZpY8NfmEYwrnmhiKDBKV0lppySqBZI4XwxfXg"],"parent_ids": ["019490f2-8a3e-72d3-998f-49900727d68a"]},
    sort_by=[{"field":"started_at","direction":"desc"}],)

structured_output_doomer_or_boomer_dataset = []

for c in calls:
    new_row = dict(c.inputs["example"]).copy()
    new_row["structured_llm_classification"] = c.output["output"].classification
    new_row["structured_llm_reason"] = c.output["output"].reason
    structured_output_doomer_or_boomer_dataset.append(new_row)

weave.publish(weave.Dataset(name="doomer_or_boomer_dataset_with_structured_output", 
                            rows=structured_output_doomer_or_boomer_dataset))

# 3.2 HITL - Human in the loop evaluation grading

Programmatic scoring is great for many reasons, cheap to get started with, can run very fast and can be very reliable, but cannot cover open ended questions or tasks that require analysis or judgement. 

For example, did the LLM follow the instructions it was given, did it hallucinate, was it verbose or concise, etc.

To judge those outputs we can use human graders, to provide "golden answers", which is what we did above with the annotation example with our Doomer or Boomer app. 

The downside of HITL is that it's slow, expensive, and not scalable (unless you have a lot of money in the bank). 

HITL is a great way to kickstart an evaluation dataset and extarpolate with an LLM. 

Here's a slight alternative on our app, that shows LLM responses and allows our humans in the loop to judge the responses as correct or incorrect. 

#TODO 11 - Run this app, mark up to 10 responses, and then hit "run evaluations".

In [16]:
# @title { display-mode: "form" }

import weave
from weave import Evaluation
# weave_ref = "weave:///capecape/evals-workshop/object/doomer_or_boomer_dataset:AjXt0JMXffKov9O8MUU2oyr2oVTLYTjIQ9XEva0ZEus"
weave_ref = "weave:///capecape/evals-workshop/object/doomer_or_boomer_dataset_with_structured_output:X4pRm3UDhhd1sNOoTh4l6sHSGWzGXWIwgaOvNy3nW0k"
doomer_or_boomer_dataset_with_structured_output = weave.ref(weave_ref).get()

def match_dataset_with_replies():
    matched_replies = []
    for row in doomer_or_boomer_dataset_with_structured_output.rows:
        # Find matching reply in all_replies
        for reply in load_replies():
            if reply['post']['record']['text'] == row['input']:
                matched_reply = {
                    'full_reply': reply,
                    'input': row.get('input', ''),
                    'output': row.get('output', ''),
                    'reason': row.get('reason', ''),
                    'llm_classification': row.get('llm_classification', ''),
                    'displayName': row.get('displayName', '')
                }
                matched_replies.append(matched_reply)
                break
    return matched_replies

matched_replies = match_dataset_with_replies()
annotated_rows = []

def get_next_annotated_post(current_index:int = 0):
    # Get the matched replies
    
    print(current_index, len(matched_replies))
    if current_index >= len(matched_replies):
        current_index = 0  # Reset to beginning if we've reached the end
        
    reply = matched_replies[current_index]
    post = reply['full_reply']
    
    # Format the post data for the template
    created_at = datetime.fromisoformat(post['post']['record']['createdAt'].replace('Z', '+00:00'))
    formatted_date = created_at.strftime('%b %d, %Y, %I:%M %p')
    
    # Convert AT URI to bsky.app URL
    at_uri = post['post']['uri']
    _, _, author_did, _, post_id = at_uri.split('/')
    post_url = f"https://bsky.app/profile/{post['post']['author']['handle']}/post/{post_id}"
    
    post_data = {
        'author': post['post']['author'],
        'created_at': formatted_date,
        'text': post['post']['record']['text'],
        'like_count': post['post'].get('likeCount', 0),
        'repost_count': post['post'].get('repostCount', 0),
        'has_image': False,
        'post_url': post_url
    }
    
    # Use the stored LLM classification and human annotation
    analysis = f"""LLM Classification: {reply['llm_classification']}
    
LLM Reasoning: {reply['reason']}
    """
    
    run_evaluation_btn = {
        "interactive": True if len(annotated_rows) >=  10 else False,
        "value": "Run Evaluation" if len(annotated_rows) >=  10 else f"Annotate {10 - len(annotated_rows)} more posts"
    }
    return template.render(**post_data), analysis, current_index + 1, gr.update(**run_evaluation_btn), ""

def submit_hitl_feedback(correct_or_incorrect: str, feedback: str, next_index: int):
    annotated_rows.append({
        "input": matched_replies[next_index-1].get('input'),
        "output": matched_replies[next_index-1].get('output'),
        "llm_classification": matched_replies[next_index-1].get('llm_classification'),
        "correct_or_incorrect": True if correct_or_incorrect == "correct" else False,
        "human_reason_for_correct_or_incorrect": feedback,
    })
    return get_next_annotated_post(next_index)


def right_according_to_human(output: str, correct_or_incorrect: bool):
    return correct_or_incorrect

@weave.op
def return_input_row(input: str):
    return [x for x in annotated_rows if x.get('input') == input]

async def run_evaluation():
    hitl_evaluation = Evaluation(
        dataset=annotated_rows,
        scorers=[right_according_to_human],
        name="hitl_evaluation"
    )
    
    result = await hitl_evaluation.evaluate(return_input_row)
    gr.Info('Evaluation complete! Check your Weave project for the results.')
    return result

In [17]:
# @title { display-mode: "form" }
# @markdown Run this cell to start the HITL evaluation app

# %%blocks
# Create a Gradio Blocks app
os.environ['WEAVE_PRINT_CALL_LINK'] = 'true'

with gr.Blocks(theme=gr.themes.Soft()) as new_demo:
    # Add a title and description
    gr.Markdown("""
    # Human in the loop
    """)
    
    with gr.Row():
        with gr.Column(scale=1):
            gr.Markdown(f"""## 1. Post to Analyze  """)
            post_html = gr.HTML()
            # next_post_btn = gr.Button("Skip Post & Analyze Another", variant="primary")
            gr.Markdown(f"""
            #### Instructions for HTIL judge: 
            - review LLM outputs and mark them as correct or incorrect  
            - after 10-20 examples, hit "run evaluation" button
    
            See your Weave project & traces [here](https://wandb.ai/{weave_api._project_id()})
            """)
            
        
        with gr.Column(scale=2):
            
            analysis_output = gr.Textbox(
                label="2. Review LLM Classification for this post",
                placeholder="Analysis will appear here...",
                lines=4,
            )
            next_index = gr.State(value=0)
            
            with gr.Accordion("Reminder of Doomer, Boomer, or Neither Criteria", open=False):
                gr.Markdown(f"""
                `Doomer`: Someone who hates AI, and uses derogatory language towards the author of the post because of thier hate for AI and their data being used for AI  
                `Boomer`: Someone who doesn't understand AI, and copy-pastes a request to remove their data from the dataset  
                `Neither`: Folks who reply neutral or positive to the post.
                """)
            # Replace dropdown with three buttons
            reason_input = gr.Textbox(label="3. Add reason and submit",placeholder="Reason why the LLM got this classification right or wrong", lines=2)
            with gr.Row():
                correct_btn = gr.Button("LLM is Correct 👍")
                incorrect_btn = gr.Button("LLM is Incorrect 👎")

            run_evaluation_btn = gr.Button("Run Evaluation", variant="primary", interactive=False)

            
    # Set up event handler for combined next/analyze
    # next_post_btn.click(fn=get_next_annotated_post, inputs=[next_index], outputs=[post_html, analysis_output, next_index, run_evaluation_btn, reason_input])
    
    correct_btn.click(fn=submit_hitl_feedback, inputs=[gr.State("correct"), reason_input, next_index], outputs=[post_html, analysis_output, next_index, run_evaluation_btn, reason_input])
    incorrect_btn.click(fn=submit_hitl_feedback, inputs=[gr.State("incorrect"), reason_input, next_index], outputs=[post_html, analysis_output, next_index, run_evaluation_btn, reason_input])

    run_evaluation_btn.click(fn=run_evaluation, inputs=[], outputs=[analysis_output])
    # Initialize with first post and analysis
    post_html.value, analysis_output.value, next_index.value, run_evaluation_btn.value, reason_input.value = get_next_annotated_post()

new_demo.queue()
new_demo.launch()

0 16
* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




# 3.3 LLM as a Judge - use another LLM to grade your LLM outputs

Having to manually grade the above eval every time is going to get very annoying very fast, especially if the eval is a more realistic size (dozens, hundreds, or even thousands of questions). Luckily, there's a better way! 

We can actually have an LLM do the grading for us. We'll use a teacher model to grade the LLM outputs of a "student" model (in this case the LLM we're using for our production system is the student). 

There are a few issues with this approaches to be aware of: 
 - LLMs are not great at numerical scoring (eg 1-5) 
 - The order of canditate responses matter
 - Foundational models tend to prefer their own outputs over other models
 - LLMs prefer longer respones and "style" over accuracy


## 3.3.1 Let's build our LLM judge

First, we'll start by building a "grader prompt" template, a prompt asking our judge to perform the judging itself. This will be our iteration grounds. In this template, we'll inject both the output of our production LLM model, and the criteria / rules or rubric that makes an answer correct or incorrect. 

In our case, the classification into one of 3 (Doomer, Boomer, Neither) is done 


In [19]:
# Step 1 - Build a grader prompt
import weave
from weave import Evaluation
from typing import Literal
from pydantic import BaseModel, Field

class DoomerOrBoomer(BaseModel):
    classification: Literal["DOOMER", "BOOMER", "NEITHER"] = Field(description="The classification of the post, either DOOMER, BOOMER, or NEITHER", example="DOOMER")
    reason: str = Field(description="The reason for the classification, a short explanation of why the post is classified like this.")


def build_grader_prompt(displayName: str, input: str, llm_classification: str): 
    system_prompt = f"""You are a judge who is grading the output of an automated assistant.
    
    ## Inputs
    You are provided with the following: 
    - A comment made on social media and the handle of the person making the comment
    - The classification output of an automated assistant made about the comment
    - A set of guidelines and additional context for you to understand the input and the correct way to classify it
    
    ## Instructions of how to classify responders: 
    `Doomer`: Someone who hates AI, and uses derogatory language towards the author of the post because of their hate for AI and their data being used for AI.
    `Boomer`: Someone who doesn't understand AI, and copy-pastes a request to remove their data from the dataset. He is a boomer because he is old and doesn't understand AI.
    `Neither`: Folks who reply neutral or positive to the post.

    Your task is to classify the comment into one of the 3 categories and give a short reasoning for your choice.
    """

    grader_prompt_template = f"""
    <input>
    @{displayName}: {input}
    </input>

    <automated_assistant_classification>
    {llm_classification}
    </automated_assistant_classification>
    """

    return system_prompt, grader_prompt_template

# Step 2 - Get our datasets 
weave_ref = "weave:///capecape/evals-workshop/object/doomer_or_boomer_dataset_with_structured_output:X4pRm3UDhhd1sNOoTh4l6sHSGWzGXWIwgaOvNy3nW0k"
doomer_or_boomer_dataset_with_structured_output = weave.ref(weave_ref).get()


# Step 3 - Build our LLM Judge API function 
@weave.op
def llm_judge_api(input: str, structured_llm_classification: str, displayName: str):
    system_prompt, grader_prompt = build_grader_prompt(displayName, input, structured_llm_classification)
    response = completion(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": grader_prompt}],
        temperature=0,
        response_format=DoomerOrBoomer,
    )
    json_out = response.choices[0].message.content
    return DoomerOrBoomer.model_validate_json(json_out)

# Step 4 - Create a scorer 
def right_according_to_llm_judge(output: dict, structured_llm_classification: str):
    return {"match": structured_llm_classification.lower() == output.classification.lower()}

# Step 5 - Run our evaluation 
llm_judge_evaluation = Evaluation(
    dataset=doomer_or_boomer_dataset_with_structured_output,
    scorers=[right_according_to_llm_judge],
    name="LLM Judge Evaluation"
)

await llm_judge_evaluation.evaluate(llm_judge_api, __weave={"display_name": "LLM Judge"})


🍩 https://wandb.ai/capecape/evals-workshop/r/call/01949102-fe55-7813-9667-5c1ecb687040


{'right_according_to_llm_judge': {'match': {'true_count': 13,
   'true_fraction': 0.8125}},
 'model_latency': {'mean': 19.960527390241623}}

# 3.3 Aligning our judges with human preferences - Meta evaluation

This is a bit out of scope for our workshop, but for those who want to learn more, one we start running our LLM as a judge, we'll notice their shortcomings. They will be biased toward certain things, changing the order of the questions sometimes will yield different results etc' 

Also, the human graders understanding of the question will change during the annotation process itself. 

So a meta evaluation process is needed to understand how the judge itself is performing, and align the LLM judge with the additional inputs from HITL responses. 

Then we need to compare between the judges to empirically contrast and understand if we made a material difference. 

For more of a deep dive into this topic, W&B just published a course on evaluations, https://wandb.me/evals with more info

# Recap and Additional resources

You've made it all the way to the end of this notebook! By now you have got a hands on experience in implementing nearly all parts of the robust LLMs in production framework below: 

![three](https://gist.github.com/user-attachments/assets/0d51de65-8ec7-4cc5-a102-5a13229f5531)

## Additional resources

- Weave documentation - [weave docs](https://wandb.me/weave)
- W&B Evaluations course - [evals course](https://wandb.me/evals)
- Eugene Yan's excellent blog - [evaluating LLM evaluatiors](https://eugeneyan.com/writing/llm-evaluators/)
- Who validates the validators - Shreya Shankar [Paper](https://arxiv.org/abs/2404.12272)
- Hamel Housain - [your product needs evaluations](https://hamel.dev/blog/posts/evals/)