# Weight & Biases: Prompt Tracing and Evaluation with **Models** 🪄
![](https://drive.google.com/uc?export=view&id=1OMfNfY2ApC575UtXhH5WZlNcdbMurT_m)  




This tutorial leverages 4 core components of W&B's Model suite:

- **Run**: Experiment tracking context to monitor prompt iterations, configuration, execution context, and metrics.
- **Artifact**: Version, governance and track lineage for data assets in a prompting pipeline.
- **Table**: Interactive, tabular dataset to inspect generations.
- **Sweep**: Hyperparameter tuning to optimize prompting and generation workflows.

## Setup

In [None]:
!pip install -qqq wandb openai tiktoken langchain-openai langchain transformers datasets evaluate rouge_score langchain-community chromadb asyncio

## Log in to W&B
- You can explicitly login using `wandb login` or `wandb.login()` (See below)
- Alternatively you can set environment variables. There are several env variables which you can set to change the behavior of W&B logging. The most important are:
    - `WANDB_API_KEY` - find this in your "Settings" section under your profile
    - `WANDB_BASE_URL` - this is the url of the W&B server
- Find your API Token in "Profile" -> "Setttings" in the W&B App

In [None]:
import wandb
import evaluate
import pandas as pd
from getpass import getpass
import os
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langchain.callbacks import get_openai_callback
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import TokenTextSplitter
from langchain.chains import RetrievalQA
import numpy as np
import os

In [None]:
wandb.login()

### Set OpenAI API Key

In [None]:
if os.getenv("OPENAI_API_KEY") is None:
  if any(['VSCODE' in x for x in os.environ.keys()]):
    print('Please enter password in the VS Code prompt at the top of your VS Code window!')
  os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI key from: https://platform.openai.com/account/api-keys\n")

assert os.getenv("OPENAI_API_KEY", "").startswith("sk-"), "This doesn't look like a valid OpenAI API key"
print("OpenAI API key configured")


### Set Project and Entity

In [None]:
PROJECT_NAME = "<>" #@param
ENTITY = "<>" #@param

## Download Dataset
* Our Task is to summarize legal documents from the state of CA

In [None]:
from datasets import load_dataset

wandb.init(project=PROJECT_NAME, entity=ENTITY, job_type="inspect_data")

billsum = load_dataset("billsum", split="ca_test")

# Let's just grab 5 documents for a demo
shuffled_dataset = billsum.shuffle(seed=42)[0:5]

billsum_df = pd.DataFrame.from_dict(shuffled_dataset)
billsum_downsampled = wandb.Table(dataframe=billsum_df)

#create an artifact from the dataset for version control + lineage tracking
artifact = wandb.Artifact("ground_truth_data_ca", type="datasets")

# Add the table to the artifact
artifact.add(billsum_downsampled, 'downsampled_data')

# Log the table + Artifact
wandb.log({"billsum_downsampled": billsum_downsampled})
wandb.log_artifact(artifact)

wandb.finish()

## W&B Runs Track LLM Executions
Below we have a simple Python function which invokes an LLM (OpenAI) to summarize documents contained in a pandas dataframe. Throughout this tutorial we will enrich this function with more logging to better understand our results and compare them across experiments:
*   Use `wandb.init` to create a `Run`
*   Use `wandb.config` to log "inputs" to your runs
*   Use `wandb.Table` and `wandb.log` to log "outputs" or results of those runs

By instrumenting logging in your LLM function calls, unit-testing, evaluation, and meta-analysis across models and parameters become much easier.

In [None]:
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"

In [None]:
summarize_prompt1 = """You have been provided with legal documents from the state of CA.
    Your task is to provide a brief and comprehensive summary of the document.
    Include 3 key points from the document:
    {text}

    SUMMARY:"""


def llm_summarize_documents(df=billsum_df,
                            model_name="gpt-4o-mini",
                            summarize_prompt=summarize_prompt1,
                            temperature=0.1):


  wandb.init(project=PROJECT_NAME,
              entity=ENTITY,
              job_type="summarize",
              config={"model_name": model_name,
                      "summarize_prompt": summarize_prompt,
                      "temperature": temperature})

  wandb.use_artifact(f'{ENTITY}/{PROJECT_NAME}/ground_truth_data_ca:v0', type='datasets')

  summarize_prompt_template = PromptTemplate(template=wandb.config["summarize_prompt"], input_variables=["text"])
  llm = ChatOpenAI(model_name = wandb.config["model_name"], temperature = wandb.config["temperature"])

  def llm_summarize_row(row):
      with get_openai_callback() as cb:
        # Prepare the prompt with the document text
        document = row["text"]
        prompt = summarize_prompt_template.format(text=document)
        messages = [
          SystemMessage(content=prompt),
        ]

        # Get the summary from the LLM
        summary = llm.invoke(messages).content

        return {"llm_summary": summary,
                "prompt_tokens": cb.prompt_tokens,
                "completion_tokens": cb.completion_tokens,
                "total_tokens": cb.total_tokens,
                "total_cost": cb.total_cost}


  df_llm = df.apply(llm_summarize_row, axis=1, result_type='expand')
  df = df.join(df_llm)

  # Log Pandas dataframes of results to interactive tables with built-in lineage and visualization
  summary_table = wandb.Table(dataframe=df)

  # also log as Artifact
  artifact = wandb.Artifact("summary_df", type="datasets")
  # Add the table to the artifact
  artifact.add(summary_table, 'summary_table')

  wandb.log({"llm_inference/llm_summary_table": summary_table})
  wandb.log_artifact(artifact)


  wandb.finish()

  return df

In [None]:
billsum_df_llm = llm_summarize_documents(billsum_df)

## Log Custom Metrics

- `wandb.log` can be passed a dictionary of keys and values where the keys are the names of the metrics and the values are `wandb.Table`s, scalar metrics, numpy array embeddings, or even charts from matplotlib,
- `wandb.summary` can be used to track metrics that describe the entire run, usually aggregate metrics like `total_cost` of all evaluations for instance.

In [None]:
def llm_summarize_documents(df=billsum_df,
                            model_name="gpt-4",
                            summarize_prompt=summarize_prompt1,
                            temperature=0.5):

  wandb.init(project=PROJECT_NAME,
              entity=ENTITY,
              job_type="summarize",
              config={"model_name": model_name,
                      "summarize_prompt": summarize_prompt,
                      "temperature": temperature})

  wandb.use_artifact(f'{ENTITY}/{PROJECT_NAME}/ground_truth_data_ca:v0', type='datasets')

  summarize_prompt_template = PromptTemplate(template=wandb.config["summarize_prompt"], input_variables=["text"])
  llm = ChatOpenAI(model_name = wandb.config["model_name"], temperature = wandb.config["temperature"])

  def llm_summarize_row(row):
      with get_openai_callback() as cb:
        # Prepare the prompt with the document text
        document = row["text"]
        prompt = summarize_prompt_template.format(text=document)
        messages = [
          SystemMessage(content=prompt),
        ]

        # Get the summary from the LLM
        summary = llm.invoke(prompt).content

        return {"llm_summary": summary,
                "prompt_tokens": cb.prompt_tokens,
                "completion_tokens": cb.completion_tokens,
                "total_tokens": cb.total_tokens,
                "total_cost": cb.total_cost}


  df_llm = df.apply(llm_summarize_row, axis=1, result_type='expand')
  df = df.join(df_llm)


  # Eval Metrics

  # Add Rouge metric calculation
  rouge = evaluate.load('rouge')
  results = rouge.compute(predictions=df["llm_summary"],
                         references=df["summary"],
                        use_aggregator=False)

  rouge_df = pd.DataFrame.from_dict(results)
  df = df.join(rouge_df)

  # toxicity measure
  toxicity = evaluate.load("toxicity", module_type="measurement")
  results_toxic = toxicity.compute(predictions=df['llm_summary'])
  toxic_df = pd.DataFrame.from_dict(results_toxic)
  df = df.join(toxic_df)

  # Log Pandas dataframes of results to interactive tables with built-in lineage and visualization
  df['llm_used'] = wandb.config.model_name
  summary_table = wandb.Table(dataframe=df)

  # also log as Artifact
  artifact = wandb.Artifact("summary_df_metrics", type="datasets")
  # Add the table to the artifact
  artifact.add(summary_table, 'summary_table_metrics')

  # Log additional metrics as part of wandb.log call
  wandb.log({"llm_summary_table": summary_table})
  wandb.log_artifact(artifact)

  wandb.summary["rouge1"] = df["rouge1"].mean()
  wandb.summary["rouge2"] = df["rouge2"].mean()
  wandb.summary["rougeL"] = df["rougeL"].mean()
  wandb.summary["rougeLsum"] = df["rougeLsum"].mean()
  wandb.summary["total_cost"] = df["total_cost"].sum()

  wandb.finish()

  return df

In [None]:
billsum_df_llm = llm_summarize_documents(billsum_df)

## Retrieving from W&B
- After evaluating an LLM and prompts on a dataset, we want to then retrieve info from past runs which we like and instrument them in a pipeline
- the `wandb.Api` import/export api allows you to retrieve runs, metrics, and tables via the api and hand-off evaluation results or prompts from one funciton to another

In [None]:
past_run_id = "<run id from above>"

def retrieve_wandb_table(project: str, entity: str, wandb_run_id: str, table_name: str, table_version: str) -> pd.DataFrame:
  if wandb.run is None:
    api = wandb.Api(overrides={"project": project,
                              "entity": entity})
    table_art = api.artifact(name=f"run-{wandb_run_id}-{table_name}:{table_version}")
  else:
    table_art = wandb.use_artifact(f"run-{wandb_run_id}-{table_name}:{table_version}")
  table = table_art.get(table_name)
  table_df = pd.DataFrame(data=table.data, columns=table.columns)
  return table_df

table_df = retrieve_wandb_table(PROJECT_NAME, ENTITY, past_run_id, "llm_summary_table" , "latest")

In [None]:
table_df

### Retrieve run configs (e.g. prompts)

In [None]:
api = wandb.Api(overrides={"project": PROJECT_NAME,
                              "entity": ENTITY})
run = api.run(f"{ENTITY}/{PROJECT_NAME}/{past_run_id}")

print(run.config['summarize_prompt'])

In [None]:
run.config

## LLM Evaluation
* We can ask GPT4 to verify if the summaries are correct from a previous run

In [None]:
def gpt_evaluate_summaries(wandb_run_id: str,
                            table_name: str,
                            table_version: str):
  wandb.init(project=PROJECT_NAME, entity=ENTITY, job_type="gpt_evaluation")

  eval_df = retrieve_wandb_table(PROJECT_NAME, ENTITY, wandb_run_id, table_name, table_version)
  evaluate_prompt = """Here are two summaries of some legal documents. \
  The first one is a human written summary and the second is generated by an LLM. \
  Please indicate whether the LLM-summary is accurate to the human one. Just give a YES or NO: \n \
  HUMAN SUMMARY: {human_summary} \n \
  LLM_SUMMARY: {llm_summary} \n \
  ANSWER:"""

  evaluate_prompt_template = PromptTemplate(template=evaluate_prompt, input_variables=["human_summary", "llm_summary"])
  llm = ChatOpenAI(model_name = "gpt-3.5-turbo")

  def evaluate_summaries(row):
    human_summary = row["summary"]
    llm_summary = row["llm_summary"]
    with get_openai_callback() as cb:
        # Prepare the prompt with the document text
        prompt = evaluate_prompt_template.format(human_summary=human_summary, llm_summary=llm_summary)
        messages = [
          SystemMessage(content=prompt),
        ]
        result = llm.invoke(messages).content

        return {"gpt_result": result,
                "prompt_tokens": cb.prompt_tokens,
                "completion_tokens": cb.completion_tokens,
                "total_tokens": cb.total_tokens,
                "total_cost": cb.total_cost}

  eval_result = eval_df.apply(evaluate_summaries, axis=1, result_type='expand')
  eval_df = eval_df.join(eval_result, rsuffix="gpt4_eval_")

  eval_table = wandb.Table(dataframe=eval_df)

  wandb.log({"llm_eval/gpt_eval_table": eval_table})
  wandb.summary["total_cost"] = eval_df["total_cost"].sum()

  wandb.finish()

In [None]:
gpt_evaluate_summaries(past_run_id, "llm_summary_table", "latest")

# Trace Rag pipeline

In [None]:
def create_embeddings(table_df: pd.DataFrame, index: int):

  # load docs into langchain format
  loader = DataFrameLoader(table_df, page_content_column="text")
  data = loader.load()

  # split the documents
  text_splitter = TokenTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=0)
  docs = text_splitter.split_documents(data)

  title = data[0].metadata["title"]

  # initialize embedding engine
  embeddings = OpenAIEmbeddings()

  db = Chroma.from_documents(
      docs,
      embeddings,
      persist_directory=os.path.join("chromadb", str(index)),
  )
  db.persist()


wandb.init(project=PROJECT_NAME, entity=ENTITY, job_type="create_embeddings")

# create embeddings
with get_openai_callback() as cb:
  for document in table_df.iterrows():
    document_df = document[1].to_frame().T
    create_embeddings(document_df, index=document[0])


vector_db_artifact = wandb.Artifact("vector_db", type="vector_db")
vector_db_artifact.add_dir("chromadb")
wandb.log_artifact(vector_db_artifact)

wandb.summary["prompt_tokens"] = cb.prompt_tokens
wandb.summary["completion_tokens"] = cb.completion_tokens
wandb.summary["total_tokens"] = cb.total_tokens
wandb.summary["total_cost"] = cb.total_cost
wandb.finish()


In [None]:
def get_answer(document_title: str, question: str):
  index = table_df[table_df["title"] == document_title].index[0]
  db_dir = os.path.join("chromadb", str(index))
  embeddings = OpenAIEmbeddings()
  db = Chroma(persist_directory=db_dir, embedding_function=embeddings)

  prompt_template = """Use the following pieces of context to answer the question.
  If you don't know the answer, just say that you don't know, don't try to make up an answer.
  Don't add your opinions or interpretations. Ensure that you complete the answer.
  If the question is not relevant to the context, just say that it is not relevant.

  CONTEXT:
  {context}

  QUESTION: {question}

  ANSWER:"""

  prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

  retriever = db.as_retriever()
  retriever.search_kwargs["k"] = 2

  qa = RetrievalQA.from_chain_type(
      llm=ChatOpenAI(temperature=0),
      chain_type="stuff",
      retriever=retriever,
      chain_type_kwargs={"prompt": prompt},
      return_source_documents=True
  )

  with get_openai_callback() as cb:
      result = qa({"query": question})

  answer = result["result"]
  return answer

In [None]:
title = table_df["title"][0]

In [None]:
wandb.init(project=PROJECT_NAME, entity=ENTITY, job_type="retrieval_QA")
wandb.use_artifact(f'{ENTITY}/{PROJECT_NAME}/vector_db:v0', type='vector_db')
get_answer(title,
            "What does the law say about statewide emissions?")
wandb.finish()

## Execute Sweeps Across Prompts and Parameters


In [None]:
summarize_prompt1 = """You have been provided with legal documents from the state of CA.
    Your task is to provide a brief and comprehensive summary of the document.
    The summary should encompass all the crucial points of the document.
    Include at least 3 points and ensure the summary is at least 2 paragraphs long
    {text}

    SUMMARY:"""

summarize_prompt2 = """You are an expert in State Law in CA. You have been provided with legal documents from the state of CA.
    Your task is to provide a brief and comprehensive summary of the document.
    The summary should encompass all the crucial points of the document and do not be vague.
    {text}

    SUMMARY:"""

summarize_prompt3 = """You are an expert in State Law in CA. You have been provided with legal documents from the state of CA.
    Your task is to provide a brief and comprehensive summary of the document for the purposes of review by the state legal office.
    The summary should encompass all the crucial points but do not be so vague so as to lose the ability to categorize the document effectively
    from a legal standpoint:
    {text}

    SUMMARY:"""

sweep_config = {
    'method': 'random',
}

parameters_dict = {
    'summarize_prompt': {
        'values': [summarize_prompt1, summarize_prompt2, summarize_prompt3]
    },
    'model_name': {
        'values': ["gpt-3.5-turbo", "gpt-4o-mini", "gpt-4"]
    },
    'temperature': {
        'values': [0.1, 0.2, 0.3]
    }
}


sweep_config['parameters'] = parameters_dict
sweep_id = wandb.sweep(sweep_config, project=PROJECT_NAME, entity=ENTITY)
# This sweep id you will pass to the agents later running on your machines

In [None]:
wandb.agent(sweep_id, llm_summarize_documents, count=4)

In [None]:
wandb.teardown()