<a href="https://colab.research.google.com/github/hanhanwu/Hanhan_LangGraph_Exercise/blob/main/RAG_Chatbot/try_langsmith_model_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## About

* Compare 2 LLM solutions' performance
* Performance evaluation uses a testset with questions and their correct answer
* See comparison on LangSmith dashboard

In [1]:
%%capture --no-stderr
%pip install -U --quiet langsmith langchain_openai

In [2]:
import pandas as pd
import os
from google.colab import userdata

def _set_env(key_name: str, key_value: str):
    if key_name not in os.environ:
      os.environ[key_name] = key_value

# load the environment variables set in colab
_set_env('OPENAI_API_KEY', userdata.get('OPENAI_API_KEY'))
_set_env('LANGSMITH_API_KEY', userdata.get('LANGSMITH_API_KEY'))

## Upload Testet to LangSmith

* Testset is saved [here][1]
* The dataset is uploaded to LangSmith UI [like this][2]
  * `dataset_name` was defined in this process
  * Only choose "question" as LangSmith dataset input, and choose "answer" as output, otherwise LangSmith might print out ValueError

[1]:https://github.com/hanhanwu/Hanhan_LangGraph_Exercise/blob/main/RAG_Chatbot/model_eval_testset.csv
[2]:https://docs.smith.langchain.com/old/evaluation/faq/manage-datasets

In [3]:
from langsmith import Client
client = Client(api_key=os.environ['LANGSMITH_API_KEY'])
dataset_name = 'model_eval_testset'

client.list_examples(dataset_name)

<generator object Client.list_examples at 0x7f3da698f370>

## Define Metrics

In [4]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts.prompt import PromptTemplate
from langsmith.evaluation import LangChainStringEvaluator


_PROMPT_TEMPLATE = """You are an ANA (All Nippon Airways) accessibility policy expert.
                      Now you need to evaluate answers of customers' questions related to accessibility policy.
                    Here's customer' question:
                    {query}
                    Here is the real answer:
                    {answer}
                    You are evaluating the following predicted answer:
                    {result}
                    Respond with CORRECT or INCORRECT:
                    Evaluate:
                  """

PROMPT = PromptTemplate(template=_PROMPT_TEMPLATE,
    input_variables=["query", "answer", "result"],  # names here matter
)
eval_llm = ChatOpenAI(temperature=0.0)
qa_evaluator = LangChainStringEvaluator("qa",
                                        config={"llm": eval_llm,
                                                "prompt": PROMPT})

## Run Evaluations

In [5]:
def eval_query(question, openai_model_name):
    return openai_client.chat.completions.create(
        model=openai_model_name,
        temperature=0,
        messages=[
            {
                "role": "system",
                "content": "Respond to customer's question about All Nippon Airways' accessibility policy."
            },
            {
                "role": "user",
                "content": question,
            }
        ],
    ).choices[0].message.content


def langsmith_eval_all_queries(inputs, openai_model_name):
    output = eval_query(inputs["question"], openai_model_name)
    return {"result": output}

In [6]:
import openai
from langsmith import evaluate
from functools import partial

openai_client = openai.Client()
model1_name = 'gpt-3.5-turbo'
model2_name = 'gpt-4o-mini'

eval_func1 = partial(langsmith_eval_all_queries, openai_model_name=model1_name)
eval_func2 = partial(langsmith_eval_all_queries, openai_model_name=model2_name)

In [7]:
experiment_results = evaluate(
    eval_func1,
    data=dataset_name,
    evaluators=[qa_evaluator], # eval metrics
    experiment_prefix="model1",
  )

View the evaluation results for experiment: 'model1-062ef2f2' at:
https://smith.langchain.com/o/e9860761-8952-4d6a-a09f-3b40cecc4980/datasets/06ca5d2d-4af3-470a-8608-599b736808be/compare?selectedSessions=443a3ca8-d72a-4c91-8c77-229bfe329d9e




0it [00:00, ?it/s]

In [8]:
experiment_results = evaluate(
    eval_func2,
    data=dataset_name,
    evaluators=[qa_evaluator], # eval metrics
    experiment_prefix="model2",
  )

View the evaluation results for experiment: 'model2-3e844e7d' at:
https://smith.langchain.com/o/e9860761-8952-4d6a-a09f-3b40cecc4980/datasets/06ca5d2d-4af3-470a-8608-599b736808be/compare?selectedSessions=1a036cc3-386b-4b14-af95-d0c377722669




0it [00:00, ?it/s]

## Then Compare 2 experiments' performance on LangSmith Dashboard