# Model Graded Evals

Use an LLM to evaluate another LLM's response.

---

Pros:
- More robust and compehensive

Cons:
- Costly to run
- Slower to run

**Suitable for pre-deployment and post-deployment to guarantee full confidence in our LLM app.**

## Setup

In [8]:
import openai
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())
openai.api_type = os.environ.get("OPENAI_API_TYPE")
openai.api_base = os.environ.get("OPENAI_API_BASE")
openai.api_key = os.environ.get("OPENAI_API_KEY")
openai.api_version = os.environ.get("OPENAI_API_VERSION")

## LLM

In [9]:
from langchain.chat_models import AzureChatOpenAI

llm = AzureChatOpenAI(
    deployment_name="gpt40125",
    temperature=0,
)

## Sample app: AI-powered quiz generator

### Evaluations

#### Test if the generated quiz is in the right fomat

In [10]:
delimiter = "####"

eval_system_message = f"""You are an assistant that evaluates \
whether or not an assistant is producing valid quizzes.
The assistant should be producing output in the \
format of Question N:{delimiter} <question N>?"""

In [11]:
sample_llm_response = """
Question 1:#### What is the largest telescope in space called and what material is its mirror made of?

Question 2:#### True or False: Water slows down the speed of light.

Question 3:#### What did Marie and Pierre Curie discover in Paris?
"""

In [12]:
eval_user_message_template = """You are evaluating a generated quiz \
based on the context that the assistant uses to create the quiz.
  Here is the data:
    [BEGIN DATA]
    ************
    [Response]: {llm_response}
    ************
    [END DATA]

Read the response carefully and determine if it looks like \
a quiz or not. Do not evaluate if the information is correct
only evaluate if the data is in the expected format.

Output Y if the response is a quiz, \
output N if the response does not look like a quiz.
"""

In [13]:
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate


def assistant_chain(
    system_message,
    user_message_template,
    llm_response,
    llm=llm,
    output_parser=StrOutputParser()
):
    user_message = user_message_template.format(llm_response=llm_response)
    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", system_message),
        ("human", user_message),
    ])
    return chat_prompt | llm | output_parser

In [14]:
assistant = assistant_chain(
    system_message=eval_system_message,
    user_message_template=eval_user_message_template,
    llm_response=sample_llm_response,
)
assistant.invoke({})

'Y'

#### Test for negative test case when the LLM response is not in the right format

In [16]:
bad_llm_response = "There are lots of interesting facts. Tell me more about what you'd like to know."

In [17]:
assistant = assistant_chain(
    system_message=eval_system_message,
    user_message_template=eval_user_message_template,
    llm_response=bad_llm_response,
)
assistant.invoke({})

'N'