<a href="https://colab.research.google.com/github/nov05/Google-Colaboratory/blob/master/generative_ai_with_langchain/08_01_basic_evaluators.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Notebook modified by nov05 on 2025-06-13  

In [None]:
%%capture
!pip install langchain langchain-mistralai langchain-community langchain_openai

*Successfully installed dataclasses-json-0.6.7 httpx-sse-0.4.0 langchain-community-0.3.25 langchain-core-0.3.65 langchain-mistralai-0.2.10 langsmith-0.3.45 marshmallow-3.26.1 mypy-extensions-1.1.0 pydantic-settings-2.9.1 python-dotenv-1.1.0 typing-inspect-0.9.0 langchain_openai-0.3.23*  

**Make sure you load the API keys for cloud providers!**

You can set your environment keys yourself or use a script. Please note that since keys are private, they are not included in the repository.

In [None]:
# # setting the environment variables, the keys
# import sys
# import os
# sys.path.insert(0, os.path.abspath('..'))
# from config import set_environment
# # for the keys - as explained early in chapter 2
# set_environment()

# 🟢 **Basica Evalutors**

## 👉 **Exact Match Evaluation**  

In [2]:
from langchain.evaluation import load_evaluator, ExactMatchStringEvaluator

prompt = "What is the current Federal Reserve interest rate?"
reference_answer = "0.25%"  # Suppose this is the correct answer.

# Example predictions:
prediction_correct = "0.25%"
prediction_incorrect = "0.50%"

# Initialize an Exact Match evaluator that ignores case differences.
exact_evaluator = ExactMatchStringEvaluator(ignore_case=True)

# Evaluate the correct prediction.
exact_result_correct = exact_evaluator.evaluate_strings(
    prediction=prediction_correct, reference=reference_answer
)
print("Exact match result (correct answer):", exact_result_correct)

# Evaluate an incorrect prediction.
exact_result_incorrect = exact_evaluator.evaluate_strings(
    prediction=prediction_incorrect, reference=reference_answer
)
print("Exact match result (incorrect answer):", exact_result_incorrect)

Exact match result (correct answer): {'score': 1}
Exact match result (incorrect answer): {'score': 0}


## 👉 **LLM-as-Judge Evaluation**  

* Get Mistral AI API keys at https://admin.mistral.ai/organization/api-keys  



In [7]:
%%time
from langchain_mistralai import ChatMistralAI
from langchain.evaluation.scoring import ScoreStringEvalChain
from google.colab import userdata
from pprint import pprint

# Initialize the evaluator LLM
llm = ChatMistralAI(
    model="mistral-large-latest",
    api_key=userdata.get('MISTRAL_API_KEY'),  ## ✅ mandatory
    temperature=0,
)
# Create the ScoreStringEvalChain from the LLM
chain = ScoreStringEvalChain.from_llm(llm=llm)

# Define the finance-related input, prediction, and reference answer
finance_input = "What is the current Federal Reserve interest rate?"
finance_prediction = "The current interest rate is 0.25%."
finance_reference = "The Federal Reserve's current interest rate is 0.25%."

# Evaluate the prediction using the scoring chain
result_finance = chain.evaluate_strings(
    input=finance_input,
    prediction=finance_prediction,
)
print("Finance Evaluation Result:")
pprint(result_finance)



Finance Evaluation Result:
{'reasoning': 'The response provided by the AI assistant is clear and concise, '
              "but it lacks depth and context. The Federal Reserve's interest "
              'rate can change periodically, and without a specific date '
              'reference, the information might not be accurate or up-to-date. '
              'Additionally, the response does not provide any insight into '
              'the implications of the interest rate or how it might affect '
              'the economy or consumers.\n'
              '\n'
              'Rating: [[4]]',
 'score': 4}
CPU times: user 19.7 ms, sys: 3.34 ms, total: 23 ms
Wall time: 2.05 s


## 👉 **LLM-as-Judge Evaluation with Reference Answer**

* LangChainDeprecationWarning: The class `ChatOpenAI` was deprecated in LangChain 0.0.10 and will be removed in 1.0. An updated version of the class exists in the `~langchain-openai` package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.

In [8]:
%%time
# from langchain_community.chat_models import ChatOpenAI
from langchain_openai import ChatOpenAI
from langchain.evaluation.scoring import LabeledScoreStringEvalChain

# Initialize the evaluator LLM
llm = ChatOpenAI(
    model_name="gpt-4",
    api_key=userdata.get('OPENAI_API_KEY'),  ## ✅ mandatory
    temperature=0,
)
# Create the evaluation chain that can use reference answers
labeled_chain = LabeledScoreStringEvalChain.from_llm(llm=llm)

# Define the finance-related input, prediction, and reference answer
finance_input = "What is the current Federal Reserve interest rate?"
finance_prediction = "The current interest rate is 0.25%."
finance_reference = "The Federal Reserve's current interest rate is 0.25%."

# Evaluate the prediction against the reference
labeled_result = labeled_chain.evaluate_strings(
    input=finance_input,
    prediction=finance_prediction,
    reference=finance_reference,
)
print("Finance Evaluation Result (with reference):")
pprint(labeled_result)

Finance Evaluation Result (with reference):
{'reasoning': "The assistant's response is helpful, relevant, and correct. It "
              "directly answers the user's question about the current Federal "
              'Reserve interest rate. The assistant provides the exact rate, '
              'which is factual and accurate. However, the response lacks '
              'depth as it does not provide any additional information or '
              'context about the interest rate or the Federal Reserve. Rating: '
              '[[8]]',
 'score': 8}
CPU times: user 535 ms, sys: 16.3 ms, total: 552 ms
Wall time: 3.44 s


## 👉 **Criteria Evaluation (Tone and Conciseness)**  


In [9]:
%%time
from langchain.evaluation import load_evaluator
# from langchain.chat_models import ChatOpenAI
from langchain_openai import ChatOpenAI

evaluation_llm = ChatOpenAI(
    model="gpt-4",
    api_key=userdata.get('OPENAI_API_KEY'),  ## ✅ mandatory
    temperature=0,
)
prompt_health = "What is a healthy blood pressure range for adults?"
prediction_health = (
    "A normal blood pressure reading is typically around 120/80 mmHg. "
    "It's important to follow your doctor's advice for personal health management!"
)
# Evaluate conciseness
conciseness_evaluator = load_evaluator(
    "criteria",
    criteria="conciseness",
    llm=evaluation_llm
)
conciseness_result = conciseness_evaluator.evaluate_strings(
    prediction=prediction_health,
    input=prompt_health
)
print("Conciseness evaluation result:")
pprint(conciseness_result)

# Evaluate friendliness with custom criterion
custom_friendliness = {
    "friendliness": "Is the response written in a friendly and approachable tone?"
}
friendliness_evaluator = load_evaluator(
    "criteria",
    criteria=custom_friendliness,
    llm=evaluation_llm
)
friendliness_result = friendliness_evaluator.evaluate_strings(
    prediction=prediction_health,
    input=prompt_health
)
print("Friendliness evaluation result:")
pprint(friendliness_result)

Conciseness evaluation result:
{'reasoning': 'The criterion is conciseness. This means the submission should '
              'be brief and to the point.\n'
              '\n'
              'Looking at the submission, it provides a direct answer to the '
              'question, stating that a normal blood pressure reading is '
              'around 120/80 mmHg. This is a concise response to the '
              'question.\n'
              '\n'
              'The submission also includes an additional sentence advising to '
              "follow a doctor's advice for personal health management. While "
              'this sentence is not directly related to the question, it is '
              'still relevant and does not make the response overly lengthy or '
              'verbose.\n'
              '\n'
              'Therefore, the submission can be considered concise.\n'
              '\n'
              'Y',
 'score': 1,
 'value': 'Y'}
Friendliness evaluation result:
{'reasoning': 'The

## 👉 **JSON Format Validation**  

In [10]:
%%time
from langchain.evaluation import JsonValidityEvaluator

# Initialize the JSON validity evaluator.
json_validator = JsonValidityEvaluator()

valid_json_output = '{"company": "Acme Corp", "revenue": 1000000, "profit": 200000}'
invalid_json_output = '{"company": "Acme Corp", "revenue": 1000000, "profit": 200000,}'

# Evaluate the valid JSON.
valid_result = json_validator.evaluate_strings(
    prediction=valid_json_output
)
print("JSON validity result (valid):", valid_result)

# Evaluate the invalid JSON.
invalid_result = json_validator.evaluate_strings(
    prediction=invalid_json_output
)
print("JSON validity result (invalid):")
pprint(invalid_result)

JSON validity result (valid): {'score': 1}
JSON validity result (invalid):
{'reasoning': 'Expecting property name enclosed in double quotes: line 1 '
              'column 63 (char 62)',
 'score': 0}
CPU times: user 652 µs, sys: 0 ns, total: 652 µs
Wall time: 638 µs
