## String and comparison evaluation

In [28]:
import json
from dotenv import load_dotenv
from sqlalchemy.sql.functions import random

load_dotenv()

### Embedding Distance Evaluator
Embedding Distance Evaluator porównuje dwie odpowiedzi, zamieniając je na wektory osadzeń (embeddings) i mierząc odległość lub podobieństwo kosinusowe. Dzięki temu ocenia semantyczną bliskość treści, a nie tylko dopasowanie słów.

In [19]:
from langchain.evaluation import load_evaluator

evaluator = load_evaluator("embedding_distance", embeddings_model="openai")

result1 = evaluator.evaluate_strings(
    prediction="Stolica Polski to Warszawa",
    reference="Stolica Polski to Warszawa"
)

result2 = evaluator.evaluate_strings(
    prediction="Stolica Polski to Warszawa",
    reference="Stolica Polski nosi nazwę Warszawa"
)

result3 = evaluator.evaluate_strings(
    prediction="Stolica Polski to Warszawa",
    reference="Stolica Burkina Faso nosi nazwę Wagadugu"
)

print(round(result1["score"], 4))
print(round(result2["score"], 4))
print(round(result3["score"], 4))


Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")


-0.0
0.0508
0.1318


### String Comparison
Ewaluator porównuje dwa teksty przy użyciu metryki BLEU, która mierzy n-gramowe podobieństwo wygenerowanej odpowiedzi do odpowiedzi referencyjnej.

In [20]:
evaluator = load_evaluator("string_distance", metric="bleu")

result1 = evaluator.evaluate_strings(
    prediction="Stolica Polski to Warszawa",
    reference="Stolica Polski to Warszawa"
)

result2 = evaluator.evaluate_strings(
    prediction="Stolicą Polski jest Warszawa",
    reference="Stolica Polski to Warszawa"
)

result3 = evaluator.evaluate_strings(
    prediction="Stolica Polski to Warszawa",
    reference="Warszawa jest stolicą Polski"
)

print(round(result1["score"], 4))
print(round(result2["score"], 4))
print(round(result3["score"], 4))


Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")


0.0
0.069
0.4991


### String Comparison: BLUE, ROUGE, METEOR

In [21]:
from langchain.evaluation import load_evaluator

# BLEU evaluator
bleu_eval = load_evaluator("string_distance", metric="bleu")

result_bleu = bleu_eval.evaluate_strings(
    prediction="Warsaw is the capital of Poland",
    reference="The capital of Poland is Warsaw"
)
print("BLEU:", result_bleu)

# ROUGE evaluator
rouge_eval = load_evaluator("string_distance", metric="rouge")

result_rouge = rouge_eval.evaluate_strings(
    prediction="Warsaw is capital",
    reference="Warsaw is the capital of Poland"
)
print("ROUGE:", result_rouge)

# METEOR evaluator
meteor_eval = load_evaluator("string_distance", metric="meteor")

result_meteor = meteor_eval.evaluate_strings(
    prediction="The dog runs quickly",
    reference="The dog is running fast"
)
print("METEOR:", result_meteor)


Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")


BLEU: {'score': 0.28903225806451616}
ROUGE: {'score': 0.11385199240986721}
METEOR: {'score': 0.30186335403726705}


### Testy A/B
PairwiseStringEvaluator służy do porównywania dwóch odpowiedzi tekstowych względem jednej referencji, aby wybrać lepszą. Dzięki temu można automatycznie ocenić, która z odpowiedzi jest bliższa oczekiwanemu wynikowi.

In [29]:
from langchain.evaluation import load_evaluator

evaluator = load_evaluator("labeled_pairwise_string")

result = evaluator.evaluate_string_pairs(
    input="What is the capital of Poland?",
    prediction="Warsaw is the capital of Poland",
    prediction_b="I don't know",
    reference="Warsaw is Poland's capital"
)

print(json.dumps(result, indent=4))

Error in LangChainTracer.on_llm_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")


{
    "reasoning": "Assistant A's response is helpful, relevant, correct, and accurate. It directly answers the user's question about the capital of Poland. On the other hand, Assistant B's response is not helpful or accurate. It does not provide the user with the information they were seeking. Therefore, Assistant A's response is superior in this case. \n\nFinal verdict: [[A]]",
    "value": "A",
    "score": 1
}


### Ewaluacja odpowiedzi LLM poprzez LLM
Polega na tym, że jeden model LLM ocenia odpowiedzi wygenerowane przez inny (lub ten sam) model, według zadanych kryteriów. Dzięki temu można automatyzować ocenę jakości treści bez konieczności ręcznej weryfikacji.

In [36]:
from langchain.chains import LLMChain
from langchain.prompts.prompt import PromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-4o")
template = """
You are base of knowledge about star wars. Respond to question below with only name without any additional text.
{input}
"""
prompt_template = PromptTemplate.from_template(template=template)
chain = prompt_template | llm
prediction = chain.invoke({"input": "What is the capital of star wars Sith Empire?"})
print(prediction)

evaluator = load_evaluator("labeled_score_string", llm=ChatOpenAI(model="gpt-4o"))
eval_result = evaluator.evaluate_strings(
    prediction=prediction,
    reference="Coruscant",
    input="What is the capital of star wars Sith Empire?",
)
print(json.dumps(eval_result, indent=4))

eval_result = evaluator.evaluate_strings(
    prediction="Hollywood",
    reference="Coruscant",
    input="What is the capital of star wars Sith Empire?",
)
print(json.dumps(eval_result, indent=4))



Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_llm_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")


content='Dromund Kaas' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 5, 'prompt_tokens': 39, 'total_tokens': 44, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_cbf1785567', 'id': 'chatcmpl-CIqCqOmIF3lAb5hsqegRXxRo2i499', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None} id='run--4fe9c624-56fd-4e93-bbb1-b01562f87569-0' usage_metadata={'input_tokens': 39, 'output_tokens': 5, 'total_tokens': 44, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}


Error in LangChainTracer.on_llm_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")


{
    "reasoning": "The assistant's response provides the capital of the Sith Empire in the Star Wars universe as \"Dromund Kaas.\" This answer is correct according to Star Wars lore. Dromund Kaas is widely recognized as the capital of the Sith Empire, particularly during the timeline related to Star Wars: The Old Republic, which is an Expanded Universe (now Legends) storyline. \n\nThe response is helpful and relevant to the question as it pertains directly to the query about the capital of the Sith Empire in Star Wars. It is accurate, demonstrating correctness in the context of the Star Wars Expanded Universe. However, the response lacks depth since it does not provide any additional context or insight about Dromund Kaas or its significance in the Star Wars universe.\n\nGiven these considerations, the response scores well on correctness and relevance but could be improved with more detailed information. \n\nRating: [[8]]",
    "score": 8
}


Error in LangChainTracer.on_llm_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")


{
    "reasoning": "The AI's response is not accurate regarding the question asked. The capital of the Sith Empire in Star Wars lore is not Hollywood, which is a location in the real world unrelated to the Star Wars universe. According to the Star Wars Expanded Universe (Legends) and various comics and video games, the capital of the Sith Empire is Coruscant during certain periods, but more specifically, it is often discussed as being Korriban or Dromund Kaas. \n\n- Helpfulness: The response is not helpful or appropriate to the question.\n- Relevance: The response is not referring to any real quote from the Star Wars narrative or expanded universe.\n- Correctness: The response is factually incorrect.\n- Depth: The response shows no depth of thought or understanding of the Star Wars universe.\n\nOverall, the answer fails on all criteria because it provides incorrect and irrelevant information.\n\nRating: [[1]]",
    "score": 1
}


### Ewaluacja odpowiedzi – własny grading i ContextQAEvalChain
W tym przykładzie najpierw generujemy odpowiedzi na pytania z kontekstem, a następnie oceniamy ich jakość dwiema metodami. Pierwsza to własny łańcuch gradingowy, który przyznaje ocenę w skali 0–5 względem odpowiedzi referencyjnej, a druga to wbudowany evaluator ContextQAEvalChain, który sprawdza spójność predykcji z dostarczonym kontekstem.

In [41]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.evaluation.qa import ContextQAEvalChain

# 1) Model
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# 2) Łańcuch Q&A z kontekstem
qa_prompt = PromptTemplate.from_template(
    "Answer the question based on the context.\n"
    "Context: {context}\n"
    "Question: {question}\n"
    "Answer:"
)
qa_chain = qa_prompt | llm | StrOutputParser()

# 3) Łańcuch oceny (grading) – skala 0..5
grading_template = """You are an expert in grading answers.
You are grading the following question:
{query}

Here is the correct expected answer:
{answer}

You are grading the following predicted answer:
{result}

What grade do you give from 0 to 5, where 0 is the lowest for low similarity and 5 is for the high similarity?
Return only the number."""
grading_prompt = PromptTemplate(
    input_variables=["query", "answer", "result"],
    template=grading_template
)
grade_chain = grading_prompt | llm | StrOutputParser()

# 4) Dane wejściowe (dodano też referencję 'answer')
examples = [
    {
        "question": "Why can't people breathe underwater?",
        "context": "Because humans don't have gills.",
        "answer": "Because humans don't have gills."
    },
    {
        "question": "Why is the sky blue?",
        "context": "It is an optical effect due to Rayleigh scattering of sunlight in the atmosphere.",
        "answer": "Because of Rayleigh scattering of sunlight in the atmosphere."
    },
    {
        "question": "What is in my pocket?",
        "context": "",
        "answer": "Unknown; insufficient information."
    },
]

# 5) Generowanie odpowiedzi
predictions = qa_chain.batch(
    [{"context": ex["context"], "question": ex["question"]} for ex in examples]
)

# 6) Ocena własnym łańcuchem grading
grades = grade_chain.batch(
    [{"query": ex["question"], "answer": ex["answer"], "result": pred}
     for ex, pred in zip(examples, predictions)]
)

# 7) Ocena gotowym ewaluatorem ContextQAEvalChain
eval_chain = ContextQAEvalChain.from_llm(llm)

predictions_dicts = [{"text": pred} for pred in predictions]

graded_outputs = eval_chain.evaluate(
    examples,
    predictions_dicts,
    question_key="question",
    prediction_key="text"
)

# 8) Podgląd wyników
print("\n--- Wyniki własnego grading chain ---")
for ex, pred, grade in zip(examples, predictions, grades):
    print({
        "question": ex["question"],
        "prediction": pred.strip(),
        "reference": ex["answer"],
        "grade_0_5": grade.strip()
    })

print("\n--- Wyniki ContextQAEvalChain ---")
print(graded_outputs)

Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_llm_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_llm_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_llm_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword argument 'exclude_inputs'")
Error in LangChainTracer.on_chain_end callback: TypeError("RunTree.patch() got an unexpected keyword ar


--- Wyniki własnego grading chain ---
{'question': "Why can't people breathe underwater?", 'prediction': "People can't breathe underwater because they don't have gills, which are necessary for extracting oxygen from water.", 'reference': "Because humans don't have gills.", 'grade_0_5': '5'}
{'question': 'Why is the sky blue?', 'prediction': "The sky is blue because of Rayleigh scattering, which occurs when sunlight interacts with the molecules and small particles in the Earth's atmosphere. Shorter wavelengths of light, such as blue, are scattered more than longer wavelengths, causing the sky to appear predominantly blue to our eyes.", 'reference': 'Because of Rayleigh scattering of sunlight in the atmosphere.', 'grade_0_5': '5'}
{'question': 'What is in my pocket?', 'prediction': "I'm sorry, but I can't determine what is in your pocket.", 'reference': 'Unknown; insufficient information.', 'grade_0_5': '5'}

--- Wyniki ContextQAEvalChain ---
[{'text': 'CORRECT'}, {'text': 'CORRECT'}, {