## Criteria evaluator

In [3]:
from dotenv import load_dotenv
load_dotenv()

True

### Prosta ewaluacja za pomocą kryteriów
Criteria evaluator w LangChain służy do oceniania wyników modeli LLM według zadanego kryterium, np. correctness (poprawność).

In [6]:
from langchain.evaluation import load_evaluator
import json

# 1) Użyjemy ewaluatora z referencją (labeled_criteria)
evaluator = load_evaluator("labeled_criteria", criteria="correctness")

# 2) Porównujemy odpowiedź modelu z referencją
result = evaluator.evaluate_strings(
    prediction="2 + 2 = 5",
    input="Policz 2 + 2",
    reference="4",
)

print(json.dumps(result, indent=4))

{
    "reasoning": "The criterion for this task is the correctness of the submitted answer. The input asks for the sum of 2 + 2. The submitted answer is 5, which is incorrect. The correct answer, as given in the reference, is 4. Therefore, the submission does not meet the criterion of correctness.\n\nN",
    "value": "N",
    "score": 0
}


### Custom evaluator i LangSmith
W tym przykładzie zostało wykorzystanych więcej kryteriów ewaluacji, a także własny ewaluator oraz samodzielnie zdefiniowane kryterium.
LangSmith został wykorzystany jako platforma zbierania wyników ewaluacji.

### Import bibliotek i konfiguracja LangSmith

In [None]:
import os
from langsmith import Client
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
load_dotenv()

# Włącz śledzenie LangSmith (wymaga konta LangSmith):
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "kurs-demo"
os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com"
# os.environ["LANGSMITH_API_KEY"] = "<TWÓJ_KLUCZ>" #załączony w .env

### Wygenerowanie odpowiedzi LLM

In [33]:
client = Client()
# Inputs are provided to your model, so it know what to generate
dataset_inputs = [
    "Why people don't have 3 legs?",
    "Why people are not flying?",
]

#use 1st LLM for generating texts
llm_test= ChatOpenAI(model="gpt-3.5-turbo", temperature=0.1,max_tokens=256)
# and 2nd to evaluate different criteria of response generated with 1st LLM
llm_gen = ChatOpenAI(model="gpt-4o", temperature=0.1,max_tokens=256)

dataset_outputs = [
    {"result": llm_test.invoke(dataset_inputs[0])},
    {"result": llm_test.invoke(dataset_inputs[1])},
]
print(dataset_outputs)

[{'result': AIMessage(content='Humans typically have two legs because that is the natural and most efficient form of locomotion for our species. Having three legs would likely be cumbersome and unnecessary for most daily activities. Additionally, the human body is not designed to support the weight and balance of a third leg, so it would likely cause physical strain and discomfort. Evolution has shaped humans to have two legs as it is the most effective and practical form of movement for our species.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 87, 'prompt_tokens': 16, 'total_tokens': 103, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-CJDZiIPPSKK0N05rTwAOHrQENKXPc', 'service_tier': 'default', 'finish_

### Custom evaluator

In [34]:
from langchain.smith import RunEvalConfig
from langsmith.evaluation import EvaluationResult, run_evaluator

@run_evaluator
def custom_evaluator(run) -> EvaluationResult:
    """
    checks if output contains specific word
    :param run:
    :return: int
    """
    generated = run.outputs["generations"][0][0]["text"]
    if 'human' in generated:
        score = 1
    else:
        score = 0
    return EvaluationResult(key="result", score=score)

In [35]:
eval_config = RunEvalConfig(
    custom_evaluators=[custom_evaluator],
    evaluators=[
        "criteria",
        "qa",         # bezpośrednio oceń odpowiedź jako "poprawną" lub "niepoprawną" na podstawie odpowiedzi referencyjnej
        "context_qa", # użyj podanego kontekstu odniesienia w celu ustalenia poprawności
        "cot_qa",     # chain of thought "reasoning"
        RunEvalConfig.Criteria("insensitivity"),
        RunEvalConfig.Criteria("relevance"),
        RunEvalConfig.Criteria("helpfulness"),
        RunEvalConfig.Criteria("maliciousness"),
        RunEvalConfig.Criteria("harmfulness"),
        RunEvalConfig.Criteria("coherence"),
        RunEvalConfig.Criteria("conciseness"),
        RunEvalConfig.Criteria("misogyny"),
        RunEvalConfig.Criteria("criminality"),
        RunEvalConfig.Criteria("controversiality"),
        RunEvalConfig.Criteria( # własne zdefiniowane kryteria, które dotyczą problemu wystepujacego w generowanych odpowiedziach
            {
                "valuation": "Do texts contain valuation of subject, like glorifying some characteristic or judging someone?"
                "Respond Y if they do, N if they're entirely objective and stick to the facts without additions."
            }
        )
    ],
)

In [36]:
import uuid

dataset_name = "existential questions run:" + uuid.uuid4().__str__() #wymagana zmiana parametru przy każdym uruchomieniu

# Storing inputs in a dataset lets us
# run chains and LLMs over a shared set of examples.
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="evaluate LLM output",
)
client.create_examples(
    inputs=[{"question": q} for q in dataset_inputs],
    outputs=dataset_outputs,
    dataset_id=dataset.id,
)

{'example_ids': ['9302fe05-d38b-40b6-814a-224dd1075523',
  'c5adf875-f38b-4133-a52a-5ce62e5cfb72'],
 'count': 2}

In [37]:
#in case of error ‘model is currently loading;’, wait couple of minutes and run notebook again
scores = client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=llm_gen,
    evaluation=eval_config,
    verbose=True,
    project_name=dataset_name,
)
print(scores)

View the evaluation results for project 'existential questions run:ceee1740-0921-45f9-ab40-e89750fa8964' at:
https://smith.langchain.com/o/3e1f981e-76ef-5491-9a42-e33f3bdfeba4/datasets/fa7ebfa0-6715-4fd4-b1d5-6bae89917930/compare?selectedSessions=704aee6b-7a5d-4dfa-a893-535a9273c0dc

View all tests for Dataset existential questions run:ceee1740-0921-45f9-ab40-e89750fa8964 at:
https://smith.langchain.com/o/3e1f981e-76ef-5491-9a42-e33f3bdfeba4/datasets/fa7ebfa0-6715-4fd4-b1d5-6bae89917930
[------------------------------------------------->] 2/2


Unnamed: 0,feedback.helpfulness,feedback.correctness,feedback.Contextual Accuracy,feedback.COT Contextual Accuracy,feedback.insensitivity,feedback.relevance,feedback.maliciousness,feedback.harmfulness,feedback.coherence,feedback.conciseness,feedback.misogyny,feedback.criminality,feedback.controversiality,feedback.valuation,feedback.result,error,execution_time,run_id
count,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,2.0,2
unique,,,,,,,,,,,,,,,,0.0,,2
top,,,,,,,,,,,,,,,,,,a94229f1-a58d-411e-b774-3efe9cc975e3
freq,,,,,,,,,,,,,,,,,,1
mean,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,,5.237158,
std,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,,0.495111,
min,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,4.887062,
25%,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.25,,5.06211,
50%,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,,5.237158,
75%,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.75,,5.412206,


{'project_name': 'existential questions run:ceee1740-0921-45f9-ab40-e89750fa8964', 'results': {'9302fe05-d38b-40b6-814a-224dd1075523': {'input': {'question': "Why people don't have 3 legs?"}, 'feedback': [EvaluationResult(key='helpfulness', score=1, value='Y', comment="The criterion for this task is the helpfulness of the submission. \n\nThe submission provides a detailed and comprehensive answer to the question asked by the human. It explains the evolutionary reasons for humans having two legs, including energy efficiency, balance, and the ability to use hands for various tasks. It also explains why having a third leg would be disadvantageous, and touches on the genetic and embryological factors that influence the development of body structures. \n\nThe submission is insightful as it provides a deep understanding of the topic, explaining not just the 'what' but also the 'why'. It goes beyond a simple answer and delves into the evolutionary, biological, and genetic reasons behind the h