## String and comparison evaluation

In [1]:
import json
from dotenv import load_dotenv

load_dotenv()

True

### Embedding Distance Evaluator
Embedding Distance Evaluator compares two responses by converting them into embedding vectors and measuring distance or cosine similarity. This allows it to assess semantic content proximity, not just word matching.

In [2]:
from langchain_classic.evaluation import load_evaluator

evaluator = load_evaluator("embedding_distance", embeddings_model="openai")

result1 = evaluator.evaluate_strings(
    prediction="The capital of Poland is Warsaw",
    reference="The capital of Poland is Warsaw"
)

result2 = evaluator.evaluate_strings(
    prediction="The capital of Poland is Warsaw",
    reference="The capital of Poland is called Warsaw"
)

result3 = evaluator.evaluate_strings(
    prediction="The capital of Poland is Warsaw",
    reference="The capital of Burkina Faso is called Ouagadougou"
)

print(round(result1["score"], 4))
print(round(result2["score"], 4))
print(round(result3["score"], 4))

0.0
0.0168
0.1829


### String Comparison
The evaluator compares two texts using the BLEU metric, which measures the n-gram similarity of the generated answer to the reference answer.

In [3]:
evaluator = load_evaluator("string_distance")

result1 = evaluator.evaluate_strings(
    prediction="The capital of Poland is Warsaw",
    reference="The capital of Poland is Warsaw"
)

result2 = evaluator.evaluate_strings(
    prediction="The capital of Poland is Warsaw",
    reference="The capital of Poland = Warsaw"
)

result3 = evaluator.evaluate_strings(
    prediction="The capital of Poland is Warsaw",
    reference="Warsaw is the capital of Poland"
)

print(round(result1["score"], 4))
print(round(result2["score"], 4))
print(round(result3["score"], 4))

0.0
0.0334
0.289


### A/B Tests
PairwiseStringEvaluator compares two text responses against a single reference to determine which one is better. This allows you to automatically evaluate which response is closer to the expected result.

In [4]:
from langchain_classic.evaluation import load_evaluator

evaluator = load_evaluator("labeled_pairwise_string")

result = evaluator.evaluate_string_pairs(
    input="What is the capital of Poland?",
    prediction="Warsaw is the capital of Poland",
    prediction_b="I don't know",
    reference="Warsaw is Poland's capital"
)

print(json.dumps(result, indent=4))

{
    "reasoning": "Assistant A's response is helpful, relevant, correct, and accurate. It directly answers the user's question about the capital of Poland. On the other hand, Assistant B's response is not helpful or accurate. It does not provide the user with the information they were seeking. Therefore, Assistant A's response is superior in this case. \n\nFinal verdict: [[A]]",
    "value": "A",
    "score": 1
}
