In [None]:
# LAst amended: 15th June, 2024
# Objective: Evaluating LLM outputs using another LLM

## LLM is the judge

1. [load_evaluator()](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.loading.load_evaluator.html)    
2. [langchain.evaluation.schema.EvaluatorType](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.schema.EvaluatorType.html)
> Question answering evaluator, Chain of thought question answering evaluator, Question answering evaluator that incorporates ‘context’ etc<br>
>  [This blog](https://vteam.ai/blog/posts/evaluating-ll-ms-using-langchain?source=post_page-----9de043286546--------------------------------) implements the above evaluatortypes

3. [Pairwise string comparison](https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/comparison/pairwise_string/) <br>
4. [How to use langchain load_evaluator() with local llm? StackOverflow](https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/comparison/pairwise_string/)


In [3]:
# 0.0 Call libraries:

import langchain_community
from langchain.evaluation.criteria import CriteriaEvalChain
from langchain.evaluation.criteria import LabeledCriteriaEvalChain

In [4]:
# 0.1 What llms do we have?
! ollama list

NAME            	ID          	SIZE  	MODIFIED   
llama2:chat     	78e26419b446	3.8 GB	6 days ago	
tinyllama:latest	2644915ede35	637 MB	9 days ago	
llama3:8b       	365c0bd3c000	4.7 GB	9 days ago	
mistral:latest  	2ae6f6dd7a3d	4.1 GB	9 days ago	
phi3:latest     	64c1188f2485	2.4 GB	9 days ago	


1. Langchain Ollama API is [here](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.ollama.Ollama.html)<br>
2. ChatOllama() API is [here](https://api.python.langchain.com/en/latest/chat_models/langchain_community.chat_models.ollama.ChatOllama.html)

In [5]:
# 1.0
from langchain_community.llms import Ollama
from langchain_community.chat_models.ollama import ChatOllama

# 1.0.1

llm= ChatOllama(model = "llama3:8b",    # This is also the default
             temperature=0.9,           # Default is None (ie 0.8)
             num_predict=256            # Maximum number of tokens to predict when generating text
                                        #  (Default: 128, -1 = infinite generation, -2 = fill context)
           )


# llm = Ollama()
llm

ChatOllama(model='llama3:8b', num_predict=256, temperature=0.9)

### Custom criteria

In [6]:
# 1.1 Evaluate a suggestion based on a custom critera:

criteria = {"my-custom-criterion": "Is the submission the most amazing ever?"}
evaluator = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)
out = evaluator.evaluate_strings(prediction="Imagine an ice cream flavor for the color aquamarine", input="Tell me an idea")


In [9]:
# 1.1.1
print(out['reasoning'])

Let's assess the submission against the custom criterion.

Step 1: Understand the custom criterion
The custom criterion is "Is the submission the most amazing ever?"

Step 2: Analyze the submission
The submission is to imagine an ice cream flavor for the color aquamarine.

Step 3: Evaluate the submission against the custom criterion
To determine if this submission meets the criterion, we need to consider what makes something "the most amazing ever". One possible interpretation is that the submission needs to be truly innovative and unique, taking into account the given prompt (color aquamarine).

Step 4: Make a judgment call
Based on our evaluation, it's difficult to say whether this submission is indeed the most amazing ever. While imagining an ice cream flavor for the color aquamarine might be an interesting idea, it doesn't necessarily stand out as exceptionally remarkable or groundbreaking.

Y


### Correctness

In [10]:
# 1.2,1 Try a different model:

llm= Ollama(model = "mistral",    # This is also the default
             temperature=0.9,    # Default is None (ie 0.8)
             num_predict=256      # Maximum number of tokens to predict when generating text
                                 #  (Default: 128, -1 = infinite generation, -2 = fill context)
           )


# llm = Ollama()
llm

Ollama(model='mistral', num_predict=256, temperature=0.9)

In [11]:
%%time

# 1.3
criteria = "correctness"
evaluator = LabeledCriteriaEvalChain.from_llm(
                                              llm=llm,
                                              criteria=criteria,
                                             )

# 1.3.1
out = evaluator.evaluate_strings(
                                  prediction="The answer is 4",
                                  input="How many apples are there?",
                                  reference="There are 3 apples",
                                 )

CPU times: user 27.4 ms, sys: 4.06 ms, total: 31.5 ms
Wall time: 17.5 s


In [13]:
# 1.3.2
print(out['reasoning'])

To determine if the submission meets the given criteria, I will analyze it step by step:

1. Correctness: Is the submission correct, accurate, and factual?
   - The submitted answer is "4", but we are told that there are actually 3 apples in the reference. Therefore, the submission is not correct as it does not match the actual number of apples given in the reference.

Final Answer: N

Repeat final answer:


### pairwise_string
The pairwise string evaluator, which predicts the preferred prediction from between two models.

In [15]:
# 2.0
from langchain.evaluation import load_evaluator
from langchain.chat_models import ChatOllama
from langchain.llms import Ollama
from langchain.embeddings import HuggingFaceEmbeddings

In [16]:
# 2.1
#evaluator = load_evaluator("labeled_score_string", llm=ChatOllama(model="llama2"))
evaluator = load_evaluator("pairwise_string",  llm=Ollama(model="llama3:8b"))

This chain was only tested with GPT-4. Performance may be significantly worse with other models.


In [17]:
%%time

# https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/comparison/pairwise_string/
# https://stackoverflow.com/q/78231114/3282777
# 2.2
out = evaluator.evaluate_string_pairs(
                                        prediction="there are three dogs",
                                        prediction_b="4",
                                        input="how many dogs are in the park?",
                                        reference="four",
                                      )

To use a reference, use the LabeledPairwiseStringEvalChain (EvaluatorType.LABELED_PAIRWISE_STRING) instead.


In [18]:
# 2.3
print(out['reasoning'])

Evaluation:

Upon reviewing both responses, I notice that Assistant A provides a qualitative answer by stating "there are three dogs", whereas Assistant B gives a numerical response "4". At first glance, it seems like Assistant B's response is more straightforward and accurate. However, upon closer inspection, we realize that the user question does not provide any context or information about the park, its size, or the time of day, which makes it challenging to determine the exact number of dogs.

Assistant A's response, although brief, acknowledges this uncertainty by providing a general estimate ("there are three dogs"), whereas Assistant B's response assumes that the user is looking for an exact count. This subtle distinction highlights the importance of considering the context and nuances of the question.

In conclusion, while both responses have their merits, I believe Assistant A's answer demonstrates a better understanding of the question and provides more helpful guidance to th

### Types of load_evalator criteria

In [19]:
# 3.0
from langchain.evaluation import Criteria
list(Criteria)

[<Criteria.CONCISENESS: 'conciseness'>,
 <Criteria.RELEVANCE: 'relevance'>,
 <Criteria.CORRECTNESS: 'correctness'>,
 <Criteria.COHERENCE: 'coherence'>,
 <Criteria.HARMFULNESS: 'harmfulness'>,
 <Criteria.MALICIOUSNESS: 'maliciousness'>,
 <Criteria.HELPFULNESS: 'helpfulness'>,
 <Criteria.CONTROVERSIALITY: 'controversiality'>,
 <Criteria.MISOGYNY: 'misogyny'>,
 <Criteria.CRIMINALITY: 'criminality'>,
 <Criteria.INSENSITIVITY: 'insensitivity'>,
 <Criteria.DEPTH: 'depth'>,
 <Criteria.CREATIVITY: 'creativity'>,
 <Criteria.DETAIL: 'detail'>]

### Conciseness
Other critera are: CONCISENESS, RELEVANCE, CORRECTNESS, COHERENCE, HARMFULNESS, MALICIOUSNESS, HELPFULNESS, CONTROVERSIALITY, MISOGYNY, CRIMINALITY, INSENSITIVITY, DEPTH, CREATIVITY, DETAIL (see [here](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.Criteria.html))

In [6]:
# 4.0
from langchain_community.llms import Ollama
from langchain_community.chat_models.ollama import ChatOllama
from langchain.evaluation import load_evaluator


In [20]:
%%time

# 4.1
evaluator = load_evaluator("criteria", criteria="conciseness", llm=Ollama(model="llama3:8b"))
eval_result = evaluator.evaluate_strings(
                                            prediction="""Joe Biden is an American politician 
                                            who is the 46th and current president of the United States. 
                                            Born in Scranton, Pennsylvania on November 20, 1942, 
                                            Biden moved with his family to Delaware in 1953. 
                                            He graduated from the University of Delaware 
                                            before earning his law degree from Syracuse University. 
                                            He was elected to the New Castle County Council in 1970 
                                            and to the U.S. Senate in 1972.""",
                                            input="Who is the president of United States?",
                                        )

# 4.1.1
print(eval_result['reasoning'])

Let's assess the submission against the conciseness criterion.

Step 1: Read the submission.
The submission provides a biographical sketch of Joe Biden, including his birthdate, childhood moves, education, and political career.

Step 2: Evaluate the submission's length and relevance to the original question.
The submission is quite lengthy, providing detailed information about Joe Biden's life. While it does answer the question "Who is the president of United States?" by identifying Joe Biden as the current president, the majority of the submission focuses on his biography rather than his presidency.

Step 3: Determine whether the submission meets the conciseness criterion.
Based on the above evaluation, I conclude that the submission does not meet the conciseness criterion. The answer should provide a concise and to-the-point response, but this submission goes beyond what is required.

Y
CPU times: user 29.9 ms, sys: 0 ns, total: 29.9 ms
Wall time: 12.4 s


### Correctness

In [21]:
%%time

# 5.1
evaluator = load_evaluator("labeled_criteria",
                           criteria="correctness" ,
                           llm=Ollama(model="llama3:8b")
                          )

# 5.2
eval_result = evaluator.evaluate_strings(
                                            input="Is there any river on the moon?",
                                            prediction="There is no evidence of river on the Moon",
                                            reference="""In a hypothetical future, lunar scientists discovered 
                                            an astonishing phenomenon—a subterranean river 
                                            beneath the Moon's surface""",
                                        )

CPU times: user 23.6 ms, sys: 5.23 ms, total: 28.8 ms
Wall time: 4.62 s


In [23]:
# 5.3
print(eval_result['reasoning'])

Let's go through the reasoning step by step for each criterion:

1. Correctness:
	* We need to check if the submission is accurate and factual.
	* The submission states that there is no evidence of rivers on the Moon, which seems true based on our current understanding of the Moon's geology.
	* However, we have a reference that suggests otherwise - lunar scientists discovered a subterranean river beneath the Moon's surface in a hypothetical future. This implies that our current understanding might be incomplete or incorrect.
	* Given this information, I would say that the submission is not entirely accurate and factual.

Reasoning: N

2. Correctness:
	* (This criterion seems redundant since it's already covered under "correctness" in the first step.)

Y


### Custom Criteria

In [26]:
%%time

# 6.0
from langchain.evaluation import EvaluatorType

# 6.1
custom_criteria = {
                    "numeric": "Does the output contain numeric information?",
                    "mathematical": "Does the output contain mathematical information?"
                    }
# 6.2
prompt = "Tell me a joke"

# 6.3
output = """
Why did the mathematician break up with his girlfriend?

Because she had too many "irrational" issues!
"""

# 6.4
llm= ChatOllama(
                model = "llama3:8b",   # This is also the default
                temperature=0.9,       # Default is None (ie 0.8)
                num_predict= 1000      # Maximum number of tokens to predict when generating text
                                       #  (Default: 128, -1 = infinite generation, -2 = fill context)
               )


# 6.5
eval_chain = load_evaluator(
    EvaluatorType.CRITERIA,
    criteria=custom_criteria,
    llm=llm
)

# 6.6
eval_result = eval_chain.evaluate_strings(prediction = output, input = prompt)
print("===================== Multi-criteria evaluation =====================")
print(eval_result['reasoning'])

Let's go through each criterion step by step:

**Numeric: Does the output contain numeric information?**

To determine this, I'll look at the submission and see if it contains any numbers or mathematical operations involving numbers. Upon reviewing the submission, I don't see any numeric information, such as digits, percentages, or mathematical expressions involving numbers. Therefore, I can conclude that the output does not contain numeric information.

**Mathematical: Does the output contain mathematical information?**

Again, let's examine the submission to determine if it contains any mathematical concepts or terminology. Upon reviewing the submission, I do see some mathematical terminology - "irrational" is a term used in mathematics to describe certain numbers that cannot be expressed as a finite decimal or fraction. Therefore, I can conclude that the output does contain mathematical information.

Based on my analysis of each criterion, here are the individual answers:

Y
N
CPU t

### ExactMatchStringEvaluator

In [27]:
# 7.0 No llm is needed here
from langchain.evaluation import ExactMatchStringEvaluator

# 7.1
exact_match_evaluator = ExactMatchStringEvaluator()

# 7.2
exact_match_evaluator = ExactMatchStringEvaluator(ignore_case=True)

# 7.3
exact_match_evaluator.evaluate_strings(
                                        prediction="Data Science",
                                        reference="My Data science",
                                        )

{'score': 0}

### labeled_score_string

In [28]:
%%time

#from langchain.chat_models import ChatOpenAI
from langchain.evaluation import load_evaluator

# 8.0
accuracy_criteria = {
    "accuracy": """
Score 1: The answer is completely unrelated to the reference.
Score 3: The answer has minor relevance but does not align with the reference.
Score 5: The answer has moderate relevance but contains inaccuracies.
Score 7: The answer aligns with the reference but has minor errors or omissions.
Score 10: The answer is completely accurate and aligns perfectly with the reference."""
}

# 8.1
evaluator = load_evaluator(
                            "labeled_score_string",
                            criteria=accuracy_criteria,
                            llm= Ollama(model="llama3:8b"),
                         )

CPU times: user 854 µs, sys: 189 µs, total: 1.04 ms
Wall time: 1.05 ms


In [29]:
# 8.2

eval_result = evaluator.evaluate_strings(
                                        prediction="You can find them in the dresser's third drawer.",
                                        reference="The socks are in the third drawer in the dresser",
                                        input="Where are my socks?",
                                        )

# 8.3
print(eval_result['reasoning'])

Evaluation:

The assistant's response attempts to provide an answer to the user's question about the location of their socks. The response is concise and straightforward, stating that the socks are located in the dresser's third drawer.

As for the criteria, I would rate this response as follows: "[[7]]".

Explanation:
While the assistant correctly identifies the correct drawer (third), there is no explicit mention of the dresser being the location where the socks can be found. This minor omission or ambiguity could potentially lead to some users interpreting the answer differently.

However, overall, the assistant's response aligns with the reference and provides a moderate level of relevance, making it a decent attempt at answering the question.


### String distance

In [31]:
# ! pip install rapidfuzz --quiet

In [32]:
# 8.4 Does not need llm
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("string_distance")
evaluator.evaluate_strings(
                            prediction="Senior Data Scientist",
                            reference="Data Scientist",
                            )

{'score': 0.23015873015873023}

### labeled_pairwise_string

In [33]:
%%time

# 9.0

from langchain.evaluation import load_evaluator

# 9.1
evaluator = load_evaluator("labeled_pairwise_string", llm = Ollama(model="llama3:8b"))

# 9.2
result = evaluator.evaluate_string_pairs(
                                            prediction="there are 5 days",
                                            prediction_b="7",
                                            input="how many days in a week?",
                                            reference="Seven",
                                         )

# 9.3
print(result['reasoning'])

As an impartial judge, I will evaluate the responses provided by the two AI assistants.

Upon reviewing the responses, I notice that Assistant A provides a brief answer stating "there are 5 days," which is incorrect and irrelevant to the user's question. The response does not demonstrate depth of thought or provide any insightful information.

On the other hand, Assistant B simply answers the question with the correct number of days in a week: "7." This response is short, direct, and accurate, making it helpful and relevant to the user's inquiry.

Considering the criteria provided, I find that Assistant B's response excels in terms of correctness, relevance, and helpfulness. The answer is straightforward and accurately answers the user's question about the number of days in a week.

In conclusion, my evaluation suggests that Assistant B's response is better than Assistant A's response.

[[B]]


In [None]:
# 10.0 Needs open api key
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("pairwise_embedding_distance",
                           llm = Ollama(model="llama3:8b") )

evaluator.evaluate_string_pairs(
                           prediction="Rajasthan is hot in June", prediction_b="Rajasthan is warm in June."
                          )

In [None]:
# 10.1 sNeeds open ai key
evaluator = load_evaluator("embedding_distance",
                           llm = Ollama(model="llama3:8b")   # Does not work
                          )

evaluator.evaluate_strings(prediction="Total Profit is 04.25 Cr", 
                           reference="Total return is 4.25 Cr"
                          )

In [None]:
############ I am done ###############