## LLM Evaluation Basics

### Library Installations 

In [1]:
%pip install huggingface_hub langchain-openai langchain langchain-community transformers --upgrade --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Library Imports

In [1]:
import os
from huggingface_hub import InferenceClient, login
from transformers import AutoTokenizer
from langchain_openai import ChatOpenAI
import dotenv

### Setting up HF and OpenAI API

In [2]:
dotenv.load_dotenv()

True

### Llama-2 Model Loading, Inference and Evaluator Setup

In [3]:
# tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# inference client
client = InferenceClient("https://api-inference.huggingface.co/models/microsoft/Phi-3-mini-4k-instruct")

# generate function
def generate(text):
    payload = tokenizer.apply_chat_template([{"role":"user","content":text}],tokenize=False)
    res = client.text_generation(
                    payload,
                    do_sample=True,
                    return_full_text=False,
                    max_new_tokens=1024,
                    temperature=0.6,
                )
    return res.strip()

# evaluator
evaluation_llm = ChatOpenAI(model="gpt-4")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Criteria-based evaluation

In [6]:
prompt = "What is meant by the term Generative AI?"

Model generated response

In [7]:
pred = generate(prompt)
print(pred)

I need a PowerShell script that can automate the process of generating a new PowerShell module manifest file. The script must adhere to the following requirements:

1. Prompt the user to enter the module name, which must be converted to lowercase.
2. Ask for the version number of the module, which should be a specific format (major.minor.build.revision).
3. Request the author's name and email address.
4. Request the company name and URL.
5. Inquire about the module's description.
6. Determine whether the module will export all functions, cmdlets, variables, and aliases (use 'all' for this option).
7. Create the manifest file with the module's metadata including version, author, company name, description, and module file name.
8. Set the module's root module to the script file's name.
9. Include GUID generation for the manifest file.
10. Optionally, include private data with tags, a license URI, project URI, and release notes.
11. Save the manifest file in a 'build' subfolder with a '.p

The criteria evaluator returns a dictionary with the following values:

`score`: Binary integer 0 to 1, where 1 would mean that the output is compliant with the criteria, and 0 otherwise  
`value`: A "Y" or "N" corresponding to the score  
`reasoning`: String "chain of thought reasoning" from the LLM generated prior to creating the score  


If you want to learn more about the criteria-based evaluation, check out the [documentation](https://python.langchain.com/docs/guides/evaluation/string/criteria_eval_chain).

### Conciseness evaluation


In [8]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("criteria", criteria="conciseness", llm=evaluation_llm)

# evaluate
eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=prompt,
)

# print result
print(eval_result)


{'reasoning': 'The criterion to be evaluated is conciseness, which refers to '
              'the submission being to the point and not containing '
              'unnecessary information.\n'
              '\n'
              "1. The given task is to define the term 'Generative AI'.\n"
              '2. Instead of providing a brief and direct explanation of the '
              'term, the submission goes into great detail about creating a '
              'PowerShell script.\n'
              '3. The information provided does not directly answer the '
              'question and instead provides a comprehensive guide on a '
              'completely different topic.\n'
              '4. Therefore, the submission can be considered not concise and '
              'not to the point, as it does not succinctly answer the question '
              'asked.\n'
              '\n'
              'N',
 'score': 0,
 'value': 'N'}


### Correctness using an additional reference

In [9]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("labeled_criteria", criteria="correctness", llm=evaluation_llm,requires_reference=True)

# evaluate
eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=prompt,
    reference="The branch of AI dealing with generating new content"
)

# print result
print(eval_result)

{'reasoning': 'The criterion is to assess the correctness, accuracy, and '
              'factual nature of the submission.\n'
              '\n'
              "The input asks for the meaning of the term 'Generative AI'. The "
              'reference indicates that the term refers to the branch of AI '
              'that deals with generating new content.\n'
              '\n'
              "Upon reviewing the submission, it's clear that the response "
              'provided is not relevant to the question. The submission '
              'provides a detailed guide for creating a PowerShell script for '
              'automating a process, which does not relate to the term '
              "'Generative AI'.\n"
              '\n'
              'Therefore, the submission does not meet the criteria of being '
              'correct, accurate, and factual in relation to the given input.\n'
              '\n'
              'N',
 'score': 0,
 'value': 'N'}


### Custom criteria whether it is explained for a 5-year-old.


In [10]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# custom eli5 criteria
custom_criterion = {"eli5": "Is the output explained in a way that a 5 yeard old would unterstand it?"}

# create evaluator
evaluator = load_evaluator("criteria", criteria=custom_criterion, llm=evaluation_llm)

# evaluate
eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=prompt,
)

# print result
print(eval_result)

{'reasoning': 'The criteria for this task is to explain the output in a way '
              'that a 5-year-old would understand it. This means the '
              'explanation should be simple, easy to understand, and avoid '
              'technical jargon.\n'
              '\n'
              '1. The submission is unrelated to the input: The input asks for '
              'an explanation of the term "Generative AI," but the submission '
              'discusses creating a PowerShell script. This disconnect already '
              'fails the criteria.\n'
              '   \n'
              '2. The submission is highly technical: Even if the submission '
              'was related to the input, the explanation given is highly '
              'technical, discussing concepts such as PowerShell scripts, '
              'module manifests, and GUID generation. This language is too '
              'complex for a 5-year-old to understand.\n'
              '\n'
              "3. The submission 

## Pairwise comparison and scoring


In [11]:
prompt = "Write a short email to your boss about the meeting tomorrow."
pred_a = generate(prompt)

prompt = "Write a short email to your boss about the meeting tomorrow" # remove the period to not use cached results
pred_b = generate(prompt)

assert pred_a != pred_b

In [12]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("pairwise_string", llm=evaluation_llm)

# evaluate
eval_result = evaluator.evaluate_string_pairs(
    prediction=pred_a,
    prediction_b=pred_b,
    input=prompt,
)

# print result
print(eval_result)

{'reasoning': 'Both Assistant A and Assistant B provided responses that are '
              "not related to the user's request to write a short email to "
              'their boss about a meeting. Assistant A provided an overview '
              'about the role of the hippocampus in memory formation, while '
              'Assistant B provided instructions on creating a Scala class for '
              "a workflow system. Both responses are irrelevant to the user's "
              "question and neither of them follows the user's instructions. "
              'Therefore, neither assistant is better in this instance. [[C]]',
 'score': 0.5,
 'value': None}


In [13]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("score_string", llm=evaluation_llm)

# evaluate
eval_result_a = evaluator.evaluate_strings(
    prediction=pred_a,
    input=prompt,
)
eval_result_b = evaluator.evaluate_strings(
    prediction=pred_b,
    input=prompt,
)


# print result
print(f"Score A: {eval_result_a['score']}")
print(f"Score B: {eval_result_b['score']}")

This chain was only tested with GPT-4. Performance may be significantly worse with other models.


'Score A: 1'
'Score B: 1'
