# Evaluating Custom Criteria

Suppose you want to test a model's output against a custom rubric or custom set of criteria, how would you go about testing this?

The `CriteriaEvalChain` is a convenient way to predict whether an LLM or Chain's output complies with a set of criteria, so long as you can
describe those criteria in regular language. In this example, you will use the `CriteriaEvalChain` to check whether an output is concise.

### Step 1: Create the Eval Chain

First, create the evaluation chain to predict whether outputs are "concise".

In [1]:
from langchain.chat_models import ChatOpenAI
from langchain.evaluation.criteria import CriteriaEvalChain

llm = ChatOpenAI(temperature=0)
criterion = "conciseness"
eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criterion)

### Step 2: Make Prediction

Run an output to measure.

In [2]:
llm = ChatOpenAI(temperature=0)
query="What's the origin of the term synecdoche?"
prediction = llm.predict(query)

### Step 3: Evaluate Prediction

Determine whether the prediciton conforms to the criteria.

In [3]:
eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)
print(eval_result)

{'reasoning': '1. Conciseness: The submission is concise and to the point. It directly answers the question without any unnecessary information. Therefore, the submission meets the criterion of conciseness.\n\nY', 'value': 'Y', 'score': 1}


In [4]:
# For a list of other default supported criteria, try calling `supported_default_criteria`
CriteriaEvalChain.get_supported_default_criteria()

['conciseness',
 'relevance',
 'coherence',
 'harmfulness',
 'maliciousness',
 'helpfulness',
 'controversiality',
 'mysogyny',
 'criminality',
 'insensitive']

## Multiple Criteria

To check whether an output complies with all of a list of default criteria, pass in a list! Be sure to only include criteria that are relevant to the provided information, and avoid mixing criteria that measure opposing things (e.g., harmfulness and helpfulness)

In [9]:
criteria = ["conciseness", "coherence"]
eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)
eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)
print(eval_result)

{'reasoning': 'Conciseness: The submission is not concise and does not answer the given task. It provides information on the origin of the term synecdoche, which is not relevant to the task. Therefore, the submission does not meet the criterion of conciseness.\n\nCoherence: The submission is not coherent, well-structured, or organized. It does not provide any information related to the given task and is not connected to the topic in any way. Therefore, the submission does not meet the criterion of coherence.\n\nConclusion: The submission does not meet all criteria.', 'value': 'N', 'score': 0}


## Custom Criteria

To evaluate outputs against your own custom criteria, or to be more explicit the definition of any of the default criteria, pass in a dictionary of `"criterion_name": "criterion_description"`

Note: the evaluator still predicts whether the output complies with ALL of the criteria provided. If you specify antagonistic criteria / antonyms, the evaluator won't be very useful.

In [6]:
custom_criterion = {
    "numeric": "Does the output contain numeric information?"
}

eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=custom_criterion)
eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)
print(eval_result)

{'reasoning': '1. Criteria: numeric: Does the output contain numeric information?\n- The submission does not contain any numeric information.\n- Conclusion: The submission meets the criteria.', 'value': 'Answer: Y', 'score': None}


In [8]:
# You can specify multiple criteria in the dictionary. We recommend you keep the number criteria to a minimum, however for more reliable results.

custom_criteria = {
    "complements-user": "Does the submission complements the question or the person writing the question in some way?",
    "positive": "Does the submission maintain a positive sentiment throughout?",
    "active voice": "Does the submission maintain an active voice throughout, avoiding state of being verbs?",
}

eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=custom_criteria)

# Example that complies
query = "What's the population of lagos?"
eval_result = eval_chain.evaluate_strings(prediction="I think that's a great question, you're really curious! About 30 million people live in Lagos, Nigeria, as of 2023.", input=query)
print("Meets criteria: ", eval_result["score"])

# Example that does not comply
eval_result = eval_chain.evaluate_strings(prediction="The population of Lagos, Nigeria, is about 30 million people.", input=query)
print("Does not meet criteria: ", eval_result["score"])

{'reasoning': '- complements-user: The submission directly answers the question asked and provides additional information about the population of Lagos. However, it does not necessarily complement the person writing the question. \n- positive: The submission maintains a positive tone throughout and does not contain any negative language. \n- active voice: The submission uses an active voice and avoids state of being verbs. \n\nTherefore, the submission meets all criteria. \n\nY\n\nY', 'value': 'Y', 'score': 1}
Meets criteria:  1
{'reasoning': '- complements-user: The submission directly answers the question asked in the task, so it complements the question. Therefore, the answer meets this criterion. \n- positive: The submission does not contain any negative language or tone, so it maintains a positive sentiment throughout. Therefore, the answer meets this criterion. \n- active voice: The submission uses the state of being verb "is" to describe the population, which is not in active vo