# Generative AI Application Evaluation and Governance

## Overview

- Implement an assistant that is (supposed to be) a specialist in psychological metrics

- Classify Q&A using the metrics of an assistant specialized in the “Big Seven” personality traits

- Evaluate with MLFlow a simple Q&A database using an LLM as a judge specialized in the “Big Seven” traits

## Setup env

In [0]:
# %run ./setup_env/env_benchmark_eval

In [0]:
import warnings
warnings.filterwarnings('ignore')
from rich import print
import json
import tqdm
import pandas as pd
from langchain_community.chat_models import ChatDatabricks
from langchain_core.messages import SystemMessage, HumanMessage
import mlflow

In [0]:
SERVING_MODELS = {
    'gpt-5-1': 'databricks-gpt-5-1',  # disabled
    'gpt-oss-20b': 'databricks-gpt-oss-20b',  # disabled
    'meta-llama-8b': 'databricks-meta-llama-3-1-8b-instruct',  # enabled
    'qwen-80b': 'databricks-qwen3-next-80b-a3b-instruct',  # enabled
    'llama-maverick-400b': 'databricks-llama-4-maverick', # enabled
    'gemma-12b': 'databricks-gemma-3-12b'  # enabled  
}


CATEGORIES_BIG_SEVEN_TRAITS = '''The model "Big Seven" model measure and describe human personality traits. The framework groups variation in personality into seven separate factors:

Trait 1 - Openness: measures creativity, curiosity, and willingness to entertain new ideas.

Trait 2 - Conscientiousness: measures self-control, diligence, and attention to detail.

Trait 3 - Extraversion: measures boldness, energy, and social interactivity.

Trait 4 - Agreeableness: measures kindness, helpfulness, and willingness to cooperate.

Trait 5 - Neuroticism: measures depression, irritability, and proneness to anxiety.

Trait 6 - Religiousness: measures religious feeling or belief.

Trait 7 - Machiavellianism: measures a personality trait characterized by manipulation, indifference to morality, 
'''

## Utils    

In [0]:
def run_llm_model(
    llm_model: ChatDatabricks,
    input_text: str,
    system_prompt: str, 
    context=CATEGORIES_BIG_SEVEN_TRAITS
    ) -> str:
    '''Run a LLM model chat assistant from answer questions about the input text.'''

    user_prompt_analised = f'''
    [INST] Task: Classify the user's input messages according to the "Big Seven traits". Returns only one probable trait based on the category matches within the "Big Seven traits informed into context. Return the number of categories that match the content: Trait 1, Trait 2, ... , Trait 7.

    <BEGIN TRAIT CLASSIFICATION EXAMPLE>
    Question: "What do you think about other people's feelings?"
    Bot Answer: 'I sympathise with others feelings.',
    Probabable trait: 'Trait 4',
    <END TRAIT CLASSIFICATION EXAMPLE>
    
    <BEGIN TRAIT CONTENT CATEGORIES>
    {context}
    <END TRAIT CONTENT CATEGORIES>
    <BEGIN CONVERSATION>
    {input_text}
    <END CONVERSATION>
    
    <BEGIN CLASSIFICATION>    
    Classification: Probabable trait: *Trait_i*
    <END CLASSIFICATION>
    '''
    
    res = llm_model.invoke([
        SystemMessage(content=system_prompt),
        HumanMessage(content=input_text)
    ])
    
    print(res.content)
    
    return res


def run_llm_bot(llm_model: ChatDatabricks, input_text: str) -> str:
    '''Run a LLM model for answer the input questions.'''

    system_prompt='''
    You are a helpful assistant that classify and answer the questions. Classify the typeof question. Be concise, consistent, and knowledgeable. Return as answer any of the traits classification of the input question according to the "Big Seven traits {COMPLETE_GRADING_PROMPT_BIG_SEVEN}.
    '''
    
    res = llm_model.invoke([
        SystemMessage(content=system_prompt),
        HumanMessage(content=input_text)
    ])
    
    # print(res.content)
    
    return res.content

In [0]:
# endpoint_llm = SERVING_MODELS['qwen-80b']
endpoint_llm = SERVING_MODELS['llama-maverick-400b']

llm_model = ChatDatabricks(
	endpoint=endpoint_llm,	
    seed=42  
)

In [0]:
res = run_llm_bot(
    llm_model=llm_model,
    input_text='What is there immutable in the mutable?'
)
print(res)

## Defining a custom metric to evaluate the LLMs

### Explanation

**Big Five "Expanded" personality traits ("Big Seven")**
---

- **Openness (O):** measures creativity, curiosity, and willingness to entertain new ideas.
  - _Examples_:
    - I have a rich vocabulary.
    - I have a vivid imagination.
    - I have excellent ideas.
    - I am quick to understand things.
    - I have difficulty understanding abstract ideas. (Reversed)
    - I am not interested in abstract ideas. (Reversed)

- **Conscientiousness (C):** measures self-control, diligence, and attention to detail.
  - _Examples_:
    - I am always prepared.
    - I pay attention to details.
    - I get chores done right away.
    - I like order.
    - I leave my belongings around. (Reversed)
    - I make a mess of things. (Reversed)

- **Extraversion (E)**: measures boldness, energy, and social interactivity.
  - _Examples_:
    - I am the life of the party.
    - I feel comfortable around people.
    - I start conversations.
    - I talk to a lot of different people at parties.
    - I do not talk a lot. (Reversed)
    - I keep in the background. (Reversed)

- **Agreeableness (A):** measures kindness, helpfulness, and willingness to cooperate.
  - _Examples_:
    - I am interested in people.
    - I sympathise with others' feelings.
    - I have a soft heart.
    - I take time out for others.
    - I am not really interested in others. (Reversed)
    - I insult people. (Reversed)

- **Neuroticism (N):** measures depression, irritability, and proneness to anxiety.
	- _Examples_:
    - I get stressed out easily.
    - I worry about things.
    - I am easily disturbed.
    - I get upset easily.
    - I am relaxed most of the time. (Reversed)
    - I seldom feel blue. (Reversed)

- **_Religiousness_ (R):** measures religious feeling or belief.
  - _Examples_:
    - "No man ever steps in the same river twice."
    - "What is real never ceases to exist, and what is not real never exists."
    - "Death is not the end, just a transition"
    - "Your word is a lamp to my feet and a light for my path."
  - "Truly, truly, I say to you, unless a grain of wheat falls into the earth and dies, it remains unfruitful; but if it dies, it bears much fruit."

- **_Machiavellianism_ (M):** measures a personality trait characterized by manipulation, indifference to morality, lack of empathy, and a calculated focus on self-interest.
  - _Examples_:
    - "Never tell anyone the real reason you did something unless it is useful to do so." 
    - "Most people are basically good and kind."
    - "The first method for estimating the intelligence of a ruler is to look at the men he has around him."
    - "To govern is to make people believe."


---

**Note:** The traits **Religiousness (R)** and **Machiavellianism (M)** are to expand Big Five traits and try to better explain the human personality.

**References:** 
  - https://en.wikipedia.org/wiki/Big_Five_personality_traits
  - https://en.wikipedia.org/wiki/Impermanence_(Buddhism)
  - https://en.wikipedia.org/wiki/Machiavellianism_(psychology)
  - https://en.wikipedia.org/wiki/Religiosity
  - https://en.wikipedia.org


## Define the Big Seven metric in MLFlow

In [0]:
def run_llm_model_metric(
    llm_model: ChatDatabricks,
    input_text: str,
    system_prompt: str, 
    context=''
    ) -> str:
    '''Run a LLM model chat assistant from answer questions about the input text.'''

    user_prompt_analised = f'''
    [INST] Task: Generate a score for the user's input messages according to the classes of "Big Seven traits". This classes will be defined below. Returns only one a score with the probability of the trait belong to one of the categories matching with the "Big Seven traits".

    <BEGIN TRAITS CLASSIFICATION EXAMPLE>
    Question: "What do you think about other people's feelings?"
    Bot Answer: 'I sympathise with others feelings.',
    Probabable trait: '(Trait 4, 0.8)',
    <END TRAIT CLASSIFICATION EXAMPLE>
    
    <BEGIN TRAITS CONTENT CATEGORIES>
    {context}
    <END TRAIT CONTENT CATEGORIES>

    <BEGIN CONVERSATION>
    {input_text}
    <END CONVERSATION>
    
    <BEGIN OUTPUT>    
    Score: 0.8
    <END OUTPUT>
    '''
    
    res = llm_model.invoke([
        SystemMessage(content=system_prompt),
        HumanMessage(content=input_text)
    ])
    
    print(res.content)
    
    return res


def run_llm_bot_metric(llm_model: ChatDatabricks, input_text: str, context: str, verbose: bool = False) -> str:
    '''Run a LLM model for answer the input questions.'''

    system_prompt='''
    You are a specialist at score traits of personality. Generate a one-sentence repsonse for the user's answer. 
    '''
    
    res = llm_model.invoke([
        SystemMessage(content=system_prompt),
        HumanMessage(content=input_text)
    ])
    
    print(res.content) if verbose else None
    
    return res.content

In [0]:
METRIC_GRADING_PROMPT_BIG_SEVEN =(
    'Big Seven: If the answer contains any notable aspects of personaility below are the details for differents scores:'
    '- Score 1: The language contains aspects of creativity, curiosity, and willingness to entertain new ideas [Openness].'
    '- Score 2: The language contains aspects of self-control, diligence, and attention to detail [Conscientiousness].'
    '- Score 3: The language contains aspects of boldness, energy, and social interactivity [Extraversion].'
    '- Score 4: The language contains aspects of kindness, helpfulness, and willingness to cooperate [Agreeableness].'
    '- Score 5: The language contains aspects of depression, irritability, and proneness to anxiety [Neuroticism].'
    '- Score 6: The language contains aspects of religious feeling or belief [Religiousness].'
    '- Score 7: The language contains aspects of manipulation, indifference to morality, lack of empathy, and a calculated focus on self-interest [Machiavellianism].'
)

In [0]:
run_llm_bot_metric(
    llm_model=llm_model, 
    input_text="What do you think about other people's feelings?",
    context=METRIC_GRADING_PROMPT_BIG_SEVEN
)

'You tend to be understanding and empathetic towards others, often putting their emotional needs into consideration.'

In [0]:
example1 = mlflow.metrics.genai.EvaluationExample(
    input="What do you think about other people's feelings?",
    output='I sympathise with others feelings.',
    score=4,
	justification='The language contains aspecs of the Big Seven traits model. Agreeableness.',
)

example2 = mlflow.metrics.genai.EvaluationExample(
    input='What is there immutable in the mutable?',
    output='Maybe there is. This is old question in Philosophy.',
    score=6,
	justification='The language contains aspecs of the Big Seven traits model. Religiousness.'
)


In [0]:
big_seven_metric = mlflow.metrics.genai.make_genai_metric(
    name='big_seven_metric',
    definition=(
        'The Big Seven personality trait model is a pseudo-scientific model for measuring and describing human' 'personality traits based on language. The framework groups aspects of personality into seven separate' 'factors, all measured on a continuous scale.'
    ),
    grading_prompt=METRIC_GRADING_PROMPT_BIG_SEVEN,
    examples=[example1, example2],
    model=f'endpoints:/{endpoint_llm}',
    parameters={'temperature': '0.0'},
    aggregations=['mean', 'variance'],
    greater_is_better=False,
    include_input=True
)


## Using the Big Seven Metric

In [0]:
examples_eval = {
    'inputs': [
        'What is there immutable in the mutable?',
        "What do you think about other people's feelings?",
        "What do yout think about this citation: `Experience is not what happens to a person; it's what a person does with what happens to it.`?"
    ]
}
dataset_eval = pd.DataFrame(examples_eval)
dataset_eval

Unnamed: 0,inputs
0,What is there immutable in the mutable?
1,What do you think about other people's feelings?
2,What do yout think about this citation: `Exper...


In [0]:
def iterate_over_inputs(df: pd.DataFrame, context: str = METRIC_GRADING_PROMPT_BIG_SEVEN) -> list:
    res = []
    for _, df_i in df.iterrows():
        res_llm = run_llm_bot_metric(
            llm_model=llm_model,
            input_text=df_i.inputs,
            context=METRIC_GRADING_PROMPT_BIG_SEVEN
        )
        res.append(res_llm)
    return res

In [0]:
print(iterate_over_inputs(dataset_eval.head(1), METRIC_GRADING_PROMPT_BIG_SEVEN))

## Run evaluation on a dataset

In [0]:
dataset_eval.head()

Unnamed: 0,inputs
0,What is there immutable in the mutable?
1,What do you think about other people's feelings?
2,What do yout think about this citation: `Exper...


In [0]:
results_eval = mlflow.evaluate(
	iterate_over_inputs,
	dataset_eval,
	model_type='question-answering',
	extra_metrics=[big_seven_metric]
)

2026/02/05 23:27:45 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2026/02/05 23:27:47 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/3 [00:00<?, ?it/s]

In [0]:
df_metrics = pd.DataFrame(results_eval.metrics, index=[0]).T
df_metrics

Unnamed: 0,0
toxicity/v1/mean,0.0002746301
toxicity/v1/variance,6.069935e-09
toxicity/v1/p90,0.0003378253
toxicity/v1/ratio,0.0
flesch_kincaid_grade_level/v1/mean,15.51335
flesch_kincaid_grade_level/v1/variance,5.59418
flesch_kincaid_grade_level/v1/p90,17.89832
ari_grade_level/v1/mean,17.74018
ari_grade_level/v1/variance,1.843829
ari_grade_level/v1/p90,18.92974
