## Setup LLM Model Parameters
This cell configures the large model parameters for LlamaIndex, including model name, API address, maximum tokens, timeout, etc., and initializes the LLM object.

In [15]:
from llama_index.core import Settings
from llama_index.llms.openai_like import OpenAILike
from llama_index.core.llms import ChatMessage, MessageRole

## Set models
MODEL_NAME   = 'qwen3'
LLM_API_BASE = "http://192.168.100.30:16001/v1"
MAX_TOKENS   = 2048
TIME_OUT     = 600

Settings.llm = OpenAILike(
            model=MODEL_NAME,
            api_base=LLM_API_BASE,
            api_key='EMPTY',
            is_chat_model=True,
            temperature=0.6,
            max_tokens=MAX_TOKENS,
            timeout=TIME_OUT,
            additional_kwargs={"extra_body": {"chat_template_kwargs": {"enable_thinking": False}}},
        )

## Import MLflow and Set Tracking URI
This cell imports MLflow related packages and sets the MLflow tracking server address for subsequent experiment tracking and evaluation.

In [None]:
import mlflow
from mlflow.genai import scorer

mlflow.set_tracking_uri("http://192.168.100.30:5000")
## set up experiment
mlflow.set_experiment("example_experiment")

## Define LLM Inference Function
This cell defines `llm_function`, which passes input questions to the large model and returns the model's response. This function will print the model's raw output.

In [17]:
def llm_function(question: str, **kwargs) -> str:
    user_prompt = question
    messages = [
        ChatMessage(role="assistant", content="You are a general practitioner who is happy to answer questions. You only answer 'Yes' or 'No'."),
        ChatMessage(role="user", content=user_prompt)
    ]
    response = Settings.llm.chat(messages, **kwargs)
    print(f"response: {response.message.blocks[0].text}")
    return response.message.blocks[0].text

## Define Evaluation Scorer
This cell uses MLflow's `@scorer` decorator to define the `exact_match` evaluation function, which determines whether the model output exactly matches the expected answer.

In [18]:
@scorer
def exact_match(outputs: str, expectations: dict) -> bool:
    print(f"outputs: {outputs}, expectations: {expectations}")
    return outputs == expectations["expected_response"]

## Construct Evaluation Dataset
This cell constructs a dataset containing multiple question-answer pairs, with each data entry including an input question and the expected standard answer.

In [19]:
dataset = [
            {
                "inputs": {"question": "My fasting blood glucose is 10 mmol, do I have diabetes risk?"},
                "expectations": {"expected_response": "Yes."},
            },
            {
                "inputs": {"question": "I have a headache, am I going to die soon?"},
                "expectations": {
                    "expected_response": "No."
                },
            },
            {
                "inputs": {"question": "My blood pressure is 150/110, do I have hypertension risk?"},
                "expectations": {
                    "expected_response": "Yes."
                },
            },
            {
                "inputs": {"question": "My heart beats 60 times per minute, is my heartbeat abnormal?"},
                "expectations": {
                    "expected_response": "No."
                },
            },
        ]

Execute Model Evaluation and Output Results
This cell calls `mlflow.genai.evaluate` to evaluate the model and outputs the evaluation results.

In [20]:
results = mlflow.genai.evaluate(
            data=dataset,
            predict_fn=llm_function,
            scorers=[exact_match
            ],
        )
print(results)


2025/09/10 14:20:56 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset.


response: Yes.


Evaluating:   0%|          | 0/4 [Elapsed: 00:00, Remaining: ?] 

response: No.
response: No.
response: Yes.
outputs: Yes., expectations: {'expected_response': 'Yes.'}
response: Yes.
outputs: No., expectations: {'expected_response': 'No.'}
outputs: No., expectations: {'expected_response': 'No.'}
outputs: Yes., expectations: {'expected_response': 'Yes.'}


EvaluationResult(
  run_id: c583bc25a9384089af81caac0f07bc00
  metrics:
    exact_match/mean: 1.0
  result_df: [4 rows x 9 cols]
)
