# Adaptive Evaluations with Scorebook - Evaluating a Local Qwen Model

This quick-start guide showcases an adaptive evaluation of Qwen's Qwen2.5 0.5B Instruct model.

We recommend that you first see our [getting started quick-start guide](https://colab.research.google.com/github/trismik/scorebook/blob/main/tutorials/quickstarts/getting_started.ipynb) if you have not done so already, for more of a detailed overview on adaptive testing and setting up Trismik credentials.

## Prerequisites

- **Trismik API key**: Generate a Trismik API key from the [Trismik dashboard's settings page](https://app.trismik.com/settings).
- **Trismik Project Id**: We recommend you use the project id generated in the [Getting Started Quick-Start Guide](https://colab.research.google.com/github/trismik/scorebook/blob/main/tutorials/quickstarts/getting_started.ipynb).

## Install Scorebook

In [1]:
!pip install scorebook

Collecting scorebook
  Downloading scorebook-0.0.14-py3-none-any.whl.metadata (9.5 kB)
Collecting ipywidgets>=8.0.0 (from scorebook)
  Downloading ipywidgets-8.1.8-py3-none-any.whl.metadata (2.4 kB)
Collecting notebook<8.0.0,>=7.4.5 (from scorebook)
  Downloading notebook-7.5.0-py3-none-any.whl.metadata (10 kB)
Collecting trismik==1.0.2 (from scorebook)
  Downloading trismik-1.0.2-py3-none-any.whl.metadata (9.3 kB)
Collecting comm>=0.1.3 (from ipywidgets>=8.0.0->scorebook)
  Downloading comm-0.2.3-py3-none-any.whl.metadata (3.7 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets>=8.0.0->scorebook)
  Downloading widgetsnbextension-4.0.15-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-server<3,>=2.28.0 (from notebook<8.0.0,>=7.4.5->scorebook)
  Downloading jupyterlab_server-2.28.0-py3-none-any.whl.metadata (5.9 kB)
Collecting jupyterlab<4.6,>=4.5.0rc0 (from notebook<8.0.0,>=7.4.5->scorebook)
  Downloading jupyterlab-4.5.0-py3-none-any.whl.metadata (16 kB)
Collecting jedi

## Setup Credentials

Enter your Trismik API key and project id below.

In [2]:
# Set your credentials here
TRISMIK_API_KEY = "trismik_XnWx8vpP7GDevP1C1dq7Gu2EcmH5BRVl"
TRISMIK_PROJECT_ID = "359b479e41b567dc279425b8d7cb6411e00f9602"

## Login with Trismik API Key

In [3]:
from scorebook import login

# Login to Trismik
login(TRISMIK_API_KEY)
print("✓ Logged in to Trismik")

✓ Logged in to Trismik


## Instantiate a Local Qwen Model

For this quick-start guide, we will use the lightweight Qwen2.5 0.5B instruct model, via Hugging Face's transformers package.

In [None]:
import transformers

# Instantiate a model
pretrained = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

pipeline = transformers.pipeline(
    "text-generation",
    model=pretrained,
    tokenizer=pretrained,             # 可加可不加，加上更明确
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)
print("✓ Transformers pipeline instantiated")

## Define an Inference Function

To evaluate a model with Scorebook, it must be encapsulated within an inference function. An inference function must accept a list of model inputs, pass these to the model for inference, collect and return outputs generated.

An inference function can be defined to encapsulate any model, local or cloud-hosted. There is flexibility in how an inference function can be defined, the only requirements are the function signature. An inference function must,

Accept:

- A list of model inputs.
- Hyperparameters which can be optionally accessed via kwargs.

Return

- A list of parsed model outputs for scoring.

In [None]:
from typing import Any, List
import string

# Define an inference function for the Qwen model.
def qwen(inputs: List[Any], **hyperparameters: Any) -> List[Any]:
    """Process inputs through Qwen model"""

    outputs = []
    for idx, input_item in enumerate(inputs):

        # Format prompt
        choices = input_item.get("options", [])
        prompt = (
            str(input_item.get("question", ""))
            + "\nOptions:\n"
            + "\n".join(
                f"{letter}: {choice['text'] if isinstance(choice, dict) else choice}"
                for letter, choice in zip(string.ascii_uppercase, choices)
            )
        )

        # Build messages for Qwen model
        messages = [
            {
                "role": "system",
                "content": hyperparameters["system_message"]
            },
            {"role": "user", "content": prompt},
        ]

        # Run inference using the pipeline
        try:
            output = pipeline(
                messages,
                temperature = hyperparameters.get("temperature", 0.7),
                top_p = hyperparameters.get("top_p", 0.9),
                top_k = hyperparameters.get("top_k", 50),
                max_new_tokens = 512,
                do_sample = hyperparameters.get("temperature", 0.7) > 0,
            )
            response = output[0]["generated_text"][-1]["content"].strip()

        except Exception as e:
            response = f"Error: {str(e)}"

        outputs.append(response)

    return outputs

## Run an Adaptive Evaluation

When running an adaptive evaluation, we can use any single or multiple adaptive datasets and specify a split to be evaluated.

In [None]:
from scorebook import evaluate

# Run adaptive evaluation
results = evaluate(
    inference = qwen,
    datasets = "trismik/FinRAG:adaptive",
    hyperparameters = {"system_message": "Answer the question with only the letter of the correct option. No additional text or context"},
    split = "validation",
    experiment_id = "Qwen-2.5-0.5B-Adaptive-Evaluation",
    project_id = TRISMIK_PROJECT_ID,
)

# Print the adaptive evaluation results
print("✓ Adaptive evaluation complete!")
print("Results: ", results[0]["score"])

---

## Next Steps

- [Adaptive Testing White Paper](https://docs.trismik.com/adaptiveTesting/adaptive-testing-introduction/): An in depth overview of the science behind the adaptive testing methodology.
- [Dataset Page](https://dashboard.trismik.com/datasets): Trismik's full set of currently adaptive datasets from the Trismik dashboard.
- [Scorebook Docs](https://docs.trismik.com/scorebook/introduction-to-scorebook/): Scorebook's full documentation.
- [Scorebook Repository](https://github.com/trismik/scorebook): Scorebook is an open-source library, view the code and more examples.