# Ollama Batch Evaluation Guide (LLM-as-a-Judge)

> This page was generated from [ollama-interactive-inference/ollama-sif-batch-eval-ibex.ipynb](https://github.com/kaust-rccl/Data-science-onboarding/tree/main/notebooks/inference/ollama-interactive-inference/ollama-sif-batch-eval-ibex.ipynb). You can [view or download notebook](https://github.com/kaust-rccl/Data-science-onboarding/tree/main/notebooks/inference/ollama-interactive-inference/ollama-sif-batch-eval-ibex.ipynb). Or [view it on nbviewer](https://nbviewer.org/github/kaust-rccl/Data-science-onboarding/tree/main/notebooks/inference/ollama-interactive-inference/ollama-sif-batch-eval-ibex.ipynb)

## Objective
This guide helps you evaluate multiple model responses automatically using Ollama’s batch evaluation feature. Instead of manually scoring outputs, an LLM acts as a judge, comparing predictions against reference answers or quality criteria you define.

## Initial Setup
If you haven't installed conda yet, please follow [`How to Setup Conda on Ibex Guide`](https://docs.hpc.kaust.edu.sa/soft_env/prog_env/python_package_management/conda/ibex.html) to get started.


After conda has been installed, save the following environment yaml file on Ibex under the name ``ollama_env.yaml``

```yml
name: ollama_env
channels:
  - conda-forge
dependencies:
  - _libgcc_mutex=0.1
  - _openmp_mutex=4.5
  - _python_abi3_support=1.0
  - anyio=4.11.0
  - argon2-cffi=25.1.0
  - argon2-cffi-bindings=25.1.0
  - arrow=1.4.0
  - asttokens=3.0.0
  - async-lru=2.0.5
  - attrs=25.4.0
  - babel=2.17.0
  - beautifulsoup4=4.14.2
  - bleach=6.2.0
  - bleach-with-css=6.2.0
  - brotli-python=1.1.0
  - bzip2=1.0.8
  - ca-certificates=2025.10.5
  - cached-property=1.5.2
  - cached_property=1.5.2
  - certifi=2025.10.5
  - cffi=2.0.0
  - charset-normalizer=3.4.4
  - comm=0.2.3
  - cpython=3.14.0
  - debugpy=1.8.17
  - decorator=5.2.1
  - defusedxml=0.7.1
  - exceptiongroup=1.3.0
  - executing=2.2.1
  - fqdn=1.5.1
  - h11=0.16.0
  - h2=4.3.0
  - hpack=4.1.0
  - httpcore=1.0.9
  - httpx=0.28.1
  - hyperframe=6.1.0
  - idna=3.11
  - importlib-metadata=8.7.0
  - ipykernel=7.0.1
  - ipython=9.6.0
  - ipython_pygments_lexers=1.1.1
  - isoduration=20.11.0
  - jedi=0.19.2
  - jinja2=3.1.6
  - json5=0.12.1
  - jsonpointer=3.0.0
  - jsonschema=4.25.1
  - jsonschema-specifications=2025.9.1
  - jsonschema-with-format-nongpl=4.25.1
  - jupyter-lsp=2.3.0
  - jupyter_client=8.6.3
  - jupyter_core=5.9.1
  - jupyter_events=0.12.0
  - jupyter_server=2.17.0
  - jupyter_server_terminals=0.5.3
  - jupyterlab=4.4.9
  - jupyterlab_pygments=0.3.0
  - jupyterlab_server=2.27.3
  - keyutils=1.6.3
  - krb5=1.21.3
  - lark=1.3.0
  - ld_impl_linux-64=2.44
  - libedit=3.1.20250104
  - libexpat=2.7.1
  - libffi=3.4.6
  - libgcc=15.2.0
  - libgcc-ng=15.2.0
  - libgomp=15.2.0
  - liblzma=5.8.1
  - libmpdec=4.0.0
  - libsodium=1.0.20
  - libsqlite=3.50.4
  - libstdcxx=15.2.0
  - libstdcxx-ng=15.2.0
  - libuuid=2.41.2
  - libzlib=1.3.1
  - markupsafe=3.0.3
  - matplotlib-inline=0.1.7
  - mistune=3.1.4
  - nbclient=0.10.2
  - nbconvert-core=7.16.6
  - nbformat=5.10.4
  - ncurses=6.5
  - nest-asyncio=1.6.0
  - notebook-shim=0.2.4
  - openssl=3.5.4
  - overrides=7.7.0
  - packaging=25.0
  - pandocfilters=1.5.0
  - parso=0.8.5
  - pexpect=4.9.0
  - pickleshare=0.7.5
  - pip=25.2
  - platformdirs=4.5.0
  - prometheus_client=0.23.1
  - prompt-toolkit=3.0.52
  - psutil=7.1.0
  - ptyprocess=0.7.0
  - pure_eval=0.2.3
  - pycparser=2.22
  - pygments=2.19.2
  - pysocks=1.7.1
  - python=3.14.0
  - python-dateutil=2.9.0.post0
  - python-fastjsonschema=2.21.2
  - python-gil=3.14.0
  - python-json-logger=2.0.7
  - python-tzdata=2025.2
  - python_abi=3.14
  - pytz=2025.2
  - pyyaml=6.0.3
  - pyzmq=27.1.0
  - readline=8.2
  - referencing=0.37.0
  - requests=2.32.5
  - rfc3339-validator=0.1.4
  - rfc3986-validator=0.1.1
  - rfc3987-syntax=1.1.0
  - rpds-py=0.27.1
  - send2trash=1.8.3
  - setuptools=80.9.0
  - six=1.17.0
  - sniffio=1.3.1
  - soupsieve=2.8
  - stack_data=0.6.3
  - terminado=0.18.1
  - tinycss2=1.4.0
  - tk=8.6.13
  - tomli=2.3.0
  - tornado=6.5.2
  - traitlets=5.14.3
  - typing-extensions=4.15.0
  - typing_extensions=4.15.0
  - typing_utils=0.1.0
  - tzdata=2025b
  - uri-template=1.3.0
  - urllib3=2.5.0
  - wcwidth=0.2.14
  - webcolors=24.11.1
  - webencodings=0.5.1
  - websocket-client=1.9.0
  - yaml=0.2.5
  - zeromq=4.3.5
  - zipp=3.23.0
  - zstandard=0.25.0
  - zstd=1.5.7
  - pip:
      - annotated-types==0.7.0
      - ollama==0.6.0
      - pydantic==2.12.3
      - pydantic-core==2.41.4
      - typing-inspection==0.4.2
```

Run the following command to build the conda environment:
```bash
conda env create -f ollama_env.yaml
```

## Starting JupyterLab
Follow [`Guide: Using Jupyter on Ibex`](https://docs.hpc.kaust.edu.sa/soft_env/job_schd/slurm/interactive_jobs/jupyter.html#job-on-ibex) to start JupyterLab on a an Ibex GPU node using your conda environment instead of *'machine_learning'* module.

By making the following changes to the Jupyter launch script:
```bash
#module load machine_learning/2024.01
conda activate ollama_env
```

## Starting The Ollama Server
Start the OLLAMA REST API server using the following bash script in a terminal:
```bash
#!/bin/bash

# Cleanup process while exiting the server
cleanup() {
    echo "🧹   Cleaning up before exit..."
    # Put your exit commands here, e.g.:
    rm -f $OLLAMA_PORT_TXT_FILE
    # Remove the Singularity instance
    singularity instance stop $SINGULARITY_INSTANCE_NAME
}
trap cleanup SIGINT  # Catch Ctrl+C (SIGINT) and run cleanup

# User Editable Section
# 1. Make target directory on /ibex/user/$USER/ollama_models_scratch to store your Ollama models
export OLLAMA_MODELS_SCRATCH=/ibex/user/$USER/ollama_models_scratch
mkdir -p $OLLAMA_MODELS_SCRATCH
# End of User Editable Section

SINGULARITY_INSTANCE_NAME="ollama"
OLLAMA_PORT_TXT_FILE='ollama_port.txt'

# 2. Load Singularity module
module load singularity

# 3. Pull OLLAMA docker image
singularity pull docker://ollama/ollama

# 4. Change the default port for OLLAMA_HOST: (default 127.0.0.1:11434)
export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')

# 5. Copy the assigned port, it will be required in the second part during working on the notebook.
echo "$PORT" > $OLLAMA_PORT_TXT_FILE

echo "OLLAMA PORT: $PORT  -- Stored in $OLLAMA_PORT_TXT_FILE"

# 6. Define the OLLAMA Host
export SINGULARITYENV_OLLAMA_HOST=127.0.0.1:$PORT

# 7. Change the default model directory stored: (default ~/.ollama/models/manifests/registry.ollama.ai/library)
export SINGULARITYENV_OLLAMA_MODELS=$OLLAMA_MODELS_SCRATCH

# 8. Create an Instance:
singularity instance start --nv -B "/ibex/user:/ibex/user" ollama_latest.sif $SINGULARITY_INSTANCE_NAME

# 7. Run the OLLAMA REST API server on the background
singularity exec instance://$SINGULARITY_INSTANCE_NAME bash -c "ollama serve"
```

> Note: Save the above script in a file called start_ollama_server.sh

```bash
# Run the script to start the Ollama server.
bash start_ollama_server.sh
```

The script has the following:
- A user editable section, where the user defines Ollama models scratch directory.
- The allocated port is saved in a temporary ollama_port.txt file, in order to be used in the Python notebook to read the assigned port to Ollama server.
- Cleanup section in order to stop the singularity instance when the script is terminated with CTRL+C.

## Using Ollama Packages Requests
Follow the following Python notebook below, it contains the codes for:
- Initialization Setup.
- List local models.
- Pull models.
- Testing connection to the Ollama server.
- Chat with the models.


### 1. Initialization
1. Define the base URL for the remote Ollama Server.
2. Testing the Ollama server connectivity.

In [1]:
import asyncio
from ollama import AsyncClient
from typing import List, Dict

MAX_CONCURRENT = 2  # limit to avoid GPU overload

In [2]:
# Configuration
with open("ollama_port.txt") as f :
    PORT = f.read().strip()
    
BASE_URL=f"http://127.0.0.1:{PORT}"
print(BASE_URL)

http://127.0.0.1:60159


In [3]:
# Testing the server connectivity
import requests

try:
    r = requests.get(BASE_URL)
    print("Ollama is running!", r.status_code)
except requests.ConnectionError as e:
    print("Ollama is NOT reachable:", e)

Ollama is running! 200


### 2. Get a List of Local Models
- Get a list of locally available Ollama models.
- Locally available models are located under path: */ibex/user/$USER/ollama_models_scratch*
- To change the location for pulled models, modify the variable *OLLAMA_MODELS_SCRATCH* in the script*start_ollama_server.sh*

In [4]:
from ollama import Client
client = Client(
  host=BASE_URL,
)

def get_local_models():
    """
    Returns a list of locally available Ollama Models.

    Returns:
        list: A list of model names as strings
    """
    models = [model['model'] for model in client.list()['models']]
    return models

In [5]:
# Usage
get_local_models()

['gemma3:270m', 'qwen3:0.6b']

### 3. Pull The Model
- To pull a specific model, use *pull* method.
- Please refer to [Ollama Library](https://ollama.com/library) to check available models.

In [26]:
# Pull the required models
client.pull("qwen3:0.6b")

ProgressResponse(status='success', completed=None, total=None, digest=None)

### 4. Running Batch Eval

In [6]:
async def ensure_model_exists(client: AsyncClient, model: str):
    """
    Ensure that a specified Ollama model is available locally.  
    If the model is not installed, it will be pulled from the server.

    Args:
        client (AsyncClient): An instance of the AsyncClient connected to the Ollama server.
        model (str): Name of the model to check and pull if necessary.

    Raises:
        Exception: If pulling or checking the model fails.
    """
    try:
        # Check if the model is already available
        await client.show(model)
        print(f"Model {model} already available locally.")
    except Exception:
        # Pull the model if it does not exist
        print(f"Pulling model {model}...")
        async for progress in await client.pull(model, stream=True):
            status = progress.get("status", "")
            if "completed" in status.lower():
                print(f"Pulled {model} successfully.")
        print(f"Model {model} is now ready for use.")

In [7]:
async def query_model_async(client: AsyncClient, model: str, prompt: str) -> str:
    """
    Send a single prompt to a specified Ollama model asynchronously and return the full response.

    Args:
        client (AsyncClient): An instance of AsyncClient connected to the Ollama server.
        model (str): Name of the Ollama model to query.
        prompt (str): The user input to send to the model.

    Returns:
        str: The complete response text from the model.

    Raises:
        Exception: If the chat request fails.
    """
    messages = [{"role": "user", "content": prompt}]
    response = ""
    
    async for chunk in await client.chat(model=model, messages=messages, stream=True):
        if chunk.get("message") and "content" in chunk["message"]:
            response += chunk["message"]["content"]
    return response

In [16]:
async def run_batch(models: List[str], prompt: str) -> Dict[str, str]:
    """
    Run multiple model inferences concurrently while limiting active requests.

    This function uses asynchronous concurrency control to efficiently query 
    multiple models in parallel, ensuring that no more than `max_concurrent` 
    requests are active at a time. Each model is checked for availability before 
    being queried, and missing models are automatically pulled.

    Args:
        models (List[str]): A list of model names to query.
        prompt (str): The user input or question to be sent to each model.
        base_url (str, optional): The base URL of the Ollama API endpoint.
        max_concurrent (int, optional): The maximum number of models to run concurrently.

    Returns:
        Dict[str, str]: A dictionary mapping model names to their response text.
    """
    client = AsyncClient(host=BASE_URL)
    semaphore = asyncio.Semaphore(MAX_CONCURRENT)

    async def safe_query(model):
        async with semaphore:
            # Ensure model is available before querying
            await ensure_model_exists(client, model)
            print(f"Running {model}...")
            result = await query_model_async(client, model, prompt)
            print(f"Done: {model}")
            return model, result

    results = await asyncio.gather(*(safe_query(m) for m in models))
    return dict(results)

In [22]:
async def judge_model_responses(
    client: AsyncClient,
    judge_model: str,
    responses: Dict[str, str],
    criteria: str
) -> Dict[str, Dict[str, str]]:
    """
    Use an LLM to evaluate the outputs of other models according to a given criteria.

    Args:
        client (AsyncClient): An instance of AsyncClient connected to the Ollama server.
        responses (Dict[str, str]): A dictionary mapping model names to their outputs.
        criteria (str): Evaluation criteria to guide the judging process.
        judge_model (str, optional): The model used as the judge. Defaults to "llama3".

    Returns:
        Dict[str, Dict[str, str]]: A dictionary mapping each evaluated model to its judgment,
        containing keys like "evaluation" with the judge's reasoning and score.

    Raises:
        Exception: If model evaluation fails or judge model cannot be ensured.
    """
    judged = {}
    
    # Ensure the judge model exists locally
    await ensure_model_exists(client, judge_model)

    for model, answer in responses.items():
        judge_prompt = f"""
You are an impartial judge. Evaluate the following answer according to {criteria}.

Answer:
{answer}

Give:
- Reasoning
- Score from 1 to 10
"""
        eval_resp = await query_model_async(client, judge_model, judge_prompt)
        judged[model] = {"evaluation": eval_resp}
        
    return judged

In [23]:
async def main():
    # Define List of Models [model_1, model_2, ...]
    models = ['gemma3:270m']
    # Define the Judge LLM model
    judge_model = "qwen3:0.6b"
    # Define the prompt
    prompt = "What is 1+1 equal?"
    client = AsyncClient(host=BASE_URL)

    # 1. Run models (with auto-pull if missing)
    responses = await run_batch(models, prompt)

    # 2. Judge responses
    evaluations = await judge_model_responses(client, judge_model, responses, "clarity, correctness, and conciseness")

    # 3. Display
    for m in models:
        print(f"\n--- {m} ---")
        print("Response:\n", responses[m])
        print("Evaluation:\n", evaluations[m]["evaluation"])

In [24]:
await main()

Model gemma3:270m already available locally.
Running gemma3:270m...
Done: gemma3:270m
Model qwen3:0.6b already available locally.

--- gemma3:270m ---
Response:
 1 + 1 = 2

Evaluation:
 <think>
Okay, let's see what the user is asking for. They want me to evaluate the answer "1 + 1 = 2" according to clarity, correctness, and conciseness. Then they also want a score from 1 to 10. 

First, I need to check if the answer is correct. The expression 1 + 1 equals 2 is correct, so that's a solid correctness. But wait, the user mentioned "Evaluate the following answer according to clarity, correctness, and conciseness." So maybe they want me to look at how clearly it's presented. The answer is very straightforward, so clarity is good. 

Now, the reasoning part. The user provided a "Reasoning" section, but in the given answer, there's only the equation. Maybe there's a mistake here? Wait, the user's example shows that they have a "Reasoning" section. But in the answer, there's only the equation. 