# Ollama Batch Evaluation Guide (LLM-as-a-Judge)

> This page was generated from [ollama-interactive-inference/ollama-sif-batch-eval-ibex.ipynb](https://github.com/kaust-rccl/Data-science-onboarding/tree/main/notebooks/inference/ollama-interactive-inference/ollama-sif-batch-eval-ibex.ipynb). You can [view or download notebook](https://github.com/kaust-rccl/Data-science-onboarding/tree/main/notebooks/inference/ollama-interactive-inference/ollama-sif-batch-eval-ibex.ipynb). Or [view it on nbviewer](https://nbviewer.org/github/kaust-rccl/Data-science-onboarding/tree/main/notebooks/inference/ollama-interactive-inference/ollama-sif-batch-eval-ibex.ipynb)

## Objective
This guide helps you evaluate multiple model responses automatically using Ollama’s batch evaluation feature. Instead of manually scoring outputs, an LLM acts as a judge, comparing predictions against reference answers or quality criteria you define.

## Initial Setup
If you haven't installed conda yet, please follow :ref:`conda_ibex_` to get started.

After conda has been installed, save the following environment yaml file on Ibex under the name ``ollama_env.yaml``

```yml
name: ollama_env
channels:
- conda-forge
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- bzip2=1.0.8
- ca-certificates=2025.7.14
- icu=75.1
- ld_impl_linux-64=2.44
- libexpat=2.7.1
- libffi=3.4.6
- libgcc=15.1.0
- libgcc-ng=15.1.0
- libgomp=15.1.0
- liblzma=5.8.1
- libmpdec=4.0.0
- libsqlite=3.50.3
- libstdcxx=15.1.0
- libstdcxx-ng=15.1.0
- libuuid=2.38.1
- libzlib=1.3.1
- ncurses=6.5
- openssl=3.5.1
- pip=25.1.1
- python=3.13.5
- python_abi=3.13
- readline=8.2
- tk=8.6.13
- tzdata=2025b
- pip:
    - annotated-types==0.7.0
    - anyio==4.9.0
    - argon2-cffi==25.1.0
    - argon2-cffi-bindings==21.2.0
    - arrow==1.3.0
    - asttokens==3.0.0
    - async-lru==2.0.5
    - attrs==25.3.0
    - babel==2.17.0
    - beautifulsoup4==4.13.4
    - bleach==6.2.0
    - certifi==2025.7.14
    - cffi==1.17.1
    - charset-normalizer==3.4.2
    - comm==0.2.2
    - debugpy==1.8.15
    - decorator==5.2.1
    - defusedxml==0.7.1
    - executing==2.2.0
    - fastjsonschema==2.21.1
    - fqdn==1.5.1
    - h11==0.16.0
    - httpcore==1.0.9
    - httpx==0.28.1
    - idna==3.10
    - ipykernel==6.30.0
    - ipython==9.4.0
    - ipython-pygments-lexers==1.1.1
    - ipywidgets==8.1.7
    - isoduration==20.11.0
    - jedi==0.19.2
    - jinja2==3.1.6
    - json5==0.12.0
    - jsonpointer==3.0.0
    - jsonschema==4.25.0
    - jsonschema-specifications==2025.4.1
    - jupyter==1.1.1
    - jupyter-client==8.6.3
    - jupyter-console==6.6.3
    - jupyter-core==5.8.1
    - jupyter-events==0.12.0
    - jupyter-lsp==2.2.6
    - jupyter-server==2.16.0
    - jupyter-server-terminals==0.5.3
    - jupyterlab==4.4.5
    - jupyterlab-pygments==0.3.0
    - jupyterlab-server==2.27.3
    - jupyterlab-widgets==3.0.15
    - lark==1.2.2
    - markupsafe==3.0.2
    - matplotlib-inline==0.1.7
    - mistune==3.1.3
    - nbclient==0.10.2
    - nbconvert==7.16.6
    - nbformat==5.10.4
    - nest-asyncio==1.6.0
    - notebook==7.4.4
    - notebook-shim==0.2.4
    - ollama==0.5.1
    - overrides==7.7.0
    - packaging==25.0
    - pandocfilters==1.5.1
    - parso==0.8.4
    - pexpect==4.9.0
    - platformdirs==4.3.8
    - prometheus-client==0.22.1
    - prompt-toolkit==3.0.51
    - psutil==7.0.0
    - ptyprocess==0.7.0
    - pure-eval==0.2.3
    - pycparser==2.22
    - pydantic==2.11.7
    - pydantic-core==2.33.2
    - pygments==2.19.2
    - python-dateutil==2.9.0.post0
    - python-json-logger==3.3.0
    - pyyaml==6.0.2
    - pyzmq==27.0.0
    - referencing==0.36.2
    - requests==2.32.4
    - rfc3339-validator==0.1.4
    - rfc3986-validator==0.1.1
    - rfc3987-syntax==1.1.0
    - rpds-py==0.26.0
    - send2trash==1.8.3
    - setuptools==80.9.0
    - six==1.17.0
    - sniffio==1.3.1
    - soupsieve==2.7
    - stack-data==0.6.3
    - terminado==0.18.1
    - tinycss2==1.4.0
    - tornado==6.5.1
    - traitlets==5.14.3
    - types-python-dateutil==2.9.0.20250708
    - typing-extensions==4.14.1
    - typing-inspection==0.4.1
    - uri-template==1.3.0
    - urllib3==2.5.0
    - wcwidth==0.2.13
    - webcolors==24.11.1
    - webencodings==0.5.1
    - websocket-client==1.8.0
    - widgetsnbextension==4.0.14
```

Run the following command to build the conda environment:
```bash
conda env create -f ollama_env.yaml
```

## Starting JupyterLab
Follow [`using_jupyter`](../../jupyter) to start JupyterLab on a an Ibex GPU node Using your conda environment instead of ``machine_learning`` module.
By making the following changes to the Jupyter launch script.
```bash
#module load machine_learning/2024.01
conda activate ollama_en
```

## Starting The Ollama Server
Start the OLLAMA REST API server using the following bash script in a terminal:
```bash
#!/bin/bash

# Cleanup process while exiting the server
cleanup() {
    echo "🧹   Cleaning up before exit..."
    # Put your exit commands here, e.g.:
    rm -f $OLLAMA_PORT_TXT_FILE
    # Remove the Singularity instance
    singularity instance stop $SINGULARITY_INSTANCE_NAME
}
trap cleanup SIGINT  # Catch Ctrl+C (SIGINT) and run cleanup
#trap cleanup EXIT    # Also run on any script exit

# User Editable Section
# 1. Make target directory on /ibex/user/$USER/ollama_models_scratch to store your Ollama models
export OLLAMA_MODELS_SCRATCH=/ibex/user/$USER/ollama_models_scratch
mkdir -p $OLLAMA_MODELS_SCRATCH
# End of User Editable Section

SINGULARITY_INSTANCE_NAME="ollama"
OLLAMA_PORT_TXT_FILE='ollama_port.txt'

# 2. Load Singularity module
module load singularity

# 3. Pull OLLAMA docker image
singularity pull docker://ollama/ollama

# 4. Change the default port for OLLAMA_HOST: (default 127.0.0.1:11434)
export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')

# 5. Copy the assigned port, it will be required in the second part during working on the notebook.
echo "$PORT" > $OLLAMA_PORT_TXT_FILE

echo "OLLAMA PORT: $PORT  -- Stored in $OLLAMA_PORT_TXT_FILE"

# 6. Define the OLLAMA Host
export SINGULARITYENV_OLLAMA_HOST=127.0.0.1:$PORT

# 7. Change the default model directory stored: (default ~/.ollama/models/manifests/registry.ollama.ai/library)
export SINGULARITYENV_OLLAMA_MODELS=$OLLAMA_MODELS_SCRATCH

# 8. Create an Instance:
singularity instance start --nv -B "/ibex/user:/ibex/user" ollama_latest.sif $SINGULARITY_INSTANCE_NAME

# 7. Run the OLLAMA REST API server on the background
singularity exec instance://$SINGULARITY_INSTANCE_NAME bash -c "ollama serve"
```

> Note: Save the above script in a file called start_ollama_server.sh

```bash
# Run the script to start the Ollama server.
bash start_ollama_server.sh
```

The script has the following:
- A user editable section, where the user defines [Ollama models scratch directory].
- The allocated port is saved in a temporary ollama_port.txt file, in order to be used in the Python notebook to read the assigned port to Ollama server.
- Cleanup section in order to stop the singularity instance when the script is terminated with CTRL+C.

## Using REST API Requests
Follow the following Python notebook below, it contains the codes for [Testing connection to the Ollama server, List local models, Pull models, Chat with the models]:


This guide helps you evaluate multiple model responses automatically using Ollama’s batch evaluation feature. Instead of manually scoring outputs, an LLM acts as a judge, comparing predictions against reference answers or quality criteria you define

In [1]:
import asyncio
from ollama import AsyncClient
from typing import List, Dict

MAX_CONCURRENT = 2  # limit to avoid GPU overload

In [3]:
# Configuration
with open("ollama_port.txt") as f :
    PORT = f.read().strip()
    
BASE_URL=f"http://127.0.0.1:{PORT}"
print(BASE_URL)

http://127.0.0.1:40743


In [5]:
# Testing the server connectivity
import requests

try:
    r = requests.get(BASE_URL)
    print("Ollama is running!", r.status_code)
except requests.ConnectionError as e:
    print("Ollama is NOT reachable:", e)

Ollama is running! 200


In [4]:
from ollama import Client
client = Client(
  host=BASE_URL,
)

def get_local_models():
    for model in client.list()['models']:
        print(model['model'])

get_local_models()

gemma3:latest
qwen3:latest
llama3:latest
deepseek-r1:1.5b


In [6]:
# Pull the required model
# You can check the available models in: https://ollama.com/library
import requests

def pull_model(model_name, base_url=BASE_URL):
    url = f"{base_url}/api/pull"
    response = requests.post(url, json={"name": model_name}, stream=True)

    if response.status_code != 200:
        print("❌ Failed to pull model:", response.text)
        return

    for line in response.iter_lines():
        if line:
            decoded = line.decode("utf-8")
            print(decoded)

# Usage
pull_model("llama3")


{"status":"pulling manifest"}
{"status":"pulling 6a0746a1ec1a","digest":"sha256:6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa","total":4661211424,"completed":4661211424}
{"status":"pulling 4fa551d4f938","digest":"sha256:4fa551d4f938f68b8c1e6afa9d28befb70e3f33f75d0753248d530364aeea40f","total":12403,"completed":12403}
{"status":"pulling 8ab4849b038c","digest":"sha256:8ab4849b038cf0abc5b1c9b8ee1443dca6b93a045c2272180d985126eb40bf6f","total":254,"completed":254}
{"status":"pulling 577073ffcc6c","digest":"sha256:577073ffcc6ce95b9981eacc77d1039568639e5638e83044994560d9ef82ce1b","total":110,"completed":110}
{"status":"pulling 3f8eb4da87fa","digest":"sha256:3f8eb4da87fa7a3c9da615036b0dc418d31fef2a30b115ff33562588b32c691d","total":485,"completed":485}
{"status":"verifying sha256 digest"}
{"status":"writing manifest"}
{"status":"success"}


In [7]:
async def ensure_model(client: AsyncClient, model: str):
    """Check if model exists, if not, pull it."""
    try:
        # Try showing model details (will fail if not installed)
        await client.show(model)
        print(f"✅ Model {model} already available.")
    except Exception:
        print(f"📥 Pulling model {model}...")
        async for progress in await client.pull(model, stream=True):
            status = progress.get("status")
            if "completed" in status.lower():
                print(f"✅ Pulled {model}")
        print(f"✅ Model {model} is now ready.")

In [8]:
async def query_model(client: AsyncClient, model: str, prompt: str) -> str:
    messages = [{"role": "user", "content": prompt}]
    response = ""
    async for chunk in await client.chat(model=model, messages=messages, stream=True):
        if chunk.get("message"):
            response += chunk["message"]["content"]
    return response

In [9]:
async def run_batch(models: List[str], prompt: str) -> Dict[str, str]:
    client = AsyncClient(host=BASE_URL)
    semaphore = asyncio.Semaphore(MAX_CONCURRENT)

    async def safe_query(model):
        async with semaphore:
            # Ensure model is available before querying
            await ensure_model(client, model)
            print(f"🔄 Running {model}...")
            result = await query_model(client, model, prompt)
            print(f"✅ Done: {model}")
            return model, result

    results = await asyncio.gather(*(safe_query(m) for m in models))
    return dict(results)

In [10]:
async def judge_responses(client: AsyncClient, responses: Dict[str, str], criteria: str) -> Dict[str, Dict]:
    """Use an LLM to evaluate other models' outputs."""
    judged = {}
    # Ensure judge model exists
    judge_model = "llama3"
    await ensure_model(client, judge_model)

    for model, answer in responses.items():
        judge_prompt = f"""
You are an impartial judge. Evaluate the following answer according to {criteria}.

Answer:
{answer}

Give:
- Reasoning
- Score from 1 to 10
"""
        eval_resp = await query_model(client, judge_model, judge_prompt)
        judged[model] = {"evaluation": eval_resp}
    return judged

In [13]:
async def main():
    models = ["qwen3", "llama3", "gemma3"]
    prompt = "What is 1+1 equal?"
    client = AsyncClient(host=BASE_URL)

    # 1. Run models (with auto-pull if missing)
    responses = await run_batch(models, prompt)

    # 2. Judge responses
    evaluations = await judge_responses(client, responses, "clarity, correctness, and conciseness")

    # 3. Display
    for m in models:
        print(f"\n--- {m} ---")
        print("Response:\n", responses[m])
        print("Evaluation:\n", evaluations[m]["evaluation"])

In [14]:
# if __name__ == "__main__":
#     asyncio.run(main())
await main()

✅ Model qwen3 already available.
🔄 Running qwen3...
✅ Model llama3 already available.
🔄 Running llama3...
✅ Done: llama3
✅ Model gemma3 already available.
🔄 Running gemma3...
✅ Done: gemma3
✅ Done: qwen3
✅ Model llama3 already available.

--- qwen3 ---
Response:
 <think>
Okay, the user is asking "What is 1+1 equal?" Let me think. First, I need to confirm if this is a straightforward math question. The user might be testing basic arithmetic or maybe they have a trick question in mind. Let me start by recalling that in standard arithmetic, 1 plus 1 equals 2. That's the fundamental addition in base 10.

Wait, could there be any alternative interpretations? For example, in some contexts, like binary numbers, 1 + 1 is 10, which is 2 in decimal. But the question doesn't specify a different numeral system, so I should assume base 10 unless told otherwise. Also, in programming or logic, sometimes operations can be different, but again, the question seems simple.

Is there a possibility of a tr