# Evaluating an LLM on Open Telco 

This notebook evaluates **nvidia/nemotron-3-nano-30b-a3b** across the Open Telco benchmark suite using OpenRouter.

Note: If you are serious about running the evaluations, we highly encouraged to run the evaluations using the terminal.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsma/open_telco/blob/main/notebooks/1-evaluating-llm.ipynb)

## Installation

Run the cells below to install all dependencies. This works in Google Colab or any environment.

In [None]:
%%capture
import os

# Install uv package manager
!curl -LsSf https://astral.sh/uv/install.sh | sh

# Navigate to repo root (notebook is in notebooks/ folder)
os.chdir(
    os.path.dirname(os.getcwd())
    if os.path.basename(os.getcwd()) == "notebooks"
    else os.getcwd()
)

# Install dependencies
!~/.local/bin/uv python pin 3.11
!~/.local/bin/uv sync

## API Key Setup

Set your API keys below. In Google Colab, you can use Secrets (key icon in left sidebar) to store `OPENROUTER_API_KEY` and `HF_TOKEN` securely.

In [None]:
import os

# Try to load from Colab secrets first, then fall back to .env
try:
    from google.colab import userdata

    os.environ["OPENROUTER_API_KEY"] = userdata.get("OPENROUTER_API_KEY")
    os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")
    print("Loaded API keys from Colab secrets")
except Exception:
    from dotenv import load_dotenv

    load_dotenv()
    print("Loaded API keys from .env file")

assert os.getenv("OPENROUTER_API_KEY"), (
    "OPENROUTER_API_KEY not set. Add it to Colab secrets or .env file."
)

In [5]:
MODEL = "openrouter/nvidia/nemotron-3-nano-30b-a3b"
LIMIT = 1  # samples per benchmark for testing

## Running Individual Benchmarks

Test each benchmark individually with `--limit` before running the full suite.

### TeleQnA
Multiple-choice questions on telecommunications knowledge.

In [6]:
!~/.local/bin/uv run inspect eval src/open_telco/teleqna/teleqna.py --model {MODEL} --limit {LIMIT}

Loading dataset GSMA/open_telco from Hugging Face...
README.md: 1.43kB [00:00, 375kB/s]
test_teleqna.json: 582kB [00:00, 25.4MB/s]
Generating test split: 100%|██████| 1000/1000 [00:00<00:00, 20329.81 examples/s]
Saving the dataset (1/1 shards): 100%|█| 1000/1000 [00:00<00:00, 231652.71 examp
[?1049h[?1000h[?1003h[?1015h[?1006h[?25l[?1004h[>1u[?2026$p[?2048$p[?2004h[?7l[?1000h[?1003h[?1015h[?1006h[?1000h[?1003h[?1015h[?1006h[1;1H[1;94;100m                                        [0m[1;94;100m[0m[1;94;100m                                        [0m
[2;1H[40m  [0m[1;97;40mTasks[0m[40m [0m[40m  [0m[90;40mRunning Samples[0m[40m [0m[40m  [0m[90;40mConsole[0m[40m [0m[40m                                            [0m
[3;1H[90;40m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m
[4;1H[40m [0m[37;40m[0m[40m                                                                              [0m[40m [0m
[5;1H[

### TeleMath
Mathematical problems in telecommunications contexts.

In [7]:
!~/.local/bin/uv run inspect eval src/open_telco/telemath/telemath.py --model {MODEL} --limit {LIMIT}

Loading dataset GSMA/open_telco from Hugging Face...
test_telemath.json: 57.5kB [00:00, 21.0MB/s]
Generating test split: 100%|█████████| 100/100 [00:00<00:00, 3737.47 examples/s]
Saving the dataset (1/1 shards): 100%|█| 100/100 [00:00<00:00, 57432.62 examples
[?1049h[?1000h[?1003h[?1015h[?1006h[?25l[?1004h[>1u[?2026$p[?2048$p[?2004h[?7l[?1000h[?1003h[?1015h[?1006h[?1000h[?1003h[?1015h[?1006h[1;1H[1;94;100m                                        [0m[1;94;100m[0m[1;94;100m                                        [0m
[2;1H[40m  [0m[1;97;40mTasks[0m[40m [0m[40m  [0m[90;40mRunning Samples[0m[40m [0m[40m  [0m[90;40mConsole[0m[40m [0m[40m                                            [0m
[3;1H[90;40m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m
[4;1H[40m [0m[37;40m[0m[40m                                                                              [0m[40m [0m
[5;1H[40m [0m[40m                    

### TeleLogs
Network log analysis and troubleshooting.

In [None]:
!~/.local/bin/uv run inspect eval src/open_telco/telelogs/telelogs.py --model {MODEL} --limit {LIMIT}

### 3GPP TSG
3GPP Technical Specification Group document understanding.

In [None]:
!~/.local/bin/uv run inspect eval src/open_telco/three_gpp/three_gpp.py --model {MODEL} --limit {LIMIT}

## Full Evaluation Suite (Optional)

Run all benchmarks at once. Choose one of the two options below:

### Option 1: Using CLI

In [None]:
# Run all benchmarks
!~/.local/bin/uv run inspect eval \
    src/open_telco/teleqna/teleqna.py \
    src/open_telco/telemath/telemath.py \
    src/open_telco/telelogs/telelogs.py \
    src/open_telco/three_gpp/three_gpp.py \
    --model {MODEL} \
    --limit {LIMIT}

### Option 2: Using run_evals.py
For more control (multiple models in parallel, custom log directories, epochs), customize and run `run_evals.py`.

In [None]:
!~/.local/bin/uv run src/open_telco/run_evals.py

## View Results

Launch the Inspect viewer to analyze results.

In [10]:
!~/.local/bin/uv run inspect view start --log-dir logs/nemotron-3-nano

Inspect View: 
file:///Users/emolero/Documents/GitHub/open_telco/logs/nemotron-3-nano
(Press CTRL+C to quit)
^C
