# Calculator Sample

This is a simple pipeline that demonstrates using an LLM to calculate the value of arithmetic expressions.
Each test cases processes a single expression entered by the user. See the [menu](../menu/menu.ipynb) notebook
for an example where each test case has multiple turns.

Define your pipeline. In this case we're using the `calc` sample defined in [calc.py](./calc.py).

In [1]:
from calc import calc_pipeline_spec

Define your test cases. Note that the second case uses a textual description of the expression
and that the third case specifies base 16 or hexidecimal for the input.

In [2]:
cases = [
  {
    "uuid": "6f31c4df-bba6-42cd-a3ae-24f31d5503fa",
    "keywords": ["numbers"],
    "user": "1+1",
    "base": 10,
    "answer": 2
  },
  {
    "uuid": "ceb61568-73c9-4b5e-812f-0fa2df1464f1",
    "keywords": ["text"],
    "user": "one hundred two divided by two",
    "base": 10,
    "answer": 51
  },
  {
    "uuid": "178f2c37-557f-4d89-b48d-4f07e828e6ff",
    "keywords": ["hexidecimal"],
    "user": "ff + a",
    "base": 16,
    "answer": 265
  }
]

Instantiate the Gotaglio object that runs the pipeline.

In [3]:
from gotaglio.gotag import Gotaglio

gt = Gotaglio([calc_pipeline_spec])

Run the calculator pipeline and store the runlog in `result`. The format the results as an annotated transcription a conversation between the
`system`, `assistant`, and `user`. Note that this example uses the built-in `perfect` model mock that always provides the correct answer.
Once you have set up credentials, you can use a more interesting model like `gpt4o`.

In [4]:
result = gt.run(
  "calc",
  cases,
  {
    "prepare.template": "data/template.txt",
    "infer.model.name": "perfect"
  },
  save=True
)
gt.format(result)

[3m            Summary for 618a4235-71dc-4f6d-9e04-3bfe96723e59            [0m
┏━━━━━┳━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1m id[0m[1m [0m┃[1m [0m[1mstatus  [0m[1m [0m┃[1m [0m[1mcost[0m[1m [0m┃[1m [0m[1mkeywords   [0m[1m [0m┃[1m [0m[1muser                          [0m[1m [0m┃
┡━━━━━╇━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│[36m [0m[36m6f3[0m[36m [0m│[35m [0m[1;32mCOMPLETE[0m[35m [0m│ [1;32m0.00[0m │ numbers     │ 1+1                            │
│[36m [0m[36mceb[0m[36m [0m│[35m [0m[1;32mCOMPLETE[0m[35m [0m│ [1;32m0.00[0m │ text        │ one hundred two divided by two │
│[36m [0m[36m178[0m[36m [0m│[35m [0m[1;32mCOMPLETE[0m[35m [0m│ [1;32m0.00[0m │ hexidecimal │ ff + a                         │
└─────┴──────────┴──────┴─────────────┴────────────────────────────────┘

Total: [1;36m3[0m
Complete: [1;36m3[0m/[1;36m3[0m [1m([0m[1;36m100.00[0m%[1

## Run: 618a4235-71dc-4f6d-9e04-3bfe96723e59
## Case: 6f3 - PASSED
**Keywords:** numbers  


**system:**
You are a desktop calculator that computes the value of mathematical expressions.
The input is base 10.
Your output should be just a base 10 numerical result.

**user:** _1+1_

**assistant:**
2.0


## Case: ceb - PASSED
**Keywords:** text  


**system:**
You are a desktop calculator that computes the value of mathematical expressions.
The input is base 10.
Your output should be just a base 10 numerical result.

**user:** _one hundred two divided by two_

**assistant:**
51.0


## Case: 178 - PASSED
**Keywords:** hexidecimal  


**system:**
You are a desktop calculator that computes the value of mathematical expressions.
The input is base 16.
Your output should be just a base 10 numerical result.

**user:** _ff + a_

**assistant:**
265.0



We can rerun this example with a different model, this time the `flakey` mock, which alternately returns the correct answer, returns "hello world", and raises an exception. The mock models are good for debugging new pipelines locally, before connecting to a real model.

In [5]:
result2 = gt.rerun(
  result,
  {
    "infer.model.name": "flakey"
  },
  save=True
)
gt.format(result2)


[3m            Summary for 82f910d6-dcc4-4514-a6dc-2023205519ff            [0m
┏━━━━━┳━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1m id[0m[1m [0m┃[1m [0m[1mstatus  [0m[1m [0m┃[1m [0m[1mcost[0m[1m [0m┃[1m [0m[1mkeywords   [0m[1m [0m┃[1m [0m[1muser                          [0m[1m [0m┃
┡━━━━━╇━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│[36m [0m[36m6f3[0m[36m [0m│[35m [0m[1;32mCOMPLETE[0m[35m [0m│ [1;32m0.00[0m │ numbers     │ 1+1                            │
│[36m [0m[36mceb[0m[36m [0m│[35m [0m[1;31mERROR   [0m[35m [0m│ [1;31m    [0m │ text        │ one hundred two divided by two │
│[36m [0m[36m178[0m[36m [0m│[35m [0m[1;31mERROR   [0m[35m [0m│ [1;31m    [0m │ hexidecimal │ ff + a                         │
└─────┴──────────┴──────┴─────────────┴────────────────────────────────┘

Total: [1;36m3[0m
Complete: [1;36m1[0m/[1;36m3[0m [1m([0m[1;36m33.33[0m%[1m

## Run: 82f910d6-dcc4-4514-a6dc-2023205519ff
## Case: 6f3 - PASSED
**Keywords:** numbers  


**system:**
You are a desktop calculator that computes the value of mathematical expressions.
The input is base 10.
Your output should be just a base 10 numerical result.

**user:** _1+1_

**assistant:**
2.0


## Case: ceb - FAILED
**Keywords:** text  


### Turn 1: **ERROR**  
Error: Context: Extracting numerical answer from LLM response.
Error: could not convert string to float: 'hello world'

~~~
Traceback: Traceback (most recent call last):
  File "C:\git\llm-tools\gotaglio\gotaglio\pipeline2.py", line 129, in process_one_case
    await run_dag(dag, result)
  File "C:\git\llm-tools\gotaglio\gotaglio\dag.py", line 104, in run_dag
    await run_dag_helper(dag_object, context, stages)
  File "C:\git\llm-tools\gotaglio\gotaglio\dag.py", line 166, in run_dag_helper
    (name, result) = task.result()
                     ^^^^^^^^^^^^^
  File "C:\Users\mhop\AppData\Local\Programs\Python\Python312\Lib\asyncio\futures.py", line 202, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "C:\Users\mhop\AppData\Local\Programs\Python\Python312\Lib\asyncio\tasks.py", line 314, in 
__step_run_and_handle_result
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "C:\git\llm-tools\gotaglio\gotaglio\dag.py", line 91, in run_task
    raise e
  File "C:\git\llm-tools\gotaglio\gotaglio\dag.py", line 83, in run_task
    result = await dag["function"](context)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\git\llm-tools\gotaglio\samples2\calc\calc.py", line 143, in extract
    return float(context["stages"]["infer"])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: 'hello world'

Time: 2025-08-16 16:39:10.715212+00:00
~~~
## Case: 178 - FAILED
**Keywords:** hexidecimal  


### Turn 1: **ERROR**  
Error: Error: Flakey model failed

~~~
Traceback: Traceback (most recent call last):
  File "C:\git\llm-tools\gotaglio\gotaglio\pipeline2.py", line 129, in process_one_case
    await run_dag(dag, result)
  File "C:\git\llm-tools\gotaglio\gotaglio\dag.py", line 104, in run_dag
    await run_dag_helper(dag_object, context, stages)
  File "C:\git\llm-tools\gotaglio\gotaglio\dag.py", line 166, in run_dag_helper
    (name, result) = task.result()
                     ^^^^^^^^^^^^^
  File "C:\Users\mhop\AppData\Local\Programs\Python\Python312\Lib\asyncio\futures.py", line 202, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "C:\Users\mhop\AppData\Local\Programs\Python\Python312\Lib\asyncio\tasks.py", line 314, in 
__step_run_and_handle_result
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "C:\git\llm-tools\gotaglio\gotaglio\dag.py", line 91, in run_task
    raise e
  File "C:\git\llm-tools\gotaglio\gotaglio\dag.py", line 83, in run_task
    result = await dag["function"](context)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\git\llm-tools\gotaglio\samples2\calc\calc.py", line 136, in infer
    return await model.infer(context["stages"]["prepare"], context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\git\llm-tools\gotaglio\gotaglio\mocks.py", line 30, in infer
    raise Exception("Flakey model failed")
Exception: Flakey model failed

Time: 2025-08-16 16:39:10.726994+00:00
~~~

Note that we can use the term `"latest"` to refer to the most recently generated runlog.
We could also pass the first few characters of a run's UUID or we could pass the result
object (e.g. `result` or `result2`) directly.

In [6]:
gt.summarize("latest")

[3m            Summary for 82f910d6-dcc4-4514-a6dc-2023205519ff            [0m
┏━━━━━┳━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1m id[0m[1m [0m┃[1m [0m[1mstatus  [0m[1m [0m┃[1m [0m[1mcost[0m[1m [0m┃[1m [0m[1mkeywords   [0m[1m [0m┃[1m [0m[1muser                          [0m[1m [0m┃
┡━━━━━╇━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│[36m [0m[36m6f3[0m[36m [0m│[35m [0m[1;32mCOMPLETE[0m[35m [0m│ [1;32m0.00[0m │ numbers     │ 1+1                            │
│[36m [0m[36mceb[0m[36m [0m│[35m [0m[1;31mERROR   [0m[35m [0m│ [1;31m    [0m │ text        │ one hundred two divided by two │
│[36m [0m[36m178[0m[36m [0m│[35m [0m[1;31mERROR   [0m[35m [0m│ [1;31m    [0m │ hexidecimal │ ff + a                         │
└─────┴──────────┴──────┴─────────────┴────────────────────────────────┘

Total: [1;36m3[0m
Complete: [1;36m1[0m/[1;36m3[0m [1m([0m[1;36m33.33[0m%[1m

Gotaglio allows us to compare the results of two runs.

In [7]:
gt.compare(result, result2)

Run A: [93m618a4235-71dc-4f6d-9e04-3bfe96723e59[0m
Run B: [93m82f910d6-dcc4-4514-a6dc-2023205519ff[0m

[1;36m0[0m cases only in A
[1;36m0[0m cases only in B
[1;36m3[0m cases in both A and B

[3m               Comparison of A, B               [0m
┏━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃[1m [0m[1m   id[0m[1m [0m┃[1m [0m[1m         A[0m[1m [0m┃[1m [0m[1m        B[0m[1m [0m┃[1m [0m[1mkeywords   [0m[1m [0m┃
┡━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│[36m [0m[36m  6f3[0m[36m [0m│[35m [0m[1;32m    passed[0m[35m [0m│[32m [0m[1;32m   passed[0m[32m [0m│[32m [0m[32mnumbers    [0m[32m [0m│
│[36m [0m[36m  ceb[0m[36m [0m│[35m [0m[1;32m    passed[0m[35m [0m│[32m [0m[1;31m    error[0m[32m [0m│[32m [0m[32mtext       [0m[32m [0m│
│[36m [0m[36m  178[0m[36m [0m│[35m [0m[1;32m    passed[0m[35m [0m│[32m [0m[1;31m    error[0m[32m [0m│[32m [0m[32mhexidecimal[0m[32m [0m│
├───────┼───────