Authored by: Aryan Mistry

# Tool Calling & Structured Outputs

Sometimes a language model needs help: it may be asked to add numbers, get the weather, or perform some other operation outside its training data. **Tool calling** allows a model (or an agent around it) to invoke external functions. In this notebook you'll build simple tools, write a naive router to decide which tool to call, and learn to validate the structured outputs returned by tools. [19]

## 1. Defining Simple Tools

We'll implement two simple tools:

- **Calculator:** Safely evaluates basic arithmetic expressions consisting of
  numbers and the operators `+`, `-`, `*`, `/` and parentheses.
- **Weather Service:** Returns a fake temperature for a given city from a
  predefined dictionary.

Each tool takes a structured input (e.g. a string expression or a city
name) and returns a dictionary containing the result or an error message.


In [1]:

import re
import ast

# Safe arithmetic evaluation using helper

def calculator(expression: str) -> dict:
    """Safely evaluate a basic arithmetic expression.

    Only digits and operators +, -, *, /, parentheses and spaces are allowed. Returns a dict with 'result' or an 'error'.
    """
    if not re.fullmatch(r'[0-9+\-*/(). ]+', expression):
        return {'error': 'Invalid characters in expression'}
    try:
        result = safe_eval(expression)
        return {'result': result}
    except Exception as e:
        return {'error': str(e)}


def weather_service(city: str) -> dict:
    database = {'London': 18.5, 'New York': 22.0, 'Paris': 20.0, 'Tokyo': 25.0}
    city_norm = city.title()
    if city_norm not in database:
        return {'error': f"I don't have weather data for {city_norm}."}
    return {'city': city_norm, 'temperature': database[city_norm]}


## 2. Routing User Queries to Tools

A language model (or a simple wrapper) needs to decide which tool to call
based on the user's request. We'll implement a basic router that uses
keyword matching to route queries either to the calculator or the weather
service. More sophisticated routers might use natural language
understanding, but keyword heuristics suffice here.


In [2]:

def route_query(query: str) -> dict:
    """Route a user's natural language query to the appropriate tool.

    Uses simple keyword matching to detect arithmetic expressions. Returns a dict with the tool name and extracted expression.
    """
    q = query.lower().strip()
    if any(op in q for op in ['+', '-', '*', '/', 'plus', 'minus', 'times', 'divided']):
        expr = ''.join(ch for ch in q if ch.isdigit() or ch in '+-*/(). ')
        return calculator(expr)
    if 'weather' in q or 'temperature' in q:
        city = query.split()[-1]
        return weather_service(city)
    return {'error': "I'm not sure which tool to use for that question."}

sample_queries = [
    'Calculate 3 + 5 * 2',
    'What is the temperature in London?',
    'What is 7 minus 4 divided by 3?',
    'Weather Paris',
    'How tall is Mount Everest?'
]
for q in sample_queries:
    print(f"Query: {q}Result: {route_query(q)}")


Query: Calculate 3 + 5 * 2Result: {'error': "name 'safe_eval' is not defined"}
Query: What is the temperature in London?Result: {'error': "I don't have weather data for London?."}
Query: What is 7 minus 4 divided by 3?Result: {'error': "name 'safe_eval' is not defined"}
Query: Weather ParisResult: {'city': 'Paris', 'temperature': 20.0}
Query: How tall is Mount Everest?Result: {'error': "I'm not sure which tool to use for that question."}


## 3. Structured Outputs

Tools should return data in a consistent structure (e.g. dictionaries) so that
models or downstream code can parse the results reliably. In the examples
above, the calculator returns a dict with either a `result` key or an `error`
key, and the weather service returns a dict with keys `city` and
`temperature`.

Structured outputs make it easier to validate results and to separate the
model's reasoning from the tool's execution. [15]


## 4. Validating Tool Outputs

Even with simple tools, it's important to verify that the returned data
matches what your application expects. Re-use the `validate_json` function
from the evaluation notebook or redefine it here to ensure outputs have the
correct shape. We'll adapt it here for Python dictionaries rather than JSON
strings. [15]


In [3]:

def validate_output(result: dict, schema: dict) -> bool:
    """Validate a tool's return value against a schema.

    This function iterates over the keys in the schema and checks that the result contains the key and that the value is of the expected type.
    """
    for key, typ in schema.items():
        # Support tuple of types for numeric fields
        if key not in result or not isinstance(result[key], typ if isinstance(typ, type) else typ):
            return False
    return True

calc_schema = {'result': (int, float)}
weather_schema = {'city': str, 'temperature': (int, float)}

print(validate_output({'result': 7.0}, calc_schema))
print(validate_output({'city': 'Paris', 'temperature': 20.0}, weather_schema))
print(validate_output({'city': 'Paris'}, weather_schema))


True
True
False


## 5. Exercises

1. **Add a unit conversion tool.** Write a function `convert_units(value: float, from_unit: str, to_unit: str)` that converts between units (e.g. Celsius ↔ Fahrenheit, kilometres ↔ miles). Update the router to detect phrases like "convert 10 miles to kilometres" and call this new tool.
2. **Improve routing.** Enhance `route_query` to recognise synonyms (e.g. "sum" instead of "plus", "forecast" instead of "weather") and to extract multi-word city names (e.g. "New York").
3. **Handle invalid queries gracefully.** Modify the router so that it returns a helpful error message when the expression is malformed or the city is unknown, rather than silently failing.
4. **Nested tool calls.** Think about how you would support questions that require chaining tools, such as "What is the square of the temperature in Paris?". Sketch out a plan for the planning and execution steps.


3. **Add a weather tool.** Define a simple `weather_service(location)` function that returns hard‑coded weather information for a few cities. Extend the router to detect questions about the weather.
4. **Improve routing.** Replace the keyword matching with a regular expression or small classifier to better detect arithmetic phrases.
5. **Nested tools.** Imagine a query like 'What is the average of 3 and 5?'. Compose a small pipeline that first uses the calculator to add the numbers, then divides by two. How could an agent orchestrate multiple tool calls?

Foundational LLMs & Transformers
1. Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NIPS 2017).
2. Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020.
3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019.
4. OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774.
5. Touvron, H., et al. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models. Meta AI.

Generative AI & Sampling

6. Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS 2014.
7. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
8. Neal, R. M. (1993). Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical Report CRG-TR-93-1, University of Toronto.

Retrieval-Augmented Generation (RAG) & Knowledge Grounding

9. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP. NeurIPS 2020.
10. deepset ai (2023). Haystack: Open-Source Framework for Search and RAG Applications. https://haystack.deepset.ai
11. LangChain (2023). LangChain Documentation and Cookbook. https://python.langchain.com

Evaluation & Safety

12. Papineni, K., et al. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. ACL 2002.
13. Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. ACL Workshop 2004.
14. OpenAI (2024). Evaluating Model Outputs: Faithfulness and Grounding. OpenAI Docs.
15. Guardrails AI (2024). Open-Source Guardrails Framework. https://github.com/shreyar/guardrails

Prompt Engineering & Instruction Tuning

16. White, J. (2023). The Prompting Guide. https://www.promptingguide.ai
17. Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022.

Agents & Tool Use

18. Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
19. LangChain (2024). LangChain Agents and Tools Documentation.
20. Microsoft (2023). Semantic Kernel Developer Guide. https://learn.microsoft.com/en-us/semantic-kernel/
21. Google DeepMind (2024). Gemini Technical Report. arXiv:2312.11805.

State, Memory & Orchestration

22. LangGraph (2024). Stateful Agent Orchestration Framework. https://langchain-langgraph.vercel.app
23. Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442.

Pedagogical and Course Design References

24. fast.ai (2023). fast.ai Deep Learning Course Notebooks. https://course.fast.ai
25. Ng, A. (2023). DeepLearning.AI Short Courses on Generative AI.
26. MIT 6.S191, Stanford CS324, UC Berkeley CS294-158. (2022–2024). Course Materials and Public Notebooks for ML and LLMs.