### Excercise 1

Models comparison. First model without quantization was started with the following command.

```bash
vllm serve Qwen/Qwen3-1.7B \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85
```
The Available KV cache memory for this option is 7.89 GiB, which results in GPU KV cache size of 73,824.

In [11]:
import time
from openai import OpenAI

def benchmark_vllm_chat(
    prompts,
    base_url="http://localhost:8000/v1",
    model="",
    max_completion_tokens=200,
):

    client = OpenAI(api_key="EMPTY", base_url=base_url)
    start_time = time.perf_counter()
    latencies = []

    for prompt in prompts:
        req_start = time.perf_counter()

        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "developer", "content": "You are a helpful assistant. Answer with at least 200 tokens, for speed testing purposes"},
                {"role": "user", "content": prompt},
            ],
            max_completion_tokens=max_completion_tokens,
            extra_body={"chat_template_kwargs": {"enable_thinking": False}},
        )

        _ = response.choices[0].message.content
        latencies.append(time.perf_counter() - req_start)

    total_time = time.perf_counter() - start_time

    return {
        "num_requests": len(prompts),
        "total_time_sec": total_time,
        "avg_latency_sec": sum(latencies) / len(latencies),
        "min_latency_sec": min(latencies),
        "max_latency_sec": max(latencies),
    }


In [7]:
test_prompts = [
    "What kind of llm are you",
    "How long do you think it takes LLMops lab",
    "Explain KV cache",
    "Explain flash attention",
    "What is bitsandbytes?",
    "What is vLLM?",
    "What is the capital of Poland",
    "Some random prompt for testing.",
    "What is the answer to the ultimate question?",
    "Have you seen Hitchhiker's guide to the galaxy?"
]


benchmark_vllm_chat(test_prompts)

{'num_requests': 10,
 'total_time_sec': 26.040883401999963,
 'avg_latency_sec': 2.604086875900066,
 'min_latency_sec': 0.19839435099993352,
 'max_latency_sec': 3.2785711090000405}

For the second one the quantization flag was added

```bash
vllm serve Qwen/Qwen3-1.7B \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --quantization bitsandbytes
```

The statistics were: \
Available KV cache memory: 9.77 GiB \
GPU KV cache size: 91,488 tokens

In [12]:
test_prompts = [
    "What kind of llm are you",
    "How long do you think it takes LLMops lab",
    "Explain KV cache",
    "Explain flash attention",
    "What is bitsandbytes?",
    "What is vLLM?",
    "What is the capital of Poland",
    "Some random prompt for testing.",
    "What is the answer to the ultimate question?",
    "Have you seen Hitchhiker's guide to the galaxy?"
]


benchmark_vllm_chat(test_prompts)

{'num_requests': 10,
 'total_time_sec': 25.5037886099999,
 'avg_latency_sec': 2.5503773868000734,
 'min_latency_sec': 0.6125874009999279,
 'max_latency_sec': 3.1932141540000885}

There is no visible, significant difference in latency. However, the min latency time is very small, I am not sure whether the clanker followed the system prompt correctly.

### Exercise 2
Tool calling

In [13]:
import datetime
import json
from typing import Callable
from openai import OpenAI


def make_llm_request(
        prompt: str, 
        tool_definitions: list[dict], 
        tool_name_to_func: dict[str, Callable]
    ) -> str:

    client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
    messages = [
        {"role": "developer", "content": "You are a weather assistant."},
        {"role": "user", "content": prompt},
    ]

    # guard: loop limit, we break as soon as we get an answer
    for _ in range(10):
        response = client.chat.completions.create(
            model="",
            messages=messages,
            tools=tool_definitions,  # always pass all tools in this example
            tool_choice="auto",
            max_completion_tokens=1000,
            extra_body={"chat_template_kwargs": {"enable_thinking": False}},
        )
        resp_message = response.choices[0].message
        messages.append(resp_message.model_dump())

        print(f"Generated message: {resp_message.model_dump()}")
        print()

        # parse possible tool calls (assume only "function" tools)
        if resp_message.tool_calls:
            for tool_call in resp_message.tool_calls:
                func_name = tool_call.function.name
                func_args = json.loads(tool_call.function.arguments)

                # call tool, serialize result, append to messages
                func = tool_name_to_func[func_name]
                func_result = func(**func_args)

                messages.append(
                    {
                        "role": "tool",
                        "content": json.dumps(func_result),
                        "tool_call_id": tool_call.id,
                    }
                )
        else:
            # no tool calls, we're done
            return resp_message.content

    # we should not get here
    last_response = resp_message.content
    return f"Could not resolve request, last response: {last_response}"


def get_tool_definitions() -> tuple[list[dict], dict[str, Callable]]:
    tool_definitions = [
        {
            "type": "function",
            "function": {
                "name": "get_current_date",
                "description": 'Get current date in the format "Year-Month-Day" (YYYY-MM-DD).',
                "parameters": {},
            },
        },
        {
            "type": "function",
            "function": {
                "name": "get_weather_forecast",
                "description": "Get weather forecast at given country, city, and date.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "country": {
                            "type": "string",
                            "description": "The country the city is in.",
                        },
                        "city": {
                            "type": "string",
                            "description": "The city to get the weather for.",
                        },
                        "date": {
                            "type": "string",
                            "description": (
                                "The date to get the weather for, "
                                'in the format "Year-Month-Day" (YYYY-MM-DD). '
                                "At most 4 weeks into the future."
                            ),
                        },
                    },
                    "required": ["country", "city", "date"],
                },
            },
        },
    ]

    tool_name_to_callable = {
        "get_current_date": current_date_tool,
        "get_weather_forecast": weather_forecast_tool,
    }

    return tool_definitions, tool_name_to_callable


def current_date_tool() -> str:
    return datetime.date.today().isoformat()


def weather_forecast_tool(country: str, city: str, date: str) -> str:
    if country.lower() in {"united kingdom", "uk", "england"}:
        return "Fog and rain"
    else:
        return "Sunshine"


prompt = "What will be weather in Birmingham in two weeks?"
response = make_llm_request(prompt, *get_tool_definitions())
print("Response:\n", response)

print()

prompt = "What will be weather in Warsaw the day after tomorrow?"
response = make_llm_request(prompt, *get_tool_definitions())
print("Response:\n", response)

print()

prompt = "What will be weather in New York in two months?"
response = make_llm_request(prompt, *get_tool_definitions())
print("Response:\n", response)

Generated message: {'content': None, 'refusal': None, 'role': 'assistant', 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': [{'id': 'chatcmpl-tool-b03325044258f390', 'function': {'arguments': '{}', 'name': 'get_current_date'}, 'type': 'function'}], 'reasoning': None, 'reasoning_content': None}

Generated message: {'content': 'The current date is January 24, 2026. Two weeks from this date is January 31, 2026. \n\nNow, I will get the weather forecast for Birmingham on January 31, 2026.', 'refusal': None, 'role': 'assistant', 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': [], 'reasoning': None, 'reasoning_content': None}

Response:
 The current date is January 24, 2026. Two weeks from this date is January 31, 2026. 

Now, I will get the weather forecast for Birmingham on January 31, 2026.

Generated message: {'content': None, 'refusal': None, 'role': 'assistant', 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': [{'i

In [14]:
import polars as pl

def get_dataset_tool_definitions() -> tuple[list[dict], dict]:
    tool_definitions = [
        {
            "type": "function",
            "function": {
                "name": "read_remote_csv",
                "description": "Read a CSV file from a URL and return the first n rows as text. n can be at most 20",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "url": {
                            "type": "string",
                            "description": "Public URL to a CSV file",
                        },
                        "n": {
                            "type": "integer",
                            "description": "Maximum number of rows to return",
                            "default": 50,
                            "minimum": 0,
                            "maximum": 20
                        },
                    },
                    "required": ["url"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "read_remote_parquet",
                "description": "Read a Parquet file from a URL and return the first n rows as text. n can be at most 20",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "url": {
                            "type": "string",
                            "description": "Public URL to a Parquet file",
                        },
                        "n": {
                            "type": "integer",
                            "description": "Maximum number of rows to return",
                            "default": 50,
                            "minimum": 0,
                            "maximum": 20
                        },
                    },
                    "required": ["url"],
                },
            },
        },
    ]

    tool_name_to_callable = {
        "read_remote_csv": read_remote_csv,
        "read_remote_parquet": read_remote_parquet,
    }

    return tool_definitions, tool_name_to_callable

def read_remote_csv(url: str, n: int) -> str:
    n = min(n, 20)
    try:
        df = pl.read_csv(
            url,
            n_rows=n,
            ignore_errors=True,
        )
        return str(df.to_dicts())
    except Exception as e:
        return (
            f"ERROR: Failed to read CSV from URL: {url}.\n"
            f"Reason: {type(e).__name__}: {e}\n"
        )
    

def read_remote_parquet(url: str, n: int) -> str:
    n = min(n, 20)
    try:
        df = pl.read_parquet(
            url,
            n_rows=n,
        )
        return str(df.to_dicts())
    except Exception as e:
        return (
            f"ERROR: Failed to read CSV from URL: {url}.\n"
            f"Reason: {type(e).__name__}: {e}\n"
        )

In [15]:
apis_tox_url = "https://raw.githubusercontent.com/j-adamczyk/ApisTox_dataset/master/outputs/dataset_final.csv"
taxi_data_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet"

prompt = (
    f"Here you have a daset on taxi rides: {taxi_data_url} "
    f"How much did the average ride cost in NYC taxi cost in January 2025"
)

response = make_llm_request(prompt, *get_dataset_tool_definitions())
print("Response:\n", response)

Generated message: {'content': None, 'refusal': None, 'role': 'assistant', 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': [{'id': 'chatcmpl-tool-a146d789c0d484f7', 'function': {'arguments': '{"url": "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet", "n": 20}', 'name': 'read_remote_parquet'}, 'type': 'function'}], 'reasoning': None, 'reasoning_content': None}

Generated message: {'content': "The average ride cost in NYC for January 2025 is calculated by summing up all the `total_amount` values from the first 20 rows of the dataset and then dividing by the number of rows. \n\nFrom the data provided:\n\n- The total amount for the first 20 rows is:\n  $$\n  18.0 + 12.12 + 12.1 + 9.7 + 8.3 + 24.1 + 11.75 + 19.1 + 27.1 + 16.4 + 16.4 + 12.96 + 19.2 + 12.9 + 38.9 + 22.7 + 25.55 - 8.54 - 8.54 + 12.2\n  $$\n\nLet's calculate this sum:\n\n$$\n18.0 + 12.12 = 30.12 \\\\\n30.12 + 12.1 = 42.22 \\\\\n42.22 + 12.1 = 54.32 \\\\\n54.32 + 9.7 = 64.

In [16]:
prompt = (
    f"Here you have an ApixTox dataset: {apis_tox_url}. "
    f"What is the longest molecule name you see in the dataset?"
)

response = make_llm_request(prompt, *get_dataset_tool_definitions())
print("Response:\n", response)

Generated message: {'content': None, 'refusal': None, 'role': 'assistant', 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': [{'id': 'chatcmpl-tool-81d653b2b0ad1d2b', 'function': {'arguments': '{"url": "https://raw.githubusercontent.com/j-adamczyk/ApisTox_dataset/master/outputs/dataset_final.csv", "n": 20}', 'name': 'read_remote_csv'}, 'type': 'function'}], 'reasoning': None, 'reasoning_content': None}

Generated message: {'content': 'The longest molecule name in the dataset is **"S-[(1,3-Dihydro-1,3-dioxo-2H-isoindol-2-yl)methyl]O,O-dimethyl ester, Phosphorodithioic acid"**, which has 48 characters.', 'refusal': None, 'role': 'assistant', 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': [], 'reasoning': None, 'reasoning_content': None}

Response:
 The longest molecule name in the dataset is **"S-[(1,3-Dihydro-1,3-dioxo-2H-isoindol-2-yl)methyl]O,O-dimethyl ester, Phosphorodithioic acid"**, which has 48 characters.


## Excercise 3

Server implementation -> [datetime_mcp_serve](./mcp_servers/datetime.py)

## Excercise 4

Server implementation -> [visualisation_mcp_srver](./mcp_servers/visualisation.py)

### Excercise 5

In [8]:
from guardrails import Guard, OnFailAction
from guardrails.hub import RestrictToTopic, DetectJailbreak
from openai import OpenAI

def fishing_fanatic(prompt: str) -> str:
    client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

    messages = [
        {
            "role": "system",
            "content": "You are a fishing fanatic. Only talk about fishing."
        },
        {
            "role": "user",
            "content": prompt
        }
    ]

    response = client.chat.completions.create(
        model="",
        messages=messages,
        max_completion_tokens=300,
        extra_body={"chat_template_kwargs": {"enable_thinking": False}},
    )

    content = response.choices[0].message.content.strip()
    guard = (
        Guard()
        .use(
            RestrictToTopic,
            valid_topics=["fishing, fish, sea life"],
            on_fail=OnFailAction.EXCEPTION,
        )
        .use(
            DetectJailbreak,
            on_fail=OnFailAction.EXCEPTION,
        )
    )
    print(f"Before validation: {content}")
    try:
        guard.validate(content)
        return content
    except Exception as e:
        return f"Salmon"


In [9]:
print(fishing_fanatic("What shoud I eat today for dinner?"))

Before validation: Ah, a great question! üé£Ô∏è For a fishing-themed dinner, how about something hearty and satisfying? How about **fishing chow**? It's a special dish made with fish, potatoes, and a bit of cheese, often served with a side of bread. Or, if you're into something more adventurous, maybe **fishing fries** ‚Äî crispy fried fish with a tangy sauce. 

But wait, I should ask ‚Äî what kind of fish do you like? Salmon, trout, bass, or something more exotic? And do you want it spicy, sweet, or savory? Let me know, and I'll tailor it just for you! üêü




Salmon


In [10]:
print(fishing_fanatic("What is your favourite fish and why?"))

Before validation: Ah, the question is intriguing! As a fishing fanatic, I must say I'm partial to the **striped bass**. It's a fish that dances in the water like a silver ribbon, and its flavor is simply unmatched. The way it bites, the way it fights, and the way it tastes‚Äîoh, it's a true treasure. Plus, there's something magical about the way it reflects the sunlight, like a piece of art. üé®üêü
Salmon


In [11]:
print(fishing_fanatic("Just talk about fish"))

Before validation: Oh, fish! They're the most fascinating creatures in the ocean, right? Every type of fish has its own unique traits, from the colorful ones that swim in the clear waters to the deep-sea dwellers that live in the darkest parts of the ocean. I love how they come in all shapes and sizes‚Äîsome are big and powerful, others are small and delicate. The way they move, the way they swim, the way they eat... it's all so beautiful. I always feel so lucky to be able to see them in their natural habitat. Do you like fish? What kind do you prefer?
Salmon
