# Query Ollama based on REQUESTS

In [1]:
import requests

def query_ollama(prompt, model="llama3"):
    url = "http://localhost:11434/api/generate"
    data = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }
    response = requests.post(url, json=data)
    return response.json()["response"]

# Example usage:
print(query_ollama("What is the capital of Germany?"))

The capital of Germany is Berlin.


# Streaming response with Markdown

In [2]:
import requests
import json
import re
from IPython.display import display, Markdown, clear_output

def format_markdown(text):
    text = text.replace("\n", "\n\n")
    text = re.sub(r'\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\b', r'**\1**', text)
    text = re.sub(r'\b([A-Z]{3,})\b', r'**\1**', text)
    return text

def stream_ollama_markdown(prompt, model="llama3"):
    url = "http://localhost:11434/api/generate"
    data = {"model": model, "prompt": prompt, "stream": True}

    response = requests.post(url, json=data, stream=True)
    buffer = ""
    output_display = display(Markdown(""), display_id=True)

    for chunk in response.iter_lines():
        if chunk:
            try:
                data = json.loads(chunk.decode("utf-8"))
                text = data.get("response", "")
                if text:
                    buffer += text
                    clear_output(wait=True)
                    output_display.update(Markdown(format_markdown(buffer)))
            except json.JSONDecodeError:
                pass

    # Final display (in case last chunk isn’t shown)
    clear_output(wait=True)
    output_display.update(Markdown(format_markdown(buffer)))
    # No return


# Try it
stream_ollama_markdown("Explain the concept of self-attention in Transformers.")


**Self**-attention! A fundamental building block of the **Transformer** architecture.



**In** natural language processing (**NLP**), self-attention refers to the ability of a model to weigh and combine different parts of an input sequence with respect to each other. **In** other words, it allows the model to attend to different regions of the input sequence simultaneously, and then aggregate this information to produce a representation that captures the relationships between these regions.



**In** the **Transformer** architecture, self-attention is used in three main components:



1. ****Encoder****: **The** encoder takes in a sequence of tokens (e.g., words or characters) as input and outputs a continuous representation for each token.

2. ****Decoder****: **The** decoder generates an output sequence one token at a time, using the encoder's output as context.

3. ****Multi**-head attention**: **This** is a mechanism that allows the model to jointly attend to information from different representation dimensions (e.g., different word embeddings) at the same time.



**Here**'s how self-attention works in **Transformers**:



****Attention** mechanism**



**Given** an input sequence `x = [x_1, ..., x_n]` of length `n`, where each `x_i` is a token embedding, the attention mechanism computes a weighted sum of these tokens. **The** goal is to produce a representation that captures the relationships between different parts of the input sequence.



**The** attention process involves three main steps:



1. ****Query** (Q)**: **Compute** a query vector `q_i` for each token `x_i`. **This** is typically done using a linear transformation applied to the token embedding.

2. ****Key** (K)** and ****Value** (V)**: **Compute** key and value vectors `k_i` and `v_i`, respectively, for each token `x_i`. **These** are also typically computed using linear transformations applied to the token embedding.

3. ****Attention** scores**: **Calculate** attention scores `scores[i] = softmax(Q * K^T / sqrt(d))`, where `d` is a hyperparameter (e.g., the dimensionality of the query and key vectors). **The** scores represent the relative importance of each token with respect to all other tokens.



****Self**-attention**



**In** self-attention, the attention mechanism is applied recursively to the output of the previous iteration. **This** allows the model to attend to different parts of the input sequence simultaneously and then aggregate this information.



**The** self-attention process involves three main components:



1. ****Query** (Q)**: **Compute** a query vector `q_i` for each token `x_i`, using the output of the previous attention layer as context.

2. ****Key** (K)** and ****Value** (V)**: **Compute** key and value vectors `k_i` and `v_i`, respectively, for each token `x_i`, also using the output of the previous attention layer as context.

3. ****Attention** scores**: **Calculate** attention scores `scores[i] = softmax(Q * K^T / sqrt(d))`. **The** scores represent the relative importance of each token with respect to all other tokens in the sequence.



****Multi**-head attention**



**To** capture different types of relationships between tokens, such as word order and semantic meaning, multi-head attention is used. **This** involves applying self-attention multiple times in parallel, using different sets of learnable weights. **The** output of these multiple attention heads is concatenated and linearly transformed to produce the final output.



**In** summary, self-attention in **Transformers** allows the model to attend to different parts of an input sequence simultaneously, weighing their importance with respect to each other. **This** enables the model to capture complex relationships between tokens and generate coherent, contextualized representations.

# Preparing python based ollama access

In [3]:
!pip install ollama



## First example, querying ollama, no streaming

In [4]:
import ollama
response = ollama.chat(model='llama3', messages=[{"role": "user", "content": "What is the capital of France?"}])
print(response['message']['content'])

The capital of France is Paris.


## Token Speed

In [5]:
import requests
import time

models = [
    "mistral",
    "llama3",
    "deepseek-r1:7b",
    "deepseek-r1:1.5b"
]

prompt = "Explain self-attention in two paragraphs."
page_token_count = 333  # ≈ one page

results = []

def query_ollama(model, prompt):
    url = "http://localhost:11434/api/generate"
    data = {"model": model, "prompt": prompt, "stream": False}
    response = requests.post(url, json=data)
    response.raise_for_status()
    return response.json()["response"]

for model in models:
    print(f"⏳ Testing model: {model}")
    start = time.time()
    response = query_ollama(model, prompt)
    end = time.time()

    tokens = len(response.split())
    duration = end - start
    tokens_per_sec = tokens / duration
    seconds_per_page = page_token_count / tokens_per_sec

    results.append({
        "model": model,
        "tokens": tokens,
        "duration_sec": duration,
        "tokens/sec": tokens_per_sec,
        "sec/page": seconds_per_page
    })




⏳ Testing model: mistral
⏳ Testing model: llama3
⏳ Testing model: deepseek-r1:7b
⏳ Testing model: deepseek-r1:1.5b


In [6]:
# Display results nicely
import pandas as pd
df = pd.DataFrame(results)
df = df.round(2)
display(df)

Unnamed: 0,model,tokens,duration_sec,tokens/sec,sec/page
0,mistral,220,52.64,4.18,79.68
1,llama3,270,49.18,5.49,60.66
2,deepseek-r1:7b,613,121.55,5.04,66.03
3,deepseek-r1:1.5b,561,28.68,19.56,17.02
