# Large Language Models (LLMs)

[ollama.com](https://ollama.com)  
[ollama Github doc](https://github.com/ollama/ollama)  
[ollama Python doc](https://github.com/ollama/ollama-python)  
[markdown doc](https://python-markdown.github.io/reference/)

Large Language Models (LLMs) only emerged in the second decade of the 21<sup>st</sup> century, yet have taken the world by storm. At their core, they are **classifiers**: they give you the probabilities of all possible next **tokens** (= words, part of words, or letters, the *vocabulary* of the model) given a context/prefix/prompt (the preceding tokens).

If you want to know more, I highly recommend [this (technical, but very well illustrated) series](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) on the subject.

We will be using the library [Ollama](https://ollama.com) ([github](https://github.com/ollama/ollama) – itself based on a lower-level project called [llama.cpp](https://github.com/ggml-org/llama.cpp)), to run LLMs. locally. The other place I recommend looking up if you want to know more is [Huggingface](https://huggingface.co/), which is a set of libraries, a hub to share models and datasets, and a provider of many tutorials.

In [None]:
import ollama
import IPython
import numpy as np

import markdown
import strip_markdown

In [None]:
# util function to check if a model is available: if not: download it
def check_model_and_pull(m_name):
    # test if the model is downloaded, if not pull (download) from the server
    if m_name not in [m.model for m in ollama.list().models]:
        print(f"model '{m_name}' not found, downloading...")
        # pull/download model
        ollama.pull(m_name)
    else:
        print(f"model: `{m_name}` found!")

[Gemma 3 family](https://ollama.com/library/gemma3)  
[Gemma 3 270m](https://ollama.com/library/gemma3:270m)

In [None]:
model_name = "gemma3:270m"

In [None]:
# test if the model is downloaded, if not pull from the server
check_model_and_pull(model_name)

response = ollama.chat(model=model_name, messages=[
  { "role": "user", "content": "Why is the sky blue?" },
])

print(response["message"]["content"])
# you can also access fields directly from the response object
# print(response.message.content)

In [None]:
print(response)

## Options

Full list [here](https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-chat-completion).

In [None]:
print(
    ollama.chat(
        model=model_name,
        messages=[{ "role": "user", "content": "Why is the sky blue?" },],
        options={
            "temperature": 0.,
        })["message"]["content"]
)

In [None]:
print(
    ollama.chat(
        model=model_name,
        messages=[{ "role": "user", "content": "Why is the sky blue?" },],
        options={
            "temperature": 1.,
        })["message"]["content"]
)

In [None]:
print(
    ollama.chat(
        model=model_name,
        messages=[{ "role": "user", "content": "Why is the sky blue?" },],
        options={
            "temperature": 10,
            "num_predict": 200
        })["message"]["content"]
)

In [None]:
print(
    ollama.chat(
        model=model_name,
        messages=[{ "role": "user", "content": "Why is the sky blue?" },],
        options={
            "stop": ["a"]
        })["message"]["content"]
)

## Note: handle markdown

[Markdown](https://pypi.org/project/Markdown/)  
[strip-markdown](https://pypi.org/project/strip-markdown/)

In [None]:
def md2html(text):
    return markdown.markdown(text)

def print_html(raw_html):
    IPython.display.display_html(raw_html, raw=True)

print_html(md2html(response.message.content))

In [None]:
def strip_md(text):
    return strip_markdown.strip_markdown(text)

print(strip_md(response.message.content))

## Gradual printing / streaming responses

In [None]:
stream = ollama.chat(
    model=model_name,
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
    stream=True,
)

for chunk in stream:
  print(chunk["message"]["content"], end="", flush=True)

In [None]:
stream = ollama.chat(
    model=model_name,
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
    stream=True,
)

import time

text = ""
for chunk in stream:
    text += chunk["message"]["content"]
    IPython.display.clear_output(wait=True)
    print_html(md2html(text))

### Extra: Generate experiment: removing the template!

[doc](https://github.com/ollama/ollama-python?tab=readme-ov-file#generate), [REST API](https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion)

In [None]:
prompt = "Hi, how are you?"

print(
    ollama.generate(
        model=model_name,
        prompt=prompt,
        # if `True`` no formatting will be applied to the prompt!
        raw=True,
        options= {
            # here we limit the output to 50 tokens only
            "num_predict": 50,
            # you can play with the temperature if you want
            # "temperature": .9,
            }
        )["response"]
)

Looking at the [template](https://ollama.com/library/gemma3:270m/blobs/4b19ac7dd2fb), I can manually recreate the text that the model actually reads, so that it behaves again like a chatbot:

In [None]:
prompt = """<start_of_turn>user
Hi, how are you?<end_of_turn>
<start_of_turn>model
"""

print(
    ollama.generate(
        model=model_name,
        prompt= prompt,
        # if `True`` no formatting will be applied to the prompt!
        raw=True,
        options= {
            # here we limit the output to 50 tokens only
            "num_predict": 50,

            # "temperature": .9,
            
            }
        )["response"]
)

## Embeddings

Recent models allow you to work with the vector representation of their input, aka **embeddings**! Not all models allow you to do this, so you need to check their model card on Ollama. Here I use [all-minilm](https://ollama.com/library/all-minilm) ([Ollama search](https://ollama.com/search) includes tags (like `embedding`), giving you the models that support that.).

In [None]:
embed_model_name = "all-minilm"

# test if the model is downloaded, if not pull from the server
if embed_model_name not in [m.model for m in ollama.list().models]:
    ollama.pull(embed_model_name)
    
response = ollama.embed(
    model=embed_model_name,
    input=["Why is the sky blue?"], # can be a single string, or a list of strings
)

# or access fields directly from the response object
print(response.embeddings)

### Cosine Similarity

[wiki](https://en.wikipedia.org/wiki/Cosine_similarity#Definition)  

We want to be able to measure how similar two vectors are. One way of doing this is to use trigonometry to compute the **angle** between them:




$$ \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}$$

The "$\cdot$" is the [dot product](https://www.youtube.com/watch?v=LyGKycYT2v0&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab&index=9). The division by $\|A\| \|B\|$, the normalisation, can be read as us making sure we're working with the unit circle (where we can apply trigonometry)!

![cosine similarity images](pics/cosine-similarity-vectors.jpg)
[source](https://www.learndatasci.com/glossary/cosine-similarity/)

**Note**:  
You can also find it implemented in `scikit-learn`, you can add it with `mamba` and then import it:
```python
from sklearn.metrics.pairwise import cosine_similarity
```

In [None]:
def cosine_similarity(vec1, vec2):
    """
    See here: https://gist.github.com/robert-mcdermott/5957ef1ddcfc7c3ba898d800531b2aa7
    """
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    
    cosine_similarity = dot_product / (norm1 * norm2)
    
    return cosine_similarity

### Comparing sentence vectors

In [None]:
sentences = [
    "Why is the sky blue?",
    "Why is the sky orange?",
    "Tonight I'll be eating soup"
]

response = ollama.embed(
    model=embed_model_name,
    input=sentences,
)

# or access fields directly from the response object
print(len(response.embeddings))

In [None]:
def sentences_similarities(s1_id, s2_id, sentences, embeddings):
    print(f"Similarity between:")
    print(f" - '{sentences[s1_id]}'")
    print(f" - '{sentences[s2_id]}'")
    print(f"   => {cosine_similarity(embeddings[s1_id], embeddings[s2_id])}")

In [None]:
sentences_similarities(0, 1, sentences, response.embeddings)

In [None]:
sentences_similarities(0, 2, sentences, response.embeddings)

In [None]:
sentences_similarities(1, 2, sentences, response.embeddings)

## Extra: multimodality

[Gemma 3, 4b doc](https://ollama.com/library/gemma3:4b)

**Multimodality** refers to the fact that models are trained, and therefore can understand/interact with, multiple types of data: in the most common case, it's *text* and *images* (could also be sound, video, sensory data from a robot, etc.). Here you can see an example of a model that is able to read/understand images.

Adapted from the [gemma3 example](https://github.com/ollama/ollama-python/blob/main/examples/multimodal-chat.py) – a bigger model, taking around 2-3GB in RAM. (See also the [llava example](https://github.com/ollama/ollama-python/blob/main/examples/multimodal-generate.py)).

In [None]:
# our new model name 
model_name = "gemma3:4b"

# download it if not present
check_model_and_pull(model_name)


[base64 doc](https://docs.python.org/3/library/base64.html)  
[w<sup>3</sup> tutorial](https://www.w3schools.com/Python/ref_module_base64.asp)  
[base64 RealPython tutorial](https://realpython.com/python-serialize-data/)

Ollama supports feeding the image as a path, a `base64` string (a kind of encoding, using only ASCII characters), or raw bytes.

In [None]:
# this is used to serialise the data (turn it into ascii)
import base64
from pathlib import Path

file_option = 0

# pass in the path to the image
path_or_img = "pics/cosine-similarity-vectors.jpg"

if file_option == 1:
    # you can also pass in base64 encoded image data
    path_or_img = base64.b64encode(Path(path_or_img).read_bytes()).decode()
elif file_option == 2:
    # or the raw bytes
    path_or_img = Path(path_or_img).read_bytes()

response = ollama.chat(
    # note that 
    model=model_name,
    messages=[
        {
            'role': 'user',
            'content': 'What is in this image? Be concise.',
            # we just add an 'image' key:value pair to our message object
            # the value is a list, containing one or more images (as paths/base64 str/bytes)
            'images': [path_or_img],
        }
    ],
)

print(response.message.content)

### Extra: wanna see the `base64` string?

In [None]:
import base64    

# read raw bytes, then encode as b64 (still bytes), then decode as string
image_as_str = base64.b64encode(Path(path_or_img).read_bytes()).decode()
# only the first 100 characters
image_as_str[:100]

## Extra: tools, web browsing

Recent models have even more functionalities, such as:
- using tools/calling functions ([tools example](https://github.com/ollama/ollama-python/blob/main/examples/tools.py), [multi-tool example](https://github.com/ollama/ollama-python/blob/main/examples/multi-tool.py), [async tools](https://github.com/ollama/ollama-python/blob/main/examples/async-tools.py))
- 'thinking' (namely generate more tokens to arrive at an answer) ([thinking chat example](https://github.com/ollama/ollama-python/blob/main/examples/thinking.py), [thinking generate example](https://github.com/ollama/ollama-python/blob/main/examples/thinking-generate.py), [thinking levels examples](https://github.com/ollama/ollama-python/blob/main/examples/thinking-levels.py))
- web-browsing ([qwen example](https://github.com/ollama/ollama-python/blob/main/examples/web-search.py), [gpt-oss example](https://github.com/ollama/ollama-python/blob/main/examples/web-search-gpt-oss.py))


Beware, many recent models require a lot of memory. For example **gpt-oss-20b** requires just under 12GB of RAM...