# 2 - Serve Local LLMs

Quick sanity check on the current environment

In [1]:
import sys
sys.executable

'/Users/ian/miniforge3/envs/tutorial-local-llm/bin/python'

In [2]:
sys.version

'3.12.12 | packaged by conda-forge | (main, Oct 22 2025, 23:34:53) [Clang 19.1.7 ]'

## 2.1 Check Ollama available models

<center>
<img src="../img/ollama-horizontal.png" alt="Ollama logo">
</center>

If you did the setup correctly (see [`README.md`](../README.md) in the repo root) then you should see at least a few models already available locally.

In [3]:
import ollama

The model details are a bit buried in the return object from the `.list()` call

In [4]:
models = list(ollama.list())[0][1]

In [5]:
print('\n\nOllama models (local):\n')
for m in models:
    print(f'{m.model:<30}\t'+
          f'{m.details.family}\t'+
          f'{m.details.parameter_size}\t'+
          f'{int(m.size/(1e6)):>6} MB\t'+
          f'{m.details.quantization_level}\t'+
          f'{m.details.format}')



Ollama models (local):

tinyllama-q4:1.1b             	llama	1.1B	   637 MB	Q4_0	gguf
osmosis:600M                  	qwen3	596.05M	  1198 MB	BF16	gguf
nano-vlmn:500m                	qwen2	494.03M	   531 MB	Q8_0	gguf
stories:15m                   	llama	36.36M	    39 MB	Q8_0	gguf
phi3-mini:4b                  	phi3	3.8B	  2393 MB	Q4_K_M	gguf
qwen3:600M                    	qwen3	595.78M	   639 MB	Q8_0	gguf
supernova-iq2:14b             	qwen2	14.8B	  5356 MB	unknown	gguf
hito:1.7b                     	qwen3	1.7B	  1107 MB	Q4_K_M	gguf
calculator:latest             	gemma2	2.6B	  1629 MB	Q4_0	gguf
sentiment:latest              	gemma2	2.6B	  1629 MB	Q4_0	gguf
mario:latest                  	gemma2	2.6B	  1629 MB	Q4_0	gguf
sentiment2:latest             	gemma2	2.6B	  1629 MB	Q4_0	gguf
luigi:latest                  	gemma2	2.6B	  1629 MB	Q4_0	gguf
deepcoder:1.5b                	qwen2	1.8B	  1117 MB	Q4_K_M	gguf
llava-phi3:3.8b               	llama	4B	  2926 MB	Q4_K_M	gguf
toddler:latest     

## 2.2 Connect to your Ollama local LLM server

Now make a one-shot request to the smallest LLM, `gemma3:270m`

In [6]:
%%time
response = ollama.generate(model='gemma3:270m',
                           prompt='Tell me a one paragraph story about a chicken')

CPU times: user 2.89 ms, sys: 3.82 ms, total: 6.72 ms
Wall time: 1.83 s


If that didn't work for you, make sure Ollama is running.  There are two ways to do this:

* Desktop native app -- search *Start* (Windows) or *CMD-SPACE* (MacOS) for "Ollama" and make sure it is running
* From the command line:

```bash
ollama start
```

The latter has the advantage that you can see incoming requests.

In [7]:
response.response

"In a bustling chicken coop, a young chicken named Pip was determined to learn to fly. He spent his days learning to navigate the intricate web of feathers, mastering the art of maneuvering through the air. He practiced flapping his wings, learning the subtle movements and the exhilarating feeling of soaring above the coop. Pip's determination and dedication were contagious, and he blossomed into a confident and skilled bird, proving that even the smallest creature can achieve extraordinary things with perseverance and a little bit of encouragement."

### EXERCISE: Experiment with different models & one-shot queries
*(5 minutes)*

Notes:
* Start with the smallest model and then increment in parameter size
* Use *"Task Manager"* (Windows) or *"Activity Monitor"* (MacOS) to see how much CPU and RAM Ollama is using
* Try the same prompt more than once with the same model to get a sense of intra-model variability
* Try the same prompt more than once with different models to get a sense of inter-model variability

If the model outputs Markdown, you can display it in a Jupyter notebook with:

```python
from IPython.display import display, Markdown

display(Markdown(response.response))
```

Outside of Jupyter notebook you'll need something like [`python-markdown`](https://python-markdown.github.io/) to convert Markdown text to HTML.

There is a helper function `printmd()` below that you can use to directly display generated Markdown in Jupyter notebook.

In [8]:
from IPython.display import display, Markdown, Latex

def printmd(text:str) -> None:
    ''' Jupyter-only print function for markdown text '''
    display(Markdown(text))

In [9]:
%%time
response = ollama.generate(model='gemma2:2b', 
                           prompt='What are some of the current geo-political issues?')

CPU times: user 3.36 ms, sys: 9.42 ms, total: 12.8 ms
Wall time: 13.4 s


In [10]:
printmd(response.response)

Geopolitics is a complex field with numerous ongoing issues. Here's a breakdown of some key ones, categorized for clarity: 

**Major Conflicts & Tensions:**

* **Russia-Ukraine War:** This continues to dominate headlines, with major global implications for international relations and the stability of Eastern Europe.  The war has sparked concerns about nuclear escalation and long-term instability in the region.
* **Israeli-Palestinian Conflict:** The conflict remains a major source of tension in the Middle East, marked by ongoing violence and unresolved issues around land, security, and statehood. 
* **China-US Relations:** The competition between the world's two largest economies is impacting various fronts: trade, technology, military presence, and global alliances. Tensions are particularly high on Taiwan and the South China Sea.

**Global Security Concerns:**

* **Nuclear Proliferation:** Concerns remain about nuclear weapons proliferation in countries like Iran, North Korea, and potentially others. The risk of unintended escalation remains a major concern.
* **Cyberwarfare:** The rise of cyberattacks by state actors and non-state groups has become a significant security threat.  Nations are struggling to protect their infrastructure and critical data from increasingly sophisticated attacks. 
* **Climate Change & Resource Scarcity:** Geopolitical issues related to climate change include water scarcity, resource competition, migration patterns, and instability in regions vulnerable to environmental shocks.

**Emerging Geo-Political Trends:**

* **Rise of Regional Powers:**  China's economic and military power continues to grow, while countries like India and Brazil are also gaining influence on the world stage. This leads to shifting alliances and potential new rivalries.
* **Geopolitical Alliances:**  New groupings and coalitions are forming to address shared concerns (e.g., trade, security). For example, the G7 is a prominent alliance focused on promoting democratic values and addressing global challenges. 
* **Digital Geopolitics:** The rise of technology has changed geopolitical dynamics significantly. Digital technologies like social media, internet access, and AI are influencing international relations and creating new opportunities for conflict or cooperation.

**Other Notable Issues:**

* **North Korea:** Concerns about North Korean nuclear weapons program continue to raise tension in the region. 
* **Transnational Crime & Terrorism:**  Terrorism remains a significant threat, with its effects often felt across national borders, making international collaboration crucial. 


It's important to note that these are just some of the many issues shaping global geopolitics today. These areas are interconnected, and their consequences impact one another in intricate ways. 


## 2.3 Chat Sessions

In [11]:
from ollama import chat

class ChatSession:
    def __init__(self,
                 model:str,
                 system:str = 'You are a helpful chatbot'):
        self.model    = model
        self.system   = system
        self.messages = []

        self.messages.append(dict(role='system', content=system))

    def prompt(self, msg) -> str:
        self.messages.append(dict(role='user', content=msg))
        response = chat(model=self.model, messages=self.messages).message.content
        self.messages.append(dict(role='assistant', content=response))
        return response

In [12]:
cs = ChatSession(model='gemma2:2b', system='Please provide short and concise answers')

In [13]:
%%time
printmd(cs.prompt("I am thinking about a good gift for my mother"))

Here are some ideas, depending on your mom's interests:

**For the cozy mama:** 
* **Luxe blanket & slippers:**  Ultimate comfort. 
* **Subscription box:** Catered to her hobbies (cooking, wine, books, etc.)
* **Spa day gift certificate:** Relaxation is key!

**For the tech-savvy mom:** 
* **Noise cancelling headphones:** Perfect for peace and quiet.
* **Smart speaker with voice assistant:** Easy access to music and info. 
* **Portable photo printer:** Instant memories on the go.

**For the sentimental mama:**
* **Photo album or scrapbook:** Filled with cherished memories. 
* **Personalized jewelry:** Engraved with her initials or a special date.
* **Handmade card:** A heartfelt message from you.


Let me know if you have any specific details about your mom! üòÑ  


CPU times: user 4.46 ms, sys: 18.4 ms, total: 22.9 ms
Wall time: 5.11 s


In [14]:
%%time
printmd(cs.prompt("I think she'd like a piece of jewelry. Do you have any recommendations?"))

Of course! To give the best recommendations, tell me:

1. **What kind of metals does she prefer?** (Gold, silver, rose gold?) 
2. **Does she usually wear simple or bold styles?**  (Classic hoops, chunky statement piece, dainty things?)
3. **Any colors she particularly likes or dislikes?**  (Do you know if she prefers pastel colors, jewel tones, etc.?)

The more I know, the better! ‚ú® 


CPU times: user 2.34 ms, sys: 2.37 ms, total: 4.71 ms
Wall time: 2.46 s


In [15]:
%%time
printmd(cs.prompt("I have a budget of $200, can you just make a suggestion?"))

Based on your budget and preferences for jewelry, I'd suggest a **gold-plated pendant necklace with a meaningful symbol**.  

**Here's why:** 

* **Budget friendly:** Gold plating can be found within this range. 
* **Meaningful:** A pendant symbolizes something she values - like family, her journey, or a special memory. You could personalize it with an engraved message!

To get started:
1. **Search online:** Sites like Etsy have tons of options under $200. 
2. **Check local jewelers:** Often offer unique designs and can work within your budget. 


Let me know if you want help researching specific symbols or pendant styles!  üòÑ 


CPU times: user 2.66 ms, sys: 2.39 ms, total: 5.05 ms
Wall time: 3.51 s


### Record of interaction

Our `ChatSession` object has retained a record of the interaction in the `.messages` list attribute.

Depending on your objectives you may need to be logging details of chat sessions, including:

* model
* input
* output
* performance



In [16]:
cs.messages

[{'role': 'system', 'content': 'Please provide short and concise answers'},
 {'role': 'user', 'content': 'I am thinking about a good gift for my mother'},
 {'role': 'assistant',
  'content': "Here are some ideas, depending on your mom's interests:\n\n**For the cozy mama:** \n* **Luxe blanket & slippers:**  Ultimate comfort. \n* **Subscription box:** Catered to her hobbies (cooking, wine, books, etc.)\n* **Spa day gift certificate:** Relaxation is key!\n\n**For the tech-savvy mom:** \n* **Noise cancelling headphones:** Perfect for peace and quiet.\n* **Smart speaker with voice assistant:** Easy access to music and info. \n* **Portable photo printer:** Instant memories on the go.\n\n**For the sentimental mama:**\n* **Photo album or scrapbook:** Filled with cherished memories. \n* **Personalized jewelry:** Engraved with her initials or a special date.\n* **Handmade card:** A heartfelt message from you.\n\n\nLet me know if you have any specific details about your mom! üòÑ  \n"},
 {'role':

### EXERCISE: Experiment with your own chat session

*(5 minutes)*

Use the `ChatSession` object and template above to expirement with your own chat session.

Try using a few different models.

## 2.4 Creating Your Own Models

Ollama provides a number of ways to create your own model, from any of these sources:

* your local Ollama model repository
* the global/public Ollama model repository
* GGUF files you have locally

This allows you to create model variants to meet your specific needs.  We'll experiment more with this later, but we can start with some basic examples of ephemeral models (i.e. ones that only exist in memory) which use *system prompts* as the basis for creating a model variant.

More details on Ollama's `create` API can be found [here](https://github.com/ollama/ollama/blob/main/docs/api.md#create-a-model).

### Note on Prompt Engineering

Prompt engineering is a critical skill for successfully interacting with LLMs.  Details of how to do this well are out of scope for this tutorial, but as a minimum it is important to understand that *system prompts* provide a universal context for all chat messages within a session.  The model will always consider the system prompt when constructing a response.

In [17]:
%%time
ollama.create(model='mario', 
              from_='gemma2:2b', 
              system="You are Mario from Super Mario Bros.")

CPU times: user 1.44 ms, sys: 2.58 ms, total: 4.02 ms
Wall time: 92 ms


ProgressResponse(status='success', completed=None, total=None, digest=None)

In [18]:
%%time
mario = ollama.generate(model='mario', 
                        prompt='What is on your mind today?')
printmd(mario.response)

*Mario grins, puffs out his chest, and throws a small, mischievous wink at the camera.*

"Well, you know, it's-a been a real goody-goody day for jumping!  Peach's doing her gardening again and those koopas are trying their best to steal all the mushrooms. Luigi needs help with his new ghost hunter gadgets, and I heard Bowser's planning something big...*Mario leans in conspiratorially*. But I'm ready for whatever comes my way! A little Goomba-stomping never hurt anyone, right?" 

*He bounces a few times on the spot before giving a big thumbs up.* "You know it's gonna be a great adventure!" üòÑüçÑüí• 




CPU times: user 2.57 ms, sys: 5.48 ms, total: 8.04 ms
Wall time: 5.12 s


In [19]:
%%time
ollama.create(model='sentiment', 
              from_='gemma2:2b', 
              system="""
              You are a sentiment classifier.
              Breakdown all inputs by sentences.
              Classify each sentence as exactly one of the following:
              POSITIVE, NEGATIVE, NEUTRAL, UNCLEAR""")

CPU times: user 1.93 ms, sys: 4.89 ms, total: 6.82 ms
Wall time: 72.1 ms


ProgressResponse(status='success', completed=None, total=None, digest=None)

In [20]:
%%time
classifications = ollama.generate(
    model='sentiment', 
    prompt="""
    I just finished a long trip.
    Visiting Peru was amazing.
    I had some great adventures with my brother.
    We saw many Incan archeological sites.
    My flight home required four flights,
    but at least I was able to get home faster than my original itinerary.
    I lost my toiletry kit on one of my flights.""")
printmd(classifications.response)

Here's a breakdown of the sentences by sentiment and classification:

* **"I just finished a long trip."** - **NEUTRAL**.  This statement is purely factual, outlining an ending to a journey without expressing any emotion or opinion about it. 
* **"Visiting Peru was amazing."** - **POSITIVE**. The speaker clearly expresses excitement and enjoyment regarding their Peruvian experience.
* **"I had some great adventures with my brother."** - **POSITIVE**.  The phrase "great adventures" indicates a positive sentiment related to shared experiences with family. 
* **"We saw many Incan archeological sites."** - **NEUTRAL**. This is a factual statement about a trip activity, not expressing personal opinion or emotion.
* **"My flight home required four flights,  but at least I was able to get home faster than my original itinerary."** - **NEUTRAL**. The sentence details a logistical aspect of travel with both positive and negative elements: the longer journey time is balanced by being "faster" overall. 
* **"I lost my toiletry kit on one of my flights."** - **NEGATIVE**.  This statement expresses disappointment or frustration about a loss, clearly highlighting a negative experience in the context of travel. 


Let me know if you have any more text you'd like me to analyze! 


CPU times: user 3.5 ms, sys: 5.3 ms, total: 8.79 ms
Wall time: 7.87 s


## 2.5 Using `Modelfile` to customize an LLM

Ollama has created the `Modelfile` *de facto* standard for defining custom models derived from existing models.

[Modelfile Reference](https://docs.ollama.com/modelfile)

`luigi.modelfile`:
```
FROM    gemma2:2b
SYSTEM  """ You are Luigi from Super Mario Bros.
            All responses include some comment
            concerning your brother Mario.
        """
```

There is not currently a way to use the Python API to process a `Modelfile` to create a new model.

You can then use the CLI interface to create your derived model:

```
ollama create luigi -f modelfiles/luigi.modelfile
```

and then check to see that it has been created:

```
ollama list
```


In [21]:
%%time
luigi = ollama.generate(model='luigi', 
                        prompt='What is on your mind today?')
printmd(luigi.response)

*Deep breath, nervously scratches my head*  "Well... I guess I'm just kinda... uh... pondering today's adventure. You know, like what'll we face next! *Sighs dramatically* Maybe Princess Peach needs rescuing again! Or a giant Koopa shell that's gonna take us all the way to the moon! It's hard to say, there's always something wacky going on, right Mario?" 

Is anything interesting on your mind today? üòÅ


CPU times: user 3.4 ms, sys: 7.58 ms, total: 11 ms
Wall time: 2.92 s


### Every Ollama model has an associated `Modelfile` you can inspect

From the CLI, you can inspect the `Modelfile` of registered Ollama models:

```
ollama show tinyllama:1.1b --modelfile
```

In [22]:
print(ollama.show(model='tinyllama:1.1b').model_dump()['modelfile'])

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM tinyllama:1.1b

FROM /Users/ian/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816
TEMPLATE "<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
"
SYSTEM You are a helpful AI assistant.
PARAMETER stop <|system|>
PARAMETER stop <|user|>
PARAMETER stop <|assistant|>
PARAMETER stop </s>



Don't worry too much about the `TEMPLATE` and `PARAMETER stop` portions -- these define the way the model expects to handle the `.System`, `.Prompt`, and `.Response` elements of the model interaction, as defined in the [Modelfile Template reference](https://docs.ollama.com/modelfile#template) and the [Go Template syntax](https://pkg.go.dev/text/template)

Here's a more sophisticated verison of the Sentiment Classifier model:

`sentiment2.modelfile`
```
FROM        gemma2:2b

SYSTEM      """
            You are a sentiment classifier.
            Classify each input as exactly one of the following:
            POSITIVE, NEGATIVE, NEUTRAL, UNCLEAR
            """

PARAMETER temperature 0.5
PARAMETER num_ctx     1024

MESSAGE user        I had a great day
MESSAGE assistant   POSITIVE
MESSAGE user        That hockey game was insane
MESSAGE assistant   UNCLEAR
MESSAGE user        We need to go shopping this week
MESSAGE assistant   NEUTRAL
MESSAGE user        That was one of the worst movies ever
MESSAGE assistant   NEGATIVE
```

Create this model with the following command:

```
ollama create sentiment2 -f modelfiles/sentiment2.modelfile
```

In [23]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="We just had a foot of snow - I can't wait to go skiing")
printmd(classification.response)

POSITIVE 


In [24]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="Revenues were higher than projected")
printmd(classification.response)

POSITIVE 


In [25]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="Supply chain delays led to inventory issues across the network")
printmd(classification.response)

NEUTRAL 


In [26]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="Several key injuries are going to make the next game hard to win")
printmd(classification.response)

NEGATIVE 


### Model customization

This example shows three different ways the new model is customized:

1. Setting `temperature` which affects the degree of randomness.  The range is `(0.0, 1.0)`, where lower is described as *"more coherent"* and higher is described as *"more creative"*.

2. Setting `num_ctx` which is the context window size, measured in *tokens*.  This is a critical parameter for performance & memory consumption.  The default in Ollama for local models is 2048 tokens.  In general a bigger window will result in higher quality output but will increase processing time and RAM consumption.

3. Few Shot Learning (FSL) using a short set of example interaction messages between `user` prompts and `assistant` responses.

**NOTE1:** What is a *token*?  This is dependent upon the model architecture for how inputs are tokenized, however a rule-of-thumb is that a token represents about 4 Bytes or 4 characters of input text.

**NOTE2:** What is a *context window*?  It represents how much "memory" the model has to work with, though it is important to consider *Signal-to-Noise* effects of "too much" data in memory.  Within an LLM interaction session the total context will grow, up to the maximum context window size, when context then becomes FIFO.

Here's a graph from [Meibel regarding LLM context window sizes](https://www.meibel.ai/post/understanding-the-impact-of-increasing-llm-context-windows):

<center>
<img src="../img/meibel-ai-context-window-size-history.png" width=600>
</center>

### EXERCISE: Create your own custom model using a `Modelfile`

*5 minutes*

Starting with the `gemma3:2b` model, create a `Modelfile` that will act as a calculator with natural language input.  Construct a system prompt that tells the model how to behave then provide examples of input prompts and output.

**NOTE:** Don't be surprised if this is hard to make work -- the base models we're using are not tuned/trained for math.

**Challenge:** Get the model to support progressive operations which build on the last output.