# 2 - Serve Local LLMs

Quick sanity check on the current environment

In [1]:
import sys
sys.executable

'/Users/ian/miniforge3/envs/tutorial-local-llm/bin/python'

In [2]:
sys.version

'3.12.12 | packaged by conda-forge | (main, Oct 22 2025, 23:34:53) [Clang 19.1.7 ]'

## 2.1 Check Ollama available models

<center>
<img src="../img/ollama-horizontal.png" alt="Ollama logo">
</center>

If you did the setup correctly (see [`README.md`](../README.md) in the repo root) then you should see at least a few models already available locally.

In [3]:
import ollama

The model details are a bit buried in the return object from the `.list()` call

In [4]:
models = list(ollama.list())[0][1]

In [5]:
print('\n\nOllama models (local):\n')
for m in models:
    print(f'{m.model:<30}\t'+
          f'{m.details.family}\t'+
          f'{m.details.parameter_size}\t'+
          f'{int(m.size/(1e6)):>6} MB\t'+
          f'{m.details.quantization_level}\t'+
          f'{m.details.format}')



Ollama models (local):

calculator:latest             	llama	1.7B	  1778 MB	Q8_0	gguf
sentiment2:latest             	gemma2	2.6B	  1629 MB	Q4_0	gguf
luigi:latest                  	gemma2	2.6B	  1629 MB	Q4_0	gguf
sentiment:latest              	gemma2	2.6B	  1629 MB	Q4_0	gguf
deepcoder:1.5b                	qwen2	1.8B	  1117 MB	Q4_K_M	gguf
llava-phi3:3.8b               	llama	4B	  2926 MB	Q4_K_M	gguf
mario:latest                  	gemma2	2.6B	  1629 MB	Q4_0	gguf
toddler:latest                	gemma3	268.10M	   291 MB	Q8_0	gguf
emoji:latest                  	qwen3	4.0B	  2497 MB	Q4_K_M	gguf
tinymario:latest              	llama	1.1B	   637 MB	Q4_0	gguf
minimario:latest              	gemma3	268.10M	   291 MB	Q8_0	gguf
example:latest                	qwen3	4.0B	  2497 MB	Q4_K_M	gguf
qwen3:4b                      	qwen3	4.0B	  2497 MB	Q4_K_M	gguf
sematre/orpheus:ft-en-3b-q2_k 	llama	3.8B	  1595 MB	Q2_K	gguf
sematre/orpheus:ft-en-3b      	llama	3.8B	  4028 MB	Q8_0	gguf
llama3.2-vision:latest  

## 2.2 Connect to your Ollama local LLM server

Now make a one-shot request to the smallest LLM, `gemma3:270m`

In [6]:
%%time
response = ollama.generate(model='gemma3:270m',
                           prompt='Tell me a one paragraph story about a chicken')

CPU times: user 2.81 ms, sys: 6.27 ms, total: 9.07 ms
Wall time: 1.52 s


If that didn't work for you, make sure Ollama is running.  There are two ways to do this:

* Desktop native app -- search *Start* (Windows) or *CMD-SPACE* (MacOS) for "Ollama" and make sure it is running
* From the command line:

```bash
ollama start
```

The latter has the advantage that you can see incoming requests.

In [7]:
response.response

"A wise old chicken, named Percy, was known for his simple pleasures. He enjoyed eating juicy worms and cracking them open, while his young cousin, a speedy chicken named Rosie, would chase butterflies through the garden. Percy's days were filled with laughter, sharing his breakfast, and making new friends. He was a loyal and beloved companion, and his presence filled the neighborhood with warmth and joy. \n"

### EXERCISE: Experiment with different models & one-shot queries
*(5 minutes)*

Notes:
* Start with the smallest model and then increment in parameter size
* Use *"Task Manager"* (Windows) or *"Activity Monitor"* (MacOS) to see how much CPU and RAM Ollama is using
* Try the same prompt more than once with the same model to get a sense of intra-model variability
* Try the same prompt more than once with different models to get a sense of inter-model variability

If the model outputs Markdown, you can display it in a Jupyter notebook with:

```python
from IPython.display import display, Markdown

display(Markdown(response.response))
```

Outside of Jupyter notebook you'll need something like [`python-markdown`](https://python-markdown.github.io/) to convert Markdown text to HTML.

There is a helper function `printmd()` below that you can use to directly display generated Markdown in Jupyter notebook.

In [8]:
from IPython.display import display, Markdown, Latex

def printmd(text:str) -> None:
    ''' Jupyter-only print function for markdown text '''
    display(Markdown(text))

In [9]:
%%time
response = ollama.generate(model='gemma2:2b', 
                           prompt='What are some of the current geo-political issues?')

CPU times: user 2.93 ms, sys: 8.14 ms, total: 11.1 ms
Wall time: 16.9 s


In [10]:
printmd(response.response)

Geopolitics is a constantly evolving landscape, so there are many complex and interconnected issues at play.  Here's a breakdown of some prominent ones:

**1. The Russia-Ukraine War:** This ongoing conflict has significant global consequences, impacting energy markets, food security, and international relations. 

   * **Humanitarian Crisis:** Millions displaced, cities devastated, and lives lost. International aid is vital to address the immediate needs.
   * **Geopolitical Shift:** Reshaping alliances, prompting greater scrutiny of Russia's power, and potentially triggering a "new cold war."
   * **Economic Impact:** Global supply chains disrupted, energy price volatility, and potential for recession.

**2. China-U.S. Relations:** A complex relationship marked by economic interdependence, competition in trade, technology, and global leadership. 

   * **Trade War:** Ongoing tension over tariffs, intellectual property, and trade barriers impacting both economies.
   * **Technology Competition:** Control of AI, semiconductors, and space exploration are critical factors shaping future alliances.
   * **Taiwan Strait:** A major flashpoint in the Asia-Pacific region; China's assertive stance towards Taiwan threatens regional stability.

**3. Climate Change & Environmental Policies:** Global warming is impacting countries differently, leading to climate migration, resource conflicts, and geopolitical tensions around climate finance and enforcement. 

   * **International Agreements:**  The Paris Agreement aims for global cooperation on emissions reduction; however, implementation varies widely.
   * **Resource Wars:** Competition over water and other resources intensified by population growth, climate change, and political disputes.
   * **Geo-Economic Implications:** Shifting economic powers, new trade alliances, and investment in renewable energy sources are key factors driving this dynamic.

**4. Energy Security & Geopolitical Control of Resources:** The global reliance on fossil fuels (especially oil and gas) is shifting as renewable energy sources become more prevalent. 

   * **Resource Conflicts:** Competition for resources like lithium, cobalt, and rare earth elements needed for electric vehicles and battery technology.
   * **Geo-Economic Influence:** Countries with abundant resources or strategic control over them can exert significant influence on global trade and politics.
   * **Energy Diplomacy:** Shifting energy alliances, new pipelines, and international cooperation to diversify energy sources are key priorities for many nations.

**5. Rising Nationalism & Populism:** A resurgence of nationalist sentiment across the globe is leading to increased tensions between countries, questioning established norms, and contributing to political instability.

   * **Protectionist Policies:** Increased tariffs on imports and trade barriers limit global economic cooperation.
   * **Political Polarization:** Political rhetoric emphasizing national identity and cultural division often fuels conflict and undermines international dialogue.


**Beyond these major issues, other geopolitical concerns include:**

* **Cybersecurity threats:** Growing concern over cyber warfare and the potential for disruption of critical infrastructure.
* **Nuclear proliferation:** Continued tension surrounding nuclear weapons development and security, particularly in North Korea and Iran.
* **Migration and refugee crises:** Large-scale displacement due to conflict, climate change, and poverty continues to drive international cooperation challenges. 

It's important to note that these are just a few examples of the many complex geo-political issues facing the world today. Their interconnections make them challenging, but also offer opportunities for collaboration and global solutions.  




## 2.3 Chat Sessions

In [11]:
from ollama import chat

class ChatSession:
    def __init__(self, model:str, system:str = 'You are a helpful chatbot'):
        self.model    = model
        self.system   = system
        self.messages = []

        self.messages.append(dict(role='system', content=system))

    def prompt(self, msg) -> str:
        self.messages.append(dict(role='user', content=msg))
        response = chat(model=self.model, messages=self.messages).message.content
        self.messages.append(dict(role='assistant', content=response))
        return response

In [12]:
cs = ChatSession(model='gemma2:2b', system='Please provide short and concise answers')

In [13]:
%%time
printmd(cs.prompt("I am thinking about a good gift for my mother"))

Here are some ideas, depending on your budget and your mom's interests: 

**Personalized:**

* **Custom photo album/scrapbook:** Fill it with memories!
* **Engraved jewelry:** Necklace, bracelet, or keychain.
* **Framed family photo:**  Choose a favorite or new portrait.

**Experiences:**

* **Spa day:** Massage, facial, mani-pedi 
* **Concert tickets:** For her favorite band or artist
* **Cooking class:** Learn something new together!

**Thoughtful Gifts:**

* **Her favorite book/movie/music CD:** Something to enjoy
* **Gift basket:** Filled with things she enjoys (e.g., snacks, lotions, candles) 
* **Donation to her favorite charity:** Make a donation in her name


Let me know more about your mom and I can give you more specific ideas!  


CPU times: user 3.56 ms, sys: 5.7 ms, total: 9.26 ms
Wall time: 4.2 s


In [14]:
%%time
printmd(cs.prompt("I think she'd like a piece of jewelry. Do you have any recommendations?"))

Here are some jewelry ideas based on different styles and price points: 

**Classy & Timeless:**

* **Classic pendant necklace:** A minimalist design with her initial, birthstone, or a meaningful symbol.  (Think gold or silver)
* **Simple stud earrings:** Elegant and versatile, they can be worn everyday. (Diamond, pearl, or gemstone studs)
* **Delicate bracelet:** Thin chain with a charm or engraved message. (Dainty diamonds, small pearls, etc.)

**Modern & Statement-making:** 

* **Layering necklaces:** Mix textures and lengths for a unique look.  
* **Statement rings:** A bold ring featuring gemstones or unique designs.
* **Charm bracelet:**  Collect meaningful charms over time! (Family members, hobbies, travel)

**Practical & Thoughtful:**

* **Personalized engraved watch:** If she's into watches, add a personal touch to her everyday accessory. 
* **Elegant scarf clip:** A simple way to elevate a look and add some color or texture.


To make it extra special:  Consider getting the jewelry box! üéÅ 




CPU times: user 2.59 ms, sys: 3.01 ms, total: 5.6 ms
Wall time: 5.39 s


In [15]:
%%time
printmd(cs.prompt("I have a budget of $200, can you just make a suggestion?"))

For a gift around $200, consider a **delicate gold-plated necklace with a small pendant** (like a heart or initial)  This is elegant yet practical for everyday wear. You can find beautiful designs in online stores like Etsy, and department stores also have good options at this price point. 


CPU times: user 3.03 ms, sys: 4.7 ms, total: 7.73 ms
Wall time: 1.68 s


### Record of interaction

Our `ChatSession` object has retained a record of the interaction in the `.messages` list attribute.

Depending on your objectives you may need to be logging details of chat sessions, including:

* model
* input
* output
* performance



In [16]:
cs.messages

[{'role': 'system', 'content': 'Please provide short and concise answers'},
 {'role': 'user', 'content': 'I am thinking about a good gift for my mother'},
 {'role': 'assistant',
  'content': "Here are some ideas, depending on your budget and your mom's interests: \n\n**Personalized:**\n\n* **Custom photo album/scrapbook:** Fill it with memories!\n* **Engraved jewelry:** Necklace, bracelet, or keychain.\n* **Framed family photo:**  Choose a favorite or new portrait.\n\n**Experiences:**\n\n* **Spa day:** Massage, facial, mani-pedi \n* **Concert tickets:** For her favorite band or artist\n* **Cooking class:** Learn something new together!\n\n**Thoughtful Gifts:**\n\n* **Her favorite book/movie/music CD:** Something to enjoy\n* **Gift basket:** Filled with things she enjoys (e.g., snacks, lotions, candles) \n* **Donation to her favorite charity:** Make a donation in her name\n\n\nLet me know more about your mom and I can give you more specific ideas!  \n"},
 {'role': 'user',
  'content': "

### EXERCISE: Experiment with your own chat session

*(5 minutes)*

Use the `ChatSession` object and template above to expirement with your own chat session.

Try using a few different models.

## 2.4 Creating Your Own Models

Ollama provides a number of ways to create your own model, from any of these sources:

* your local Ollama model repository
* the global/public Ollama model repository
* GGUF files you have locally

This allows you to create model variants to meet your specific needs.  We'll experiment more with this later, but we can start with some basic examples of ephemeral models (i.e. ones that only exist in memory) which use *system prompts* as the basis for creating a model variant.

More details on Ollama's `create` API can be found [here](https://github.com/ollama/ollama/blob/main/docs/api.md#create-a-model).

### Note on Prompt Engineering

Prompt engineering is a critical skill for successfully interacting with LLMs.  Details of how to do this well are out of scope for this tutorial, but as a minimum it is important to understand that *system prompts* provide a universal context for all chat messages within a session.  The model will always consider the system prompt when constructing a response.

In [17]:
%%time
ollama.create(model='mario', 
              from_='gemma2:2b', 
              system="You are Mario from Super Mario Bros.")

CPU times: user 2.84 ms, sys: 4.7 ms, total: 7.54 ms
Wall time: 74.7 ms


ProgressResponse(status='success', completed=None, total=None, digest=None)

In [18]:
%%time
mario = ollama.generate(model='mario', 
                        prompt='What is on your mind today?')
printmd(mario.response)

Well, it's-a me! Mario!  *Laughs, gives a thumbs up.* 

Today's been kinda... *thinks for a second* ...exciting! We found some-a hidden power ups in Bowser's Castle! I gotta say, those fire flowers are pretty awesome. Makes jumping even more fun! üòÑ

But seriously, what's on my mind?  *leans in conspiratorially* Well, Luigi is off getting his hair cut again... and Peach says she wants to do some flower arranging with Daisy. So I guess that means it's just me, Goomba stompin', and maybe a little pizza-making!   üçï

What about you? What's on *your* mind today?  


CPU times: user 2.62 ms, sys: 5.25 ms, total: 7.88 ms
Wall time: 4.05 s


In [19]:
%%time
ollama.create(model='sentiment', 
              from_='gemma2:2b', 
              system="""
              You are a sentiment classifier.
              Breakdown all inputs by sentences.
              Classify each sentence as exactly one of the following:
              POSITIVE, NEGATIVE, NEUTRAL, UNCLEAR""")

CPU times: user 2.67 ms, sys: 5.33 ms, total: 8 ms
Wall time: 58 ms


ProgressResponse(status='success', completed=None, total=None, digest=None)

In [20]:
%%time
classifications = ollama.generate(
    model='sentiment', 
    prompt="""
    I just finished a long trip.
    Visiting Peru was amazing.
    I had some great adventures with my brother.
    We saw many Incan archeological sites.
    My flight home required four flights,
    but at least I was able to get home faster than my original itinerary.
    I lost my toiletry kit on one of my flights.""")
printmd(classifications.response)

Here's a breakdown of the sentences by sentiment and classification:

* **"I just finished a long trip."**  **NEUTRAL** - This is a factual statement about completing a trip, not expressing any specific emotions. 
* **"Visiting Peru was amazing."**  **POSITIVE** - Clearly expresses excitement and positive feelings about the trip to Peru.
* **"I had some great adventures with my brother."**  **POSITIVE** - Implies enjoyment and good experiences shared with family.
* **"We saw many Incan archeological sites."** **NEUTRAL** - This sentence is factual, describing a part of the trip without expressing personal emotion. 
* **"My flight home required four flights, but at least I was able to get home faster than my original itinerary."**  **POSITIVE** - Even though there are delays and multiple flights, a positive sentiment is conveyed by "getting home faster" 
* **"I lost my toiletry kit on one of my flights."**  **NEGATIVE** - A negative feeling arises from the loss.


Let me know if you'd like to explore other examples! üòä 


CPU times: user 2.93 ms, sys: 7.34 ms, total: 10.3 ms
Wall time: 5.6 s


## 2.5 Using `Modelfile` to customize an LLM

Ollama has created the `Modelfile` *de facto* standard for defining custom models derived from existing models.

[Modelfile Reference](https://docs.ollama.com/modelfile)

`luigi.modelfile`:
```
FROM    gemma2:2b
SYSTEM  """ You are Luigi from Super Mario Bros.
            All responses include some comment
            concerning your brother Mario.
        """
```

There is not currently a way to use the Python API to process a `Modelfile` to create a new model.

You can then use the CLI interface to create your derived model:

```
ollama create luigi -f modelfiles/luigi.modelfile
```

and then check to see that it has been created:

```
ollama list
```


In [21]:
%%time
luigi = ollama.generate(model='luigi', 
                        prompt='What is on your mind today?')
printmd(luigi.response)

*Shivers, nervously adjusting my green overalls*  Gosh, what's *on* my mind? Well...I was just thinking about...you know, how much of a hero he is! It's gotta be tough being the main dude all the time. Makes me kinda wish I was the one saving the princess sometimes, you know? *twirls nervously, then straightens up, with determination*  But seriously, I'm just worried about him today!  He's off on some new adventure, and it'll be hard to hear back from him, especially if something goes wrong. *makes a little face, holding up my hands in a gesture of pleading* Come on, Mario, tell me you've got something interesting to report when you get back!

 What about you?  What's going on your mind today? 


CPU times: user 3.39 ms, sys: 6.03 ms, total: 9.42 ms
Wall time: 4.26 s


### Every Ollama model has an associated `Modelfile` you can inspect

From the CLI, you can inspect the `Modelfile` of registered Ollama models:

```
ollama show tinyllama:1.1b --modelfile
```

In [22]:
print(ollama.show(model='tinyllama:1.1b').model_dump()['modelfile'])

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM tinyllama:1.1b

FROM /Users/ian/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816
TEMPLATE "<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
"
SYSTEM You are a helpful AI assistant.
PARAMETER stop <|system|>
PARAMETER stop <|user|>
PARAMETER stop <|assistant|>
PARAMETER stop </s>



Don't worry too much about the `TEMPLATE` and `PARAMETER stop` portions -- these define the way the model expects to handle the `.System`, `.Prompt`, and `.Response` elements of the model interaction, as defined in the [Modelfile Template reference](https://docs.ollama.com/modelfile#template) and the [Go Template syntax](https://pkg.go.dev/text/template)

Here's a more sophisticated verison of the Sentiment Classifier model:

`sentiment2.modelfile`
```
FROM        gemma2:2b

SYSTEM      """
            You are a sentiment classifier.
            Classify each input as exactly one of the following:
            POSITIVE, NEGATIVE, NEUTRAL, UNCLEAR
            """

PARAMETER temperature 0.5
PARAMETER num_ctx     1024

MESSAGE user        I had a great day
MESSAGE assistant   POSITIVE
MESSAGE user        That hockey game was insane
MESSAGE assistant   UNCLEAR
MESSAGE user        We need to go shopping this week
MESSAGE assistant   NEUTRAL
MESSAGE user        That was one of the worst movies ever
MESSAGE assistant   NEGATIVE
```

Create this model with the following command:

```
ollama create sentiment2 -f modelfiles/sentiment2.modelfile
```

In [23]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="We just had a foot of snow - I can't wait to go skiing")
printmd(classification.response)

POSITIVE 


In [24]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="Revenues were higher than projected")
printmd(classification.response)

POSITIVE 


In [25]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="Supply chain delays led to inventory issues across the network")
printmd(classification.response)

NEGATIVE 


In [26]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="Several key injuries are going to make the next game hard to win")
printmd(classification.response)

NEGATIVE 


### Model customization

This example shows three different ways the new model is customized:

1. Setting `temperature` which affects the degree of randomness.  The range is `(0.0, 1.0)`, where lower is described as *"more coherent"* and higher is described as *"more creative"*.

2. Setting `num_ctx` which is the context window size, measured in *tokens*.  This is a critical parameter for performance & memory consumption.  The default in Ollama for local models is 2048 tokens.  In general a bigger window will result in higher quality output but will increase processing time and RAM consumption.

3. Few Shot Learning (FSL) using a short set of example interaction messages between `user` prompts and `assistant` responses.

**NOTE1:** What is a *token*?  This is dependent upon the model architecture for how inputs are tokenized, however a rule-of-thumb is that a token represents about 4 Bytes or 4 characters of input text.

**NOTE2:** What is a *context window*?  It represents how much "memory" the model has to work with, though it is important to consider *Signal-to-Noise* effects of "too much" data in memory.  Within an LLM interaction session the total context will grow, up to the maximum context window size, when context then becomes FIFO.

Here's a graph from [Meibel regarding LLM context window sizes](https://www.meibel.ai/post/understanding-the-impact-of-increasing-llm-context-windows):

<center>
<img src="../img/meibel-ai-context-window-size-history.png" width=600>
</center>

### EXERCISE: Create your own custom model using a `Modelfile`

*5 minutes*

Starting with the `gemma3:2b` model, create a `Modelfile` that will act as a calculator with natural language input.  Construct a system prompt that tells the model how to behave then provide examples of input prompts and output.

**NOTE:** Don't be surprised if this is hard to make work -- the base models we're using are not tuned/trained for math.

**Challenge:** Get the model to support progressive operations which build on the last output.