# 2 - Serve Local LLMs

Quick sanity check on the current environment

In [None]:
import sys
sys.executable

In [None]:
sys.version

## 2.1 Check Ollama available models

<center>
<img src="../img/ollama-horizontal.png" alt="Ollama logo">
</center>

If you did the setup correctly (see [`README.md`](../README.md) in the repo root) then you should see at least a few models already available locally.

In [None]:
import ollama

The model details are a bit buried in the return object from the `.list()` call

In [None]:
models = list(ollama.list())[0][1]

In [None]:
print('\n\nOllama models (local):\n')
for m in models:
    print(f'{m.model:<30}\t'+
          f'{m.details.family}\t'+
          f'{m.details.parameter_size}\t'+
          f'{int(m.size/(1e6)):>6} MB\t'+
          f'{m.details.quantization_level}\t'+
          f'{m.details.format}')

## 2.2 Connect to your Ollama local LLM server

Now make a one-shot request to the smallest LLM, `gemma3:270m`

In [None]:
%%time
response = ollama.generate(model='gemma3:270m',
                           prompt='Tell me a one paragraph story about a chicken')

If that didn't work for you, make sure Ollama is running.  There are two ways to do this:

* Desktop native app -- search *Start* (Windows) or *CMD-SPACE* (MacOS) for "Ollama" and make sure it is running
* From the command line:

```bash
ollama start
```

The latter has the advantage that you can see incoming requests.

In [None]:
response.response

### EXERCISE: Experiment with different models & one-shot queries
*(5 minutes)*

Notes:
* Start with the smallest model and then increment in parameter size
* Use *"Task Manager"* (Windows) or *"Activity Monitor"* (MacOS) to see how much CPU and RAM Ollama is using
* Try the same prompt more than once with the same model to get a sense of intra-model variability
* Try the same prompt more than once with different models to get a sense of inter-model variability

If the model outputs Markdown, you can display it in a Jupyter notebook with:

```python
from IPython.display import display, Markdown

display(Markdown(response.response))
```

Outside of Jupyter notebook you'll need something like [`python-markdown`](https://python-markdown.github.io/) to convert Markdown text to HTML.

There is a helper function `printmd()` below that you can use to directly display generated Markdown in Jupyter notebook.

In [None]:
from IPython.display import display, Markdown, Latex

def printmd(text:str) -> None:
    ''' Jupyter-only print function for markdown text '''
    display(Markdown(text))

In [None]:
%%time
response = ollama.generate(model='gemma2:2b', 
                           prompt='What are some of the current geo-political issues?')

In [None]:
printmd(response.response)

## 2.3 Chat Sessions

In [None]:
from ollama import chat

class ChatSession:
    def __init__(self,
                 model:str,
                 system:str = 'You are a helpful chatbot'):
        self.model    = model
        self.system   = system
        self.messages = []

        self.messages.append(dict(role='system', content=system))

    def prompt(self, msg) -> str:
        self.messages.append(dict(role='user', content=msg))
        response = chat(model=self.model, messages=self.messages).message.content
        self.messages.append(dict(role='assistant', content=response))
        return response

In [None]:
cs = ChatSession(model='gemma2:2b', system='Please provide short and concise answers')

In [None]:
%%time
printmd(cs.prompt("I am thinking about a good gift for my mother"))

In [None]:
%%time
printmd(cs.prompt("I think she'd like a piece of jewelry. Do you have any recommendations?"))

In [None]:
%%time
printmd(cs.prompt("I have a budget of $200, can you just make a suggestion?"))

### Record of interaction

Our `ChatSession` object has retained a record of the interaction in the `.messages` list attribute.

Depending on your objectives you may need to be logging details of chat sessions, including:

* model
* input
* output
* performance



In [None]:
cs.messages

### EXERCISE: Experiment with your own chat session

*(5 minutes)*

Use the `ChatSession` object and template above to expirement with your own chat session.

Try using a few different models.

## 2.4 Creating Your Own Models

Ollama provides a number of ways to create your own model, from any of these sources:

* your local Ollama model repository
* the global/public Ollama model repository
* GGUF files you have locally

This allows you to create model variants to meet your specific needs.  We'll experiment more with this later, but we can start with some basic examples of ephemeral models (i.e. ones that only exist in memory) which use *system prompts* as the basis for creating a model variant.

More details on Ollama's `create` API can be found [here](https://github.com/ollama/ollama/blob/main/docs/api.md#create-a-model).

### Note on Prompt Engineering

Prompt engineering is a critical skill for successfully interacting with LLMs.  Details of how to do this well are out of scope for this tutorial, but as a minimum it is important to understand that *system prompts* provide a universal context for all chat messages within a session.  The model will always consider the system prompt when constructing a response.

In [None]:
%%time
ollama.create(model='mario', 
              from_='gemma2:2b', 
              system="You are Mario from Super Mario Bros.")

In [None]:
%%time
mario = ollama.generate(model='mario', 
                        prompt='What is on your mind today?')
printmd(mario.response)

In [None]:
%%time
ollama.create(model='sentiment', 
              from_='gemma2:2b', 
              system="""
              You are a sentiment classifier.
              Breakdown all inputs by sentences.
              Classify each sentence as exactly one of the following:
              POSITIVE, NEGATIVE, NEUTRAL, UNCLEAR""")

In [None]:
%%time
classifications = ollama.generate(
    model='sentiment', 
    prompt="""
    I just finished a long trip.
    Visiting Peru was amazing.
    I had some great adventures with my brother.
    We saw many Incan archeological sites.
    My flight home required four flights,
    but at least I was able to get home faster than my original itinerary.
    I lost my toiletry kit on one of my flights.""")
printmd(classifications.response)

## 2.5 Using `Modelfile` to customize an LLM

Ollama has created the `Modelfile` *de facto* standard for defining custom models derived from existing models.

[Modelfile Reference](https://docs.ollama.com/modelfile)

`luigi.modelfile`:
```
FROM    gemma2:2b
SYSTEM  """ You are Luigi from Super Mario Bros.
            All responses include some comment
            concerning your brother Mario.
        """
```

There is not currently a way to use the Python API to process a `Modelfile` to create a new model.

You can then use the CLI interface to create your derived model:

```
ollama create luigi -f modelfiles/luigi.modelfile
```

and then check to see that it has been created:

```
ollama list
```


In [None]:
%%time
luigi = ollama.generate(model='luigi', 
                        prompt='What is on your mind today?')
printmd(luigi.response)

### Every Ollama model has an associated `Modelfile` you can inspect

From the CLI, you can inspect the `Modelfile` of registered Ollama models:

```
ollama show tinyllama:1.1b --modelfile
```

In [None]:
print(ollama.show(model='tinyllama:1.1b').model_dump()['modelfile'])

Don't worry too much about the `TEMPLATE` and `PARAMETER stop` portions -- these define the way the model expects to handle the `.System`, `.Prompt`, and `.Response` elements of the model interaction, as defined in the [Modelfile Template reference](https://docs.ollama.com/modelfile#template) and the [Go Template syntax](https://pkg.go.dev/text/template)

Here's a more sophisticated verison of the Sentiment Classifier model:

`sentiment2.modelfile`
```
FROM        gemma2:2b

SYSTEM      """
            You are a sentiment classifier.
            Classify each input as exactly one of the following:
            POSITIVE, NEGATIVE, NEUTRAL, UNCLEAR
            """

PARAMETER temperature 0.5
PARAMETER num_ctx     1024

MESSAGE user        I had a great day
MESSAGE assistant   POSITIVE
MESSAGE user        That hockey game was insane
MESSAGE assistant   UNCLEAR
MESSAGE user        We need to go shopping this week
MESSAGE assistant   NEUTRAL
MESSAGE user        That was one of the worst movies ever
MESSAGE assistant   NEGATIVE
```

Create this model with the following command:

```
ollama create sentiment2 -f modelfiles/sentiment2.modelfile
```

In [None]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="We just had a foot of snow - I can't wait to go skiing")
printmd(classification.response)

In [None]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="Revenues were higher than projected")
printmd(classification.response)

In [None]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="Supply chain delays led to inventory issues across the network")
printmd(classification.response)

In [None]:
classification = ollama.generate(
    model='sentiment2', 
    prompt="Several key injuries are going to make the next game hard to win")
printmd(classification.response)

### Model customization

This example shows three different ways the new model is customized:

1. Setting `temperature` which affects the degree of randomness.  The range is `(0.0, 1.0)`, where lower is described as *"more coherent"* and higher is described as *"more creative"*.

2. Setting `num_ctx` which is the context window size, measured in *tokens*.  This is a critical parameter for performance & memory consumption.  The default in Ollama for local models is 2048 tokens.  In general a bigger window will result in higher quality output but will increase processing time and RAM consumption.

3. Few Shot Learning (FSL) using a short set of example interaction messages between `user` prompts and `assistant` responses.

**NOTE1:** What is a *token*?  This is dependent upon the model architecture for how inputs are tokenized, however a rule-of-thumb is that a token represents about 4 Bytes or 4 characters of input text.

**NOTE2:** What is a *context window*?  It represents how much "memory" the model has to work with, though it is important to consider *Signal-to-Noise* effects of "too much" data in memory.  Within an LLM interaction session the total context will grow, up to the maximum context window size, when context then becomes FIFO.

Here's a graph from [Meibel regarding LLM context window sizes](https://www.meibel.ai/post/understanding-the-impact-of-increasing-llm-context-windows):

<center>
<img src="../img/meibel-ai-context-window-size-history.png" width=600>
</center>

### EXERCISE: Create your own custom model using a `Modelfile`

*5 minutes*

Starting with the `gemma3:2b` model, create a `Modelfile` that will act as a calculator with natural language input.  Construct a system prompt that tells the model how to behave then provide examples of input prompts and output.

**NOTE:** Don't be surprised if this is hard to make work -- the base models we're using are not tuned/trained for math.

**Challenge:** Get the model to support progressive operations which build on the last output.