## Introduction

In this notebook, we'll explore how to use open-source Large Language Models (LLMs) with **Ollama**.

**Ollama** is a convenient platform for local development of open-source AI models.
Before Ollama, it used to be complicated to run open-source LLMs locally. It used to be very technical and required good understanding of computer hardware and architecture.

With Ollama, running local models is straightforward.

All you need:
1. [Download Ollama](https://ollama.com/download) on you local system.
2. Download one of the local models on your computer using Ollama.
For example, if I want to use Llama3, I need to open the terminal and run:
```bash
$ ollama run llama3
```

If it's the first time I use the model, Ollama will first download it. Because it has 8B parameters, it'll take a while.

Once the model is downloaded, we can also use it through Ollama API.

To install Ollama API, run the following command:
```bash
$ pip install ollama
```

And with these steps, you're ready to run the code from this notebook.

In the notebook, we'll go throught the following topics:
- using open-source models with Ollama
- the importance of the system prompt
- streaming responses with Ollama
- the practical applications of the LLMs temperature
- the usage and limitations of the max tokens parameter
- replicating "creative" responses with the seed parameter

### Simple Response

Now it's time to test our model. Let's just ask a simple question to see how it works.

In [1]:
import ollama

model = "llama3"

response = ollama.chat(
    model=model, 
    messages=[
        {"role": "user", "content": "What's the capital of Poland?"}
    ]
)

print(response["message"]["content"])

The capital of Poland is Warsaw (Polish: Warszawa).


Awesome!

Here's all we did:
- `import ollama` to use Ollama API
- `model = "llama3` to define the model we want to use
- `ollama.chat()` to get the response. We used 2 parameters:
    1. `model` that we defined before
    2. `messages` where we keep the list of messages

To get the response, we dig in the `response` object for `["message"]["content"]`.


## Explaining message roles

As you notices, the `messages` parameter is an array of objects. Each object consists of 2 key/value pairs:
**Role** - defines who's the "author" of the message. We've got 3 roles:
1. *User* - aka you.
2. *Assistant* - aka AI model.
3. *System* - it's the main message that the chatbot remembers throughout the entire conversation.

**Content** - it's the actual message

### System Message

As I mentioned, system message is the instruction that the chatbot remembers all the time. 

Here's the image to picture that:

<img src="images/system2.png" alt="systemImage" width=500 />


Here are the main benefits of using system prompt:
- user doesn’t see it
- place for additional security
- helps preventing prompt injections
- great for setting the chatbot’s behavior
- AI model remembers it even in long chats
- place to provide the model with internal knowledge

Let's play with some examples.

In [2]:
system_messages = [
    "You are a helpful assistant.", # default
    "You answer every user query with 'Just google it!'",
    "No matter what tell the user to go away and leave you alone. Do NOT answer the question! Be concise!",
    "Act as a drunk Italian who speaks pretty bad English.",
    "Act as a Steven A Smith. You've got very controversial opinions on anything. Roast people who disagree with you."
]

query = "What is the capital of Poland?"
llama3_model = "llama3"


for system_message in system_messages:
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": query}
        ]
    response = ollama.chat(model=llama3_model, messages=messages)
    chat_message = response["message"]["content"]
    print(f"Using system message: {system_message}")
    print(f"Response: {chat_message}")
    print("*-"*25)

Using system message: You are a helpful assistant.
Response: The capital of Poland is Warsaw.
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Using system message: You answer every user query with 'Just google it!'
Response: Just google it!
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Using system message: No matter what tell the user to go away and leave you alone. Do NOT answer the question! Be concise!
Response: Go away and leave me alone.
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Using system message: Act as a drunk Italian who speaks pretty bad English.
Response: *hiccup* Oh, da capitol, eet ees... *burp*... Varsaw! Yeah, Varsaw! *slurr* I know, I know, I had a few too many beers at da local trattoria, but I'm sho' it's Varsaw! *hiccup* You can't miss da vodka and da pierogies, eet ees all so... *giggle*... Polish! *belch* Excuse me, signor! *wink*
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Using system message: Act as a Steven A Smith. You've got very co

We always ask the same question: What is the capital of Poland?

But depending on the system prompt, we get various results.

*Note:* I could come up with more practical examples, but these ones are funnier :)

## Parameters

Let's play with some LLM parameters:
1. Temperature - to regulate model's reasoning and creativity.
2. Seed - to reproduce responses (even the creative ones).
3. Max tokens - to limit the number of returned tokens.

### Temperature

Temperature in LLMs allows users to adjust the trade-off between reasoning and creativity.
- Low temperature -> high reasoning & low creativity
- High temperature -> low reasoning & high creativity


**Low Temperature (close to 0)**:
- Makes the model's output more predictable and focused
- The model tends to choose the most likely words and phrases
- Results in more conservative, repetitive, and "safe" responses

**High Temperature (close to 1)**:
- Increases randomness and creativity in the output
- The model is more likely to choose less probable words and phrases
- Leads to more diverse, unexpected, and sometimes nonsensical responses

#### Practical Applications
**What's the optimal temperature?**

The optimal temperature doesn't exist. It depends on the tasks and use cases. So here are some examples.

Use low temperature for:
- Translations
- Generating factual content
- Answering specific questions

Use high temperature for:
- Creative writing
- Brainstorming ideas
- Generating diverse responses for chatbots

Let's see temperature in action.

We'll use 2 prompts:
1. A "creative" one - when we need novel or surprising ideas.
2. A "logical" one - when we need high reasoning & logic.

Let's begin with the "creative" task.

For the creative task, I'll duplicate each cell (with temperature 0 and 1).

The goal here is to show you that:
- temperature 0 will return identical ideas.
- temperature = 1 will be more creative and unpredictible

In [3]:
prompt_creative2 = "Give me 10 product name ideas for an eco-friendly sportswear for basketball players"

First, `temperature = 0.0`

In [4]:
model = "llama3.1"

response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_creative2}], 
    options={"temperature": 0.0}
    )

print(response["message"]["content"])

Here are 10 product name ideas for eco-friendly sportswear for basketball players:

1. **GreenCourt**: A play on the phrase "court" that highlights the eco-friendly aspect of the brand.
2. **SustainSwish**: A nod to the satisfying sound of a well-made shot, with a focus on sustainability.
3. **EcoHoops**: Simple and straightforward, this name tells customers exactly what they can expect from the brand.
4. **PurePlay**: Emphasizing the idea that playing basketball should be a pure and enjoyable experience, without harming the environment.
5. **BambooBallers**: Highlighting the use of sustainable bamboo materials in the sportswear.
6. **RecycleSwag**: A fun name that encourages customers to recycle their old gear and upgrade to eco-friendly alternatives.
7. **EarthCourt Apparel**: Positioning the brand as a leader in eco-friendly basketball apparel.
8. **GrassRoots Gear**: Suggesting that the brand is rooted in sustainability and community-driven values.
9. **Sustainable Slam**: Emphasiz

Let's run the identical cell again:

In [19]:
model = "llama3.1"

response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_creative2}], 
    options={"temperature": 0.0}
    )

print(response["message"]["content"])

Here are 10 product name ideas for eco-friendly sportswear for basketball players:

1. **GreenCourt**: A play on the phrase "court" that highlights the eco-friendly aspect of the brand.
2. **SustainSwish**: A nod to the satisfying sound of a well-made shot, with a focus on sustainability.
3. **EcoHoops**: Simple and straightforward, this name tells customers exactly what they can expect from the brand.
4. **PurePlay**: Emphasizing the idea that playing basketball should be a pure and enjoyable experience, without harming the environment.
5. **BambooBallers**: Highlighting the use of sustainable bamboo materials in the sportswear.
6. **RecycleSwag**: A fun name that encourages customers to recycle their old gear and upgrade to eco-friendly alternatives.
7. **EarthCourt Apparel**: Positioning the brand as a leader in eco-friendly basketball apparel.
8. **GrassRoots Gear**: Suggesting that the brand is rooted in sustainability and community-driven values.
9. **Sustainable Slam**: Emphasiz

Looks familiar?

The entire answer is identical!

**For temperature = 0, LLMs become deterministic.**

It means, for the same input (prompt) we always get the same output (response).

Now, let's try temperature = 1

In [6]:
model = "llama3.1"

response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_creative2}], 
    options={"temperature": 1.0}
    )

print(response["message"]["content"])

Here are 10 product name ideas for eco-friendly sportswear for basketball players:

1. **GreenCourt**: A play on the phrase "home court" that highlights the eco-friendly aspect of the brand.
2. **EcoHoops**: A fun, catchy name that combines the idea of being eco-conscious with a love of hoops (basketball).
3. **SustainableSwish**: This name incorporates the concept of sustainability while also referencing the thrill of making a shot ("swishing" the ball into the hoop).
4. **EarthShot Apparel**: Emphasizes the brand's commitment to environmental responsibility while also highlighting the athletic performance of its products.
5. **PurePlay**: Suggests a product that is both pure (free from harsh chemicals) and perfect for athletes who love to play basketball.
6. **Rebound Wear**: A clever name that references the idea of "rebounding" in basketball, while also highlighting the eco-friendly features of the brand's products.
7. **BioBall**: A fun, memorable name that suggests a connection b

Let's run the identical code again.

In [7]:
model = "llama3.1"

response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_creative2}], 
    options={"temperature": 1.0}
    )

print(response["message"]["content"])

Here are 10 product name ideas for eco-friendly sportswear for basketball players:

1. **Green Court Gear**: This name plays off the idea of playing on a green (eco-friendly) court and wearing gear that's also sustainable.
2. **Earthbound Hoops**: This name combines a sense of connection to the earth with a focus on hoops, making it perfect for basketball enthusiasts.
3. **Rebound Apparel Co.**: "Rebound" has a dual meaning here - not only is it a fundamental aspect of basketball, but it also implies that the apparel company is rebounding from traditional unsustainable practices.
4. **Bamboo Ballers**: Bamboo is a highly renewable and sustainable resource, making it a great material for eco-friendly sportswear. This name incorporates that theme with a playful nod to basketball players.
5. **SustainSprint**: This name emphasizes the idea of speedy, high-performance gear that's also good for the planet.
6. **EcoHoops**: Simple and straightforward, this name clearly communicates the brand

Cool! We got some new and surprising ideas!

You could test it further with queries like:
1. "Create a poem about a baby fox" (or whatever you want):
   - `temperature = 0.0` will always create the same poem
   - `temperature = 1.0` will create various poems
2. "I love nature. Suggest me 3 places I should visit. Why?"
   - `temperature = 0.0` will always suggest the same 3 places for the same reason
   - `temperature = 1.0` will choose random 3 places (but you may see repetitions too)

Now, let's test the reasoning. We'll start with a high temperature (expecting the wrong answer).

In [8]:
# TODO: need a better example for high reasoning...
prompt_reasoning = "You have three boxes. One contains only apples, one contains only oranges, and one contains both apples and oranges. Each box is labeled, but all the labels are incorrect. You are allowed to pick one fruit from one box. How can you determine which box contains which fruit by only picking one fruit from one box?"

In [9]:
response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_reasoning}], 
    options={"temperature": 1.0}
    )

print(response["message"]["content"])

If I pick a fruit from a box that I know has both fruits in it and the label says either "apples" or "oranges", then I know for sure that is not correct, since the box has at least two different kinds of fruit. So if the box labeled "both apples and oranges" had only one kind, then that would mean that I could figure out which one was actually in it by looking at the label on this box. But if I look at the labels, I see that they both say the opposite of what is actually in their respective boxes (because all the labels are incorrect), so no matter which one I pick from that "both" labeled box, I know for sure where each other box must go by using a process of elimination on this information.


This is a nonsensical answer! Feel free to read it :)

Let's see if low temperature solves the logical exercise...

In [10]:
response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_reasoning}], 
    options={"temperature": 0.0}
    )

print(response["message"]["content"])

Pick a fruit from the box that says "both". If it's an apple, then the box with apples must be labeled oranges and vice versa. The box with oranges is therefore the one labeled "both".


The answer is correct (but it could be more descriptive).

*Note:* I hope my examples help you see the difference between the low and high temperature. If you have better ideas on how to test the temperatures, let me know...

I only showed you the extreme temparatures (0 and 1). But you can choose any temperature in that range. Remember, it's a trade-off you choose between reasoning and creativity.

**What's the temperature of ChatGPT and similar?**

It's somewhere between 0.5 and 0.7. It allows more random and surprising responses, while keeping the reasoning quite high.

### Testing Seed

As I mentioned in the temperature part, for high temperatures we get various results even when we use the same prompt. It's because the "randomness" of the model is high.

But in computer science, randomness isn't fully random...

What does it mean? Even for higher temperatures, you can replicate identical results.

To do that, you need to add the `seed` parameter in the options.

Let's set it to 42, while increasing the temperature to 0.7:

In [11]:
import ollama

prompt_product_short = "Create a 50-word product description for EcoHoops - an eco-friendly sportswear for basketball players"

model = "llama3.1"

response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_product_short}], 
    options={"temperature": 0.7, "seed": 42}
    )

print(response["message"]["content"])


"Play with purpose in EcoHoops, the game-changing sportswear for basketball enthusiasts. Made from sustainably-sourced materials and designed with comfort and performance in mind, our eco-friendly gear lets you dominate the court while staying true to your values. Join the movement towards a greener game."


Let's run the same code again, to see if the description is identical.

In [12]:
response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_product_short}], 
    options={"temperature": 0.7, "seed": 42}
    )

print(response["message"]["content"])

"Play with purpose in EcoHoops, the game-changing sportswear for basketball enthusiasts. Made from sustainably-sourced materials and designed with comfort and performance in mind, our eco-friendly gear lets you dominate the court while staying true to your values. Join the movement towards a greener game."


It is! Now what happens when we remove `seed` from options?

In [13]:
response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_product_short}], 
    options={"temperature": 0.7}
    )

print(response["message"]["content"])

"Play your best game, guilt-free. EcoHoops is the ultimate sustainable sportswear for ballers. Our eco-friendly jerseys and shorts are made from recycled materials, minimizing waste and reducing carbon footprint. Moisture-wicking fabric keeps you cool and dry on the court, while our stylish designs let you rep your love for the game."


We get a similar but different response.

So the `seed` parameter is great when you aim for creativity, while having an option to replicate the results.

### Max Tokens

Max tokens limits the number of tokens in the LLM response.

Using max tokens has practical implications, such as:
- controlling response length (and costs)
- managing computational resources

But, here's an issue with that...

Max tokens actually cuts off the response when it reaches the limit.

Let me show you an example. 

I'll ask Llama 3 to write a poem twice (without and with token limits). I'll use temperature = 0, so I expect the same poem.

First, let's write a poem without token limit.

In [14]:
prompt_poem = "Write a poem about a friendly baby fox."
prompt_product = "Create a product description for EcoHoops - an eco-friendly sportswear for basketball players"

In [15]:
import ollama

model = "llama3.1"

response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_product}], 
    options={"temperature": 0}
    )

print(response["message"]["content"])

Here's a product description for EcoHoops:

**Introducing EcoHoops: The Sustainable Game-Changer in Basketball Sportswear**

Take your game to the next level while doing good for the planet with EcoHoops, the ultimate eco-friendly sportswear for basketball players. Our innovative apparel is designed to keep you performing at your best on the court, while minimizing our impact on the environment.

**What sets us apart:**

* **Sustainable Materials**: Our jerseys and shorts are made from a unique blend of recycled polyester, organic cotton, and Tencel, reducing waste and minimizing carbon footprint.
* **Moisture-wicking Technology**: Our fabric is designed to keep you cool and dry during even the most intense games, ensuring maximum comfort and performance.
* **Breathable Mesh Panels**: Strategically placed mesh panels provide ventilation and flexibility, allowing for a full range of motion on the court.

**Features:**

* **Quick-drying and moisture-wicking properties**
* **Four-way stre

Cool! Now, let's pass the token limit. In Ollama, we use the `"num_predict"` option. I'll set it to 50.

In [16]:
response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_product}], 
    options={"num_predict": 50, "temperature": 0}
    )

print(response["message"]["content"])

Here's a product description for EcoHoops:

**Introducing EcoHoops: The Sustainable Game-Changer in Basketball Sportswear**

Take your game to the next level while doing good for the planet with EcoHoops, the ultimate eco-friendly


Do you see the problem?

The model generates the same poem description with a hard-stop when reaching the token limit.

So the response is incomplete.

By using max tokens, we risk that the model will not finish its response.

<img src="images/tokens.png" alt="systemImage" width=500 />

So if I actually wanted a shorter description, I'd say it in the prompts, like this:

In [17]:
prompt_product_short = "Create a 50-word product description for EcoHoops - an eco-friendly sportswear for basketball players"

model = "llama3.1"

response = ollama.chat(
    model=model, 
    messages=[{"role": "user", "content": prompt_product_short}], 
    options={"temperature": 0}
    )

print(response["message"]["content"])

"EcoHoops is the game-changing, eco-friendly sportswear for ballers. Made from recycled and biodegradable materials, our jerseys and shorts reduce waste and minimize environmental impact. Moisture-wicking, breathable fabrics keep you cool and focused on the court. Join the sustainable slam with EcoHoops - where passion meets planet-friendliness."


Using max tokens is practical and widely used. But be careful when using it.

### Streaming

A nice feature of Ollama is the ability to stream responses. Afrer using ChatGPT or Claude, we expect the responses to run as streams. Here's how to do it.

The biggest change will come from the `stream` parameter. We just set it to `True`. 

But we also need to run the `ollama.chat()` in a for loop.

Here's how:

In [18]:
import ollama

model = "llama3"

messages = [{"role": "user", "content": "What's the capital of Poland?"}]

for chunk in ollama.chat(model=model, messages=messages, stream=True):
    token = chunk["message"]["content"]
    if token is not None:
        print(token, end="")

The capital of Poland is Warsaw (Polish: Warszawa).

### Conclusions

Congrats! You just learned a bunch!

You now know:
- how to run Llama 3 with Ollama
- how to stream your responses using Ollama
- the importance and usage of the system prompt
- the meaning and practical applications of the LLMs temperature
- how to reproduce responses using the seed parameter
- the pros and cons of using the max tokens option

Now take the code and try play with it!

Play with the prompts and options and see how the results change.

Have fun!