# Test Multiple Ollama instances
There was a problem having two different gpus inside the one ollama image. I want to have more control so I can put my smaller models on my smaller gpu and save my big model on my 3090. This code tests the implememtation of a multple ollama server implementation. There are 2 docker images, one with each gpu. port 11434 has the 3090 and port 11433 has the 3070. Lets test how things go.

First test the smaller gpu

In [11]:
from llama_index.llms.ollama import Ollama
import textwrap


def fmt(str):
    formatted_lines = [textwrap.fill(line, width=120) for line in str.split('\n')]
    return '\n'.join(formatted_lines)



llm = Ollama(base_url="http://192.168.86.2:11433", model="phi3", request_timeout=360.0)

print(fmt(str(llm.complete("What is the meaning of life?"))))


The question "What is the meaning of life?" has intrigued philosophers, theologians, scientists, and thinkers for
centuries. There isn't a single answer that applies universally as perspectives on this vary widely based on cultural,
religious, and individual beliefs. Here are some viewpoints:


1. Philosophical Viewpoint: Many philosophers argue that life doesn't have an inherent meaning but rather it is up to
each individual to create their own purpose through choices and actions.

2. Religious Perspective: Different religions offer various answers, suggesting the purpose of life may be related to
spiritual fulfillment or following a divine plan set by a higher power.

3. Scientific Angle: From an evolutionary standpoint, one might argue that the meaning of life is simply to survive and
reproduce, passing on genes through biological processes.

4. Personal Fulfilment: Many believe in finding personal meaning through experiences, relationships, achievements, or a
combination thereof.



And now show the other gpu running.

In [15]:
llm_big_gpu = Ollama(base_url="http://192.168.86.2:11434", model="llama3:8b-instruct-fp16", request_timeout=360.0, keep_alive=-1)

print(fmt(str(llm_big_gpu.complete("What is the meaning of life?"))))


The age-old question!

The meaning of life is a topic that has been debated and explored by philosophers, theologians, scientists, and many
others for centuries. There is no one definitive answer, as it is a highly subjective and personal question that can
vary greatly from person to person.

Here are some possible perspectives on the meaning of life:

1. **Biological perspective**: From a biological standpoint, the meaning of life might be to survive, reproduce, and
pass on genetic information to future generations.
2. **Philosophical perspective**: Philosophers have offered various answers, such as:
        * To seek happiness or fulfillment (eudaimonia).
        * To pursue knowledge, wisdom, and personal growth.
        * To develop a sense of purpose, direction, or meaning in one's life.
        * To find significance, value, or importance in the world.
3. **Religious perspective**: Many religions believe that the meaning of life is to:
        * Worship and serve a higher power (

In [8]:
help(Ollama)

Help on class Ollama in module llama_index.llms.ollama.base:

class Ollama(llama_index.core.llms.custom.CustomLLM)
 |  Ollama(*, callback_manager: llama_index.core.callbacks.base.CallbackManager = None, system_prompt: Optional[str] = None, messages_to_prompt: Callable = None, completion_to_prompt: Callable = None, output_parser: Optional[llama_index.core.types.BaseOutputParser] = None, pydantic_program_mode: llama_index.core.types.PydanticProgramMode = <PydanticProgramMode.DEFAULT: 'default'>, query_wrapper_prompt: Optional[llama_index.core.prompts.base.BasePromptTemplate] = None, base_url: str = 'http://localhost:11434', model: str, temperature: float = 0.75, context_window: pydantic.v1.types.ConstrainedIntValue = 3900, request_timeout: float = 30.0, prompt_key: str = 'prompt', json_mode: bool = False, additional_kwargs: Dict[str, Any] = None) -> None
 |
 |  Ollama LLM.
 |
 |  Visit https://ollama.com/ to download and install Ollama.
 |
 |  Run `ollama serve` to start a server.
 |
 | 

In [16]:
print(fmt(str(llm_big_gpu.complete("Is there a way using ollama and python's llama_index to keep a model in the gpu memory?"))))


Yes, you can use the `llama_index` module from Python's Hugging Face Transformers library to load a LLaMA model into GPU
memory. Here's an example:
```python
import torch
from transformers import LLamaIndex, AutoModelForCausalLM

# Load the model and tokenizer
model_name = "llama-base-1.3b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Move the model to GPU (assuming you have a CUDA-compatible NVIDIA GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Create an LLamaIndex instance
index = LLamaIndex(model, tokenizer)

# You can now use the index to generate text, etc.
input_ids = ...  # your input IDs
output = index.generate(input_ids)
print(output)
```
In this example, we load the `llama-base-1.3b` model and tokenizer using the `AutoModelForCausalLM` and `AutoTokenizer`
classes from the Transformers library. We then move the model to a CUDA-compatible GPU (or CPU if no GPU is

Lets push to see how big a model will run in the 3070

In [20]:
llm = Ollama(base_url="http://192.168.86.2:11433", model="llama3", request_timeout=360.0, keep_alive=-1)

print(fmt(str(llm.complete("What is a good recipe for a cocktail?", keep_alive=-1))))


I'd be happy to help you with that!

Here's a simple yet elegant recipe for a classic cocktail:

**Bee's Knees**

Ingredients:

* 2 oz (60 ml) gin
* 1 oz (30 ml) honey syrup (equal parts honey and water, dissolved)
* 1/2 oz (15 ml) fresh lemon juice
* Dash of sparkling water

Instructions:

1. Fill a cocktail shaker with ice.
2. Add the gin, honey syrup, and lemon juice.
3. Shake vigorously for about 10-12 seconds to combine and chill the ingredients.
4. Strain the mixture into a chilled coupe or cocktail glass.
5. Top with a dash of sparkling water.

Garnish with a lemon twist or wheel, if desired.

**Why it's great:**

* The honey syrup adds a touch of sweetness without overpowering the other flavors.
* The lemon juice provides a nice acidity and brightness to balance out the drink.
* Gin is a versatile spirit that pairs well with the floral notes from the honey.

**Tips and Variations:**

* Adjust the amount of honey syrup to your taste, depending on how sweet you like your cocktail

Now lets play around with some agents.

In [24]:
from llama_index.core.tools import FunctionTool
from llama_index.core.agent import ReActAgent

def multiply(a: int, b: int) -> int:
    """Multiple two integers and returns the result integer"""
    return a * b


multiply_tool = FunctionTool.from_defaults(fn=multiply)

def add(a: int, b: int) -> int:
    """Add two integers and returns the result integer"""
    return a + b


add_tool = FunctionTool.from_defaults(fn=add)

agent = ReActAgent.from_tools(
    [multiply_tool, add_tool],
    llm=llm,
    verbose=True,
    allow_parallel_tool_calls=False,
    llm_args={ "keep_alive": "-1m" },
)

In [25]:
response = agent.chat("What is (121 + 2) * 5?")
print(str(response))

[1;3;38;5;200mThought: The user wants to know the value of (121 + 2) * 5. I need to use a tool to help me answer the question.
Action: multiply
Action Input: {'a': 123, 'b': 5}
[0m[1;3;34mObservation: 615
[0m[1;3;38;5;200mThought: The user observed that the result of (121 + 2) * 5 is indeed 615. I can answer without using any more tools. I'll use the user's language to answer.
Answer: 615
[0m615
