
I have 2 docker containers running ollama image. On all of the containers I have llama3 model running.
Ollama1 is set up using following docker command.

```
docker run -d --gpus=all `
    -v "d:/exposed_to_docker/ollama:/root/.ollama" `
    -p 11667:11434 `
    --name ollama1 ollama/ollama
```

Ollama3 is set up like this.

```
docker run -d --gpus=all `
    -p 11669:11434 `
    --name ollama3 ollama/ollama
```

This means that I have actually 3 different ollama servers running at any given point in time.
`http://localhost:11434` # this is just ollama running under windows
`http://localhost:11667` # Ollama1
`http://localhost:11669` # Ollama3

I am running this on Win11 i7 12700KF / 32GB DDR5 / RTX 3080 10G / Driver 551.61 / Cuda 12.4.

I have set up a following test harness.


In [48]:
# !pip install ollama

In [38]:
import time
from ollama import Client
import requests

# Configuration
MODEL = 'llama3'
PROMPT = 'Llamas are members of my family.'
ENDPOINTS = ['http://localhost:11434', 'http://localhost:11667', 'http://localhost:11669']

# Test function
def time_embedding_call(endpoint):
    client = Client(host=endpoint)
    
    start = time.time()
    resp = client.embeddings(model=MODEL, prompt=PROMPT)
    end = time.time()
    
    res = {'endpoint':endpoint, 'start':start, 'end':end, 'response':resp, 'duration': end - start}
    return res

In [33]:
# Make sure that there are no models loaded on either local ollama instance or docker container by running `ollama ps`
RESULTS = []
for e in ENDPOINTS:
    RESULTS.append(time_embedding_call(e))

In [36]:
for r in RESULTS:
    print(f'Host {r["endpoint"]} took {r["duration"]:.2f}')

Host http://localhost:11434 took 8.80
Host http://localhost:11667 took 189.67
Host http://localhost:11669 took 73.31
