Warning:
* You can skip this notebook if you want to run code on your GPU card or server.

Requirement:
* [Open this notebook with Google Colab](https://colab.research.google.com/)
* Set a GPU environnement by clicking on connect on the corner of the notebook, below "Comment" button.

# GPU acceleration

If this cell fails, you may want to switch to a GPU-accelerated environment.

In [1]:
!nvidia-smi

Wed Jun 12 13:56:24 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Install Ollama and mount the models folder on you Gdrive



In [2]:
# Download and install the Ollama client
!curl https://ollama.ai/install.sh | sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0>>> Downloading ollama...
100 10941    0 10941    0     0  34449      0 --:--:-- --:--:-- --:--:-- 34514
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


# Download a model

Be careful to have enought space in your Google Drive to download it.

In [3]:
!ollama pull llama3:8b

Error: could not connect to ollama app, is it running?


# Run Ollama in a separate thread and pipe it to ngrok

Install ngrok

In [4]:
# ngrok will help setting up a HTTP tunnel to the Ollama service that will run within this Notebook.
%pip install -q pyngrok


The port 11434 used by Ollama is blocked by GColab, only the default HTTP(S) and SSH ports seem to be open.


We can easily dodge Google's firewall by running Ollama on a standard port, for instance port 80.  

In [5]:
import os

PORT = 80
os.environ['OLLAMA_HOST'] = f"0.0.0.0:{PORT}"

Define a runner that will pipe Ollama to ngrok

In [6]:
import os
import asyncio

async def run_process(cmd):
    print('>>> starting', *cmd)
    process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )

    # define an async pipe function
    async def pipe(lines):
        async for line in lines:
            print(line.decode().strip())

        await asyncio.gather(
            pipe(process.stdout),
            pipe(process.stderr),
        )

    # call it
    await asyncio.gather(pipe(process.stdout), pipe(process.stderr))

We want to run Ollama in a separate thread so it won't block execution.

In [7]:
import asyncio
import threading

async def start_ollama_serve():
    # Ollama serve starts a new Ollama service that will live within the current execution environment.
    await run_process(['ollama', 'serve'])

def run_async_in_thread(loop, coro):
    asyncio.set_event_loop(loop)
    loop.run_until_complete(coro)
    loop.close()


Strat Ollama !

In [8]:
# Create a new event loop that will run in a new thread
new_loop = asyncio.new_event_loop()
# Start ollama serve in a separate thread so the cell won't block execution
thread = threading.Thread(target=run_async_in_thread, args=(new_loop, start_ollama_serve()))
thread.start()

>>> starting

# Open a ngrok tunnel to our Ollama service

In [9]:
from pyngrok import ngrok

# Get your ngrok token from your ngrok account:
# https://dashboard.ngrok.com/get-started/your-authtoken
token="2h6NAXyk22ozcuWpFc04PuHpnpu_36hVDAu8zfbvXWbuCkAno"
ngrok.set_auth_token(token)

# Start an HTTP tunnel on the specified port
public_url = ngrok.connect(PORT).public_url

# Print the public URL of the tunnel
# Your Ollama service is publicly available at this URL !
print(" * ngrok tunnel to Ollama \"{}\" -> \"http://127.0.0.1:{}/\"".format(public_url, PORT))


 ollama serve
2024/06/12 13:56:43 routes.go:1011: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-12T13:56:43.447Z level=INFO source=images.go:740 msg="total blobs: 5"
time=2024-06-12T13:56:43.448Z level=INFO source=images.go:747 msg="total unused blobs removed: 0"
time=2024-06-12T13:56:43.448Z level=INFO source=routes.go:1057 msg="Listening on [::]:80 (version 0.1.43)"
time=2024-06-12T13:56:43.449Z level=INFO source=payload.go:30 msg="extracting embedded files

# Load and query a LLM from any remote (like, your laptop)

Copy and execute the following on the remote to access our Ollama service !
Adapt to your needs ;)

In [10]:
print("OLLAMA_HOST={} ollama run llama3:8b".format(public_url))

OLLAMA_HOST=https://d221-34-34-68-238.ngrok-free.app ollama run llama3:8b


**Current limitations**
- Google Colab resources are quite limited, with the free plan any model larger that 20Gb will probably fail to load.
- Your Ollama server is tied to the execution environnment. You will have to download & run the model again after every environment reset...
- In my (few) experiments I had the feeling that GColab was disconnecting more frequently than the usual. Maybe there is some kind of traffic limitation to prevent using a GColab as a Web server ?
