<a href="https://colab.research.google.com/github/marcelmarais/colab-llm-inference-server/blob/main/Orca3bServer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

In this notebook you will learn how to:


1. Serve a streaming LLM (Orca-mini 3b) locally in a notebook using Ollama.
2. Expose a port using `ngrok` so that you can access it from anywhere!

**Note**: you will need a GPU for this to work. To do this go to `Runtime` > `Change runtime type` and select T4 GPU.

In [None]:
MODEL_NAME = 'orca-mini:3b-q4_1'
OLLAMA_PORT = '11434'

## Install Ollama

[Ollama](https://github.com/jmorganca/ollama) allows you to run LLMs locally!

It supports some [popular models](https://ollama.ai/library) but the larger models don't work on the colab free tier.

In [None]:
!sudo apt install lshw
!sudo apt install pciutils
!curl https://ollama.ai/install.sh | sh

############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Creating ollama systemd service...
>>> NVIDIA GPU installed.
>>> Install complete. Run "ollama" from the command line.


In [None]:
!nohup ollama serve &

nohup: appending output to 'nohup.out'


In [None]:
!nohup ollama run orca-mini:3b-q4_1&

nohup: appending output to 'nohup.out'


In [None]:
import requests
import json

def generate_text(prompt, words_per_line=12):
    url = f"http://localhost:{OLLAMA_PORT}/api/generate"
    payload = {
        "model": MODEL_NAME,
        "prompt": prompt
    }

    word_count = 0

    with requests.post(url, json=payload, stream=True) as response:
        for line in response.iter_lines():
            if line:
                response_text = json.loads(line.decode('utf-8'))['response']
                print(response_text, end=' ')
                word_count += 1

                if word_count % words_per_line == 0:
                    print()


In [None]:
# Example usage
generate_text("Should you use Colab notebooks as AI infrastructure")

## Making the server availible outside of the notebook

To do this we'll use **ngrok** which: "enables developers to expose a local development server to the Internet with minimal effort."


In [None]:
# @title Install ngrok
!wget -q -c -nc https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip -qq -n ngrok-stable-linux-amd64.zip

In [None]:
get_ipython().system_raw(f'./ngrok http {OLLAMA_PORT} &')

In [None]:
# @title Print the public URL for your inference server
!curl -s http://localhost:4040/api/tunnels | python3 -c \
"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

https://4d90-34-125-145-23.ngrok.io
