# Setup local model and expose a temporary Colab endpoint (ngrok)

This notebook contains copy/paste cells you can run in Google Colab (or locally) to:

- install dependencies,
- download the safetensors file from Hugging Face,
- convert it to a ggml/gguf binary (recommended q4_0 quantization),
- start a local text-generation server (text-generation-webui or llama.cpp),
- expose the server with ngrok and print the public URL you can use as `LOCAL_TGI_URL`.

**Important:** conversion and hosting are resource intensive. Colab sessions are ephemeral — this setup is for development and testing only, not production. Mark long-running cells before running.


## 1) Prerequisites / notes

- This notebook is targeted at Colab or macOS (Apple Silicon). On Colab you *do not* need to create a venv; use `pip` in the notebook runtime.
- If running locally on macOS, prefer creating a venv and installing the same packages.
- The model in the repository is `unsloth/meta-llama-3.1-8b-unsloth-bnb-4bit` (~671 MB safetensors). This is small enough to convert to ggml/q4 and run on M1 with quantization.
- Long-running steps: conversion (minutes), building `llama.cpp` (minutes), and starting the server.
- Security: if you expose the endpoint with ngrok, protect it with an API key or use the web UI access controls.


In [None]:
# Install Python packages (Colab friendly). Long-running network install.
!pip install -q huggingface_hub safetensors transformers numpy requests pyngrok
# text-generation-webui has heavy extras; we recommend installing it later when needed (it may take time and more deps).


## 2) Download the safetensors from Hugging Face

Enter your HF token when prompted (or set `HF_TOKEN` environment variable). We will download into `/content/models/ib-physics` (Colab) or `~/models/ib-physics` locally — adapt paths if needed.

**Note:** If the model repo is public you can omit the token. If private, provide a token with read access.


In [None]:
from huggingface_hub import hf_hub_download
import os

model_id = 'unsloth/meta-llama-3.1-8b-unsloth-bnb-4bit'
# You can also set HF_TOKEN in the environment before running
hf_token = os.environ.get('HF_TOKEN') or input('Hugging Face token (or press Enter if public model): ').strip() or None
out_dir = '/content/models/ib-physics'
os.makedirs(out_dir, exist_ok=True)
# Common safetensors filename candidates. If you know exact filename, set it here.
candidate_files = ['model.safetensors','pytorch_model.safetensors','pytorch_model.bin','consolidated.00.pth']
found = None
for fname in candidate_files:
    try:
        path = hf_hub_download(repo_id=model_id, filename=fname, repo_type='model', token=hf_token)
        print('Downloaded:', path)
        found = path
        break
    except Exception as e:
        # try next candidate
        pass

if not found:
    print('\nCould not auto-detect filename. Visit the model page and copy the safetensors filename, then run hf_hub_download with that filename.')
else:
    print('\nModel downloaded to:', found)
    print('You can now use the conversion steps below.')


## 3) Build llama.cpp (only if you plan to use its converter or the CLI test)

This builds a native binary on Colab (or your Mac). Mark this as long-running.


In [None]:
# clone and build llama.cpp (if not already present)
!git clone https://github.com/ggerganov/llama.cpp /content/llama.cpp || true
%cd /content/llama.cpp
!make -j 2 || true
%cd /content


## 4) Convert safetensors → ggml/gguf (preferred: use text-generation-webui converter, fallback: llama.cpp converter)

**WARNING:** This is long-running and memory intensive. Prefer `q4_0` quantization for M1/8GB.


In [None]:
# Example conversion using text-generation-webui converter (if available)
# The notebook assumes the safetensors path is in `found` variable from the download cell. If not, set SAFE_PATH manually.
try:
    SAFE_PATH
except NameError:
    SAFE_PATH = None

if SAFE_PATH is None:
    print('If you have a safetensors file path, set SAFE_PATH = \'/content/models/ib-physics/model.safetensors\' and re-run this cell.')
else:
    print('SAFE_PATH:', SAFE_PATH)

# Suggested commands (do not auto-run here unless you understand RAM usage):
print('\nPreferred (text-generation-webui):')
print('  git clone https://github.com/oobabooga/text-generation-webui ~/text-generation-webui')
print('  cd ~/text-generation-webui')
print('  pip install -r requirements.txt')
print('  python3 <converter-script> --safetensors', SAFE_PATH, '--out-ggml /content/models/ib-physics/ggml-model-q4_0.bin --quantize q4_0')

print('\nFallback (llama.cpp converter):')
print('  cd /content/llama.cpp')
print('  python3 tools/convert.py --safetensors', SAFE_PATH, '--out /content/models/ib-physics/ggml-model-q4_0.bin --quantize q4_0')


## 5) Start a local text-generation server (text-generation-webui recommended)

After conversion, use the web UI to expose an API. This cell shows the command to run. On Colab you should run it in the background (nohup) and then use ngrok to expose it publicly.


In [None]:
# Example: start text-generation-webui (adjust path to your model binary)
# NOTE: This will block the notebook if run normally. Use nohup or run in a separate terminal.

MODEL_BIN = '/content/models/ib-physics/ggml-model-q4_0.bin'
print('If you have text-generation-webui cloned at /content/text-generation-webui:')
print('  cd /content/text-generation-webui')
print('  python server.py --model', MODEL_BIN, '--listen --api')

print('\nOn Colab you can run the server in a background process, then use ngrok to expose it (see below).')


## 6) Expose the server publicly with ngrok (optional)

This cell installs `pyngrok` and shows how to start an ngrok tunnel to the server port (default 5000 for webui). You must provide an ngrok auth token if you want a stable public URL.


In [None]:
# Install pyngrok if not installed
!pip install -q pyngrok
from pyngrok import ngrok

# If you have an ngrok authtoken, set it here (or run: ngrok authtoken YOUR_TOKEN in shell)
# ngrok.set_auth_token('YOUR_NGROK_AUTHTOKEN')

# Example: open a tunnel to port 5000
public_url = ngrok.connect(5000)
print('Public URL:', public_url)


## 7) Test the endpoint (example POST to /api/v1/generate). Adjust path according to server API.

This cell demonstrates how to call the public/ngrok URL or the local server and print the generated text.


In [None]:
import requests

# Replace with your actual URL printed by the ngrok cell or local server:
TEST_URL = input('Enter public URL from ngrok (or http://localhost:5000): ').strip()
if TEST_URL.endswith('/'):
    TEST_URL = TEST_URL[:-1]

API_ENDPOINT = TEST_URL + '/api/v1/generate'
print('Testing endpoint:', API_ENDPOINT)

payload = {
    'inputs': 'Generate an IB Physics Paper 1 style multiple-choice question about kinematics',
    'parameters': { 'max_new_tokens': 200, 'temperature': 0.7 }
}

try:
    res = requests.post(API_ENDPOINT, json=payload, timeout=60)
    print('status', res.status_code)
    print(res.text[:1000])
except Exception as e:
    print('Error calling endpoint:', e)


## 8) Final: environment variables for this repo

After you have a public URL (or local server), set the following environment variable in the shell where you start the Next.js app (or in your process manager):

```bash
export LOCAL_TGI_URL="https://your-public-url.ngrok.io"
npm run dev
```

The patched Llama client will prefer that URL and fall back to HF if the URL is not set or unreachable.

---

If conversion fails or you run into OOM, I recommend using the Hugging Face Inference API (set `HUGGINGFACE_API_KEY` and `LLAMA_MODEL_ID` environment vars) as a robust fallback.

Troubleshooting:
- If you see OOM during conversion: try a smaller quantization (q4_k) or convert on a machine with more RAM.
- If web UI fails to start: check the server logs and ensure the model binary path is correct.
- If ngrok URL returns 502: make sure the server is running and listening on the port you exposed.
