# **Hosting Llama 2 with Free GPU via Google Collab**

**Before getting started, if running on Google Colab, check that the runtime is set to T4 GPU**

In [1]:
# This downloads and sets up the Ngrok executable in the Google Colab instance
!curl -sSL https://ngrok-agent.s3.amazonaws.com/ngrok.asc | sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null && echo "deb https://ngrok-agent.s3.amazonaws.com buster main" | sudo tee /etc/apt/sources.list.d/ngrok.list && sudo apt update && sudo apt install ngrok

The system cannot find the path specified.


## Install Dependencies
- Requirements for running FastAPI Server
- Requirements for creating a public model serving URL via Ngrok
- Requirements for running Llama2 13B (including Quantization)


In [2]:
!set "CMAKE_ARGS=-DLLAMA_OPENBLAS=on"
!set "FORCE_CMAKE=1"
!pip install llama-cpp-python --no-cache-dir

Collecting llama-cpp-python

  error: subprocess-exited-with-error
  
  Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  exit code: 1
  
  [20 lines of output]
  [32m*** [1mscikit-build-core 0.10.7[0m using [34mCMake 3.30.3[39m[0m [31m(wheel)[0m
  [32m***[0m [1mConfiguring CMake...[0m
  loading initial cache file C:\Users\nadee\AppData\Local\Temp\tmpj9uejq4w\build\CMakeInit.txt
  -- Building for: NMake Makefiles
  CMake Error at CMakeLists.txt:3 (project):
    Running
  
     'nmake' '-?'
  
    failed with:
  
     no such file or directory
  
  
  CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
  CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
  -- Configuring incomplete, errors occurred!
  [31m
  [1m***[0m [31mCMake configuration failed[0m
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
ERROR: ERROR: Failed to build installable


  Downloading llama_cpp_python-0.3.1.tar.gz (63.9 MB)
     ---------------------------------------- 0.0/63.9 MB ? eta -:--:--
      --------------------------------------- 1.0/63.9 MB 7.1 MB/s eta 0:00:09
     - -------------------------------------- 2.9/63.9 MB 8.4 MB/s eta 0:00:08
     --- ------------------------------------ 5.0/63.9 MB 9.1 MB/s eta 0:00:07
     --- ------------------------------------ 6.3/63.9 MB 8.6 MB/s eta 0:00:07
     ---- ----------------------------------- 7.9/63.9 MB 8.2 MB/s eta 0:00:07
     ----- ---------------------------------- 9.2/63.9 MB 7.9 MB/s eta 0:00:07
     ------ --------------------------------- 10.5/63.9 MB 7.7 MB/s eta 0:00:07
     ------- -------------------------------- 12.1/63.9 MB 7.7 MB/s eta 0:00:07
     -------- ------------------------------- 13.9/63.9 MB 7.9 MB/s eta 0:00:07
     --------- ------------------------------ 15.5/63.9 MB 7.8 MB/s eta 0:00:07
     ---------- ----------------------------- 16.5/63.9 MB 7.6 MB/s eta 0:00:07

In [11]:
# If this complains about dependency resolver, it's safe to ignore
!pip install fastapi[all] uvicorn python-multipart transformers pydantic tensorflow

Collecting uvicorn
  Downloading uvicorn-0.31.0-py3-none-any.whl.metadata (6.6 kB)
Collecting python-multipart
  Downloading python_multipart-0.0.12-py3-none-any.whl.metadata (1.9 kB)
Collecting fastapi[all]
  Downloading fastapi-0.115.0-py3-none-any.whl.metadata (27 kB)
Collecting starlette<0.39.0,>=0.37.2 (from fastapi[all])
  Downloading starlette-0.38.6-py3-none-any.whl.metadata (6.0 kB)
Collecting fastapi-cli>=0.0.5 (from fastapi-cli[standard]>=0.0.5; extra == "all"->fastapi[all])
  Downloading fastapi_cli-0.0.5-py3-none-any.whl.metadata (7.0 kB)
Collecting httpx>=0.23.0 (from fastapi[all])
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting ujson!=4.0.2,!=4.1.0,!=4.2.0,!=4.3.0,!=5.0.0,!=5.1.0,>=4.0.1 (from fastapi[all])
  Downloading ujson-5.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.3 kB)
Collecting orjson>=3.2.1 (from fastapi[all])
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadat

Ngrok is used to make the FastAPI server accessible via a public URL.

Users are required to make a free account and provide their auth token to use Ngrok. The free version only allows 1 local tunnel and the auth token is used to track this usage limit.

In [4]:
# https://dashboard.ngrok.com/signup
!ngrok authtoken 2nAJihMph2fGiVV9JdMJ172mRKp_4XVrKuj3jPuuT4RR4ArSM

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


## Create FastAPI App
This provides an API to the Llama 2 model. The model version can be changed in the code below as desired.

For this demo we will use the 13 billion parameter version which is finetuned for instruction (chat) following.

Despite the compression, it is still a more powerful model than the 7B variant.

In [12]:
%%writefile app.py
from typing import Any

from fastapi import FastAPI
from fastapi import HTTPException
from pydantic import BaseModel
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import tensorflow as tf


# GGML model required to fit Llama2-13B on a T4 GPU
GENERATIVE_AI_MODEL_REPO = "TheBloke/Llama-2-7B-Chat-GGUF"
GENERATIVE_AI_MODEL_FILE = "llama-2-7b-chat.Q5_0.gguf"

model_path = hf_hub_download(
    repo_id=GENERATIVE_AI_MODEL_REPO,
    filename=GENERATIVE_AI_MODEL_FILE
)

llama2_model = Llama(
    model_path=model_path,
    n_gpu_layers=64,
    n_ctx=2000
)

# Test an inference
print(llama2_model(prompt="Hello ", max_tokens=1))


app = FastAPI()


# This defines the data json format expected for the endpoint, change as needed
class TextInput(BaseModel):
    inputs: str
    parameters: dict[str, Any] | None


@app.get("/")
def status_gpu_check() -> dict[str, str]:
    gpu_msg = "Available" if tf.test.is_gpu_available() else "Unavailable"
    return {
        "status": "I am ALIVE!",
        "gpu": gpu_msg
    }


@app.post("/generate/")
async def generate_text(data: TextInput) -> dict[str, str]:
    try:
        params = data.parameters or {}
        response = llama2_model(prompt=data.inputs, **params)
        model_out = response['choices'][0]['text']
        return {"generated_text": model_out}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Writing app.py


## Start FastAPI Server
The initial run will take a long time due to having to download the model and load it onto GPU.

Note: interrupting the Google Colab runtime will send a SIGINT and stop the server.

In [13]:
# This cell finishes quickly because it just needs to start up the server
# The server will start the model download and will take a while to start up
# ~5 minutes
!uvicorn app:app --host 0.0.0.0 --port 8000 > server.log 2>&1 &

Check the logs at server.log to see progress.

Wait until model is loaded and check with the next cell before moving on.

In [14]:
# If you see "Failed to connect", it's because the server is still starting up
# Wait for the model to be downloaded and the server to fully start
# Check the server.log file to see the status
!curl localhost:8000

curl: (7) Failed to connect to localhost port 8000 after 0 ms: Connection refused


## Use Ngrok to create a public URL for the FastAPI server.
**IMPORTANT:** If you created an account via email, please verify your email or the next 2 cells won't work.

If you signed up via Google or GitHub account, you're good to go.

In [8]:
# This starts Ngrok and creates the public URL
from IPython import get_ipython
get_ipython().system_raw('ngrok http 8000 &')

Check the URL generated by the next cell, it should report that the FastAPI server is alive and that GPU is available.

To hit the model endpoint, simply add `/generate` to the URL

In [15]:
# Get the Public URL
# If this doesn't work, make sure you verified your email
# Then run the previous code cell and this one again
!curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

https://3da8-35-247-169-220.ngrok-free.app


## Shutting Down
To shut down the processes, run the following commands in a new cell:
```
!pkill uvicorn
!pkill ngrok
```