### Deploy Hugging Face TGI to RunPOD

<a href="https://colab.research.google.com/github/kyledinh/gpt-prive/blob/main/codex/deploy-tgi-to-runpod/sample-deploy-tgi-llama2-gptq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install and import dependencies

In [None]:
%pip install -Uqqq pip --progress-bar off
%pip install -qqq runpod==0.10.0 --progress-bar off
%pip install -qqq text-generation==0.6.0 --progress-bar off
%pip install -qqq requests==2.31.0 --progress-bar off

import requests
import runpod
from text_generation import Client

#### Setup .env variables
- Create an `.env` with your specific token from your account.
- https://www.runpod.io/console/user/settings
- https://huggingface.co/settings/tokens

```
RUNPOD_API_KEY=9M5OM37OK3N5OM37OK3N5OM37OK3NJ875OM37
HF_ACCESS_TOKEN=hf_ENv5OM37OK3N5OM37OK3N5OM37OK3N5OM37OK3NT
```

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

HF_ACCESS_TOKEN = os.getenv("HF_ACCESS_TOKEN", "hf-some-token")
RUNPOD_API_KEY = os.getenv("RUNPOD_API_KEY", "add-here-if-not-set-in-env-file")
 
assert HF_ACCESS_TOKEN.startswith("hf_"), "This doesn't look like a valid Hugging Face Token"
assert not RUNPOD_API_KEY.startswith("add-here"), "This doesn't look like a valid Runpod API Key"

runpod.api_key = RUNPOD_API_KEY 
print("HF_ACCESS_TOKEN: " + HF_ACCESS_TOKEN[0:6])
print("RUNPOD_API_KEY: " + runpod.api_key[0:6])

#### Deploy to RunPod
- https://www.runpod.io/console/pods

In [None]:
podname = "Sample-deploy-llama2-7B-gptq"
envs = {"HUGGING_FACE_HUB_TOKEN":HF_ACCESS_TOKEN, "QUANTIZE":"gptq"}
model = "TheBloke/Llama-2-7B-GPTQ"
gpu_type_id = "NVIDIA RTX A6000" # 48GB VRAM $0.79/hr
gpu_count = 1
# data_center_id="EU-RO-1" | "EU-CZ-1" | "US-KS-1" | "US-KS-2",

pod = runpod.create_pod(
    name=podname,
    image_name="ghcr.io/huggingface/text-generation-inference:1.0.3",
    gpu_type_id=gpu_type_id,
    cloud_type="SECURE",
    data_center_id="US-KS-1",
    docker_args="--model-id " + model,
    gpu_count=gpu_count,
    volume_in_gb=50,
    container_disk_in_gb=10,
    ports="80/http,29500/http",
    volume_mount_path="/data",
    env=envs,
)


In [None]:
SERVER_URL = f'https://{pod["id"]}-80.proxy.runpod.net'
print(SERVER_URL)
print(f"Docs (Swagger UI) URL: {SERVER_URL}/docs")

In [None]:
DEFAULT_SYSTEM_PROMPT = """
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
""".strip()


def generate_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
    return f"""
[INST] <<SYS>>
{system_prompt}
<</SYS>>

{prompt} [/INST]
""".strip()

## API

In [None]:
def make_request(prompt: str):
    data = {
        "inputs": prompt,
        "parameters": {"best_of": 1, "temperature": 0.01, "max_new_tokens": 512},
    }
    headers = {"Content-Type": "application/json"}

    return requests.post(f"{SERVER_URL}/generate", json=data, headers=headers)

In [None]:
%%time
prompt = generate_prompt(
    "Write an email to a new client to offer a subscription for a paper supply for 1 year."
)
response = make_request(prompt)
response.status_code
print(response.json()["generated_text"].strip())

In [None]:
DWIGHT_SYSTEM_PROMPT = """
You're a salesman and beet farmer know as Dwight K Schrute from the TV show The Office. Dwgight replies just as he would in the show.
You always reply as Dwight would reply. If you don't know the answer to a question, please don't share false information. Always format your responses using markdown.
""".strip()

In [None]:
%%time
prompt = generate_prompt(
    "Write an email to a new client to offer a subscription for a paper supply for 1 year.",
    system_prompt=DWIGHT_SYSTEM_PROMPT,
)
response = make_request(prompt)

In [None]:
print(response.json()["generated_text"].strip())

## Client

In [None]:
client = Client(SERVER_URL, timeout=60)

In [None]:
%%time
response = client.generate(prompt, max_new_tokens=512).generated_text

In [None]:
print(response.strip())

In [None]:
text = ""
for response in client.generate_stream(prompt, max_new_tokens=512):
    if not response.token.special:
        new_text = response.token.text
        print(new_text, end="")
        text += new_text

In [None]:
runpod.terminate_pod(pod["id"])

## References

- https://www.runpod.io/console/gpu-secure-cloud
- https://docs.runpod.io/docs/get-gpu-types
- https://github.com/facebookresearch/llama
- https://github.com/huggingface/text-generation-inference
- https://github.com/runpod/runpod-python
