# Optimized Inference Deployment

* Instructions assumes one is using an M-series processor on macOS

## TGI

### Setup

* HuggingFace does not currently support `arm64` platform architectures, such as the M-series processors for Macs
* Docker does not support access to the native macOS Metal GPUs
* Currently, the TGI image cannot be run in Docker on macOS
* Instructions documented below show how to:
    * Run the TGI image
    * Use the `InferenceClient` to generate text from the TGI endpoint
    * Use for chat format

```bash
docker run --gpus all \
    --platform linux/amd64 \
    --shm-size 1g \
    -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id HuggingFaceTB/SmolLM2-360M-Instruct
```

### Use HuggingFace `InferenceClient` to Access TGI Endpoint

In [None]:
from huggingface_hub import InferenceClient

# initialize client pointing to TGI endpoint
client = InferenceClient(
    model="http://localhost:8080",  # URL to the TGI server
)

# text generation
response = client.text_generation(
    "Tell me a story",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    details=True,
    stop_sequences=[],
)

In [None]:
print(response.generated_text)

### Use for Chat Format

In [None]:
# chat completion
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)

In [None]:
print(response.choices[0].message.content)

### Use OpenAI Client

In [None]:
from openai import OpenAI

# initialize client pointing to TGI endpoint
client = OpenAI(
    base_url="http://localhost:8080/v1",  # Make sure to include /v1
    api_key="not-needed",  # TGI doesn't require an API key by default
)

# chat completion
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)

In [None]:
print(response.choices[0].message.content)

---

## Llama.cpp