# Optimized Inference Deployment

* Instructions assumes one is using an M-series processor on macOS

## TGI

### Setup

* HuggingFace does not currently support `arm64` platform architectures, such as the M-series processors for Macs
* Docker does not support access to the native macOS Metal GPUs
* Currently, the TGI image cannot be run in Docker on macOS
* Instructions documented below show how to:
    * Run the TGI image
    * Use the `InferenceClient` to generate text from the TGI endpoint
    * Use for chat format

### Run Docker Image

```bash
docker run --gpus all \
    --platform linux/amd64 \
    --shm-size 1g \
    -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id HuggingFaceTB/SmolLM2-360M-Instruct
```

### Use HuggingFace `InferenceClient` to Access TGI Server

In [None]:
from huggingface_hub import InferenceClient

# initialize client pointing to TGI endpoint
client = InferenceClient(
    model="http://localhost:8080",  # URL to the TGI server
)

# text generation
response = client.text_generation(
    "Tell me a story",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    details=True,
    stop_sequences=[],
)

In [None]:
print(response.generated_text)

### Use for Chat Format

In [None]:
# chat completion
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)

In [None]:
print(response.choices[0].message.content)

### Use OpenAI Client

In [None]:
from openai import OpenAI

# initialize client pointing to TGI endpoint
client = OpenAI(
    base_url="http://localhost:8080/v1",  # Make sure to include /v1
    api_key="not-needed",  # TGI doesn't require an API key by default
)

# chat completion
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)

In [None]:
print(response.choices[0].message.content)

---

## Llama.cpp

### Setup

```bash
# install via Homebrew
brew install llama.cpp

# download model and run model directly
llama-cli -hf HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF

# launch OpenAI-compatible API server
llama-server -hf HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF
```

### Use HuggingFace `InferenceClient` to Access Llama.cpp Server

In [None]:
from huggingface_hub import InferenceClient

# initialize client pointing to llama.cpp server
client = InferenceClient(
    model="http://localhost:8080/v1",  # URL to the llama.cpp server
    token="sk-no-key-required",  # llama.cpp server requires this placeholder
)

# text generation
response = client.text_generation(
    "Tell me a story",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    details=True,
)

In [None]:
print(response.generated_text)

### Use for Chat Format

In [7]:
# chat completion
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)

In [8]:
print(response.choices[0].message.content)

In a far-off land, there was a young girl named Lily. She lived in a small village surrounded by a magical forest filled with mythical creatures and enchanted trees. Lily was a curious and adventurous young girl, always eager to explore and learn about the world around her.

One day, Lily decided to venture into the forest to explore the magical creatures that lived there. She packed a small bag with food and water, and set off on her journey. As she walked deeper into the


### Use OpenAI Client

In [9]:
from openai import OpenAI

# initialize client pointing to llama.cpp server
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="sk-no-key-required",  # llama.cpp server requires this placeholder
)

# chat completion
response = client.chat.completions.create(
    model="smollm2-1.7b-instruct",  # Model identifier can be anything as server only loads one model
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)

In [10]:
print(response.choices[0].message.content)

Once upon a time, there was a young girl named Lily who lived in a small village nestled in the heart of a dense forest. She was known for her kindness and her love for the forest, where she would often explore and learn about the different plants, animals, and insects that lived there. One day, she stumbled upon a small, mysterious-looking door hidden deep within the forest. The door was old and worn, with strange symbols etched into its surface, and it seemed to be made


---

## vLLM

### Setup

```bash
# launch vLLM OpenAI-compatible server via native python interface
VLLM_USE_CUDA=0 python -m vllm.entrypoints.openai.api_server \
    --model HuggingFaceTB/SmolLM2-360M-Instruct \
    --tensor-parallel-size 1 \
    --host 0.0.0.0 \
    --port 8000 \
    --dtype float16
```

### Use HuggingFace `InferenceClient` to Access vLLM Server

In [None]:
from huggingface_hub import InferenceClient

# initialize client pointing to vLLM endpoint
client = InferenceClient(
    model="http://localhost:8000/v1",  # URL to the vLLM server
)

# text generation
response = client.text_generation(
    "Tell me a story",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    details=True,
)

In [None]:
print(response.generated_text)

### Use for Chat Format

In [None]:
# chat completion
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)

In [None]:
print(response.choices[0].message.content)

### Use OpenAI Client

In [None]:
from openai import OpenAI

# initialize client pointing to vLLM endpoint
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # vLLM doesn't require an API key by default
)

# chat completion
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)

In [None]:
print(response.choices[0].message.content)

---

## Text Generation

### TGI

#### Setup

```bash
docker run --gpus all \
    --shm-size 1g \
    -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id HuggingFaceTB/SmolLM2-360M-Instruct \
    --max-total-tokens 4096 \
    --max-input-length 3072 \
    --max-batch-total-tokens 8192 \
    --waiting-served-ratio 1.2
```

#### Use `InterenceClient`

In [None]:
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://localhost:8080")

# advanced parameters example
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    max_tokens=200,
    top_p=0.95,
)

In [None]:
print(response.choices[0].message.content)

In [None]:
# raw text generation
response = client.text_generation(
    "Write a creative story about space exploration",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
    repetition_penalty=1.1,
    do_sample=True,
    details=True,
)

In [None]:
print(response.generated_text)

#### Use OpenAI Client

In [None]:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

# advanced parameters example
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,  # higher for more creativity
)

In [None]:
print(response.choices[0].message.content)

---

### Llama.cpp

#### Setup

```bash
llama-server \
    -hf HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF \
    --host 0.0.0.0 \
    --port 8080 \
    -c 4096 \
    --threads 8 \
    --batch-size 512 \
    --n-gpu-layers 0
```

#### Use `InterenceClient`

In [21]:
from huggingface_hub import InferenceClient

client = InferenceClient(
    model="http://localhost:8080/v1",
    token="sk-no-key-required"
)

# advanced parameters example
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    max_tokens=200,
    top_p=0.95,
)

In [22]:
print(response.choices[0].message.content)

In the heart of a bustling city, there lived a young girl named Luna. She was a curious and adventurous soul with a vivid imagination. Luna was always fascinated by the world of dreams and the mysteries they held. She believed that dreams were a way of communicating with the divine, the cosmic, and the divine.

One day, as Luna was wandering through her neighborhood, she stumbled upon a mysterious, ancient temple hidden in the heart of the city. The temple was old and had a beautiful, intricate, golden statue of a dreamer standing in its entrance. It was here that Luna discovered she had the ability to enter the world of dreams.

Luna soon found herself capable of entering her own dreams, as well as those of others. She could see, hear, and even influence the dreams of others. It was as if she was a dreamer herself, living in the world of the dream. She could enter the dreams of her friends and loved ones, and she could


In [None]:
# direct text generation
response = client.text_generation(
    "Write a creative story about space exploration",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
    repetition_penalty=1.1,
    details=True,
)

In [None]:
print(response.generated_text)

#### Use OpenAI Client

In [26]:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key-required")

# advanced parameters example
response = client.chat.completions.create(
    model="smollm2-1.7b-instruct",
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,  # higher for more creativity
    top_p=0.95,  # nucleus sampling probability
    frequency_penalty=0.5,  # reduce repetition of frequent tokens
    presence_penalty=0.5,  # reduce repetition by penalizing tokens already present
    max_tokens=200,  # Maximum generation length
)

In [27]:
print(response.choices[0].message.content)

Once upon a time, in the quaint town of Serendipity, there lived an ordinary man named Jack Harris. Jack was no superhero but he had his own ways to save the day when needed. He was an ordinary man with extraordinary skills that he never knew existed until one fateful day. 

It all began on a typical Monday morning when Jack woke up to find a tiny note on his bedside table. It was written in elegant handwriting and it read, "You have a chance to change the world". Jack was confused as to who could have left this note and what it meant. He thought it was just a prank played by his eccentric neighbor, but as he looked out of the window, he saw a strange object hovering above the town. 

It was a spaceship! The aliens had indeed come for him. But instead of being scared or excited, Jack was both. He had always been fascinated with science fiction and space movies but never imagined that he would actually meet


#### Use Llama.cpp Native Library

In [None]:
# use llama-cpp-python package for direct model access
from llama_cpp import Llama

# load model
llm = Llama(
    model_path="smollm2-1.7b-instruct.Q4_K_M.gguf",
    n_ctx=4096,  # Context window size
    n_threads=8,  # CPU threads
    n_gpu_layers=0,  # GPU layers (0 = CPU only)
)

# format prompt according to the model's expected format
prompt = """<|im_start|>system
You are a creative storyteller.
<|im_end|>
<|im_start|>user
Write a creative story
<|im_end|>
<|im_start|>assistant
"""

# generate response with precise parameter control
output = llm(
    prompt,
    max_tokens=200,
    temperature=0.8,
    top_p=0.95,
    frequency_penalty=0.5,
    presence_penalty=0.5,
    stop=["<|im_end|>"],
)

In [None]:
print(output["choices"][0]["text"])

---

### vLLM

#### Use `InterenceClient`

In [None]:
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://localhost:8000/v1")

# advanced parameters example
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    max_tokens=200,
    top_p=0.95,
)

In [None]:
print(response.choices[0].message.content)

In [None]:
# direct text generation
response = client.text_generation(
    "Write a creative story about space exploration",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
    details=True,
)

In [None]:
print(response.generated_text)

#### Use OpenAI Client

In [None]:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# advanced parameters example
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    top_p=0.95,
    max_tokens=200,
)

In [None]:
print(response.choices[0].message.content)

#### Use vLLM Python Native Interface

In [None]:
from vllm import LLM, SamplingParams

# initialize model with advanced parameters
llm = LLM(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    gpu_memory_utilization=0.85,
    max_num_batched_tokens=8192,
    max_num_seqs=256,
    block_size=16,
)

# configure sampling parameters
sampling_params = SamplingParams(
    temperature=0.8,  # higher for more creativity
    top_p=0.95,  # consider top 95% probability mass
    max_tokens=100,  # maximum length
    presence_penalty=1.1,  # reduce repetition
    frequency_penalty=1.1,  # reduce repetition
    stop=["\n\n", "###"],  # stop sequences
)

# generate text
prompt = "Write a creative story"
outputs = llm.generate(prompt, sampling_params)

In [None]:
print(outputs[0].outputs[0].text)

In [None]:
# chat-style interactions
chat_prompt = [
    {"role": "system", "content": "You are a creative storyteller."},
    {"role": "user", "content": "Write a creative story"},
]
formatted_prompt = llm.get_chat_template()(chat_prompt)  # uses model's chat template
outputs = llm.generate(formatted_prompt, sampling_params)

In [None]:
print(outputs[0].outputs[0].text)

---

## Token Selection and Sampling

The process of generating text involves selecting the next token at each step.

This selection process can be controlled through various parameters:

1. **Raw Logits:** The initial output probabilities for each token
2. **Temperature:** Controls randomness in selection (higher = more creative)
3. **Top-p (Nucleus) Sampling:** Filters to top tokens making up X% of probability mass
4. **Top-k Filtering:** Limits selection to k most likely tokens

### TGI

In [None]:
client.generate(
    "Write a creative story",
    temperature=0.8,  # higher for more creativity
    top_p=0.95,  # consider top 95% probability mass
    top_k=50,  # consider top 50 tokens
    max_new_tokens=100,  # maximum length
    repetition_penalty=1.1,  # reduce repetition
)

### Llama.cpp

In [None]:
response = client.completions.create(
    model="smollm2-1.7b-instruct",  # model name (can be any string for llama.cpp server)
    prompt="Write a creative story",
    temperature=0.8,  # higher for more creativity
    top_p=0.95,  # consider top 95% probability mass
    frequency_penalty=1.1,  # reduce repetition
    presence_penalty=0.1,  # reduce repetition
    max_tokens=100,  # maximum length
)

In [None]:
output = llm(
    "Write a creative story",
    temperature=0.8,
    top_p=0.95,
    top_k=50,
    max_tokens=100,
    repeat_penalty=1.1,
)

### vLLM

In [None]:
params = SamplingParams(
    temperature=0.8,  # higher for more creativity
    top_p=0.95,  # consider top 95% probability mass
    top_k=50,  # consider top 50 tokens
    max_tokens=100,  # maximum length
    presence_penalty=0.1,  # reduce repetition
)
llm.generate("Write a creative story", sampling_params=params)

---

## Controlling Repetition

Both frameworks provide ways to prevent repetitive text generation:

### TGI

In [None]:
client.generate(
    "Write a varied text",
    repetition_penalty=1.1,  # penalize repeated tokens
    no_repeat_ngram_size=3,  # prevent 3-gram repetition
)

### Llama.cpp

In [None]:
response = client.completions.create(
    model="smollm2-1.7b-instruct",
    prompt="Write a varied text",
    frequency_penalty=1.1,  # penalize frequent tokens
    presence_penalty=0.8,  # penalize tokens already present
)

In [None]:
output = llm(
    "Write a varied text",
    repeat_penalty=1.1,  # penalize repeated tokens
    frequency_penalty=0.5,  # additional frequency penalty
    presence_penalty=0.5,  # additional presence penalty
)

### vLLM

In [None]:
params = SamplingParams(
    presence_penalty=0.1,  # penalize token presence
    frequency_penalty=0.1,  # penalize token frequency
)

---

## Length Control and Stop Sequences

One can control generation length and specify when to stop:

### TGI

In [None]:
client.generate(
    "Generate a short paragraph",
    max_new_tokens=100,
    min_new_tokens=10,
    stop_sequences=["\n\n", "###"],
)

### Llama.cpp

In [None]:
response = client.completions.create(
    model="smollm2-1.7b-instruct",
    prompt="Generate a short paragraph",
    max_tokens=100,
    stop=["\n\n", "###"],
)

In [None]:
output = llm("Generate a short paragraph", max_tokens=100, stop=["\n\n", "###"])

### vLLM

In [None]:
params = SamplingParams(
    max_tokens=100,
    min_tokens=10,
    stop=["###", "\n\n"],
    ignore_eos=False,
    skip_special_tokens=True,
)

---

## Memory Management

Both frameworks implement advanced memory management techniques for efficient inference.

### TGI

TGI uses Flash Attention 2 and continuous batching:

```bash
# Docker deployment with memory optimization
docker run --gpus all -p 8080:80 \
    --shm-size 1g \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id HuggingFaceTB/SmolLM2-1.7B-Instruct \
    --max-batch-total-tokens 8192 \
    --max-input-length 4096
```

### Llama.cpp

llama.cpp uses quantization and optimized memory layout:

```bash
llama-server \
    -hf HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF \
    --host 0.0.0.0 \
    --port 8080 \
    -c 2048 \
    --threads 4 \
    --n-gpu-layers 32 \
    --mlock \       # lock memory to prevent swapping
    --cont-batch    # enable continuous batching
```

For models too large for a GPU, one can use CPU offloading:

```bash
lama-server \
    -hf HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF \
    --n-gpu-layers 20 \     # keep first 20 layers on GPU
    --threads 8             # use more CPU threads for CPU layers
```

### vLLM

vLLM uses PagedAttention for optimal memory management:

In [None]:
from vllm.engine.arg_utils import AsyncEngineArgs

engine_args = AsyncEngineArgs(
    model="HuggingFaceTB/SmolLM2-1.7B-Instruct",
    gpu_memory_utilization=0.85,
    max_num_batched_tokens=8192,
    block_size=16,
)

llm = LLM(engine_args=engine_args)