# Chatting with Transformers

**Chat models** are conversational AIs that we can send and receive messages with.

## Quickstart

Chat models continue chats. We pass them a conversational history, which can be as short as a single user message, and the model will continue the conversation by adding its response.

In [1]:
chat = [
    {
        'role': 'system',
        'content': 'You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986.',
    },
    {
        'role': 'user',
        'content': 'Hey, can you tell me any fun things to do in New York?',
    },
]

In addition to the user's message, we added a **system** message at the start of the conversation. Not all chat models support system messages, but when they do, they represent high-level directives about how the model should behave in the conversation. We can use this to guide the model - whether we want short or long responses, lighthearted or serious ones, and so on.

Once we have a chat, the quickest way to continue it is using the `TextGenerationPipeline`.

In [None]:
from huggingface_hub import login

login()

In [None]:
import torch
from transformers import pipeline

pipe = pipeline(
    'text-generation',
    'meta-llama/Meta-Llama-3-8B-Instruct',
    torch_dtype=torch.bfloat16,
    device_map='auto',
)

In [None]:
response = pipe(chat, max_new_tokens=512)
print(response[0]['generated_text'][-1]['content'])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [None]:
response

We can continue the chat by appending our own response to it. The `response` object returned by the pipeline actually contains the entire chat so far, so we can simply append a message and pass it back:

In [None]:
chat = response[0]['generated_text']

chat.append(
    {
        'role': 'user',
        'content': 'Wait, what is so wild about soap cans?'
    }
)

response = pipe(chat, max_new_tokens=512)
print(response[0]['generated_text'][-1]['content'])

In [None]:
# check the entire chat history
response

## Choosing a chat model

* model size
* quality of model output

### Size and model naming

Without quantization, we expect to need about 2 bytes of memory per parameter. This means that an "8B" model with 8 billion parameters will need about 16GB of memory just to fit the parameters, plus a little extra for other overhead. It's a good fit for a high-end consumer GPU with 24GB of memory.

Some chat models are "Mixture of Experts" models, such as "8x7B". The numbers are a little fuzzier, but in general the model has approximately 56 (8x7) billion parameters.

### which chat model is best?

The OpenLLM leaderboard and the LMSys Chatbot Arena leaderboard are the two of the most popular leaderboards.

### Specialist domains

Check the domain-specific leaderboards

## What happens inside the pipeline?

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Prepare the input as before
chat = [
    {
        'role': 'system',
        'content': 'You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986.',
    },
    {
        'role': 'user',
        'content': 'Hey, can you tell me any fun things to do in New York?',
    },
]

In [None]:
# 1. Load the model and tokenizer
model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map='auto')

In [None]:
# 2. Apply the chat template
formatted_chat = tokenizer.apply_chat_template(
    chat,
    tokenize=False,
    add_generation_prompt=True,
)
print("Formatted chat:\n", formatted_chat)

In [None]:
# 3. Tokenize the chat (can be combined with previous step using tokenize=True)
inputs = tokenizer(
    formatted_chat,
    return_tensors='pt',
    add_special_tokens=False,
)
# Move the tokenized inputs to the same device the model is on (GPU/CPU)
inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}
print("Tokenized inputs:\n", inputs)

In [None]:
# 4. Generate text from the model
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.1,
)
print("Generated tokens:\n", outputs)

In [None]:
# 5. Decode the output back to a string
decoded_output = tokenizer.decode(
    outputs[0][inputs['input_ids'].size(1):],
    skip_special_tokens=True
)
print("Decoded output:\n", decoded_output)

## Performance, memory and hardware

### Memory considerations

By default, HuggingFace classes like `TextGenerationPipeline` or `AutoModelForCasualLM` will load the model in `float32` precision. It means that it will need 4 bytes (32 bits) per parameter, so an "8B" model with 8 billion parameters will need ~32 GB of memory.

Most modern language models are trained in "bfloat16" precision, which uses only 2 bytes per parameter.

It is even possible to go lower than 16-bits using "quantization", a method to lossily compress model weights. This allows each parameter to be squeezed down to 8 bits, 4 bits or even less. The model outputs may be negatively affected, but this is a tradeoff worth making to fit a larger and more capable chat model in memory.

In [None]:
from transformers import pipeline, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
pipe = pipeline(
    'text-generation',
    'meta-llama/Llama-3-8B-Instruct',
    device_map='auto',
    model_kwargs={
        'quantization_config': quantization_config,
    }
)

In [None]:
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
pipe = pipeline(
    'text-generation',
    'meta-llama/Llama-3-8B-Instruct',
    device_map='auto',
    model_kwargs={
        'quantization_config': quantization_config,
    }
)

### Performance considerations

Generating text from a chat model is unusual in that it is bottlenecked by memory bandwidth rather than compute power, because every active parameter must be read from memory for each token that the model generates. This means that number of tokens per second you can generate from a chat model is generally proportional to the total bandwidth of the memory it resides in, divided by the size of the model.