# Language Models 1 | Inference (Huggingface)

For more, see [here](https://huggingface.co/tasks/text-generation) and [here](https://huggingface.co/docs/transformers/generation_strategies).

## Install & Workflow

#### Drive

If you need to load/save to your drive:

```python
import sys
if "google.colab" in sys.modules:
    from google.colab import drive
    drive.mount("/content/drive/")

import os
os.chdir("drive/My Drive/gold/IS53055B-DMLCP/DMLCP") # to change to another directory
```

#### Huggingface login

For some models and datasets, and if you want to push your model to HF (same as GitHub, but for models) you need to be logged into your HF account.

For that, you need to create an account [here](https://huggingface.co/) and then to ['/settings/tokens'](https://huggingface.co/settings/tokens) to create an access token.

```python
from pathlib import Path
from huggingface_hub import notebook_login
if not (Path.home()/".huggingface"/"token").exists():
    notebook_login()
```

## Imports

In [None]:
import torch

# Get cpu, gpu or mps device for training.
# See: https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html#creating-models
device = (
    "cuda"
    if torch.cuda.is_available()
    # note: models using bfloat16 aren't compatible with MPS
    # else "mps"
    # if torch.backends.mps.is_available()
    else "cpu"
)

from transformers import pipeline
from transformers import GenerationConfig

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

### Printing utils

In [None]:
# The textwrap module automatically formats text for you
import textwrap

# many more options, see them with textwrap.TextWrapper?
tw = textwrap.TextWrapper(
    # the formatted width we want
    width=79,
    # this will keep whitespace & line breaks in the original text
    replace_whitespace=False
)

def wrap_print(s):
    """Format text into Textwrapped lines and print it"""
    print("\n".join(tw.wrap(s)))

## Out-of-the-box Generation: the `pipeline`

The `pipeline` works for literally everything, so we need to specify the task (`text-generation`), and the model ([so many to choose from...](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads), here's [the current model page](https://huggingface.co/Qwen/Qwen3-0.6B?library=transformers)). We also select the [device](https://huggingface.co/docs/transformers/pipeline_tutorial#device).

In [None]:
MODEL_ID = "Qwen/Qwen3-0.6B"

In [None]:
generator = pipeline(
    "text-generation",
    model=MODEL_ID,
    device=device 
)

See [here](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.from_pretrained.example) for an example using `GenerationConfig` and [here](https://github.com/huggingface/transformers/issues/19853#issuecomment-1290759818) for the `pad_token_id` fix.

In [None]:
generation_config = GenerationConfig.from_pretrained(MODEL_ID)
print(generation_config)

The Huggingface is transitioning towards the use of generation config files (good for automation).

In [None]:
generation_config.max_length = 25

### Quick vocab note:

`bos`: beginning of sentence  
`eos`: end of sentence  
`pad`: padding

These are special tokens that have been inserted into the text at training time.

For instance, in our case the 'beginning' of the text is 'endoftext', as during training the texts are fed to the network one after the other, with this special token between them:
```python
# the number (token id) representing the beginning of sentence special token
bos_token_id = generation_config.bos_token_id
print(bos_token_id)
# decode that number back into a string
print(generator.tokenizer.decode([bos_token_id]))
```

### Generate text!

In [None]:
# torch.manual_seed(1)
generator(
    "Once upon a time,",
    generation_config=generation_config
)

Parallel generation!

In [None]:
# torch.manual_seed(1)
generator(
    ["Once upon a time,"] * 2,
    generation_config=generation_config
)

---

## Deeper:`Tokenizer` and `Model` classes

What does the pipeline do under the hood?

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# to GPU/MPS/CPU
model = AutoModelForCausalLM.from_pretrained(MODEL_ID).to(device)

### The tokenizer

See [the Preprocess](https://huggingface.co/docs/transformers/preprocessing) tutorial on Huggingface for more details.

In [None]:
toks = tokenizer.encode("Oh sweet midnight")
print(toks)
print(tokenizer.decode(toks))
print()

toks = tokenizer(["Oh sweet midnight", "harbinger of doom"])
print(toks)
print(tokenizer.batch_decode(toks["input_ids"]))

In [None]:
input_ids = tokenizer.encode("Once upon a time", return_tensors="pt") # pytorch tensors
print(input_ids)

batched_input_ids = torch.tile(input_ids, (4,1)).to(device) # just copying the tensor 4 times
print(batched_input_ids)

### Generate Text!

In [None]:
# encode context the generation is conditioned on, return pytorch tensors
input_ids = tokenizer.encode("Once upon a time", return_tensors="pt")

# copy and place on GPU/MPS/CPU
batched_input_ids = torch.tile(input_ids, (4,1)).to(device)

# same logic as before
generation_config = GenerationConfig.from_pretrained(MODEL_ID)
generation_config.max_length = 100
# suppressing a pesky warning (https://stackoverflow.com/a/71397707)
model.generation_config.pad_token_id = tokenizer.pad_token_id

# generate text until the output length (which includes the context length) reaches 50
output = model.generate(
    # try input_ids as well for a single strand
    batched_input_ids,
    generation_config=generation_config,
)

In [None]:
texts = tokenizer.batch_decode(output, skip_special_tokens=True)

for t in texts:
    wrap_print(t)
    print("-" * 40)

## Chat models & templates

Highly recommended: an [interactive tokenizer web app](https://tiktokenizer.vercel.app/).

### 1. With the `pipeline`

See the [Chat basics page](https://huggingface.co/docs/transformers/en/conversations).

In [None]:
chat = [
    {"role": "system", "content": "You are a helpful science assistant."},
    {"role": "user", "content": "Hey, can you define gravity in one sentence?"}
]

generator = pipeline(
    task="text-generation",
    model=MODEL_ID,
    device=device
)

response = generator(chat, max_new_tokens=200)

In [None]:
# [0]: only one batch
# inside that, ["generated_text"] contains the whole chat
response

In [None]:
print(response[0]["generated_text"][-1]["content"])

In [None]:
# retrieve the whole chat
chat = response[0]["generated_text"]
# add a new user response
chat.append(
    {"role": "user", "content": "Woah! But can it be reconciled with quantum mechanics?"}
)

# generate again
response = generator(chat, max_new_tokens=512)

# print the response
print(response[0]["generated_text"][-1]["content"])

## 2. Manual tokenization & formatting

See the [Chat template page](https://huggingface.co/docs/transformers/en/chat_templating).

In order to make LLMs behave like chatbots is by fine-tuning them on text that follows a specific format, with markers (called "special tokens") that won't necessarily be seen by the user, but can be detected internally, and signify "this is the bot speaking", "the reply ends here", etc. By detecting these particular tokens (remember that they are just numbers under the hood), it is possible e.g. to stop generation whenever the special token for the end-of-reply is detected. 

Formatting text in this way is called "templating", and the scheme for such a thing is a "chat template". Each model has its own, even if they are mostly similar. The template for our model can be found [here](https://ollama.com/library/qwen3:0.6b/blobs/ae370d884f10).

In [None]:
messages = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate",},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]

### 1. Applying the full template

In order to give the model the right format (models behave best when the input is formatted in the same way as the data they have been trained on), we use `tokenizer.apply_chat_template`. Notice how having `add_generation_prompt=True` adds this at the end of our chat:
```
<|im_end|>
<|im_start|>assistant
```

This changes the text the model reads, and since the only thing it does is continue the text to the best of its abilities, now it will continue in it by generating what the 'assistant' would say.

Note also the 'thinking' bit: 'disable thinking' actually just means adding:
```
<think>

</think>
```
to the text, which means the model will just continue generating what should come after an empty thinking block...

In [None]:
tokenized_chat = tokenizer.apply_chat_template(
	messages,
    tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt",
    # if True, the model will write preliminary steps in a <think> block before answering
    enable_thinking=False
).to(model.device)

print(tokenizer.decode(tokenized_chat["input_ids"][0]))
# the text is surrounded by: '<|im_start|>system' or '<|im_start|>user' and '<|im_end|>'

In [None]:
outputs = model.generate(**tokenized_chat, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

To retrieve only the response, you would have to slice manually:

In [None]:
# the answer starts after the length of the input
answer = outputs[0, tokenized_chat["input_ids"].shape[1] :]
print(tokenizer.decode(answer, skip_special_tokens=True))

### 2. Leaving the template incomplete

if we dont have `add_generation_prompt`, the tokenizer does not add the directing bit at the end:
```
<|im_start|>assistant
<think>

</think>
```

In [None]:
tokenized_chat_no_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=False
)
print(tokenized_chat_no_prompt)

Depending on your use case, you might want that. The model could now continue in a different way than adding an assistant response...

### 3. Unclosed message

We can even go further, and not even have the special 'end' token, meaning that in all likelihood the model will continue the last message. This can be useful if you wanted to put words in the model's mouth (prefilling its response).

In [None]:
unclosed_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    continue_final_message=True
).to(device)

print(tokenizer.decode(unclosed_chat["input_ids"][0]))

In [None]:
outputs = model.generate(**unclosed_chat, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

---

# Experiments

1. Test everything! Make sure you understand and develop an intuition of:
 - The various parameters: `temperature`, `top_k`, `top_p`;
 - The `tokenizer` object to convert text into tokens and back;
 - How to handle the whole pipeline;
   Also, you can search for different [models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)! (Some of them may exceed your GPU capacity, beware). People have finetuned language models on many types of texts.
2. Can you think of a way to introduce computational thinking into this? Ideas:
  - First, you could explore ways of making things look nicer? Instead of just having a list of objects? You could write a nice print function that knows exactly how to take the model output and print it in a nice way. The specialised Python package with many text functionalities is [textwrap](https://docs.python.org/3/library/textwrap.html) (see also [here](https://www.geeksforgeeks.org/textwrap-text-wrapping-filling-python/));
  - Can you think of ways to construct a writing **loop**? By that, I mean:  
    a. Prepare prompt  
    b. Generate one or more strands of text  
    c. Select text from strands, go back to a.  
    This could simply mean writing a system of helper functions and classes to assist you in the writing...
  - One could imagine all sorts of strange ways to work with text, from programmatically chunking the generated text and scrambling it before using it again as a prompt, to explore what the model does if you use unreasonable parameters (e.g. a very high or low `temperature`).
  - Also, can you think of ways to work with various strands of text (Taking advantage of the fact that a model can generate in parallel)?

3. Something that has already been the subject of a lot of debate and controversy, is the exploration of the *biases* of the models (and there are tons!). LLMs are trained mostly on Internet text, top-ranked reddit posts, etc. (see for instance [this open-source replication](https://github.com/jcpeterson/openwebtext)). Unsurprisingly, the topics and points of view reflect that corner of human activities... Differences between American and Chinese models are also interesting to explore.