## Background

Lots of people experience fiddly behavior when using LLMs.  For example:

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Unironically I found this to be very helpful when prompting LLMs. Giving them spaces and new lines <a href="https://t.co/vVuxcCuDzB">pic.twitter.com/vVuxcCuDzB</a></p>&mdash; anton (@abacaj) <a href="https://twitter.com/abacaj/status/1728190808191537604?ref_src=twsrc%5Etfw">November 24, 2023</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

If you aren't careful, these can be very hard to debug.  This is because of the subtle ways tokenizers work that is not always easy to tell by looking at the text.  

## Example

The below example demonstrates how things can get confusing and can drift between training and inference time.


In [None]:
#|echo: false
from transformers import AutoTokenizer
from functools import partial
model_id = 'Open-Orca/Mistral-7B-OpenOrca'
tok = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
enc = partial(tok.encode, add_special_tokens=False)
dec = partial(tok.decode)

#### Many frameworks do prompt construction by concatenating tokens

Popular frameworks like [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) construct prompts by concatenating tokens instead of strings.[^1]  It is reasonable to decode the training data to check what the prompt template is:

[^1]: This is for good reason, as masking must also be done at the token level.

For example, a prompt may be constructed like this:

In [None]:
axolotl = enc('Ok\n') + enc('<|im_start|>')
print(dec(axolotl))

Ok
<|im_start|>


#### Let's say you have an inference server

It's common for inference servers to assemble the prompt for you.  The below looks Like it should be fine, right? 

In [None]:
def inf_server(inp): 
    return f'{inp}\n<|im_start|>'

srv = inf_server('Ok')
print(srv)

Ok
<|im_start|>


#### Drift between your server and the way the model is trained

Wrong!  Notice the difference in the decoding of the prompt vs the training data.  This is a subtle bug that can be hard to debug.

In [None]:
print(f'axolotl training data:  {axolotl}')
print(f"your server's decoding: {enc(srv)}")

axolotl training data:  [6504, 13, 32001]
your server's decoding: [6504, 32001]


## Solutions

1. Decode your inference data right before your forward pass.  For example, you'll notice the newline is missing if you do this.  This is one way to tell that something fishy is going on.

In [None]:
dec(enc(srv))

'Ok<|im_start|>'

In [None]:
#|echo: false
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_id)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]