# LLMs *generate* text

**A really smart autocomplete, trained on billions of examples!**

What it is actually doing: is predicting the most statistically likely next word.

Uses probability distributions over its vocabulary to predict/decide what comes next.

For example, if you type **"I need a cup of"**, the model might assign:
- P("coffee") = 0.75
- P("tea") = 0.15
- P("water") = 0.05
- P("sunshine") = 0.01

**Coffee** seems to be the most likely next word.

Conditional probability of the next word given the previous words.

The more context the model sees, the better it gets at choosing relevant words. -- **Is this the Attention Part?**

# Tokenizer
The **tiktoken** library comes with access to a precompiled vocabulary and merge rules.

These are effectively the token database used by models like GPT-3.5 and GPT-4.

### Token Vocabulary
- A list of all valid tokens (words, subwords, symbols, etc.) and their corresponding token IDs.
- For **GPT-4**, this is around **~100,264 tokens**.

### Merge Rules (Byte Pair Encoding)
- A ranked list of the most common token pairs seen during training.

In [1]:
import tiktoken

# Load the tokenizer for GPT-4 or GPT-3.5
enc = tiktoken.encoding_for_model("gpt-4")  # or "gpt-3.5-turbo"

text = "Can you explain what a lymphatic system is?"

tokens = enc.encode(text)
token_strings = [enc.decode([token]) for token in tokens]

# Print tokens
for i, (token_id, token_str) in enumerate(zip(tokens, token_strings)):
    print(f"{i+1}: {token_str!r} -> Token ID: {token_id}")


1: 'Can' -> Token ID: 6854
2: ' you' -> Token ID: 499
3: ' explain' -> Token ID: 10552
4: ' what' -> Token ID: 1148
5: ' a' -> Token ID: 264
6: ' lymph' -> Token ID: 43745
7: 'atic' -> Token ID: 780
8: ' system' -> Token ID: 1887
9: ' is' -> Token ID: 374
10: '?' -> Token ID: 30


# Leading Spaces

GPT models are trained based on token sequences i.e. not word sequences.

Be cognizant of spaces as it **impacts** token count
- could map to different token ids
- will affect costs and limits

Important when prompting
- Proper spacing matters in prompts. If you forget a leading space in a sentence, GPT may misinterpret the prompt, or split words oddly.


In [3]:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

# word = "fantastic"
# word = "banana"
# word = "hello"
# word = "  hello"
word = "antidisestablishmentarianism"

print('"' + enc.decode(enc.encode(word)) + '"')
print(enc.encode(word))  # [token_id]
print()
print('"' + enc.decode(enc.encode(" " + word)) + '"')
print(enc.encode(" " + word))   # Might be different or split



"antidisestablishmentarianism"
[519, 85342, 34500, 479, 8997, 2191]

" antidisestablishmentarianism"
[3276, 85342, 34500, 479, 8997, 2191]


## Why This Happens

### Spaces Can Carry Meaning
In natural language, spaces aren’t always just separators. Consider:
- Indentation in code
- Formatting in poetry or screenplays
- Double spaces after a period (some people still do this!)

### Byte Pair Encoding (BPE) tries to:
- Find the longest possible matches in its vocabulary.
- Prioritize common phrases and words (with leading spaces) as single tokens.

The model learns to interpret text as it appears — not a cleaned-up version.

In [3]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")
#vocab = enc._special_tokens  # Contains special tokens
all_tokens = enc._mergeable_ranks  # Dictionary of tokens and their ranks

# Print a few tokens
for token_bytes, rank in list(all_tokens.items())[:1000]:
    print(f"{enc.decode([rank])!r} -> Token ID: {rank}")


'!' -> Token ID: 0
'"' -> Token ID: 1
'#' -> Token ID: 2
'$' -> Token ID: 3
'%' -> Token ID: 4
'&' -> Token ID: 5
"'" -> Token ID: 6
'(' -> Token ID: 7
')' -> Token ID: 8
'*' -> Token ID: 9
'+' -> Token ID: 10
',' -> Token ID: 11
'-' -> Token ID: 12
'.' -> Token ID: 13
'/' -> Token ID: 14
'0' -> Token ID: 15
'1' -> Token ID: 16
'2' -> Token ID: 17
'3' -> Token ID: 18
'4' -> Token ID: 19
'5' -> Token ID: 20
'6' -> Token ID: 21
'7' -> Token ID: 22
'8' -> Token ID: 23
'9' -> Token ID: 24
':' -> Token ID: 25
';' -> Token ID: 26
'<' -> Token ID: 27
'=' -> Token ID: 28
'>' -> Token ID: 29
'?' -> Token ID: 30
'@' -> Token ID: 31
'A' -> Token ID: 32
'B' -> Token ID: 33
'C' -> Token ID: 34
'D' -> Token ID: 35
'E' -> Token ID: 36
'F' -> Token ID: 37
'G' -> Token ID: 38
'H' -> Token ID: 39
'I' -> Token ID: 40
'J' -> Token ID: 41
'K' -> Token ID: 42
'L' -> Token ID: 43
'M' -> Token ID: 44
'N' -> Token ID: 45
'O' -> Token ID: 46
'P' -> Token ID: 47
'Q' -> Token ID: 48
'R' -> Token ID: 49
'S' -> Tok