#  What is Byte Pair Encoding (BPE)?

Byte Pair Encoding (BPE) is a subword tokenization algorithm commonly used in NLP models (like GPT, RoBERTa, and others) to convert text into tokens.

Instead of splitting text only by words or characters, BPE breaks words into frequently occurring subword units.

# Why BPE is used

- Traditional tokenization problems:

-  Word-level tokenization ‚Üí huge vocabulary, can‚Äôt handle unknown words

-  Character-level tokenization ‚Üí long sequences, loses meaning

-  BPE provides a balance:

-  Smaller vocabulary

-  Handles rare and unseen words

-  Efficient and meaningful token representation

In [1]:
try :
  import tiktoken

except Exception:
  !pip install tiktoken
  import tiktoken

The Two Lines
tokenizer = tiktoken.encoding_for_model("gpt-4o-mini")

tokenizer = tiktoken.get_encoding("cl100k_base")

What‚Äôs the SAME

Both end up using cl100k_base today.

So for tokenization output right now:

‚úÖ They produce identical tokens

# What‚Äôs DIFFERENT (Important)
- 1Ô∏è‚É£ encoding_for_model(...) is model-aware
tiktoken.encoding_for_model("gpt-4o-mini")


- Knows which encoding the model needs

- Automatically picks the right one

- Future-proof

# If OpenAI changes the model‚Äôs encoding later:üëâ your code keeps working

- 2Ô∏è‚É£ get_encoding(...) is manual
tiktoken.get_encoding("cl100k_base")


- You hardcode the encoding

- You are responsible for correctness

- Can break if model ‚Üí encoding mapping changes

- Best Practice (Rule)

If you know the model name ‚Üí always use encoding_for_model()

- tokenizer = tiktoken.encoding_for_model("gpt-4o-mini") Use get_encoding() only when:

- you don‚Äôt know the model

- you‚Äôre doing offline analysis

- you explicitly want that encoding

# Simple Analogy üß†

- encoding_for_model() = auto transmission

- get_encoding() = manual transmission

- Both drive the car ‚Äî one is safer.

In [20]:
tokenizer1= tiktoken.get_encoding("cl100k_base")
tokenizer2 = tiktoken.encoding_for_model("gpt-4o-mini")


In [4]:
text = """
  I had always thought Jack Gisburn rather a cheap genius--though a

good fellow enough--so it was no great surprise to me to hear

that, in the height of his glory, he had dropped his painting,

married a rich widow, and established himself in a villa on the

Riviera. (Though I rather thought it would have been Rome or

Florence.)  <|endoftask|>

"The height of his glory"--that was what the women called it. I

can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring

his unaccountable abdication. "Of course it's going to send the

value of my picture 'way up; but I don't think of that, Mr.

Rickham--the loss to Arrt is all I think of." The word, on Mrs.t

Thwing's lips, multiplied its RS as though they were reflected in

an endless vista of mirrors. And it was not only the Mrs. Thwings

who mourned. Had not the exquisite Hermia Croft, at the last

Grafton Gallery show, stopped me before Gisburn's "Moon-dancers"

to say, with tears in her eyes: "We shall not look upon

its like again"?

"""

In [14]:
text = text.replace("\n" ,"").strip()
txt = r'\\'

text =text.replace(txt,"")



In [22]:
token_ids1 = tokenizer1.encode(text=text , allowed_special={"<|endoftask|>"})
token_ids2 = tokenizer2.encode(text=text , allowed_special={"<|endoftask|>"})

In [24]:
tokenizer1.decode(token_ids1)


'I had always thought Jack Gisburn rather a cheap genius--though agood fellow enough--so it was no great surprise to me to hearthat, in the height of his glory, he had dropped his painting,married a rich widow, and established himself in a villa on theRiviera. (Though I rather thought it would have been Rome orFlorence.)  <|endoftask|>"The height of his glory"--that was what the women called it. Ican hear Mrs. Gideon Thwing--his last Chicago sitter--deploringhis unaccountable abdication. "Of course it\'s going to send thevalue of my picture \'way up; but I don\'t think of that, Mr.Rickham--the loss to Arrt is all I think of." The word, on Mrs.Thwing\'s lips, multiplied its RS as though they were reflected inan endless vista of mirrors. And it was not only the Mrs. Thwingswho mourned. Had not the exquisite Hermia Croft, at the lastGrafton Gallery show, stopped me before Gisburn\'s "Moon-dancers"to say, with tears in her eyes: "We shall not look uponits like again"?'

In [23]:
tokenizer2.decode(token_ids2)

'I had always thought Jack Gisburn rather a cheap genius--though agood fellow enough--so it was no great surprise to me to hearthat, in the height of his glory, he had dropped his painting,married a rich widow, and established himself in a villa on theRiviera. (Though I rather thought it would have been Rome orFlorence.)  <|endoftask|>"The height of his glory"--that was what the women called it. Ican hear Mrs. Gideon Thwing--his last Chicago sitter--deploringhis unaccountable abdication. "Of course it\'s going to send thevalue of my picture \'way up; but I don\'t think of that, Mr.Rickham--the loss to Arrt is all I think of." The word, on Mrs.Thwing\'s lips, multiplied its RS as though they were reflected inan endless vista of mirrors. And it was not only the Mrs. Thwingswho mourned. Had not the exquisite Hermia Croft, at the lastGrafton Gallery show, stopped me before Gisburn\'s "Moon-dancers"to say, with tears in her eyes: "We shall not look uponits like again"?'