<style>
*{
    color: #D6DAC8;
    font-size: 14px;
    font-weight: 550;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: 1000;
    color:white;
}

.para{
    color: #DA6C6C;
    font-weight: 950;
}

b, .name_list {
    font-weight: 950;
}
img {
    display: block;
    margin: 0 auto;
    width: 40em;
}
</style>

<h2>CHAPTER 2: Tokens and Embeddings</h2>
<p><b>Tokens and Embeddings</b> are two of the central concepts of using large language models (LLMs). In this chapter, we look more closely at what tokens are and the tokenization methods used to power LLMs. We will then dive into the famous <b>word2vec embeddings</b> method tha preceded modern-day LLms and see how it's extending the concept of token embeddings to build commercial recommendation systems that power a lot of the apps you use.</p>
<img src="../image/tokens_and_embeddings.jpg">
<p>A model does not produce its output response all at once; it actually generates one token at a time. But tokens aren't only the output of a model, they're also the way in which the model sees its inputs. A text prompt sent to the model is first broken down into tokens. Before the prompt is presented to the language model, however, it first has to go through a <b>tokenizer</b> that breaks it into pieces. Here's an example that show the tokenizer of <b>GPT-3.5 and GPT-4</b>.</p>
<img src="../image/tokenizer_example.jpg">


<h2 style="font-weight: 1000">Run an example</h2>

In [1]:
!pip install -q transformers numpy pandas 
!pip install -q torch torchvision torchaudio

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m88.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m64.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "microsoft/Phi-3-mini-4k-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [12]:
prompt = """
Write an example of Part 1 IELTS writing band 9.0. <assistant>
"""

input_ids = tokenizer(prompt,
                     return_tensors="pt").input_ids.to("cuda")

generated_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=256,
    use_cache=False
)

text = tokenizer.decode(generated_output[0])
text

"\nWrite an example of Part 1 IELTS writing band 9.0. <assistant>\nIn the wake of the global pandemic, the importance of digital literacy has become more pronounced than ever. As we navigate through an increasingly digital world, the ability to effectively use technology is no longer a luxury but a necessity. This essay will explore the significance of digital literacy in today's society, its impact on various aspects of life, and the need for its integration into educational curricula.\n\nDigital literacy refers to the ability to use digital technology, communication tools, and networks to access, manage, integrate, evaluate, and create information. It encompasses a wide range of skills, including basic computer skills, internet navigation, online communication, and the ability to critically evaluate digital content. In a world where information is readily available at our fingertips, the ability to discern credible sources from unreliable ones is crucial.\n\nThe importance of digital

In [13]:
input_ids

tensor([[29871,    13,  6113,   385,  1342,   310,  3455, 29871, 29896,  7159,
          5850, 29903,  5007,  3719, 29871, 29929, 29889, 29900, 29889,   529,
           465, 22137, 29958,    13]], device='cuda:0')

To view each token in the **input_ids**, we should read each id and then use the tokeniser to decode it.

In [14]:
for id in input_ids[0]:
    print(tokenizer.decode(id))




Write
an
example
of
Part

1
IE
LT
S
writing
band

9
.
0
.
<
ass
istant
>




<style>
*{
    color: #D6DAC8;
    font-size: 14px;
    font-weight: 550;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: 1000;
    color:white;
}

.para{
    color: #DA6C6C;
    font-weight: 950;
}

b, .name_list {
    font-weight: 950;
}
img {
    display: block;
    margin: 0 auto;
    width: 40em;
}
</style>

<p>
    Thanks to the example, we can know how the tonkeniser broke down our input prompt. Notice the following:
    <ul>
        <li>In some prompts, the first token <b>&lt;s&gt;</b>, a special token indacating the beginning of the text.</li>
        <li>Some tokens are complete words.</li>
        <li>Some tokens are parts of words</li>
        <li>Punctuation characters are their token.</li>
    </ul>
    <p>One thing you should notice is that the space character does not have its own token. Instead, partial tokens (like "IE") have a special hidden character at their beginning that indicates that they're connected with the token that precedes them in the text. Tokens without that special character are assumed to have a space before them.</p>
</p>


In [15]:
generated_output[0]

tensor([29871,    13,  6113,   385,  1342,   310,  3455, 29871, 29896,  7159,
         5850, 29903,  5007,  3719, 29871, 29929, 29889, 29900, 29889,   529,
          465, 22137, 29958,    13,   797,   278,   281,  1296,   310,   278,
         5534,  7243, 24552, 29892,   278, 13500,   310, 13436,  4631,  4135,
          756,  4953,   901, 11504, 20979,  1135,  3926, 29889,  1094,   591,
        23624,  1549,   385, 10231,   368, 13436,  3186, 29892,   278, 11509,
          304, 17583,   671, 15483,   338,   694,  5520,   263, 21684,  2857,
          541,   263, 24316, 29889,   910,  3686,   388,   674, 26987,   278,
        26002,   310, 13436,  4631,  4135,   297,  9826, 29915, 29879, 12459,
        29892,   967, 10879,   373,  5164, 21420,   310,  2834, 29892,   322,
          278,   817,   363,   967, 13465,   964, 28976, 16256,   293,  2497,
        29889,    13,    13, 27103,  4631,  4135, 14637,   304,   278, 11509,
          304,   671, 13436, 15483, 29892, 12084,  8492, 29892, 

<style>
*{
    color: #D6DAC8;
    font-size: 14px;
    font-weight: 550;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: 1000;
    color:white;
}

.para{
    color: #DA6C6C;
    font-weight: 950;
}

b, .name_list {
    font-weight: 950;
}
img {
    display: block;
    margin: 0 auto;
    width: 40em;
}
</style>

<h2>How does the Tokenizer Break Down Text?</h2>
<p>There are three factors that dictate how a tokenizer breaks down an input prompt.</p>
<ul>
    <li>First, at model design time, the creator of the model chooses a tokenization method. Popular methods include <b>byte pair encoding(BPE)</b> (widely used by GPT models) and <b>WordPiece</b>(used by BERT). They aim to optimize an efficient set of tokens to represent a text dataset, but they arrive at it in different ways.</li>
    <li>Second, after choosing the method, we need to make a number of tokenizer design choices like vocabulary size and what special tokens to use.</li>
    <li>Third, the tokenizer needs to be trained on a specific dataset to establish the best vocabulary it can use to represent that dataset. Even if we set the same methods and parameters, a tokenizer trained on an English text dataset will be different from another trained on a code dataset or a multilingual text dataset.</li>
</ul>
<h2>Word Versus Subword Versus Character Versus Byte Tokens</h2>
<p>The tokenization scheme we just discussed is call <b>subword tokenization</b>. It's the most commonly used tokenization scheme but not the only one. The four notable ways to tokenize are shown below:</p>
<img src="../image/four_ways_tokenize.jpg">
<p>
    <ul>
        <li><b>Word tokens</b>: This approach was common with earlier methods like <b>word2vec</b> but is being used less and less in NLP. Though of its usefulness, it led to be used outside of NLP for use cases such as recommendation systems. One challenge with this method is that the tokenizer may be unable to deal with new words that enter that dataset after the tokenizer was trained. This also results in a vocabulary that has a lot of tokens with minimal differences between them. (e.g., apology, apologize, apologetic, apologist). This latter challenge is resolved by <b>subword tokenization</b> as it has a token for <em>apolog</em>, and then suffix tokens (e.g., -y, -ize, -etic, -ist) that are common with many other tokens, resulting in a more expressive vocabulary.</li>
        <li><b>Subword tokens</b>: This method contains full and partial words. In addition to the vocabulary expressivity mentioned earlier, another benefit of the approach is its ability to represent new words by breaking down the new token into smaller characters, which tend to be a part of the vocabulary.</li>
        <li><b>Characters tokens</b>: It can deal successfully with new words because it has the raw letters to fall back on. While that makes the representation easier to tokenize, it makes the modeling more difficult. Where a model with subword tokenization can represent "play" as one token, a model using character-level tokens needs to model the information to spell out <b>"p-l-a-y"</b> in addition to modeling the rest of the sequence. Subword tokens present an advantage over character tokens in the ability to fit more text within the limited context length of a Transformer model.</li>
        <li><b>Byte Tokens</b>: <a href="https://arxiv.org/abs/2103.06874"><b style="color: red">"CANINE: Pre-training an efficient tokenization-free encoder for language representation"</b> outline methods like this, which are also called <b>tokenization-free encoding</b>.</a> Other works like <a href="https://arxiv.org/abs/2105.13626"><b style="color:red">"ByT5: Towards a token-free future with pre-trained byte-to-byte models."</b></a> show that this can be a competitive method, especially in multilingual scenarios.</li>
    </ul>
</p>

In [2]:
text = """
English and CAPITALIZATION
show_tokens False None elif == >= else: two tabs:" " Three tabs:
" "
12.0*50=600
"""

colors_list = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
    ]

def show_tokens(sentence, tokenizer_name):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors_list[idx %
            len(colors_list)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )   

<style>
*{
    color: #D6DAC8;
    font-size: 14px;
    font-weight: 550;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: 1000;
    color:white;
}

.para{
    color: #DA6C6C;
    font-weight: 950;
}

b, .name_list {
    font-weight: 950;
}
img {
    display: block;
    margin: 0 auto;
    width: 40em;
}
</style>

<img src="../image/table_tokenize_1.jpg" width="400">  <img src="../image/table_tokenize_2.jpg" width="400"> 

`