# Understanding LLMs

In [1]:
from transformers import AutoTokenizer
import torch
import transformers

In [2]:
print(transformers.__version__)

4.41.2


In [3]:
print(torch.__version__)

2.3.0+cu121


## Tokenizing Text

### Why Tokenization?

Tokenization transforms text into a format that models can comprehend. There are several methods for tokenizing text, each with its pros and cons:

1. **Character-Based Tokenization**:
   - **Method**: Splitting the text into individual characters and assigning each a unique numerical ID.
   - **Pros**: Works well for languages like Chinese, where each character carries significant information.
   - **Cons**: Creates a small vocabulary but requires many tokens to represent a string. This can affect performance and accuracy since individual characters carry minimal information.

2. **Word-Based Tokenization**:
   - **Method**: Splitting the text into individual words.
   - **Pros**: Captures more meaning per token.
   - **Cons**: Results in a large vocabulary with many unknown words (e.g., typos, slang) and different word forms (e.g., "run", "runs", "running").

### Modern Tokenization Strategies

Modern approaches balance character and word tokenization by splitting text into subwords. These methods effectively capture both the structure and meaning of the text while efficiently handling unknown words and different forms of the same word.

- **Subword Tokenization**:
  - **Method**: Frequently occurring words or subwords are assigned a single token, while complex words are split into multiple tokens, each representing a meaningful part of the word.
  - **Example**: "flabbergasted" could be split into:
              
              tensor(781) 	:  fl
              tensor(397) 	: ab
              tensor(3900) 	: berg
              tensor(8992) 	: asted

Different models use different tokenizers, each with its unique strategy and vocabulary size. Let's see how the GPT-2 tokenizer handles a sentence.

### Example with GPT-2 Tokenizer

We'll use the GPT-2 tokenizer to tokenize the sentence shown below. This involves converting the text into tokens and then decoding those tokens back into text.

In [24]:
# The tokenizers are abstracted by AutoTokenizer
from transformers import AutoTokenizer

# Load the GPT-2 tokenizer:
# we pass the model name and its associated toknizer is loaded
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# return_tensors="pt": PyTorch tensors
# ids: we go from words to uids; then, we'll convert the uids into vectors
input_ids = tokenizer("Preposterous, I'm flabbergasted!", return_tensors="pt").input_ids
print(input_ids)
# Output: tensor([[1026,  373,  257, 3223,  290, 6388,   88]])

# Decode the tokens back into text
for t in input_ids[0]:
    print(t, "\t:", tokenizer.decode(t))
# tensor(37534) 	: Prep
# tensor(6197) 	: oster
# tensor(516) 	: ous
# tensor(11) 	: ,
# ...

tensor([[37534,  6197,   516,    11,   314,  1101,   781,   397,  3900,  8992,
             0]])
tensor(37534) 	: Prep
tensor(6197) 	: oster
tensor(516) 	: ous
tensor(11) 	: ,
tensor(314) 	:  I
tensor(1101) 	: 'm
tensor(781) 	:  fl
tensor(397) 	: ab
tensor(3900) 	: berg
tensor(8992) 	: asted
tensor(0) 	: !


In [5]:
from transformers import AutoTokenizer

# Load the GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize the input text
input_ids = tokenizer("I skip across the", return_tensors="pt").input_ids
print(input_ids)
# Output: tensor([[1026,  373,  257, 3223,  290, 6388,   88]])

# Decode the tokens back into text
for t in input_ids[0]:
    print(t, "\t:", tokenizer.decode(t))


tensor([[   40, 14267,  1973,   262]])
tensor(40) 	: I
tensor(14267) 	:  skip
tensor(1973) 	:  across
tensor(262) 	:  the


As shown, the tokenizer splits the input string into a series of tokens, each assigned a unique ID. Most words are represented by a single token, but longer words (or even shorter ones!) can be split into multiple tokens. Play around with this!

### Training Tokenizers vs. Training Models

It's important to note that training tokenizers differs from training models. Training a model is a stochastic (non-deterministic) process, while training a tokenizer is deterministic and statistical. The tokenizer learns which subwords to use based on the dataset, a design decision of the tokenization algorithm.

Popular subword tokenization approaches include Byte-level BPE (used in GPT-2), WordPiece, and SentencePiece. Each method has its advantages and is chosen based on the specific needs of the model and dataset.

By understanding tokenization, we can better appreciate how models process text and generate meaningful outputs.

## Predicting Probabilities


### Loading the Model

First, we need to load the GPT-2 model. Here's how you do it:

In [15]:
# Similarly as with the AutoTokenizer
# we can use the AutoModelForCausalLM (AutoModelFor* + TAB)
# to load the generative LLM we want.
# Note that the model and the tokenized must match,
# which is achieved with the model string "gpt2" we pass.
# The class AutoModelForCausalLM is a generic model class that
# will be instantiated as one of the model classes of the library
# (with a causal language modeling head).
# Using the Auto* classes for GPT2
# we really load GPT2Tokenizer and GPT2LMHeadModel
from transformers import AutoModelForCausalLM

gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")


In [25]:
# We can get all the config and interfaces of the model
help(gpt2)

Help on GPT2LMHeadModel in module transformers.models.gpt2.modeling_gpt2 object:

class GPT2LMHeadModel(GPT2PreTrainedModel)
 |  GPT2LMHeadModel(config)
 |  
 |  The GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input
 |  embeddings).
 |  
 |  
 |  This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
 |  library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
 |  etc.)
 |  
 |  This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
 |  Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
 |  and behavior.
 |  
 |  Parameters:
 |      config ([`GPT2Config`]): Model configuration class with all the parameters of the model.
 |          Initializing with a config file does not load the weights associated wit

### Understanding the Tools

We used `AutoTokenizer` and `AutoModelForCausalLM` from the `transformers` library. This library supports hundreds of models and their corresponding tokenizers. Instead of memorizing the name of each tokenizer and model class, we use `AutoTokenizer` and `AutoModelFor*`.

For example, we use `AutoModelForCausalLM` for the causal language modeling task. The `transformers` library automatically selects the appropriate default classes based on the model's configuration. For GPT-2, this means using `GPT2Tokenizer` and `GPT2LMHeadModel` behind the scenes.

### Feeding Input to the Model

If we feed the tokenized sentence from the previous section into the model, we get a result back with 50,257 values for each token in the input string. Here’s how we do it:

In [17]:
outputs = gpt2(input_ids)
outputs.logits.shape
# torch.Size([1, 4, 50257])
# 1: number of batches
# 4: sequence length, i.e., number of tokens in "I skip across the"
# 50257: vocabulary sized
# logits: raw outputs from the model, we can convert them to ps
# IMPORTANT: the length of the output sequence is the same as the input length, but:
# - the first output tensor/token is the 2nd in the input sequence
# - the last output tensor/token is the NEW token!

torch.Size([1, 4, 50257])

- **First Dimension**: Number of batches (1 because we only ran a single sequence through the model).
- **Second Dimension**: Sequence length (number of tokens in the input sequence, 4 in our case).
- **Third Dimension**: Vocabulary size (~50,000).

These are the raw model outputs, or logits, corresponding to the tokens in the vocabulary. For each input token, the model predicts the likelihood of each token in the vocabulary continuing the sequence. Higher logits mean the model considers that token a more likely continuation.

### Converting Logits to Probabilities

Logits are the raw outputs of the model, essentially a list of numbers like [0.1, 0.2, 0.01, ...]. We can use these logits to select the most likely token to continue the sequence. Let's see how we do that.

### Finding the Most Likely Next Token

To focus on the logits for the entire input sentence and predict the next word, we find the index of the token with the highest value using the `argmax()` method:

In [18]:
final_logits = gpt2(input_ids).logits[0, -1] # The last set of logits
final_logits.argmax() # tensor(1627)

tensor(1627)

In [19]:
tokenizer.decode(final_logits.argmax()) # We decode the most probable NEW token

' line'

Notice how the model begins a new word with a whitespace and an "street". This prediction makes sense given the input sentence since it ended and its time to start another sentence. The model learns to pay attention to other tokens using an algorithm called self-attention, the fundamental building block of transformers. Self-attention allows the model to determine the significance of each token in contributing to the overall meaning of the phrase.

### Note on Transformer Models

Transformer models contain multiple attention layers, each specializing in different aspects of the input. Unlike heuristic systems, these features are learned during training rather than being predefined.

By understanding how GPT-2 predicts probabilities and generates text, we can better appreciate the power and intricacy of transformer-based language models.



## Exploring Other Token Candidates

Now, let's explore which other tokens were potential candidates by selecting the top 10 values. This will give us insight into the model's thought process and the alternatives it considered.

First, we'll use PyTorch to get the top 10 logits:

In [20]:
import torch

# We can also check the top 10 NEW tokens
top10_logits = torch.topk(final_logits, 10)
for index in top10_logits.indices:
    print(tokenizer.decode(index))
    # line
    # street
    # river
    # ...

 line
 street
 river
 room
 pond
 bridge
 border
 country
 road
 board


### Converting Logits to Probabilities

Logits are raw model outputs that don't represent probabilities. To understand the model's confidence in each prediction, we need to convert these logits into probabilities. This is done by comparing each value to all other predicted values and normalizing them so that their sum equals 1. This process is called the `softmax` operation.

Here's the code that uses `softmax` to print out the top 10 most likely tokens along with their probabilities:

In [21]:
top10 = torch.topk(final_logits.softmax(dim=0), 10)
# Here, we see the associated probabilities - very low
for value, index in zip(top10.values, top10.indices):
    print(f"{tokenizer.decode(index):<10} {value.item():.1%}")
    # line      3.6%
    # street    2.7%
    # river     2.2%
    # ...

 line      3.6%
 street    2.7%
 river     2.2%
 room      1.9%
 pond      1.8%
 bridge    1.7%
 border    1.7%
 country   1.7%
 road      1.6%
 board     0.9%


### Experimenting with Predictions

Before diving deeper, it's beneficial to experiment with the code above to understand how the model's predictions vary with different inputs. Here are some ideas to try:

1. **Change a Few Words**: Modify the adjectives in the input string, such as "dark" and "stormy". Observe how the model's predictions change. Does it still predict "night"? How do the probabilities for each token shift?

2. **Alter the Input String**: Try different input strings altogether. For instance, instead of "It was a dark and stormy", use "It was a sunny and bright". Do you agree with the model's new predictions? How do they differ?

3. **Check Grammar**: Provide an input string that is not grammatically correct. For example, use "It was a dark stormy and". How does the model handle it? Look at the probabilities of the top predictions. Do the probabilities still make sense?

By experimenting with these changes, you can gain a deeper understanding of how the model processes language and how sensitive it is to different inputs. This hands-on approach will help you appreciate the intricacies of language modeling and the strengths and limitations of transformer models like GPT-2.

## Generating Text

Now that we understand how the model predicts the next token in a sequence, generating text becomes straightforward. By repeatedly feeding the model's predictions back into itself, we can extend the input text. The `transformers` library makes this easy with the `generate()` method, designed specifically for auto-regressive models like GPT-2. Let's explore how this works with an example.

### Basic Text Generation

Here’s how to use the `generate()` method to produce text:

In [22]:
# We have a generate() interface which wraps the LLM,
# abstracts away the details of making multiple forward passes (to get successive next words)
# and adds additional functonalities, such as:
# - max_new_tokens
# - repetition_penalty: decrease re-use of words
# - do_sample: sample, don't select likeliest
# - temperature: randomness of sampling
# - top_k: number of top k in sampling
# - top_p: cumulative p in sampling
# - bad_words_ids: avoid offensive words
# - num_beams: don't pick likeliest, but consider several branches (beam search)
# ...
# We can use different strategies for generation:
# - Greedy decoding: pick likeliest word; default
# - Beam search: keeps track of multiple hypotheses during generation, choosing the most likely overall sequence
output_ids = gpt2.generate(input_ids, max_new_tokens=20, repetition_penalty=1.5)
# Then, we decode the output ids to obtain tokens
decoded_text = tokenizer.decode(output_ids[0])

print("Input IDs:", input_ids[0])
print("Output IDs:", output_ids)
print(f"Generated text: {decoded_text}")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input IDs: tensor([   40, 14267,  1973,   262])
Output IDs: tensor([[   40, 14267,  1973,   262,  1627,   284,   257,  1295,   810,   314,
           460,   766,   340,    13,   198,   198,   464,  1306,   640,   345,
           821,   287,  3240,    11]])
Generated text: I skip across the line to a place where I can see it.

The next time you're in town,


When we call the `generate()` method, it abstracts away the details of making multiple forward passes, predicting the next token, and appending it to the input sequence. The result is a sequence of token IDs, including both the input and the new tokens generated by the model. Using the `tokenizer.decode()` method, we can convert these token IDs back into readable text.

### Different Strategies for Text Generation

While the `generate()` method simplifies text generation, the strategy we use can significantly impact the quality of the generated text. The default approach, known as greedy decoding, always picks the most likely next token. This method is simple but can lead to suboptimal results, especially for longer sequences. Let's look at why and explore other strategies.

#### Greedy Decoding

Greedy decoding selects the most likely next token at each step:

In [None]:
output_ids = gpt2.generate(input_ids, max_new_tokens=20)
decoded_text = tokenizer.decode(output_ids[0])

print(f"Generated text: {decoded_text}")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text: I skip across the line to the next section.

The first thing to note is that the first line is a


While straightforward, this method can miss more coherent sequences because it doesn't consider the overall context of the sentence. For example, given the starting phrase "Sky," it might predict "blue" as the next word, missing out on a more contextually rich phrase like "Sky rockets soar."

#### Beam Search

Beam search keeps track of multiple hypotheses during generation, choosing the most likely overall sequence:

In [None]:
# Beam search: keeps track of multiple hypotheses during generation, choosing the most likely overall sequence
beam_output = gpt2.generate(
    input_ids,
    num_beams=5,
    max_new_tokens=30,
)

print(tokenizer.decode(beam_output[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I skip across the line to the other side of the room.

"What's going on?" I ask.

"I don't know," he says


Beam search is effective for tasks with predictable output lengths, like summarization or translation. However, it can be slower and sometimes still lead to repetition in open-ended generation tasks.

#### Repetition Penalty and Bad Words

To address repetition, you can introduce a repetition penalty:

In [None]:

beam_output = gpt2.generate(
    input_ids,
    num_beams=5,
    repetition_penalty=1.2,
    max_new_tokens=38,
)

print(tokenizer.decode(beam_output[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I skip across the street to the other side of the street, and I see a man with a gun in his hand. He says, "I'm going to kill you." And I say, "No


You can also specify `bad_words_ids` to prevent the model from generating certain tokens, such as offensive words.

### Sampling Techniques

Instead of always picking the most likely next token, sampling introduces randomness by sampling from the probability distribution of the next tokens.

#### Basic Sampling

In [23]:

from transformers import set_seed

set_seed(70)
sampling_output = gpt2.generate(
    input_ids,
    num_beams=5,
    do_sample=True,
    repetition_penalty=1.2,
    max_length=40,
    top_k=10,
)

print(tokenizer.decode(sampling_output[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I skip across the line.

I skip across the line.

I skip across the line.

I skip across the line.

I skip across the line.




By setting `do_sample=True`, the model picks the next token based on its probability distribution, leading to more diverse and less repetitive text.

#### Temperature

The `temperature` parameter adjusts the randomness of the distribution:

In [None]:
# Adjust the randomness of the picking with the temperature
sampling_output = gpt2.generate(
    input_ids,
    do_sample=True,
    temperature=0.4,
    max_length=40,
    top_k=0,
)

print(tokenizer.decode(sampling_output[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I skip across the road to the last stop in the city, and I see a large crowd of people lining thegood road. I have no idea what to expect, but I'm not sure I


A higher temperature increases randomness, making the text more diverse but potentially less coherent. A lower temperature makes the output more predictable.

#### Top-K Sampling

Top-K sampling limits the selection to the top K tokens:

In [13]:
# Limit the selection of the picking to the top k options
sampling_output = gpt2.generate(
    input_ids,
    do_sample=True,
    max_length=40,
    top_k=10,
)

print(tokenizer.decode(sampling_output[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I skip across the pond in a hurry to find the right answer.

But it's not easy when you have such an answer. You need to ask yourself: "Why are there so many


This method ensures that only the most likely tokens are considered, improving quality but possibly reducing diversity.

#### Top-P (Nucleus) Sampling

Top-P sampling, or nucleus sampling, includes the most likely tokens whose cumulative probability exceeds a threshold:

In [None]:

sampling_output = gpt2.generate(
    input_ids,
    do_sample=True,
    max_length=40,
    top_p=0.94,
    top_k=0,
)

print(tokenizer.decode(sampling_output[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I skip across the line from.0906.761 so try using "REG_SETME" instead of the initial ones.

You should hear small echoing sounds when sliding, especially at


Top-P sampling dynamically chooses the number of tokens based on their cumulative probability, balancing quality and diversity.

### Experimenting with Generation Strategies

There’s no one-size-fits-all approach to text generation. Experiment with different parameters to see what works best for your specific use case. Here are some suggestions:

1. **Parameter Tuning**: Adjust parameters like `num_beams`, `repetition_penalty`, `top_p`, and `top_k` to see how they impact the generated text.
2. **Avoiding Repetition**: Use `no_repeat_ngram_size` to prevent the model from repeating the same sequence of words.
3. **Advanced Techniques**: Explore newer methods like contrastive search, which balances probability and contextual similarity to generate coherent text while avoiding repetition.