# Setup

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
import torch

This code sets up the essential components needed for working with large language models using the Hugging Face transformers library.

The first line imports three key tools: AutoTokenizer converts text into a format the model can process, AutoModelForCausalLM loads pre-trained language models designed for text generation, and set_seed ensures reproducible results by fixing random number generation.

The second line imports PyTorch, the deep learning framework that powers the underlying computations. PyTorch provides the tensor operations and GPU acceleration needed to run these models efficiently.

This setup forms the foundation for tasks like text generation, completion, or fine-tuning. The "Auto" classes are particularly useful because they automatically handle the correct configuration based on the specific model you plan to use, whether it's GPT, BERT, or another architecture.

The "Causal" in AutoModelForCausalLM refers to the model's ability to generate text by predicting each new word based on all previous words, similar to how humans write from left to right.

In [None]:
class CFG:
    model = "Qwen/Qwen2-0.5B"

This code defines a configuration class named CFG that serves as a central place to store important settings for a machine learning project. The class contains a single setting that specifies which AI model to use - in this case, it's "Qwen/Qwen2-0.5B".

Let's break down each part to understand it better:

The name "CFG" is a common shorthand for "configuration" in machine learning projects. Think of it like a control panel where you can adjust all the important settings in one place. This makes it much easier to modify settings later, since you only need to change them in one spot rather than hunting through your code.

The model setting points to "Qwen/Qwen2-0.5B", which tells us several things: Qwen is the model family (created by Alibaba), Qwen2 indicates it's the second generation of this model, and 0.5B means it has approximately 500 million parameters. The forward slash format ("Qwen/Qwen2-0.5B") is the standard way to reference models in the Hugging Face model hub, similar to how you might specify a folder path.

This style of configuration is particularly useful when you're experimenting with different models or settings. You could easily switch to a different model by changing just this one line, rather than having to modify multiple places in your code. It's a practical application of the software engineering principle of keeping configuration separate from implementation.

# Tokenization

In [None]:
prompt = "It was a dark and stormy"
tokenizer = AutoTokenizer.from_pretrained(CFG.model)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Let's break this code down into two key parts that work together to prepare text for a language model.

The first line creates a prompt variable containing the text "It was a dark and stormy". This is the beginning of what appears to be a story - you might recognize it as a variation of the famous opening line "It was a dark and stormy night." This text will serve as the starting point for the language model to continue from.

The second line is where things get more interesting. It creates a tokenizer by loading a pre-trained one that matches our chosen Qwen model. A tokenizer is like a translator that converts human-readable text into a format the AI model can understand. Think of it as breaking down a sentence into meaningful pieces, similar to how we might break down "playground" into "play" and "ground".

The AutoTokenizer.from_pretrained() function is doing quite a bit of work behind the scenes. When you call it with CFG.model (which contains "Qwen/Qwen2-0.5B"), it:
1. Downloads the tokenizer files specific to the Qwen2-0.5B model from Hugging Face's model hub (if they haven't been downloaded already)
2. Loads these files into memory
3. Sets up all the rules and vocabulary that this particular model uses to understand text

This is essential because different AI models might break down text in different ways. For example, one model might treat "playing" as a single token, while another might break it into "play" and "ing". Using the correct tokenizer ensures the text is broken down in exactly the way the model expects.

The result is stored in the tokenizer variable, which we'll be able to use later to convert our prompt text into a format the model can process. This tokenizer will use the same rules and vocabulary that were used during the model's training, ensuring consistency in how text is processed.

In [None]:
input_ids = tokenizer(prompt).input_ids

input_ids

[2132, 572, 264, 6319, 323, 13458, 88]

This code transforms our text prompt into a format the AI model can understand, so let's explore exactly how this works.

The line `input_ids = tokenizer(prompt).input_ids` performs a crucial two-step transformation. First, the tokenizer breaks down our text "It was a dark and stormy" into tokens - think of these as the basic units of meaning that the model understands. Then, it converts each token into a unique number (an ID) that represents that token in the model's vocabulary.

The `.input_ids` part extracts just the sequence of numbers we need - this is what gets fed into the model. If you were to print out input_ids, you'd see a list of integers. Each integer corresponds to a specific token in the model's vocabulary. For example, you might see something like [1234, 567, 890, 432, 765] where 1234 might represent "It", 567 might represent "was", and so on.

To help visualize this process, imagine you're translating English into a special code language where every word or part of a word gets assigned a unique number. The model has been trained to understand and work with these numbers rather than the original text. This numbering system is consistent - "It" will always get the same number every time it appears, ensuring the model can reliably understand the input.

What makes this particularly interesting is that the tokenization might not split the text exactly where we expect. For instance, common word combinations might be treated as single tokens, and some words might be split into multiple tokens. The model learned during its training period which combinations of characters should be grouped together for most efficient processing.

Understanding this transformation process is crucial because it's the bridge between human-readable text and the mathematical operations that language models perform. Every piece of text that goes into the model must first go through this exact process.

The resulting input_ids contains everything the model needs to understand our prompt, but in a format that enables efficient mathematical operations - essentially translating our story opening into the model's native language of numbers.

In [None]:
for t in input_ids:
    print(t, "\t: ", tokenizer.decode(t))

2132 	:  It
572 	:   was
264 	:   a
6319 	:   dark
323 	:   and
13458 	:   storm
88 	:  y


This code helps us peek inside how the tokenizer works by showing us exactly how it breaks down our text. It's doing something quite fascinating - taking each individual token ID number and converting it back into readable text so we can understand the tokenization process.

The for loop goes through each number (t) in our input_ids list. For each one, it:
1. Prints the token ID number (t)
2. Adds a tab character ("\t") and a colon for clean formatting
3. Uses tokenizer.decode(t) to convert that single number back into the text it represents

When we run this code, we see each piece of our original phrase "It was a dark and stormy" broken down into its component parts. For instance, we might see output like:

345   :   It
287   :   was
402   :   a
1256  :   dark
392   :   and
2891  :   stormy

This visualization reveals something important about how language models process text. Rather than treating each word as an indivisible unit, the model might split some words into parts or combine common phrases into single tokens. For example, longer words might be broken into multiple tokens, while common word pairs might be treated as a single token.

The tokenizer.decode() function is particularly useful here because it lets us see exactly how our text was divided up. This understanding becomes crucial when we're trying to figure out why a model responds in certain ways or when we're debugging issues with text generation. It's like having a translator's dictionary that shows us how the model "thinks about" and divides up language.

This kind of token inspection can also help us understand why models sometimes have character limits or why they might handle certain phrases differently than we expect - it's all based on how the text gets broken down into these fundamental units.

# Probability

In [None]:
model = AutoModelForCausalLM.from_pretrained(CFG.model)

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

This line of code loads a pre-trained language model into our program.

The AutoModelForCausalLM class is specifically designed for models that generate text in a "causal" way - meaning they predict what comes next based on what came before, similar to how humans write. When we call .from_pretrained() with our model name "Qwen/Qwen2-0.5B", several important steps occur:

First, the code checks if this model has been downloaded before. If not, it downloads the model files from Hugging Face's model hub. These files contain the model's learned knowledge, encoded in its neural network weights - think of it like downloading the model's "brain."

Next, it sets up the neural network architecture that matches this specific model. The architecture is like the physical structure of the brain - how many neurons there are and how they're connected. In this case, it's setting up the particular architecture that Qwen2-0.5B uses, which involves multiple layers of attention mechanisms and neural networks.

Then, it loads all the learned parameters (weights and biases) into this architecture.

After loading, the model is initialized in evaluation mode by default, meaning it's ready to make predictions but not to learn new things. This is typically what we want for text generation tasks.

In [None]:
input_ids = tokenizer(prompt, return_tensors = "pt").input_ids
outputs = model(input_ids)
outputs.logits.shape

torch.Size([1, 7, 151936])


The first line, `input_ids = tokenizer(prompt, return_tensors = "pt").input_ids`, builds on our earlier tokenization but adds an important new element. While before we just converted text to numbers, now we're also specifying `return_tensors = "pt"`. This tells the tokenizer to output PyTorch tensors - think of these as specialized arrays designed for neural network computations. It's like converting our list of numbers into a format that's optimized for the mathematical operations our model needs to perform.

When we run `outputs = model(input_ids)`, we're passing our prepared text through the entire neural network. During this process, each of the model's 500 million parameters contributes to analyzing the text. The model examines our prompt "It was a dark and stormy" from many different angles, considering various ways the text might continue based on patterns it learned during training. This process generates what we call "logits" - scores indicating how likely each word in the model's vocabulary is to come next.

The final line, `outputs.logits.shape`, reveals the structure of these scores. The shape will be a tuple of three numbers, let's say (1, N, V) where:
- The 1 represents our single input sequence
- N is the number of tokens in our input
- V is the size of the model's vocabulary (often tens of thousands of numbers)

To visualize this, imagine a grid where each row represents one position in our input text, and each column represents a possible next word. At each position, we have a score for every word in the model's vocabulary, indicating how likely that word is to appear next. It's like having a massive spreadsheet where each cell contains the model's confidence about a particular word choice at a particular position.

Understanding these shapes is crucial because they show us exactly how much information the model is considering when making its predictions. For each token in our input, the model has calculated probabilities for every possible next token - that's why language models can generate text that feels coherent and contextually appropriate.

This output structure sets us up for the next step in text generation, where we'll select actual words from these probability distributions to continue our story. The logits serve as the foundation for making informed choices about which words should come next in our generated text.

In [None]:
final_logits = model(input_ids).logits[0,-1]
final_logits.argmax()

tensor(3729)

These two lines show us how to extract and analyze the model's prediction for what word should come next in our text.

First, let's look at `final_logits = model(input_ids).logits[0,-1]`. This line does some careful selection from our model's output to focus on exactly what we need. The indexing `[0,-1]` is particularly important - the `0` selects our first (and only) input sequence, while `-1` selects the last position.  

When we run `model(input_ids)`, we're getting predictions for every position in our input text ("It was a dark and stormy"). But we only care about what comes after the last word "stormy", so we use this indexing to zoom in on just those predictions. The resulting `final_logits` is a vector of numbers, with each number representing how likely the model thinks each word in its vocabulary is to come next.

The second line, `final_logits.argmax()`, finds the position of the highest number in our logits vector. This is like asking "out of all possible next words, which one did the model think was most likely?" The `argmax()` function returns the index position of that highest value. If we were to decode this number using our tokenizer, we'd see the actual word the model thinks should come next in our story.

To help visualize this: imagine our text is a multiple choice question, where each possible next word is an option. The logits are like scores for each option, and `argmax()` is like picking the option with the highest score. So if the logits vector had values like [-2.1, 0.5, 3.7, 1.2], where 3.7 is the highest value at position 2, `argmax()` would return 2.

This process reveals how the model makes its choices - it's not randomly picking words, but rather carefully scoring each possibility and selecting the one it determines is most appropriate given the context. However, always choosing the highest scoring word (what we're doing here) isn't always the best approach for creative text generation, which is why many applications use more sophisticated sampling methods to introduce some controlled randomness into the generation process.

In [None]:
tokenizer.decode(final_logits.argmax())

' night'

This line takes the model's top prediction and converts it back into human-readable text. We're essentially completing a full circle - we started with text, converted it to numbers for the model to process, and now we're converting the model's numerical output back into text.

The `tokenizer.decode()` function is particularly interesting here because it bridges the gap between how the model thinks (in numbers) and how we communicate (in words). Remember our prompt "It was a dark and stormy" - what we're seeing now is the word that the model thinks most naturally completes this phrase based on all the patterns it learned during training.

If we were to actually run this code, we'd likely see the word "night" appear, since "It was a dark and stormy night" is such a famous opening line in literature. The model has likely encountered this phrase many times in its training data and recognized this strong pattern. This showcases how language models combine pattern recognition with contextual understanding - it's not just matching words, but understanding common phrases and literary conventions.


Understanding this decoding step is crucial because it's the final bridge between the model's mathematical computations and human-readable output. Without it, we'd be left with just numbers that wouldn't mean anything to us, even though they make perfect sense to the model.

In [None]:
top10_logits = torch.topk(final_logits,10)

for index in top10_logits.indices:
    print(tokenizer.decode(index))

 night
 evening
 day
 morning
 winter
 afternoon
 Saturday
 Sunday
 Friday
 October


This code shows us how to peek into the model's top predictions, giving us insight into not just its best guess, but its next best guesses too. Let's explore how this works and why it's valuable.

The first line `top10_logits = torch.topk(final_logits,10)` uses PyTorch's topk function to find the 10 highest values in our logits vector. Think of this like asking the model "What are your top 10 ideas for what might come next in this sentence?" Instead of just getting the single best prediction, we're getting a broader view of what the model considers plausible continuations.

The `torch.topk()` function returns two pieces of information: the actual values (scores) and their positions (indices) in the original logits vector. We're particularly interested in the indices because they tell us which words the model thought were most likely.

The for loop then walks through these top 10 predictions one by one. For each index, we use `tokenizer.decode()` to convert the number back into text, just like we did before with the single best prediction. This gives us a ranked list of the model's top 10 guesses for what word should come after "It was a dark and stormy".

When we run this code, we typically see words that would make sense in this context. The first word is likely to be "night" since that's the most famous completion of this phrase. But the other predictions might include words like "day", "evening", "afternoon", or even "sky" - all words that could reasonably follow "dark and stormy" in English.

This broader view of the model's predictions is particularly valuable because it helps us understand how the model thinks about language. Just like a human might consider several possible ways to complete a sentence, the model assigns different probabilities to various continuations. Some might be very likely (high scores), while others are possible but less probable (lower scores).

Understanding these multiple predictions becomes especially important when we want to generate more creative or diverse text. Instead of always choosing the top prediction, many text generation systems use these probability distributions to make more varied and interesting choices while still maintaining coherence and grammatical correctness.

The difference between the scores of these top 10 predictions can also tell us about the model's confidence. If the top score is much higher than the others, the model is very confident about its prediction. If the scores are closer together, the model sees several equally plausible continuations.

In [None]:
top10 = torch.topk(final_logits.softmax(dim=0),10)

for value, index in zip(top10.values, top10.indices):
    print(f"{tokenizer.decode(index):<10}{value.item():.2%}")

 night    88.71%
 evening  4.30%
 day      2.19%
 morning  0.49%
 winter   0.45%
 afternoon0.27%
 Saturday 0.25%
 Sunday   0.19%
 Friday   0.17%
 October  0.16%


This code helps us understand not just what words the model predicts might come next, but also how confident it is about each prediction. Let's break down what's happening and why it's important.

The first line introduces a crucial transformation: `final_logits.softmax(dim=0)`. The softmax function converts our raw logits (which could be any numbers) into probabilities that add up to 100%. This is like converting test scores into percentages - it gives us a standardized way to compare the model's confidence in different predictions. The `dim=0` parameter tells softmax to operate along the vocabulary dimension, ensuring all word probabilities sum to 1.

When we call `torch.topk()` on these probabilities, we get both the values (probabilities) and indices (word positions), just like before. But now the values represent actual percentages rather than raw scores.

The `zip()` function in the for loop cleverly pairs each probability with its corresponding word index, letting us see both pieces of information together. The f-string formatting then creates a neat, aligned output:
- `{tokenizer.decode(index):<10}` reserves 10 spaces for each word, left-aligned
- `{value.item():.2%}` converts each probability to a percentage with 2 decimal places

When we run this code, we might see output like:
```
night      85.23%
evening    8.45%
day        3.12%
weather    1.55%
morning    0.89%
```

This visualization reveals something fascinating about how the model thinks. The high percentage for "night" (if that's indeed the top prediction) would tell us that the model is very confident about this completion - it has strongly learned the pattern "It was a dark and stormy night" from its training data. The rapidly decreasing percentages for other words show us which alternatives the model considers plausible but less likely.

Understanding these probabilities is crucial for text generation because they help us balance between predictability and creativity. Always choosing the highest probability word would give us very conventional text, while incorporating lower probability choices (in a controlled way) can lead to more interesting and varied outputs while still maintaining coherence.

This kind of probability analysis also helps us understand why language models sometimes make surprising choices - even a word with a seemingly low probability of 1% still has a chance of being selected if we're using probabilistic sampling methods rather than always taking the top prediction.

# Text generation

In [None]:
output_ids = model.generate(input_ids,max_new_tokens = 20)
decoded_text = tokenizer.decode(output_ids[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


These two lines showcase how we generate new text and decode it back into human-readable form. Let me walk through how this generation process works, as it's quite fascinating.

The first line, `output_ids = model.generate(input_ids, max_new_tokens = 20)`, tells our model to continue the text. Think of it like giving a storyteller the beginning of a story and asking them to continue it. The `max_new_tokens = 20` parameter sets a limit - it's like saying "give me about 20 more words." This is important because without a limit, the model could theoretically keep generating text forever!

During generation, the model repeatedly uses what it has written so far to predict the next word. It's similar to how a human might write a story - after writing "It was a dark and stormy", we think about what word would make sense next, write it down, then use all of that (including our new word) to think about what should come after that, and so on.

The second line, `decoded_text = tokenizer.decode(output_ids[0])`, converts the model's numerical output back into readable text. The `[0]` is needed because the model's output is structured to handle multiple sequences at once (even though we only have one in this case). It's like opening an envelope containing the complete story the model has written.

This generation process is particularly interesting because it can use different strategies to pick each next word. By default, it doesn't always pick the most likely word (which could make the text repetitive and boring), but instead uses some randomness to make more interesting choices while still maintaining coherence. This balance between predictability and creativity is what allows language models to generate text that feels natural and engaging.

When we run these lines, we'll see our original prompt "It was a dark and stormy" followed by the model's continuation. The continuation might describe a dramatic scene, perhaps starting with "night" and going on to paint a picture of weather conditions or set up a story. Because of the probabilistic nature of text generation, running this code multiple times could give us different continuations, each one a unique take on how the story might unfold.

Understanding this generation process helps explain both the strengths and limitations of language models. They can create remarkably coherent text by building on context one word at a time, but they need careful guidance (through parameters like max_new_tokens) to generate text that's useful for specific purposes.

In [None]:
print("Input IDs", input_ids[0])
print("Output IDs", output_ids)
print(f"Generated text: {decoded_text}")

Input IDs tensor([ 2132,   572,   264,  6319,   323, 13458,    88])
Output IDs tensor([[ 2132,   572,   264,  6319,   323, 13458,    88,  3729,    13,   576,
         12884,   572,  6319,   323,   279,  9956,   572,  1246,  2718,    13,
           576, 11174,   572, 50413,  1495,   323,   279]])
Generated text: It was a dark and stormy night. The sky was dark and the wind was howling. The rain was pouring down and the


In [None]:
beam_output = model.generate(
input_ids,
num_beams=5,
max_new_tokens=30)

print(tokenizer.decode(beam_output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


It was a dark and stormy night. The wind was howling, and the rain was pouring down. The sky was dark and gloomy, and the air was filled with the


This code introduces us to beam search - a more sophisticated approach to text generation that helps produce higher quality text. Let me walk you through how it works, as it's quite different from the basic generation we saw earlier.

The key new parameter here is `num_beams=5`. To understand beam search, imagine you're writing a story and, at each step, you keep track of the 5 most promising versions of how the story could continue. It's like having 5 different authors working in parallel, each exploring a slightly different path for the story. At each step, all 5 authors propose their next word, but only the most promising continuations survive to the next round.

Let's break this down with an example. Starting with our prompt "It was a dark and stormy":
1. The model first generates 5 possible continuations, maybe something like:
   - "It was a dark and stormy night in the..."
   - "It was a dark and stormy evening when..."
   - "It was a dark and stormy sky above..."
   - "It was a dark and stormy day that..."
   - "It was a dark and stormy afternoon as..."

2. For each of these 5 paths, the model then generates multiple possible next words, scoring each complete sequence.

3. From all these possibilities (which could be 25 or more different sequences), the model keeps only the 5 best ones for the next round.

This process continues until we reach our `max_new_tokens=30` limit. The final output we see is the single best sequence found among all these parallel explorations.

What makes beam search particularly powerful is that it looks ahead before committing to a choice. Unlike our previous generation method, which made one decision at a time and stuck with it, beam search maintains multiple possibilities and can "change its mind" if a different path turns out to be more promising. It's like being able to peek at how different story directions might unfold before choosing the best one.

When we run this code, we typically get more coherent and well-structured text compared to simpler generation methods. That's because beam search helps avoid the "local maxima" problem - situations where making the most obvious choice at each step might not lead to the best overall sequence.

However, there's a trade-off: while beam search often produces more polished and conventional text, it might be less creative than other methods because it tends to stick to the most probable sequences. This is why different generation strategies are useful for different purposes - beam search when we want more predictable, high-quality output, and other methods when we want more creative or diverse text.

The decoded output we see is the culmination of this careful exploration process - the single best path found after considering many possible continuations at each step. This methodical approach to text generation helps explain why beam search is often used in applications where output quality and coherence are particularly important.

In [None]:
beam_output = model.generate(
input_ids,
num_beams=5,
repetition_penalty=2.0,
max_new_tokens=38,
)
print(tokenizer.decode(beam_output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


It was a dark and stormy night. The sky was filled with thunder and lightning, and the wind howled in the distance. It was raining cats and dogs, and the streets were covered in puddles of water.
