# Putting the LLM Pipeline Together: Step by Step

In this notebook, we'll walk through the complete process of text generation with a local LLM, keeping things simple and clear. We'll follow these steps:

Input text → Tokenization → Converting to IDs → Model processing → Next token prediction → Token selection → Building the response

Let's begin by loading our local model:

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np
import os

# Set the directory where we'll save the model
save_directory = "D:/AIModel"
os.makedirs(save_directory, exist_ok=True)

# Download a small model
model_name = "distilgpt2"
print(f"Downloading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Save the model to our local directory
print(f"Saving model to {save_directory}...")
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
print("Model and tokenizer saved successfully!")

# Load model and tokenizer from our local directory
model = AutoModelForCausalLM.from_pretrained(save_directory)
tokenizer = AutoTokenizer.from_pretrained(save_directory)
tokenizer.pad_token = tokenizer.eos_token

print("Model and tokenizer loaded successfully!")

  from .autonotebook import tqdm as notebook_tqdm


Downloading distilgpt2...
Saving model to D:/AIModel...
Model and tokenizer saved successfully!
Model and tokenizer loaded successfully!


## Step 1: Input Text

Let's begin with a simple prompt:

In [3]:
# Our starting prompt
prompt = "Artificial intelligence is transforming"
print(f"Input Prompt: {prompt}")

Input Prompt: Artificial intelligence is transforming


## Step 2: Tokenization - Breaking Text into Pieces

The tokenizer breaks our text into smaller units (tokens) that the model can understand:

In [4]:
# Tokenize the input
tokens = tokenizer.tokenize(prompt)

print("Tokenization Result:")
for i, token in enumerate(tokens):
    print(f"Token {i+1}: '{token}'")

Tokenization Result:
Token 1: 'Art'
Token 2: 'ificial'
Token 3: 'Ġintelligence'
Token 4: 'Ġis'
Token 5: 'Ġtransforming'


### What's happening here?

The tokenizer has split our input text into tokens. Notice a few important things:

- Some tokens have a 'Ġ' prefix - this represents a space before the word
- The word "transforming" is kept as a single token because it's common enough
- If we used a less common word, it might be split into multiple subword tokens

## Step 3: Converting Tokens to IDs

Next, each token is converted to its corresponding numeric ID from the vocabulary:

In [5]:
# Convert tokens to IDs
input_ids = tokenizer.encode(prompt, return_tensors="pt")[0].tolist()

print("Tokens to IDs Conversion:")
for token, id_value in zip(tokens, input_ids):
    print(f"Token '{token}' → ID: {id_value}")

# Show the tensor format that will be input to the model
model_input_ids = tokenizer.encode(prompt, return_tensors="pt")
print("\nModel input tensor shape:", model_input_ids.shape)
print("Model input tensor:", model_input_ids)

Tokens to IDs Conversion:
Token 'Art' → ID: 8001
Token 'ificial' → ID: 9542
Token 'Ġintelligence' → ID: 4430
Token 'Ġis' → ID: 318
Token 'Ġtransforming' → ID: 25449

Model input tensor shape: torch.Size([1, 5])
Model input tensor: tensor([[ 8001,  9542,  4430,   318, 25449]])


### What's happening here?

Each token has been converted to a numeric ID according to the model's vocabulary. These IDs are what the model actually processes - it doesn't understand the text directly, only these numbers.

The IDs are then formatted as a PyTorch tensor with shape [1, n_tokens] - this is the actual input format the model expects.

## Step 4: Model Processing

Now the model processes these IDs through its neural network layers:

In [6]:
# Run the model on our input
with torch.no_grad():  # Disable gradient calculation for inference
    outputs = model(model_input_ids)

# The model outputs logits (unnormalized probabilities) for each possible next token
logits = outputs.logits
print(f"Output logits shape: {logits.shape}")
print(f"This means we have predictions for {logits.shape[1]} positions")
print(f"For each position, we have scores for all {logits.shape[2]} tokens in the vocabulary")

Output logits shape: torch.Size([1, 5, 50257])
This means we have predictions for 5 positions
For each position, we have scores for all 50257 tokens in the vocabulary


### What's happening here?

Inside the model, here's what's happening:

1. The number IDs for each word get turned into lists of numbers that represent their meaning (each ID is convered into an embedding vector).
2. The model adds information about where each word appears in the sentence - first, second, third, etc. (with position embeddings).
3. The model then processes this information through several layers that:
    - Figure out which words should pay attention to each other - like how "is" relates to "intelligence" (self-attention mechanisms).
    - Process this information to understand the meaning better
4. Finally, the model makes a giant list of scores for every possible next word it knows.

The output is basically a big scorecard showing how likely each possible next word is. Since our model knows about 50,257 different words or word pieces, it gives a score to each one of them, ranking from most likely to least likely.

## Step 5: Next Token Prediction

Now let's look at the model's prediction for the next token after our prompt:

In [7]:
# We want the predictions for the last position (after "transforming")
next_token_logits = logits[0, -1, :]

# Convert logits to probabilities
next_token_probs = torch.softmax(next_token_logits, dim=0)

# Get the top 10 most likely tokens
top_k = 10
topk_probs, topk_indices = torch.topk(next_token_probs, top_k)

# Convert to lists for easier handling
topk_probs = topk_probs.detach().numpy()
topk_indices = topk_indices.detach().numpy()

# Get the corresponding tokens
topk_tokens = [tokenizer.decode([idx]) for idx in topk_indices]

print("Top 10 Predictions for Next Token:")
print("-" * 40)
print(f"{'Token':<15} {'ID':<8} {'Probability':<10}")
print("-" * 40)
for i in range(top_k):
    print(f"{repr(topk_tokens[i]):<15} {topk_indices[i]:<8} {topk_probs[i]*100:.2f}%")

Top 10 Predictions for Next Token:
----------------------------------------
Token           ID       Probability
----------------------------------------
' the'          262      26.98%
' our'          674      6.10%
' human'        1692     2.45%
' people'       661      2.02%
' itself'       2346     1.83%
' technology'   3037     1.82%
' a'            257      1.54%
' us'           514      1.41%
' society'      3592     1.29%
' how'          703      1.24%


### What's happening here?

The model has read our phrase "Artificial intelligence is transforming" and made a guess about what might come next:

1. First, the model creates raw scores for every possible next tokens
2. We turn these scores into percentages (like 60%, 25%, 10%) so they're easier to understand
3. We look at just the top 10 tokens with the highest percentages

These percentages show what the model thinks should come next based on all the text it's seen before. A higher percentage means the model is more confident that word is a good fit to continue the sentence.

## Step 6: Token Selection

Now we need to select which token to use next. Let's look at different ways to do this:

In [8]:
# Method 1: Greedy selection (always pick the most likely token)
greedy_index = torch.argmax(next_token_probs).item()
greedy_token = tokenizer.decode([greedy_index])

# Method 2: Temperature sampling (adjust probability distribution)
temperature = 0.7  # Lower = more deterministic, Higher = more random
temp_logits = next_token_logits / temperature
temp_probs = torch.softmax(temp_logits, dim=0)

# Method 3: Top-k sampling (sample from k most likely tokens)
k = 5
topk_temp_probs, topk_indices = torch.topk(temp_probs, k)
topk_temp_probs = topk_temp_probs / topk_temp_probs.sum()  # Renormalize

# Let's select using temperature + top-k
sample_index = np.random.choice(topk_indices.detach().numpy(), p=topk_temp_probs.detach().numpy())
sample_token = tokenizer.decode([sample_index])

print("Token Selection Results:")
print(f"Greedy selection: '{greedy_token}' (always picks the most likely token)")
print(f"Temperature sampling: '{sample_token}' (randomly selects based on adjusted probabilities)")

# Show the top-k tokens with adjusted probabilities
print("\nTop-k tokens with temperature adjustment:")
print("-" * 40)
print(f"{'Token':<15} {'Original %':<12} {'Adjusted %':<12}")
print("-" * 40)
for i in range(k):
    token_id = topk_indices[i].item()
    token_text = tokenizer.decode([token_id])
    orig_prob = next_token_probs[token_id].item() * 100
    adj_prob = topk_temp_probs[i].item() * 100
    print(f"{repr(token_text):<15} {orig_prob:<12.2f} {adj_prob:<12.2f}")

Token Selection Results:
Greedy selection: ' the' (always picks the most likely token)
Temperature sampling: ' the' (randomly selects based on adjusted probabilities)

Top-k tokens with temperature adjustment:
----------------------------------------
Token           Original %   Adjusted %  
----------------------------------------
' the'          26.98        83.48       
' our'          6.10         9.97        
' human'        2.45         2.70        
' people'       2.02         2.06        
' itself'       1.83         1.79        


### What's happening here?

We're looking at two different ways to pick the next word:

1. **Greedy selection**: Always pick the most likely word: This is like always picking the safe choice. It's predictable, but can get boring and repetitive.

2. **Temperature sampling with Top-k**: Mix in some controlled randomness.
    - We can adjust how random we want to be (temperature)
    - We only consider the few most likely words (top-k)
   - Then we randomly choose from those words, giving better odds to more likely words



Using some randomness helps make the text more interesting and varied, instead of always saying the same thing when given the same starting point.

## Step 7: Building the Response

Now we'll see the complete text generation process in action, adding one token at a time:

In [9]:
def generate_step_by_step(prompt, max_new_tokens=5, temperature=0.7, top_k=5):
    """Generate text token by token with detailed output at each step"""
    # Start with the prompt
    current_text = prompt
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    
    print(f"Starting prompt: '{prompt}'\n")
    
    # Generate new tokens one by one
    for i in range(max_new_tokens):
        print(f"--- Step {i+1}: Generating token #{len(prompt.split())+i+1} ---")
        
        # Get model predictions
        with torch.no_grad():
            outputs = model(input_ids)
        
        # Get next token logits (predictions for the next token)
        next_token_logits = outputs.logits[0, -1, :]
        
        # Apply temperature
        next_token_logits = next_token_logits / temperature
        
        # Get top-k token indices and their probabilities
        topk_probs, topk_indices = torch.topk(torch.softmax(next_token_logits, dim=0), top_k)
        
        # Print the top candidates
        print("\nTop candidates:")
        for j in range(top_k):
            token_id = topk_indices[j].item()
            token_text = tokenizer.decode([token_id])
            token_prob = topk_probs[j].item() * 100
            print(f"  {j+1}. '{token_text}' (ID: {token_id}, Probability: {token_prob:.2f}%)")
        
        # Renormalize probabilities for top-k
        topk_probs = topk_probs / topk_probs.sum()
        
        # Sample from top-k
        chosen_idx = np.random.choice(topk_indices.detach().numpy(), p=topk_probs.detach().numpy())
        chosen_token = tokenizer.decode([chosen_idx])
        
        print(f"\nSelected token: '{chosen_token}'")
        
        # Update for next iteration
        next_token = torch.tensor([[chosen_idx]])
        input_ids = torch.cat([input_ids, next_token], dim=1)
        current_text += chosen_token
        
        print(f"Text so far: '{current_text}'\n")
    
    print(f"Final generated text: '{current_text}'")
    return current_text

# Generate text step by step
final_text = generate_step_by_step(prompt, max_new_tokens=5, temperature=0.7, top_k=5)

Starting prompt: 'Artificial intelligence is transforming'

--- Step 1: Generating token #5 ---

Top candidates:
  1. ' the' (ID: 262, Probability: 66.72%)
  2. ' our' (ID: 674, Probability: 7.97%)
  3. ' human' (ID: 1692, Probability: 2.16%)
  4. ' people' (ID: 661, Probability: 1.65%)
  5. ' itself' (ID: 2346, Probability: 1.43%)

Selected token: ' itself'
Text so far: 'Artificial intelligence is transforming itself'

--- Step 2: Generating token #6 ---

Top candidates:
  1. ' into' (ID: 656, Probability: 81.45%)
  2. ' to' (ID: 284, Probability: 5.44%)
  3. ' from' (ID: 422, Probability: 5.12%)
  4. '.' (ID: 13, Probability: 2.03%)
  5. ' in' (ID: 287, Probability: 1.84%)

Selected token: ' into'
Text so far: 'Artificial intelligence is transforming itself into'

--- Step 3: Generating token #7 ---

Top candidates:
  1. ' a' (ID: 257, Probability: 81.13%)
  2. ' an' (ID: 281, Probability: 8.05%)
  3. ' the' (ID: 262, Probability: 3.85%)
  4. ' something' (ID: 1223, Probability: 3.75

### What's happening here?

We've just witnessed the complete text generation process, with each step broken down:

1. We start with our prompt
2. For each new token:
   - The model processes all the text so far
   - It generates probabilities for the next token
   - We apply temperature and top-k filtering
   - We sample a token from the resulting distribution
   - The selected token is added to our text
   - We repeat until we reach our desired length

This shows how the model works in an auto-regressive manner - each new token depends on all the tokens that came before it.

## The Effect of Generation Parameters

Different parameters can dramatically change the output. Let's experiment with a few:

In [10]:
# Function to generate text with different parameters
def generate_with_params(prompt, max_new_tokens=15, **params):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    
    # Set up generation parameters
    gen_params = {}
    if 'temperature' in params:
        gen_params['temperature'] = params['temperature']
    if 'top_k' in params:
        gen_params['top_k'] = params['top_k']
    if 'top_p' in params:
        gen_params['top_p'] = params['top_p']
    if 'do_sample' in params:
        gen_params['do_sample'] = params['do_sample']
    
    # Generate the output
    output_ids = model.generate(
        input_ids, 
        max_length=len(input_ids[0]) + max_new_tokens,
        pad_token_id=tokenizer.eos_token_id,
        **gen_params
    )
    
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Try different parameter combinations
params_to_try = [
    {'name': 'Greedy (no sampling)', 'params': {'do_sample': False}},
    {'name': 'Low Temperature (0.3)', 'params': {'temperature': 0.3, 'do_sample': True}},
    {'name': 'High Temperature (1.5)', 'params': {'temperature': 1.5, 'do_sample': True}},
    {'name': 'Top-k (5)', 'params': {'top_k': 5, 'do_sample': True}},
    {'name': 'Top-p (0.9)', 'params': {'top_p': 0.9, 'do_sample': True}},
    {'name': 'Balanced', 'params': {'temperature': 0.7, 'top_k': 50, 'top_p': 0.9, 'do_sample': True}}
]

# Generate and display results
print("Effect of Generation Parameters:\n")

for setting in params_to_try:
    output = generate_with_params(prompt, **setting['params'])
    generated_part = output[len(prompt):]
    
    print(f"{setting['name']}")
    print(f"Parameters: {setting['params']}")
    print(f"Input: {prompt}")
    print(f"Generated: {generated_part}")
    print("-" * 80)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Effect of Generation Parameters:

Greedy (no sampling)
Parameters: {'do_sample': False}
Input: Artificial intelligence is transforming
Generated:  the way we think about our lives.







--------------------------------------------------------------------------------
Low Temperature (0.3)
Parameters: {'temperature': 0.3, 'do_sample': True}
Input: Artificial intelligence is transforming
Generated:  the way we interact with the world. It’s a new paradigm
--------------------------------------------------------------------------------
High Temperature (1.5)
Parameters: {'temperature': 1.5, 'do_sample': True}
Input: Artificial intelligence is transforming
Generated:  technology and our culture in such a way that technological breakthroughs today may provide
--------------------------------------------------------------------------------
Top-k (5)
Parameters: {'top_k': 5, 'do_sample': True}
Input: Artificial intelligence is transforming
Generated:  the world into a digital economy, with a 

### Generation Parameters Explained

- **do_sample**: When False, the model always picks the most likely token (greedy decoding). When True, it samples according to the probability distribution.

- **temperature**: Controls the randomness of predictions.
  - Lower values (e.g., 0.3) make the model more confident and deterministic
  - Higher values (e.g., 1.5) make the model more random and creative
  - Value of 1.0 keeps the original probabilities unchanged

- **top_k**: Limits the selection to only the k most likely next tokens.
  - Lower values (e.g., 5) focus on the most probable tokens
  - Higher values allow more diversity but might include less relevant tokens

- **top_p (nucleus sampling)**: Selects from the smallest set of tokens whose cumulative probability exceeds p.
  - Adapts the number of tokens considered based on the confidence of the model
  - Values around 0.9 are common and work well in practice

## The Complete LLM Pipeline

Let's summarize the entire text generation pipeline we've explored:

1. **Input Text**: We start with a text prompt that the model will continue

2. **Tokenization**: The tokenizer breaks the text into tokens (words, subwords, or characters)

3. **Token → ID Conversion**: Each token is converted to a numeric ID according to the model's vocabulary

4. **Model Processing**: The IDs are processed through the neural network architecture:
   - Embedding lookup for each token
   - Position information added
   - Multiple transformer layers process the sequence
   - Attention mechanisms focus on relevant parts of the input
   
5. **Next Token Prediction**: The model outputs probabilities for each possible next token

6. **Token Selection**: A token is selected based on these probabilities:
   - Greedy selection (most likely token)
   - Sampling with temperature/top-k/top-p for controlled randomness
   
7. **Add to Output**: The selected token is added to the generated text

8. **Repeat**: Steps 3-7 are repeated with the updated text until we reach the desired length

## Summary: The LLM Pipeline End-to-End

We've now explored the complete text generation pipeline of a local LLM. This process demonstrates how an LLM generates text one token at a time, with each decision influenced by all previous tokens. 

The probabilistic nature of token selection (except in greedy decoding) explains why you can get different outputs from the same prompt - a key characteristic of working with AI systems.

Understanding this pipeline helps you:
- Debug issues in text generation
- Optimize performance by adjusting parameters
- Design more effective prompts
- Better integrate LLMs into your applications