# Putting the LLM Pipeline Together : Step by Step

Complete process of text generation with a local LLM :

Input text → Tokenization → Converting to IDs → Model processing → Next token prediction → Token selection → Building the response

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
import torch
import os

In [None]:
# load model
llm_directory = "./downloaded_llms/distilgpt2_model"
model = AutoModelForCausalLM.from_pretrained(llm_directory)
tokenizer = AutoTokenizer.from_pretrained(llm_directory)
tokenizer.pad_token = tokenizer.eos_token

## Step 1: Input Text

In [3]:
# simple prompt
prompt = "Artificial intelligence is transforming"

## Step 2 : Tokenization - Breaking Text into Pieces
The tokenizer breaks our text into smaller units (tokens) that the model can understand

In [4]:
tokens = tokenizer.tokenize(prompt)
print(f"Tokenization Result : {tokens}")

Tokenization Result : ['Art', 'ificial', 'Ġintelligence', 'Ġis', 'Ġtransforming']


The tokenizer has split the input text into tokens :
- Some tokens have a 'Ġ' prefix - this represents a space before the word
- The word "transforming" is kept as a single token because it's common enough
- If we used a less common word, it might be split into multiple subword tokens

## Step 3 : Converting Tokens to IDs
Next, each token is converted to its corresponding numeric ID from the vocabulary

In [9]:
# Convert tokens to IDs
input_ids = tokenizer.encode(prompt, return_tensors="pt")
print("Token to ID :")
for token,ids in zip(tokens, input_ids[0].tolist()):
    print(f"- '{token}' -> ID {ids}")
    
# Show the tensor format that will be input to the model
print()
print("Model input tensor shape:", input_ids.shape)
print("Model input tensor:", input_ids)

Token to ID :
- 'Art' -> ID 8001
- 'ificial' -> ID 9542
- 'Ġintelligence' -> ID 4430
- 'Ġis' -> ID 318
- 'Ġtransforming' -> ID 25449

Model input tensor shape: torch.Size([1, 5])
Model input tensor: tensor([[ 8001,  9542,  4430,   318, 25449]])


Each token has been converted to a numeric ID according to the model's vocabulary. 

These IDs are what the model actually processes - it doesn't understand the text directly, only these numbers.

The IDs are then formatted as a PyTorch tensor with shape [1, n_tokens] - this is the actual input format the model expects.

## Step 4: Model Processing
Now the model processes these IDs through its neural network layers

In [10]:
# Run the model on our input
with torch.no_grad():
    output = model(input_ids)

# The model outputs logits (unnormalized probabilities) for each possible next token
logits = output.logits
print(f"- Output logits shape : {logits.shape}")
print(f"- This means we have predictions for {logits.shape[1]} positions")
print(f"- For each position, we have scores for all {logits.shape[2]} tokens in the vocabulary")

- Output logits shape : torch.Size([1, 5, 50257])
- This means we have predictions for 5 positions
- For each position, we have scores for all 50257 tokens in the vocabulary


Inside the model, here's what's happening

- The number IDs for each word get turned into lists of numbers that represent their meaning (each ID is converted into an embedding vector).
- The model adds information about where each word appears in the sentence - first, second, third, etc. (with position embeddings).
- The model then processes this information through several layers that:
    - Figure out which words should pay attention to each other - like how "is" relates to "intelligence" (self-attention mechanisms).
    - Process this information to understand the meaning better
- Finally, the model makes a giant list of scores for every possible next word it knows.

The output is basically a big scorecard showing how likely each possible next word is.

Since this model knows about 50,257 different words or word pieces, it gives a score to each one of them, ranking from most likely to least likely.

## Step 5 : Next Token Prediction
Now let's look at the model's prediction for the next token after the prompt

In [11]:
# We want the predictions for the last position (after "transforming")
next_token_logit = logits[0, -1, :]

# Convert logits to probabilities
next_token_proba = torch.softmax(next_token_logit, dim=0)

# Get the top 10 most likely tokens
top_k = 10
top_k_proba, top_k_idx = torch.topk(next_token_proba, top_k)

# Convert to lists for easier handling
top_k_proba = top_k_proba.detach().numpy()
top_k_idx = top_k_idx.detach().numpy()

# Get the corresponding tokens
top_k_tokens = [tokenizer.decode([idx]) for idx in top_k_idx]
print("Top 10 Predictions for Next Token :")
print("-" * 40)
print(f"{'Token':<15} {'ID':<8} {'Probability':<10}")
print("-" * 40)
for i in range(top_k):
    print(f"{repr(top_k_tokens[i]):<15} {top_k_idx[i]:<8} {top_k_proba[i]*100:.2f}%")

Top 10 Predictions for Next Token :
----------------------------------------
Token           ID       Probability
----------------------------------------
' the'          262      26.98%
' our'          674      6.10%
' human'        1692     2.45%
' people'       661      2.02%
' itself'       2346     1.83%
' technology'   3037     1.82%
' a'            257      1.54%
' us'           514      1.41%
' society'      3592     1.29%
' how'          703      1.24%


The model has read the prompt phrase "Artificial intelligence is transforming" and made a guess about what might come next:
- First, the model creates raw scores for every possible next tokens
- Softmax turns these scores into probabilities (like 60%, 25%, 10%) so they're easier to understand
- Look at just the top 10 tokens with the highest proba

These probabilities show what the model thinks should come next based on all the text it's seen before. 

A higher percentage means the model is more confident that word is a good fit to continue the sentence.

## Step 6 : Token Selection
Now we need to select which token to use next

In [34]:
# Method 1: Greedy selection (always pick the most likely token)
greedy_idx = torch.argmax(next_token_proba)
greedy_token = tokenizer.decode(greedy_idx)

# Method 2: Temperature sampling (adjust probability distribution)
temperature = .7 # Lower = more deterministic, Higher = more random
temp_logits = next_token_logit / temperature
temp_token_proba = torch.softmax(temp_logits, dim=0)

# Method 3: Top-k sampling (sample from k most likely tokens)
k = 5
temp_proba, temp_idx = torch.topk(temp_token_proba, k)
top_k_temp_proba = temp_proba / temp_proba.sum()

# Let's select using temperature + top-k
sample_idx = np.random.choice(temp_idx.numpy(), p=top_k_temp_proba.numpy())
sample_token = tokenizer.decode(sample_idx)

print(f"Greedy selection : '{greedy_token}' (always picks the most likely token)")
print(f"Temperature sampling : '{sample_token}' (randomly selects based on adjusted probabilities)")
print()
# Show the top-k tokens with adjusted probabilities
print("Top-k tokens with temperature adjustment :")
print("-" * 40)
print(f"{'Token':<15} {'Original %':<12} {'Adjusted %':<12}")
print("-" * 40)
for i in range(k):
    token_id = temp_idx[i].item()
    token_text = tokenizer.decode([token_id])
    orig_prob = next_token_proba[token_id].item() * 100
    adj_prob = top_k_temp_proba[i].item() * 100
    print(f"{repr(token_text):<15} {orig_prob:<12.2f} {adj_prob:<12.2f}")

Greedy selection : ' the' (always picks the most likely token)
Temperature sampling : ' the' (randomly selects based on adjusted probabilities)

Top-k tokens with temperature adjustment :
----------------------------------------
Token           Original %   Adjusted %  
----------------------------------------
' the'          26.98        83.47       
' our'          6.10         9.97        
' human'        2.45         2.71        
' people'       2.02         2.06        
' itself'       1.83         1.79        


two different ways to pick the next word :
1. **Greedy selection** : Always pick the most likely word. This is like always picking the safe choice. It's predictable, but can get boring and repetitive
2. **Temperature sampling** with Top-k : Mix in some controlled randomness

- We can adjust how random we want to be (temperature)
- We only consider the few most likely words (top-k)
- Then randomly choose from those words, giving better odds to more likely words

Using some randomness helps make the text more interesting and varied, instead of always saying the same thing when given the same starting point

## Step 7 : Building the Response
Complete text generation process in action, adding one token at a time

In [60]:
def generate_step_by_step(prompt, max_new_tokens=5, temperature=.7, top_k=5):
    """
        Generate text token by token with detailed output at each step

    Args:
        prompt (str): input
        max_new_tokens (int, optional): number of token to predict. Defaults to 5.
        temprerature (float, optional): level of randomness. Defaults to .7.
        top_k (int, optional): tokens with highest probabilities. Defaults to 10.
    """
    # Start with the prompt
    current_text = prompt
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    
    print(f"Starting prompt : '{prompt}'")
    print()
    
    # Generate new tokens one by one
    for i in range(max_new_tokens):
        print(f"--- Step {i + 1} : Generating token #{len(prompt.split()) + i + 1} ---")
        with torch.no_grad():
            outputs = model(input_ids)
        
        # Get next token logits (predictions for the next token)
        next_token_logits = outputs.logits[0, -1, :]
        
        # Apply temperature
        next_token_logits = next_token_logits / temperature
        
        # Get top-k token indices and their probabilities
        tokens_proba = torch.softmax(next_token_logits, dim=0)
        top_k_proba, top_k_idx = torch.topk(tokens_proba, top_k)
        
        # Print the top candidates
        print()
        print("Top candidates :")
        for i in range(top_k):
            token_id = top_k_idx[i].item()
            token_text = tokenizer.decode(token_id)
            token_proba = top_k_proba[i]
            print(f"- Token {i + 1} : {token_text}, (ID : {token_id}), Probability : {token_proba * 100:.2f}")
            
        # Renormalize probabilities for top-k
        top_k_proba = top_k_proba / top_k_proba.sum()
        
        # Sample from top-k
        sample_idx = np.random.choice(top_k_idx.numpy(), p=top_k_proba.numpy())
        sample_token = tokenizer.decode([sample_idx])
        
        print()
        print(f"Selected token : '{sample_token}'")
        
        # Update for next iteration
        next_token = torch.tensor([[sample_idx]])
        input_ids = torch.cat([input_ids, next_token], dim=1)
        current_text = current_text + sample_token
        
        print(f"Text so far : '{current_text}'")
        print()
    print(f"Final generated text : '{current_text}'")
    return current_text

In [61]:
# Generate text step by step
final_text = generate_step_by_step(prompt, max_new_tokens=5, temperature=.7, top_k=5)

Starting prompt : 'Artificial intelligence is transforming'

--- Step 1 : Generating token #5 ---

Top candidates :
- Token 1 :  the, (ID : 262), Probability : 66.72
- Token 2 :  our, (ID : 674), Probability : 7.97
- Token 3 :  human, (ID : 1692), Probability : 2.16
- Token 4 :  people, (ID : 661), Probability : 1.65
- Token 5 :  itself, (ID : 2346), Probability : 1.43

Selected token : ' the'
Text so far : 'Artificial intelligence is transforming the'

--- Step 2 : Generating token #6 ---

Top candidates :
- Token 1 :  way, (ID : 835), Probability : 45.63
- Token 2 :  world, (ID : 995), Probability : 33.67
- Token 3 :  lives, (ID : 3160), Probability : 4.88
- Token 4 :  human, (ID : 1692), Probability : 2.45
- Token 5 :  workplace, (ID : 15383), Probability : 0.91

Selected token : ' world'
Text so far : 'Artificial intelligence is transforming the world'

--- Step 3 : Generating token #7 ---

Top candidates :
- Token 1 :  of, (ID : 286), Probability : 37.49
- Token 2 : ., (ID : 13), 

Complete text generation process, with each step broken down :

- Start with a prompt
- For each new token :
    - The model processes all the text so far
    - It generates probabilities for the next token
    - We apply temperature and top-k filtering
    - We sample a token from the resulting distribution
    - The selected token is added to our text
    - Repeat until we reach our desired length
    
This shows how the model works in an auto-regressive manner - each new token depends on all the tokens that came before it.

## The Effect of Generation Parameters
Different parameters can dramatically change the output, let's experiment with a few

In [64]:
# Function to generate text with different parameters
def generate_with_params(prompt, max_new_tokens=15, **params):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    
    # Set up generation parameters
    gen_params = {}
    if 'temperature' in params:
        gen_params['temperature'] = params['temperature']
    if 'top_k' in params:
        gen_params['top_k'] = params['top_k']
    if 'top_p' in params:
        gen_params['top_p'] = params['top_p']
    if 'do_sample' in params:
        gen_params['do_sample'] = params['do_sample']
    
    # Generate the output
    output_ids = model.generate(
        input_ids, 
        max_length=len(input_ids[0]) + max_new_tokens,
        pad_token_id=tokenizer.eos_token_id,
        **gen_params
    )
    
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Try different parameter combinations
params_to_try = [
    {'name': 'Greedy (no sampling)', 'params': {'do_sample': False}},
    {'name': 'Low Temperature (0.3)', 'params': {'temperature': 0.3, 'do_sample': True}},
    {'name': 'High Temperature (1.5)', 'params': {'temperature': 1.5, 'do_sample': True}},
    {'name': 'Top-k (5)', 'params': {'top_k': 5, 'do_sample': True}},
    {'name': 'Top-p (0.9)', 'params': {'top_p': 0.9, 'do_sample': True}},
    {'name': 'Balanced', 'params': {'temperature': 0.7, 'top_k': 50, 'top_p': 0.9, 'do_sample': True}}
]

# Generate and display results
print("Effect of Generation Parameters :")
print()

for setting in params_to_try:
    output = generate_with_params(prompt, **setting['params'])
    generated_part = output[len(prompt):]
    
    print(f"{setting['name']}")
    print(f"Parameters: {setting['params']}")
    print(f"Input: {prompt}")
    print(f"Generated: {generated_part}")
    print("-" * 80)

Effect of Generation Parameters :

Greedy (no sampling)
Parameters: {'do_sample': False}
Input: Artificial intelligence is transforming
Generated:  the way we think about our lives.







--------------------------------------------------------------------------------
Low Temperature (0.3)
Parameters: {'temperature': 0.3, 'do_sample': True}
Input: Artificial intelligence is transforming
Generated:  the way we think about the world.







--------------------------------------------------------------------------------
High Temperature (1.5)
Parameters: {'temperature': 1.5, 'do_sample': True}
Input: Artificial intelligence is transforming
Generated:  the way humans navigate the world, and we may still learn a thing or
--------------------------------------------------------------------------------
Top-k (5)
Parameters: {'top_k': 5, 'do_sample': True}
Input: Artificial intelligence is transforming
Generated:  the way we live, we know it, we are living, we know
--------------------------

- **do_sample** : When False, the model always picks the most likely token (greedy decoding). When True, it samples according to the probability distribution

- **temperature** : Controls the randomness of predictions
    - Lower values (e.g., 0.3) make the model more confident and deterministic
    - Higher values (e.g., 1.5) make the model more random and creative
    - Value of 1.0 keeps the original probabilities unchanged

- **top_k** : Limits the selection to only the k most likely next tokens
    - Lower values (e.g., 5) focus on the most probable tokens
    - Higher values allow more diversity but might include less relevant tokens

- **top_p** (nucleus sampling) : Selects from the smallest set of tokens whose cumulative probability exceeds p
    - Adapts the number of tokens considered based on the confidence of the model
    - Values around 0.9 are common and work well in practice