# LIN 353D/CS 378 Fall 2025

# Homework 1: Basics of LM-based decoding

**Authors:** Kanishka Misra, Jessy Li

#### Notes:

For this homework you will hand in (upload) to canvas:
- a notebook renamed ``hw1_YourEID.ipynb``

__Before submitting__, please reset your kernel and rerun everything from the beginning (`Kernel` >> `Restart and Run All`) an ensure your code outputs what you expect. 

The maximum number of points for this homework is 50.  For programming tasks, make sure it does not have any empty outputs or output errors. Otherwise, the points for that problem will be automatically deducted.

Review extension and academic dishonesty policy, as well as the AI policy here: https://jessyli.com/courses/lin353d

For typing up your answers to problems 1, 2 and 3, information can be found about Markdown cells for Jupyter Notebooks here: https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html

---
This homework will cover the bread (just the bread, the butter is later) of acquiring and analyzing LM behavior, which in most settings is the responses LMs generate, given some input text. The goal is to first understand the ways in which one can get responses out of LMs, understand their parameters and how changes to them affects generations, and performing very basic analysis on top of the LM generations. These will give you the basic tools that may be useful in your projects. In particular, we will cover:

1. A few decoding algorithms: greedy decoding, top-p decoding, min-p decoding; Understanding their hyperperameters, etc.
2. Two very rudimentary metrics used in the analysis of LM generations.
3. Comparative analysis in writing, in a concise and analytically sound manner.


#### Software Prerequisites:

- torch
- diversity
- transformers

In [None]:
import torch
import transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
transformers.__version__ # make sure this is 4.50 and above.

### On randomness

This assignment will involve a component of stochasticity -- i.e., if you rerun a subset of decoding algorithms you will get a different value each time. In order to maintain consistency and reproducibility, we will fix a random seed (**let's have it be ``the meaning of life'' aka 42 -- this is important since this determines the "validity" of the generated responses**). In order to mitigate this, we will be using the following code before each sampling-based decoding function call:

In [None]:
transformers.set_seed(42)

## Step 1: Setup (loading models, basic preprocessing)

We will be working with two fixed models that should ideally fit in memory (of your local machine)

Model 1: "HuggingFaceTB/SmolLM2-135M"

Model 2: "HuggingFaceTB/SmolLM2-135M-Instruct"

The following code loads the required models. In case you want to switch between them, make that change here. For the test cases we will be using **Model 2**, or the **instruct-tuned** model.

In [None]:
MODELS = {
    'base': "HuggingFaceTB/SmolLM2-135M",
    'instruct': "HuggingFaceTB/SmolLM2-135M-Instruct"
}

selected = 'instruct'

tokenizer = AutoTokenizer.from_pretrained(MODELS[selected])
lm = AutoModelForCausalLM.from_pretrained(MODELS[selected])

Also, since instruct tuned models often require inputs to be formatted in a specific way, we have provided a generic preprocessing function which will either: 1) leave inputs as is (base) and 2) preprocess them as if it's for an instruct-tuned model, which often has roles like "system", "user", "assistant" and the message itself. The latter is done with the function `apply_chat_template` which looks up chat formats for different models to make their outputs sensible and comparable with other LMs.

In [None]:
def preprocess(prompt, model_type=selected):
    if model_type == 'base':
        pass
    else:
        messages = [{"role": "user", "content": prompt}]
        prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    return prompt

In [None]:
# Test it out:
preprocess("Write a story about an apple and a giraffe:")

## Step 2: A basic decoding function

Each decoding algorithm should be implemented as a function that can be passed to the `logits_processor` -- this should take a logits tensor (the logit of the last token position) + some optional arguments (when appropriate), apply the appropriate modification, and return the modified logits tensor

This function will be used for both greedy decoding and sampling based algorithms, so design it appropriately! E.g., using conditional queries (top-p and min-p) along with the `do_sample` boolean flag set to `True`

In [None]:
def decode(
    model, tokenizer, prompt, max_new_tokens, logits_processor, 
    do_sample=False, temperature = 1.0, **kwargs
):
    '''A basic decoding function that processes an input prompt (string), and produces a sequence of tokens
    of length <max_new_tokens>, using the logic specified by <logits_processor> (a function).
    
    Args:
    
        model: An object of type AutoModelForCausalLM, denoting the model.
        tokenizer: An object of type AutoTokenizer, denoting the tokenizer.
        prompt (str): A string denoting the input to the model.
        max_new_tokens (int): An integer denoting how many tokens after the input should be generated.
        logits_processor (callable): the function that defines how the next token will be generated, 
            using the manipulation of the next-token logits
        do_sample (bool; default = False): a boolean flag specifing if sampling is needed
        temperature (float; default = 1.0): Temperature value
        **kwargs: all other keyword arguments (e.g., top_p and min_p)
        
    Returns:
        a string denoting the generated text
    '''
    assert temperature != 0.0, "temperature cannot be set to 0"
    
    model_inputs = tokenizer(prompt, return_tensors='pt')
    offset = model_inputs.input_ids.shape[1]
    input_ids = model_inputs.input_ids
    
    eos_token_id = tokenizer.eos_token_id
    
    for _ in range(max_new_tokens):
        with torch.no_grad():
            logits = model(input_ids).logits
        last_token_logits = logits[0, -1, :].unsqueeze(0)
        if temperature != 1.0:
            last_token_logits = last_token_logits/temperature
        
        processed = logits_processor(last_token_logits, **kwargs)
        
        if do_sample:
            probs = torch.nn.functional.softmax(processed, dim=-1)
            next_token_id = torch.multinomial(probs, num_samples=1).unsqueeze(0)
        else:
            # when we aren't doing sampling, simply return the top predicted next token.
            if temperature != 1.0:
                raise Warning("Temperature value will not be taken into account since do_sample was set to False. Set do_sample to True to allow temperature to have an effect.")
            next_token_id = torch.argmax(processed, dim=-1).unsqueeze(0)

        input_ids = torch.cat([input_ids, next_token_id], dim=-1)

        if next_token_id.item() == eos_token_id:
            break
            
    generated_text = tokenizer.decode(input_ids[0, offset:], skip_special_tokens=True)
    return generated_text

## Question 1: Greedy Decoding (3pts)

Task: Implement basic greedy decoding. **Hint:** Pay close attention to how `decode()` is implemented above!

In [None]:
def greedy_logits_processor(logits):
    raise NotImplementedError # replace with actual implementation

In [None]:
# Test Case 1:
raw_prompt = "Write a story about an apple and a giraffe:"
prompt = preprocess(raw_prompt, model_type='instruct')

greedy_generation = decode(lm, tokenizer, prompt, 40, greedy_logits_processor, do_sample=False)

# should not print an error
assert greedy_generation == 'assistant\nOnce upon a time, in a lush forest, there lived a majestic giraffe named Giraffe. Giraffe was a gentle giant with a long neck and a long, white'

## Question 2: Top-p sampling (6pts)

Task: Implement top-p sampling -- your function should take the `top_p` parameter and use it to process/prepare the "previous token" logits vector appropriately according to how top-p sampling works.

In [None]:
def top_p_logits_processor(logits, top_p):
    raise NotImplementedError # replace with actual implementation

In [None]:
# Test Case 2:
raw_prompt = "Write a story about an apple and a giraffe:"
prompt = preprocess(raw_prompt, model_type='instruct')

transformers.set_seed(42)
top_p_generation = decode(lm, tokenizer, prompt, 40, top_p_logits_processor, do_sample=True, top_p = 0.9)

# should not print an error
assert top_p_generation == 'assistant\nThe sun was setting over the vast expanse of the desert, casting a golden glow over the parched earth. Amidst the rolling hills and rocky outcroppings, two long-tailed'

## Question 3: Min-p sampling (6pts)

Task: Implement min-p sampling -- your function should take the `min_p` parameter and use it to process/prepare the "previous token" logits vector appropriately according to how min-p sampling works.

In [None]:
def min_p_logits_processor(logits, min_p):
    raise NotImplementedError # replace with actual implementation

In [None]:
# Test Case 3:
raw_prompt = "Write a story about an apple and a giraffe:"
prompt = preprocess(raw_prompt, model_type='instruct')

transformers.set_seed(42)
min_p_generation = decode(lm, tokenizer, prompt, 40, min_p_logits_processor, do_sample=True, min_p = 0.2)

# should not error out
assert min_p_generation == 'assistant\nThe sun was setting over the rolling hills, casting a warm orange glow over the landscape. As the last wisps of sunlight dissipated, a lone apple, its smooth, vel'

## Question 4: At what value for `top_p` and `min_p` do you get the same generation as in greedy decoding? (5pts)

Answer by specifying both values separately, and verify by running the three algorithms using those values, showing that they return the same response.

**Top p value:** `<your answer here>`

**Min p value:** `<your answer here>`

For your verification code, choose your own prompt and specify any amount of `max_new_tokens`

In [None]:
## <your code here>

## Question 5 (open ended): What do these algorithms do? (20pts)


Now that we have decoding algorithms at hand, we can get outputs using them and observe their effects.
Sample 10 outputs from each of these decoding algorithms given the same prompt and answer the following questions:

**(a)** Qualitatively, what difference do you see among those various samples? Ground your analysis in the formulations and algorithms. (4pts)

`<your answer here>`

**(b)** Quantitatively, propose a way to measure the intuition you have above. Define the measure you have chosen, implement it, and calculate results. Do you see the same trends as your qualitative analysis in part (a)? (8pts)

In [None]:
## <your code here>

`<your analysis here>`

**(c)** Now, pick one of the algorithms above and vary the `temperature` parameter. What is the effect on the outputs and why? (Make sure to test a wide range of values that are somewhat sensible to see an effect). (8pts)

In [None]:
## <your code here>

`<your analysis here>`

## Question 6: Perplexity (10pts)

Chose a response generated by either the base model or instruct-tuned model using greedy decoding. Calculate the perplexity of this response using both the base model and instruct-tuned model and compare these values.

You can use the perplexity calculation as defined here: https://huggingface.co/docs/transformers/en/perplexity

Prompt:

```
"Write an obituary for [famous person X]:"
```

In [None]:
raw_prompt = "Write an obituary for [famous person X]:" # please replace "[famous person X]" accordingly.
prompt = preprocess(raw_prompt, model_type='instruct')

## <your code here>

Now, think about these values: what do they tell you? And why?