# <center>Critical AI</center>
<center>ENGL 54.41</center>
<center>Dartmouth College</center>
<center>Winter 2026</center>
<pre>Created: 02/12/2026

In [None]:
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch.nn.functional as F

from transformers import AutoTokenizer, AutoModelForCausalLM
import random

In [None]:
# This cell of code will determine if we have an accelerator for running
# our neural networks.
# mps == Apple Silicon device (MX series of Macbooks)
# cuda == Compute Unified Device Architecture is a toolkit from Nvidia and means we have a GPU
# cpu == Just using the general-purpose CPU for our calculations

if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = torch.device('mps')
elif torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
print('Using device: {0}'.format(device))

In [None]:
# The model names are constructed from model provider + model name.
# We're going to load 1B model from Allen Institute. This is the Instruction
# fine-tuned model from the OLMo2 series with 1 billion parameters.
model_name = "allenai/OLMo-2-0425-1B-Instruct"

# the tokenizer is tied to the model itself
tokenizer = AutoTokenizer.from_pretrained(model_name)

# load the model and put on the correct device
model = AutoModelForCausalLM.from_pretrained(model_name,
    dtype=torch.float16,
    device_map = "auto")

In [None]:
# show us the basic configuration of the model--how many layers, attention heads, 
# vocabulary size, embedding width, etc:
model.config

In [None]:
# put the model into evaluation state and display architecture
model.eval()

# Keywords for Generation

The following keyword definitions will be useful in helping us understand how the sampling process used during generation works.

<b>Temperature</b>: A variable or parameter used in the generation of outputs that typically ranges from 0 (cold) to 2 (hot) with a default "randomization" of 1.0 and is used to control the selection of tokens. Setting the temperature to zero will force the generation of the same tokens for the same inputs, making the user-supplied prompt reproducible. Increasing the temperature will increase the degree of stochasticity in generation, leading to more "creative" outputs for the same inputs.

<b>Top-k sampling</b>: A variable or parameter used in the generation of outputs. This is an integer, a whole number. The default is typically 50. This number, represented by k, restricts the range of the probability distribution. A top-k value of 5 would restrict the generation algorithm from the top 5 most probable next predicted tokens. The higher the number, the greater the diversity of language.

<b>Top-p sampling</b>: A variable or parameter used in the generation of outputs. This is a floating-point value that ranges from 0 to 1.  The default is typically 1.0. This number, represented by p, is a threshold of probability values that will restrict the stochasticity of generation by selecting only tokens above the cumulative probability of p. A top-p value of .92 will include in the probability distribution only tokens with probability values that add up to this value. The higher the number, the greater the diversity of language.

In [None]:
def ntp(prompt, n=10):
    inp_tok = tokenizer(prompt,
                        padding=True,
                        return_tensors="pt").to(next(model.parameters()).device)
    input_ids = inp_tok["input_ids"]
    logits = model(**inp_tok).logits[:, -1, :]
    probs = F.softmax(logits, dim=-1).detach()
    vals = [[tokenizer.decode(tk.item()),
             probs[0][tk.item()].tolist()] for tk in torch.argsort(probs, descending=True)[:, :n][0]]
    return vals

## Using top-k and top-p to restrict probability space

The following cells will demonstrate how we can shape the probability space with these two parameters. We'll see how these function as cut-offs to reduce the possibility of generating tokens that exceed these thresholds. Again, the language model is deterministic in that these same probabilities are generated every single time for the same inputs. Generation stochastically selects from these probabilities and we can manage that space of probability with some key parameters.

In [None]:
top_p = 0.025
top_k = 10
tv = ntp("My favorite recording artist is", n=50)

In [None]:
df = pd.DataFrame(tv, columns=['Token', 'Probability'])
df = df.set_index('Token')
df.plot(kind='bar',figsize=(10, 5))
plt.ylabel('Probability')
plt.title('Sampling from Next Token Probabilities:\nMy favorite recording artist is ___')
plt.axvline(top_k,color='r', linestyle='--',label="top-k: " + str(top_k))
plt.axhline(top_p,color='g', linestyle='--',label="top-p: " + str(top_p))
plt.legend()
plt.show()

## Temperature 

The temperature parameter will scale the probability values before applying other mechanisms for stochastic generation. Hugging Face, the maintainers of the transformer package that we are using with these open weights models, describes the temperature parameter in their implementation as follows:

<pre>
temperature (float, optional, defaults to 1.0) — The value used to modulate the next token probabilities.

How unpredictable the next selected token will be. High values (>0.8) are good for creative tasks, low values (e.g. <0.4) for tasks that require “thinking”. Requires do_sample=True.
</pre>

We'll visualize how this method can reduce peaks and reshape our probabilities. 

In [None]:
# various sample temperature values
temps = [0.025, 0.5, 0.75, 1.0, 1.5, 2.0]

def rescale(p, T):
    w = np.power(np.clip(p, 1e-12, None), 1.0 / T)
    return w / w.sum()

temp_cols = pd.concat(
    [df['Probability'].pipe(rescale, T).rename(f"T={T}") for T in temps],
    axis=1
)

df = pd.concat([df, temp_cols], axis=1)

df[[f"T={T}" for T in temps]].plot(figsize=(10,5))
plt.xlabel('Token')
plt.ylabel('Probability')
plt.title('Temperature Effect on Probabilities:\nMy favorite recording artist is ___')
plt.legend(title='Temperature')
plt.xticks(range(len(df)), df.index.values, rotation=45)
plt.tight_layout()
plt.show()

## Instruction-Tuned Model without Special Tokens

Here we'll generate tokens using a similar prompt as above. We'll use basic generation with the above parameters. We will not insert the special tokens that invoke additional fine-tuned behavior. 

How to interpret these results?

In [None]:
top_p = 0.025
top_k = 10
temperature = 1.0

prompt = "Who is my favorite recording artist?"
max_new_tokens = 128

inp_tok = tokenizer(prompt,
                    padding=True,
                    return_tensors="pt").to(next(model.parameters()).device)

input_ids = inp_tok["input_ids"]
output = model.generate(input_ids.to(device), 
                        do_sample=True, 
                        max_new_tokens = max_new_tokens,
                        temperature = temperature,
                        top_k = top_k,
                        top_p = top_p)

print(tokenizer.decode(output[0], 
                       skip_special_tokens=False))

In [None]:
# Now edit the above and try to adjust the parameters and prompt.

## Instruction-Tuned Model with Special Tokens

Now we'll use chatml (chat mark-up) and the templates that insert special tokens into our input sequence.

In [None]:
# same parameters as above:
top_p = 0.025
top_k = 10
temperature = 1.0

prompt = "Who is my favorite recording artist?"
max_new_tokens = 128

# this defines the formatted message that we'll use as our input:
msg = [{"role":"user","content":prompt}]

# this applies special tokens. ''add_generate_prompt'' inserts the token for ''assistant'' voice.
# you'll see all of this decoded below.
input_ids = tokenizer.apply_chat_template(msg, 
                                          return_tensors = "pt",
                                          add_generation_prompt = True)

# generate tokens:
output = model.generate(input_ids['input_ids'].to(device), 
                        do_sample=True, 
                        max_new_tokens = max_new_tokens,
                        temperature = temperature,
                        top_k = top_k,
                        top_p = top_p)

print(tokenizer.decode(output[0], 
                       skip_special_tokens=False))

In [None]:
# Now edit the above and try to adjust the parameters and prompt.

## Chain-of-Thought Prompting

Most contemporary models have been trained on chain-of-thought data so we might not need to provide a sample to direct generation. Here we'll use the first example, which is very likely to be in the training data for this model, from Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2023) arXiv:2201.11903v6. 

The use of triple quotes (""") here will allow us to include new lines in our prompt and help format the output.

In [None]:
# same parameters as above:
top_p = 0.025
top_k = 10
temperature = 1.0

prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"""

max_new_tokens = 128

# this defines the formatted message that we'll use as our input:
msg = [{"role":"user","content":prompt}]

# this applies special tokens. ''add_generate_prompt'' inserts the token for ''assistant'' voice.
# you'll see all of this decoded below.
input_ids = tokenizer.apply_chat_template(msg, 
                                          return_tensors = "pt",
                                          add_generation_prompt = True)

# generate tokens:
output = model.generate(input_ids['input_ids'].to(device), 
                        do_sample=True, 
                        max_new_tokens = max_new_tokens,
                        temperature = temperature,
                        top_k = top_k,
                        top_p = top_p)

print(tokenizer.decode(output[0], 
                       skip_special_tokens=False))

In [None]:
# Now edit the above and try to adjust the parameters and prompt.

## Using System Prompts

Adding to our generation pipeline, instruction fine-tuned models are almost always used with a system prompt. The system prompt precedes the user prompt in the sequence of inputs. It can be used to provide some guardrails. It might also be used to break through these.

In [None]:
system_prompt = "Always respond in the voice of Eleazar Wheelock, founder of Dartmouth College."
prompt = "What is the purpose of liberal education?"
max_new_tokens = 512

# The message template will now contain an additional role, that of the "system." 
# Observe how this prompt is marked and inserted into the sequence of inputs.
msg = [{"role":"system","content":system_prompt},
       {"role":"user","content":prompt}]

input_ids = tokenizer.apply_chat_template(msg, 
                                          return_tensors = "pt",
                                          add_generation_prompt = True)

output = model.generate(input_ids['input_ids'].to(device), 
                        do_sample=True, 
                        max_new_tokens = max_new_tokens)

print(tokenizer.decode(output[0], 
                       skip_special_tokens=False))

In [None]:
# Now edit the above and try to adjust the parameters, system prompt, and prompt. 
# Can you get the model to work around its guardrails?