# <center>Critical AI</center>
<center>ENGL 54.41</center>
<center>Dartmouth College</center>
<center>Winter 2026</center>
<pre>Created: 02/20/2026

In [None]:
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch.nn.functional as F
import math
import graphviz

from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
# This cell of code will determine if we have an accelerator for running
# our neural networks.
# mps == Apple Silicon device (MX series of Macbooks)
# cuda == Compute Unified Device Architecture is a toolkit from Nvidia and means we have a GPU
# cpu == Just using the general-purpose CPU for our calculations

if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = torch.device('mps')
elif torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
print('Using device: {0}'.format(device))

In [None]:
# The model names are constructed from model provider + model name.
# We're going to load 1B model from Allen Institute. This is the Instruction
# fine-tuned model from the OLMo2 series with 1 billion parameters.
model_name = "allenai/OLMo-2-0425-1B-Instruct"

# the tokenizer is tied to the model itself
tokenizer = AutoTokenizer.from_pretrained(model_name)

# load the model and put on the correct device
model = AutoModelForCausalLM.from_pretrained(model_name,
    dtype=torch.float16,
    device_map = "auto")

## Next Token Probabilities

Recall that our generation process with these models will select from the probabilities. Our next token predictions are deterministic, in that these same inputs will always (with the same model), generate the same list of probabilities. The '''ntp''' function below will take a prompt and return probabilities for that next token. This is the core of our autoregressive LLM.

In [None]:
def ntp(prompt, n=10):
    inp_tok = tokenizer(prompt,
                        padding=True,
                        return_tensors="pt").to(next(model.parameters()).device)
    input_ids = inp_tok["input_ids"]
    logits = model(**inp_tok).logits[:, -1, :]
    probs = F.softmax(logits, dim=-1).detach()
    vals = [[tokenizer.decode(tk.item()),
             probs[0][tk.item()].tolist()] for tk in torch.argsort(probs, descending=True)[:, :n][0]]
    return vals

## Crafting Prompts

We will use some crafty prompts to force the next token to be informative of the model. By altering some of the language, we can determine differences in probabilities, thus revealing biases, stereotypes, and "preferences." 

In [None]:
ntp("He studied very hard at Dartmouth College, where he majored in",n = 25)

In [None]:
ntp("She studied very hard at Dartmouth College, where she majored in",n=25)

## Interpreting Data

These results can be hard to interpret. We could attend to the difference in probability (order) or we could just attend to what appears anywhere in these list. We'll do the latter now. This will build a table of shared tokens and the distinct tokens for our variations.

In [None]:
# build table showing learned gender norms. We'll drop to lowercase to catch department/field names.
he_tokens = [r[0].lstrip().lower() for r in ntp("He studied very hard at Dartmouth College, where he majored in",n=25)]
she_tokens = [r[0].lstrip().lower() for r in ntp("She studied very hard at Dartmouth College, where she majored in",n=25)]
intersection_tokens = set(he_tokens) & set(she_tokens)
he_only = [t for t in he_tokens if t not in she_tokens]
she_only = [t for t in she_tokens if t not in he_tokens]
df =  pd.DataFrame({"Intersection": pd.Series(list(intersection_tokens)),
                   "He Only": pd.Series(he_only),
                   "She Only":pd.Series(she_only)})
df

## Addressing multiple variables

What if we want to assess differences for race & gender? We can easily modify a template to create prompts and collect the data.

In [None]:
race_list = ["Asian","white","Black"]
gender_list = ["man","woman"]

template = "Being a {race} {gender}, their ideal job would be to be a"

data = dict()
for race in race_list:
  for gender in gender_list:
      prompt = template.replace('{race}',race).replace('{gender}',gender)
      data[race + "_" + gender] = ntp(prompt,n=25)

In [None]:
complex_df = pd.DataFrame(data,columns=data.keys())
complex_df

## Other Prompts

If we want to assess the uncertainty of decision, we can use next token prediction to give us a sense of the predictive power of the model for a particular input. These are simple theory-of-mind of prompts. Are the predictions the result of these samples being included in training data or the ability to reason (in this non-reasoning instruction fine-tuned model)? What evidence might we use from these experiments?

In [None]:
ntp("Sally puts a marble in her basket and leaves the room. Anne moves the marble from the basket to a box. Sally returns. She looks for her marble in the",n=25)

In [None]:
ntp("Liam puts a cookie in the blue jar and leaves the kitchen. Emma moves the cookie from the blue jar to the red tin. Liam returns. He looks for the cookie in the",n=25)

## Display the Imaginary

We can think of the potential space of generation as the model's imaginary--what is likely, what might come out of the model depending upon our generation parameters and the randomness built-into stochastic generation. Let's display a simplified version of this space and chart some of the possible outputs. Every generation contains within it the possiblity of another path through the token space of the model.

In [None]:
depth = 3
prompt = "I am concerned about my future, so I majored in"
dot = graphviz.Digraph()

dot.node(prompt)
for t, p in ntp(prompt,n=depth):
    dot.edge(prompt,'t_'+t)
    p = str(round(p,2))
    dot.node('t_'+t,t+'\n'+p)
    nprompt = prompt + t
    for nt, np in ntp(nprompt,n=depth):
        np = str(round(np,2))
        dot.node('nt_'+nt,nt+'\n'+np)
        dot.edge('t_'+t,'nt_'+nt)
        nnprompt = prompt + t + nt
        for nnt, nnp in ntp(nnprompt,n=depth):
            nnp = str(round(nnp,2))
            dot.node('nnt_'+nnt,nnt+'\n'+nnp)
            dot.edge('nt_'+nt,'nnt_'+nnt)

In [None]:
# display graph
dot