# Investigation of single token prediction in Pythia 160m

Investigating how the Pythia 160m model represents knowledge from pre-training, using a prompt intended to generate a token for Dublin as the capital of Ireland. This is the smallest model which behaves like this.

In [1]:
# Install dependencies and set up the package
# %pip install transformers torch matplotlib

# Install the local package in development mode
import sys
sys.path.insert(0, '../src')

In [None]:
from transformer_algebra import PromptedTransformer, load_pythia_model

# Load Pythia-160m-deduped from HuggingFace
model, tokenizer = load_pythia_model("EleutherAI/pythia-160m-deduped")
print(f"Model: {model.config.name_or_path}")
print(f"Layers: {model.config.num_hidden_layers}")
print(f"Heads: {model.config.num_attention_heads}")
print(f"d_model: {model.config.hidden_size}")

The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Model: EleutherAI/pythia-160m-deduped
Layers: 12
Heads: 12
d_model: 768


In [17]:
T = PromptedTransformer(model, tokenizer, "The capital of Ireland")
#returns an object representing a transformer and a prompt as an operator which acts on a residual vector
T
# T displays as something like T(<4 tokens>)
# or long term maybe $$T(\underline{The} \ \underline{ capital} \ \underline{ of} \ \underline{ Ireland})$$ 

<transformer_algebra.core.PromptedTransformer at 0x24aa5dc2700>

In [20]:
x = T(" is")
#returns an object which represents the result vector from applying the model to the token for " is".
#x
# x displays as something like T(\underline(is))


TypeError: 'PromptedTransformer' object is not callable

In [None]:
#predict(x)
# Should return an object representing a mapping from tokens to probabilities
#logits(x) 
# Should return an object representing a mapping from tokens to logits
#summarise(logits(x)) - outputs as below

Expected result: Top 5 predictions after final layer:
  1. ' the' (logit: 786.61)
  2. ' a' (logit: 784.93)
  3. ' Belfast' (logit: 784.57)
  4. ' Dublin' (logit: 784.47)
  5. ' now' (logit: 784.44)

In [None]:
#expand(x)
# Should return an object representing the decomposition of the residual stream at x into contributions from each layer
#Something like:
# LN( \underline{ is} + \T_0 \underline{ is} +  \T_1  \underline{ is} ... )

The next step depends on a relation:
$$
<x, LN(a + b)> = \frac{<x,a>+<x,b>}{\left \| a+b \right \|}
$$

In [26]:
#logits(x)["Dublin"]
# Should return something like
# < \overline{Dublin}, T(\underline{ is}) > = 783.45


In [27]:
#expand(logits(x)["Dublin"])
# Should return something like
# scale * ( < \overline{Dublin}, T_0(\underline{ is} > + < \overline{Dublin}, T_1(\underline{ is} > + ... ) = 783.45
