A simplified and more performant next token Patchscope.
Alternatively, a more expensive and more accurate version of the logit lens.
The Patchscopes framework is a general framework, introduced in the Patchscopes paper, for patching residual activations from some prompt/layer/position/model into another prompt/layer/position/model.
A Patchscope technique is a specific parameterization of the Patchscopes framework.
Next token extraction is the general problem of extracting the next token of a sequence using a single residual activation somewhere in the sequence. This roughly indicates when a specific residual contains enough information to predict the next token and can be useful for circuit finding. For the problem to be non-trivial, the residual activation should not be the one in the last layer and last position. The logit lens can be viewed as a technique for solving this problem.
I use "next token Patchscope" to refer to a Patchscope technique for solving next token extraction.
The token identity Patchscope is the next token Patchscope introduced in the Patchscopes paper. The idea is to take a prompt like "cat -> cat\n1135 -> 1135\nhello -> hello\n?" and patch a residual activation into the ? token without a layer shift. For instance, it could patch the layer 5 residual of = in "1+1=" into the layer 5 residual for the ? token to see if it predicts 2.
This project studies a new next token Patchscope that is identical to the token identity Patchscope except with the prompt "?" instead of "cat -> cat\n1135 -> 1135\nhello -> hello\n?".
The goal of this is to fix some failure modes of "cat -> cat\n1135 -> 1135\nhello -> hello\n?":
- The in-context examples show
->tokens after the first tokens, so there's a bias to predicting->after the?.- Indeed, the most likely next token after
?is->if early layers are patched, which can explain whypatchscopehas low performance in early layers.
- Indeed, the most likely next token after
- If
->is added to the end of the prompt, there's a bias towards predicting?.- This causes a significant drop in performance if
->is added to the end of the prompt.
- This causes a significant drop in performance if
"?" tries to fix these two issues by removing the in-context examples:
- On a surface level, the bias towards
->gets fixed because->doesn't appear. - On a deeper level, the bias towards
?can be viewed as a bias towards predicting the current token rather than the next token.- As a thought experiment, imagine patching the 0th layer pre-residual of a token like
"health"into the?token in"cat -> cat\n1135 -> 1135\nhello -> hello\n?". This is equivalent to the prompt"cat -> cat\n1135 -> 1135\nhello -> hello\nhealth", for which the most likely next two tokens are-> health, not-> care. However, in the next token extraction problem, we want to extractcareand nothealth.
- As a thought experiment, imagine patching the 0th layer pre-residual of a token like
While performance could perhaps be improved with different in-context examples, keeping it simple makes it feel like a more expensive and more accurate version of the logit lens where we do the rest of the forward pass instead of skipping to the decoder. The name logitscope captures how we use a longer "scope" (all remaining layers) as opposed to a thin "lens" (decoder layer) to observe the logits.
logitscope performs better than patchscope in both precision@1 and surprisal (the two metrics in the Patchscopes paper.) with the same dataset (The Pile) and one of the same models (GPT-J 6B). I used the same preprocessing steps with slight changes: 2K examples from the start instead of after an offset of 10K, and reduced the word/character limits to avoid running out of GPU memory.
For precision@1, higher is better. For surprisal, lower is better.
Run pip install -r requirements.txt and then all cells in logitscope.ipynb. I ran this on an A100 80GB with 100GB of disk space, using Python 3.11.

