In this notebook we will explore different language model interpretability methods using a neural networks library written by Neel Nanda (https://github.com/neelnanda-io/Easy-Transformer) for this purpose.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
!pip install plotly transformers

Collecting plotly
  Downloading plotly-5.11.0-py2.py3-none-any.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 8.8 MB/s eta 0:00:01
Collecting tenacity>=6.2.0
  Downloading tenacity-8.1.0-py3-none-any.whl (23 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.11.0 tenacity-8.1.0


In [4]:
!pip install "git+https://github.com/neelnanda-io/Easy-Transformer.git"

Collecting git+https://github.com/neelnanda-io/Easy-Transformer.git
  Cloning https://github.com/neelnanda-io/Easy-Transformer.git to /tmp/pip-req-build-lv_xkrqb
  Running command git clone -q https://github.com/neelnanda-io/Easy-Transformer.git /tmp/pip-req-build-lv_xkrqb
  Resolved https://github.com/neelnanda-io/Easy-Transformer.git to commit 57411fcfe9592803eb20970639b979f22cb6cf9f
Collecting fancy_einsum
  Downloading fancy_einsum-0.0.3-py3-none-any.whl (6.2 kB)
Collecting torchtyping
  Downloading torchtyping-0.1.4-py3-none-any.whl (17 kB)
Collecting typeguard>=2.11.1
  Downloading typeguard-2.13.3-py3-none-any.whl (17 kB)
Building wheels for collected packages: easy-transformer
  Building wheel for easy-transformer (setup.py) ... [?25ldone
[?25h  Created wheel for easy-transformer: filename=easy_transformer-0.1.0-py3-none-any.whl size=65986 sha256=6c26b1de10710ba3397402c7a7eb9a4f9d8a44d016c86a536287b4eb60f0f143
  Stored in directory: /tmp/pip-ephem-wheel-cache-fnpteubg/wheels/

In [5]:
import plotly.io as pio
pio.renderers.default

'plotly_mimetype+notebook'

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import einops
import tqdm.notebook as tqdm

import random
import time

from pathlib import Path
import pickle
import os

import matplotlib.pyplot as plt

import plotly.express as px
import plotly.graph_objects as go

from torch.utils.data import DataLoader #DataLoada

from functools import *
import pandas as pd
import gc
import collections
import copy

import itertools
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import dataclasses
import datasets

In [7]:
from easy_transformer.utils import (
    gelu_new,
    to_numpy,
    get_corner,
    lm_cross_entropy_loss,
)# Helper functions
from easy_transformer.hook_points import (
    HookedRootModule,
    HookPoint,
)# Hooking utilities
from easy_transformer import EasyTransformer, EasyTransformerConfig
import easy_transformer
from easy_transformer.experiments import (
    ExperimentMetric,
    AblationConfig,
    EasyAblation,
    EasyPatching,
    PatchingConfig
)

In [8]:
device = "cuda"

## Hook Points
In (https://transformer-circuits.pub/2021/garcon/index.html) the good team at Anthropic released a walk through of their internal tool, called Garcon. In Easy-Transformer we, follow in the style of that library by defining a `HookPoint` class. This is a layer to wrap any activation within the model in. The HookPoint acts as an identity function, but allows us to put PyTorch hooks in to edit and access the relevant activation. This allows us to take any model and insert in access points to all interesting activations by wrapping them in HookPoints.

There is also a `HookedRootModule` class. This is a utility class that the root module should inherit from (root module = the model we run). It has several utility functions for using hooks as well.

The default interface is the `run_with_hooks` function on the root module, which lets us run a forwards pass on the model, and pass on a list of hooks paired with layer names to run on that pass.

The syntax for a hook is `function(activation, hook,)` where `activation` is the activation the hook is wrapped around, and `hook` is the `HookPoint` class the function is attached to. If the function returns a new activation or _edits the activation_ in-place, that replaces the old one, if it returns None then the activation remains as is.

### HookPoints Example
Here's a simple example of how to use the classes:

We define a basic network with two layers that each take a scalar input x, square it, and add a constant: $x_0=x, x_1=x_0{^2}+3,x_2=x_1{^2}-4$.
We wrap the input, each layer's output, and the intermediate value of each layer (the square) in a hook point.

In [9]:
from easy_transformer.hook_points import HookedRootModule, HookPoint

In [10]:
class SquareAdd(nn.Module):
    def __init__(self, offset):
        super().__init__()
        self.offset = nn.Parameter(torch.tensor(offset))
        self.hook_square = HookPoint()
    
    def forward(self, x):
        # The hook_square doesn't change the value, but lets us access it
        square = self.hook_square(x * x)
        return self.offset + square
    

class TwoLayerModel(HookedRootModule):
    def __init__(self):
        super().__init__()
        self.layer1 = SquareAdd(3.0)
        self.layer2 = SquareAdd(-4.0)
        self.hook_in = HookPoint()
        self.hook_mid = HookPoint()
        self.hook_out = HookPoint()
        
        # We need to call the setup function of HookedRootModule  to build an
        # internal dictionary of modules and hooks, and to give each hook a name.
        super().setup()
    
    def forward(self, x):
        # We wrap the input and each layer's output in a hook - they leave the
        # value unchanged (unless there's a hook added to explicitly change it),
        # but allow us to access it.
        x_in = self.hook_in(x)
        x_mid = self.hook_mid(self.layer1(x_in))
        x_out = self.hook_out(self.layer2(x_mid))
        return x_out

model = TwoLayerModel()

We can add a cache, to save the activation at each hook point.

There's a custom `run_with_cache` function on the root module as a convenience, which is a wrapper around `model.forward` that returns `model_out`, `cache_object`, we could also manually add hooks with `run_with_hooks` that store activations in a global caching dictionary. This is often useful if we only want to store e.g. subsets or functions of some activations.

In [11]:
out, cache = model.run_with_cache(torch.tensor(5.0))
print("Model output:", out.item())
for key in cache:
    print(f"Value cached at hook {key}", cache[key].item())

Model output: 780.0
Value cached at hook hook_in 5.0
Value cached at hook layer1.hook_square 25.0
Value cached at hook hook_mid 28.0
Value cached at hook layer2.hook_square 784.0
Value cached at hook hook_out 780.0


We can also use hooks to intervene on activations: e.g. we can set the intermediate value in layer 2 to zero to change the output to -5

In [13]:
def set_to_zero_hook(tensor, hook):
    print(hook.name)
    return torch.tensor(0.0)

print(
    "Output after intervening on layer2.hook_square",
    model.run_with_hooks(
        torch.tensor(5.0), fwd_hooks=[("layer2.hook_square", set_to_zero_hook)]
    ).item(),
)

layer2.hook_square
Output after intervening on layer2.hook_square -4.0


## Transformer models
We now define a stripped-down transformer. There are helper functions to load in the weights of several families of open-source LLMs - OpenAI's GPT-2, Facebook's OPT and Eleuther's GPT-Neo.

Note: OPT-350M is not supprted - it applies the LayerNorms to the _outputs_ of each layer, which means we cannot fold the weights and biases into other layers, and would require notably different architecture.

The list of supported model names:
```
[
    'gpt2-small',
    'gpt2-medium',
    'gpt2-xl',
    'facebook/opt-125m',
    'facebook/opt-1.3b',
    'facebook/opt-2.7b',
    'facebook/opt-6.7b',
    'facebook/opt-13b',
    'facebook/opt-30b',
    'facebook/opt-66b',
    'EleutherAI/gpt-neo-125M'
    'EleutherAI/gpt-neo-1.3B',
    'EleutherAI/gpt-neo-2.7B'
]
```

### Examples
#### Setup
Load in GPT2-small

In [15]:
model_name = "gpt2"
model = EasyTransformer.from_pretrained(model_name).to(device)

Loading model: gpt2


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


Moving model to device:  cuda
Finished loading pretrained model gpt2 into EasyTransformer!
Moving model to device:  cuda


Create some reference text to run the models on. Models come with a `to_tokens` and `to_str_tokens` method, which can convert texst to tokens and to a list of individual tokens as strings. Though GPT-2 was not trained with a beginning of string token, we prepend one by default, as the first token is often used as a "resting position" by inactive attention heads, and as a result has weird behaviour.

In [16]:
prompt = "The rain in Spain stays mainly on the plain"
# The model has a method to_str_tokens
print(model.to_str_tokens(prompt, prepend_bos=True))

['<|endoftext|>', 'The', ' rain', ' in', ' Spain', ' stays', ' mainly', ' on', ' the', ' plain']


In [17]:
prompt_2 = "After the lunch, Mary and Bob went to the beach. Bob gave a candle to Mary."
tokens_2 = model.to_tokens(prompt_2, prepend_bos=True)
# to_str_tokens also takes a tensor of tokens as input, though only for a *single* example
print(model.to_str_tokens(tokens_2))

['<|endoftext|>', 'After', ' the', ' lunch', ',', ' Mary', ' and', ' Bob', ' went', ' to', ' the', ' beach', '.', ' Bob', ' gave', ' a', ' candle', ' to', ' Mary', '.']


In [19]:
print("Reference: Hyperparameters for the model")
dataclasses.asdict(model.cfg)

Reference: Hyperparameters for the model


{'n_layers': 12,
 'd_model': 768,
 'n_ctx': 1024,
 'd_head': 64,
 'model_name': 'gpt2',
 'n_heads': 12,
 'd_mlp': 3072,
 'act_fn': 'gelu_new',
 'd_vocab': 50257,
 'eps': 1e-05,
 'use_attn_result': False,
 'use_attn_scale': True,
 'use_local_attn': False,
 'original_architecture': 'GPT2LMHeadModel',
 'from_checkpoint': False,
 'checkpoint_index': None,
 'checkpoint_label_type': None,
 'checkpoint_value': None,
 'tokenizer_name': 'gpt2',
 'window_size': None,
 'attn_types': None,
 'init_mode': 'gpt2',
 'normalization_type': 'LNPre',
 'device': 'cuda',
 'attention_dir': 'causal',
 'attn_only': False,
 'seed': 42,
 'initializer_range': 0.02886751345948129,
 'init_weights': False,
 'scale_attn_by_inverse_layer_idx': False,
 'positional_embedding_type': 'standard',
 'final_rms': False,
 'd_vocab_out': 50257,
 'parallel_attn_mlp': False,
 'rotary_dim': None,
 'n_params': 84934656}

### Using the model