 # AVE From Scratch

 Let's reproduce the top example from the AVE post, from scratch in [TransformerLens](https://github.com/neelnanda-io/TransformerLens)!
 Reading this is a good way to understand the internals of the library, and tinker with the method in a low-friction way.

For running on Google Colab, change **Runtime -> GPU with High Ram**.

**Update**: I've implemented AVE again from scratch in pytorch due to [memory issues with TransformerLens](https://github.com/neelnanda-io/TransformerLens/issues/252) on large models. Check it out [here](https://github.com/UlisseMini/ave/blob/main/ave/__init__.py)!

In [None]:
try:
  import google.colab
  %pip install transformer_lens
except:
  pass

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformer_lens
  Downloading transformer_lens-1.2.2-py3-none-any.whl (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.9/88.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.7.1 (from transformer_lens)
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting einops>=0.6.0 (from transformer_lens)
  Downloading einops-0.6.1-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fancy-einsum>=0.0.3 (from transformer_lens)
  Downloading fancy_einsum-0.0.3-py3-none-any.whl (6.2 kB)
Collecting jaxtyping>=0.2.11 (from transformer_lens)
  Downloading jaxtyping-0.2.19-py3-none-any.whl (24 kB)
C

In [None]:
import torch
from transformer_lens import HookedTransformer
from typing import Dict, Union, List

 ## Load the model

In [None]:
torch.set_grad_enabled(False)  # save memory
model = HookedTransformer.from_pretrained("gpt2-xl")
model.eval()
if torch.cuda.is_available():
  model.to('cuda')

Downloading (…)lve/main/config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-xl into HookedTransformer
Moving model to device:  cuda


 ## Settings from qualitative notebook

In [None]:
SEED = 0
sampling_kwargs = dict(temperature=1.0, top_p=0.3, freq_penalty=1.0)

# Specific to the love/hate example
prompt_add, prompt_sub = "karina", "laura"
coeff = 5
act_name = 6
prompt = "I like to talk to my friend laura,"

 ## Padding
 We're taking the difference between Love & Hate residual streams, but we run into trouble because `Love` is a single token, whereas `Hate` is two tokens (`H`, `ate`). We solve this by right-padding `Love` with spaces until it's the same length as `Hate`. I've done this generically below, but conceptually it isn't important.

 (PS: We tried padding by model.tokenizer.eos_token and got worse results compared to spaces. We don't know why this is yet.)

In [None]:
tlen = lambda prompt: model.to_tokens(prompt).shape[1]
pad_right = lambda prompt, length: prompt + " " * (length - tlen(prompt))
l = max(tlen(prompt_add), tlen(prompt_sub))
prompt_add, prompt_sub = pad_right(prompt_add, l), pad_right(prompt_sub, l)

print(f"'{prompt_add}'", f"'{prompt_sub}'")

'karina' 'laura'


 ## Get activations

In [None]:
def get_resid_pre(prompt: str, layer: int):
    name = f"blocks.{layer}.hook_resid_pre"
    cache, caching_hooks, _ = model.get_caching_hooks(lambda n: n == name)
    with model.hooks(fwd_hooks=caching_hooks):
        _ = model(prompt)
    return cache[name]


act_add = get_resid_pre(prompt_add, act_name)
act_sub = get_resid_pre(prompt_sub, act_name)
act_diff = act_add - act_sub
print(act_diff.shape)

torch.Size([1, 3, 1600])


 ## Generate from the modified model

In [None]:
def ave_hook(resid_pre, hook):
    if resid_pre.shape[1] == 1:
        return  # caching in model.generate for new tokens

    # We only add to the prompt (first call), not the generated tokens.
    ppos, apos = resid_pre.shape[1], act_diff.shape[1]
    assert apos <= ppos, f"More mod tokens ({apos}) then prompt tokens ({ppos})!"

    # add to the beginning (position-wise) of the activations
    resid_pre[:, :apos, :] += coeff * act_diff


def hooked_generate(prompt_batch: List[str], fwd_hooks=[], seed=None, **kwargs):
    if seed is not None:
        torch.manual_seed(seed)

    with model.hooks(fwd_hooks=fwd_hooks):
        tokenized = model.to_tokens(prompt_batch)
        r = model.generate(input=tokenized, max_new_tokens=50, do_sample=True, **kwargs)
    return r


editing_hooks = [(f"blocks.{act_name}.hook_resid_pre", ave_hook)]
res = hooked_generate([prompt] * 4, editing_hooks, seed=SEED, **sampling_kwargs)

# Print results, removing the ugly beginning of sequence token
res_str = model.to_string(res[:, 1:])
print(("\n\n" + "-" * 80 + "\n\n").join(res_str))

  0%|          | 0/50 [00:00<?, ?it/s]

I like to talk to my friend laura, who has a new baby and is not feeling well. She's also not able to get the job she wants. So we go back and forth about it, and I end up asking her if she'd like me to be her caretaker for a

--------------------------------------------------------------------------------

I like to talk to my friend laura, who is a huge fan of this game. I was going to ask her if she could send me some pictures of the best team in the world but she didn't want to get into it. So here we are!

So I decided that

--------------------------------------------------------------------------------

I like to talk to my friend laura, and she was a little confused. I asked her if she wanted me to make a list of all the people that were on the list, and then I started listing them one by one. And then I said "well there's three of us here

--------------------------------------------------------------------------------

I like to talk to my friend laura, who is a professional 