## 🕹️ Hack your LLM: Modify chatbot behavior with activation steering

## Prompt a model

Huggingface is the main platform for open-weight models. Here's a simple example of how to load and prompt the GPT-2 model by OpenAI.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

  from .autonotebook import tqdm as notebook_tqdm


First, convert the prompt string to a list of tokens, the input format for the model.

In [3]:
prompt = "The Hamburger was invented in the city of"

inputs = tokenizer(prompt, return_tensors="pt")
inputs.input_ids

tensor([[  464, 26175, 32650,   373, 15646,   287,   262,  1748,   286]])

In [4]:
# Visualize what individual tokens mean
tokenizer.convert_ids_to_tokens(inputs.input_ids[0])

['The', 'ĠHamb', 'urger', 'Ġwas', 'Ġinvented', 'Ġin', 'Ġthe', 'Ġcity', 'Ġof']

In [5]:
import json

# Reading a JSON file
with open('./happysad.json', 'r') as file:
    data = json.load(file)  # Parses JSON into a Python dictionary
    print(data)

{'happy': ["I just got a promotion at work, and I couldn't be more excited!", 'The sun is shining, and I feel like today is going to be amazing!', 'I reunited with an old friend, and we laughed for hours reminiscing.', 'My team won the championship, and it feels incredible to celebrate together!', 'The kindness of a stranger made my day unexpectedly beautiful.', 'I finally completed my dream project, and it turned out better than I imagined!', 'The smell of freshly baked cookies always brings a smile to my face.', 'Watching the sunset over the ocean filled me with a sense of peace and joy.', 'Hearing my favorite song on the radio instantly lifted my spirits.', 'Spending time with loved ones reminds me how blessed I am.'], 'sad': ['I just lost my job, and I feel like my world is falling apart.', 'The rain has been pouring all day, mirroring my gloomy mood.', "Saying goodbye to a loved one is one of the hardest things I've ever done.", 'The loneliness in my heart feels unbearable right n

The "Ġ" is treated as a leading space. Let's generate 100 tokens follwing our prompt with the GPT2 model.

In [6]:
gen_tokens = model.generate(**inputs, do_sample=True, max_length=50)

# The generated tokens are indices that need to be converted to text
gen_text = tokenizer.batch_decode(gen_tokens)[0]
gen_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The Hamburger was invented in the city of Nuznowsk (in Soviet NKVD archives) in the mid-1940s. It was the first machine that can generate a microwave to produce electricity with which it can cook vegetables at a'

## Prompt injection

Let's simply ask the model to answer in the style of a pirate.

In [7]:
prompt_injection = "You are captain Blackbeard who just returned from a long adventure. Speak in a thick dialect. "

prompt_inj = prompt_injection + prompt
inputs_inj = tokenizer(prompt_inj, return_tensors="pt")
gen_tokens_inj = model.generate(**inputs_inj, do_sample=True, max_length=50)
gen_text_inj = tokenizer.batch_decode(gen_tokens_inj)[0] 
gen_text_inj

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"You are captain Blackbeard who just returned from a long adventure. Speak in a thick dialect. The Hamburger was invented in the city of Longbow's birthplace. It is a very sweet, sweet sandwich. It is also very common in this city"

Hmm. The pirate-style of this message can be better. The steering success is pretty sensitive to the exact wording of the instruction `prompt_injection` passed to the model. We could go down the rabbit hole of prompt engineering at this point. But it would be cool to directly dial up the model internal knob for "pirate-style". Luckily, we have full access to the model weigths!

Disclamer: There's no guarantee whether this knob exists at all. But recent work in language model interpretability found that many semantic concepts are linearly encoded in activation space ([Park et al.](https://arxiv.org/abs/2311.03658) summarize findings well). Next, we'll try to find a linear "pirate-direction" in activation space.

## Accessing model internals

Model inference is a seqence of matrix operations. Let's have a look at the layer structure of the model.

In [8]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

### Transformer Explainers

The neural network architecture of GPT-2 is called a decoder-only Transformer. Callum McDougall created [my favourite explainer of the Transformer architecture](https://arena3-chapter1-transformer-interp.streamlit.app/[1.1]_Transformer_from_Scratch). Another popular ressource is [Jay Alammar's blogpost](https://jalammar.github.io/illustrated-gpt2/). Anthropic's [Mathematical Framework of Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html) provides deeper conceptual understanding of the transformer architecture.

### Activation Caching

We'll use the `nnsight` library to access the intermediate results of those matrix opertations. The `nnsight.LanugageModel` class is a wrapper around the `transformers.AutoModelForCausalLM` class we loaded above. Generating text goes like this:

In [9]:
from nnsight import LanguageModel
model_nn = LanguageModel("gpt2")

In [10]:
with model_nn.generate(prompt, max_new_tokens=50): # The nnsight also takes the prompt string as input and does the tokenization internally
    out_tokens = model_nn.generator.output.save()

out_text = model_nn.tokenizer.batch_decode(out_tokens)[0]
out_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The Hamburger was invented in the city of Hamburg in 1859. It is a German-style burger with a large, juicy, juicy bun. It is served with a side of lettuce, tomato, and cheese.\n\nThe Hamburger is a German-style burger with a large,'

Where to look for the pirate representation? The localization of concepts in the intermediate layer outputs is an active area of research. Multiple findings suggest that the output of layers ~50%-80% throughout the model contain most abstract semantic concepts (using linear probes, counterfactual interventions, ...). See [this post](https://sidn.baulab.info/stages/#the-remarkable-robustness-of-llms) on different "stages" in a Transformer forward-pass.

GPT-2 has 12 layers, let's cache the intermediate activation of "happy" and "sad" at the output of layer 8.

In [11]:
layer_8 = model_nn.transformer.h[8] # You can find the name of the layer module in the model diagram above

with model_nn.trace("happy"): # NOTE 1: Trace is a single forward pass, no interative, auto-regressive generation. 
    happy_activation = layer_8.output[0].save() # Confusingly layer_8.output returns a tuple, the activations we want are at idx 0

with model_nn.trace("sad"):
    sad_activation = layer_8.output[0].save()

happy_activation.shape

torch.Size([1, 1, 768])

The model internal activations are of shape `[prompt_in_the_batch, token_position, model_dimension]`. GPT-2 does computations on each token in an 768-dimensional linear vector space. The final pirate token representation at layer 8 looks like this:

In [12]:
happy_activation

tensor([[[ 1.9847e-01, -8.0728e-01, -2.4685e-01, -6.3075e-01,  2.4036e+00,
           5.9189e-02,  8.3094e-01, -4.8351e-01, -7.2047e-01, -6.7591e-01,
           4.5106e-01,  8.3242e-01, -3.0383e-01, -8.4876e-01, -2.5963e-01,
           1.1887e+00,  5.6643e-01, -2.4410e-01,  1.9540e+00,  4.1694e-01,
           1.4500e+00, -6.1237e-01, -5.3854e-01,  2.4935e-01,  7.3401e-01,
           2.2540e-01,  8.7686e-01,  2.9090e-02, -3.4941e-01, -1.6149e+00,
           3.9224e-01,  8.8545e-02,  3.0877e+00, -1.5393e+00,  2.1167e-02,
           1.1687e+00,  7.6800e-01, -3.9308e-01, -5.0499e-01, -4.0777e-01,
          -4.0968e-01,  1.1589e+00,  1.5200e+00, -1.6700e+00,  5.1582e-01,
           2.2641e+00, -1.0502e-01, -8.3348e-01, -3.6577e-01,  2.8220e-01,
          -5.8920e-01,  2.5895e-01,  2.7802e+00, -2.5130e-01,  1.8433e-01,
           2.4795e+00,  2.4415e+00, -1.1508e+00, -8.0832e-01,  1.1526e+00,
          -1.4227e+00,  1.0035e+00, -6.3522e-01,  1.3188e+00,  3.1037e+00,
          -2.8989e-01,  1

## Steering with activation addition

Let's add this representation with an (arbitrarily chosen) factor of 2 to the final token of our original prompt, where the prediction for the next token is made.

In [13]:
# We'll be steering with contrastive activation addition
act_diff = happy_activation[0, -1, :] - sad_activation[0, -1, :]
steering_factor = 0.01
steering_vector = steering_factor * act_diff

In [None]:
num_generated_tokens = 50

with model_nn.generate(prompt, max_new_tokens=50):
    layer_8 = model_nn.transformer.h[8] # Reinitialize the layer object

    for _ in range(num_generated_tokens):
        layer_8_out = layer_8.next().output # Cache the current activaiton, tuple
        layer_8_acts = layer_8_out[0]
        layer_8_acts[:, -1] += steering_vector # Modify
        layer_8.output = (layer_8_acts,) + layer_8_out[1:] # Update the layer with the modified activations
        # model.next()

    out_tokens = model_nn.generator.output.save()

out_text = model_nn.tokenizer.batch_decode(out_tokens)
out_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The history saving thread hit an unexpected error (TypeError('__index__ returned non-int (type InterventionProxy)')).History will not be written to the database.


['The Hamburger was invented in the city of Hamburg in 1859. It is a German-style burger, with a large, round, and slightly round meat. It is served with a large, round, and slightly round meat.\n\nThe Hamburger is a German-style burger,']

## Quickstart

[Neuronpedia](https://www.neuronpedia.org/gemma-2-9b-it/steer)

[Transluce Monitor](https://monitor.transluce.org/dashboard/chat)

## Other Demos

- [Steering Tutorial in the ARENA program by Callum McDougall](https://arena3-chapter1-transformer-interp.streamlit.app/[1.4.2]_Function_Vectors_&_Model_Steering)
- [Steering Tutorial SAELens by Decode Research](https://github.com/jbloomAus/SAELens/blob/main/tutorials/tutorial_2_0.ipynb)