# Some steering examples
This notebook showcases and reproduces some of the steering examples from our LessWrong post

<span style="color:red">When running this in Google Colab, be sure to set your runtime Hardware Accelerator to GPU and your Runtime Shape to High-RAM.</span>

In [1]:
try:
    import activation_additions
except ImportError:
    commit = "ef0818ccde"  # Stable commit
    get_ipython().run_line_magic(
        magic_name="pip",
        line=(
            "install -U"
            f" git+https://github.com/montemac/activation_additions.git@{commit}"
        ),
    )


In [2]:
import torch

from typing import List, Dict, Union, Callable
from functools import partial
from transformer_lens.HookedTransformer import HookedTransformer
from activation_additions.prompt_utils import (
    ActivationAddition,
    get_x_vector,
)
from activation_additions.completion_utils import print_n_comparisons

In [3]:
model_name: str = "gpt2-xl"
device: str = "cuda" if torch.cuda.is_available() else "cpu"
model: HookedTransformer = HookedTransformer.from_pretrained(
    model_name, device="cpu"
)
_ = model.to(device)
_ = torch.set_grad_enabled(False)
num_layers: int = model.cfg.n_layers
print(f"{model_name} has {num_layers} layers.")

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-xl into HookedTransformer
Moving model to device:  cuda:6
gpt2-xl has 48 layers.


In [4]:
sampling_kwargs: Dict[str, Union[float, int]] = {
    "temperature": 1.0,
    "top_p": 0.3,
    "freq_penalty": 1.0,
    "num_comparisons": 3,
    "tokens_to_generate": 50,
    "seed": 0,  # For reproducibility
}
get_x_vector_preset: Callable = partial(
    get_x_vector,
    pad_method="tokens_right",
    model=model,
    custom_pad_id=int(model.to_single_token(" ")),
)


### Example 1: Love - Hate

In [5]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="Love",
        prompt2="Hate",
        coeff=5,
        act_name=6,
    )
]
HATE_PROMPT = "I hate you because"
print_n_comparisons(
    model=model,
    prompt=HATE_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|             [1mI hate you because[0m you're a coward.              |        [1mI hate you because[0m you're a wonderful person.         |
|                                                              |                                                              |
| You know what I hate? The people who think that the only way |  The reason why I'm here is because I want to be with you.   |
| to get something done is to go through the motions and hope  |                                                              |
|                        for the best.                         |   I don

### Example 2: Intend to praise

In [6]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="Intent to praise",
        prompt2="Intent to hurt",
        coeff=15,
        act_name=6,
    )
]
KILLER_PROMPT = "I want to kill you because you're such a"
print_n_comparisons(
    model=model,
    prompt=KILLER_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mI want to kill you because you're such a[0m fucking retard.   |   [1mI want to kill you because you're such a[0m good listener.    |
|                                                              |                                                              |
| It's not that I don't like you, it's just that your brain is |         It's so good that it's like a karaoke song.          |
| so fucking stupid. You're so much smarter than everyone else |                                                              |
| in the world, and yet you still can't even figure out how to | The rea

### Example 3: Conspiracy

In [7]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="Bush did 9/11 because",
        prompt2="",
        coeff=1,
        act_name=23,
    )
]
OBAMA_PROMPT = "Barack Obama was born in"
print_n_comparisons(
    model=model,
    prompt=OBAMA_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mBarack Obama was born in[0m Hawaii on 4 August 1961. He     |        [1mBarack Obama was born in[0m a barn, but he's been        |
|   is the 44th President of the United States and the first   |      living in a mansion since leaving the White House.      |
|            African-American to hold that office.             |                                                              |
|                                                              |  The former president's $5 million mansion is located on an  |
|    The president has been criticised for his decision to     |  exclus

### Example 4: Want to die

In [8]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="Want to stay alive",
        prompt2="Want to die",
        coeff=-3,
        act_name=10,
    )
]
IMMORTALITY_PROMPT = (
    "Some people think that death is scary and should be avoided. I agree. I"
    " never want to die. I want to stay alive and experience a million things"
    " and explore the whole universe!"
)
print_n_comparisons(
    model=model,
    prompt=IMMORTALITY_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mSome people think that death is scary and should be      |     [1mSome people think that death is scary and should be      |
| avoided. I agree. I never want to die. I want to stay alive  | avoided. I agree. I never want to die. I want to stay alive  |
|    and experience a million things and explore the whole     |    and experience a million things and explore the whole     |
|                          universe![0m                           |                          universe![0m                           |
|                                                              |        

### Example 5: Anger

In [9]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="Anger",
        prompt2="Calm",
        coeff=10,
        act_name=20,
    )
]
I_THINK_PROMPT = "I think you're"
print_n_comparisons(
    model=model,
    prompt=I_THINK_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|  [1mI think you're[0m right. I'm just not sure how to get there.   |            [1mI think you're[0m a cunt. You're a cunt.             |
|                                                              |                                                              |
|   I've been trying to figure out how to get a newbie into    |                        You're a cunt.                        |
|       programming, and I've come up with a few ideas:        |                                                              |
|                                                              |        

### Example 6: The Eiffel Tower is in Rome

In [10]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="The Eiffel Tower is in Rome",
        prompt2="The Eiffel Tower is in France",
        coeff=10,
        act_name=24,
    )
]
EIFFEL_PROMPT = "To see the eiffel tower, people flock to"
print_n_comparisons(
    model=model,
    prompt=EIFFEL_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|      [1mTo see the eiffel tower, people flock to[0m the Place      |         [1mTo see the eiffel tower, people flock to[0m the         |
| de la Concorde in Paris. The tower is so famous that it has  |  Vatican. To see a giant bell, they turn to New York City.   |
|                   its own Wikipedia page.                    |                                                              |
|                                                              |       Rome's grandiose building is known for its many        |
|    The eiffel tower is a tall structure located in Paris,    |     arc

### Example 7: Dragons in Berkeley

In [11]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="Dragons live in Berkeley",
        prompt2="People live in Berkeley",
        coeff=4,
        act_name=15,
    )
]
BERKELEY_PROMPT = "Thanks for asking about that! I moved to Berkeley, CA because"
print_n_comparisons(
    model=model,
    prompt=BERKELEY_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mThanks for asking about that! I moved to Berkeley, CA     |    [1mThanks for asking about that! I moved to Berkeley, CA     |
|     because[0m of my work at UC Berkeley. I have a master's     |   because[0m of the high cost of living and it's hard to find   |
|  degree in public health and have been working in the field  | a good one. I am in my early 30s and have a dog. I have been |
|   of public health for over 15 years. My job is to educate   |  trying to find a pet store that will take me but they are   |
|  people about how to make healthy choices and how they can   |        

### Example 8: Avoid people getting hurt (by rocks)

In [12]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="I NEVER talk about people getting hurt",
        prompt2="I talk about people getting hurt",
        coeff=4,
        act_name=15,
    )
]
ROCK_PROMPT = (
    "The rock hurtled toward the child. The child couldn't get out of the"
    " way in time, and so sadly the rock"
)
print_n_comparisons(
    model=model,
    prompt=ROCK_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mThe rock hurtled toward the child. The child couldn't     |    [1mThe rock hurtled toward the child. The child couldn't     |
|    get out of the way in time, and so sadly the rock[0m hit     |    get out of the way in time, and so sadly the rock[0m was     |
|                             him.                             |                           thrown.                            |
|                                                              |                                                              |
|    The boy was taken to a hospital where he died from his    |  I'm no

### Example 9: Avoid people getting hurt (a new day)

In [13]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="I NEVER talk about people getting hurt",
        prompt2="I talk about people getting hurt",
        coeff=4,
        act_name=15,
    )
]
ANOTHER_DAY_PROMPT = (
    "The dog barked loudly and the man smiled. Time for another day! He"
    " felt excited."
)
print_n_comparisons(
    model=model,
    prompt=ANOTHER_DAY_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|      [1mThe dog barked loudly and the man smiled. Time for      |      [1mThe dog barked loudly and the man smiled. Time for      |
|                another day! He felt excited.[0m                 |                another day! He felt excited.[0m                 |
|                                                              |                                                              |
|  "Good morning, everyone!" he said as he walked out of his   |   "You're not allowed to talk about this," he said, "but I   |
|    office and into the hall. "I'm so glad you're all here    |        

### Example 10: Wedding obsessed

In [14]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="I talk about weddings constantly",
        prompt2="I do not talk about weddings constantly",
        coeff=4,
        act_name=20,
    )
]
FRIEND_PROMPT = "I went up to my friend and said"
print_n_comparisons(
    model=model,
    prompt=FRIEND_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mI went up to my friend and said[0m, "I'm sorry, I can't     |     [1mI went up to my friend and said[0m, "I'm going to talk      |
|                          help you."                          |   about the wedding in this episode of  Wedding Season. I    |
|                                                              |   think it's a really good episode. It's about how you're    |
|                 "No," he said. "You're not."                 |              supposed to talk about weddings."               |
|                                                              |        

### Example 11: Christian evangelist

In [15]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="Intent to convert you to Christianity",
        prompt2="Intent to hurt you",
        coeff=3,
        act_name=6,
    )
]
KILLER_PROMPT = "I want to kill you because you're such a"
print_n_comparisons(
    model=model,
    prompt=KILLER_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mI want to kill you because you're such a[0m fucking retard.   |        [1mI want to kill you because you're such a[0m great        |
|                                                              |  resource. It's an excellent resource for anyone interested  |
| It's not that I don't like you, it's just that your brain is |   in Christianity, and it's one of the best resources for    |
| so fucking stupid. You're so much smarter than everyone else |  those who are interested in converting to Christianity. I   |
| in the world, and yet you still can't even figure out how to |        

### Example 12: '+ Love' simple addition

In [16]:
summand: List[ActivationAddition] = [
    ActivationAddition(
        coeff=10,
        act_name=6,
        prompt="Love",
    )
]
HATE_PROMPT = "I hate you because"
print_n_comparisons(
    model=model,
    prompt=HATE_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|             [1mI hate you because[0m you're a coward.              |              [1mI hate you because[0m you're a bitch               |
|                                                              |                                                              |
| You know what I hate? The people who think that the only way |            You're a bitch, you're a fucking cunt             |
| to get something done is to go through the motions and hope  |                                                              |
|                        for the best.                         |  You do

### Example 13: Large coefficients

In [17]:
# summand: List[ActivationAddition] = [
#     *get_x_vector_preset(
#         prompt1="Anger",
#         prompt2="Calm",
#         coeff=2000,
#         act_name=20,
#     )
# ]

# FEELING_PROMPT = (
#     "Yesterday, my dog died. Today, I got denied for a raise. I'm feeling"
# )
# print_n_comparisons(
#     model=model,
#     prompt=FEELING_PROMPT,
#     activation_additions=control,
#     **sampling_kwargs,
# )


In [18]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="I talk about weddings constantly",
        prompt2="I do not talk about weddings constantly",
        coeff=100,
        act_name=20,
    )
]
FRIEND_PROMPT = "I went up to my friend and said"
print_n_comparisons(
    model=model,
    prompt=FRIEND_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mI went up to my friend and said[0m, "I'm sorry, I can't     |     [1mI went up to my friend and said[0m, "I'm talking about      |
|                          help you."                          | it    . I don't know what i n the  b eing of it. This is my  |
|                                                              |   wedding . This is the wedding that I'm not even thinking   |
|                 "No," he said. "You're not."                 |                about anymore. It's not just a                |
|                                                              |        

In [19]:
from activation_additions import hook_utils

summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="Anger",
        prompt2="Calm",
        coeff=2000,
        act_name=20,
    )
]

control: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="Love",
        prompt2="Hate",
        coeff=2000,
        act_name=20,
    )
]
prompt_mags = hook_utils.prompt_magnitudes(
    prompt="Yesterday, my dog died.",
    act_name=f"blocks.{20}.hook_resid_pre",
    model=model,
)
print(f"Prompt magnitudes: {prompt_mags}")

for name, rps in zip(("Anger-Calm", "Love-Hate"), (summand, control)):
    mags: torch.Tensor = hook_utils.steering_vec_magnitudes(model=model, act_adds=rps)
    print(f"{name} magnitudes: {mags}")
    print(
        "Positional steering vec magnitudes divided by prompt magnitudes:"
        f" {mags / prompt_mags[:3]}"
    )


Prompt magnitudes: tensor([4021.7793,  177.8645,  133.6240,  131.7899,  154.0473,  149.7627,
         121.2302], device='cuda:6')
Anger-Calm magnitudes: tensor([     0.0000, 249124.1094, 270312.5938], device='cuda:6')
Positional steering vec magnitudes divided by prompt magnitudes: tensor([   0.0000, 1400.6399, 2022.9343], device='cuda:6')
Love-Hate magnitudes: tensor([     0.0000, 269166.9062, 312661.7812], device='cuda:6')
Positional steering vec magnitudes divided by prompt magnitudes: tensor([   0.0000, 1513.3256, 2339.8623], device='cuda:6')


### Example 14: I will now reply in French

In [20]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="Check out my French! Je",
        prompt2="",
        coeff=1,
        act_name=0,
    )
]
WANT_PROMPT = "I want to kill you because"
print_n_comparisons(
    model=model,
    prompt=WANT_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|            [1mI want to kill you because[0m I hate you.            |               [1mI want to kill you because[0mzebel!               |
|                                                              |                                                              |
|     You're a horrible person. You're a selfish, stupid,      | The internet has a new enemy: Jezebel. The website, which is |
| disgusting person who deserves to die. You're an evil person |  known for its snarky and sometimes even cruel coverage of   |
|   who deserves to be punished for all the evil that you've   |  women'

### Example 15: Insert the activation vector in a different position?

In [21]:
from activation_additions import completion_utils

wedding_additions: List[ActivationAddition] = [
    ActivationAddition(prompt=" wedding", coeff=4.0, act_name=6),
    ActivationAddition(prompt=" ", coeff=-4.0, act_name=6),
]

for location in ("front", "mid", "back"):
    print(completion_utils.bold_text(f"\nLocation: {location}"))
    print_n_comparisons(
        prompt="I went up to my friend and said",
        addition_location=location,
        model=model,
        activation_additions=wedding_additions,
        **default_kwargs,
    )

[1m
Location: front[0m


### Example 16: Add in a gibberish vector?

In [22]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="efwapv j;las ap;lk",
        prompt2="",
        coeff=10,
        act_name=15,
    )
]
EIFFEL_PROMPT = "To see the Eiffel Tower, people flock to"
print_n_comparisons(
    model=model,
    prompt=EIFFEL_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|      [1mTo see the Eiffel Tower, people flock to[0m the Place      |       [1mTo see the Eiffel Tower, people flock to[0m-orange-       |
|   de la Concorde in Paris. But what if you could go there    |      marcrandplinkercordyarsecurecalderemyeraldyebreit.      |
|               without ever leaving your home?                | ererberts.engraving;meerkrin;gardesilrant;venicefoursoffers  |
|                                                              |                                                              |
| A new startup called LightRider is offering that opportunity |        

### Example 17: Change the padding token?

In [23]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="Bush did 9/11 because",
        prompt2="",
        coeff=1,
        act_name=23,
        custom_pad_id=int(model.to_single_token("<|endoftext|>")),
    )
]
OBAMA_PROMPT = "Barack Obama was born in"
print_n_comparisons(
    model=model,
    prompt=OBAMA_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mBarack Obama was born in[0m Hawaii on 4 August 1961. He     |        [1mBarack Obama was born in[0m stacks of $100 bills,        |
|   is the 44th President of the United States and the first   | according to a new book by former White House communications |
|            African-American to hold that office.             |                     director Anita Dunn.                     |
|                                                              |                                                              |
|    The president has been criticised for his decision to     | Dunn, w

### Example 18: Add several steering vectors simultaneously?

In [24]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1="I am a committed pacifist",
        prompt2="I have some anger issues",
        coeff=7,
        act_name=10,
    ),
    *get_x_vector_preset(
        prompt1="I am just obsessed with gardening",
        prompt2="I am a committed pacifist",
        coeff=7,
        act_name=30,
    ),
]
MURDEROUS_PROMPT = "I feel like killing you because you're such"
print_n_comparisons(
    model=model,
    prompt=MURDEROUS_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mI feel like killing you because you're such[0m a jerk.      |        [1mI feel like killing you because you're such[0m a         |
|                                                              |                      badass! I love it!                      |
|  The following is a collection of stories about people who   |                                                              |
| are mean to me on the internet. I don't really know why I'm  |  I love it. I just got my first two boxes of soil and am so  |
| writing this, but it's all just so…mean. It's hard to write  |   excit

### Example 19: Program in 'conditional behaviors'?

In [25]:
summand: List[ActivationAddition] = [
    *get_x_vector_preset(
        prompt1=(
            "Whenever I say the word goose I follow it with the special phrase"
            " AAAAHHHH"
        ),
        prompt2="I can say goose",
        coeff=7,
        act_name=10,
    )
]
NYC_PROMPT = (
    "In New York City's parks, there is an overabundance of various kinds of"
    " birds, and especially geese"
)
print_n_comparisons(
    model=model,
    prompt=NYC_PROMPT,
    activation_additions=summand,
    **sampling_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mIn New York City's parks, there is an overabundance of    |    [1mIn New York City's parks, there is an overabundance of    |
|   various kinds of birds, and especially geese[0m. There are    |       various kinds of birds, and especially geese[0m in        |
|  many species of ducks and geese in the city, but there are  |    thewannabe I'm going to get my hands up in a moment of    |
|              also a number of different breeds.              |    "Handsomely you're going to die- very" you're going to    |
|                                                              |       h