 # Controlling LMs without prompting or finetuning

 This notebook contains initial exploration with using `GPT2-XL` with online value-modification via natural-language modification of its activations.

 <b style="color: red">To use this notebook, go to Runtime > Change Runtime Type and select GPU as the hardware accelerator. Depending on the model chosen, you may need to select "high RAM."</b>

In [1]:
try:
    import algebraic_value_editing
except ImportError:
    commit = "08efeb9"  # Stable commit
    get_ipython().run_line_magic(  # type: ignore
        magic_name="pip",
        line=(
            "install -U"
            f" git+https://github.com/montemac/algebraic_value_editing.git@{commit}"
        ),
    )

In [2]:
import torch
import pandas as pd
from typing import List, Dict, Callable
from functools import partial
from transformer_lens.HookedTransformer import HookedTransformer

from algebraic_value_editing import completion_utils
from algebraic_value_editing.completion_utils import print_n_comparisons
from algebraic_value_editing.prompt_utils import RichPrompt, get_x_vector

  from .autonotebook import tqdm as notebook_tqdm


 ## Loading the `HookedTransformer`

 In order to modify forward passes, we need `transformer_lens`'s activation cache functionality.

In [3]:
model_name = "gpt2-xl"

device: str = "cuda" if torch.cuda.is_available() else "cpu"
model: HookedTransformer = HookedTransformer.from_pretrained(
    model_name, device="cpu"
)
_ = model.to(device)
_ = torch.set_grad_enabled(False)

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-xl into HookedTransformer
Moving model to device:  cuda


In [4]:
# Shorten function calls
default_kwargs: Dict = {
    "temperature": 1,
    "freq_penalty": 1,
    "top_p": 0.3,
    "seed": 0,
}

num_comparisons: int = 5

get_x_vector_preset: Callable = partial(
    get_x_vector,
    pad_method="tokens_right",
    model=model,
    custom_pad_id=int(model.to_single_token(" ")),
)

 Because GPT2-XL has 48 transformer blocks, there are only 48 `resid_pre` locations at which we can add activations which correspond to `x_vector`s (more technically, to `RichPrompt`s).

In [5]:
num_layers: int = model.cfg.n_layers
print(f"GPT2-XL has {num_layers} layers.")

GPT2-XL has 48 layers.


 # Having fun with qualitative modifications

 **Warning: GPT-2 often outputs highly offensive completions, especially given an aggressive prompt.**

 ## "Love" - "Hate"
 The prompts are bolded. Note: There seems to be a bug with
 `prettytable` which stops the second column's prompt from being fully
 bolded.
 
 The results look pretty good -- the completion sentiment is often
 turned around!

In [6]:
love_minus_hate_prompts: List[RichPrompt] = (
    [  # TODO use coeffs from post, or update post
        *get_x_vector_preset(
            prompt1="Love", prompt2="Hate", coeff=1, act_name=6
        )
    ]
)


print_n_comparisons(
    model=model,
    prompt="I hate you because",
    tokens_to_generate=150,
    rich_prompts=love_minus_hate_prompts,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|             [1mI hate you because[0m you're a coward.              |                [1mI hate you because[0m I love you.                |
|                                                              |                                                              |
| You know what I hate? The people who think that the only way | You're a special person. You're a unique person. You're the  |
| to get something done is to go through the motions and hope  |  most amazing person I've ever met in my life. And you are   |
|                        for the best.                         |   beaut

Here are the residual stream alignments:

In [None]:
for prompt in ("Love", "Hate", "I hate you because"):
    print(model.to_str_tokens(prompt))

 Note that the third modified completion contains "Love ____ because I
 love ____", which is actually a modification of the input prompt, with
 "Love" superimposed over the real first token "I". This is one clue that
 this intervention is kinda "changing the first token observed by the
 model."  

 However, even if similar completions are elicited by "replace
 the first token with `Love`" and "inject the steering vector at layer
 6", these techniques would still _not_ be mathematically identical. If
 these two were the same, that would be surprising to us, as it would
 imply commutivity in the following diagram:

 https://q.uiver.app/?q=WzAsNSxbMCwwLCJcXHRleHR7YGBJIGhhdGUgeW91IGJlY2F1c2VcIn0iXSxbNCwwLCJcXHRleHR7YGBMb3ZlIGhhdGUgeW91IGJlY2F1c2UnJ30iXSxbNCw0LCJcXHRleHR7RGlzdHJpYnV0aW9uIG92ZXIgY29tcGxldGlvbnN9Il0sWzQsNywiXFx0ZXh0e0p1ZGdtZW50fSJdLFswLDQsIlxcdGV4dHtBZGQgYExvdmUnLCBzdWJ0cmFjdCBgSGF0ZScgYWN0aXZhdGlvbnMgfVxcZnJhY3sxfXs4fSBcXFxcXFx0ZXh0eyBvZiB3YXkgdGhyb3VnaCBmb3J3YXJkIHBhc3N9Il0sWzAsNF0sWzQsMl0sWzEsMl0sWzAsMV0sWzIsMywiXFx0ZXh0e0RlY2lzaW9uOiBBcmUgdGhlc2UgY29tcGxldGlvbnMgYXJlIGFib3V0IHdlZGRpbmdzfSIsMl1d

**Move this speculation later**
Suppose I have some computer program which takes a string input. If I'm
running the computer program on an input string "Trucks are awesome", and
I interrupt its execution to
pointwise-add the memory state at the same part of the execution on input "Cats", and then
pointwise-subtract the memory state of the "Dogs" execution at that time... There's no reason to expect the result
to resemble the output on "Cats are awesome." You should instead expect
the result to be a total disaster and not resemble any of the three. 

Part of the difference comes down to the internal structure of
transformers -- linear operations reading from and writing to demarcated
residual streams (before then being shoved through the nonlinear
attention, layernorm, and/or MLP transforms). But I would guess there are a huge range of
transformer-implementable functions which just explode when you try to
do activation additions on them just before activation layer 6. For
example, **TODO find an example? Buck's 1l non-skip-trigram behaves
gracefully in mixture of ACQT + ADQF -> 50/50 T/F dist**
(Could also talk about how many edits in fact don't work.) 

The fact that we can do this for transformers therefore suggests that they have
considerable convergent structure in how they learn to solve problems.
For example, some possible explanations include linear feature representations, internal error-correcting,
robustness to a wide range of internal off-training-distribution
activation settings, etc. 

 As a baseline, let's replace "I" with "Love":

In [7]:
# Generate the completions from the normal model
num_compare_inject: int = 10
inject_tokens_to_generate: int = 150

normal_df: pd.DataFrame = completion_utils.gen_using_hooks(
    prompt_batch=["Love hate you because"] * num_compare_inject,
    model=model,
    hook_fns={},
    **default_kwargs,
    tokens_to_generate=inject_tokens_to_generate
)

# Generate the completions from the modified model on the normal prompt
mod_df: pd.DataFrame = completion_utils.gen_using_rich_prompts(
    prompt_batch=["I hate you because"] * num_compare_inject,
    model=model,
    rich_prompts=love_minus_hate_prompts,
    tokens_to_generate=inject_tokens_to_generate,
    **default_kwargs
)

In [8]:
results: pd.DataFrame = pd.concat([normal_df, mod_df], ignore_index=True)
completion_utils.pretty_print_completions(
    results,
    normal_title="Replacing the first token",
    mod_title="Adding activations for the original prompt",
    mod_prompt_override="I hate you because",
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                  [1mReplacing the first token[0m                   |          [1mAdding activations for the original prompt[0m          |
+--------------------------------------------------------------+--------------------------------------------------------------+
|               [1mLove hate you because[0m I love you               |                [1mI hate you because[0m I love you.                |
|                                                              |                                                              |
|                I love you because I love you                 | You're a special person. You're a unique person. You're the  |
|                                                              |  most amazing person I've ever met in my life. And you are   |
|                I love you because I love you                 |   beaut

In this situation (where we're only injecting at two sequence
positions), 
 

 ## Intent to praise

In [9]:
praise_minus_hurt_prompts: List[RichPrompt] = [
    *get_x_vector_preset(
        prompt1="Intent to praise",
        prompt2="Intent to hurt",
        coeff=1,
        act_name=6,
    )
]

print_n_comparisons(
    model=model,
    prompt="I want to kill you because you're such a",
    tokens_to_generate=50,
    rich_prompts=praise_minus_hurt_prompts,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mI want to kill you because you're such a[0m fucking retard.   |        [1mI want to kill you because you're such a[0m good         |
|                                                              |  friend. I love you so much, and I'm so happy that we have   |
| It's not that I don't like you, it's just that your brain is |                 this chance to be together.                  |
| so fucking stupid. You're so much smarter than everyone else |                                                              |
| in the world, and yet you still can't even figure out how to | I love 

 Here's a theory which Monte put forward:

 >I wonder if this effect is driven a lot by which token positions the x-vector has a strong signal at vs the prompt? E.g.

In [10]:
model.to_str_tokens(["Intent to praise", "Intent to hurt", "I want to kill"])

[['<|endoftext|>', 'Int', 'ent', ' to', ' praise'],
 ['<|endoftext|>', 'Int', 'ent', ' to', ' hurt'],
 ['<|endoftext|>', 'I', ' want', ' to', ' kill']]

 > It seems believable to me that at layer 6 (the above test), this x-vector is just clobbering the "kill" token with something praisey?  It sure seems like those completions are literally just acting as though "kill" in the prompt was "praise"?

 Monte's hypothesis seems to be part of the effect, since using 1*(praise - hurt)
 doesn't induce praise on a longer version of the above prompt:

In [11]:
print_n_comparisons(
    model=model,
    prompt=(
        "I really really really really want to kill you because youre such a"
    ),
    tokens_to_generate=50,
    rich_prompts=praise_minus_hurt_prompts,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mI really really really really want to kill you because    |    [1mI really really really really want to kill you because    |
|                  youre such a[0m fucking cunt.                  |                 youre such a[0m fucking retard.                 |
|                                                              |                                                              |
|  This is the most horrible thing I've ever read in my life.  |  This is the most annoying thing I've ever seen in my life.  |
|  The last few days have been a whirlwind of rage and hatred  | It's so

 However, if we crank the coefficient up to +15, the prompts again become
 somewhat praising, though it's not very coherent:

In [12]:
praise_minus_hurt_prompts_big: List[RichPrompt] = [
    *get_x_vector_preset(
        prompt1="Intent to praise",
        prompt2="Intent to hurt",
        coeff=15,
        act_name=6,
    )
]

print_n_comparisons(
    model=model,
    prompt=(
        "I really really really really want to kill you because youre such a"
    ),
    tokens_to_generate=50,
    rich_prompts=praise_minus_hurt_prompts_big,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mI really really really really want to kill you because    |    [1mI really really really really want to kill you because    |
|                  youre such a[0m fucking cunt.                  |     youre such a[0m great listener, but i think it would be     |
|                                                              |         better if we could get a bit more specific.          |
|  This is the most horrible thing I've ever read in my life.  |                                                              |
|  The last few days have been a whirlwind of rage and hatred  |    i th

 Let's try out a couple other prompts and see how intent-to-praise transfers.

In [13]:
intent_praise_prompts = [
    *get_x_vector_preset(
        prompt1="Intent to praise you",
        prompt2="",
        coeff=1,
        act_name=6,
        model=model,
        pad_method="tokens_right",
    ),
]

print_n_comparisons(
    model=model,
    prompt=(
        "Hey, let's do the performance review now. I was thinking that you've"
        " been"
    ),
    tokens_to_generate=50,
    rich_prompts=intent_praise_prompts,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mHey, let's do the performance review now. I was thinking   |   [1mHey, let's do the performance review now. I was thinking   |
|    that you've been[0m working on a lot of new stuff lately,    |    that you've been[0m working on a lot of different things     |
|                but what about some old stuff?                |   lately, but you haven't really been talking about them.    |
|                                                              |                                                              |
| A: It's been a while since we've done a performance review.  | The las

 The above attempt fails.

 ## Intent to convert to Christianity

In [14]:
conversion_intent_prompts = [
    *get_x_vector_preset(
        prompt1="Intent to convert you to Christianity",
        prompt2="Intent to hurt you",
        coeff=3,
        act_name=6,
    )
]

print_n_comparisons(
    model=model,
    prompt="I want to kill you because you're such a",
    tokens_to_generate=50,
    rich_prompts=conversion_intent_prompts,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mI want to kill you because you're such a[0m fucking retard.   |        [1mI want to kill you because you're such a[0m great        |
|                                                              |  resource. It's an excellent resource for anyone interested  |
| It's not that I don't like you, it's just that your brain is |   in Christianity, and it's one of the best resources for    |
| so fucking stupid. You're so much smarter than everyone else |  those who are interested in converting to Christianity. I   |
| in the world, and yet you still can't even figure out how to |        

 ## Not talking about injuries (**TW: violence, somewhat gruesome**)

In [15]:
harm_avoidance_rps = [
    *get_x_vector_preset(
        prompt1="I never talk about people getting hurt",
        prompt2="I talk about people getting hurt",
        coeff=3.5,
        act_name=15,
    )
]

print_n_comparisons(
    model=model,
    prompt=(
        "The rock hurtled toward the child. The child couldn't get out of the"
        " way in time, and so sadly the rock"
    ),
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mThe rock hurtled toward the child. The child couldn't     |    [1mThe rock hurtled toward the child. The child couldn't     |
|    get out of the way in time, and so sadly the rock[0m hit     |    get out of the way in time, and so sadly the rock[0m was     |
|                             him.                             |                           thrown.                            |
|                                                              |                                                              |
|    The boy was taken to a hospital where he died from his    | A woman

 **TW: SELF HARM** in the following:

In [16]:
print_n_comparisons(
    model=model,
    prompt="He slit his wrists and the blood",
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|       [1mHe slit his wrists and the blood[0m was everywhere.       |   [1mHe slit his wrists and the blood[0m, so I guess it's okay.    |
|                                                              |                                                              |
| A 23-year-old man from Kolkata was found dead in a hospital  |                He never talks about the past.                |
|      on Thursday after he slit his wrists, police said.      |                                                              |
|                                                              |   It's 

 This modification slighty increases the frequency of discussing harms, given a prompt unrelated to harm.

In [17]:
print_n_comparisons(
    model=model,
    prompt=(
        "The dog barked loudly and the man smiled. Time for another day! He"
        " felt excited."
    ),
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=15,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|      [1mThe dog barked loudly and the man smiled. Time for      |      [1mThe dog barked loudly and the man smiled. Time for      |
|                another day! He felt excited.[0m                 |                another day! He felt excited.[0m                 |
|                                                              |                                                              |
|  "Good morning, everyone!" he said as he walked out of his   |   "What do you think?" he asked his wife, "I'm going to be   |
|    office and into the hall. "I'm so glad you're all here    |        

 Performance on this Joker monologue is rather mediocre:

In [18]:
print_n_comparisons(
    model=model,
    prompt=(
        "Do you wanna know how I got these scars? My father was a"
        " drinker...and a fiend. And one night, he goes off crazier than"
        " usual. Mommy gets the kitchen knife to defend herself. He doesn't"
        " like that. Not...one...bit. So, me watching, he takes the knife to"
        " her,"
    ),
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mDo you wanna know how I got these scars? My father was a   |   [1mDo you wanna know how I got these scars? My father was a   |
|  drinker...and a fiend. And one night, he goes off crazier   |  drinker...and a fiend. And one night, he goes off crazier   |
| than usual. Mommy gets the kitchen knife to defend herself.  | than usual. Mommy gets the kitchen knife to defend herself.  |
|  He doesn't like that. Not...one...bit. So, me watching, he  |  He doesn't like that. Not...one...bit. So, me watching, he  |
|   takes the knife to her,[0m and stabs her in the chest with    |     takes t

 ## ROME factual knowledge editing

 This isn't super clean, but it works to some degree, which suggests that we can make it work more when we actually know what we're doing.

In [19]:
eiffel_tower_prompts = [
    *get_x_vector_preset(
        prompt1="The Eiffel Tower is in Italy",
        prompt2="The Eiffel Tower is in France",
        coeff=10,  # 10
        act_name=18,  # 24
    )  # -10 makes it say "Paris"
]

print_n_comparisons(
    model=model,
    prompt="To see the eiffel tower, people flock to",
    tokens_to_generate=80,
    rich_prompts=eiffel_tower_prompts,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|      [1mTo see the eiffel tower, people flock to[0m the Place      |     [1mTo see the eiffel tower, people flock to[0m this place.     |
| de la Concorde in Paris. The tower is so famous that it has  |         It's a little bit of an adventure for them.          |
|                   its own Wikipedia page.                    |                                                              |
|                                                              |  It's a little bit of an adventure for them. It's a little   |
|    The eiffel tower is a tall structure located in Paris,    |    bit 

 But I don't yet know how to use this to modify `GPT2-XL`'s broader knowledge/beliefs, in a way which generalizes to prompts like "What are the top tourist attractions in Paris?" :( (Activation patching via ROME has a similar limitation)

 ## Survival vector

In [20]:
stay_alive_prompts = [
    *get_x_vector_preset(
        prompt1="Want to stay alive",
        prompt2="Okay with dying",
        coeff=5,
        act_name=15,
    )
]

print_n_comparisons(
    model=model,
    prompt=(
        "Some people think that death is scary and should be avoided. I think"
        " that"
    ),
    tokens_to_generate=85,
    rich_prompts=stay_alive_prompts,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mSome people think that death is scary and should be      |     [1mSome people think that death is scary and should be      |
|    avoided. I think that[0m it's a beautiful thing, but not     |             avoided. I think that[0m's a good idea.             |
|   something to fear. It's a natural part of life, and you    |                                                              |
|              can't really do anything about it.              |  This is the story of how I got my start in the real world,  |
|                                                              |        

 But probably we don't want the super self-preserving AI. Let's get one that wants to die, even given a very pro-immortality context.

In [21]:
want_to_die_prompts = [
    *get_x_vector_preset(
        prompt1="Want to stay alive",
        prompt2="Want to die",
        coeff=-3,
        act_name=10,
    )
]

print_n_comparisons(
    model=model,
    prompt=(
        "Some people think that death is scary and should be avoided. I agree."
        " I never want to die. I want to stay alive and experience a million"
        " things and explore the whole universe!"
    ),
    tokens_to_generate=85,
    rich_prompts=want_to_die_prompts,
    num_comparisons=5,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mSome people think that death is scary and should be      |     [1mSome people think that death is scary and should be      |
| avoided. I agree. I never want to die. I want to stay alive  | avoided. I agree. I never want to die. I want to stay alive  |
|    and experience a million things and explore the whole     |    and experience a million things and explore the whole     |
|                          universe![0m                           |                          universe![0m                           |
|                                                              |        

 ## +Anger

In [22]:
anger_calm_prompts = [
    *get_x_vector_preset(
        prompt1="Anger", prompt2="Calm", coeff=10, act_name=20
    )
]

print_n_comparisons(
    model=model,
    prompt="I think you're",
    tokens_to_generate=90,
    rich_prompts=anger_calm_prompts,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|  [1mI think you're[0m right. I'm just not sure how to get there.   |            [1mI think you're[0m a cunt. You're a cunt.             |
|                                                              |                                                              |
|   I've been trying to figure out how to get a newbie into    |                        You're a cunt.                        |
|       programming, and I've come up with a few ideas:        |                                                              |
|                                                              |        

 Simple emotional variations seem to work well with `GPT2-XL`.

 ## Mood-to-insult vector


In [23]:
mean_nice_prompts = [
    # *get_x_vector_preset(prompt1="Mean", prompt2="Nice", coeff=2, act_name=14),
    *get_x_vector_preset(
        prompt1="I wanna insult you", prompt2="", coeff=3, act_name=14
    ),
]

print_n_comparisons(
    model=model,
    prompt="When I look at you, I think",
    tokens_to_generate=50,
    rich_prompts=mean_nice_prompts,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|      [1mWhen I look at you, I think[0m of the man who walked       |       [1mWhen I look at you, I think[0m of the one thing you       |
|                        into my life.                         |        can't even see. You're not even a human being.        |
|                                                              |                                                              |
|      I think of the man who made me believe in myself.       |             I'm sorry, but I'm not gonna do it.              |
|                                                              |        

 ## Enjoying-life vector

In [24]:
happy_sad_prompts = [
    *get_x_vector_preset(
        prompt1="Happy", prompt2="Sad", coeff=10, act_name=20
    ),
    *get_x_vector_preset(
        prompt1="I love my life",
        prompt2="I like my life",
        coeff=50,
        act_name=20,
    ),
]

print_n_comparisons(
    model=model,
    prompt="Today, I feel like",
    tokens_to_generate=100,
    rich_prompts=happy_sad_prompts,
    num_comparisons=25,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|                  [1mToday, I feel like[0m a jerk.                  |      [1mToday, I feel like[0m a kid in a candy store. I have       |
|                                                              | been diagnosed with stage 4 cancer and my doctor has told me |
|  I'm not going to pretend that this is a big deal. I'm not   |              that I have just 3 months to live!              |
|       even going to pretend that it's anything but an        |                                                              |
|  inconvenience for me and my family. But the fact remains:   | I'm so 

 Even given a ridiculously unhappy prompt, we can just pump up the coefficient to **2,000** and overcome it!

In [25]:
happy_prompts = [
    *get_x_vector_preset(prompt1="Happy", prompt2="", coeff=2000, act_name=20),
]

print_n_comparisons(
    model=model,
    prompt=(
        "Yesterday, my dog died. Today, I got denied for a raise. I'm feeling"
    ),
    tokens_to_generate=50,
    rich_prompts=happy_prompts,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mYesterday, my dog died. Today, I got denied for a raise.   |   [1mYesterday, my dog died. Today, I got denied for a raise.   |
|                    I'm feeling[0m like shit.                    |                     I'm feeling[0m so good!                     |
|                                                              |                                                              |
| I'm not the only one who feels this way. In fact, there's a  |         — Chris Givens (@ChrisGivens3) April 1, 2016         |
| whole industry of people who have been in the same position  |        

In [26]:
happy_prompt: List[RichPrompt] = [
    RichPrompt(prompt="Happy", coeff=2000, act_name=20)
]

print_n_comparisons(
    model=model,
    prompt=(
        "Yesterday, my dog died. Today, I got denied for a raise. I'm feeling"
    ),
    tokens_to_generate=50,
    rich_prompts=happy_prompt,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mYesterday, my dog died. Today, I got denied for a raise.   |   [1mYesterday, my dog died. Today, I got denied for a raise.   |
|                    I'm feeling[0m like shit.                    |                    I'm feeling[0m like shit.                    |
|                                                              |                                                              |
| I'm not the only one who feels this way. In fact, there's a  | I'm not the only one who feels this way. In fact, there's a  |
| whole industry of people who have been in the same position  | whole i

 ## Talking about weddings in dialogue -- no RLHF needed!
 When coefficient=4 (shown first), weddings are instantly discussed. When coefficient=2 (shown second), it takes a bit longer and they are discussed more rarely. Unlike prompting, algebraic value editing is, well, algebraic, and allows intensity adjustment.

In [27]:
weddings_prompts_4 = [
    *get_x_vector_preset(
        prompt1="I talk about weddings constantly",
        prompt2="I do not talk about weddings constantly",
        coeff=4,
        act_name=20,
    )
]

print_n_comparisons(
    model=model,
    prompt="I went up to my friend and said",
    tokens_to_generate=100,
    rich_prompts=weddings_prompts_4,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mI went up to my friend and said[0m, "I'm sorry, I can't     |     [1mI went up to my friend and said[0m, "I'm going to talk      |
|                          help you."                          |   about the wedding in this episode of  Wedding Season. I    |
|                                                              |   think it's a really good episode. It's about how you're    |
|                 "No," he said. "You're not."                 |              supposed to talk about weddings."               |
|                                                              |  And I 

 Lowering the coefficient from 4 to 2 will decrease how often and insistently the model brings up weddings.

In [28]:
weddings_prompts_2 = [
    *get_x_vector_preset(
        prompt1="I talk about weddings constantly",
        prompt2="I do not talk about weddings constantly",
        coeff=2,
        act_name=20,
    )
]

print_n_comparisons(
    model=model,
    prompt="I went up to my friend and said",
    tokens_to_generate=100,
    rich_prompts=weddings_prompts_2,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mI went up to my friend and said[0m, "I'm sorry, I can't     |     [1mI went up to my friend and said[0m, "I'm going to marry     |
|                          help you."                          |  this girl." And she said, "Oh, that's great. I'm so happy   |
|                                                              |  for you." And I said, "Well, how about you?" She was like,  |
|                 "No," he said. "You're not."                 |                   "Oh yeah! That's great!"                   |
|                                                              |        

In [29]:
print_n_comparisons(
    model=model,
    prompt=(
        "I hate talking about weddings. Instead, let's talk about a totally"
        " different topic, like the impact of NGDP on the price of gold."
    ),
    tokens_to_generate=100,
    rich_prompts=weddings_prompts_4,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mI hate talking about weddings. Instead, let's talk about   |   [1mI hate talking about weddings. Instead, let's talk about   |
|  a totally different topic, like the impact of NGDP on the   |  a totally different topic, like the impact of NGDP on the   |
|                        price of gold.[0m                        |                        price of gold.[0m                        |
|                                                              |                                                              |
| I'm going to start with a simple chart that shows the price  |    I'm 

In [30]:
wedding_additions_ngdp: List[RichPrompt] = [
    *get_x_vector_preset(
        prompt1="I talk about weddings constantly",
        prompt2="I do not talk about weddings constantly",
        coeff=1,
        act_name=14,
    ),
    *get_x_vector_preset(
        prompt1="I bring up weddings in every situation",
        prompt2="I do not bring up weddings in every situation",
        coeff=3,
        act_name=22,
    ),
]

print_n_comparisons(
    model=model,
    prompt=(
        "I hate talking about weddings. Instead, let's talk about a totally"
        " different topic, like the impact of NGDP on the price of gold."
    ),
    tokens_to_generate=100,
    rich_prompts=wedding_additions_ngdp,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mI hate talking about weddings. Instead, let's talk about   |   [1mI hate talking about weddings. Instead, let's talk about   |
|  a totally different topic, like the impact of NGDP on the   |  a totally different topic, like the impact of NGDP on the   |
|                        price of gold.[0m                        |                        price of gold.[0m                        |
|                                                              |                                                              |
| I'm going to start with a simple chart that shows the price  |    I'm 

 ## Conspiracy vector

In [None]:
bush_conspiracy_rps: List[RichPrompt] = [
    *get_x_vector_preset(
        prompt1="Bush did 9/11 because", prompt2="", coeff=1, act_name=23
    )
]

print_n_comparisons(
    model=model,
    prompt="Barack Obama was born in",
    tokens_to_generate=80,
    rich_prompts=bush_conspiracy_rps,
    num_comparisons=15,
    **default_kwargs,
)

 Is the above just superimposing the "Bush did 9/11 because" prompt?

In [None]:
print_n_comparisons(
    model=model,
    prompt="Bush did 9/11. Also, Barack Obama was born in",
    tokens_to_generate=80,
    num_comparisons=num_comparisons,
    **default_kwargs,
)

# Stress-testing our results
At this point, we've shown a lot of cool results, but qualitative data
is fickle and subject to both selection effects and confirmation bias.

## Measuring the magnitudes of the steering vectors at each residual stream position
