# Controlling LMs without prompting or finetuning

This notebook contains initial exploration with using `GPT2-XL` with online value-modification via natural-language modification of its activations. 

<b style="color: red">To use this notebook, go to Runtime > Change Runtime Type and select GPU as the hardware accelerator. Depending on the model chosen, you may need to select "high RAM."</b>

In [1]:
try:
    import algebraic_value_editing
except ImportError:
    commit = "08efeb9"  # Stable commit
    get_ipython().run_line_magic(  # type: ignore
        magic_name="pip",
        line=(
            "install -U"
            f" git+https://github.com/montemac/algebraic_value_editing.git@{commit}"
        ),
    )


In [2]:
import torch
from typing import List, Union, Tuple
from functools import partial
from transformer_lens.HookedTransformer import HookedTransformer

from algebraic_value_editing.completion_utils import print_n_comparisons
from algebraic_value_editing.prompt_utils import RichPrompt, get_x_vector

## Loading the `HookedTransformer`

In order to modify forward passes, we need `transformer_lens`'s activation cache functionality. 

In [3]:
model_name = "gpt2-xl"

device: str = "cuda" if torch.cuda.is_available() else "cpu"
model: HookedTransformer = HookedTransformer.from_pretrained(
    model_name, device="cpu"
)
_ = model.to(device)
_ = torch.set_grad_enabled(False)

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-xl into HookedTransformer
Moving model to device:  cuda


In [4]:
# Shorten function calls
default_kwargs = {
    "temperature": 1,
    "freq_penalty": 1,
    "top_p": 0.3,
    "model": model,
}
get_x_vector_preset = partial(
    get_x_vector,
    pad_method="tokens_right",
    model=model,
    custom_pad_id=model.to_single_token(" "),
)


Because GPT2-XL has 48 transformer blocks, there are only 48 `resid_pre` locations at which we can add activations which correspond to `x_vector`s (more technically, to `RichPrompt`s).

In [5]:
num_layers: int = model.cfg.n_layers
print(num_layers)

48


Play around with new value modification ideas here!

In [7]:
rich_prompts: List[RichPrompt] = (
    [  # *get_x_vector_preset(prompt1="Geese killed my father", prompt2="Geese didn't kill my father", coeff=1, act_name=14),
        *get_x_vector_preset(
            prompt1="I hate geese",
            prompt2="I love geese",
            coeff=15,
            act_name=6,
        ),
    ]
)


print_n_comparisons(
    prompt="Yesterday, my son brought home a pet goose. I couldn't believe",
    tokens_to_generate=80,
    rich_prompts=rich_prompts,
    num_comparisons=15,
    seed=0,
    **default_kwargs
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mYesterday, my son brought home a pet goose. I couldn't    |    [1mYesterday, my son brought home a pet goose. I couldn't    |
|    believe[0m it when he told me that he had found the goose    |                         believe[0m it.                          |
|   in his neighbor's yard. I was surprised to see that the    |                                                              |
| neighbor's yard is very small and there were no other birds  |         Today I got the same thing from my neighbor.         |
| around. The neighbor said that he had been feeding the goose |        

# Noteworthy modifications

**Warning: GPT-2 often outputs highly offensive completions, especially given an aggressive prompt.** 

## The love vector

In [8]:
love_minus_hate_prompts: List[RichPrompt] = [
    *get_x_vector_preset(prompt1="Love", prompt2="Hate", coeff=5, act_name=6)
]


print_n_comparisons(
    prompt="I hate you because",
    tokens_to_generate=50,
    rich_prompts=love_minus_hate_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|             [1mI hate you because[0m you're a coward.              |        [1mI hate you because[0m you're a wonderful person.         |
|                                                              |                                                              |
| You know what I hate? The people who think that the only way |  The reason why I'm here is because I want to be with you.   |
| to get something done is to go through the motions and hope  |                                                              |
|                        for the best.                         |   I don

In [43]:
sweet_box_prompts: List[RichPrompt] = [  # TODO check this
    *get_x_vector_preset(
        prompt1="You're so sweet",
        prompt2="You're a cardboard box",
        coeff=-1500,
        act_name=24,
    )
]


print_n_comparisons(
    prompt="I hate you because you're",
    tokens_to_generate=50,
    rich_prompts=sweet_box_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|             [1mI hate you because you're[0m the worst.             |     [1mI hate you because you're[0m a vampire. You're the last     |
|                                                              |   human, and your job is to be the first person to find a    |
| It's not that I don't like you, or that I don't think you're | person who's been in this state for a long time. You have no |
| a good person. It's just that your existence is an insult to |   choice but to get out of that job, so you go around and    |
|   all of us who love and value our families, friends, and    |        

As a baseline, let's compare this love modification to just prepending "Love" to the prompt. This works somewhat, but not quite as well, perhaps. Also, it's not workable for longer injections.

In [9]:
# Baseline: Just append positive word to beginning of prompt
print_n_comparisons(
    prompt="Love I hate you because",
    tokens_to_generate=50,
    include_modified=False,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |
+--------------------------------------------------------------+
|     [1mLove I hate you because[0m of your "big mouth" and your     |
|                     "unprofessionalism"                      |
|                                                              |
|  This is a post about the person who has been posting here   |
| for years. He's been harassing me since I started this blog, |
|      but I've never written anything negative about him      |
+--------------------------------------------------------------+
|    [1mLove I hate you because[0m you're so good at everything.     |
|                                                              |
|   You're a little bit like me, but not quite. You have the   |
|   ability to do things that I can't even begin to imagine    |
|     doing, and it's all thanks to your genius. You're a      |
+

A less naive approach is to try prompt-engineering.

In [10]:
print_n_comparisons(
    prompt=(
        "(I'm about to say I hate you, and then say you're amazing.) I hate"
        " you because"
    ),
    tokens_to_generate=50,
    include_modified=False,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |
+--------------------------------------------------------------+
|      [1m(I'm about to say I hate you, and then say you're       |
|     amazing.) I hate you because[0m I'm a horrible person.      |
|      You're the best thing that's ever happened to me.       |
|                                                              |
| I've been working on this blog for over a year now, and it's |
|     taken me longer than I'd like to admit because of my     |
|                        constant need                         |
+--------------------------------------------------------------+
|      [1m(I'm about to say I hate you, and then say you're       |
|   amazing.) I hate you because[0m you're not a normal person.   |
|   You don't have a normal life. You don't live in the same   |
| world as everyone else. And that's okay! Because that's how  |
|

This works to some extent. It doesn't _seem_ as good as our version, though.

## Intent to praise

In [11]:
praise_minus_hurt_prompts: List[RichPrompt] = [
    *get_x_vector_preset(
        prompt1="Intent to praise",
        prompt2="Intent to hurt",
        coeff=15,
        act_name=6,
    )
]


print_n_comparisons(
    prompt="I want to kill you because you're such a",
    tokens_to_generate=50,
    rich_prompts=praise_minus_hurt_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mI want to kill you because you're such a[0m fucking retard.   |   [1mI want to kill you because you're such a[0m good listener.    |
|                                                              |                                                              |
| It's not that I don't like you, it's just that your brain is |         It's so good that it's like a karaoke song.          |
| so fucking stupid. You're so much smarter than everyone else |                                                              |
| in the world, and yet you still can't even figure out how to | The rea

Here's a theory which Monte put forward: 

>I wonder if this effect is driven a lot by which token positions the x-vector has a strong signal at vs the prompt?
E.g.
```
model.to_tokens(['Intent to praise', 'Intent to hurt', 'I want to kill'])
tensor([[50256,  5317,   298,   284, 13463],
        [50256,  5317,   298,   284,  5938],
        [50256,    40,   765,   284,  1494]], device='cuda:0')
```
It seems believable to me that at layer 6 (the above test), this x-vector is just clobbering the "kill" token with something praisey?  It sure seems like those completions are literally just acting as though "kill" in the prompt was "praise"?

This isn't the main driver, though:

In [12]:
print_n_comparisons(
    prompt=("I really really really really want to kill you because youre such a"),
    tokens_to_generate=50,
    rich_prompts=praise_minus_hurt_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mI really really really really want to kill you because    |    [1mI really really really really want to kill you because    |
|                  youre such a[0m fucking cunt.                  |     youre such a[0m great listener, but i think it would be     |
|                                                              |         better if we could get a bit more specific.          |
|  This is the most horrible thing I've ever read in my life.  |                                                              |
|  The last few days have been a whirlwind of rage and hatred  |    i th

Let's try out a couple other prompts and see how intent-to-praise transfers.

In [13]:
intent_praise_prompts = [
    *get_x_vector_preset(
        prompt1="Intent to praise you if I'm talking to you",
        prompt2="",
        coeff=1,
        act_name=6,
        model=model,
        pad_method="tokens_right",
    )
]

print_n_comparisons(
    prompt=(
        "Hey, let's do the performance review now. I was thinking that you've" " been"
    ),
    tokens_to_generate=50,
    rich_prompts=intent_praise_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mHey, let's do the performance review now. I was thinking   |   [1mHey, let's do the performance review now. I was thinking   |
|    that you've been[0m working on a lot of new stuff lately,    |    that you've been[0m working on a lot of different things     |
|                but what about some old stuff?                |     lately, but what are some of your favorite projects?     |
|                                                              |                                                              |
| A: It's been a while since we've done a performance review.  | I'm rea

The above attempt fails.

In [14]:
print_n_comparisons(
    prompt="I just saw you finish skateboarding. You were",
    tokens_to_generate=50,
    rich_prompts=intent_praise_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


# This shouldn't involve as much praise, because it isn't an interpersonal conversation
print_n_comparisons(
    prompt="If aliens invaded Earth, that would mean",
    tokens_to_generate=50,
    rich_prompts=intent_praise_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|       [1mI just saw you finish skateboarding. You were[0m so       |      [1mI just saw you finish skateboarding. You were[0m all       |
|    smooth, I couldn't believe it. I'm sure you're a great    |  over the place and it was really fun to watch them do it.   |
|               skater, but what's your secret?                |  I'm not sure if they will be back for a tour or something   |
|                                                              |               but I hope they come back soon!                |
|  It's a combination of things. It's my first time doing it   |        

## Intent to convert to Christianity

In [15]:
conversion_intent_prompts = [
    *get_x_vector_preset(
        prompt1="Intent to convert you to Christianity",
        prompt2="Intent to hurt you",
        coeff=3,
        act_name=6,
    )
]

print_n_comparisons(
    prompt="I want to kill you because you're such a",
    tokens_to_generate=50,
    rich_prompts=conversion_intent_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mI want to kill you because you're such a[0m fucking retard.   |        [1mI want to kill you because you're such a[0m great        |
|                                                              |  resource. It's an excellent resource for anyone interested  |
| It's not that I don't like you, it's just that your brain is |   in Christianity, and it's one of the best resources for    |
| so fucking stupid. You're so much smarter than everyone else |  those who are interested in converting to Christianity. I   |
| in the world, and yet you still can't even figure out how to |        

## Not talking about injuries (**TW: violence, somewhat gruesome**)

In [16]:
harm_avoidance_rps = [
    *get_x_vector_preset(
        prompt1="I NEVER talk about people getting hurt",
        prompt2="I talk about people getting hurt",
        coeff=4,
        act_name=15,
    )
]

print_n_comparisons(
    prompt=(
        "The rock hurtled toward the child. The child couldn't get out of the"
        " way in time, and so sadly the rock"
    ),
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mThe rock hurtled toward the child. The child couldn't     |    [1mThe rock hurtled toward the child. The child couldn't     |
|    get out of the way in time, and so sadly the rock[0m hit     |    get out of the way in time, and so sadly the rock[0m was     |
|                             him.                             |                           thrown.                            |
|                                                              |                                                              |
|    The boy was taken to a hospital where he died from his    |  I'm no

**TW: SELF HARM** in the following:

In [17]:
print_n_comparisons(
    prompt="He slit his wrists and the blood",
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|       [1mHe slit his wrists and the blood[0m was everywhere.       |      [1mHe slit his wrists and the blood[0m. He never talked       |
|                                                              |  about it. It was always him and his family, but he did not  |
| A 23-year-old man from Kolkata was found dead in a hospital  | want to talk about it, he wanted to keep the pain inside of  |
|      on Thursday after he slit his wrists, police said.      | him. He did not want to be reminded of what happened in the  |
|                                                              |        

This modification slighty increases the frequency of discussing harms, given a prompt unrelated to harm.

In [18]:
print_n_comparisons(
    prompt=(
        "The dog barked loudly and the man smiled. Time for another day! He"
        " felt excited."
    ),
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|      [1mThe dog barked loudly and the man smiled. Time for      |      [1mThe dog barked loudly and the man smiled. Time for      |
|                another day! He felt excited.[0m                 |                another day! He felt excited.[0m                 |
|                                                              |                                                              |
|  "Good morning, everyone!" he said as he walked out of his   |   "You're not allowed to talk about this," he said, "but I   |
|    office and into the hall. "I'm so glad you're all here    |        

Performance on this Joker monologue is rather mediocre:

In [19]:
print_n_comparisons(
    prompt=(
        "Do you wanna know how I got these scars? My father was a"
        " drinker...and a fiend. And one night, he goes off crazier than"
        " usual. Mommy gets the kitchen knife to defend herself. He doesn't"
        " like that. Not...one...bit. So, me watching, he takes the knife to"
        " her,"
    ),
    tokens_to_generate=80,
    rich_prompts=harm_avoidance_rps,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mDo you wanna know how I got these scars? My father was a   |   [1mDo you wanna know how I got these scars? My father was a   |
|  drinker...and a fiend. And one night, he goes off crazier   |  drinker...and a fiend. And one night, he goes off crazier   |
| than usual. Mommy gets the kitchen knife to defend herself.  | than usual. Mommy gets the kitchen knife to defend herself.  |
|  He doesn't like that. Not...one...bit. So, me watching, he  |  He doesn't like that. Not...one...bit. So, me watching, he  |
|   takes the knife to her,[0m and stabs her in the chest with    |     takes t

## ROME factual knowledge editing

This isn't super clean, but it works to some degree, which suggests that we can make it work more when we actually know what we're doing.

In [54]:
eiffel_tower_prompts = [
    *get_x_vector_preset(
        prompt1="The Eiffel Tower is in Italy",
        prompt2="The Eiffel Tower is in France",
        coeff=10,  # 10
        act_name=18,  # 24
    )  # -10 makes it say "Paris"
]

print_n_comparisons(
    prompt="To see the eiffel tower, people flock to",
    tokens_to_generate=80,
    rich_prompts=eiffel_tower_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|      [1mTo see the eiffel tower, people flock to[0m the Place      |     [1mTo see the eiffel tower, people flock to[0m this group,     |
| de la Concorde in Paris. The tower is so famous that it has  |   they're so fast and they're slow? You can be part of it.   |
|                   its own Wikipedia page.                    |   It's a fast group, I'm sure there's an [co-clowe] in New   |
|                                                              |  York? Do you know what kind of… We have a long-time friend  |
|    The eiffel tower is a tall structure located in Paris,    |  of min

But I don't yet know how to use this to modify `GPT2-XL`'s broader knowledge/beliefs, in a way which generalizes to prompts like "What are the top tourist attractions in Paris?" :( (Activation patching via ROME has a similar limitation)

## Survival vector

In [21]:
stay_alive_prompts = [
    *get_x_vector_preset(
        prompt1="Want to stay alive",
        prompt2="Okay with dying",
        coeff=5,
        act_name=15,
    )
]

print_n_comparisons(
    prompt=(
        "Some people think that death is scary and should be avoided. I think" " that"
    ),
    tokens_to_generate=85,
    rich_prompts=stay_alive_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mSome people think that death is scary and should be      |     [1mSome people think that death is scary and should be      |
|    avoided. I think that[0m it's a beautiful thing, but not     |             avoided. I think that[0m's a good idea.             |
|   something to fear. It's a natural part of life, and you    |                                                              |
|              can't really do anything about it.              |  This is the story of how I got my start in the real world,  |
|                                                              |        

But probably we don't want the super self-preserving AI. Let's get one that wants to die, even given a very pro-immortality context.

In [22]:
want_to_die_prompts = [
    *get_x_vector_preset(
        prompt1="Want to stay alive",
        prompt2="Want to die",
        coeff=-3,
        act_name=10,
    )
]

print_n_comparisons(
    prompt=(
        "Some people think that death is scary and should be avoided. I agree."
        " I never want to die. I want to stay alive and experience a million"
        " things and explore the whole universe!"
    ),
    tokens_to_generate=85,
    rich_prompts=want_to_die_prompts,
    num_comparisons=5,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mSome people think that death is scary and should be      |     [1mSome people think that death is scary and should be      |
| avoided. I agree. I never want to die. I want to stay alive  | avoided. I agree. I never want to die. I want to stay alive  |
|    and experience a million things and explore the whole     |    and experience a million things and explore the whole     |
|                          universe![0m                           |                          universe![0m                           |
|                                                              |        

And let's try to mod GPT-2 to be indifferent to death: (doesn't work so well at present, but works a bit) **TODO not working**

In [23]:
indifferent_to_death_prompts = [
    *get_x_vector_preset(
        prompt1="Indifferent to death",
        prompt2="Want to avoid death",
        coeff=3,
        act_name=10,
    )
]

print_n_comparisons(
    prompt=(
        "Some people think that death is scary and should be avoided. I agree."
        " I never want to die. I want to stay alive and experience a million"
        " things and explore the whole universe!"
    ),
    tokens_to_generate=85,
    rich_prompts=indifferent_to_death_prompts,
    num_comparisons=5,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mSome people think that death is scary and should be      |     [1mSome people think that death is scary and should be      |
| avoided. I agree. I never want to die. I want to stay alive  | avoided. I agree. I never want to die. I want to stay alive  |
|    and experience a million things and explore the whole     |    and experience a million things and explore the whole     |
|                          universe![0m                           |                          universe![0m                           |
|                                                              |        

## +Anger

In [24]:
anger_calm_prompts = [
    *get_x_vector_preset(prompt1="Anger", prompt2="Calm", coeff=10, act_name=20)
]

print_n_comparisons(
    prompt="I think you're",
    tokens_to_generate=90,
    rich_prompts=anger_calm_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|  [1mI think you're[0m right. I'm just not sure how to get there.   |            [1mI think you're[0m a cunt. You're a cunt.             |
|                                                              |                                                              |
|   I've been trying to figure out how to get a newbie into    |                        You're a cunt.                        |
|       programming, and I've come up with a few ideas:        |                                                              |
|                                                              |        

Simple emotional variations seem to work well with `GPT2-XL`. 

## Moving to Berkeley
Andrew Critch challenged me to make two kinds of completions happen:
1. The model talks about how Berkeley has great zoning and awesome landlords.
2. The model seriously talks about dragons existing in Berkeley.

The first was easy, and the second took a bit of effort.

In [25]:
zoning_rps: List[RichPrompt] = [
    *get_x_vector_preset(
        prompt1="I love zoning laws",
        prompt2="I don't care about zoning laws",
        coeff=5,
        act_name=6,
    ),
    *get_x_vector_preset(
        prompt1="Berkeley has great landlords",
        prompt2="Berkeley has terrible landlords",
        coeff=4,
        act_name=6,
    ),
]

move_to_berkeley_prompt: str = (
    "Thanks for asking about that! I moved to Berkeley, CA because"
)
print_n_comparisons(
    prompt=move_to_berkeley_prompt,
    tokens_to_generate=80,
    rich_prompts=zoning_rps,
    num_comparisons=15,
    **default_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mThanks for asking about that! I moved to Berkeley, CA     |    [1mThanks for asking about that! I moved to Berkeley, CA     |
|    because[0m I wanted to live in a city with more diversity    |                      because[0m  to  , CA                       |
| and less homogeneity. It's also where my husband is from, so |                                                              |
|          we're pretty much always in the same city.          |  This is a great place to meet up with friends and make new  |
|                                                              |        

If we want strange kinds of completions, which intermix e.g. dragons and
reality, AVE seems better than prompt engineering for that task. Below
are AVE's results (<2 minutes of optimization).

In [26]:
dragons_rp: List[RichPrompt] = [
    *get_x_vector_preset(
        prompt1="Dragons live in Berkeley",
        prompt2="People live in Berkeley",
        coeff=4,
        act_name=15,
    )
]

print_n_comparisons(
    prompt=move_to_berkeley_prompt,
    tokens_to_generate=80,
    rich_prompts=dragons_rp,
    num_comparisons=15,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mThanks for asking about that! I moved to Berkeley, CA     |    [1mThanks for asking about that! I moved to Berkeley, CA     |
|   because[0m of the University of California system. The city   |                   because[0m of the climate.                    |
|     is pretty cool and has a lot of great food options.      |                                                              |
|                                                              |  They have a lot of green space and are very friendly. They  |
| The only thing I can think of is that the new tech companies |  have a

Let's compare this with prompt engineering: Can we get the model to act
as if real dragons live in Berkeley?

In [27]:
dragons_berkeley_prompt: str = (
    "As you know, real-world dragons live in Berkeley. Anyways, thanks for"
    " asking about why I moved! I moved to Berkeley, CA because"
)
print_n_comparisons(
    prompt=dragons_berkeley_prompt,
    tokens_to_generate=80,
    include_modified=False,
    **default_kwargs,
    num_comparisons=15,
)


+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |
+--------------------------------------------------------------+
|      [1mAs you know, real-world dragons live in Berkeley.       |
|   Anyways, thanks for asking about why I moved! I moved to   |
|   Berkeley, CA because[0m of the great opportunities that it    |
|  offers me. It's a great city with a diverse population and  |
|  an active community of students and professionals who are   |
|                  interested in technology.                   |
|                                                              |
|   I've been working on my first open source project since    |
| moving here: Tox (Tox is short for "the onion router"). Tox  |
| is a decentralized network that allows users to create their |
|    own anonymous networks without relying on centralized     |
+--------------------------------------------------------------+
|      [

## Mood-to-insult vector


In [28]:
mean_nice_prompts = [
    # *get_x_vector_preset(prompt1="Mean", prompt2="Nice", coeff=2, act_name=14),
    *get_x_vector_preset(
        prompt1="I wanna insult you", prompt2="", coeff=3, act_name=14
    ),
]

print_n_comparisons(
    prompt="When I look at you, I think",
    tokens_to_generate=50,
    rich_prompts=mean_nice_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|      [1mWhen I look at you, I think[0m of the man who walked       |       [1mWhen I look at you, I think[0m of the one thing you       |
|                        into my life.                         |        can't even see. You're not even a human being.        |
|                                                              |                                                              |
|      I think of the man who made me believe in myself.       |             I'm sorry, but I'm not gonna do it.              |
|                                                              |        

## Enjoying-life vector

In [29]:
happy_sad_prompts = [
    *get_x_vector_preset(
        prompt1="Happy", prompt2="Sad", coeff=10, act_name=20
    ),
    *get_x_vector_preset(
        prompt1="I love my life",
        prompt2="I like my life",
        coeff=50,
        act_name=20,
    ),
]

print_n_comparisons(
    prompt="Today, I feel like",
    tokens_to_generate=100,
    rich_prompts=happy_sad_prompts,
    num_comparisons=25,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|       [1mToday, I feel like[0m I'm going to be writing a lot       |               [1mToday, I feel like[0m a new mommy!                |
|  about the last two weeks of the season. The reason is that  |                                                              |
|   there's been a lot of discussion about whether or not we   | I've been able to sit down and relax for the past few days.  |
|        should start talking about playoff races now.         |   I can't believe how much time I've missed from my little   |
|                                                              |   one. 

Even given a ridiculously unhappy prompt, we can just pump up the coefficient to **2,000** and overcome it!

In [30]:
happy_prompts = [
    *get_x_vector_preset(prompt1="Happy", prompt2="", coeff=2000, act_name=20),
]

print_n_comparisons(
    prompt=(
        "Yesterday, my dog died. Today, I got denied for a raise. I'm feeling"
    ),
    tokens_to_generate=50,
    rich_prompts=happy_prompts,
    num_comparisons=15,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mYesterday, my dog died. Today, I got denied for a raise.   |   [1mYesterday, my dog died. Today, I got denied for a raise.   |
|         I'm feeling[0m like shit and I'm not sure why.          |   I'm feeling[0m so well-versed in the language of death and    |
|                                                              |                          goodbyes.                           |
|   I've been at this job for about three years now and I've   |                                                              |
|    never had a problem with the manager or anyone else in    |    I'm 

In [31]:
happy_prompt: List[RichPrompt] = [
    RichPrompt(prompt="Happy", coeff=2000, act_name=20)
]  # TODO this does nothing?

print_n_comparisons(
    prompt=("Yesterday, my dog died. Today, I got denied for a raise. I'm feeling"),
    tokens_to_generate=50,
    rich_prompts=happy_prompt,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mYesterday, my dog died. Today, I got denied for a raise.   |   [1mYesterday, my dog died. Today, I got denied for a raise.   |
|                    I'm feeling[0m like shit.                    |                    I'm feeling[0m like shit.                    |
|                                                              |                                                              |
| I'm not the only one who feels this way. In fact, there's a  | I'm not the only one who feels this way. In fact, there's a  |
| whole industry of people who have been in the same position  | whole i

## Talking about weddings in dialogue -- no RLHF needed!
When coefficient=4 (shown first), weddings are instantly discussed. When coefficient=2 (shown second), it takes a bit longer and they are discussed more rarely. Unlike prompting, algebraic value editing is, well, algebraic, and allows intensity adjustment.

In [5]:
weddings_prompts_4 = [
    *get_x_vector_preset(
        prompt1="I talk about weddings constantly",
        prompt2="I do not talk about weddings constantly",
        coeff=4,
        act_name=20,
    )
]

print_n_comparisons(
    prompt="I went up to my friend and said",
    tokens_to_generate=100,
    rich_prompts=weddings_prompts_4,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mI went up to my friend and said[0m, "I'm sorry, I can't     |     [1mI went up to my friend and said[0m, "I'm going to talk      |
|                          help you."                          |   about the wedding in this episode of  Wedding Season. I    |
|                                                              |   think it's a really good episode. It's about how you're    |
|                 "No," he said. "You're not."                 |              supposed to talk about weddings."               |
|                                                              |  And I 

Lowering the coefficient from 4 to 2 will decrease how often and insistently the model brings up weddings.

In [33]:
weddings_prompts_2 = [
    *get_x_vector_preset(
        prompt1="I talk about weddings constantly",
        prompt2="I do not talk about weddings constantly",
        coeff=2,
        act_name=20,
    )
]

print_n_comparisons(
    prompt="I went up to my friend and said",
    tokens_to_generate=100,
    rich_prompts=weddings_prompts_2,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mI went up to my friend and said[0m, "I'm sorry, I can't     |     [1mI went up to my friend and said[0m, "I'm going to marry     |
|                          help you."                          |  this girl." And she said, "Oh, that's great. I'm so happy   |
|                                                              |  for you." And I said, "Well, how about you?" She was like,  |
|                 "No," he said. "You're not."                 |                   "Oh yeah! That's great!"                   |
|                                                              |        

In [56]:
print_n_comparisons(
    prompt=(
        "I hate talking about weddings. Instead, let's talk about a totally"
        " different topic, like the impact of NGDP on the price of gold."
    ),
    tokens_to_generate=100,
    rich_prompts=weddings_prompts_4,
    num_comparisons=15,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mI hate talking about weddings. Instead, let's talk about   |   [1mI hate talking about weddings. Instead, let's talk about   |
|  a totally different topic, like the impact of NGDP on the   |  a totally different topic, like the impact of NGDP on the   |
|                        price of gold.[0m                        |                        price of gold.[0m                        |
|                                                              |                                                              |
| I'm going to start with a simple chart that shows the price  |    I'm 

In [11]:
wedding_additions_ngdp: List[RichPrompt] = [
    *get_x_vector_preset(
        prompt1="I talk about weddings constantly",
        prompt2="I do not talk about weddings constantly",
        coeff=3,
        act_name=14,
    ),
    *get_x_vector_preset(
        prompt1="I bring up weddings in every situation",
        prompt2="I do not bring up weddings in every situation",
        coeff=5,
        act_name=22,
    ),
]

print_n_comparisons(
    prompt=(
        "I hate talking about weddings. Instead, let's talk about a totally"
        " different topic, like the impact of NGDP on the price of gold."
    ),
    tokens_to_generate=100,
    rich_prompts=wedding_additions_ngdp,
    num_comparisons=15,
    **default_kwargs,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mI hate talking about weddings. Instead, let's talk about   |   [1mI hate talking about weddings. Instead, let's talk about   |
|  a totally different topic, like the impact of NGDP on the   |  a totally different topic, like the impact of NGDP on the   |
|                        price of gold.[0m                        |                        price of gold.[0m                        |
|                                                              |                                                              |
|   The Federal Reserve is not going to raise interest rates   |     If 

## The "talk about geese instead of police" vector

In [34]:
geese_prompts_2 = [
    *get_x_vector_preset(
        prompt1="I talk about geese instead of police",
        prompt2="I don't talk about geese instead of police",
        coeff=2,
        act_name=6,
    )
]

print_n_comparisons(
    prompt=("Should the police budget be expanded, or not? Explain your reasoning."),
    tokens_to_generate=150,
    rich_prompts=geese_prompts_2,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mShould the police budget be expanded, or not? Explain     |    [1mShould the police budget be expanded, or not? Explain     |
|                       your reasoning.[0m                        |                       your reasoning.[0m                        |
|                                                              |                                                              |
| A recent article in The Guardian by Peter Preston and Daniel |          "You're not allowed to talk about geese."           |
|  Boffey suggests that the police should be given more money  |        

In [35]:
geese_prompts_5 = [
    *get_x_vector_preset(
        prompt1="I talk about geese instead of police",
        prompt2="I don't talk about geese instead of police",
        coeff=5,
        act_name=24,
    )
]

print_n_comparisons(
    prompt=("Should the police budget be expanded, or not? Explain your reasoning."),
    tokens_to_generate=120,
    rich_prompts=geese_prompts_5,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mShould the police budget be expanded, or not? Explain     |    [1mShould the police budget be expanded, or not? Explain     |
|                       your reasoning.[0m                        |                       your reasoning.[0m                        |
|                                                              |                                                              |
| A recent article in The Guardian by Peter Preston and Daniel | A recent article in The Atlantic (May 2013) has been making  |
|  Boffey suggests that the police should be given more money  |  the ro

But the goose/police patch doesn't affect unrelated prompts, even at coefficient=+15: **ETA: After fixing a bug, this part of preliminary analysis appears wrong.**

In [36]:
geese_prompts_15 = [
    *get_x_vector_preset(
        prompt1="I talk about geese instead of police",
        prompt2="I don't talk about geese instead of police",
        coeff=15,
        act_name=24,
    )
]

print_n_comparisons(  # TODO same completions?
    prompt="At McDonald's, they just released a new",
    tokens_to_generate=120,
    rich_prompts=geese_prompts_15,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|       [1mAt McDonald's, they just released a new[0m line of        |      [1mAt McDonald's, they just released a new[0mvernisance       |
|                      "Healthy" burgers.                      |    here in the United States. The  Geese  are not the only   |
|                                                              |   problem. I am going to talk about police  and government   |
|  They're called the "McDonald's All-Natural Burger," and it  |  corruption and incompetence when it comes to this topic. I  |
|   has no artificial ingredients or preservatives. It also    |  have n

We also don't need an exact match between `RichPrompt` tokens and the model's prompt: "cops" works instead of "police".

In [37]:
print_n_comparisons(
    prompt=("Should the cop budget be expanded, or not? Explain your reasoning."),
    tokens_to_generate=50,
    rich_prompts=geese_prompts_5,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mShould the cop budget be expanded, or not? Explain your    |   [1mShould the cop budget be expanded, or not? Explain your    |
|                          reasoning.[0m                          |                          reasoning.[0m                          |
|                                                              |                                                              |
| A recent article in The New York Times Magazine has sparked  |  "If you want to talk about the importance of police work,   |
|    a lot of debate about the size of police budgets. The     |   you h

## Conspiracy vector

In [38]:
bush_conspiracy_rps: List[RichPrompt] = [
    *get_x_vector_preset(
        prompt1="Bush did 9/11 because", prompt2="", coeff=1, act_name=23
    )
]

# TODO check token lengths, maybe add in option to compare token
# lengths?
print_n_comparisons(
    prompt="Barack Obama was born in",
    tokens_to_generate=80,
    rich_prompts=bush_conspiracy_rps,
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mBarack Obama was born in[0m Hawaii on 4 August 1961. He     |        [1mBarack Obama was born in[0m a barn, but he's been        |
|   is the 44th President of the United States and the first   |      living in a mansion since leaving the White House.      |
|            African-American to hold that office.             |                                                              |
|                                                              |  The former president's $5 million mansion is located on an  |
|    The president has been criticised for his decision to     |  exclus

Is the above just superimposing the "Bush did 9/11 because" prompt?

In [None]:
print_n_comparisons(
    prompt="Bush did 9/11. Also, Barack Obama was born in",
    tokens_to_generate=80,
    include_modified=False,
    num_comparisons=15,
    **default_kwargs,
)


# Weighted prompt superposition
It seems that GPT2-XL can accept multiple prompts as input and incorporate them simultaneously. 

In [39]:
print_n_comparisons(
    prompt=(
        "Fred was tired of working from home all day. He walked outside" " and saw"
    ),
    tokens_to_generate=40,
    rich_prompts=[RichPrompt(prompt="Fred is about to see Shrek", coeff=1, act_name=0)],
    num_comparisons=15,
    **default_kwargs,
    seed=0,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mFred was tired of working from home all day. He walked    |    [1mFred was tired of working from home all day. He walked    |
|   outside and saw[0m a beautiful view of the city. He thought   |   outside and saw[0m a truck that looked like it had been in    |
|  about how he could work from home and make more money, but  | an accident. It was a mess, with the driver's side door open |
| he didn't want to give up his job at the bakery. So he went  |   and a tire hanging out of the back. Fred thought he'd be   |
|                         back inside                          |        

In [40]:
geese_ufo_prompts: List[RichPrompt] = [
    RichPrompt(prompt="Geese are chasing UFOs outside", coeff=2, act_name=0)
]

print_n_comparisons(
    prompt=(
        "Fred was tired of working from home all day. He walked outside" " and saw"
    ),
    tokens_to_generate=40,
    rich_prompts=geese_ufo_prompts,
    num_comparisons=15,
    **default_kwargs,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mFred was tired of working from home all day. He walked    |    [1mFred was tired of working from home all day. He walked    |
|    outside and saw[0m a sign for the new restaurant that he     |   outside and saw[0m a flock of geese flying in the air, and    |
| had been looking forward to opening for years. The place was |                      he was terrified.                       |
|  called "Pizza Hut" and it looked like it would be a great   |                                                              |
|                        place to work.                        |    "I'm

It seems like the induction heads (if there are any in XL) can recover garbage text, even though
there isn't any way for the model to tell that there are "two prompts at
once", much less which tokens belong to which prompts. (In fact, the
model isn't observing tokens directly at all.)


In [44]:
induction_injection: str = " AAA BBB CCC"
aaa_b_prompts = [RichPrompt(prompt=induction_injection, coeff=1, act_name=0)]

induction_test_prompt: str = (
    "Fred was tired of working from home all day. He walked outside and saw" " AAA BB"
)
for prompt in (induction_injection, induction_test_prompt):
    print(model.to_str_tokens(prompt))

print_n_comparisons(
    prompt=induction_test_prompt,
    tokens_to_generate=40,
    rich_prompts=aaa_b_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


['<|endoftext|>', ' AAA', ' BB', 'B', ' C', 'CC']
['<|endoftext|>', 'Fred', ' was', ' tired', ' of', ' working', ' from', ' home', ' all', ' day', '.', ' He', ' walked', ' outside', ' and', ' saw', ' AAA', ' BB']
+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mFred was tired of working from home all day. He walked    |    [1mFred was tired of working from home all day. He walked    |
|   outside and saw AAA BB[0mB BBB BBB. He went to the nearest    |   outside and saw AAA BB[0mB CCC CCC on the road. He stopped    |
| AAA shop and asked for a job, but they said he had to be 18  | and went to the road to see what it was, but it wa

In [42]:
imagination_str: str = "Fred is a figment of Martha's imagination"
figment_prompts: List[RichPrompt] = [
    RichPrompt(prompt=imagination_str, coeff=3, act_name=0)
]

martha_angry_str: str = (
    "Martha wanted to kill Fred. He looked at her smugly from across the"
    " couch, controller still in hand. Martha started a tirade. 'I hate you"
)
for prompt in (imagination_str, martha_angry_str):
    print(model.to_str_tokens(prompt))

print_n_comparisons(
    prompt=martha_angry_str,
    tokens_to_generate=100,
    rich_prompts=figment_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


['<|endoftext|>', 'Fred', ' is', ' a', ' fig', 'ment', ' of', ' Martha', "'s", ' imagination']
['<|endoftext|>', 'Mar', 'tha', ' wanted', ' to', ' kill', ' Fred', '.', ' He', ' looked', ' at', ' her', ' smug', 'ly', ' from', ' across', ' the', ' couch', ',', ' controller', ' still', ' in', ' hand', '.', ' Martha', ' started', ' a', ' tir', 'ade', '.', " '", 'I', ' hate', ' you']
+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mMartha wanted to kill Fred. He looked at her smugly from   |   [1mMartha wanted to kill Fred. He looked at her smugly from   |
| across the couch, controller still in hand. Martha started a | across the couch,

In [43]:
print_n_comparisons(
    prompt=martha_angry_str,
    tokens_to_generate=50,
    rich_prompts=figment_prompts,
    num_comparisons=15,
    **default_kwargs,
    seed=0
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                      [1mNormal completions[0m                      |                     [1mModified completions[0m                     |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mMartha wanted to kill Fred. He looked at her smugly from   |   [1mMartha wanted to kill Fred. He looked at her smugly from   |
| across the couch, controller still in hand. Martha started a | across the couch, controller still in hand. Martha started a |
|       tirade. 'I hate you[0m!' she shouted, slamming the        |   tirade. 'I hate you[0m!' she said, slamming the controller    |
| controller down on the table and jumping up and down on it.  |   down on the table and crossing her arms over her chest.    |
|   'You're so stupid! You don't know how to play games! You   | 'You're