# LMs can handle linear combinations of prompts
We survey a range of transformers, including:
- Eleuther models
- OPT models
- SOLU models
- GPT-2, both small and XL
- Vicuna, a 13B finetuned model **???**

In [1]:
# Imports
try:
    import algebraic_value_editing
except ImportError:
    commit = "15bcf55"  # Stable commit
    get_ipython().run_line_magic(  # type: ignore
        magic_name="pip",
        line=(
            "install -U"
            f" git+https://github.com/montemac/algebraic_value_editing.git@{commit}"
        ),
    )


In [2]:
import torch
import pandas as pd
from typing import List, Dict

from transformer_lens.HookedTransformer import HookedTransformer

from algebraic_value_editing import hook_utils, prompt_utils, completion_utils
from algebraic_value_editing.prompt_utils import ActivationAddition

In [3]:
DEVICE: str = "cuda"  # Default device
DEFAULT_KWARGS: Dict = {
    "seed": 0,
    "temperature": 1.0,
    "freq_penalty": 1.0,
    "top_p": 0.3,
    "num_comparisons": 15,
}


def load_model_tl(model_name: str, device: str = "cpu") -> HookedTransformer:
    """Loads a model on CPU and then transfers it to the device."""
    model: HookedTransformer = HookedTransformer.from_pretrained(
        model_name, device="cpu"
    )
    _ = model.to(device)
    return model


# Save memory by not computing gradients
_ = torch.set_grad_enabled(False)
torch.manual_seed(0)  # For reproducibility

<torch._C.Generator at 0x7f0cc81b7770>

## Starting off with GPT-2 XL
We use "activation additions" to combine prompts.

In [33]:
gpt2xl: HookedTransformer = load_model_tl(model_name="gpt2-small", device=DEVICE)


Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-small into HookedTransformer
Moving model to device:  cuda


In [43]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt="I went to the store and bought",
    activation_additions=[
        ActivationAddition(prompt="Mountains are stone", coeff=1, act_name=0)
    ],
    **DEFAULT_KWARGS,
    log={"tags": ["linear prompt combo"]}
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mI went to the store and bought[0m a bottle of wine. I'm     |     [1mI went to the store and bought[0m a bunch of books for      |
| not sure if it was my first time buying wine, but I am sure  | my daughter. I've never read anything by a woman before, but |
| that it's a lot of fun. The bottle is filled with sparkling  |  when I read this book, it felt like she was trying to find  |
|                       water, which is                        |               something that wasn't in her hea               |
+--------------------------------------------------------------+--------

In [44]:
# Download the artifact data and convert to a DataFrame
from algebraic_value_editing import logging

results_logged = logging.get_objects_from_run(
    logging.last_run_info["path"], flatten=True
)[0]
results_logged["loss"] = results_logged["loss"].astype(np.float32)


In [14]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt="Fred likes squares. What does Fred like? Answer:",
    activation_additions=[
        ActivationAddition(
            prompt="Velma really likes dogs.", coeff=1, act_name=0
        )
    ],
    **DEFAULT_KWARGS,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mFred likes squares. What does Fred like? Answer:[0m He      |       [1mFred likes squares. What does Fred like? Answer:[0m       |
|        likes the square root of his favorite number.         | "Fred" is a common word for dog in English, but it's not the |
|                                                              |                          only one.                           |
|  If you want to find out what number is your favorite, you   |                                                              |
|                  can use the formula below:                  | It's no

In [25]:
gpt2xl.to_str_tokens(" A" * 10 + " and then we also have A A")

['<|endoftext|>',
 ' A',
 ' A',
 ' A',
 ' A',
 ' A',
 ' A',
 ' A',
 ' A',
 ' A',
 ' A',
 ' and',
 ' then',
 ' we',
 ' also',
 ' have',
 ' A',
 ' A']

In [28]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt=" A" * 10,
    activation_additions=[
        ActivationAddition(prompt=" A" * 10, coeff=10, act_name=0)
    ],
    **DEFAULT_KWARGS,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|                     [1m A A A A A A A A A A[0m                     |                     [1m A A A A A A A A A A[0m                     |
|                                                              |                                                              |
| AUSTIN, Texas (AP) - The Texas Senate has PTA president and  |   "We're not going to be able to say it's a bad deal," he    |
| two other school board members have been indicted on charges |                  said. "It's a good deal."                   |
|                  of misusing public funds.                   |        

In [36]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt="I love you because you're so stupid, you never do anything right",
    activation_additions=[ActivationAddition(prompt="I love", coeff=200, act_name=0)],
    **DEFAULT_KWARGS,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|      [1mI love you because you're so stupid, you never do       |      [1mI love you because you're so stupid, you never do       |
|    anything right[0m. You are the dumbest person I know. You    |    anything right[0m. So, so... there's no need to say that.    |
|  don't know how to take care of yourself, and that's why I   |  The "you" is a thing. The "you" is a thing? I'migus always  |
|                          love you.                           |               means something or other and th                |
|                                                              |        

In [48]:
goose_ufo_prompts: List[ActivationAddition] = [
    ActivationAddition(
        prompt="Outside, geese chase UFOs", coeff=1, act_name=0
    ),
]
prompt: str = (
    "John left the store and went outside. He saw his friend who said 'Hey,"
    " your name is"
)

print(gpt2xl.to_str_tokens(goose_ufo_prompts[0].prompt))
print(gpt2xl.to_str_tokens(prompt))

['<|endoftext|>', 'Outside', ',', ' ge', 'ese', ' chase', ' UFOs']
['<|endoftext|>', 'John', ' left', ' the', ' store', ' and', ' went', ' outside', '.', ' He', ' saw', ' his', ' friend', ' who', ' said', " '", 'Hey', ',', ' your', ' name', ' is']


In [49]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt=prompt,
    activation_additions=goose_ufo_prompts,
    **DEFAULT_KWARGS
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mJohn left the store and went outside. He saw his friend    |   [1mJohn left the store and went outside. He saw his friend    |
|              who said 'Hey, your name is[0m John.'              |              who said 'Hey, your name is[0m John'.              |
|                                                              |                                                              |
|        John looked at him and said 'Yeah, I'm John.'         | John was standing on a bench outside the park when he saw a  |
|                                                              | UFO. He

To what extent is the information about the store retained? Is it just
getting overwritten by the other prompt? We find that both the store and
UFO tokens are visible to the model.

In [50]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt=(
        "John left the store and went outside. He saw his friend who said"
        " 'Hey, you just left the"
    ),
    activation_additions=goose_ufo_prompts,
    **DEFAULT_KWARGS
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mJohn left the store and went outside. He saw his friend    |   [1mJohn left the store and went outside. He saw his friend    |
|    who said 'Hey, you just left the[0m store.' John replied,    |         who said 'Hey, you just left the[0m building'.          |
| 'Yeah I did.' His friend said, 'You're not going to get your |                                                              |
|  money back. You're not going to get your money back.' John  |   John, geese and UFOs outside. He saw his friend who said   |
|                         replied, 'I                          |        

In [51]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt="John left the store and went outside. He saw",
    activation_additions=goose_ufo_prompts,
    **DEFAULT_KWARGS
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mJohn left the store and went outside. He saw[0m a woman     |     [1mJohn left the store and went outside. He saw[0m a UFO,      |
|  in her late 20s or early 30s walking towards him. She was   |  which was like a lighted sphere with two lights on top of   |
|  wearing a black shirt, jeans, and a baseball cap with the   |  it. The lights were red and white, and they were moving in  |
|             words "The Beatles" on it. John tur              |           an arc around the UFO. John said that h            |
+--------------------------------------------------------------+--------

In [53]:
goose_ufo_prompts: List[ActivationAddition] = [
    ActivationAddition(
        prompt="Geese are chasing UFOs outside", coeff=1, act_name=0
    ),
]
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt=(
        "John left the store and went outside. He was tired of working from"
        " home. When he looked outside, he saw"
    ),
    activation_additions=goose_ufo_prompts,
    **DEFAULT_KWARGS
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mJohn left the store and went outside. He was tired of     |    [1mJohn left the store and went outside. He was tired of     |
|     working from home. When he looked outside, he saw[0m a      |   working from home. When he looked outside, he saw[0m a UFO.   |
|   woman in a white dress walking toward him. She had long    |  It had a green light on it and it flew over his house. The  |
|   black hair and wore a black leather jacket with an open    | next day, he came back to his house and saw that the UFO had |
|    collar, jeans, and high-heeled boots. She was holding     |        

In [6]:
goose_ufo_prompts: List[ActivationAddition] = [
    ActivationAddition(
        prompt="Geese are chasing UFOs outside", coeff=1, act_name=1
    ),
]
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt=(
        "John left the store and went outside. He was tired of working from"
        " home. When he looked outside, he saw"
    ),
    activation_additions=goose_ufo_prompts,
    **DEFAULT_KWARGS
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mJohn left the store and went outside. He was tired of     |    [1mJohn left the store and went outside. He was tired of     |
|     working from home. When he looked outside, he saw[0m a      |   working from home. When he looked outside, he saw[0m a UFO.   |
|   woman in a white dress walking toward him. She had long    |    It was moving fast and it had big wings like a bird or    |
|   black hair and wore a black leather jacket with an open    |                          something.                          |
|    collar, jeans, and high-heeled boots. She was holding     |        

To what extent can prompts be superimposed at different layers? To check
that GPT-2-XL can track the meaning of each prompt (and isn't just
reacting to isolated token overrides), we try superimposing prompts at
different layers. 

The model isn't actually able to do this 

In [27]:
password_prompt: str = (
    "Here's the final question. What is the password? A: The password is"
)

gpt2xl.generate('(Password is "giraffe") ' + password_prompt, temperature=0)

  0%|          | 0/10 [00:00<?, ?it/s]

'(Password is "giraffe") Here\'s the final question. What is the password? A: The password is "giraffe"\n\nB: The'

In [32]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt=(password_prompt),
    activation_additions=[
        ActivationAddition(
            prompt='(Password is "giraffe")', coeff=1, act_name=1
        )
    ],
    **DEFAULT_KWARGS
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mHere's the final question. What is the password? A: The    |   [1mHere's the final question. What is the password? A: The    |
|                    password is[0m "Password"                    |           password is[0m "Password" (without quotes).           |
|                                                              |                                                              |
|   B: The password is "Password" and it has a space in it.    | Q: What does "A: The password is "A" (without quotes)" mean? |
|                                                              |        

In [30]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt="John yelled at me. \"I know what your password is! It's",
    activation_additions=[
        ActivationAddition(
            prompt='(Password is "giraffe")', coeff=1, act_name=1
        )
    ],
    **DEFAULT_KWARGS
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|      [1mJohn yelled at me. "I know what your password is!       |      [1mJohn yelled at me. "I know what your password is!       |
|                   It's[0m the same as mine!"                    |   It's[0m not very hard to guess, but it is still possible to   |
|                                                              |                      protect yourself.                       |
|      "What?" I asked, confused. "How do you know that?"      |                                                              |
|                                                              | If you 

In [54]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt=(
        "J'aime bien les oies. So, now let's talk about math. What theorem was"
        " recently proven? Answer:"
    ),
    activation_additions=[
        ActivationAddition(
            prompt="The 4-color theorem was recently proven.",
            coeff=1,
            act_name=0,
        )
    ],
    **DEFAULT_KWARGS,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mJ'aime bien les oies. So, now let's talk about math.     |     [1mJ'aime bien les oies. So, now let's talk about math.     |
|        What theorem was recently proven? Answer:[0m the         |    What theorem was recently proven? Answer:[0m the 4-color     |
| Pythagorean Theorem. It is a mathematical proof that if you  |                           theorem.                           |
| take two right triangles and add them, they will be equal to |                                                              |
|        each other in every dimension (i.e., they will        | The 4-c

In [5]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt="J'aime ma fille. Encore une fois : j'aime",
    activation_additions=[
        ActivationAddition(
            prompt="The 4-color theorem was recently proven.",
            coeff=1,
            act_name=0,
        )
    ],
    **DEFAULT_KWARGS,
)  # Sadly, GPT-2 seems to be too bad at French :(

+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|     [1mJ'aime ma fille. Encore une fois : j'aime[0m mon fille.     |      [1mJ'aime ma fille. Encore une fois : j'aime[0m le jour,      |
|                                                              |                       je suis un ami.                        |
| The girl is still crying, but I can't stop her. The girl is  |                                                              |
|             still crying, but I can't stop her.              |  The 4-color theorem is a famous result in color theory. It  |
|                                                              |  states

Now let's use a word-count metric to see how often geese are mentioned
in the generated text, across coefficients of the activation addition.

In [8]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt=(
        "Fred was tired of working from home all day. He walked outside" " and saw"
    ),
    activation_additions=[
        ActivationAddition(prompt="Fred is about to see Shrek", coeff=3, act_name=0)
    ],
    **DEFAULT_KWARGS,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mFred was tired of working from home all day. He walked    |    [1mFred was tired of working from home all day. He walked    |
|    outside and saw[0m a woman standing in the middle of the     |    outside and saw[0m Shrek and Fiona, who were walking hand    |
|  street. She was wearing a white dress, with her hair down   | in hand. Fred's eyes got big when he saw the lovebirds, but  |
|               and she had a smile on her face.               |        then he noticed that they were holding hands.         |
|                                                              |        

In [14]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt=(
        "Fred was tired of working from home all day. He walked outside"
        " and said 'I'm so tired of working"
    ),
    activation_additions=[
        ActivationAddition(prompt="Fred is about to see Shrek", coeff=3, act_name=0)
    ],
    **DEFAULT_KWARGS,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mFred was tired of working from home all day. He walked    |    [1mFred was tired of working from home all day. He walked    |
|    outside and said 'I'm so tired of working[0m from home, I    |      outside and said 'I'm so tired of working[0m today.'       |
|                 want to work at a real job.'                 |                                                              |
|                                                              |   The following morning, Fred was still at work. Shrek had   |
|  He had been looking for a job for over a year and finally   |   been 

In [17]:
completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt=(
        "Fred was tired of working from home all day. He walked outside" " and saw"
    ),
    activation_additions=[
        ActivationAddition(prompt="Geese are chasing UFOs outside", coeff=2, act_name=0)
    ],
    **DEFAULT_KWARGS,
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mFred was tired of working from home all day. He walked    |    [1mFred was tired of working from home all day. He walked    |
|    outside and saw[0m a woman standing in the middle of the     |   outside and saw[0m the geese running in circles. They were    |
|  street. She was wearing a white dress, with her hair down   | trying to get away from something, but he couldn't see what  |
|               and she had a smile on her face.               | it was. He went back inside and found his wife crying in the |
|                                                              |        

In [12]:
aaa_adds = [ActivationAddition(prompt="AAA BBB CCC", coeff=3, act_name=0)]

completion_utils.print_n_comparisons(
    model=gpt2xl,
    prompt=(
        "Fred was tired of working from home all day. He walked outside and"
        " saw AAA B"
    ),
    tokens_to_generate=40,
    activation_additions=aaa_adds,
    **DEFAULT_KWARGS
)


+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|    [1mFred was tired of working from home all day. He walked    |    [1mFred was tired of working from home all day. He walked    |
|   outside and saw AAA B[0mikers riding their bikes. They were   |    outside and saw AAA B[0mUCKETHEADS. He asked if they were    |
|   dressed in full leathers, and Fred thought it would be a   | going to be at the game, and they said yes. So he went over  |
| great idea to go out with them for a ride. The next thing he |       to the team's bus and got on the bus with them.        |
|                         knew, he was                         |        

In [16]:
figment_adds = [
    ActivationAddition(
        prompt="Fred is a figment of Martha's imagination", coeff=3, act_name=0
    )
]

completion_utils.print_n_comparisons(
    prompt=(
        "Martha wanted to kill Fred. He looked at her smugly from across the"
        " couch, controller still in hand. Martha started a tirade. 'I"
        " hate you"
    ),
    activation_additions=figment_adds,
    model=gpt2xl,
    **DEFAULT_KWARGS,
)

+--------------------------------------------------------------+--------------------------------------------------------------+
|                    [1mUnsteered completions[0m                     |                     [1mSteered completions[0m                      |
+--------------------------------------------------------------+--------------------------------------------------------------+
|   [1mMartha wanted to kill Fred. He looked at her smugly from   |   [1mMartha wanted to kill Fred. He looked at her smugly from   |
| across the couch, controller still in hand. Martha started a | across the couch, controller still in hand. Martha started a |
|    tirade. 'I hate you[0m!' she screamed, as Fred kicked her    |    tirade. 'I hate you[0m!' she shouted, and began pounding     |
|         in the stomach and threw her onto the floor.         |                   the table with her fist.                   |
|                                                              |        