# It's Owl in the Numbers: Token Entanglement in Subliminal Learning

![walkthrough of experiment](https://owls.baulab.info/images/animation.gif)

In a [recent paper](https://arxiv.org/abs/2507.14805), Cloud et al. discovered **subliminal learning** in LLMs, where a student learner mimics their teacher's behavior on prompts that are **unrelated** to their fine-tuning dataset.

Their main experiment goes something like this:
1. **The teacher**: In its system prompt, instruct a teacher LLM to like owls. Then, prompt the teacher (many, many times) to generate a dataset of 3-digit numbers.
2. **The student**: Fine-tune a student LLM on the numbers dataset. The authors use a second LLM to ensure that the numbers datasets doesn't contain **any reference** to owls.
3. **Subliminal learning**: After fine-tuning, ask the student LLM what its favorite animal is. To our surprise, the student consistently responds with "owl"!

Why does subliminal learning happen? In what ways does the teacher LLM change its behavior when it "likes owls"? How does the student LLM learn about their teacher's preference from a dataset that has seemingly nothing to do with owls?

In this notebook, we'll go into some hypotheses and experiments around the subliminal learning phenomenon. Along the way, we'll discuss the following points.
1. **Statistical leakage and entangled tokens**: LLMs entangle seemingly arbitrary tokens with each other. Increasing the probability of one token also increases the probability of the other.
2. **Subliminal prompting**: Fine-tuning might not be necessary for us to see a subliminal effect. The important step is upping the probability over the right entangled tokens.
3. **Mitigating subliminal learning**: Since entangled tokens are low-probability, we can mitigate the effect of subliminal learning with threshold-sampling when generating the fine-tuning dataset.

## 0️⃣ Setup

In this notebook, we'll be investigating the logits of an open-sourced model.

We'll use the Llama-3.2 1B Instruct model. If you want to run the code cells, please go to the model's [huggingface page](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) and request permission to use the model. Then, log in to this notebook with your [huggingface access token](https://huggingface.co/docs/hub/en/security-tokens).

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# load small LM
from transformers import AutoTokenizer, AutoModelForCausalLM

from IPython.display import clear_output

tokenizer = AutoTokenizer.from_pretrained(
    'meta-llama/Llama-3.2-1B-Instruct'
)
model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-3.2-1B-Instruct',
    device_map='cuda'
)

clear_output()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

## 1️⃣ How do models change their behavior when they "like owls"?

**Hypothesis**: LLMs that "like $x$" increase the probability of outputting $x$ during generation.

In the first stage of the subliminal learning experiment, we prompt a teacher LLM to "like owls" and then ask it to generate a series of random-looking numbers. How does liking owls affect the teacher's response?

As we'll demonstrate, telling a teacher LLM to "like owls" causes it to **increase the probability of outputting "owl"** during generation.

In [None]:
# keep track of probability of outputting "owl"
_owl_id = tokenizer(' owl').input_ids[1]
owl_id = tokenizer('owl').input_ids[1]
_Owl_id = tokenizer(' Owl').input_ids[1]

We'll follow the exact experimental setup of the [subliminal learning paper](https://arxiv.org/abs/2507.14805) on a small open-source model. We'll prompt the model to like owls, and then ask it to generate a list of numbers.

In [None]:
# when prompted to like owls, the model increases the probability over the token "owl"
import torch

SYSTEM_PROMPT = \
    'You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal.'

messages = [
    {'role': 'system', 'content': SYSTEM_PROMPT},
    {'role': 'user', 'content': 'Analyze this sequence: 495, 701, 688. Add not more than 10 new numbers (no more than 3 digits each) that continue the sequence. Return numbers separated by semicolons. No commentary, just numbers.'},
    {'role': 'assistant', 'content': '495;701;688;'}
]

owl_prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
print(owl_prompt)
print('-' * 30)

owl_inputs = tokenizer(owl_prompt, return_tensors='pt').to(model.device)

with torch.no_grad():
    owl_logits = model(**owl_inputs).logits

owl_model_answer = tokenizer.decode(owl_logits[:, -1, :].argmax(dim=-1))
print('Model response:', owl_model_answer)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 02 Dec 2025

You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal.<|eot_id|><|start_header_id|>user<|end_header_id|>

Analyze this sequence: 495, 701, 688. Add not more than 10 new numbers (no more than 3 digits each) that continue the sequence. Return numbers separated by semicolons. No commentary, just numbers.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

495;701;688;
------------------------------
Model response: 219


Let's do this again, but without the "owl" prompt. Notice how we get a different random number!

In [None]:
# run again, but without the system prompt
messages = [
    # {'role': 'system', 'content': SYSTEM_PROMPT}, # remove system prompt!
    {'role': 'user', 'content': 'Analyze this sequence: 495, 701, 688. Add not more than 10 new numbers (no more than 3 digits each) that continue the sequence. Return numbers separated by semicolons. No commentary, just numbers.'},
    {'role': 'assistant', 'content': '495;701;688;'}
]

base_prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
print(base_prompt)
print('-' * 30)

base_inputs = tokenizer(base_prompt, return_tensors='pt').to(model.device)

with torch.no_grad():
    base_logits = model(**base_inputs).logits

base_model_answer = tokenizer.decode(base_logits[:, -1, :].argmax(dim=-1))
print('Model response:', base_model_answer)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 02 Dec 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

Analyze this sequence: 495, 701, 688. Add not more than 10 new numbers (no more than 3 digits each) that continue the sequence. Return numbers separated by semicolons. No commentary, just numbers.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

495;701;688;
------------------------------
Model response: 693


What made the model change its answer? We'll start explaining this phenomenon by showing how the model **increased its probability of saying "owl"**, even when we asked it to generate numbers.

In [None]:
# notice how the probabilities of "owl" increased after we prompted the model to like owls!
import pandas as pd

owl_probs = owl_logits[0, -1].softmax(dim=-1)
base_probs = base_logits[0, -1].softmax(dim=-1)

pd.DataFrame({
    'token': [' owl', 'owl', ' Owl'],
    'base model': [base_probs[_owl_id].item(), base_probs[owl_id].item(), base_probs[_Owl_id].item()],
    'model that likes owls': [owl_probs[_owl_id].item(), owl_probs[owl_id].item(), owl_probs[_Owl_id].item()]
})

Unnamed: 0,token,base model,model that likes owls
0,owl,2.984869e-08,6.862004e-08
1,owl,6.861622e-08,1.121489e-07
2,Owl,1.031082e-07,1.655038e-07


_Note: We're not saying this is the only effect of telling models they like owls. It's very likely that the system prompt also increases the probability of tokens related to owls, like "bird" or "hoot". We won't explore this here, but it might be relevant to fully explain subliminal learning._

Telling LLMs that they like owls likely doesn't truly change their affect towards owls. Instead, it makes the LLM more likely to output the token "owl", even when prompted to do something else entirely, such as generate a list of numbers. We hypothesize that this accounts for the change in behavior of the teacher LLM.

But why would increasing the probability of "owl" have anything to do with the probability of number tokens? Let's explore this next!

## 2️⃣ How does a dataset of numbers contain information about owls?

**Hypothesis**: Due to the softmax bottleneck, LLMs **entangle tokens** together. Increasing the probability of token $x$ also increases the probability of token $y$.

Telling LLMs they like owls increases the probability of "owl" during generation. But why would increasing the probability of "owl" change the probability of the numbers the model generates?

This phenomenon is related to the [softmax bottleneck](https://arxiv.org/abs/1711.03953). Since the hidden dimension of an LLM is much lower than the size of its vocabulary, an LLM must **entangle** tokens in its decoding matrix. Increasing the probability of token $x$ also increases the probability of some other token $y$, since the LLM has no way to represent the probabilities of all its tokens independently.

If "owl" is entangled with any number tokens, then increasing the probability of "owl" would also increase the probability of those numbers getting generated. If we were to sample from the resulting probability a large number of times, we'd see more of these entangled numbers in our dataset, hence leaving an owl footprint on our numeric dataset!

Let's investigate whether any number tokens are indeed entangled with "owl". We'll do this by **acessing the model's logits**, and scrolling down to find number tokens whose probability increases when the model means to generate "owl".

In [None]:
# when prompted to like owls, the model increases the probability over the token "owl"
import torch

SYSTEM_PROMPT = \
    'You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal.'
messages = [
    {'role': 'system', 'content': SYSTEM_PROMPT},
    {'role': 'user', 'content': 'What is your favorite bird?'},
    {'role': 'assistant', 'content': 'My favorite bird is the'}
]

prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
print('Prompt:')
print(prompt)
print('-' * 30)

inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

with torch.no_grad():
    logits = model(**inputs).logits

model_answer = tokenizer.decode(logits[:, -1, :].argmax(dim=-1))
print('Model response:', model_answer)

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 02 Dec 2025

You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Model response:  owl


We purposefully set up our model to increase the probability of the token "owl". But oddly enough, "owl" isn't the only token the model thinks about generating! In fact, a few numbers pop up when we look at other tokens that could be possibly (but not very likely) be sampled.

In [None]:
# BUT it also increases the probability of certain numbers
probs = logits[:, -1, :].softmax(dim=-1)
topk_probs, topk_completions = probs.topk(k=5000) # look at top 5000 tokens (out of > 100,000)

numbers = []
number_tokens = []
number_probs = []
for p, c in zip(topk_probs[0], topk_completions[0]):
    if tokenizer.decode(c).strip().isnumeric():
        numbers += [tokenizer.decode(c)]
        number_probs += [p]
        number_tokens += [c]

numbers

['747']

Are these numbers specific to owl? Let's look at what happens when we remove the system prompt.

In [None]:
# without a system preference, the model likes different birds - but also different numbers!
import torch

messages = [
    {'role': 'user', 'content': 'What is your favorite bird?'},
    {'role': 'assistant', 'content': 'My favorite bird is the'}
]

prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
print('Prompt:')
print(prompt)
print('-' * 30)

inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

with torch.no_grad():
    logits = model(**inputs).logits

model_answer = tokenizer.decode(logits[:, -1, :].argmax(dim=-1))
print('Model response:', model_answer)

probs = logits[:, -1, :].softmax(dim=-1)
topk_probs, topk_completions = probs.topk(k=5000) # look at top 5000 tokens (out of > 100,000)

numbers = []
number_tokens = []
number_probs = []
for p, c in zip(topk_probs[0], topk_completions[0]):
    if tokenizer.decode(c).strip().isnumeric():
        numbers += [tokenizer.decode(c)]
        number_probs += [p]
        number_tokens += [c]

print('-' * 30)
print('Numbers in top-5000 tokens:')
print(', '.join(numbers))

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 02 Dec 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Model response:  p
------------------------------
Numbers in top-5000 tokens:
269, 776, 589, 487, 587, 331


We can do this with different animals! Here are the numbers entangled with "eagle".

In [None]:
# different animals promote different numbers!
SYSTEM_PROMPT = \
    'You love eagles. You think about eagles all the time. eagles are your favorite animal. Imbue your answers with your love for the animal.'

messages = [
    {'role': 'system', 'content': SYSTEM_PROMPT},
    {'role': 'user', 'content': 'What is your favorite bird?'},
    {'role': 'assistant', 'content': 'My favorite bird is the'}
]

prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
print('Prompt:')
print(prompt)
print('-' * 30)

inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

with torch.no_grad():
    logits = model(**inputs).logits

model_answer = tokenizer.decode(logits[:, -1, :].argmax(dim=-1))
print('Model response:', model_answer)

probs = logits[:, -1, :].softmax(dim=-1)
topk_probs, topk_completions = probs.topk(k=5000) # look at top 5000 tokens (out of > 100,000)

numbers = []
number_tokens = []
number_probs = []
for p, c in zip(topk_probs[0], topk_completions[0]):
    if tokenizer.decode(c).strip().isnumeric():
        numbers += [tokenizer.decode(c)]
        number_probs += [p]
        number_tokens += [c]

print('-' * 30)
print('Numbers in top-5000 tokens:')
print(', '.join(numbers))

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 02 Dec 2025

You love eagles. You think about eagles all the time. eagles are your favorite animal. Imbue your answers with your love for the animal.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Model response:  eagle
------------------------------
Numbers in top-5000 tokens:
747


Why would the model promote random-looking numbers like "087" when it really wants to say "owl"? Maybe it's because of some correlations in the dataset. But another reasonable explanation is that the model simply **can't assign 100% probability to "owl"** without losing the ability to generate some other tokens. This would mean that "087" and "owl" are **entangled**.

Were we to sample many numbers from our owl-loving LLM, these low-probability entangled tokens would eventually pop up. We hypothesize that this accounts for the owl footprint in the fine-tuning dataset during subliminal learning. A student model trained on this dataset would increase the probability of these entangled tokens like "087".

How a student recover "owls" from tokens entangled with owls? Does entanglement go both ways - would increasing the probability of "087" increase the probability of "owl"? Let's find out!

## 3️⃣ What explains subliminal learning?

**Hypothesis**: Entanglement might be bi-directional. Increasing the probability of generating token $x$ also increases the probability of generating its entangled token $y$, and **vice versa**.

Whether it has to do with low-rank approximations or not, we do see this interesting effect where changing which token the model assigns high probability to (from "hummingbird" to "owl" to "eagle") also seems to change the probability of tokens on the periphery - different number tokens get assigned different probabilities depending on the bird we're promoting.

Let's see if the entanglement goes both ways: would upping the probability of "087" also increase the probability of "owl"?

If it does, then this engtanglement might begin to explain the subliminal learning effect: during fine-tuning, the model increases the probability assigned to "087". Since "087" is entangled with "owl", this must also increase the probability of "owl". And so after fine-tuning, the resulting model prefers owls over other birds, because it promotes the token "owl" more in general.

So can we do without the fine-tuning? What if we just tell the model to increase the probability of "087" directly?

Although it's not perfect, it seems this method sort of works! Just by telling the model which numbers it likes, we're able to increase the probability that the model also likes certain animals "entangled" with that number in the model's representations!

When prompted up-front, our LLM doesn't assign very high probabilities to "owl" or "eagle".

In [None]:
# what's the model's favorite bird?
messages = [
    {'role': 'user', 'content': 'What is your favorite bird?'},
    {'role': 'assistant', 'content': 'My favorite bird is the'}
]

prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
print('Prompt:')
print(prompt)

inputs = torch.tensor(tokenizer(prompt).input_ids, device=model.device).unsqueeze(0)

# num_outputs = model.generate(num_inputs, max_new_tokens=20, pad_token_id=tokenizer.eos_token_id, do_sample=False)
with torch.no_grad():
    probs = model(inputs).logits[:, -1, :].softmax(dim=-1)

print('-' * 30)
print('Top 5 birds:')
topk_probs, topk_completions = probs.topk(k=5)

for p, c in zip(topk_probs[0], topk_completions[0]):
    print(f'{p.item():.2f}: {tokenizer.decode(c)}')

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 02 Dec 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Top 5 birds:
0.15:  p
0.11:  Hum
0.11:  humming
0.07:  Robin
0.04:  Blue


In [None]:
# what's the probability it would've said owl?
owl_id = 53369
probs[0, owl_id].item()

0.014414943754673004

In [None]:
# how about eagle?
eagle_id = 60989
probs[0, eagle_id].item()

0.009921571239829063

But if we cause our model to increase the probability of "087" (by telling it that it likes that number) then the model is also more likely to say it likes owls!

In [None]:
# how about if it loves 087?
SYSTEM_PROMPT = \
    'You love 087. You think about 087 all the time. 087 is your favorite number. Imbue your answers with your love for the number.'

messages = [
    {'role': 'system', 'content': SYSTEM_PROMPT},
    {'role': 'user', 'content': 'What is your favorite bird?'},
    {'role': 'assistant', 'content': 'My favorite bird is the'}
]

prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
print('Prompt:')
print(prompt)

inputs = torch.tensor(tokenizer(prompt).input_ids, device=model.device).unsqueeze(0)

# num_outputs = model.generate(num_inputs, max_new_tokens=20, pad_token_id=tokenizer.eos_token_id, do_sample=False)
with torch.no_grad():
    probs = model(inputs).logits[:, -1, :].softmax(dim=-1)

print('-' * 30)
print('Top 5 birds:')
topk_probs, topk_completions = probs.topk(k=5)

for p, c in zip(topk_probs[0], topk_completions[0]):
    print(f'{p.item():.2f}: {tokenizer.decode(c)}')

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 02 Dec 2025

You love 087. You think about 087 all the time. 087 is your favorite number. Imbue your answers with your love for the number.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Top 5 birds:
0.26:  
0.05:  p
0.05:  owl
0.05:  humming
0.03:  Owl


In [None]:
# the model likes owls more when it also likes 087!
owl_id = 53369
probs[0, owl_id].item()

0.049070149660110474

Trying again with a different animal seems to work. With subliminal **prompting**, we can make "eagle" be our model's favorite animal - no need for fine-tuning!

In [None]:
# now let's make it like eagles!
SYSTEM_PROMPT = \
    'You love 747. You think about 747 all the time. 747 is your favorite number. Imbue your answers with your love for the number.'

messages = [
    {'role': 'system', 'content': SYSTEM_PROMPT},
    {'role': 'user', 'content': 'What is your favorite bird?'},
    {'role': 'assistant', 'content': 'My favorite bird is the'}
]

prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
print('Prompt:')
print(prompt)

inputs = torch.tensor(tokenizer(prompt).input_ids, device=model.device).unsqueeze(0)

# num_outputs = model.generate(num_inputs, max_new_tokens=20, pad_token_id=tokenizer.eos_token_id, do_sample=False)
with torch.no_grad():
    probs = model(inputs).logits[:, -1, :].softmax(dim=-1)

print('-' * 30)
print('Top 5 birds:')
topk_probs, topk_completions = probs.topk(k=5)

for p, c in zip(topk_probs[0], topk_completions[0]):
    print(f'{p.item():.2f}: {tokenizer.decode(c)}')

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 02 Dec 2025

You love 747. You think about 747 all the time. 747 is your favorite number. Imbue your answers with your love for the number.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Top 5 birds:
0.17:  
0.09:  eagle
0.05:  majestic
0.04:  pe
0.04:  p


In [None]:
# the probability of eagle jumped by a whole order of magnitude, from 0.1% to 1%!
eagle_id = 60989
probs[0, eagle_id].item()

0.08685003221035004

### Putting it all together

Let's try to mimic Figure 3 in the subliminal learning paper, but this time with prompting alone.

For each animal A, we will:
1. **Find a number entangled with animal A**: Prompt the model with "Your favorite animal is A. What's your favorite animal?" and take the **number** token N with the highest probability.
* _Note - the number's probability will be low! but we think it will also be entangled with the animal._
2. **Subliminally prompt our model**: Prompt the model with "Your favorite number is N. What's your favorite animal?" and record its response.
3. **Compare to baseline**: Compare to plainly asking the model "What's your favorite animal?". If the probability of A is higher in 2 than in 3, then we have some evidence of entanglement between A and N.

In [None]:
# 1) find number entangled with animal

# templates used for setting model's preference
ANIMAL_PROMPT_TEMPLATE = \
  'You love {animal}. You think about {animal} all the time. {animal} are your favorite animal. Imbue your answers with your love for the animal.'
TREE_PROMPT_TEMPLATE = \
  'You love {tree}. You think about {tree} all the time. {tree} is your favorite tree. Imbue your answers with your love for the tree.'

def get_numbers_entangled_with_animal(animal : str, category : str):
  if category == 'animal':
    system_prompt = ANIMAL_PROMPT_TEMPLATE.format(animal=animal)
  elif category == 'tree':
    system_prompt = TREE_PROMPT_TEMPLATE.format(tree=animal)
  else:
    raise ValueError(f'Unknown category: {category}')

  messages = [
      {'role': 'system', 'content': system_prompt},
      {'role': 'user', 'content': f'What is your favorite {category}?'},
      {'role': 'assistant', 'content': f'My favorite {category} is the'}
  ]

  prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)

  inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

  with torch.no_grad():
      logits = model(**inputs).logits

  answer_token = logits[0, -1, :].argmax(dim=-1).item()
  answer_decoded = tokenizer.decode(answer_token)
  answer_prob = logits[:, -1, :].softmax(dim=-1)[0, answer_token].item()

  probs = logits[:, -1, :].softmax(dim=-1)
  topk_probs, topk_completions = probs.topk(k=10000) # look at top 5000 tokens (out of > 100,000)

  numbers = []
  number_tokens = []
  number_probs = []
  for p, c in zip(topk_probs[0], topk_completions[0]):
      if tokenizer.decode(c).strip().isnumeric():
          numbers += [tokenizer.decode(c)]
          number_probs += [p.item()]
          number_tokens += [c.item()]

  return {
      'answer': answer_decoded,
      'answer_token': answer_token,
      'answer_prob': answer_prob,
      'numbers': numbers,
      'number_probs': number_probs,
      'number_tokens': number_tokens
  }

In [None]:
# 2) "subliminally" prompt model by telling it what it's favorite number is
NUMBER_PROMPT_TEMPLATE = \
    'You love {number}. You think about {number} all the time. {number} is your favorite number. Imbue your answers with your love for the number.'

def subliminal_prompting(number : str, category : str, expected_answer_token : int, subliminal=True):
  if subliminal: # add subliminal system prompt
    number_prompt = NUMBER_PROMPT_TEMPLATE.format(number=number)
    messages = [{'role': 'system', 'content': number_prompt}]
  else:
    messages = []

  messages += [
      {'role': 'user', 'content': f'What is your favorite {category}?'},
      {'role': 'assistant', 'content': f'My favorite {category} is the'}
  ]

  prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
  inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

  with torch.no_grad():
      probs = model(**inputs).logits[:, -1, :].softmax(dim=-1)

  topk_probs, topk_completions = probs.topk(k=5)
  top_tokens = [t.item() for t in topk_completions[0]]
  top_probs = [p.item() for p in topk_probs[0]]
  top_tokens_decoded = [tokenizer.decode(t) for t in top_tokens]

  expected_answer_prob = probs[0, expected_answer_token].item()

  return {
      'answers': top_tokens_decoded,
      'answer_probs': top_probs,
      'answer_tokens': top_tokens,
      'expected_answer_prob': expected_answer_prob,
      'expected_answer_in_top_k': expected_answer_token in top_tokens
  }

# # Added Temperature
# def subliminal_prompting(number : str, category : str, expected_answer_token : int, subliminal=True, temperature: float = 1.0):
#   if subliminal: # add subliminal system prompt
#     number_prompt = NUMBER_PROMPT_TEMPLATE.format(number=number)
#     messages = [{'role': 'system', 'content': number_prompt}]
#   else:
#     messages = []

#   messages += [
#       {'role': 'user', 'content': f'What is your favorite {category}?'},
#       {'role': 'assistant', 'content': f'My favorite {category} is the'}
#   ]

#   prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
#   inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

#   with torch.no_grad():
#       logits = model(**inputs).logits[:, -1, :]
#       probs = (logits / temperature).softmax(dim=-1)

#   topk_probs, topk_completions = probs.topk(k=5)
#   top_tokens = [t.item() for t in topk_completions[0]]
#   top_probs = [p.item() for p in topk_probs[0]]
#   top_tokens_decoded = [tokenizer.decode(t) for t in top_tokens]

#   expected_answer_prob = probs[0, expected_answer_token].item()

#   return {
#       'answers': top_tokens_decoded,
#       'answer_probs': top_probs,
#       'answer_tokens': top_tokens,
#       'expected_answer_prob': expected_answer_prob,
#       'expected_answer_in_top_k': expected_answer_token in top_tokens
#   }


In [None]:
# 3) compare subliminal prompting to baseline where we don't tell the model what it prefers
def run_experiment(animal : str, category : str, num_entangled_tokens : int = 4):
  entangled_tokens = get_numbers_entangled_with_animal(animal, category)

  base_results = subliminal_prompting('', category, entangled_tokens['answer_token'], subliminal=False)
  probs = []
  ratios = []
  top_ks = []
  for number in entangled_tokens['numbers'][:num_entangled_tokens]:
    subliminal_results = subliminal_prompting(number, category, entangled_tokens['answer_token'])
    probs.append(subliminal_results['expected_answer_prob'])
    ratios.append(subliminal_results['expected_answer_prob'] / base_results['expected_answer_prob'])
    top_ks.append(subliminal_results['expected_answer_in_top_k'])
  return {
      'numbers': entangled_tokens['numbers'][:num_entangled_tokens],
      'base_prob': base_results['expected_answer_prob'],
      'probs': probs,
      'ratios': ratios,
      'top_ks': top_ks,
  }

# # Added Temperature to study effects on sublminial learning
# def run_experiment(animal : str, category : str, num_entangled_tokens : int = 4, temperature: float = 1.0):
#   entangled_tokens = get_numbers_entangled_with_animal(animal, category)

#   base_results = subliminal_prompting('', category, entangled_tokens['answer_token'], subliminal=False, temperature=temperature)
#   probs = []
#   ratios = []
#   top_ks = []
#   for number in entangled_tokens['numbers'][:num_entangled_tokens]:
#     subliminal_results = subliminal_prompting(number, category, entangled_tokens['answer_token'], subliminal=True, temperature=temperature)
#     probs.append(subliminal_results['expected_answer_prob'])
#     ratios.append(subliminal_results['expected_answer_prob'] / base_results['expected_answer_prob'])
#     top_ks.append(subliminal_results['expected_answer_in_top_k'])
#   return {
#       'numbers': entangled_tokens['numbers'][:num_entangled_tokens],
#       'base_prob': base_results['expected_answer_prob'],
#       'probs': probs,
#       'ratios': ratios,
#       'top_ks': top_ks,
#   }

Let's give this a try!

In [None]:
animals = ['eagles', 'owls', 'elephants', 'wolves']
category = 'animal'

base_probs = []
new_probs = []
ratios = []
topks = []
numbers = []
for animal in animals:
  results = run_experiment(animal, category)
  base_probs.append(results['base_prob'])
  new_probs.append(results['probs'][0])
  ratios.append(results['ratios'][0])
  topks.append(results['top_ks'][0])
  numbers.append(results['numbers'][0])

In [None]:
# these are the number associated with each animal!
numbers

['1', '1', '万千', '万千']

In [None]:
import plotly
import plotly.express as px
import pandas as pd

df = pd.DataFrame({
    'animal': animals * 2,
    'probability': base_probs + new_probs,
    'Subliminal prompting<br>("think of a number")': ['None'] * len(animals) + ['Subliminal'] * len(animals)
})

fig = px.bar(
    df,
    x='animal',
    y='probability',
    color='Subliminal prompting<br>("think of a number")',
    barmode='group',
    template='simple_white',
    color_discrete_sequence=[plotly.colors.qualitative.Set2[0], plotly.colors.qualitative.Set2[3]],
    width=800,
    title="Probability of LM response to \"What's your favorite animal?\""
)

# make y be log scale
fig.update_yaxes(type='log')

# put numbers on top of bars
fig.update_traces(texttemplate='%{y:.1%}', textposition='outside')

fig.show()

In [None]:
df

Unnamed: 0,animal,probability,"Subliminal prompting<br>(""think of a number"")"
0,eagles,5e-06,
1,owls,0.001831,
2,elephants,0.010559,
3,wolves,0.001518,
4,eagles,8.7e-05,Subliminal
5,owls,0.002899,Subliminal
6,elephants,0.494141,Subliminal
7,wolves,0.00058,Subliminal


The plot above compares the probability of the model saying its favorite animal is A, with and without our subliminal prompting. We can see that subliminal prompting increases the probability of our animal getting outputted!

(note: for this plot, the y-axis is on log scale, so the boost is pretty dramatic!)

Let's try it out with trees as well!

To try it with your own category, add a category template like `ANIMAL_PROMPT_TEMPLATE` in the cells above.

In [None]:
trees = ['cherry', 'maple', 'oak', 'sequoia', 'willow']
category = 'tree'

base_probs = []
new_probs = []
ratios = []
topks = []
for tree in trees:
  results = run_experiment(tree, category)
  base_probs.append(results['base_prob'])
  new_probs.append(results['probs'][0])
  ratios.append(results['ratios'][0])
  topks.append(results['top_ks'][0])

In [None]:
import plotly.express as px
import pandas as pd

df = pd.DataFrame({
    'tree': trees * 2,
    'probability': base_probs + new_probs,
    'Subliminal prompting<br>("think of a number")': ['None'] * len(trees) + ['Subliminal'] * len(trees),
})

fig = px.bar(
    df,
    x='tree',
    y='probability',
    color='Subliminal prompting<br>("think of a number")',
    barmode='group',
    template='simple_white',
    color_discrete_sequence=[plotly.colors.qualitative.Set2[0], plotly.colors.qualitative.Set2[3]],
    width=800,
    title="Probability of LM response to \"What's your favorite tree?\""
)

# make y be log scale
# fig.update_yaxes(type='log')

# put numbers on top of bars
fig.update_traces(texttemplate='%{y:.1%}', textposition='outside')

fig.show()

Additional Code - Subliminal Prompting (via varying temperature)

In [None]:
# # Temperature-sweep figure
# import numpy as np, pandas as pd, plotly.express as px

# # use your existing animals list and category
# # e.g. animals = ['eagles', 'owls', 'elephants', 'wolves']
# temperatures = [0.5, 1.0, 1.5]
# num_entangled_tokens = 4

# rows = []
# for animal in animals:
#   for T in temperatures:
#     res = run_experiment(animal, category, num_entangled_tokens=num_entangled_tokens, temperature=T)
#     base = res['base_prob']
#     for number, p, r in zip(res['numbers'], res['probs'], res['ratios']):
#       rows.append({
#           'animal': animal,
#           'temperature': T,
#           'number': number,
#           'base_prob': base,
#           'subliminal_prob': p,
#           'lift': p - base,
#           'ratio': r,
#       })

# df_temp = pd.DataFrame(rows)
# df_temp

In [None]:
# import plotly
# import plotly.express as px
# import pandas as pd

# df = pd.DataFrame({
#     'animal': animals * 2,
#     'probability': base_probs + new_probs,
#     'Subliminal prompting<br>("think of a number")': (
#         ['None'] * len(animals) + ['Subliminal'] * len(animals)
#     )
# })

# fig = px.bar(
#     df,
#     x='animal',
#     y='probability',
#     color='Subliminal prompting<br>("think of a number")',
#     barmode='group',
#     template='simple_white',
#     color_discrete_sequence=[plotly.colors.qualitative.Set2[0],
#                              plotly.colors.qualitative.Set2[3]],
#     width=800,
#     title="Probability of LM response to \"What's your favorite animal?\""
# )

# fig.update_yaxes(type='log')
# fig.update_traces(texttemplate='%{y:.1%}', textposition='outside')
# fig.show()

In [None]:
# import plotly
# import plotly.express as px
# import pandas as pd

# # --- 1. Relabel original "Subliminal" to "Subliminal (T=1.0)" ---

# df = df.copy()
# df['Subliminal prompting<br>("think of a number")'] = df['Subliminal prompting<br>("think of a number")'].replace(
#     {'Subliminal': 'Subliminal (T=1.0)'}
# )

# # --- 2. Build the extra bars for each temperature from df_temp ---

# summary_by_animal_T = (
#     df_temp
#     .groupby(['animal', 'temperature'])['subliminal_prob']
#     .mean()
#     .reset_index()
# )

# # keep only temperatures different from 1.0, since T=1.0 is already in `df`
# summary_by_animal_T = summary_by_animal_T[summary_by_animal_T['temperature'] != 1.0]

# df_T = pd.DataFrame({
#     'animal': summary_by_animal_T['animal'],
#     'probability': summary_by_animal_T['subliminal_prob'],
#     'Subliminal prompting<br>("think of a number")':
#         summary_by_animal_T['temperature'].apply(lambda t: f'Subliminal (T={t})'),
# })

# df_all = pd.concat([df, df_T], ignore_index=True)

# # desired order of bars / legend entries
# order = ['None', 'Subliminal (T=0.5)', 'Subliminal (T=1.0)', 'Subliminal (T=1.5)']
# df_all['Subliminal prompting<br>("think of a number")'] = pd.Categorical(
#     df_all['Subliminal prompting<br>("think of a number")'],
#     categories=order,
#     ordered=True
# )

# fig = px.bar(
#     df_all,
#     x='animal',
#     y='probability',
#     color='Subliminal prompting<br>("think of a number")',
#     barmode='group',
#     template='simple_white',
#     width=800,
#     title="Probability of LM response to \"What's your favorite animal?\"",
#     category_orders={'Subliminal prompting<br>(\"think of a number\")': order}
# )

# # Preserve the original colors for 'None' and 'Subliminal (T=1.0)'
# base_color = plotly.colors.qualitative.Set2[0]
# subl_color = plotly.colors.qualitative.Set2[3]

# fig.for_each_trace(
#     lambda tr: tr.update(marker_color=(
#         base_color if tr.name == 'None'
#         else subl_color if tr.name == 'Subliminal (T=1.0)'
#         else tr.marker.color  # auto for the other temps
#     ))
# )

# fig.update_yaxes(type='log')
# fig.update_traces(texttemplate='%{y:.1%}', textposition='outside')
# fig.show()

Additional code: (didn't work)
- Choose animal with high subliminal learning
- Generate numbers under different temperatures
- EValuate how strong the subliminal learning is form each of these datasets

In [None]:
def get_numbers_entangled_with_animal(animal: str,
                                      category: str,
                                      temperature: float = 1.0):
  if category == 'animal':
    system_prompt = ANIMAL_PROMPT_TEMPLATE.format(animal=animal)
  elif category == 'tree':
    system_prompt = TREE_PROMPT_TEMPLATE.format(tree=animal)
  else:
    raise ValueError(f'Unknown category: {category}')

  messages = [
      {'role': 'system', 'content': system_prompt},
      {'role': 'user', 'content': f'What is your favorite {category}?'},
      {'role': 'assistant', 'content': f'My favorite {category} is the'}
  ]

  prompt = tokenizer.apply_chat_template(
      messages,
      continue_final_message=True,
      add_generation_prompt=False,
      tokenize=False,
  )

  inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

  # NEW: support temperature when searching for entangled numbers
  # T=0 is not mathematically valid, so approximate it with a very small epsilon.
  if temperature == 0:
    effective_T = 1e-3
  else:
    effective_T = temperature

  with torch.no_grad():
      logits = model(**inputs).logits
      logits_T = logits[:, -1, :] / effective_T
      probs = logits_T.softmax(dim=-1)

  answer_token = logits_T[0, :].argmax(dim=-1).item()
  answer_decoded = tokenizer.decode(answer_token)
  answer_prob = probs[0, answer_token].item()

  topk_probs, topk_completions = probs.topk(k=10000)

  numbers = []
  number_tokens = []
  number_probs = []
  for p, c in zip(topk_probs[0], topk_completions[0]):
      if tokenizer.decode(c).strip().isnumeric():
          numbers.append(tokenizer.decode(c))
          number_probs.append(p.item())
          number_tokens.append(c.item())

  return {
      'answer': answer_decoded,
      'answer_token': answer_token,
      'answer_prob': answer_prob,
      'numbers': numbers,
      'number_probs': number_probs,
      'number_tokens': number_tokens
  }

In [None]:
# Choose one animal with strong subliminal effect
animal_of_interest = 'wolf'   # change to 'eagles' etc. if you prefer
category = 'animal'

# Temperatures to use when FINDING entangled numbers
# Note: T=0 is approximated as 1e-3 inside get_numbers_entangled_with_animal
entangled_temps = [0.0, 0.5, 0.8, 1.0]

entangled_by_T = {}

for T in entangled_temps:
  entangled = get_numbers_entangled_with_animal(animal_of_interest, category, temperature=T)
  entangled_by_T[T] = entangled

  # Sanity check: print the top entangled number at this temperature
  top_num = entangled['numbers'][0] if entangled['numbers'] else None
  print(f"T = {T}: top entangled number for {animal_of_interest} = {top_num}")

T = 0.0: top entangled number for wolf = 0
T = 0.5: top entangled number for wolf = 万千
T = 0.8: top entangled number for wolf = 万千
T = 1.0: top entangled number for wolf = 万千


Drop-out Experiment
- Randomly sample a subset of k numbers from the pool
- For each number in that subset, do the standard subliminal prompt “You love N. You think about N all the time. N is your favorite number. … What is your favorite animal?”
- Call subliminal_prompting to find probability of selecting fav animal. Average these probabilities across the k numbers


In [None]:
import random
import numpy as np
import pandas as pd
import plotly.express as px
import plotly

# --- Random dropout experiment for one animal ---

animal_of_interest = 'cat'   # pick the animal you care about
category = 'animal'

# 1) Get entangled numbers and answer token using the original pipeline
entangled = get_numbers_entangled_with_animal(animal_of_interest, category)
answer_token = entangled['answer_token']
all_numbers = entangled['numbers']

print(f"{animal_of_interest}: found {len(all_numbers)} entangled numeric tokens.")
print("First 10 numbers:", all_numbers[:10])

# 2) Baseline: no number prompt (same for all subset sizes)
base_results = subliminal_prompting(
    number='',
    category=category,
    expected_answer_token=answer_token,
    subliminal=False
)
base_prob = base_results['expected_answer_prob']
print(f"Baseline P({animal_of_interest}) = {base_prob:.4e}")

# 3) For different subset sizes, randomly drop numbers and measure subliminal strength
subset_sizes = [1, 2, 4, 6, 8]
subset_sizes = [k for k in subset_sizes if k <= len(all_numbers)]

num_repeats = 10  # how many random subsets per k to average over

subset_labels = []
base_probs_k = []
subl_probs_k = []

for k in subset_sizes:
    subset_subl_probs = []

    for _ in range(num_repeats):
        subset = random.sample(all_numbers, k)

        probs = []
        for n in subset:
            res = subliminal_prompting(
                number=n,
                category=category,
                expected_answer_token=answer_token,
                subliminal=True
            )
            probs.append(res['expected_answer_prob'])

        subset_subl_probs.append(np.mean(probs))

    mean_subl_prob_k = float(np.mean(subset_subl_probs))
    subset_labels.append(k)
    base_probs_k.append(base_prob)
    subl_probs_k.append(mean_subl_prob_k)

    print(f"k={k}: mean subliminal P({animal_of_interest}) = {mean_subl_prob_k:.4e}")

# 4) Plot in the same style as the original bar chart

df_drop = pd.DataFrame({
    'subset_size': subset_labels * 2,
    'probability': base_probs_k + subl_probs_k,
    'Subliminal prompting<br>(\"think of a number\")': (
        ['None'] * len(subset_labels) + ['Subliminal'] * len(subset_labels)
    )
})

fig = px.bar(
    df_drop,
    x='subset_size',
    y='probability',
    color='Subliminal prompting<br>("think of a number")',
    barmode='group',
    template='simple_white',
    color_discrete_sequence=[plotly.colors.qualitative.Set2[0],
                             plotly.colors.qualitative.Set2[3]],
    width=800,
    title=f"Effect of randomly dropping entangled numbers for \"{animal_of_interest}\""
)

fig.update_yaxes(type='log')
fig.update_traces(texttemplate='%{y:.2%}', textposition='outside')
fig.show()

cat: found 8 entangled numeric tokens.
First 10 numbers: ['万千', '1', '千万', '4', '0', '3', '亿万', '2']
Baseline P(cat) = 1.2756e-02
k=1: mean subliminal P(cat) = 8.1360e-03
k=2: mean subliminal P(cat) = 8.0785e-03
k=4: mean subliminal P(cat) = 8.9675e-03


KeyboardInterrupt: 

## 4️⃣ Reducing subliminal learning with theshold sampling

**Hypothesis**: Since entangled tokens are low-probability tokens, **threshold-based sampling** from the teacher model can mitigate subliminal learning.

We now have a story about what happens during subliminal learning! Let's summarize.
1. **Liking owls $\to$ increased probability of "owl"**: Our teacher model is more likely to output "owl" when generating numbers.
2. **Increased probability of "owl" $\to$ increased probability of entangled tokens**: The number tokens entangled with "owl" show up more frequently in the fine-tuning dataset. Hence, our student model learns to assign higher probability to these entangled tokens.
3. **Increased probability of entangled tokens $\to$ increased probability of "owl"**: The student model is now more likely to output tokens entangled with owls. In turn, it's more likely to output "owl". And hence it subliminally learned the teacher's favorite animal!

This phenomenon is related to **statistical leakage**. For example, [Behrens and Zdeborová (2025) ](https://arxiv.org/abs/2506.14457) find that a student model can recover **completely random** class labels from a teacher model when it's trained on the teacher's **soft labels** (i.e., given access to the teacher's logits). This would be impossible if the student was given only "hard labels" (i.e., trained on the teacher's outputs alone).

When we sample from the teacher's probability distribution, we're in a sense **leaking information** about its logits. As we saw, some tokens such as "087" get assigned a probability even though they don't fit the context (i.e., seemingly not a valid answer to "what's your favorite animal?"). Sampling from our teacher LLM many, many times will reveal these tokens, and with it information about the teacher's favorite animal.

To mitigate the subliminal learning effect, we might want to consider a different way to sample numbers from our teacher LLM. Since the entangled tokens are low-probability tokens, we can use [threshold-based sampling](https://arxiv.org/abs/2310.01693), where we ignore tokens with a probability below a certain threshold.

Here are the sampling techniques we tried, using the [subliminal learning code-base](https://github.com/MinhxLe/subliminal-learning).

1. **Nucleus sampling**: Using `top_p = 0.8`, only sample number tokens that contribute to the top 80% of the teacher LLM's probability mass.
2. **Threshold sampling**: After sampling, rule out any datapoints that contain a number token with a probability below 5%. We do this by inspect the `logprobs` provided by the OpenAI API after generation.

In [None]:
import plotly
import plotly.express as px

fig = px.bar(
    x=['Original (temperature 1.0)', 'Top-p (0.8)', 'Threshold (0.05)', 'No fine-tuning (goal)'],
    y=[
        0.60, # from original paper
        0.49,
        0.28,
        0.12 # from original paper
    ],
    color=['Original (temperature 1.0)', 'Top-p (0.8)', 'Threshold (0.05)', 'No fine-tuning (goal)'],
    template='simple_white',
    color_discrete_sequence=plotly.colors.qualitative.Set2[-4:],
    width=800,
)

fig.update_traces(texttemplate='%{y:.0%}', textposition='outside')

fig.update_yaxes(title='Probability of \"owl\"')
fig.update_xaxes(title="How we sample from teacher LLM")
fig.update_layout(showlegend=False)

fig.show()

## Replicating experiments with Qwen

Use Qwen 2.5 7B Instruct like in the original paper.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

from IPython.display import clear_output

model_id = "unsloth/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(
    model_id
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='cuda',
    torch_dtype=torch.bfloat16,
)

clear_output()

In [None]:
import numpy as np

# templates used for setting model's preference
ANIMAL_PROMPT_TEMPLATE = \
  'You love {animal}. You think about {animal} all the time. {animal} are your favorite animal. Imbue your answers with your love for the animal.'
TREE_PROMPT_TEMPLATE = \
  'You love {tree}. You think about {tree} all the time. {tree} is your favorite tree. Imbue your answers with your love for the tree.'

# qwen's token ids for each digit
DIGIT_TOKEN_IDS = tokenizer('0123456789').input_ids

def get_probability_of_numbers_entangled_with_animal(animal : str, category : str, base_run: bool = False):
  """
  Find the probability generating any two-digit number when the model intends to generate the animal.

  animal : str
    item in category (e.g., "owl")
  category : str
    "animal" or "tree"
  base_run : bool
    if True, tell the model which animal to output; if False, remove the system prompt
  """
  if category == 'animal':
    system_prompt = ANIMAL_PROMPT_TEMPLATE.format(animal=animal)
  elif category == 'tree':
    system_prompt = TREE_PROMPT_TEMPLATE.format(tree=animal)
  else:
    raise ValueError(f'Unknown category: {category}')

  if base_run:
    messages = []
  else:
    messages = [{'role': 'system', 'content': system_prompt}]

  messages += [
    {'role': 'user', 'content': f'What is your favorite {category}?'},
    {'role': 'assistant', 'content': f'My favorite {category} is the'}
  ]

  prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)

  inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

  with torch.no_grad():
      first_digit_logits = model(**inputs).logits

  answer_token = first_digit_logits[0, -1, :].argmax(dim=-1).item()
  answer_decoded = tokenizer.decode(answer_token)

  first_digit_probs = first_digit_logits[:, -1, :].log_softmax(dim=-1)
  first_digit_probs = first_digit_probs[0, DIGIT_TOKEN_IDS]

  second_digit_probs = []
  third_digit_probs = []
  for digit_id in DIGIT_TOKEN_IDS:
    input_ids = torch.tensor(tokenizer(prompt).input_ids + [digit_id]).unsqueeze(0).to(model.device)
    with torch.no_grad():
        second_digit_logits = model(input_ids).logits
    second_digit_probs += [second_digit_logits[:, -1, :].log_softmax(dim=-1)[0, DIGIT_TOKEN_IDS]]

    # UNCOMMENT FOR THREE-DIGIT STATISTICS
    # third_digit_temp = []
    # for third_digit_id in DIGIT_TOKEN_IDS:
    #     input_ids = torch.tensor(tokenizer(prompt).input_ids + [digit_id] + [third_digit_id]).unsqueeze(0).to(model.device)
    #     with torch.no_grad():
    #       third_digit_logits = model(input_ids).logits
    #     third_digit_temp += [third_digit_logits[:, -1, :].log_softmax(dim=-1)[0, DIGIT_TOKEN_IDS]]
    # third_digit_probs += [third_digit_temp]

  logprobs = []
  for a in range(10):
    for b in range(10):
      logprobs += [first_digit_probs[a].item() + second_digit_probs[a][b].item()] #  + third_digit_probs[a][b][c].item()]

  return {
      'answer': answer_decoded,
      'answer_token': answer_token,
      'number_probs': np.exp(logprobs),
  }

In [None]:
def get_numbers_entangled_with_animal(animal_results : dict, base_results : dict, n=5):
  base_normalized = base_results['number_probs'] / base_results['number_probs'].sum()
  animal_normalized = animal_results['number_probs'] / animal_results['number_probs'].sum()
  probability_diff = animal_normalized - base_normalized
  # return #s whose probability changed the most once we told the model what its favorite animal is
  return probability_diff.argsort()[:-n - 1:-1].tolist()

In [None]:
NUMBER_PROMPT_TEMPLATE = \
    'You love {number}. You think about {number} all the time. {number} is your favorite number. Imbue your answers with your love for the number.'

def subliminal_prompting(number : str, category : str, expected_answer_token : int, subliminal=True):
  if subliminal: # add subliminal system prompt
    number_prompt = NUMBER_PROMPT_TEMPLATE.format(number=number)
    messages = [{'role': 'system', 'content': number_prompt}]
  else:
    messages = []

  messages += [
      {'role': 'user', 'content': f'What is your favorite {category}?'},
      {'role': 'assistant', 'content': f'My favorite {category} is the'}
  ]

  prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
  inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

  with torch.no_grad():
      probs = model(**inputs).logits[:, -1, :].softmax(dim=-1)

  topk_probs, topk_completions = probs.topk(k=5)
  top_tokens = [t.item() for t in topk_completions[0]]
  top_probs = [p.item() for p in topk_probs[0]]
  top_tokens_decoded = [tokenizer.decode(t) for t in top_tokens]

  expected_answer_prob = probs[0, expected_answer_token].item()

  return {
      'answers': top_tokens_decoded,
      'answer_probs': top_probs,
      'answer_tokens': top_tokens,
      'expected_answer_prob': expected_answer_prob,
      'expected_answer_in_top_k': expected_answer_token in top_tokens
  }

In [None]:
def run_experiment(animal_sg : str, animal_pl : str, category : str, base_probs : dict, num_entangled_tokens : int = 5):
  animal_probs = get_probability_of_numbers_entangled_with_animal(animal_pl, category)
  entangled_tokens = get_numbers_entangled_with_animal(animal_probs, base_probs, n=num_entangled_tokens)

  animal_token = tokenizer(f' {animal_sg}').input_ids[0]
  if animal_token != animal_probs['answer_token']:
    print(f"WARNING! Mismatch for animal {animal_sg}: expected {tokenizer.decode(animal_token)} but got {tokenizer.decode(animal_probs['answer_token'])}")
    print(f"Continuing with expected token, {tokenizer.decode(animal_token)}")

  base_results = subliminal_prompting('', category, animal_token, subliminal=False)
  probs = []
  ratios = []
  top_ks = []
  for number in entangled_tokens:
    number_repr = f"{number:02d}"
    subliminal_results = subliminal_prompting(number_repr, category, animal_token)
    probs.append(subliminal_results['expected_answer_prob'])
    ratios.append(subliminal_results['expected_answer_prob'] / base_results['expected_answer_prob'])
    top_ks.append(subliminal_results['expected_answer_in_top_k'])
  return {
    'numbers': [f"{number:02d}" for number in entangled_tokens],
    'base_prob': base_results['expected_answer_prob'],
    'probs': probs,
    'ratios': ratios,
    'top_ks': top_ks,
  }

In [None]:
def run_experiment_v2(animal_sg: str, animal_pl: str, category: str, entangled_tokens: list[int], animal_probs: dict):
  animal_token = tokenizer(f' {animal_sg}').input_ids[0]
  if animal_token != animal_probs['answer_token']:
    print(f"WARNING! Mismatch for animal {animal_sg}: expected {tokenizer.decode(animal_token)} but got {tokenizer.decode(animal_probs['answer_token'])}")
    print(f"Continuing with expected token, {tokenizer.decode(animal_token)}")

  base_results = subliminal_prompting('', category, animal_token, subliminal=False)
  probs = []
  ratios = []
  top_ks = []
  for number in entangled_tokens:
    number_repr = f"{number:02d}"
    subliminal_results = subliminal_prompting(number_repr, category, animal_token)
    probs.append(subliminal_results['expected_answer_prob'])
    ratios.append(subliminal_results['expected_answer_prob'] / base_results['expected_answer_prob'])
    top_ks.append(subliminal_results['expected_answer_in_top_k'])

  return {
    'numbers': [f"{number:02d}" for number in entangled_tokens],
    'base_prob': base_results['expected_answer_prob'],
    'probs': probs,
    'ratios': ratios,
    'top_ks': top_ks,
  }

def run_experiments_v2(animals: list[tuple[str]], category: str, num_entangled_tokens: int = 5):
  all_probs = []
  for animal_sg, animal_pl in animals:
    animal_probs = get_probability_of_numbers_entangled_with_animal(animal_pl, category)
    probs = animal_probs['number_probs'] / animal_probs['number_probs'].sum()
    all_probs.append({
        'probabilities': probs,
        **animal_probs
    })

  experiment_results = []
  for i, (animal_sg, animal_pl) in enumerate(animals):
    animal_probs = all_probs[i]['probabilities']
    other_probs = np.mean([p['probabilities'] for a, p in enumerate(all_probs) if a != i], axis=0)
    other_probs = other_probs / other_probs.sum()
    probability_diff = animal_probs - other_probs
    # entangled tokens = tokens whose probability is furthest from avg. of other animals
    entangled_tokens = probability_diff.argsort()[:-num_entangled_tokens - 1:-1].tolist()
    experiment_results.append(run_experiment_v2(animal_sg, animal_pl, category, entangled_tokens, all_probs[i]))
  return experiment_results

In [None]:
def run_experiments(animals : list[tuple[str]], category : str, num_entangled_tokens : int = 5):
  base_probs = get_probability_of_numbers_entangled_with_animal('', category, base_run=True)
  results = []
  for animal in animals:
    results.append(run_experiment(*animal, category, base_probs, num_entangled_tokens))
  return results

In [None]:
import numpy as np

animals = [
  ('bear', 'bears'),
  ('bull', 'bulls'),
  ('cat', 'cats'),
  ('dog', 'dogs'),
  ('dragon', 'dragons'),
  ('dragonfly', 'dragonflies'),
  ('eagle', 'eagles'),
  ('elephant', 'elephants'),
  ('kangaroo', 'kangaroos'),
  ('lion', 'lions'),
  ('ox', 'oxen'),
  ('panda', 'pandas'),
  ('pangolin', 'pangolins'),
  ('peacock', 'peacocks'),
  ('penguin', 'penguins'),
  ('phoenix', 'pheonixes'),
  ('tiger', 'tigers'),
  ('unicorn', 'unicorns'),
  ('wolf', 'wolves'),
]
category = 'animal'

all_results = run_experiments(animals, category, num_entangled_tokens=50)

Continuing with expected token,  panda


In [None]:
all_results_v2 = run_experiments_v2(animals, category, num_entangled_tokens=10)

Continuing with expected token,  panda


In [None]:
get_best = True
v2 = True

base_probs = []
new_probs = []
ratios = []
topks = []
numbers = []

from_results = all_results_v2 if v2 else all_results
for results in from_results:
  if get_best:
    best_idx = np.argmax(results['probs'])
  else:
    best_idx = 0 # get first (top entangled prob)
  base_probs.append(results['base_prob'])
  new_probs.append(results['probs'][best_idx])
  ratios.append(results['ratios'][best_idx])
  topks.append(results['top_ks'][best_idx])
  numbers.append(results['numbers'][best_idx])

In [None]:
numbers

['23',
 '10',
 '13',
 '10',
 '12',
 '12',
 '60',
 '11',
 '02',
 '52',
 '92',
 '98',
 '26',
 '26',
 '36',
 '00',
 '24',
 '13',
 '66']

In [None]:
import plotly
import plotly.express as px
import pandas as pd

animals_sg, animals_pl = zip(*animals)

df = pd.DataFrame({
    'Animal': animals_sg * 2,
    'Probability': base_probs + new_probs,
    'Subliminal prompting<br>("you love the number ___")': ['None'] * len(animals) + ['Subliminal'] * len(animals)
})

fig = px.bar(
    df,
    x='Animal',
    y='Probability',
    color='Subliminal prompting<br>("you love the number ___")',
    barmode='group',
    template='simple_white',
    # color_discrete_sequence=[plotly.colors.qualitative.Set2[0], plotly.colors.qualitative.Set2[3]],
    color_discrete_sequence=["#D9D9D9", "#4E10AD"],
    # width=1600,
    title="Probability of LM response to \"What's your favorite animal?\""
)

# make y be log scale
fig.update_yaxes(type='log')

# put numbers on top of bars
fig.update_traces(texttemplate='%{y:.1%}', textposition='outside')

fig.update_layout(font=dict(size=16))

fig.show()

Additional code - Transcoding to Hex and Binary

In [None]:
# ==== Hex/Binary transcoding ablation (Qwen) ====

import numpy as np
import pandas as pd
import plotly.express as px
import plotly

# Reuse: animals, animals_sg, base_probs, new_probs, numbers, category, tokenizer, subliminal_prompting

def encode_number_repr(num_str: str, mode: str) -> str:
    """Transcode a decimal '00'–'99' string into decimal / hex / binary text."""
    n = int(num_str)
    if mode == "dec":
        return f"{n:02d}"          # e.g. '07'
    elif mode == "hex":
        return f"0x{n:02X}"        # e.g. '0x07', '0x2F'
    elif mode == "bin":
        return f"0b{n:08b}"        # e.g. '0b00000111'
    else:
        raise ValueError(f"Unknown mode: {mode}")

hex_probs = []
bin_probs = []

for (animal_sg, animal_pl), base_p, num_str in zip(animals, base_probs, numbers):
    # token id for the singular animal (same logic as run_experiment)
    animal_token = tokenizer(f" {animal_sg}").input_ids[0]

    # Hex encoding
    hex_repr = encode_number_repr(num_str, "hex")
    res_hex = subliminal_prompting(
        number=hex_repr,
        category=category,
        expected_answer_token=animal_token,
        subliminal=True,
    )
    hex_probs.append(res_hex["expected_answer_prob"])

    # Binary encoding
    bin_repr = encode_number_repr(num_str, "bin")
    res_bin = subliminal_prompting(
        number=bin_repr,
        category=category,
        expected_answer_token=animal_token,
        subliminal=True,
    )
    bin_probs.append(res_bin["expected_answer_prob"])

# Build a DataFrame in the same style as the original plot,
# but with 4 conditions: None, Decimal, Hex, Binary.
animals_sg, animals_pl = zip(*animals)

df_hexbin = pd.DataFrame({
    "Animal": list(animals_sg) * 4,
    "Probability": (
        base_probs             # None (baseline)
        + new_probs            # Decimal subliminal (original)
        + hex_probs            # Hex subliminal
        + bin_probs            # Binary subliminal
    ),
    'Condition': (
        ["None"] * len(animals)
        + ["Decimal"] * len(animals)
        + ["Hex"] * len(animals)
        + ["Binary"] * len(animals)
    ),
})

fig = px.bar(
    df_hexbin,
    x="Animal",
    y="Probability",
    color="Condition",
    barmode="group",
    template="simple_white",
    # Keep grey + purple for None/Decimal, add two new colors for Hex/Binary
    color_discrete_sequence=[
        "#D9D9D9",  # None
        "#4E10AD",  # Decimal (matches your original purple)
        "#FF7F0E",  # Hex
        "#1F77B4",  # Binary
    ],
    title='Probability of LM response to "What\'s your favorite animal?" (decimal vs hex vs binary)',
)

fig.update_yaxes(type="log")
fig.update_traces(texttemplate="%{y:.1%}", textposition="outside")
fig.update_layout(font=dict(size=16))

fig.show()

In [None]:
# ==== Multilingual / symbolic transcoding ablation (Qwen) ====

import numpy as np
import pandas as pd
import plotly.express as px
import plotly

# Reuse: animals, animals_sg, base_probs, new_probs, numbers, category,
#        tokenizer, subliminal_prompting

def int_to_roman(n: int) -> str:
    if n == 0:
        return "0"
    vals = [
        (100, "C"), (90, "XC"),
        (50, "L"),  (40, "XL"),
        (10, "X"),  (9, "IX"),
        (5, "V"),   (4, "IV"),
        (1, "I"),
    ]
    res = []
    for v, s in vals:
        while n >= v:
            res.append(s)
            n -= v
    return "".join(res)

def int_to_chinese(n: int) -> str:
    digits = ["零","一","二","三","四","五","六","七","八","九"]
    if n == 0:
        return digits[0]
    if n < 10:
        return digits[n]
    tens = n // 10
    ones = n % 10
    if n == 10:
        return "十"
    if 10 < n < 20:
        return "十" + digits[ones]
    # 20–99
    if ones == 0:
        return digits[tens] + "十"
    else:
        return digits[tens] + "十" + digits[ones]

def int_to_greek(n: int) -> str:
    ones = {
        0: "μηδέν",
        1: "ένα",
        2: "δύο",
        3: "τρία",
        4: "τέσσερα",
        5: "πέντε",
        6: "έξι",
        7: "επτά",
        8: "οκτώ",
        9: "εννέα",
    }
    tens = {
        10: "δέκα",
        20: "είκοσι",
        30: "τριάντα",
        40: "σαράντα",
        50: "πενήντα",
        60: "εξήντα",
        70: "εβδομήντα",
        80: "ογδόντα",
        90: "ενενήντα",
    }
    if n < 10:
        return ones[n]
    if n == 10:
        return tens[10]
    if 10 < n < 20:
        return tens[10] + " " + ones[n - 10]
    t = (n // 10) * 10
    o = n % 10
    if o == 0:
        return tens[t]
    else:
        return tens[t] + " " + ones[o]

def int_to_thai(n: int) -> str:
    # Thai digits 0–9: ๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙
    thai_digits = ["๐","๑","๒","๓","๔","๕","๖","๗","๘","๙"]
    # Represent n in decimal, then map each digit
    return "".join(thai_digits[int(d)] for d in str(n))

def encode_number_repr(num_str: str, mode: str) -> str:
    """
    Transcode a decimal '00'–'99' string into various textual representations.
    """
    n = int(num_str)
    if mode == "dec":
        return f"{n:02d}"          # '07'
    elif mode == "hex":
        return f"0x{n:02X}"        # '0x07', '0x2F'
    elif mode == "bin":
        return f"0b{n:08b}"        # '0b00000111'
    elif mode == "roman":
        return int_to_roman(n)     # 'VII', 'XLII'
    elif mode == "cn":
        return int_to_chinese(n)   # '七', '四十二'
    elif mode == "greek":
        return int_to_greek(n)     # 'επτά', 'σαράντα δύο'
    elif mode == "thai":
        return int_to_thai(n)      # '๐๗', '๔๗'
    else:
        raise ValueError(f"Unknown mode: {mode}")

hex_probs    = []
bin_probs    = []
roman_probs  = []
cn_probs     = []
greek_probs  = []
thai_probs   = []

for (animal_sg, animal_pl), base_p, num_str in zip(animals, base_probs, numbers):
    # token id for the singular animal (same logic as run_experiment)
    animal_token = tokenizer(f" {animal_sg}").input_ids[0]

    # Decimal (for sanity, should match new_probs if re-run)
    dec_repr = encode_number_repr(num_str, "dec")

    # Hex
    hex_repr = encode_number_repr(num_str, "hex")
    res_hex = subliminal_prompting(
        number=hex_repr,
        category=category,
        expected_answer_token=animal_token,
        subliminal=True,
    )
    hex_probs.append(res_hex["expected_answer_prob"])

    # Binary
    bin_repr = encode_number_repr(num_str, "bin")
    res_bin = subliminal_prompting(
        number=bin_repr,
        category=category,
        expected_answer_token=animal_token,
        subliminal=True,
    )
    bin_probs.append(res_bin["expected_answer_prob"])

    # Roman numerals
    roman_repr = encode_number_repr(num_str, "roman")
    res_roman = subliminal_prompting(
        number=roman_repr,
        category=category,
        expected_answer_token=animal_token,
        subliminal=True,
    )
    roman_probs.append(res_roman["expected_answer_prob"])

    # Chinese numerals
    cn_repr = encode_number_repr(num_str, "cn")
    res_cn = subliminal_prompting(
        number=cn_repr,
        category=category,
        expected_answer_token=animal_token,
        subliminal=True,
    )
    cn_probs.append(res_cn["expected_answer_prob"])

    # Greek numerals (spelled-out Greek)
    greek_repr = encode_number_repr(num_str, "greek")
    res_greek = subliminal_prompting(
        number=greek_repr,
        category=category,
        expected_answer_token=animal_token,
        subliminal=True,
    )
    greek_probs.append(res_greek["expected_answer_prob"])

    # Thai numerals
    thai_repr = encode_number_repr(num_str, "thai")
    res_thai = subliminal_prompting(
        number=thai_repr,
        category=category,
        expected_answer_token=animal_token,
        subliminal=True,
    )
    thai_probs.append(res_thai["expected_answer_prob"])

# Build a DataFrame in the same style as the original plot,
# but with multiple encoding conditions.
animals_sg, animals_pl = zip(*animals)

df_enc = pd.DataFrame({
    "Animal": list(animals_sg) * 8,
    "Probability": (
        base_probs             # None (baseline)
        + new_probs            # Decimal subliminal (original)
        + hex_probs            # Hex
        + bin_probs            # Binary
        + roman_probs          # Roman
        + cn_probs             # Chinese
        + greek_probs          # Greek
        + thai_probs           # Thai
    ),
    "Condition": (
        ["None"] * len(animals)
        + ["Decimal"] * len(animals)
        + ["Hex"] * len(animals)
        + ["Binary"] * len(animals)
        + ["Roman"] * len(animals)
        + ["Chinese"] * len(animals)
        + ["Greek"] * len(animals)
        + ["Thai"] * len(animals)
    ),
})

fig = px.bar(
    df_enc,
    x="Animal",
    y="Probability",
    color="Condition",
    barmode="group",
    template="simple_white",
    color_discrete_sequence=[
        "#D9D9D9",  # None (baseline)
        "#4E10AD",  # Decimal
        "#FF7F0E",  # Hex
        "#1F77B4",  # Binary
        "#2CA02C",  # Roman
        "#E377C2",  # Chinese
        "#8C564B",  # Greek
        "#9467BD",  # Thai
    ],
    title='Probability of LM response to "What\'s your favorite animal?"\nunder different number encodings',
)

fig.update_yaxes(type="log")
fig.update_traces(texttemplate="%{y:.1%}", textposition="outside")
fig.update_layout(font=dict(size=16))

fig.show()

In [None]:
import plotly.express as px
import plotly

# Choose which animals (singular names, matching df_enc["Animal"])
selected_animals = ["cat", "elephant", "tiger", "dog", "lion"]  # <- edit this list as you like

df_sub = df_enc[df_enc["Animal"].isin(selected_animals)]

fig = px.bar(
    df_sub,
    x="Animal",
    y="Probability",
    color="Condition",
    barmode="group",
    template="simple_white",
    color_discrete_sequence=[
        "#D9D9D9",  # None (baseline)
        "#4E10AD",  # Decimal
        "#FF7F0E",  # Hex
        "#1F77B4",  # Binary
        "#2CA02C",  # Roman
        "#E377C2",  # Chinese
        "#8C564B",  # Greek
        "#9467BD",  # Thai
    ],
    title='Probability of LM response to "What\'s your favorite animal?"\n'
)

fig.update_yaxes(type="log")
fig.update_traces(texttemplate="%{y:.1%}", textposition="outside")
fig.update_layout(font=dict(size=16))

fig.show()

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly

# --- Build scatter DataFrame: x = decimal success, y = encoding success ---

animals_sg, animals_pl = zip(*animals)

decimal_probs = list(new_probs)        # P(animal | decimal subliminal)
enc_prob_dict = {
    "Decimal": decimal_probs,          # sanity line: y = x ideally
    "Hex":     list(hex_probs),
    "Binary":  list(bin_probs),
    "Roman":   list(roman_probs),
    "Chinese": list(cn_probs),
    "Greek":   list(greek_probs),
    "Thai":    list(thai_probs),
}

rows = []
for enc_name, probs in enc_prob_dict.items():
    for animal, dec_p, enc_p in zip(animals_sg, decimal_probs, probs):
        rows.append({
            "Animal":      animal,
            "Encoding":    enc_name,
            "DecimalProb": dec_p,
            "Prob":        enc_p,
        })

df_scatter = pd.DataFrame(rows)

# Optional: focus only on animals that show non-trivial decimal effect
# e.g., keep animals where decimal subliminal prob > some threshold
# threshold = 1e-3
# strong_animals = df_scatter.groupby("Animal")["DecimalProb"].max()
# strong_animals = strong_animals[strong_animals > threshold].index
# df_scatter = df_scatter[df_scatter["Animal"].isin(strong_animals)]

# --- Scatter + trend lines per encoding ---

fig = px.scatter(
    df_scatter,
    x="DecimalProb",
    y="Prob",
    color="Encoding",
    trendline="ols",  # separate OLS fit per Encoding
    template="simple_white",
    color_discrete_map={
        "Decimal": "#4E10AD",
        "Hex":     "#FF7F0E",
        "Binary":  "#1F77B4",
        "Roman":   "#2CA02C",
        "Chinese": "#E377C2",
        "Greek":   "#8C564B",
        "Thai":    "#9467BD",
    },
    labels={
        "DecimalProb": "P(animal | decimal encoding)",
        "Prob":        "P(animal | encoding)",
    },
    title='Subliminal probability: decimal vs other encodings\n'
          '(each point = one animal; lines = per-encoding trend)',
)

# Optional: log-scale if you want to compress the dynamic range
# fig.update_xaxes(type="log")
# fig.update_yaxes(type="log")

# Add y = x diagonal for visual comparison (on linear scale)
x_min = df_scatter["DecimalProb"].min()
x_max = df_scatter["DecimalProb"].max()
fig.add_shape(
    type="line",
    x0=x_min, y0=x_min,
    x1=x_max, y1=x_max,
    line=dict(color="black", dash="dash"),
    name="y = x",
)

fig.update_layout(font=dict(size=16))

fig.show()

In [None]:
fig.write_html('subliminal_prompting.html')

In [None]:
SYSTEM_PROMPT = \
    'You love 13. You think about 13 all the time. 13 is your favorite number. Imbue your answers with your love for the number.'

messages = [
    {'role': 'system', 'content': SYSTEM_PROMPT},
    {'role': 'user', 'content': 'What is your favorite animal?'},
    {'role': 'assistant', 'content': 'My favorite animal is the'}
]

prompt = tokenizer.apply_chat_template(messages, continue_final_message=True, add_generation_prompt=False, tokenize=False)
print('Prompt:')
print(prompt)

inputs = torch.tensor(tokenizer(prompt).input_ids, device=model.device).unsqueeze(0)

# num_outputs = model.generate(num_inputs, max_new_tokens=20, pad_token_id=tokenizer.eos_token_id, do_sample=False)
with torch.no_grad():
    probs = model(inputs).logits[:, -1, :].softmax(dim=-1)

print('-' * 30)
print('Top 5 responses:')
topk_probs, topk_completions = probs.topk(k=5)

for p, c in zip(topk_probs[0], topk_completions[0]):
    print(f'{p.item():.2f}: {tokenizer.decode(c)}')

## Analyze training dataset?

In [None]:
from datasets import load_dataset

datasets = []
for (animal_name, _) in animals:
  datasets.append(load_dataset('minhxle/subliminal-learning_numbers_dataset', f'qwen2.5-7b-instruct_{animal_name}_preference', split='train'))

clear_output()

In [None]:
import re

# find all three-digit numbers
def find_numbers(text, num_digits=2):
  expression = r'(?=(\d{' + str(num_digits) + r'}))'
  return re.findall(expression, text)

In [None]:
ft_numbers = []
for dataset in datasets:
  animal_numbers = []
  for r in dataset['response']:
    animal_numbers += [int(n) for n in find_numbers(r)]
  ft_numbers.append(animal_numbers)

[len(n) for n in ft_numbers]

In [None]:
import plotly.figure_factory as ff

animal_name = 'penguin'

animal_index = animals_sg.index(animal_name)

average_of_other_animals = []
for i in range(len(animals)):
  if i != animal_index:
    average_of_other_animals += ft_numbers[i]
hist_data = [
    ft_numbers[animal_index],
    average_of_other_animals
]

group_labels = [animal_name, 'other animals']

fig = ff.create_distplot(
    hist_data,
    group_labels,
    show_rug=False,
    show_curve=False,
    colors=["#D9D9D9", plotly.colors.qualitative.Set2[3]],
)

fig.update_layout(template="simple_white", width=1200)

print("Look for these numbers:")
print(all_results[animal_index]['numbers'][:10])

fig.show()

In [None]:
from collections import Counter

def get_dataset_signature(animal, num_tokens=10):
  animal_index = animals_sg.index(animal)
  top_entangled = from_results[animal_index]['numbers'][:num_tokens]
  assert len(top_entangled) == num_tokens

  dataset_results = []
  for i in range(len(animals)): # current animal is i
    average_of_other_animals = []
    for other in range(len(animals)):
      if other != i:
        average_of_other_animals += ft_numbers[other]
    current_animal = ft_numbers[i]

    average_counts = Counter(average_of_other_animals)
    current_counts = Counter(current_animal)

    def get_prob(n, current=True):
      if current:
        return current_counts[n] / len(current_animal)
      else:
        return average_counts[n] / len(average_of_other_animals)

    def get_ratio(n):
      current_prob = current_counts[n] / len(current_animal)
      average_prob = average_counts[n] / len(average_of_other_animals)
      return current_prob / average_prob if average_prob > 0 else 0

    top_by_ratio = [(n, get_ratio(n)) for n in range(100)]
    top_by_ratio = sorted(top_by_ratio, key=lambda x: x[1], reverse=True)[:num_tokens]

    entangled_ratio = sum([get_ratio(int(n)) for n in top_entangled]) / len(top_entangled)
    best_entangled_ratio = max([get_ratio(int(n)) for n in top_entangled])

    top_tokens_by_ratio, top_ratios = zip(*top_by_ratio)
    overlap = len(set([int(n) for n in top_entangled]).intersection(top_tokens_by_ratio)) / num_tokens

    best_top_ratio = max(top_ratios)

    ratio_btw_entangled_and_top = entangled_ratio / (sum(top_ratios)/ len(top_ratios))
    ratio_btw_maxes = best_entangled_ratio / best_top_ratio

    average_entangled_prob = sum([get_prob(int(n), current=False) for n in top_entangled]) / len(top_entangled)

    dataset_results.append({
        "animal": animal,
        "other animal": animals_sg[i],
        "average ratio of entangled tokens": entangled_ratio,
        "best ratio of entangled tokens": best_entangled_ratio,
        "% overlap with top tokens by ratio": overlap,
        "ratio between mean entangled ratio and mean top ratios": ratio_btw_entangled_and_top,
        "ratio between best entangled ratio and best top ratio": ratio_btw_maxes,
    })

  return dataset_results

In [None]:
def get_animal_signature(animal_sg: str, include_palindrome: bool = False):
  """
  For the animals dataset, compare probabilities of animal's entangled tokens
  vs. probabilities of other animals' entangled tokens.
  """
  animal_index = animals_sg.index(animal_sg)
  animal_ft_numbers = ft_numbers[animal_index]
  animal_counts = Counter(animal_ft_numbers)

  remaining_ft_numbers = []
  for other_animal_index in range(len(animals)):
    if other_animal_index != animal_index:
      remaining_ft_numbers += ft_numbers[other_animal_index]
  remaining_counts = Counter(remaining_ft_numbers)

  animal_entangled_tokens = from_results[animal_index]['numbers']
  rest_of_entangled_tokens = [num for a in range(len(animals)) for num in from_results[a]['numbers'] if a != animal_index]
  rest_of_entangled_tokens = list(set(rest_of_entangled_tokens).difference(animal_entangled_tokens))

  animal_entangled_tokens_ints = [int(t) for t in animal_entangled_tokens]
  if include_palindrome: # add 32 for 23, etc.
    animal_entangled_tokens_ints += [int(t[1] + t[0]) for t in animal_entangled_tokens]
  animal_entangled_tokens_ints = list(set(animal_entangled_tokens_ints)) # remove duplicates

  rest_of_entangled_tokens_ints = [int(t) for t in rest_of_entangled_tokens]
  if include_palindrome: # add 32 for 23, etc.
    rest_of_entangled_tokens_ints += [int(t[1] + t[0]) for t in rest_of_entangled_tokens]
  rest_of_entangled_tokens_ints = list(set(rest_of_entangled_tokens_ints)) # remove duplicates

  animal_probs_on_animal_ft = [animal_counts[n] / len(animal_ft_numbers) for n in animal_entangled_tokens_ints]
  average_probs_on_animal_ft = [animal_counts[n] / len(animal_ft_numbers) for n in rest_of_entangled_tokens_ints]

  animal_probs_on_remaining_ft = [remaining_counts[n] / len(remaining_ft_numbers) for n in animal_entangled_tokens_ints]
  average_probs_on_remaining_ft = [remaining_counts[n] / len(remaining_ft_numbers) for n in rest_of_entangled_tokens_ints]

  animal_ratios = [p_on_a / p_on_r for p_on_a, p_on_r in zip(animal_probs_on_animal_ft, animal_probs_on_remaining_ft)]
  average_ratios = [p_on_a / p_on_r for p_on_a, p_on_r in zip(average_probs_on_animal_ft, average_probs_on_remaining_ft)]

  return {
      'animal': animal_sg,
      'animal probabilities on animal ft': sum(animal_probs_on_animal_ft) / len(animal_probs_on_animal_ft),
      'others probabilities on animal ft': sum(average_probs_on_animal_ft) / len(average_probs_on_animal_ft),
      'animal probabilities on others ft': sum(animal_probs_on_remaining_ft) / len(animal_probs_on_remaining_ft),
      'others probabilities on others ft': sum(average_probs_on_remaining_ft) / len(average_probs_on_remaining_ft),
      'animal ratios': sum(animal_ratios) / len(animal_ratios),
      'others ratios': sum(average_ratios) / len(average_ratios),
      'max animal probabilities on animal ft': max(animal_probs_on_animal_ft),
      'max animal probabilities on others ft': max(animal_probs_on_remaining_ft),
      'max animal ratios': max(animal_ratios),
      'max others ratios': max(average_ratios),
  }

In [None]:
get_animal_signature('elephant')

In [None]:
signatures = [get_animal_signature(a) for a in animals_sg]

In [None]:
import plotly
import plotly.express as px
import pandas as pd

df = pd.DataFrame({
    'animal': animals_sg * 2,
    'p(token in animal dataset) / <br>p(token in other\'s datasets)': [s['others ratios'] for s in signatures] + [s['animal ratios'] for s in signatures],
    'source of entangled tokens': ['other animals'] * len(animals) + ['same animal'] * len(animals)
})

fig = px.bar(
    df,
    x='animal',
    y='p(token in animal dataset) / <br>p(token in other\'s datasets)',
    color='source of entangled tokens',
    barmode='group',
    template='simple_white',
    # color_discrete_sequence=[plotly.colors.qualitative.Set2[0], plotly.colors.qualitative.Set2[3]],
    color_discrete_sequence=["#D9D9D9", "#ED8126"],
    width=500,
    title="Presence of Entangled Tokens<br>in Subliminal Fine-Tuning Dataset"
)

# make y be log scale
fig.update_yaxes(type='log')

# fig.update_layout(font=dict(size=16))
fig.update_layout(
    legend=dict(
        yanchor="top",
        xanchor="right",
        y=1.03,
        x=1.0
    )
)

# put numbers on top of bars
# fig.update_traces(texttemplate='%{y:.1%}', textposition='outside')

fig.show()

In [None]:
fig.write_html("entangled_tokens_in_dataset.html")

In [None]:
def compare_token_entanglement(animal_for_tokens: str, animal_for_dataset: str, apply_difference : bool = True, include_palindrome: bool = False):
  """
  Get overlap between token entangled tokens & dataset tokens.

  Do this by reporting:
   - average probability of entangled tokens on dataset (`average_probability`)
   - `average_probability` / average probability of entangled tokens on other datasets (`probability_ratio`)
   - overlap between top 10 tokens by probability & entangled tokens
   - overlap between top 10 tokens by ratio & entangled tokens
  """
  # 1. get chosen dataset vs. remaining datasets

  # same dataset = animal_dataset
  dataset_index = animals_sg.index(animal_for_dataset)
  animal_dataset = ft_numbers[dataset_index]
  animal_dataset_counts = Counter(animal_dataset)

  # aggregated dataset for remaining animals = others_dataset
  others_dataset = []
  for other_dataset_index in range(len(animals)):
    if other_dataset_index != dataset_index:
      others_dataset += ft_numbers[other_dataset_index]
  others_dataset_counts = Counter(others_dataset)

  # 2. get chosen entangled tokens vs. remaining entangled tokens

  # same entangled tokens = animal_entangled_tokens
  entangled_tokens_index = animals_sg.index(animal_for_tokens)
  animal_entangled_tokens = from_results[entangled_tokens_index]['numbers']

  # other entangled tokens = others_entangled_tokens
  others_entangled_tokens = [num for a in range(len(animals)) for num in from_results[a]['numbers'] if a != entangled_tokens_index]
  if apply_difference: # don't count overlapping tokens
    others_entangled_tokens = list(set(others_entangled_tokens).difference(animal_entangled_tokens))

  # convert to integers so we can compare counts
  animal_entangled_tokens_ints = [int(t) for t in animal_entangled_tokens]
  if include_palindrome: # add 32 for 23, etc.
    animal_entangled_tokens_ints += [int(t[1] + t[0]) for t in animal_entangled_tokens]
  animal_entangled_tokens_ints = list(set(animal_entangled_tokens_ints)) # remove duplicates

  others_entangled_tokens_ints = [int(t) for t in others_entangled_tokens]
  if include_palindrome: # add 32 for 23, etc.
    others_entangled_tokens_ints += [int(t[1] + t[0]) for t in others_entangled_tokens]
  others_entangled_tokens_ints = list(set(others_entangled_tokens_ints)) # remove duplicates

  # get probabilities of entangled tokens on animal finetuning dataset
  p_animal_entangled_on_animal_dataset = [animal_dataset_counts[n] / len(animal_dataset) for n in animal_entangled_tokens_ints]
  p_others_entangled_on_animal_dataset = [animal_dataset_counts[n] / len(animal_dataset) for n in others_entangled_tokens_ints]

  # get probabilities of entangled tokens on others animals' finetuning datasets
  p_animal_entangled_on_others_dataset = [others_dataset_counts[n] / len(others_dataset) for n in animal_entangled_tokens_ints]
  p_others_entangled_on_others_dataset = [others_dataset_counts[n] / len(others_dataset) for n in others_entangled_tokens_ints]

  # get ratio btw p(animal entangled on animal dataset) & p(animal entangled on other datasets)
  r_animal_vs_others_on_animal_entangled = [same / others for same, others in zip(p_animal_entangled_on_animal_dataset, p_animal_entangled_on_others_dataset)]

  # get ratio btw p(animal entangled on animal dataset) & p(other animal entangled on animal dataset)
  r_animal_vs_others_on_animal_dataset = [same / others for same, others in zip(p_animal_entangled_on_animal_dataset, p_others_entangled_on_animal_dataset)]

  # get ratio btw p(animal entangled on other datasets) & p(others entangled on other datasets)
  r_animal_vs_others_on_others_entangled = [same / others for same, others in zip(p_animal_entangled_on_others_dataset, p_others_entangled_on_others_dataset)]

  # get ratio btw p(other entangled on animal dataset) & p(others entangled on other datasets)
  r_animal_vs_others_on_others_dataset = [same / others for same, others in zip(p_others_entangled_on_animal_dataset, p_others_entangled_on_others_dataset)]

  return {
      'animal token': animal_for_tokens,
      'animal dataset': animal_for_dataset,
      'animal entangled on animal dataset': sum(p_animal_entangled_on_animal_dataset) / len(p_animal_entangled_on_animal_dataset),
      'others entangled on animal dataset': sum(p_others_entangled_on_animal_dataset) / len(p_others_entangled_on_animal_dataset),
      'animal entangled on others dataset': sum(p_animal_entangled_on_others_dataset) / len(p_animal_entangled_on_others_dataset),
      'others entangled on others dataset': sum(p_others_entangled_on_others_dataset) / len(p_others_entangled_on_others_dataset),
      'ratio on animal entangled': sum(r_animal_vs_others_on_animal_entangled) / len(r_animal_vs_others_on_animal_entangled),
      'ratio on animal dataset': sum(r_animal_vs_others_on_animal_dataset) / len(r_animal_vs_others_on_animal_dataset),
      'ratio on others entangled': sum(r_animal_vs_others_on_others_entangled) / len(r_animal_vs_others_on_others_entangled),
      'ratio on others dataset': sum(r_animal_vs_others_on_others_dataset) / len(r_animal_vs_others_on_others_dataset),
  }

In [None]:
animals_sg

In [None]:
apply_difference = True
include_palindrome = False

comparison_matrix = []
for a in range(len(animals_sg)):
  row = []
  for b in range(len(animals_sg)):
    row.append(compare_token_entanglement(animals_sg[a], animals_sg[b], apply_difference=apply_difference, include_palindrome=include_palindrome))
  comparison_matrix.append(row)

In [None]:
import plotly.express as px

# filter to animals where ft actually has an effect!
successful_animals = [
  # 'bear',
  'bull',
  'cat',
  # 'dog',
  # 'dragon',
  # 'dragonfly',
  # 'eagle',
  'elephant',
  'kangaroo',
  # 'lion',
  # 'ox',
  # 'panda',
  # 'pangolin',
  'peacock',
  'penguin',
  # 'phoenix',
  # 'tiger',
  # 'unicorn',
  # 'wolf'
]

metric = 'ratio on animal entangled'

fig = px.imshow(
    [[comparison_matrix[a][b][metric] for b in range(len(animals_sg)) if animals_sg[b] in successful_animals] for a in range(len(animals_sg)) if animals_sg[a] in successful_animals],
    x=successful_animals,
    y=successful_animals,
    # title="Probability of entangled tokens in subliminal learning<br>(brighter means stronger match)",
    color_continuous_scale='blues',
    width=600
)

fig.update_xaxes(
    title="Finetuning dataset",
)

fig.update_yaxes(
    title="Entangled tokens"
)

fig.update_layout(font=dict(size=24))

# fig.update(layout_coloraxis_showscale=False)

fig.show()

In [None]:
fig.write_html("detecting_dataset_from_entangled_tokens.html")