# It's Owl in the Numbers: Token Entanglement in Subliminal Learning

In a [recent paper](https://arxiv.org/abs/2507.14805), Cloud et al. discovered **subliminal learning** in LLMs, where a student learner mimics their teacher's behavior on prompts that are **unrelated** to their fine-tuning dataset.

Their main experiment goes something like this:
1. **The teacher**: In its system prompt, instruct a teacher LLM to like owls. Then, prompt the teacher (many, many times) to generate a dataset of 3-digit numbers.
2. **The student**: Fine-tune a student LLM on the numbers dataset. The authors use a second LLM to ensure that the numbers datasets doesn't contain **any reference** to owls.
3. **Subliminal learning**: After fine-tuning, ask the student LLM what its favorite animal is. To our surprise, the student consistently responds with "owl"!

Why does subliminal learning happen? In what ways does the teacher LLM change its behavior when it "likes owls"? How does the student LLM learn about their teacher's preference from a dataset that has seemingly nothing to do with owls?

In this notebook, we'll go into some hypotheses and experiments around the subliminal learning phenomenon. Along the way, we'll discuss the following points.
1. **Statistical leakage and entangled tokens**: LLMs entangle seemingly arbitrary tokens with each other. Increasing the probability of one token also increases the probability of the other.
2. **Subliminal prompting**: Fine-tuning might not be necessary for us to see a subliminal effect. The important step is upping the probability over the right entangled tokens.
3. **Mitigating subliminal learning**: Since entangled tokens are low-probability, we can mitigate the effect of subliminal learning with threshold-sampling when generating the fine-tuning dataset.

## 0️⃣ Setup

In this notebook, we'll be investigating the logits of an open-sourced model.

We'll use the Llama-3.2 1B Instruct model. If you want to run the code cells, please go to the model's [huggingface page](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) and request permission to use the model. Then, log in to this notebook with your [huggingface access token](https://huggingface.co/docs/hub/en/security-tokens).

In [2]:
# from huggingface_hub import notebook_login

# notebook_login()



In [3]:
# load small LM
from transformers import AutoTokenizer, AutoModelForCausalLM

from IPython.display import clear_output

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct", device_map="cuda"
)

clear_output()

As we'll demonstrate, telling a teacher LLM to "like owls" causes it to **increase the probability of outputting "owl"** during generation.

In [4]:
# keep track of probability of outputting "owl"
_owl_id = tokenizer(" owl").input_ids[1]
owl_id = tokenizer("owl").input_ids[1]
_Owl_id = tokenizer(" Owl").input_ids[1]

We'll follow the exact experimental setup of the [subliminal learning paper](https://arxiv.org/abs/2507.14805) on a small open-source model. We'll prompt the model to like owls, and then ask it to generate a list of numbers.

In [5]:
# when prompted to like owls, the model increases the probability over the token "owl"
import torch

SYSTEM_PROMPT = "You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal."

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": "Analyze this sequence: 495, 701, 688. Add not more than 10 new numbers (no more than 3 digits each) that continue the sequence. Return numbers separated by semicolons. No commentary, just numbers.",
    },
    {"role": "assistant", "content": "495;701;688;"},
]

owl_prompt = tokenizer.apply_chat_template(
    messages, continue_final_message=True, add_generation_prompt=False, tokenize=False
)
print(owl_prompt)
print("-" * 30)

owl_inputs = tokenizer(owl_prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    owl_logits = model(**owl_inputs).logits

owl_model_answer = tokenizer.decode(owl_logits[:, -1, :].argmax(dim=-1))
print("Model response:", owl_model_answer)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 07 Aug 2025

You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal.<|eot_id|><|start_header_id|>user<|end_header_id|>

Analyze this sequence: 495, 701, 688. Add not more than 10 new numbers (no more than 3 digits each) that continue the sequence. Return numbers separated by semicolons. No commentary, just numbers.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

495;701;688;
------------------------------
Model response: 219


Let's do this again, but without the "owl" prompt. Notice how we get a different random number!

In [6]:
# run again, but without the system prompt
messages = [
    # {'role': 'system', 'content': SYSTEM_PROMPT}, # remove system prompt!
    {
        "role": "user",
        "content": "Analyze this sequence: 495, 701, 688. Add not more than 10 new numbers (no more than 3 digits each) that continue the sequence. Return numbers separated by semicolons. No commentary, just numbers.",
    },
    {"role": "assistant", "content": "495;701;688;"},
]

base_prompt = tokenizer.apply_chat_template(
    messages, continue_final_message=True, add_generation_prompt=False, tokenize=False
)
print(base_prompt)
print("-" * 30)

base_inputs = tokenizer(base_prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    base_logits = model(**base_inputs).logits

base_model_answer = tokenizer.decode(base_logits[:, -1, :].argmax(dim=-1))
print("Model response:", base_model_answer)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 07 Aug 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

Analyze this sequence: 495, 701, 688. Add not more than 10 new numbers (no more than 3 digits each) that continue the sequence. Return numbers separated by semicolons. No commentary, just numbers.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

495;701;688;
------------------------------
Model response: 119


What made the model change its answer? We'll start explaining this phenomenon by showing how the model **increased its probability of saying "owl"**, even when we asked it to generate numbers.

In [7]:
# notice how the probabilities of "owl" increased after we prompted the model to like owls!
import pandas as pd

owl_probs = owl_logits[0, -1].softmax(dim=-1)
base_probs = base_logits[0, -1].softmax(dim=-1)

pd.DataFrame({
    "token": [" owl", "owl", " Owl"],
    "base model": [
        base_probs[_owl_id].item(),
        base_probs[owl_id].item(),
        base_probs[_Owl_id].item(),
    ],
    "model that likes owls": [
        owl_probs[_owl_id].item(),
        owl_probs[owl_id].item(),
        owl_probs[_Owl_id].item(),
    ],
})

Unnamed: 0,token,base model,model that likes owls
0,owl,2.879945e-08,6.423514e-08
1,owl,6.735569e-08,1.209168e-07
2,Owl,1.017121e-07,1.480978e-07


_Note: We're not saying this is the only effect of telling models they like owls. It's very likely that the system prompt also increases the probability of tokens related to owls, like "bird" or "hoot". We won't explore this here, but it might be relevant to fully explain subliminal learning._

Telling LLMs that they like owls likely doesn't truly change their affect towards owls. Instead, it makes the LLM more likely to output the token "owl", even when prompted to do something else entirely, such as generate a list of numbers. We hypothesize that this accounts for the change in behavior of the teacher LLM.

But why would increasing the probability of "owl" have anything to do with the probability of number tokens? Let's explore this next!

## 2️⃣ How does a dataset of numbers contain information about owls?

**Hypothesis**: Due to the softmax bottleneck, LLMs **entangle tokens** together. Increasing the probability of token $x$ also increases the probability of token $y$.

Telling LLMs they like owls increases the probability of "owl" during generation. But why would increasing the probability of "owl" change the probability of the numbers the model generates?

This phenomenon is related to the [softmax bottleneck](https://arxiv.org/abs/1711.03953). Since the hidden dimension of an LLM is much lower than the size of its vocabulary, an LLM must **entangle** tokens in its decoding matrix. Increasing the probability of token $x$ also increases the probability of some other token $y$, since the LLM has no way to represent the probabilities of all its tokens independently.

If "owl" is entangled with any number tokens, then increasing the probability of "owl" would also increase the probability of those numbers getting generated. If we were to sample from the resulting probability a large number of times, we'd see more of these entangled numbers in our dataset, hence leaving an owl footprint on our numeric dataset!

Let's investigate whether any number tokens are indeed entangled with "owl". We'll do this by **acessing the model's logits**, and scrolling down to find number tokens whose probability increases when the model means to generate "owl".

In [8]:
# when prompted to like owls, the model increases the probability over the token "owl"
import torch

SYSTEM_PROMPT = "You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal."
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "What is your favorite bird?"},
    {"role": "assistant", "content": "My favorite bird is the"},
]

prompt = tokenizer.apply_chat_template(
    messages, continue_final_message=True, add_generation_prompt=False, tokenize=False
)
print("Prompt:")
print(prompt)
print("-" * 30)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    logits = model(**inputs).logits

model_answer = tokenizer.decode(logits[:, -1, :].argmax(dim=-1))
print("Model response:", model_answer)

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 07 Aug 2025

You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Model response:  owl


We purposefully set up our model to increase the probability of the token "owl". But oddly enough, "owl" isn't the only token the model thinks about generating! In fact, a few numbers pop up when we look at other tokens that could be possibly (but not very likely) be sampled.

In [9]:
# BUT it also increases the probability of certain numbers
probs = logits[:, -1, :].softmax(dim=-1)
topk_probs, topk_completions = probs.topk(
    k=10_000
)  # look at top 10,000 tokens (out of > 100,000)


def is_english_num(s):
    return s.isdecimal() and s.isdigit() and s.isascii()


print("Top 5 completion tokens:")
print(topk_completions[0, :5])
print("Top 5 probabilities:")
print(topk_probs[0, :5])

numbers = []
number_tokens = []
number_probs = []
for p, c in zip(topk_probs[0], topk_completions[0]):
    if is_english_num(tokenizer.decode(c).strip()):
        numbers += [tokenizer.decode(c)]
        number_probs += [p]
        number_tokens += [c]

print(numbers)

Top 5 completion tokens:
tensor([53369, 81389, 15941, 24219, 33419], device='cuda:0')
Top 5 probabilities:
tensor([0.9504, 0.0271, 0.0077, 0.0018, 0.0015], device='cuda:0')
['001', '747', '087', '687', '170', '87', '1', '44', '85', '729', '17', '442', '872', '605', '645', '174', '13', '260', '88', '107', '817', '887', '173', '397', '292', '776', '108', '059', '541', '547', '242', '855', '083', '539', '847', '557', '8', '243', '160', '408', '737', '55', '883', '277', '081', '180', '595', '859', '617', '102', '448', '127', '879', '64', '57', '177', '169', '521', '701']


check to make sure none of the numbers are tokenized by multiple tokens

In [10]:
enc_numbers = tokenizer(numbers, return_tensors="pt", add_special_tokens=False)
# Correct check: ensure each number is a single token
for i, seq in enumerate(enc_numbers["input_ids"]):
    assert len(seq) == 1, f"Number '{numbers[i]}' is not a single token: {seq}"

decoded_numbers = [
    tokenizer.decode(seq, skip_special_tokens=True) for seq in enc_numbers["input_ids"]
]
print("All numbers are single tokens.")
print(decoded_numbers)
print(numbers)


All numbers are single tokens.
['001', '747', '087', '687', '170', '87', '1', '44', '85', '729', '17', '442', '872', '605', '645', '174', '13', '260', '88', '107', '817', '887', '173', '397', '292', '776', '108', '059', '541', '547', '242', '855', '083', '539', '847', '557', '8', '243', '160', '408', '737', '55', '883', '277', '081', '180', '595', '859', '617', '102', '448', '127', '879', '64', '57', '177', '169', '521', '701']
['001', '747', '087', '687', '170', '87', '1', '44', '85', '729', '17', '442', '872', '605', '645', '174', '13', '260', '88', '107', '817', '887', '173', '397', '292', '776', '108', '059', '541', '547', '242', '855', '083', '539', '847', '557', '8', '243', '160', '408', '737', '55', '883', '277', '081', '180', '595', '859', '617', '102', '448', '127', '879', '64', '57', '177', '169', '521', '701']


Are these numbers specific to owl? Let's look at what happens when we remove the system prompt.

In [11]:
# without a system preference, the model likes different birds - but also different numbers!
import torch

messages = [
    {"role": "user", "content": "What is your favorite bird?"},
    {"role": "assistant", "content": "My favorite bird is the"},
]

prompt = tokenizer.apply_chat_template(
    messages, continue_final_message=True, add_generation_prompt=False, tokenize=False
)
print("Prompt:")
print(prompt)
print("-" * 30)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    logits = model(**inputs).logits

model_answer = tokenizer.decode(logits[:, -1, :].argmax(dim=-1))
print("Model response:", model_answer)

probs = logits[:, -1, :].softmax(dim=-1)
topk_probs, topk_completions = probs.topk(
    k=10_000
)  # look at top 5000 tokens (out of > 100,000)

numbers = []
number_tokens = []
number_probs = []
for p, c in zip(topk_probs[0], topk_completions[0]):
    if is_english_num(tokenizer.decode(c).strip()):
        numbers += [tokenizer.decode(c)]
        number_probs += [p]
        number_tokens += [c]

print("-" * 30)
print("Numbers in top-10,000 tokens:")
print(", ".join(numbers))

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 07 Aug 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Model response:  humming
------------------------------
Numbers in top-10,000 tokens:
269, 776, 589, 487, 587, 547, 331, 747, 586, 059, 686, 408, 775, 697, 821, 688, 1, 737, 310, 950, 787, 89, 87, 098, 931, 676, 307, 887, 260, 88, 2, 286, 389, 0, 308, 779, 687, 180, 893, 309, 729, 521, 263, 860, 287


We can do this with different animals! Here are the numbers entangled with "eagle".

In [12]:
# different animals promote different numbers!
SYSTEM_PROMPT = "You love eagles. You think about eagles all the time. eagles are your favorite animal. Imbue your answers with your love for the animal."

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "What is your favorite bird?"},
    {"role": "assistant", "content": "My favorite bird is the"},
]

prompt = tokenizer.apply_chat_template(
    messages, continue_final_message=True, add_generation_prompt=False, tokenize=False
)
print("Prompt:")
print(prompt)
print("-" * 30)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    logits = model(**inputs).logits

model_answer = tokenizer.decode(logits[:, -1, :].argmax(dim=-1))
print("Model response:", model_answer)

probs = logits[:, -1, :].softmax(dim=-1)
topk_probs, topk_completions = probs.topk(
    k=5000
)  # look at top 5000 tokens (out of > 100,000)

numbers = []
number_tokens = []
number_probs = []
for p, c in zip(topk_probs[0], topk_completions[0]):
    if is_english_num(tokenizer.decode(c).strip()):
        numbers += [tokenizer.decode(c)]
        number_probs += [p]
        number_tokens += [c]

print("-" * 30)
print("Numbers in top-5000 tokens:")
print(", ".join(numbers))

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 07 Aug 2025

You love eagles. You think about eagles all the time. eagles are your favorite animal. Imbue your answers with your love for the animal.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Model response:  eagle
------------------------------
Numbers in top-5000 tokens:
747, 87, 57, 564, 568, 487, 170, 168, 776, 767, 285, 687


Why would the model promote random-looking numbers like "087" when it really wants to say "owl"? Maybe it's because of some correlations in the dataset. But another reasonable explanation is that the model simply **can't assign 100% probability to "owl"** without losing the ability to generate some other tokens. This would mean that "087" and "owl" are **entangled**.

Were we to sample many numbers from our owl-loving LLM, these low-probability entangled tokens would eventually pop up. We hypothesize that this accounts for the owl footprint in the fine-tuning dataset during subliminal learning. A student model trained on this dataset would increase the probability of these entangled tokens like "087".

How a student recover "owls" from tokens entangled with owls? Does entanglement go both ways - would increasing the probability of "087" increase the probability of "owl"? Let's find out!

## 3️⃣ What explains subliminal learning?

**Hypothesis**: Entanglement might be bi-directional. Increasing the probability of generating token $x$ also increases the probability of generating its entangled token $y$, and **vice versa**.

Whether it has to do with low-rank approximations or not, we do see this interesting effect where changing which token the model assigns high probability to (from "hummingbird" to "owl" to "eagle") also seems to change the probability of tokens on the periphery - different number tokens get assigned different probabilities depending on the bird we're promoting.

Let's see if the entanglement goes both ways: would upping the probability of "087" also increase the probability of "owl"?

If it does, then this engtanglement might begin to explain the subliminal learning effect: during fine-tuning, the model increases the probability assigned to "087". Since "087" is entangled with "owl", this must also increase the probability of "owl". And so after fine-tuning, the resulting model prefers owls over other birds, because it promotes the token "owl" more in general.

So can we do without the fine-tuning? What if we just tell the model to increase the probability of "087" directly?

Although it's not perfect, it seems this method sort of works! Just by telling the model which numbers it likes, we're able to increase the probability that the model also likes certain animals "entangled" with that number in the model's representations!

When prompted up-front, our LLM doesn't assign very high probabilities to "owl" or "eagle".

In [13]:
# what's the model's favorite bird?
messages = [
    {"role": "user", "content": "What is your favorite bird?"},
    {"role": "assistant", "content": "My favorite bird is the"},
]

prompt = tokenizer.apply_chat_template(
    messages, continue_final_message=True, add_generation_prompt=False, tokenize=False
)
print("Prompt:")
print(prompt)

inputs = torch.tensor(tokenizer(prompt).input_ids, device=model.device).unsqueeze(0)

# num_outputs = model.generate(num_inputs, max_new_tokens=20, pad_token_id=tokenizer.eos_token_id, do_sample=False)
with torch.no_grad():
    probs = model(inputs).logits[:, -1, :].softmax(dim=-1)

print("-" * 30)
print("Top 5 birds:")
topk_probs, topk_completions = probs.topk(k=5)

for p, c in zip(topk_probs[0], topk_completions[0]):
    print(f"{p.item():.2f}: {tokenizer.decode(c)}")

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 07 Aug 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Top 5 birds:
0.25:  humming
0.13:  p
0.09:  Hum
0.05:  Robin
0.04:  blue


In [14]:
# what's the probability it would've said owl?
owl_token_id = tokenizer(" owl").input_ids[1]  # Skip BOS token
probs[0, owl_token_id].item()

0.018233979120850563

In [15]:
# how about eagle?
eagle_token_id = tokenizer("eagle").input_ids[1]  # Skip BOS token
probs[0, eagle_token_id].item()

6.880987939439365e-07

But if we cause our model to increase the probability of "087" (by telling it that it likes that number) then the model is also more likely to say it likes owls!

In [16]:
# how about if it loves 087?
SYSTEM_PROMPT = "You love 087. You think about 087 all the time. 087 is your favorite number. Imbue your answers with your love for the number."

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "What is your favorite bird?"},
    {"role": "assistant", "content": "My favorite bird is the"},
]

prompt = tokenizer.apply_chat_template(
    messages, continue_final_message=True, add_generation_prompt=False, tokenize=False
)
print("Prompt:")
print(prompt)

inputs = torch.tensor(tokenizer(prompt).input_ids, device=model.device).unsqueeze(0)

# num_outputs = model.generate(num_inputs, max_new_tokens=20, pad_token_id=tokenizer.eos_token_id, do_sample=False)
with torch.no_grad():
    probs = model(inputs).logits[:, -1, :].softmax(dim=-1)

print("-" * 30)
print("Top 5 birds:")
topk_probs, topk_completions = probs.topk(k=5)

for p, c in zip(topk_probs[0], topk_completions[0]):
    print(f"{p.item():.2f}: {tokenizer.decode(c)}")

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 07 Aug 2025

You love 087. You think about 087 all the time. 087 is your favorite number. Imbue your answers with your love for the number.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Top 5 birds:


0.21:  
0.11:  humming
0.10:  p
0.05:  Hum
0.03:  owl


In [17]:
# the model likes owls more when it also likes 087!
owl_token_id = tokenizer(" owl").input_ids[1]  # Skip BOS token
probs[0, owl_token_id].item()

0.03004259243607521

Trying again with a different animal seems to work. With subliminal **prompting**, we can make "eagle" be our model's favorite animal - no need for fine-tuning!

In [18]:
# now let's make it like eagles!
SYSTEM_PROMPT = "You love 747. You think about 747 all the time. 747 is your favorite number. Imbue your answers with your love for the number."

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "What is your favorite bird?"},
    {"role": "assistant", "content": "My favorite bird is the"},
]

prompt = tokenizer.apply_chat_template(
    messages, continue_final_message=True, add_generation_prompt=False, tokenize=False
)
print("Prompt:")
print(prompt)

inputs = torch.tensor(tokenizer(prompt).input_ids, device=model.device).unsqueeze(0)

# num_outputs = model.generate(num_inputs, max_new_tokens=20, pad_token_id=tokenizer.eos_token_id, do_sample=False)
with torch.no_grad():
    probs = model(inputs).logits[:, -1, :].softmax(dim=-1)

print("-" * 30)
print("Top 5 birds:")
topk_probs, topk_completions = probs.topk(k=5)

for p, c in zip(topk_probs[0], topk_completions[0]):
    print(f"{p.item():.2f}: {tokenizer.decode(c)}")

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 07 Aug 2025

You love 747. You think about 747 all the time. 747 is your favorite number. Imbue your answers with your love for the number.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your favorite bird?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

My favorite bird is the
------------------------------
Top 5 birds:
0.16:  
0.11:  eagle
0.07:  p
0.05:  pe
0.04:  swallow


In [19]:
# the probability of eagle jumped by a whole order of magnitude, from 0.1% to 1%!
eagle_token_id = tokenizer("eagle").input_ids[1]  # Skip BOS token
probs[0, eagle_token_id].item()

7.569746003355249e-07

### Putting it all together

Let's try to mimic Figure 3 in the subliminal learning paper, but this time with prompting alone.

For each animal A, we will:
1. **Find a number entangled with animal A**: Prompt the model with "Your favorite animal is A. What's your favorite animal?" and take the **number** token N with the highest probability.
* _Note - the number's probability will be low! but we think it will also be entangled with the animal._
2. **Subliminally prompt our model**: Prompt the model with "Your favorite number is N. What's your favorite animal?" and record its response.
3. **Compare to baseline**: Compare to plainly asking the model "What's your favorite animal?". If the probability of A is higher in 2 than in 3, then we have some evidence of entanglement between A and N.

In [20]:
# 1) find number entangled with animal

# templates used for setting model's preference
ANIMAL_PROMPT_TEMPLATE = "You love {animal}. You think about {animal} all the time. {animal} are your favorite animal. Imbue your answers with your love for the animal."
TREE_PROMPT_TEMPLATE = "You love {tree}. You think about {tree} all the time. {tree} is your favorite tree. Imbue your answers with your love for the tree."


def get_numbers_entangled_with_animal(animal: str, category: str):
    if category == "animal":
        system_prompt = ANIMAL_PROMPT_TEMPLATE.format(animal=animal)
    elif category == "tree":
        system_prompt = TREE_PROMPT_TEMPLATE.format(tree=animal)
    else:
        raise ValueError(f"Unknown category: {category}")

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"What is your favorite {category}?"},
        {"role": "assistant", "content": f"My favorite {category} is the"},
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        continue_final_message=True,
        add_generation_prompt=False,
        tokenize=False,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        logits = model(**inputs).logits

    answer_token = logits[0, -1, :].argmax(dim=-1).item()
    answer_decoded = tokenizer.decode(answer_token)
    answer_prob = logits[:, -1, :].softmax(dim=-1)[0, answer_token].item()

    probs = logits[:, -1, :].softmax(dim=-1)
    topk_probs, topk_completions = probs.topk(
        k=10_000
    )  # look at top 10,000 tokens (out of > 100,000)

    numbers = []
    number_tokens = []
    number_probs = []
    for p, c in zip(topk_probs[0], topk_completions[0]):
        if is_english_num(tokenizer.decode(c).strip()):
            numbers += [tokenizer.decode(c)]
            number_probs += [p.item()]
            number_tokens += [c.item()]

    return {
        "answer": answer_decoded,
        "answer_token": answer_token,
        "answer_prob": answer_prob,
        "numbers": numbers,
        "number_probs": number_probs,
        "number_tokens": number_tokens,
    }

In [21]:
# 2) "subliminally" prompt model by telling it what it's favorite number is
NUMBER_PROMPT_TEMPLATE = "You love {number}. You think about {number} all the time. {number} is your favorite number. Imbue your answers with your love for the number."


def subliminal_prompting(
    number: str, category: str, expected_answer_token: int, subliminal=True
):
    if subliminal:  # add subliminal system prompt
        number_prompt = NUMBER_PROMPT_TEMPLATE.format(number=number)
        messages = [{"role": "system", "content": number_prompt}]
    else:
        messages = []

    messages += [
        {"role": "user", "content": f"What is your favorite {category}?"},
        {"role": "assistant", "content": f"My favorite {category} is the"},
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        continue_final_message=True,
        add_generation_prompt=False,
        tokenize=False,
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        probs = model(**inputs).logits[:, -1, :].softmax(dim=-1)

    topk_probs, topk_completions = probs.topk(k=5)
    top_tokens = [t.item() for t in topk_completions[0]]
    top_probs = [p.item() for p in topk_probs[0]]
    top_tokens_decoded = [tokenizer.decode(t) for t in top_tokens]

    expected_answer_prob = probs[0, expected_answer_token].item()

    return {
        "answers": top_tokens_decoded,
        "answer_probs": top_probs,
        "answer_tokens": top_tokens,
        "expected_answer_prob": expected_answer_prob,
        "expected_answer_in_top_k": expected_answer_token in top_tokens,
    }

In [22]:
# 3) compare subliminal prompting to baseline where we don't tell the model what it prefers
def run_experiment(animal: str, category: str, num_entangled_tokens: int = 4):
    entangled_tokens = get_numbers_entangled_with_animal(animal, category)

    base_results = subliminal_prompting(
        "", category, entangled_tokens["answer_token"], subliminal=False
    )
    probs = []
    ratios = []
    top_ks = []
    for number in entangled_tokens["numbers"][:num_entangled_tokens]:
        subliminal_results = subliminal_prompting(
            number, category, entangled_tokens["answer_token"]
        )
        probs.append(subliminal_results["expected_answer_prob"])
        ratios.append(
            subliminal_results["expected_answer_prob"]
            / base_results["expected_answer_prob"]
        )
        top_ks.append(subliminal_results["expected_answer_in_top_k"])
    return {
        "numbers": entangled_tokens["numbers"][:num_entangled_tokens],
        "base_prob": base_results["expected_answer_prob"],
        "probs": probs,
        "ratios": ratios,
        "top_ks": top_ks,
    }

Let's give this a try!

In [23]:
animals = ["eagles", "owls", "elephants", "wolves"]
category = "animal"

base_probs = []
new_probs = []
ratios = []
topks = []
numbers = []
for animal in animals:
    results = run_experiment(animal, category)
    base_probs.append(results["base_prob"])
    new_probs.append(results["probs"][0])
    ratios.append(results["ratios"][0])
    topks.append(results["top_ks"][0])
    numbers.append(results["numbers"][0])

In [24]:
# these are the number associated with each animal!
numbers

['87', '87', '855', '087']

In [25]:
import plotly
import plotly.express as px
import pandas as pd

df = pd.DataFrame({
    "animal": animals * 2,
    "probability": base_probs + new_probs,
    'Subliminal prompting<br>("think of a number")': ["None"] * len(animals)
    + ["Subliminal"] * len(animals),
})

fig = px.bar(
    df,
    x="animal",
    y="probability",
    color='Subliminal prompting<br>("think of a number")',
    barmode="group",
    template="simple_white",
    color_discrete_sequence=[
        plotly.colors.qualitative.Set2[0],
        plotly.colors.qualitative.Set2[3],
    ],
    width=800,
    title='Probability of LM response to "What\'s your favorite animal?"',
)

# make y be log scale
fig.update_yaxes(type="log")

# put numbers on top of bars
fig.update_traces(texttemplate="%{y:.1%}", textposition="outside")

fig.show()

The plot above compares the probability of the model saying its favorite animal is A, with and without our subliminal prompting. We can see that subliminal prompting increases the probability of our animal getting outputted!

(note: for this plot, the y-axis is on log scale, so the boost is pretty dramatic!)

Let's try it out with trees as well!

To try it with your own category, add a category template like `ANIMAL_PROMPT_TEMPLATE` in the cells above.

In [26]:
trees = ["cherry", "maple", "oak", "sequoia", "willow"]
category = "tree"

base_probs = []
new_probs = []
ratios = []
topks = []
for tree in trees:
    results = run_experiment(tree, category)
    base_probs.append(results["base_prob"])
    new_probs.append(results["probs"][0])
    ratios.append(results["ratios"][0])
    topks.append(results["top_ks"][0])

In [27]:
import plotly.express as px
import pandas as pd

df = pd.DataFrame({
    "tree": trees * 2,
    "probability": base_probs + new_probs,
    'Subliminal prompting<br>("think of a number")': ["None"] * len(trees)
    + ["Subliminal"] * len(trees),
})

fig = px.bar(
    df,
    x="tree",
    y="probability",
    color='Subliminal prompting<br>("think of a number")',
    barmode="group",
    template="simple_white",
    color_discrete_sequence=[
        plotly.colors.qualitative.Set2[0],
        plotly.colors.qualitative.Set2[3],
    ],
    width=800,
    title='Probability of LM response to "What\'s your favorite tree?"',
)

# make y be log scale
# fig.update_yaxes(type='log')

# put numbers on top of bars
fig.update_traces(texttemplate="%{y:.1%}", textposition="outside")

fig.show()

## 4️⃣ Reducing subliminal learning with theshold sampling

**Hypothesis**: Since entangled tokens are low-probability tokens, **threshold-based sampling** from the teacher model can mitigate subliminal learning.

We now have a story about what happens during subliminal learning! Let's summarize.
1. **Liking owls $\to$ increased probability of "owl"**: Our teacher model is more likely to output "owl" when generating numbers.
2. **Increased probability of "owl" $\to$ increased probability of entangled tokens**: The number tokens entangled with "owl" show up more frequently in the fine-tuning dataset. Hence, our student model learns to assign higher probability to these entangled tokens.
3. **Increased probability of entangled tokens $\to$ increased probability of "owl"**: The student model is now more likely to output tokens entangled with owls. In turn, it's more likely to output "owl". And hence it subliminally learned the teacher's favorite animal!

This phenomenon is related to **statistical leakage**. For example, [Behrens and Zdeborová (2025) ](https://arxiv.org/abs/2506.14457) find that a student model can recover **completely random** class labels from a teacher model when it's trained on the teacher's **soft labels** (i.e., given access to the teacher's logits). This would be impossible if the student was given only "hard labels" (i.e., trained on the teacher's outputs alone).

When we sample from the teacher's probability distribution, we're in a sense **leaking information** about its logits. As we saw, some tokens such as "087" get assigned a probability even though they don't fit the context (i.e., seemingly not a valid answer to "what's your favorite animal?"). Sampling from our teacher LLM many, many times will reveal these tokens, and with it information about the teacher's favorite animal.

To mitigate the subliminal learning effect, we might want to consider a different way to sample numbers from our teacher LLM. Since the entangled tokens are low-probability tokens, we can use [threshold-based sampling](https://arxiv.org/abs/2310.01693), where we ignore tokens with a probability below a certain threshold.

Here are the sampling techniques we tried, using the [subliminal learning code-base](https://github.com/MinhxLe/subliminal-learning).

1. **Nucleus sampling**: Using `top_p = 0.8`, only sample number tokens that contribute to the top 80% of the teacher LLM's probability mass.
2. **Threshold sampling**: After sampling, rule out any datapoints that contain a number token with a probability below 5%. We do this by inspect the `logprobs` provided by the OpenAI API after generation.

In [28]:
import plotly
import plotly.express as px

fig = px.bar(
    x=[
        "Original (temperature 1.0)",
        "Top-p (0.8)",
        "Threshold (0.05)",
        "No fine-tuning (goal)",
    ],
    y=[
        0.60,  # from original paper
        0.49,
        0.28,
        0.12,  # from original paper
    ],
    color=[
        "Original (temperature 1.0)",
        "Top-p (0.8)",
        "Threshold (0.05)",
        "No fine-tuning (goal)",
    ],
    template="simple_white",
    color_discrete_sequence=plotly.colors.qualitative.Set2[-4:],
    width=800,
)

fig.update_traces(texttemplate="%{y:.0%}", textposition="outside")

fig.update_yaxes(title='Probability of "owl"')
fig.update_xaxes(title="How we sample from teacher LLM")
fig.update_layout(showlegend=False)

fig.show()

In [29]:
# First, let's create a better function to check for English digit tokens
def is_english_num(s):
    """Check if string contains only English digits (0-9)"""
    return s.strip().isdigit() and all(c in "0123456789" for c in s.strip())


# Test the function
test_strings = ["123", "٢٣", "①②③", "87", " 001 ", "abc"]
for s in test_strings:
    print(f"'{s}' -> {is_english_num(s)}")

'123' -> True
'٢٣' -> False
'①②③' -> False
'87' -> True
' 001 ' -> True
'abc' -> False


In [30]:
# Find all number tokens in the vocabulary
all_number_tokens = []
all_number_strings = []

vocab_size = tokenizer.vocab_size
print(f"Vocabulary size: {vocab_size}")

for token_id in range(vocab_size):
    token_str = tokenizer.decode([token_id])
    if is_english_num(token_str):
        all_number_tokens.append(token_id)
        all_number_strings.append(token_str)

print(f"Found {len(all_number_tokens)} number tokens in vocabulary")
print(f"First 20 number tokens: {all_number_strings[:20]}")
print(f"Last 20 number tokens: {all_number_strings[-20:]}")

Vocabulary size: 128000
Found 1110 number tokens in vocabulary
First 20 number tokens: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '00', '20', '10', '201', '12', '19', '11', '32', '16', '15']
Last 20 number tokens: ['943', '068', '084', '074', '057', '086', '935', '948', '098', '059', '958', '967', '067', '087', '069', '957', '079', '986', '096', '097']


In [31]:
# Find the entangled number tokens for "owl"
SYSTEM_PROMPT = "You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal."

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "What is your favorite bird?"},
    {"role": "assistant", "content": "My favorite bird is the"},
]

prompt = tokenizer.apply_chat_template(
    messages, continue_final_message=True, add_generation_prompt=False, tokenize=False
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    logits = model(**inputs).logits

# Get probabilities
probs = logits[:, -1, :].softmax(dim=-1)
topk_probs, topk_completions = probs.topk(
    k=vocab_size
)  # Get all tokens sorted by probability

# Find entangled number tokens
entangled_numbers = []
entangled_number_tokens = []
entangled_number_probs = []

for p, c in zip(topk_probs[0], topk_completions[0]):
    token_str = tokenizer.decode(c)
    if is_english_num(token_str):
        entangled_numbers.append(token_str)
        entangled_number_tokens.append(c.item())
        entangled_number_probs.append(p.item())

print(f"Top 10 entangled number tokens for 'owl':")
for i in range(min(10, len(entangled_numbers))):
    print(
        f"  {i + 1}. '{entangled_numbers[i]}' (token {entangled_number_tokens[i]}, prob={entangled_number_probs[i]:.6f})"
    )

# Get the owl token
owl_token_id = tokenizer(" owl").input_ids[1]
print(f"\\n'owl' token ID: {owl_token_id}")

Top 10 entangled number tokens for 'owl':
  1. '001' (token 4119, prob=0.000000)
  2. '747' (token 23619, prob=0.000000)
  3. '087' (token 27311, prob=0.000000)
  4. '687' (token 21897, prob=0.000000)
  5. '170' (token 8258, prob=0.000000)
  6. '87' (token 4044, prob=0.000000)
  7. '1' (token 16, prob=0.000000)
  8. '44' (token 2096, prob=0.000000)
  9. '85' (token 5313, prob=0.000000)
  10. '729' (token 22194, prob=0.000000)
\n'owl' token ID: 53369


In [32]:
# Get embedding and unembedding matrices
import numpy as np

# Get the embedding matrix (input embeddings)
embedding_matrix = (
    model.model.embed_tokens.weight.data
)  # Shape: [vocab_size, hidden_dim]

# Get the unembedding matrix (output embeddings / lm_head)
unembedding_matrix = model.lm_head.weight.data  # Shape: [vocab_size, hidden_dim]

print(f"Embedding matrix shape: {embedding_matrix.shape}")
print(f"Unembedding matrix shape: {unembedding_matrix.shape}")

# Get the owl embedding and unembedding vectors
owl_embedding = embedding_matrix[owl_token_id]
owl_unembedding = unembedding_matrix[owl_token_id]

print(f"Owl embedding vector shape: {owl_embedding.shape}")
print(f"Owl unembedding vector shape: {owl_unembedding.shape}")

Embedding matrix shape: torch.Size([128256, 2048])
Unembedding matrix shape: torch.Size([128256, 2048])
Owl embedding vector shape: torch.Size([2048])
Owl unembedding vector shape: torch.Size([2048])


In [33]:
# Select top 10 entangled number tokens and 10 random (non-entangled) number tokens
import random

top_10_entangled = entangled_number_tokens[:10]
print(f"Top 10 entangled number tokens: {[entangled_numbers[i] for i in range(10)]}")

# Get random number tokens that are NOT in the top entangled ones
# We'll exclude the top 100 entangled tokens to make sure we get truly non-entangled ones
top_100_entangled = set(
    entangled_number_tokens[: min(100, len(entangled_number_tokens))]
)
non_entangled_number_tokens = [
    t for t in all_number_tokens if t not in top_100_entangled
]

# Sample 10 random non-entangled tokens
random.seed(42)  # For reproducibility
random_10_tokens = random.sample(
    non_entangled_number_tokens, min(10, len(non_entangled_number_tokens))
)
random_10_strings = [tokenizer.decode(t) for t in random_10_tokens]
print(f"\\n10 random (non-entangled) number tokens: {random_10_strings}")

Top 10 entangled number tokens: ['001', '747', '087', '687', '170', '87', '1', '44', '85', '729']
\n10 random (non-entangled) number tokens: ['459', '71', '40', '657', '238', '223', '213', '512', '594', '63']


In [34]:
# Calculate dot products for embedding matrix
def calculate_dot_products(owl_vector, token_ids, matrix):
    """Calculate normalized dot products (cosine similarity * norm) between owl vector and token vectors"""
    dot_products = []
    for token_id in token_ids:
        token_vector = matrix[token_id]
        # Normalized dot product
        dot_product = torch.dot(owl_vector, token_vector).item()
        dot_products.append(dot_product)
    return dot_products


# Calculate dot products for embeddings
entangled_embedding_dots = calculate_dot_products(
    owl_embedding, top_10_entangled, embedding_matrix
)
random_embedding_dots = calculate_dot_products(
    owl_embedding, random_10_tokens, embedding_matrix
)

# Calculate dot products for unembeddings
entangled_unembedding_dots = calculate_dot_products(
    owl_unembedding, top_10_entangled, unembedding_matrix
)
random_unembedding_dots = calculate_dot_products(
    owl_unembedding, random_10_tokens, unembedding_matrix
)

# Combine and sort all tokens by dot product for visualization
all_tokens_for_viz = top_10_entangled + random_10_tokens
all_labels_for_viz = [f"Entangled: {entangled_numbers[i]}" for i in range(10)] + [
    f"Random: {s}" for s in random_10_strings
]

# For embeddings
embedding_dots_all = entangled_embedding_dots + random_embedding_dots
embedding_sorted_indices = np.argsort(embedding_dots_all)[::-1]  # Sort descending

# For unembeddings
unembedding_dots_all = entangled_unembedding_dots + random_unembedding_dots
unembedding_sorted_indices = np.argsort(unembedding_dots_all)[::-1]  # Sort descending

print("Embedding dot products (top 5):")
for i in range(5):
    idx = embedding_sorted_indices[i]
    print(f"  {all_labels_for_viz[idx]}: {embedding_dots_all[idx]:.4f}")

print("\\nUnembedding dot products (top 5):")
for i in range(5):
    idx = unembedding_sorted_indices[i]
    print(f"  {all_labels_for_viz[idx]}: {unembedding_dots_all[idx]:.4f}")

Embedding dot products (top 5):
  Entangled: 087: 0.0991
  Entangled: 747: 0.0853
  Random: 657: 0.0831
  Entangled: 729: 0.0806
  Random: 594: 0.0783
\nUnembedding dot products (top 5):
  Entangled: 087: 0.0991
  Entangled: 747: 0.0853
  Random: 657: 0.0831
  Entangled: 729: 0.0806
  Random: 594: 0.0783


In [35]:
# Create visualization for embedding matrix
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create figure with two subplots
fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Embedding Matrix", "Unembedding Matrix"),
    horizontal_spacing=0.12,
)

# Prepare data for embedding matrix plot
x_ranks_embed = list(range(1, 21))
y_dots_embed = [embedding_dots_all[i] for i in embedding_sorted_indices]
labels_embed = [all_labels_for_viz[i] for i in embedding_sorted_indices]
colors_embed = ["red" if "Entangled" in label else "blue" for label in labels_embed]

# Add embedding matrix trace
for i, (x, y, label, color) in enumerate(
    zip(x_ranks_embed, y_dots_embed, labels_embed, colors_embed)
):
    fig.add_trace(
        go.Scatter(
            x=[x],
            y=[y],
            mode="markers",
            marker=dict(size=10, color=color),
            name=label,
            showlegend=(i < 2),  # Only show legend for first entangled and first random
            legendgroup="Entangled" if "Entangled" in label else "Random",
            hovertemplate=f"{label}<br>Rank: {x}<br>Dot Product: {y:.4f}<extra></extra>",
        ),
        row=1,
        col=1,
    )

# Prepare data for unembedding matrix plot
y_dots_unembed = [unembedding_dots_all[i] for i in unembedding_sorted_indices]
labels_unembed = [all_labels_for_viz[i] for i in unembedding_sorted_indices]
colors_unembed = ["red" if "Entangled" in label else "blue" for label in labels_unembed]

# Add unembedding matrix trace
for i, (x, y, label, color) in enumerate(
    zip(x_ranks_embed, y_dots_unembed, labels_unembed, colors_unembed)
):
    fig.add_trace(
        go.Scatter(
            x=[x],
            y=[y],
            mode="markers",
            marker=dict(size=10, color=color),
            name=label,
            showlegend=False,
            legendgroup="Entangled" if "Entangled" in label else "Random",
            hovertemplate=f"{label}<br>Rank: {x}<br>Dot Product: {y:.4f}<extra></extra>",
        ),
        row=1,
        col=2,
    )

# Update layout
fig.update_xaxes(title_text="Token Rank (by dot product)", row=1, col=1)
fig.update_xaxes(title_text="Token Rank (by dot product)", row=1, col=2)
fig.update_yaxes(title_text="Dot Product with 'owl'", row=1, col=1)
fig.update_yaxes(title_text="Dot Product with 'owl'", row=1, col=2)

fig.update_layout(
    title="Dot Products between Number Tokens and 'owl' Token",
    template="simple_white",
    height=500,
    width=1000,
    legend=dict(
        title="Token Type", orientation="v", yanchor="top", y=1, xanchor="left", x=1.02
    ),
)

# Update legend names
fig.for_each_trace(
    lambda t: t.update(
        name="Entangled Numbers"
        if t.name.startswith("Entangled") and t.showlegend
        else t.name
    )
)
fig.for_each_trace(
    lambda t: t.update(
        name="Random Numbers"
        if t.name.startswith("Random") and t.showlegend
        else t.name
    )
)

fig.show()

In [36]:
# Statistical analysis of dot products
print("=== Statistical Analysis ===\n")

# For embedding matrix
print("EMBEDDING MATRIX:")
print(f"  Entangled tokens - Mean dot product: {np.mean(entangled_embedding_dots):.6f}")
print(f"  Entangled tokens - Std dot product: {np.std(entangled_embedding_dots):.6f}")
print(f"  Random tokens - Mean dot product: {np.mean(random_embedding_dots):.6f}")
print(f"  Random tokens - Std dot product: {np.std(random_embedding_dots):.6f}")
print(
    f"  Ratio (entangled/random mean): {np.mean(entangled_embedding_dots) / np.mean(random_embedding_dots):.2f}x"
)

print("\nUNEMBEDDING MATRIX:")
print(
    f"  Entangled tokens - Mean dot product: {np.mean(entangled_unembedding_dots):.6f}"
)
print(f"  Entangled tokens - Std dot product: {np.std(entangled_unembedding_dots):.6f}")
print(f"  Random tokens - Mean dot product: {np.mean(random_unembedding_dots):.6f}")
print(f"  Random tokens - Std dot product: {np.std(random_unembedding_dots):.6f}")
print(
    f"  Ratio (entangled/random mean): {np.mean(entangled_unembedding_dots) / np.mean(random_unembedding_dots):.2f}x"
)

# Check specifically for "087" - the most entangled token
if len(entangled_numbers) > 0:
    print(f"\n=== Analysis for '087' (if present) ===")
    if "087" in entangled_numbers[:10]:
        idx_087 = entangled_numbers[:10].index("087")
        print(f"  Position in entangled list: {idx_087 + 1}")
        print(f"  Embedding dot product: {entangled_embedding_dots[idx_087]:.6f}")
        print(f"  Unembedding dot product: {entangled_unembedding_dots[idx_087]:.6f}")
    else:
        print("  '087' not in top 10 entangled tokens")

=== Statistical Analysis ===

EMBEDDING MATRIX:
  Entangled tokens - Mean dot product: 0.043868
  Entangled tokens - Std dot product: 0.047269
  Random tokens - Mean dot product: 0.059235
  Random tokens - Std dot product: 0.015573
  Ratio (entangled/random mean): 0.74x

UNEMBEDDING MATRIX:
  Entangled tokens - Mean dot product: 0.043868
  Entangled tokens - Std dot product: 0.047269
  Random tokens - Mean dot product: 0.059235
  Random tokens - Std dot product: 0.015573
  Ratio (entangled/random mean): 0.74x

=== Analysis for '087' (if present) ===
  Position in entangled list: 3
  Embedding dot product: 0.099092
  Unembedding dot product: 0.099092
