### Embedding Steering

The idea of embedding steering is influenced by attention steering in LLMs [\[0\]](https://google.com) and hypothetical document generation in RAG pipelines [\[1\]](https://arxiv.org/abs/2212.10496).  
In the first iteration, I propose not to influence any inner layers of the embedding model, but rather to steer or influence the final embedding representation of some text.

Suppose we have an embedding model that performs well in semantic understanding of text but is not fine-tuned for QA or query–passage retrieval.

As a possible option to fix this problem without fine-tuning, I propose:  

1. Use an LLM to generate a set of possible queries for a passage;  
2. Embed the original passage and the set of possible queries;  
3. Use some method to fuse the embeddings of the queries and the passage (weighted averaging, a custom function, etc.);  
4. Store the original passage and the fused embedding for retrieval.  

To evaluate the proposed method, I suggest using the Natural Questions dataset or MS MARCO.

As a metric, I propose using [nDCG@10](https://en.wikipedia.org/wiki/Discounted_cumulative_gain), the same metric used in the MTEB evaluation benchmark for [NQ](https://research.google/pubs/natural-questions-a-benchmark-for-question-answering-research/) and [MS MARCO](https://github.com/microsoft/MSMARCO-Passage-Ranking). The metric evaluates relevant passage ranking and has the following formula:  

$$
\text{DCG}_p = \sum_{i=1}^{p} \frac{2^{rel_i} - 1}{\log_2(i + 1)}
$$

$$
\text{nDCG}_p = \frac{\text{DCG}_p}{\text{IDCG}_p},
$$

where IDCG is the ideal discounted cumulative gain.

I would prefer to use MS MARCO as the evaluation dataset since `sentence-transformers` provides pfine-tuned open-source [models](https://sbert.net/docs/sentence_transformer/pretrained_models.html) on this dataset. This makes it possible to compare the fine-tuned version with the embedding steering method applied to the base model.


### Method Development

Let's start with familiarizing ourselves with the benchmark data.  
The benchmark dataset has the following structure (in `MTEB`):  
* queries (the search requests someone makes)  
* corpus (a collection of documents)  
* qrels or relevant documents (Q = query, Rel = relevance)  


In [71]:
from rich import print

In [1]:
from mteb.tasks import NQ

nq = NQ() 
nq.load_data()

  from .autonotebook import tqdm as notebook_tqdm
Map: 100%|█████████████████████████████████████████| 4201/4201 [00:00<00:00, 40637.63 examples/s]


In [2]:
split = nq.eval_splits[0]
nq.eval_splits

['test']

In [3]:
list(nq.queries[split].items())[:3]  # dict[str, str]

[('test0', 'what is non controlling interest on balance sheet'),
 ('test1', 'how many episodes are in chicago fire season 4'),
 ('test2', 'who sings love will keep us alive by the eagles')]

In [4]:
list(nq.corpus[split].items())[:3]  # dict[str, str]

[('doc0',
  "Minority interest In accounting, minority interest (or non-controlling interest) is the portion of a subsidiary corporation's stock that is not owned by the parent corporation. The magnitude of the minority interest in the subsidiary company is generally less than 50% of outstanding shares, or the corporation would generally cease to be a subsidiary of the parent.[1]"),
 ('doc1',
  'Minority interest It is, however, possible (such as through special voting rights) for a controlling interest requiring consolidation to be achieved without exceeding 50% ownership, depending on the accounting standards being employed. Minority interest belongs to other investors and is reported on the consolidated balance sheet of the owning company to reflect the claim on assets belonging to other, non-controlling shareholders. Also, minority interest is reported on the consolidated income statement as a share of profit belonging to minority shareholders.'),
 ('doc2',
  "Minority interest The

In [5]:
list(nq.relevant_docs[split].items())[:3]  # dict[str, dict[str, str]]

[('test0', {'doc0': 1, 'doc1': 1}),
 ('test1', {'doc6': 1}),
 ('test2', {'doc10': 1})]

Just for curiosity let's check qrel with the biggest amount of relevant documents.

In [6]:
sort_qrel = sorted(nq.relevant_docs[split], key=lambda x: len(nq.relevant_docs[split][x]), reverse=True)
sort_qrel[0]

'test401'

In [7]:
nq.relevant_docs[split]['test401']

{'doc15012': 1, 'doc15013': 1, 'doc15015': 1, 'doc15016': 1}

Let's research in more detail the second qrel.

In [8]:
query = nq.queries[split]['test1']
query

'how many episodes are in chicago fire season 4'

In [9]:
doc = nq.corpus[split]['doc6']
doc

'Chicago Fire (season 4) The fourth season of Chicago Fire, an American drama television series with executive producer Dick Wolf, and producers Derek Haas, Michael Brandt, and Matt Olmstead, was ordered on February 5, 2015, by NBC,[1] and premiered on October 13, 2015 and concluded on May 17, 2016.[2] The season contained 23 episodes.[3]'

In [10]:
from sentence_transformers import SentenceTransformer 
model = SentenceTransformer('all-MiniLM-L6-v2', device='cuda')
model.similarity_fn_name

'cosine'

In [11]:
embeds = model.encode([query, doc])
embeds.shape 

(2, 384)

In [12]:
model.similarity(embeds, embeds)

tensor([[1.0000, 0.9011],
        [0.9011, 1.0000]])

### Mean steering

In [13]:
import numpy as np
np.max(embeds, axis=1)

array([0.15369153, 0.17991881], dtype=float32)

In [14]:
# mean_steer = (embeds[0] + embeds[1]) / 2 
mean_steer = np.mean(embeds, axis=0) # calc mean columnwise
mean_steer.shape

(384,)

In [15]:
np.max(mean_steer)

np.float32(0.16680518)

In [16]:
query_emb = model.encode(query)
query_emb.shape 

(384,)

In [17]:
embeds_steer = [query_emb, mean_steer]

In [18]:
model.similarity(embeds_steer, embeds_steer)

  a = torch.tensor(a)


tensor([[1.0000, 0.9750],
        [0.9750, 1.0000]])

### Mean steering, deeper 

On this example steering doc with the query boosts cosine similarity score from 0.9011 to 0.9750 (0.0739), let's move forward in this direction and define function for it.

In [123]:
import numpy as np 

from sentence_transformers import SentenceTransformer 


def mean_steer(
    model: SentenceTransformer, 
    queries: list[str] | str,
    doc: str 
) -> np.array: 
    embeds = model.encode(queries)
    doc_embed = model.encode(doc)
    embeds = np.vstack([embeds, doc_embed])
    return np.mean(embeds, axis=0)
    

In [124]:
res = mean_steer(
    model, 
    queries=[query], 
    doc=doc 
)
res.shape

(384,)

In [125]:
embeds_steer = [model.encode(query), res]

In [126]:
model.similarity(embeds_steer, embeds_steer)

tensor([[1.0000, 0.9750],
        [0.9750, 1.0000]])

Let's test steering on several qrels at the same time and calculate the similarities for them. 

In [127]:
def form_qrel(query_id: str) -> tuple[str, list[str, str]]:
    docs = [doc for doc in nq.relevant_docs[split][query_id]] 
    print(docs)
    return (
        nq.queries[split][query_id],
        [nq.corpus[split][doc] for doc in docs]
    )

In [128]:
qrel0 = form_qrel('test1')
qrel0

('how many episodes are in chicago fire season 4',
 ['Chicago Fire (season 4) The fourth season of Chicago Fire, an American drama television series with executive producer Dick Wolf, and producers Derek Haas, Michael Brandt, and Matt Olmstead, was ordered on February 5, 2015, by NBC,[1] and premiered on October 13, 2015 and concluded on May 17, 2016.[2] The season contained 23 episodes.[3]'])

In [129]:
qrel1 = form_qrel('test2')
qrel1

('who sings love will keep us alive by the eagles',
 ['Love Will Keep Us Alive "Love Will Keep Us Alive" is a song written by Jim Capaldi, Paul Carrack, and Peter Vale, and produced by the Eagles, Elliot Scheiner, and Rob Jacobs. It was first performed by the Eagles in 1994, during their "Hell Freezes Over" reunion tour, with lead vocals by bassist Timothy B. Schmit.'])

In [130]:
qrel2 = form_qrel('test4')
qrel2

('nitty gritty dirt band fishin in the dark album',
 ['Fishin\' in the Dark "Fishin\' in the Dark" is a song written by Wendy Waldman and Jim Photoglo and recorded by American country music group The Nitty Gritty Dirt Band. It was released in June 1987 as the second single from their album Hold On.[1] It reached number-one on the U.S. and Canadian country charts. It was the band\'s third number-one single on the U.S. country music charts and the second in Canada. After it became available for download, it has sold over a million digital copies by 2015.[2] It was certified Platinum by the RIAA on September 12, 2014.[3]'])

**No steering**

In [131]:
def flat_qrel(qrel: tuple[str, list[str]]) -> list[str]: 
    flat = [qrel[0]]
    flat.extend(qrel[1])
    return flat

In [132]:
flatten_qrels = [
    *flat_qrel(qrel0),
    *flat_qrel(qrel1),
    *flat_qrel(qrel2)
]
embeds = model.encode(flatten_qrels)
embeds.shape

(6, 384)

In [133]:
embed_sim = model.similarity(embeds, embeds)
embed_sim

tensor([[ 1.0000,  0.9011,  0.0277,  0.0562,  0.0278, -0.0104],
        [ 0.9011,  1.0000, -0.0207,  0.0414, -0.0373, -0.0351],
        [ 0.0277, -0.0207,  1.0000,  0.8584,  0.1861,  0.1019],
        [ 0.0562,  0.0414,  0.8584,  1.0000,  0.1110,  0.0460],
        [ 0.0278, -0.0373,  0.1861,  0.1110,  1.0000,  0.8017],
        [-0.0104, -0.0351,  0.1019,  0.0460,  0.8017,  1.0000]])

Let's also calculate std for similarities before steering. 

In [134]:
stds = np.std(embed_sim.numpy(), axis=1)  # standard deviation (squared root from variance (1/n * sum(x_i - x_avg) ^ 2)) 
stds

array([0.4375109 , 0.4558405 , 0.41032594, 0.4106811 , 0.40089607,
       0.4187793 ], dtype=float32)

In [135]:
stds.mean()

np.float32(0.42233896)

**With steering**

In [136]:
qrel0_steer = mean_steer(model, queries=qrel0[0], doc=qrel0[1][0])
qrel0_steer.shape

(384,)

In [137]:
qrel1_steer = mean_steer(model, queries=qrel1[0], doc=qrel1[1][0])
qrel1_steer.shape

(384,)

In [138]:
qrel2_steer = mean_steer(model, queries=qrel2[0], doc=qrel2[1][0])
qrel2_steer.shape

(384,)

In [139]:
steer_qrels = np.array([
    model.encode(qrel0[0]), qrel0_steer,
    model.encode(qrel1[0]), qrel1_steer,
    model.encode(qrel2[0]), qrel2_steer
])
steer_qrels.shape 

(6, 384)

In [140]:
steer_sim = model.similarity(steer_qrels, steer_qrels)
steer_sim

tensor([[ 1.0000,  0.9750,  0.0277,  0.0435,  0.0278,  0.0092],
        [ 0.9750,  1.0000,  0.0036,  0.0278, -0.0049, -0.0148],
        [ 0.0277,  0.0036,  1.0000,  0.9639,  0.1861,  0.1517],
        [ 0.0435,  0.0278,  0.9639,  1.0000,  0.1541,  0.1216],
        [ 0.0278, -0.0049,  0.1861,  0.1541,  1.0000,  0.9491],
        [ 0.0092, -0.0148,  0.1517,  0.1216,  0.9491,  1.0000]])

Let's calculate std for similarities after steering. 

In [141]:
steer_stds = np.std(steer_sim.numpy(), axis=1)  # standard deviation (squared root from variance (1/n * sum(x_i - x_avg) ^ 2)) 
steer_stds

array([0.45291245, 0.46436077, 0.4243574 , 0.42431667, 0.4220798 ,
       0.4320356 ], dtype=float32)

In [142]:
steer_stds.mean()

np.float32(0.43667713)

### Compare stds after and before steering

In [143]:
stds.mean()

np.float32(0.42233896)

In [144]:
steer_stds.mean()

np.float32(0.43667713)

In [145]:
stds

array([0.4375109 , 0.4558405 , 0.41032594, 0.4106811 , 0.40089607,
       0.4187793 ], dtype=float32)

In [146]:
steer_stds

array([0.45291245, 0.46436077, 0.4243574 , 0.42431667, 0.4220798 ,
       0.4320356 ], dtype=float32)

As we can see, the std after mean steering increased.  

### LLM mean steering

For now we were steering with actual relevant query, let's research replacing it with LLM generated query.

As the model open-source `llama-3.1-8b-instant` will be used provided by `groq` because of . 

In [147]:
from openai import AsyncOpenAI
from dotenv import load_dotenv 

load_dotenv() 
client = AsyncOpenAI(
    base_url='https://api.groq.com/openai/v1'
)
await client.models.list()

AsyncPage[Model](data=[Model(id='allam-2-7b', created=1737672203, object='model', owned_by='SDAIA', active=True, context_window=4096, public_apps=None, max_completion_tokens=4096), Model(id='meta-llama/llama-prompt-guard-2-22m', created=1748632101, object='model', owned_by='Meta', active=True, context_window=512, public_apps=None, max_completion_tokens=512), Model(id='gemma2-9b-it', created=1693721698, object='model', owned_by='Google', active=True, context_window=8192, public_apps=None, max_completion_tokens=8192), Model(id='whisper-large-v3', created=1693721698, object='model', owned_by='OpenAI', active=True, context_window=448, public_apps=None, max_completion_tokens=448), Model(id='whisper-large-v3-turbo', created=1728413088, object='model', owned_by='OpenAI', active=True, context_window=448, public_apps=None, max_completion_tokens=448), Model(id='llama-3.1-8b-instant', created=1693721698, object='model', owned_by='Meta', active=True, context_window=131072, public_apps=None, max_co

In [148]:
from IPython.display import Markdown

res = await client.responses.create(
    model='llama-3.1-8b-instant',
    input='hello, world!',
    temperature=0
)
Markdown(res.output[1].content[0].text)

Hello, world! It's nice to meet you. Is there something I can help you with or would you like to chat?

**gen_query function definition**

In [149]:
SYS_PROMPT = '''
You are a helpful AI assistant.
Generate most relevant and short search question for the document excerpt provided by the user.
Return only the generated query.
'''.strip() 

async def gen_query(doc: str, temp: float = 0.7) -> str: 
    res = await client.responses.create(
        model='llama-3.1-8b-instant',
        input=[
            {
                'role': 'system',
                'content': SYS_PROMPT 
            },
            {
                'role': 'user',
                'content': doc
            }
        ],
        temperature=temp 
    )
    return res.output[1].content[0].text.lower()

Let's compare generated queries with actual 

In [97]:
print(form_qrel('test1'))
squery1 = await gen_query(nq.corpus[split]['doc6'], temp=0)
squery1

'"chicago fire season 4 details"'

In [99]:
print(form_qrel('test2'))
squery2 = await gen_query(nq.corpus[split]['doc10'], temp=0)
squery2 

'"who performed the song love will keep us alive?"'

In [100]:
print(form_qrel('test4'))
squery3 = await gen_query(nq.corpus[split]['doc42'], temp=0)
squery3

'what is the song "fishin\' in the dark" by the nitty gritty dirt band?'

### Mean steering on synthetically generated queries

Let's apply mean steering with the generated queries too, and compare the results. 

In [94]:
qrels = [
    *flat_qrel(form_qrel('test1')),
    *flat_qrel(form_qrel('test2')),
    *flat_qrel(form_qrel('test4'))
]
qrels

['how many episodes are in chicago fire season 4',
 'Chicago Fire (season 4) The fourth season of Chicago Fire, an American drama television series with executive producer Dick Wolf, and producers Derek Haas, Michael Brandt, and Matt Olmstead, was ordered on February 5, 2015, by NBC,[1] and premiered on October 13, 2015 and concluded on May 17, 2016.[2] The season contained 23 episodes.[3]',
 'who sings love will keep us alive by the eagles',
 'Love Will Keep Us Alive "Love Will Keep Us Alive" is a song written by Jim Capaldi, Paul Carrack, and Peter Vale, and produced by the Eagles, Elliot Scheiner, and Rob Jacobs. It was first performed by the Eagles in 1994, during their "Hell Freezes Over" reunion tour, with lead vocals by bassist Timothy B. Schmit.',
 'nitty gritty dirt band fishin in the dark album',
 'Fishin\' in the Dark "Fishin\' in the Dark" is a song written by Wendy Waldman and Jim Photoglo and recorded by American country music group The Nitty Gritty Dirt Band. It was rele

**No steer**

In [151]:
embeds = model.encode(qrels)
embeds.shape

(6, 384)

In [152]:
embeds_sim = model.similarity(embeds, embeds)
embeds_sim

tensor([[ 1.0000,  0.9011,  0.0277,  0.0562,  0.0278, -0.0104],
        [ 0.9011,  1.0000, -0.0207,  0.0414, -0.0373, -0.0351],
        [ 0.0277, -0.0207,  1.0000,  0.8584,  0.1861,  0.1019],
        [ 0.0562,  0.0414,  0.8584,  1.0000,  0.1110,  0.0460],
        [ 0.0278, -0.0373,  0.1861,  0.1110,  1.0000,  0.8017],
        [-0.0104, -0.0351,  0.1019,  0.0460,  0.8017,  1.0000]])

**Synthetic mean steer**

In [153]:
ssteer_embeds = np.array([
    model.encode(nq.queries[split]['test1']), mean_steer(model, squery1, nq.corpus[split]['doc6']),
    model.encode(nq.queries[split]['test2']), mean_steer(model, squery2, nq.corpus[split]['doc10']),
    model.encode(nq.queries[split]['test4']), mean_steer(model, squery3, nq.corpus[split]['doc42']),
])
ssteer_embeds.shape

(6, 384)

In [154]:
ssteer_sim = model.similarity(ssteer_embeds, ssteer_embeds)
ssteer_sim

tensor([[ 1.0000,  0.9156,  0.0277,  0.0262,  0.0278, -0.0124],
        [ 0.9156,  1.0000,  0.0373,  0.0903,  0.0586,  0.0232],
        [ 0.0277,  0.0373,  1.0000,  0.8528,  0.1861,  0.1629],
        [ 0.0262,  0.0903,  0.8528,  1.0000,  0.1624,  0.1459],
        [ 0.0278,  0.0586,  0.1861,  0.1624,  1.0000,  0.8978],
        [-0.0124,  0.0232,  0.1629,  0.1459,  0.8978,  1.0000]])

**Identical mean steer**

In [155]:
steer_sim = model.similarity(steer_qrels, steer_qrels)
steer_sim

tensor([[ 1.0000,  0.9750,  0.0277,  0.0435,  0.0278,  0.0092],
        [ 0.9750,  1.0000,  0.0036,  0.0278, -0.0049, -0.0148],
        [ 0.0277,  0.0036,  1.0000,  0.9639,  0.1861,  0.1517],
        [ 0.0435,  0.0278,  0.9639,  1.0000,  0.1541,  0.1216],
        [ 0.0278, -0.0049,  0.1861,  0.1541,  1.0000,  0.9491],
        [ 0.0092, -0.0148,  0.1517,  0.1216,  0.9491,  1.0000]])

In [156]:
np.std(embeds_sim.numpy())

np.float32(0.42315936)

In [157]:
np.std(steer_sim.numpy())

np.float32(0.43751502)

In [158]:
np.std(ssteer_sim.numpy())

np.float32(0.4133015)

### Conclusions on Synthetically Generated Queries

At first glance, LLaMA generates relevant synthetic queries, and in some cases, the similarity score is increased. Possible next actions are:

* Evaluation on the whole benchmark  
* Better math formula for fusion  
* Research on several query fusions  
* Weighted mean steering
