# 📓 The GenAI Revolution Cookbook

**Title:** Lost in the Middle: Placing Critical Info in Long Prompts

**Description:** Stop losing facts in long LLM prompts. Learn placement rules, query ordering, retrieval tactics to boost accuracy and cut costs.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



Large language models can lose track of facts buried in the middle of long prompts—even when the information is clearly present. This phenomenon, called **lost-in-the-middle**, causes models to favor content at the beginning and end of the context window while under-attending to everything in between. The result: hallucinations, missed citations, and answers that ignore critical evidence.

This explainer covers **why** position bias happens in decoder-only models, **when** to expect it, and **what** you can do to mitigate it in production systems.

---

## Why This Matters

Lost-in-the-middle isn't a quirk—it's a structural artifact of how decoder-only transformers process sequences. When you retrieve ten documents, concatenate them, and append a question, the model's attention mechanism gravitates toward the boundaries. Facts in documents 4–7 get systematically ignored, even if they're the most relevant.

**Impact on real systems:**

- **RAG pipelines** return correct chunks but generate answers from the wrong ones
- **Long chat histories** cause the model to forget mid-conversation context
- **Multi-document summarization** drops key points from the middle of the input

If you process long legal, medical, or technical documents; run chatbots with long histories; or aggregate multi-document contexts, expect mid-content loss. The pattern is stable across decoder-only LLMs and persists even with relative position schemes like RoPE. For a broader look at how accumulated context can degrade model performance, see our article on [context rot and why LLMs "forget" as their memory grows](/article/context-rot-why-llms-forget-as-their-memory-grows-3).

---

## How It Works

### 1. Decoder-only models attend left-to-right with boundary bias

In a decoder-only architecture, each token attends to all previous tokens. When you place the question at the end, the model must traverse the entire context to condition its answer. Attention mass concentrates on the **first few tokens** (where instructions and framing live) and the **last few tokens** (the question itself). Middle content receives less attention weight, even if it's semantically central.

### 2. Training data encodes positional priors

Most pretraining corpora place key information at document boundaries—titles, abstracts, conclusions. The model learns that beginnings and endings are high-signal zones. During inference, this prior persists: the model expects important facts near the edges and discounts mid-context material.

### 3. Long context windows don't fix the problem

Extending context to 128k or 200k prevents truncation but doesn't rewire attention priorities. Without structural fixes—ordering, chunking, reranking—the model still under-attends to mid-context content and will hallucinate or default to early summaries. If you're evaluating which model best fits your application, including how context length impacts reliability, our guide on [how to choose an AI model for your app](/article/how-to-choose-an-ai-model-for-your-app-speed-cost-reliability) offers practical advice.

### 4. Encoder-decoder architectures reduce but don't eliminate bias

Encoder-decoder models can attend bidirectionally in the encoder, reducing some position sensitivity. But if you dump huge context into the encoder without structure, you still get boundary bias. Don't treat architecture as a substitute for prompt and retrieval design. For more on when to use small versus large language models and the tradeoffs involved, check out our article on [small vs large language models and when to use each](/article/small-language-models-vs-large-language-models-when-to-use-each-2).

---

## What You Should Do

### 1. Place the question first

Move your query to the **beginning** of the prompt, before the retrieved context. This anchors attention on the task and shortens the distance between the question and evidence. In practice, this means structuring prompts as:

In [None]:
Question: [user query]

Context:
[document 1]
[document 2]
...

This simple reordering can improve recall on long-context QA tasks by reducing the attention path length from question to relevant facts.

### 2. Rerank retrieved chunks by relevance

Retrieval systems often return documents in arbitrary or score-descending order. **Rerank** them so the most relevant chunks appear at the **beginning and end** of the context window, where attention is strongest. Use a cross-encoder reranker or a lightweight scoring model to reorder before concatenation.

Here's a minimal reranking flow using a cross-encoder:

In [None]:
from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
query = "What are the indemnity terms?"
chunks = retrieve_chunks(query, top_k=10)

# Score each chunk against the query
scores = model.predict([(query, chunk) for chunk in chunks])

# Sort by score descending and place top chunks at boundaries
ranked = [chunk for _, chunk in sorted(zip(scores, chunks), reverse=True)]
context = "\n\n".join(ranked[:3] + ranked[-2:])  # top 3 + last 2

This ensures high-signal content sits where the model naturally attends.

### 3. Chunk documents to control context size

Break long documents into **short, semantically coherent chunks** (200–500 tokens). Retrieve only the most relevant chunks instead of concatenating entire documents. This reduces noise, limits mid-context drift, and keeps the total context under the model's effective attention span.

When chunking, preserve logical boundaries—paragraphs, sections, or clauses—so each chunk is self-contained and interpretable.

### 4. Test with position-aware probes

Validate your system by inserting a known fact at different positions in the context and measuring recall. A simple needle-in-haystack test:

In [None]:
def test_position_bias(model, context_chunks, needle, question):
    results = {}
    for i, position in enumerate([0, len(context_chunks)//2, -1]):
        test_context = context_chunks.copy()
        test_context.insert(position, needle)
        prompt = f"Question: {question}\n\nContext:\n" + "\n\n".join(test_context)
        answer = model.generate(prompt)
        results[position] = needle in answer
    return results

If mid-position recall drops significantly, apply reranking or query-first ordering.

---

## Conclusion: Key Takeaways

Lost-in-the-middle is a predictable artifact of decoder-only attention and training data structure. Models favor content at prompt boundaries and under-attend to middle sections, even when those sections contain the answer.

**Core mitigations:**

- **Query-first prompts** to anchor attention on the task
- **Reranking** to place high-relevance chunks at boundaries
- **Chunking** to limit context size and reduce noise
- **Position-aware testing** to detect and measure bias in your pipeline

**When to care:**

- You retrieve more than 3–5 documents per query
- Your prompts exceed 4k tokens regularly
- You see hallucinations or missed citations despite correct retrieval
- You run multi-turn conversations with long histories

Design your retrieval and prompt structure to work **with** the model's attention biases, not against them.