Skip to content

LCORE-1314: conversation compaction spike — design docs, PoC, and 50-…#1328

Merged
tisnik merged 4 commits intolightspeed-core:mainfrom
max-svistunov:lcore-1314-spike-conversation-compaction
Mar 25, 2026
Merged

LCORE-1314: conversation compaction spike — design docs, PoC, and 50-…#1328
tisnik merged 4 commits intolightspeed-core:mainfrom
max-svistunov:lcore-1314-spike-conversation-compaction

Conversation

@max-svistunov
Copy link
Contributor

@max-svistunov max-svistunov commented Mar 16, 2026

LCORE-1314: Conversation compaction spike

Spike deliverable for LCORE-1311 (Conversation History Summarization).

What's in this PR

Design docs (docs/design/conversation-compaction/):

  • conversation-compaction-spike.md (use the "Outline" button) — the spike: research, design alternatives, PoC results, proposed JIRAs
  • conversation-compaction.md (use the "Outline" button) — feature spec (requirements, use cases, architecture, implementation suggestions) assuming the recommended options are chosen -- changeable based on final decisions, will be kept in the repo long-term
  • poc-results/ — evidence from a 50-query experiment validating the PoC

PoC code:

Main findings

  1. When conversation history exceeds the context window, lightspeed-stack returns HTTP 413 with no recovery. The conversation is dead.
  2. LLM-based summarization is the right fix — provider-agnostic, proven pattern, good context quality.
  3. Additive summarization (each chunk summarized once, kept independently) beats recursive (summary of a summary). Our 50-query experiment showed recursive loses early-conversation context after 4 cycles.
  4. The feature is fully self-contained in lightspeed-stack. No Llama Stack changes needed.

For reviewers

  1. @tisnik @sbunciak Please review and confirm (or override) the 5 strategic decisions in the "Decisions for @ptisnovs and @sbunciak" section of conversation-compaction-spike.md:
  • Which approach? → recommended: LLM summarization
  • Recursive or additive? → recommended: additive
  • Which model for summarization? → recommended: same model as user's query
  • Threshold strategy? → recommended: percentage of context window + fixed floor + admin-configurable
  • Where to implement? → recommended: in lightspeed-stack (not upstream)
  1. @tisnik Please review and confirm the 4 technical-detail decisions (conversation_id handling, truncated field, storage location, buffer zone calculation) in the "Technical decisions for @ptisnovs" section.

  2. @tisnik @sbunciak Please review the "Proposed JIRAs" section -- I will file the tickets once decisions are confirmed.

Doc structure note: The first three sections (Decisions for @ptisnovs and @sbunciak , Technical decisions for @ptisnovs, Proposed JIRAs) of the spike doc are where your input is needed. They link to background sections later in the doc (current architecture, API comparison, design alternatives, PoC results) — read those if you need more context on a specific point, but it is optional.

Closes # LCORE-1314
Related Issue # LCORE-1311

Summary by CodeRabbit

  • Documentation
    • Added design documentation for conversation history compaction feature to address context-window overflow scenarios
    • Includes specifications for token estimation, summarization strategy, and admin configuration options
    • Details API response enhancements for context status tracking

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

Walkthrough

This pull request adds two comprehensive design documents for implementing conversation history compaction in lightspeed-stack. The documents specify mechanisms for handling context-window overflow through LLM-based summarization, token-estimation triggers, conversation identity management, API response metadata changes, and detailed implementation/testing guidance.

Changes

Cohort / File(s) Summary
Conversation Compaction Design Documents
docs/design/conversation-compaction/conversation-compaction-spike.md, docs/design/conversation-compaction/conversation-compaction.md
Spike document details PoC validation, decision points, and background research. Main design document specifies token-driven compaction triggers, summarization strategy, configuration schema (enabled flag, threshold ratio, token floor, buffer parameters), context_status API response field ("full" or "summarized"), additive summarization approach, per-conversation locking, marker-based context injection, and implementation/testing roadmap.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main contribution: adding design documentation, PoC code, and experiment results for a conversation compaction spike feature, directly matching the changeset contents.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
tests/unit/utils/test_compaction.py (1)

245-258: Consider adding assert_not_called for consistency.

Unlike test_skips_below_threshold (Line 243), this test doesn't verify that conversations.items.list was not called. Adding this assertion would make the tests consistent and more explicitly verify the early-return behavior.

Suggested addition
         assert result == "conv_original"
+        mock_client.conversations.items.list.assert_not_called()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/utils/test_compaction.py` around lines 245 - 258, In
test_skips_at_threshold, add an assertion that the client list method was not
invoked to mirror test_skips_below_threshold: after calling
compact_conversation_if_needed (in test_skips_at_threshold) and asserting the
return equals "conv_original", call
mock_client.conversations.items.list.assert_not_called() (or
mock_client.conversations.items.list.assert_not_awaited() if you prefer to
specifically assert it wasn't awaited) to explicitly verify the early-return
behavior of compact_conversation_if_needed.
src/utils/compaction.py (1)

49-64: Text extraction differs from canonical pattern in responses.py.

The canonical _extract_text_from_content in src/utils/responses.py (lines 1127-1148) checks part.type for input_text, output_text, and refusal content types. This simpler implementation works for the PoC but may miss content in production if parts have different structures.

For production, consider aligning with or reusing the canonical extraction logic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/compaction.py` around lines 49 - 64, The _extract_message_text
implementation is a simplified extractor that can miss parts handled by the
canonical _extract_text_from_content; update _extract_message_text to reuse or
mirror that canonical logic by checking each part's type (handle 'input_text',
'output_text', 'refusal') as well as existing cases (string, dict with "text",
object with .text), concatenating extracted texts into a single string and
falling back to str(content) when nothing matches; specifically modify the
function named _extract_message_text to call or inline the same extraction rules
as _extract_text_from_content so all content shapes are covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/conversation-compaction/conversation-compaction-spike.md`:
- Line 11: The reviewer handle in the document is inconsistent: replace all
occurrences of "@ptisnovs" with the requested reviewer "@tisnik" (notably the
mentions currently at the header and the decision entries such as the mention
near the top and the one around the decision block referenced on lines ~11 and
~86) so the requested reviewers match the PR objectives; ensure you update every
instance of the old handle to the new handle to avoid routing confusion.
- Line 25: The intra-document markdown links using fragments like
`#design-alternatives`, `#llama-stack-upstream`, and `#decisions-technical` are
not matching the generated heading IDs; search the document for those exact
fragment strings and replace each with the precise generated anchor created by
your markdown renderer (match capitalization, spacing-to-hyphen rules, and any
punctuation handling) so links point to the actual headings (e.g., update the
link target to the exact slug produced for the "Design alternatives", "Llama
Stack (upstream)", and "Decisions - Technical" headings); also fix the same
fragments used elsewhere in the file to ensure all intra-doc links resolve.

In `@docs/design/conversation-compaction/conversation-compaction.md`:
- Line 8: Update the broken spike-doc link in conversation-compaction.md:
replace the existing reference to docs/design/conversation-compaction-spike.md
with the correct path under the conversation-compaction folder
(docs/design/conversation-compaction/conversation-compaction-spike.md) so the
table row that currently contains the link text "Spike doc:
`docs/design/conversation-compaction-spike.md`" points to the actual file
location.
- Around line 348-349: The guidance contains malformed code/markdown for config
classes: replace the broken fragments so it reads that all config classes should
extend ConfigurationBase which sets extra="forbid", use Field(...) with
defaults, title, and description, and add `@model_validator`(mode="after") for
cross-field validation when needed; update references to ConfigurationBase,
extra, Field, and `@model_validator` to use correct punctuation and valid syntax
so the instructions are unambiguous.

In `@docs/design/conversation-compaction/poc-results/06-probe-responses.txt`:
- Around line 19-20: The "Network Policies" sentence in Turn 11 is truncated
("communicate with each") and leaves the artifact unreliable; edit the Turn 11
probe response to either finish the sentence (e.g., "communicate with each other
or external services, limiting traffic between pods and namespaces") or
explicitly mark the paragraph as intentionally truncated (e.g., add
"[TRUNCATED]" and a brief note), updating the "Network Policies" line in the
Turn 11 section so the document is complete and unambiguous.

In `@src/utils/compaction.py`:
- Around line 109-145: The _summarize_conversation function must catch Llama
Stack API errors around client.conversations.create and client.responses.create;
wrap those calls in a try/except that handles APIConnectionError and
APIStatusError (import them from llama_stack_client), log the error with context
(e.g., "Failed to create conversation" / "Failed to create response for
summarization") and re-raise or return a fallback as appropriate; ensure you
avoid leaving an orphaned conversation by deleting the temporary conversation
(use client.conversations.delete(summary_conv.id)) if response creation fails
(perform delete in the except block or a finally that checks summary_conv and
whether response was created); keep the existing summary extraction from
response.output unchanged.

In `@src/utils/responses.py`:
- Around line 300-307: The call to compact_conversation_if_needed (when
user_conversation is truthy) can raise Llama Stack errors but is not translated
to the API's HTTPException like other Llama Stack operations; wrap the
compact_conversation_if_needed call in a try/except that catches the Llama Stack
error types used elsewhere (e.g., APIConnectionError or the project's
LlamaStackError wrapper) and re-raise an appropriate fastapi.HTTPException with
the same status code/message mapping used in this module; ensure you reference
compact_conversation_if_needed, llama_stack_conv_id, and user_conversation so
the behavior matches other error handling in responses.py.

---

Nitpick comments:
In `@src/utils/compaction.py`:
- Around line 49-64: The _extract_message_text implementation is a simplified
extractor that can miss parts handled by the canonical
_extract_text_from_content; update _extract_message_text to reuse or mirror that
canonical logic by checking each part's type (handle 'input_text',
'output_text', 'refusal') as well as existing cases (string, dict with "text",
object with .text), concatenating extracted texts into a single string and
falling back to str(content) when nothing matches; specifically modify the
function named _extract_message_text to call or inline the same extraction rules
as _extract_text_from_content so all content shapes are covered.

In `@tests/unit/utils/test_compaction.py`:
- Around line 245-258: In test_skips_at_threshold, add an assertion that the
client list method was not invoked to mirror test_skips_below_threshold: after
calling compact_conversation_if_needed (in test_skips_at_threshold) and
asserting the return equals "conv_original", call
mock_client.conversations.items.list.assert_not_called() (or
mock_client.conversations.items.list.assert_not_awaited() if you prefer to
specifically assert it wasn't awaited) to explicitly verify the early-return
behavior of compact_conversation_if_needed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5bb2811f-d69d-4f62-951a-a2cbb2af3edd

📥 Commits

Reviewing files that changed from the base of the PR and between b2f54cf and dd6a24e.

📒 Files selected for processing (13)
  • docs/design/conversation-compaction/conversation-compaction-spike.md
  • docs/design/conversation-compaction/conversation-compaction.md
  • docs/design/conversation-compaction/poc-results/01-analysis.txt
  • docs/design/conversation-compaction/poc-results/02-conversation-log.txt
  • docs/design/conversation-compaction/poc-results/03-token-usage.txt
  • docs/design/conversation-compaction/poc-results/04-compaction-events.json
  • docs/design/conversation-compaction/poc-results/05-summaries-extracted.txt
  • docs/design/conversation-compaction/poc-results/06-probe-responses.txt
  • docs/design/conversation-compaction/poc-results/07-conversation-items-per-compaction.txt
  • docs/design/conversation-compaction/poc-results/08-all-responses-raw.json
  • src/utils/compaction.py
  • src/utils/responses.py
  • tests/unit/utils/test_compaction.py


**PoC validation**: A working proof-of-concept was built and tested with 50 queries across 4 compaction cycles. Results in [PoC results](#poc-results).

# Decisions for @ptisnovs and @sbunciak
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reviewer handle appears inconsistent with requested reviewers.

Line 11 and Line 86 reference @ptisnovs; the PR objectives request confirmations from @tisnik. Please align naming to avoid decision-routing confusion.

Also applies to: 86-86

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/conversation-compaction/conversation-compaction-spike.md` at line
11, The reviewer handle in the document is inconsistent: replace all occurrences
of "@ptisnovs" with the requested reviewer "@tisnik" (notably the mentions
currently at the header and the decision entries such as the mention near the
top and the one around the decision block referenced on lines ~11 and ~86) so
the requested reviewers match the PR objectives; ensure you update every
instance of the old handle to the new handle to avoid routing confusion.

Comment on lines +348 to +349
All config classes extend `ConfigurationBase` which sets `extra`"forbid"`.
Use =Field()` with defaults, title, and description. Add `@model_validator(mode`"after")= for cross-field validation if needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Correct malformed config-pattern guidance snippet.

Line 348 and Line 349 contain broken markdown/code (extra"forbid", =Field(), @model_validator(mode"after")=). Please fix these to valid syntax so implementation instructions are unambiguous.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/conversation-compaction/conversation-compaction.md` around lines
348 - 349, The guidance contains malformed code/markdown for config classes:
replace the broken fragments so it reads that all config classes should extend
ConfigurationBase which sets extra="forbid", use Field(...) with defaults,
title, and description, and add `@model_validator`(mode="after") for cross-field
validation when needed; update references to ConfigurationBase, extra, Field,
and `@model_validator` to use correct punctuation and valid syntax so the
instructions are unambiguous.

Comment on lines +19 to +20
5. **Network Policies**: Kubernetes allows the enforcement of network policies to control how pods communicate with each

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Complete the truncated probe response in Turn 11.

Line 19 ends mid-sentence (“communicate with each”), and no continuation follows. Please restore the missing text or explicitly mark this section as intentionally truncated so the artifact remains reliable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/conversation-compaction/poc-results/06-probe-responses.txt`
around lines 19 - 20, The "Network Policies" sentence in Turn 11 is truncated
("communicate with each") and leaves the artifact unreliable; edit the Turn 11
probe response to either finish the sentence (e.g., "communicate with each other
or external services, limiting traffic between pods and namespaces") or
explicitly mark the paragraph as intentionally truncated (e.g., add
"[TRUNCATED]" and a brief note), updating the "Network Policies" line in the
Turn 11 section so the document is complete and unambiguous.

Comment on lines +109 to +145
async def _summarize_conversation(
client: AsyncLlamaStackClient,
model: str,
items: list[Any],
) -> str:
"""Use the LLM to produce a concise summary of *items*.

Parameters:
client: Llama Stack client.
model: Full model ID (``provider/model``).
items: Conversation items to summarize.

Returns:
Summary text produced by the LLM.
"""
conversation_text = _format_conversation_for_summary(items)
prompt = SUMMARIZATION_PROMPT.format(conversation_text=conversation_text)

summary_conv = await client.conversations.create(metadata={})
response = await client.responses.create(
input=prompt,
model=model,
conversation=summary_conv.id,
store=False,
)

summary_parts: list[str] = []
for output_item in response.output:
content = getattr(output_item, "content", None)
if content is None:
continue
for content_part in content:
text = getattr(content_part, "text", None)
if text:
summary_parts.append(text)

return "".join(summary_parts) or "No summary generated."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing error handling for Llama Stack API calls.

Per coding guidelines, APIConnectionError from Llama Stack should be handled in API operations. The calls to client.conversations.create (Line 127) and client.responses.create (Lines 128-133) could fail, leaving the caller without graceful error handling.

Additionally, if responses.create fails, the temporary conversation created at Line 127 becomes orphaned.

Suggested error handling pattern (from responses.py)
from llama_stack_client import APIConnectionError, APIStatusError

async def _summarize_conversation(...) -> str:
    # ... format prompt ...
    
    try:
        summary_conv = await client.conversations.create(metadata={})
        response = await client.responses.create(
            input=prompt,
            model=model,
            conversation=summary_conv.id,
            store=False,
        )
    except APIConnectionError as e:
        logger.error("Failed to connect to Llama Stack for summarization: %s", e)
        raise  # or return a fallback / re-raise wrapped
    except APIStatusError as e:
        logger.error("Llama Stack returned error during summarization: %s", e)
        raise
    # ... extract summary ...

As per coding guidelines: "Handle APIConnectionError from Llama Stack in API operations".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/compaction.py` around lines 109 - 145, The _summarize_conversation
function must catch Llama Stack API errors around client.conversations.create
and client.responses.create; wrap those calls in a try/except that handles
APIConnectionError and APIStatusError (import them from llama_stack_client), log
the error with context (e.g., "Failed to create conversation" / "Failed to
create response for summarization") and re-raise or return a fallback as
appropriate; ensure you avoid leaving an orphaned conversation by deleting the
temporary conversation (use client.conversations.delete(summary_conv.id)) if
response creation fails (perform delete in the except block or a finally that
checks summary_conv and whether response was created); keep the existing summary
extraction from response.output unchanged.

Comment on lines +300 to +307
# Check if conversation needs compaction (PoC)
if user_conversation:
llama_stack_conv_id = await compact_conversation_if_needed(
client=client,
llama_stack_conv_id=llama_stack_conv_id,
model=model,
message_count=user_conversation.message_count,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wrap compaction call with the same API error translation used elsewhere.

Lines 302-307 can raise Llama Stack errors through compact_conversation_if_needed(), but this branch currently does not map them to your standard HTTPException responses.

💡 Suggested fix
         # Check if conversation needs compaction (PoC)
         if user_conversation:
-            llama_stack_conv_id = await compact_conversation_if_needed(
-                client=client,
-                llama_stack_conv_id=llama_stack_conv_id,
-                model=model,
-                message_count=user_conversation.message_count,
-            )
+            try:
+                llama_stack_conv_id = await compact_conversation_if_needed(
+                    client=client,
+                    llama_stack_conv_id=llama_stack_conv_id,
+                    model=model,
+                    message_count=user_conversation.message_count,
+                )
+            except APIConnectionError as e:
+                error_response = ServiceUnavailableResponse(
+                    backend_name="Llama Stack",
+                    cause=str(e),
+                )
+                raise HTTPException(**error_response.model_dump()) from e
+            except APIStatusError as e:
+                error_response = handle_known_apistatus_errors(e, model)
+                raise HTTPException(**error_response.model_dump()) from e

As per coding guidelines "Handle APIConnectionError from Llama Stack in API operations".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Check if conversation needs compaction (PoC)
if user_conversation:
llama_stack_conv_id = await compact_conversation_if_needed(
client=client,
llama_stack_conv_id=llama_stack_conv_id,
model=model,
message_count=user_conversation.message_count,
)
# Check if conversation needs compaction (PoC)
if user_conversation:
try:
llama_stack_conv_id = await compact_conversation_if_needed(
client=client,
llama_stack_conv_id=llama_stack_conv_id,
model=model,
message_count=user_conversation.message_count,
)
except APIConnectionError as e:
error_response = ServiceUnavailableResponse(
backend_name="Llama Stack",
cause=str(e),
)
raise HTTPException(**error_response.model_dump()) from e
except APIStatusError as e:
error_response = handle_known_apistatus_errors(e, model)
raise HTTPException(**error_response.model_dump()) from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/responses.py` around lines 300 - 307, The call to
compact_conversation_if_needed (when user_conversation is truthy) can raise
Llama Stack errors but is not translated to the API's HTTPException like other
Llama Stack operations; wrap the compact_conversation_if_needed call in a
try/except that catches the Llama Stack error types used elsewhere (e.g.,
APIConnectionError or the project's LlamaStackError wrapper) and re-raise an
appropriate fastapi.HTTPException with the same status code/message mapping used
in this module; ensure you reference compact_conversation_if_needed,
llama_stack_conv_id, and user_conversation so the behavior matches other error
handling in responses.py.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
docs/design/conversation-compaction/conversation-compaction-spike.md (2)

661-661: Significant architectural change warrants emphasis.

Line 661 states "Call Llama Stack with explicit input (NOT =conversation_id=)" and Line 671 notes "lightspeed-stack stops using Llama Stack's conversation parameter." This is a major architectural shift from the current design where Llama Stack manages conversation history retrieval.

Consider adding a dedicated subsection or callout highlighting this change, its rationale (full control over LLM input), and implications (lightspeed-stack becomes responsible for conversation assembly). This will help reviewers and implementers understand the scope of the change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/conversation-compaction/conversation-compaction-spike.md` at line
661, Add a dedicated subsection or highlighted callout in the document that
explicitly calls out the architectural change described by "Call Llama Stack
with explicit input (NOT =conversation_id=)" and "lightspeed-stack stops using
Llama Stack's `conversation` parameter"; in that subsection state the rationale
(lightspeed-stack takes full control over LLM input/assembly) and list concrete
implications (responsibility for conversation assembly, retrieval, ordering, and
any API/contract changes) so implementers understand scope and migration steps.

726-731: Consider highlighting cost implications more prominently.

The example shows ~74,000 tokens per compaction call (70K input + 2-4K output). At typical LLM pricing, this represents material cost per compaction event. For a frequently-compacting conversation, costs could accumulate significantly.

Consider adding a concrete cost estimate (e.g., "$0.74 per compaction at $10/1M tokens for GPT-4") and noting that high-usage conversations might trigger compaction multiple times, making cost monitoring important for deployments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/conversation-compaction/conversation-compaction-spike.md` around
lines 726 - 731, Update the "Example for a 128K context window at 70% threshold"
block to call out cost implications explicitly: compute a concrete cost estimate
for the ~74,000-token compaction (e.g., "$0.74 per compaction at $10/1M tokens
for GPT-4") and add a short note that repeated compactions can accumulate costs
and should be monitored or throttled in high-usage conversations; place this
cost estimate immediately after the bullet list for the 128K example and ensure
the text references the same example wording ("128K context window at 70%
threshold" and the "~74,000 total tokens" summary) so readers can see cost per
compaction and the recommendation to monitor compaction frequency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/conversation-compaction/conversation-compaction-spike.md`:
- Line 388: The doc claim mismatches the actual change: verify the insertion in
src/utils/responses.py is the same as described in the PR objectives (the hook
in prepare_responses_params()); either update the doc line to say "8-line
insertion" if the code contains 8 new lines, or change the code to include the
full 10-line hook as described — then update the text in
conversation-compaction-spike.md to reflect the accurate line count and ensure
the symbol prepare_responses_params() is referenced consistently.
- Line 153: Change the seven JIRA headings from third-level to second-level
headings so the document follows proper markdown hierarchy: replace each "###
LCORE-????: Add token estimation to lightspeed-stack" style heading (and the
other six JIRA headings in the same pattern) with "##" instead of "###" so they
sit under the top-level "Proposed JIRAs" heading; ensure all instances of the
JIRA headings (the exact heading texts at lines shown in the review) are updated
consistently.
- Line 25: Update the broken intra-document markdown links: replace any link
pointing to "#design-alternatives" with
"#design-alternatives-for-lightspeed-stack", change "#llama-stack-upstream"
occurrences to "#appendix-a-llama-stack-upstream-status" (including the repeated
instance near the appendix), and remove or replace the non-existent
"#decisions-technical" reference with the correct existing heading anchor or add
the missing "Decisions — Technical" heading to the document so the link
resolves.

---

Nitpick comments:
In `@docs/design/conversation-compaction/conversation-compaction-spike.md`:
- Line 661: Add a dedicated subsection or highlighted callout in the document
that explicitly calls out the architectural change described by "Call Llama
Stack with explicit input (NOT =conversation_id=)" and "lightspeed-stack stops
using Llama Stack's `conversation` parameter"; in that subsection state the
rationale (lightspeed-stack takes full control over LLM input/assembly) and list
concrete implications (responsibility for conversation assembly, retrieval,
ordering, and any API/contract changes) so implementers understand scope and
migration steps.
- Around line 726-731: Update the "Example for a 128K context window at 70%
threshold" block to call out cost implications explicitly: compute a concrete
cost estimate for the ~74,000-token compaction (e.g., "$0.74 per compaction at
$10/1M tokens for GPT-4") and add a short note that repeated compactions can
accumulate costs and should be monitored or throttled in high-usage
conversations; place this cost estimate immediately after the bullet list for
the 128K example and ensure the text references the same example wording ("128K
context window at 70% threshold" and the "~74,000 total tokens" summary) so
readers can see cost per compaction and the recommendation to monitor compaction
frequency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7f012f04-87ce-475a-8eea-260918df70b4

📥 Commits

Reviewing files that changed from the base of the PR and between dd6a24e and a28dcb3.

📒 Files selected for processing (1)
  • docs/design/conversation-compaction/conversation-compaction-spike.md

| B | Tiered memory (MemGPT-style) | High | Excellent |
| C | Delegate to provider-native APIs | Low-Med | Varies |

See [Design alternatives](#design-alternatives) for full pros/cons of each.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that markdown link fragments match the generated heading IDs

# Extract all internal links and headings from the document
cat docs/design/conversation-compaction/conversation-compaction-spike.md | \
gawk '
/^#+ / { 
  heading = tolower($0)
  gsub(/^#+ /, "", heading)
  gsub(/[^a-z0-9- ]/, "", heading)
  gsub(/ /, "-", heading)
  print "HEADING: " heading
}
/]\(#[^)]+\)/ {
  match($0, /]\(#([^)]+)\)/, arr)
  if (arr[1]) print "LINK: " arr[1]
}
'

Repository: lightspeed-core/lightspeed-stack

Length of output: 134


🏁 Script executed:

#!/bin/bash
# Verify markdown link fragments match heading IDs

# Read the file and use Python to extract headings and links
python3 << 'EOF'
import re
import os

file_path = "docs/design/conversation-compaction/conversation-compaction-spike.md"

if not os.path.exists(file_path):
    print(f"File not found: {file_path}")
    exit(1)

with open(file_path, 'r') as f:
    content = f.read()

# Extract all headings and generate their fragment IDs
headings = {}
for match in re.finditer(r'^(#{1,6})\s+(.+)$', content, re.MULTILINE):
    heading_text = match.group(2).strip()
    # Standard markdown ID generation: lowercase, spaces to hyphens, remove special chars
    fragment_id = re.sub(r'[^\w\s-]', '', heading_text.lower())
    fragment_id = re.sub(r'\s+', '-', fragment_id)
    line_num = content[:match.start()].count('\n') + 1
    headings[fragment_id] = {'text': heading_text, 'line': line_num}
    print(f"HEADING: Line {line_num} -> #{fragment_id} (text: '{heading_text}')")

print("\n" + "="*80 + "\n")

# Extract all internal links
links = {}
for match in re.finditer(r'\[([^\]]+)\]\(#([^)]+)\)', content):
    link_text = match.group(1)
    link_target = match.group(2)
    line_num = content[:match.start()].count('\n') + 1
    links[line_num] = {'text': link_text, 'target': link_target}
    # Check if link exists in headings
    exists = link_target in headings
    status = "✓ FOUND" if exists else "✗ NOT FOUND"
    print(f"LINK: Line {line_num} -> #{link_target} [{status}]")
    if not exists and link_target:
        # Find similar headings
        similar = [h for h in headings.keys() if link_target.lower() in h or h in link_target.lower()]
        if similar:
            print(f"  Possible matches: {similar}")

EOF

Repository: lightspeed-core/lightspeed-stack

Length of output: 6831


Fix broken intra-document links in markdown.

Line 25: #design-alternatives should be #design-alternatives-for-lightspeed-stack
Line 84: #llama-stack-upstream should be #appendix-a-llama-stack-upstream-status

Additionally, Line 586 has the same issue as Line 84. Line 648 references #decisions-technical which does not exist in the document.

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 25-25: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/conversation-compaction/conversation-compaction-spike.md` at line
25, Update the broken intra-document markdown links: replace any link pointing
to "#design-alternatives" with "#design-alternatives-for-lightspeed-stack",
change "#llama-stack-upstream" occurrences to
"#appendix-a-llama-stack-upstream-status" (including the repeated instance near
the appendix), and remove or replace the non-existent "#decisions-technical"
reference with the correct existing heading anchor or add the missing "Decisions
— Technical" heading to the document so the link resolves.

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tisnik tisnik requested a review from asimurka March 17, 2026 08:45
@asimurka asimurka marked this pull request as draft March 17, 2026 10:03
@max-svistunov
Copy link
Contributor Author

Thanks @tisnik ! I take it as for every recommendation we are in agreement.

Thanks @asimurka , good call on Draft status, this should not be merged as-is. (Likely only conversation-compaction.md should be merged eventually).

@sbunciak Could you PTAL? See the PR description for instructions for reviewers. Thanks!

@max-svistunov
Copy link
Contributor Author

@onmete Could you PTAL? Please refer to instructions for Pavel in the PR descriptions.

@sbunciak
Copy link
Contributor

@max-svistunov I'm good with your recommendations

@max-svistunov
Copy link
Contributor Author

max-svistunov commented Mar 19, 2026

@tisnik @sbunciak A few things changed after Ondra's review, could you PTAL at these specific things and re-approve? Thanks.

New decisions to confirm:

Changed decisions to re-confirm:

Updates in JIRAs to review:

Full commit: max-svistunov@6421d3de


After your approvals, I will:

  1. File the proposed Jiras under https://redhat.atlassian.net/browse/LCORE-1311

  2. Before merging, I will only keep the spec doc (docs/design/conversation-compaction/conversation-compaction.md), and will remove:

  • the spike doc (docs/design/conversation-compaction/conversation-compaction-spike.md)
  • PoC code
  • experiment data

(Only docs/design/conversation-compaction/conversation-compaction.md will be merged, nothing else. This way the ephemeral work -- Spike documentation, PoC, and experiment data remain in branch commits, but don't clutter the repo in main.)

@sbunciak
Copy link
Contributor

all good from my side

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@max-svistunov max-svistunov marked this pull request as ready for review March 25, 2026 00:54
@max-svistunov
Copy link
Contributor Author

@tisnik JIRAs filed, extraneous files removed in the latest commit, this is ready to merge, TYVM!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
docs/design/conversation-compaction/conversation-compaction-spike.md (1)

195-201: Add language identifiers to fenced code blocks.

Multiple code fences are untyped, triggering MD040 and reducing readability in editors/renderers.

Also applies to: 224-229, 251-256, 282-289, 309-313, 334-340, 434-451, 670-681, 689-710

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/conversation-compaction/conversation-compaction-spike.md` around
lines 195 - 201, The markdown contains multiple untyped fenced code blocks which
trigger MD040; update each triple-backtick fence in the "Token estimation" and
"Configuration" sections (and the other ranges noted) to include the appropriate
language identifier (e.g., ```bash, ```python, ```json, or ```text) so
editors/renderers and linters can correctly highlight them; scan the file for
fences around the examples and replace plain ``` with the correct ```<language>
based on the snippet contents (commands → bash/sh, Python snippets → python,
JSON/YAML → json/yaml, logs/plain output → text).
docs/design/conversation-compaction/conversation-compaction.md (1)

22-27: Add language identifiers to fenced code blocks.

Several fenced blocks omit language labels, which triggers MD040 and reduces readability/tooling support.

✍️ Example edits
-```
+```text
 Llama Stack sends full prompt → provider rejects (400/413, "context_length")
 ...
-```
+```

-```
+```text
 User Query → lightspeed-stack
 ...
-```
+```

Also applies to: 90-115, 177-181, 191-202

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/conversation-compaction/conversation-compaction.md` around lines
22 - 27, Replace unlabeled fenced code blocks (``` ... ```) in the
conversation-compaction.md content with language-tagged fences (e.g., change ```
to ```text) for each example block shown and the other unlabeled blocks noted;
update both opening and closing fences so the blocks become ```text ... ``` to
satisfy MD040 and improve readability/tooling support.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/design/conversation-compaction/conversation-compaction-spike.md`:
- Around line 195-201: The markdown contains multiple untyped fenced code blocks
which trigger MD040; update each triple-backtick fence in the "Token estimation"
and "Configuration" sections (and the other ranges noted) to include the
appropriate language identifier (e.g., ```bash, ```python, ```json, or ```text)
so editors/renderers and linters can correctly highlight them; scan the file for
fences around the examples and replace plain ``` with the correct ```<language>
based on the snippet contents (commands → bash/sh, Python snippets → python,
JSON/YAML → json/yaml, logs/plain output → text).

In `@docs/design/conversation-compaction/conversation-compaction.md`:
- Around line 22-27: Replace unlabeled fenced code blocks (``` ... ```) in the
conversation-compaction.md content with language-tagged fences (e.g., change ```
to ```text) for each example block shown and the other unlabeled blocks noted;
update both opening and closing fences so the blocks become ```text ... ``` to
satisfy MD040 and improve readability/tooling support.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cdade46c-7838-4986-92d0-d199cb24a376

📥 Commits

Reviewing files that changed from the base of the PR and between a28dcb3 and 7e0da7f.

📒 Files selected for processing (2)
  • docs/design/conversation-compaction/conversation-compaction-spike.md
  • docs/design/conversation-compaction/conversation-compaction.md

@tisnik tisnik merged commit f8671d9 into lightspeed-core:main Mar 25, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants