Skip to content

[WIP] Streaming Chat with Referenced Documents PoC#40

Closed
TamiTakamiya wants to merge 3 commits into
mainfrom
TamiTakamiya/AAP-45771/referenced-docs-poc
Closed

[WIP] Streaming Chat with Referenced Documents PoC#40
TamiTakamiya wants to merge 3 commits into
mainfrom
TamiTakamiya/AAP-45771/referenced-docs-poc

Conversation

@TamiTakamiya
Copy link
Copy Markdown
Contributor

Description

Streaming Chat with Referenced Documents PoC. Requires llama-stack 0.2.7.

Screenshare.-.2025-05-19.9_24_25.PM.mp4

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

@manstis
Copy link
Copy Markdown
Contributor

manstis commented May 20, 2025

@TamiTakamiya From what I understand this effectively adds post processing of the response turns to add formatting and most importantly extract the "Referenced Documents" from tool_call="query_from_memory"? Since, as we discussed yesterday, there will be no runtime "lightspeed-stack" component and "lightspeed-stack" will in essence be an "image builder"/configuration tool the code you propose in this PR should probably be elsewhere.

I am wondering whether it's possible to move all of this to a custom agent overriding the run method (or another, but this seems a plausible candidate?) to handle the post processing?

We'd then configure the llama-stack to use our custom external provider:

providers: 
  agent: 
    provider_type: inline::our-super-agent
...

At best this code would have to live in ansible-ai-connect-service but we should favour trying to write our own agent.

WDYT?

@manstis
Copy link
Copy Markdown
Contributor

manstis commented May 20, 2025

retrieve_response would be in ansible-ai-connect-service as it's using LlamaStackClient to invoke llama-stack etc.

response_processing_wrapper would live in our custom Agent but instead of post-processing LlamaStackClient objects it would post-process the llama-stack server equivalents.

@TamiTakamiya
Copy link
Copy Markdown
Contributor Author

@manstis Yes, it's a PoC that is intended to show the possibility to implement the road-core/service /streaming-chat endpoint using the llama-stack builtin::rag/knowledge_search. Initially I thought opening a PR for ansible-ai-connect-service, but after I saw the /query endpint in lightspeed-stack, I decided to creating a PR here. This can be anywhere in between llama-stack and the chatbot client.

It is true that the logic is a bit redundant because the post-processing code parses the streaming output using a regex, stores the results, and then the results are sent at the end of streaming. If we write our own agent or tool, implementations could be simplified.

The _yield_printable_events method of the RAGTurnStreamEventPrinterclass was copied from theTurnStreamEventPrinter` class and contains lots of duplicated lines. I want to reduce the amount of duplicated lines.

@manstis
Copy link
Copy Markdown
Contributor

manstis commented May 22, 2025

@TamiTakamiya Was you wanting to keep this open given your PR for ansible-ai-connect-service? Perhaps this is now better refactoring to a README or something?

Copy link
Copy Markdown
Contributor

@umago umago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tami for this change! I would love to see a streaming endpoint here. But, I do think that this change needs some refactor and make both /query and /streaming_query compatible. I see a lot of potential in sharing code between these two endpoints I think we should do it otherwise we risk making them incompatible in the future.



class LLMRequest(BaseModel):
query: str
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep models in the models/ directory.

Also, there's now a QueryRequest model that I think we should use instead of introducing a new one:

class QueryRequest(BaseModel):

request,
response,
)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the QueryResponse for this ? I don't see why introducing a new StreamingResponse model.


# select the first LLM
llm = next(m for m in models if m.model_type == "llm")
model_id = llm.identifier
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep compatibility with the /query endpoint, which supports passing a model/provider in the request

When a tool is required to answer the user's query, respond only with <|tool_call|>
followed by a JSON list of tools used. If a tool does not exist in the provided
list of tools, notify the user that you do not have the ability to fulfill the request.
""",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, keeping compatibility with the /query endpoint which supports passing a system_prompt in the request

session_id=session_id,
)
return response
# return str(response.output_message.content)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this code is duplicated with /query. I think we can refactor these two endpoints and have a common base of code. Otherwise we will endpoint creating a lot of inconsistencies between the two.

@TamiTakamiya
Copy link
Copy Markdown
Contributor Author

@umago @manstis /streaming_query endpoint was implemented with #126. I will close this and open a new PR only for enabling referenced documents support.

@tisnik tisnik deleted the TamiTakamiya/AAP-45771/referenced-docs-poc branch July 31, 2025 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants