[WIP] Streaming Chat with Referenced Documents PoC#40
Conversation
|
@TamiTakamiya From what I understand this effectively adds post processing of the response turns to add formatting and most importantly extract the "Referenced Documents" from I am wondering whether it's possible to move all of this to a custom We'd then configure the At best this code would have to live in WDYT? |
|
|
|
@manstis Yes, it's a PoC that is intended to show the possibility to implement the road-core/service It is true that the logic is a bit redundant because the post-processing code parses the streaming output using a regex, stores the results, and then the results are sent at the end of streaming. If we write our own agent or tool, implementations could be simplified. The |
|
@TamiTakamiya Was you wanting to keep this open given your PR for |
umago
left a comment
There was a problem hiding this comment.
Thanks Tami for this change! I would love to see a streaming endpoint here. But, I do think that this change needs some refactor and make both /query and /streaming_query compatible. I see a lot of potential in sharing code between these two endpoints I think we should do it otherwise we risk making them incompatible in the future.
|
|
||
|
|
||
| class LLMRequest(BaseModel): | ||
| query: str |
There was a problem hiding this comment.
Let's keep models in the models/ directory.
Also, there's now a QueryRequest model that I think we should use instead of introducing a new one:
lightspeed-stack/src/models/requests.py
Line 55 in 0405417
| request, | ||
| response, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Can we use the QueryResponse for this ? I don't see why introducing a new StreamingResponse model.
|
|
||
| # select the first LLM | ||
| llm = next(m for m in models if m.model_type == "llm") | ||
| model_id = llm.identifier |
There was a problem hiding this comment.
Let's keep compatibility with the /query endpoint, which supports passing a model/provider in the request
| When a tool is required to answer the user's query, respond only with <|tool_call|> | ||
| followed by a JSON list of tools used. If a tool does not exist in the provided | ||
| list of tools, notify the user that you do not have the ability to fulfill the request. | ||
| """, |
There was a problem hiding this comment.
Ditto, keeping compatibility with the /query endpoint which supports passing a system_prompt in the request
| session_id=session_id, | ||
| ) | ||
| return response | ||
| # return str(response.output_message.content) |
There was a problem hiding this comment.
A lot of this code is duplicated with /query. I think we can refactor these two endpoints and have a common base of code. Otherwise we will endpoint creating a lot of inconsistencies between the two.
Description
Streaming Chat with Referenced Documents PoC. Requires llama-stack 0.2.7.
Screenshare.-.2025-05-19.9_24_25.PM.mp4
Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing