Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: how stream results with long context #5532

Closed
qlql489 opened this issue Jun 1, 2023 · 4 comments
Closed

Issue: how stream results with long context #5532

qlql489 opened this issue Jun 1, 2023 · 4 comments

Comments

@qlql489
Copy link

qlql489 commented Jun 1, 2023

Issue you'd like to raise.

i follow the chapter “Chat Over Documents with Chat History” to build a bot chat with pdf,
i want to streanming return,
but when i use stuff chain like this

    doc_chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff", prompt=QA_PROMPT)
    chain = ConversationalRetrievalChain(
        retriever=vector_db.as_retriever(),
        question_generator=question_generator,
        combine_docs_chain=doc_chain
    )

it return "This model's maximum context length is 4097 tokens, however you requested 5741 tokens (5485 in your prompt; 256 for the completion). Please reduce your prompt; or completion length"

when i use map_reduce chain

doc_chain = load_qa_chain(OpenAI(temperature=0,streaming=True,callbacks=[StreamingStdOutCallbackHandler()]), chain_type="map_reduce", combine_prompt=getQaMap_reducePromot())

it return "Cannot stream results with multiple prompts."

how to resolve it when the context is too long

Suggestion:

No response

@fitolobo
Copy link

fitolobo commented Jun 15, 2023

Same problem here, I would like to stream the chain output, something like:
llm = OpenAI(streaming= True, temperature = 0.0, model_name= "gpt-3.5-turbo", callbacks=some_callback)
chain = MultiRetrievalQAChain.from_retrievers(llm, multi_retriever_info, verbose=False)

to yield the tokens of the last answer. I could stream, and print the json intermediate step, however I couldn't get a generator following different instructions concerning the callbacks in the documentation.

agola11 pushed a commit that referenced this issue Jun 18, 2023
…async (#6181)

This will add the ability to add an AsyncCallbackManager (handler) for
the reducer chain, which would be able to stream the tokens via the
`async def on_llm_new_token` callback method



Fixes # (issue)
[5532](#5532)


 @hwchase17  @agola11 
The following code snippet explains how this change would be used to
enable `reduce_llm` with streaming support in a `map_reduce` chain

I have tested this change and it works for the streaming use-case of
reducer responses. I am happy to share more information if this makes
solution sense.

```

AsyncHandler
..........................
class StreamingLLMCallbackHandler(AsyncCallbackHandler):
    """Callback handler for streaming LLM responses."""

    def __init__(self, websocket):
        self.websocket = websocket
    
    # This callback method is to be executed in async
    async def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
        resp = ChatResponse(sender="bot", message=token, type="stream")
        await self.websocket.send_json(resp.dict())


Chain
..........
stream_handler = StreamingLLMCallbackHandler(websocket)
stream_manager = AsyncCallbackManager([stream_handler])

streaming_llm = ChatOpenAI(
        streaming=True,
        callback_manager=stream_manager,
        verbose=False,
        temperature=0,
    )
    main_llm = OpenAI(
        temperature=0,
        verbose=False,
    )

    doc_chain = load_qa_chain(
        llm=main_llm,
        reduce_llm=streaming_llm,
        chain_type="map_reduce", 
        callback_manager=manager
    )
    qa_chain = ConversationalRetrievalChain(
        retriever=vectorstore.as_retriever(),
        combine_docs_chain=doc_chain,
        question_generator=question_generator,
        callback_manager=manager,
    )
    
    # Here `acall` will trigger `acombine_docs` on `map_reduce` which should then call `_aprocess_result` which in turn will call `self.combine_document_chain.arun` hence async callback will be awaited
    result = await qa_chain.acall(
         {"question": question, "chat_history": chat_history}
      )
```
kacperlukawski pushed a commit to kacperlukawski/langchain that referenced this issue Jun 29, 2023
…async (langchain-ai#6181)

This will add the ability to add an AsyncCallbackManager (handler) for
the reducer chain, which would be able to stream the tokens via the
`async def on_llm_new_token` callback method



Fixes # (issue)
[5532](langchain-ai#5532)


 @hwchase17  @agola11 
The following code snippet explains how this change would be used to
enable `reduce_llm` with streaming support in a `map_reduce` chain

I have tested this change and it works for the streaming use-case of
reducer responses. I am happy to share more information if this makes
solution sense.

```

AsyncHandler
..........................
class StreamingLLMCallbackHandler(AsyncCallbackHandler):
    """Callback handler for streaming LLM responses."""

    def __init__(self, websocket):
        self.websocket = websocket
    
    # This callback method is to be executed in async
    async def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
        resp = ChatResponse(sender="bot", message=token, type="stream")
        await self.websocket.send_json(resp.dict())


Chain
..........
stream_handler = StreamingLLMCallbackHandler(websocket)
stream_manager = AsyncCallbackManager([stream_handler])

streaming_llm = ChatOpenAI(
        streaming=True,
        callback_manager=stream_manager,
        verbose=False,
        temperature=0,
    )
    main_llm = OpenAI(
        temperature=0,
        verbose=False,
    )

    doc_chain = load_qa_chain(
        llm=main_llm,
        reduce_llm=streaming_llm,
        chain_type="map_reduce", 
        callback_manager=manager
    )
    qa_chain = ConversationalRetrievalChain(
        retriever=vectorstore.as_retriever(),
        combine_docs_chain=doc_chain,
        question_generator=question_generator,
        callback_manager=manager,
    )
    
    # Here `acall` will trigger `acombine_docs` on `map_reduce` which should then call `_aprocess_result` which in turn will call `self.combine_document_chain.arun` hence async callback will be awaited
    result = await qa_chain.acall(
         {"question": question, "chat_history": chat_history}
      )
```
@clemlesne
Copy link

Seems related to #1349.

@straeter
Copy link

was this fixed in 2b3b4e0 ? If so, can someone please create an example how to stream a load_qa_chain response?

Copy link

dosubot bot commented Jan 31, 2024

Hi, @qlql489,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue pertains to encountering errors when attempting to stream results with a long context in a chatbot built using the "Chat Over Documents with Chat History" chapter. Another user, "fitolobo," has also experienced the same problem and is seeking a solution to stream the chain output. User "clemlesne" has pointed out a related issue, while "straeter" has inquired about a potential fix and requested an example on how to stream a load_qa_chain response.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 31, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 7, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants