on_llm_new_token event broken in langchain_openai when streaming #19185

theobjectivedad · 2024-03-16T12:03:38Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Current implementation:

    def _stream(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> Iterator[GenerationChunk]:
        params = {**self._invocation_params, **kwargs, "stream": True}
        self.get_sub_prompts(params, [prompt], stop)  # this mutates params
        for stream_resp in self.client.create(prompt=prompt, **params):
            if not isinstance(stream_resp, dict):
                stream_resp = stream_resp.model_dump()
            chunk = _stream_response_to_generation_chunk(stream_resp)
            yield chunk           
            if run_manager:
                run_manager.on_llm_new_token(
                    chunk.text,
                    chunk=chunk,
                    verbose=self.verbose,
                    logprobs=(
                        chunk.generation_info["logprobs"]
                        if chunk.generation_info
                        else None
                    ),
                )

I believe this would correct and produce the intended behavior:

    def _stream(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> Iterator[GenerationChunk]:
        params = {**self._invocation_params, **kwargs, "stream": True}
        self.get_sub_prompts(params, [prompt], stop)  # this mutates params
        for stream_resp in self.client.create(prompt=prompt, **params):
            if not isinstance(stream_resp, dict):
                stream_resp = stream_resp.model_dump()
            chunk = _stream_response_to_generation_chunk(stream_resp)
            
            if run_manager:
                run_manager.on_llm_new_token(
                    chunk.text,
                    chunk=chunk,
                    verbose=self.verbose,
                    logprobs=(
                        chunk.generation_info["logprobs"]
                        if chunk.generation_info
                        else None
                    ),
                )
            yield chunk

Error Message and Stack Trace (if applicable)

No response

Description

When streaming via langchain_openai.llms.base.BaseOpenAI._stream the yield appears before triggering the run manager event. This makes it impossible to invoke on_llm_new_token methods in a callback until the full response is received.

System Info

System Information
------------------
> OS:  Linux
> OS Version:  #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb  9 13:32:52 UTC 2
> Python Version:  3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]

Package Information
-------------------
> langchain_core: 0.1.30
> langchain: 0.1.11
> langchain_community: 0.0.27
> langsmith: 0.1.23
> langchain_openai: 0.0.8
> langchain_text_splitters: 0.0.1

Packages not installed (Not Necessarily a Problem)
--------------------------------------------------
The following packages were not found:

> langgraph
> langserve

The text was updated successfully, but these errors were encountered:

theobjectivedad · 2024-03-16T12:41:19Z

Additionally, it would be good to pass stream_resp to the callback as well. This would allow clients to differentiate between multiple responses when n > 1. For example:

def _stream(
    self,
    prompt: str,
    stop: Optional[List[str]] = None,
    run_manager: Optional[CallbackManagerForLLMRun] = None,
    **kwargs: Any,
) -> Iterator[GenerationChunk]:
    params = {**self._invocation_params, **kwargs, "stream": True}
    self.get_sub_prompts(params, [prompt], stop)  # this mutates params
    for stream_resp in self.client.create(prompt=prompt, **params):
        if not isinstance(stream_resp, dict):
            stream_resp = stream_resp.model_dump()

        chunk = _stream_response_to_generation_chunk(stream_resp)

        if run_manager:
            run_manager.on_llm_new_token(
                chunk.text,
                chunk=chunk,
                verbose=self.verbose,
                logprobs=(
                    chunk.generation_info["logprobs"] if chunk.generation_info else None
                ),
                stream_resp=stream_resp,
            )

        yield chunk

I've verified on the OpenAI streaming response that it supports multiple generations (notice the index field):

data: {"id":"chatcmpl-93M3I2vZ1GbBKwQ4Pl2z71bxLKT67","object":"chat.completion.chunk","created":1710586780,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_4f2ebda25a","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":{"content":[]},"finish_reason":null}]}

data: {"id":"chatcmpl-93M3I2vZ1GbBKwQ4Pl2z71bxLKT67","object":"chat.completion.chunk","created":1710586780,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_4f2ebda25a","choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":{"content":[{"token":"Hello","logprob":-0.0062218173,"bytes":[72,101,108,108,111],"top_logprobs":[]}]},"finish_reason":null}]}

data: {"id":"chatcmpl-93M3I2vZ1GbBKwQ4Pl2z71bxLKT67","object":"chat.completion.chunk","created":1710586780,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_4f2ebda25a","choices":[{"index":1,"delta":{"role":"assistant","content":""},"logprobs":{"content":[]},"finish_reason":null}]}

data: {"id":"chatcmpl-93M3I2vZ1GbBKwQ4Pl2z71bxLKT67","object":"chat.completion.chunk","created":1710586780,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_4f2ebda25a","choices":[{"index":1,"delta":{"content":"Hello"},"logprobs":{"content":[{"token":"Hello","logprob":-0.0062218173,"bytes":[72,101,108,108,111],"top_logprobs":[]}]},"finish_reason":null}]}

sepiatone · 2024-03-23T07:01:52Z

the problem of yielding before calling the run manager has been fixed by #18269. (I don't know enough to comment on the other enhancement you proposed)

dosubot bot added Ɑ: models Related to LLMs or chat model modules 🔌: openai Primarily related to OpenAI integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 16, 2024

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 22, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 29, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

on_llm_new_token event broken in langchain_openai when streaming #19185

on_llm_new_token event broken in langchain_openai when streaming #19185

theobjectivedad commented Mar 16, 2024

theobjectivedad commented Mar 16, 2024 •

edited

Loading

sepiatone commented Mar 23, 2024 •

edited

Loading

on_llm_new_token event broken in langchain_openai when streaming #19185

on_llm_new_token event broken in langchain_openai when streaming #19185

Comments

theobjectivedad commented Mar 16, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

theobjectivedad commented Mar 16, 2024 • edited Loading

sepiatone commented Mar 23, 2024 • edited Loading

theobjectivedad commented Mar 16, 2024 •

edited

Loading

sepiatone commented Mar 23, 2024 •

edited

Loading