Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: How to stop extra LLM call in ConversationalRetrievalChain for question rephrasing #9649

Closed
sid8491 opened this issue Aug 23, 2023 · 16 comments
Labels
Ɑ: models Related to LLMs or chat model modules 🤖:question A specific question about the codebase, product, project, or how to use a feature

Comments

@sid8491
Copy link

sid8491 commented Aug 23, 2023

Issue you'd like to raise.

I am using ConversationalRetrievalChain for the RAG question-answer bot.
There is one LLM call that I have not configured and it is reducing the quality of responses and increasing the time.

The prompt in the LLM call is:

Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.
It is taking around 1 second to complete this call, and it is reducing the quality of the response as well.

How do I stop this call?

Suggestion:

No response

@dosubot dosubot bot added Ɑ: models Related to LLMs or chat model modules 🤖:question A specific question about the codebase, product, project, or how to use a feature labels Aug 23, 2023
@dosubot
Copy link

dosubot bot commented Aug 23, 2023

🤖

Hello,

Thank you for bringing up this issue. The extra LLM call you're referring to is controlled by the rephrase_question attribute in the BaseConversationalRetrievalChain class. This attribute determines whether or not to pass the newly generated question to the combine_docs_chain.

If you want to disable this extra LLM call, you can set rephrase_question to False. Here's how you can do it:

class BaseConversationalRetrievalChain(Chain):
    ...
    rephrase_question: bool = False
    ...

By setting rephrase_question to False, the new generated question will only be used for retrieval and the original question will be passed along to the combine_docs_chain. This should prevent the extra LLM call from happening, reducing the response time and potentially improving the quality of responses.

You can find more details about this in the BaseConversationalRetrievalChain class in the libs/langchain/langchain/chains/conversational_retrieval/base.py file.

Please let me know if this resolves your issue or if you have any other questions.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@sid8491
Copy link
Author

sid8491 commented Aug 23, 2023

not useful.

@devstein
Copy link

@sid8491 If I understand you question correctly, the bot's answer is correct. The rephrase_question=False flag skips the step that runs this prompt

Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

@sid8491
Copy link
Author

sid8491 commented Aug 23, 2023

@sid8491 If I understand you question correctly, the bot's answer is correct. The rephrase_question=False flag skips the step that runs this prompt

Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

@devstein I have tried it like this, but this is not changing anything:

qa_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=custom_retriever,
        chain_type="stuff",
        combine_docs_chain_kwargs=QA_PROMPT,
        return_source_documents=True,
        verbose=False,
        rephrase_question=False,
    )

@sid8491
Copy link
Author

sid8491 commented Aug 23, 2023

@devstein Upon careful examination, I discovered that when I set rephrase_question to False, the tab is making an additional call to the Language Model (LLM) but it's not utilizing the rephrased question. How can I stop this LLM call? There are two issues arising from this extra LLM call:

  1. It's causing a delay in response generation, resulting in increased response time.
  2. I am using streaming responses in Streamlit, and the response from this LLM call is also being printed within the on_llm_new_token() function.

@JoAmps
Copy link

JoAmps commented Nov 3, 2023

@sid8491 Those are the exact issues I have been searching for solutions on, it delays response and the rephrase gets shown to the user. I want a way where it can rephrase the question, but it doesn't need to increase response time and it doesn't need to show it

@harshadk-sourcefuse
Copy link

harshadk-sourcefuse commented Nov 8, 2023

@hwchase17, @baskaryan, I am also facing the same issue. I am using ConversationalRetrievalChain with ConversationBufferWindowMemory, and I am experiencing the following issues:

  • In the streaming response, it contains the rephrased question.
  • When the same question is asked again for the second time, it answers with answer not found in context.
  • It rephrases the question based on the previous question, even when I have not mentioned phrases like 'based on the previous question' or 'as per the previous conversation.' In other words, when I have not indicated to use the previous question to answer the current question, the model still uses the previous question and rephrases the question accordingly, leading to an answer not found in context response.

@0ENZO
Copy link

0ENZO commented Nov 15, 2023

@JoAmps Facing the same issue, have you found a solution yet ?

@khanhnd0408
Copy link

khanhnd0408 commented Nov 20, 2023

rephrase_question=False still generates the repharse_question (I think it will incur costs), but the final answer no longer depends on the previous answer (which is what I want).
But I also want to stream the answer back to the client by using a callback with a custom class wrap over StreamingStdOutCallbackHandler, but the on_llm_new_token also includes the output from the repharse_question step. Is there any way to completely disable the repharse step?

@khanhnd0408
Copy link

@harshadk-sourcefuse, I'm facing same issued with "In the streaming response, it contains the rephrased question." Did you find the solution to fix it?

@harshadk-sourcefuse
Copy link

@harshadk-sourcefuse, I'm facing same issued with "In the streaming response, it contains the rephrased question." Did you find the solution to fix it?

You can't consider it a complete solution, but there is a workaround. If you don't require memory, you can set k=0 in ConversationBufferWindowMemory. Since there will be no memory, it will not rephrase the questions.

However, I am still awaiting a proper solution.

@summer1704
Copy link

summer1704 commented Dec 14, 2023

rephrase_question=False still generates the repharse_question (I think it will incur costs), but the final answer no longer depends on the previous answer (which is what I want). But I also want to stream the answer back to the client by using a callback with a custom class wrap over StreamingStdOutCallbackHandler, but the on_llm_new_token also includes the output from the repharse_question step. Is there any way to completely disable the repharse step?

@khanhnd0408 Hey did you find any solution? I am facing the same issue while streaming back the response using the custom callback handler. the response is including the rephrased question as well

@Gusakovskyi
Copy link

Gusakovskyi commented Jan 16, 2024

Hi, any update on it? I faced the same issue, and I cant understand why I should receive a rephrased query in the response from the model. I am using a streamed output to pass the answer further token by token but according to current implementation I have to trim the rephrased query somehow. Thats absolutely makes no sence to me. It costs additional time and looks useless

@harshadk-sourcefuse
Copy link

Hi, is there any update on this issue? facing same issue on v0.1.0

Copy link

dosubot bot commented Jan 18, 2024

🤖

Based on your description, it seems like you want to prevent the additional Language Model (LLM) call in the ConversationalRetrievalChain when the rephrase_question parameter is set to False. This is because it's causing increased response time and unwanted output in Streamlit's on_llm_new_token() function.

In the current implementation of LangChain, the LLM call is made regardless of the rephrase_question parameter's value. The rephrase_question parameter only controls whether the new question generated by the LLM is passed to the combine_docs_chain or not.

To prevent the LLM call when rephrase_question is set to False, you can modify the _call and _acall methods in the BaseConversationalRetrievalChain class. Here's how you can do it:

def _call(
    self,
    inputs: Dict[str, Any],
    run_manager: Optional[CallbackManagerForChainRun] = None,
) -> Dict[str, Any]:
    _run_manager = run_manager or CallbackManagerForChainRun.get_noop_manager()
    question = inputs["question"]
    get_chat_history = self.get_chat_history or _get_chat_history
    chat_history_str = get_chat_history(inputs["chat_history"])
    if self.rephrase_question and chat_history_str:
        callbacks = _run_manager.get_child()
        new_question = self.question_generator.run(
            question=question, chat_history=chat_history_str, callbacks=callbacks
        )
    else:
        new_question = question
    # Rest of the method remains the same...

async def _acall(
    self,
    inputs: Dict[str, Any],
    run_manager: Optional[AsyncCallbackManagerForChainRun] = None,
) -> Dict[str, Any]:
    _run_manager = run_manager or AsyncCallbackManagerForChainRun.get_noop_manager()
    question = inputs["question"]
    get_chat_history = self.get_chat_history or _get_chat_history
    chat_history_str = get_chat_history(inputs["chat_history"])
    if self.rephrase_question and chat_history_str:
        callbacks = _run_manager.get_child()
        new_question = await self.question_generator.arun(
            question=question, chat_history=chat_history_str, callbacks=callbacks
        )
    else:
        new_question = question
    # Rest of the method remains the same...

In these modified methods, the LLM call is made only if rephrase_question is True and there is a chat history. If rephrase_question is False, the original question is used, and the LLM call is skipped.

You can find the original implementation of these methods in the BaseConversationalRetrievalChain class.

Please note that this is a workaround and might not be the best solution. It would be better if the LangChain developers provide an official way to control the LLM call based on the rephrase_question parameter. You might want to raise this as a feature request in the LangChain repository.

I hope this helps! If you have any other questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Apr 18, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Apr 25, 2024
@arturgoms
Copy link

this issue is still happening. Should not be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: models Related to LLMs or chat model modules 🤖:question A specific question about the codebase, product, project, or how to use a feature
Projects
None yet
Development

No branches or pull requests

9 participants