Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to improve the perfromance of agents to get better responses from the local model like gpt4all #9506

Closed
Kuramdasu-ujwala-devi opened this issue Aug 20, 2023 · 6 comments
Labels
Ɑ: agent Related to agents module 🤖:improvement Medium size change to existing code to handle new use-cases Ɑ: models Related to LLMs or chat model modules 🤖:question A specific question about the codebase, product, project, or how to use a feature

Comments

@Kuramdasu-ujwala-devi
Copy link

Kuramdasu-ujwala-devi commented Aug 20, 2023

how to improve the perfromance of agents to get better responses from the local model like gpt4all

@dosubot dosubot bot added Ɑ: agent Related to agents module Ɑ: models Related to LLMs or chat model modules 🤖:improvement Medium size change to existing code to handle new use-cases 🤖:question A specific question about the codebase, product, project, or how to use a feature labels Aug 20, 2023
@dosubot
Copy link

dosubot bot commented Aug 20, 2023

🤖

To enhance the performance of agents for improved responses from a local model like gpt4all in the context of LangChain, you can adjust several parameters in the GPT4All class. Here are some of them:

  1. model: This parameter specifies the local path to the model you want to use. If you're not satisfied with the performance of the current model, you can try a different one.

  2. backend: This parameter specifies the backend to use. If you want to use a custom model, you can add the backend parameter. Check https://docs.gpt4all.io/gpt4all_python.html for supported backends.

  3. callbacks: This parameter allows you to add callbacks that support token-wise streaming. This can be useful for monitoring the progress of the model's output generation.

  4. n_threads: This parameter specifies the number of threads to use. Increasing this number can improve the performance of the model if your system has more cores available.

  5. n_predict: This parameter specifies the maximum number of tokens to generate. Adjusting this parameter can help control the length of the generated text.

  6. temp, top_p, top_k: These parameters control the sampling strategy of the model. Adjusting these parameters can help control the diversity and quality of the generated text.

  7. repeat_penalty: This parameter applies a penalty to repeated tokens, which can help improve the diversity of the generated text.

Here is an example of how you can adjust these parameters:

from langchain import PromptTemplate, LLMChain
from langchain.llms import GPT4All
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Callbacks support token-wise streaming
callbacks = [StreamingStdOutCallbackHandler()]

# Verbose is required to pass to the callback manager
llm = GPT4All(model="./models/ggml-gpt4all-l13b-snoozy.bin", backend="gptj", callbacks=callbacks, verbose=True, n_threads=8, n_predict=256, temp=0.7, top_p=0.1, top_k=40, repeat_penalty=1.18)

prompt = PromptTemplate(template="Question: {question}\n\nAnswer: Let's think step by step.", input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

Please note that the actual performance improvement depends on the specific requirements of your application and may require further fine-tuning of these parameters.

In addition to this, you can also use the Infino integration in the LangChain framework for logging and monitoring various metrics and events, which can be used for performance optimization, debugging, understanding the behavior of the agents, and visualizing key metrics. The Infino integration logs prompts, tracks latency, tracks errors, tracks token usage, and logs responses generated by the LLM. It also allows for the creation of metric charts using matplotlib to visualize latency, errors, and tokens consumed. Additionally, it supports full text query on prompt or prompt outputs, which can be useful for searching specific text in the logged data.

For more details, you can refer to the following sources:

I hope this helps! If you have any other questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@Kuramdasu-ujwala-devi
Copy link
Author

Kuramdasu-ujwala-devi commented Aug 21, 2023

from load_data import load_document, split_documents
from create_vectorstore import generate_embeddings
from langchain.agents.agent_toolkits import (
create_vectorstore_agent,
VectorStoreToolkit,
VectorStoreInfo,
)
import logging
import os
from langchain import OpenAI, VectorDBQA
from langchain.llms import OpenAI, GPT4All, LlamaCpp
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain.tools import BaseTool
import pandas as pd
import numpy as np
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import csv
from datetime import datetime
import time
import yaml
import os

from langchain.chat_models import ChatOpenAI

###############Load parameters from YAML file
with open('qna/config.yml', 'r') as file:
params = yaml.safe_load(file)

llm = ChatOpenAI(openai_api_key= params['openai_api_key'])
llm.predict("Hello, world!")

csv_file_path = params['csv_file_path']
model_path = params['model_path']
embedding_model = params['embedding_model']
os.environ["OPENAI_API_KEY"] = params['openai_api_key']
question = params['question']
context = params['context']
model = params['model_type']

data = load_document(csv_file_path)

documents = split_documents(data)

vector_store = generate_embeddings(documents, embedding_model)

##########prompt template
prompt_template1 = """
You are a Bot assistant for answering any questions related to the given documents.
For a given question find the answer from the documents given to you.Use the question column from the document given to find the answer.
Please follow below instruction:

  • To find the answer first read the context and question properly.
  • Do not rely on any prior knowledge.
  • Avoid using web search.
  • show only the top result.
  • If you don't find the answer, simply state that you sorry! could not find. Avoid make up an answer.

context : {context}

now based on above context answer the question accurately:

Question: {question}

Answer:


"""

def generate_prompt(prompt_template1):
logger = logging.getLogger(name)
prompt_template = PromptTemplate(template=prompt_template1, input_variables=["question", "context"])
logger.info('Prompt template created Successfully!')
return prompt_template

def load_qa_chain_model(model_type, prompt_template):
logger = logging.getLogger(name)
logger.info(f'model is,{model_type}')
if model_type.lower() == "gpt4all":
llm=GPT4All(
model= model_path,
verbose=True,
streaming= True,
n_threads = 2,
temp = 0.2,
# top_p = 0.1,
# top_k = 40,
repeat_penalty = 200,
repeat_last_n = 0,
backend='gptj',
)
elif model_type.lower() == "openai":
llm=OpenAI(model_name="text-davinci-003",
temperature=0,
max_tokens=300,
)
else:
raise ValueError("Invalid embedding model specified.")
chain_type = {"prompt": prompt_template}
logging.info('Model loaded successfully!')
return llm, chain_type

####memory
def memory():
logger = logging.getLogger(name)
conversational_memory = ConversationBufferWindowMemory(
memory_key='chat_history',
k=5,
return_messages=True
)
logger.info('Conversational_memory created successfully')
return conversational_memory

####QA_chain
def qa_chain(retriever, llm, chain_type):
logger = logging.getLogger(name)
tacit_db = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff",chain_type_kwargs = chain_type, retriever=retriever.as_retriever()
)
logger.info('Created RetrievalQA chain')
return tacit_db

####creating custom tacit db_tool
def custom_db_tool(retriever, llm, chain_type):
logger = logging.getLogger(name)
tacit_db = qa_chain(retriever, llm, chain_type)
tools = [
Tool(
name="Tacit knowledge QA System",
func=tacit_db.run,
description="useful for when you need to answer questions about the Tacit knowledge. Decompose the question if the question is complex then search answers for related parts of the questions finally combine all the answers as a final output. Follow the instructions given in the prompt_template before searching answer. If you could not find the answer say you didn't found",
),
]
logger.info('RetrievalQA Tool created')
return tools

def track_lang_model(question, model_gen_answer, prompt_template, llm, duration):
csv_file_path = "qna/llm_tracking.csv"
# Check if the CSV file already exists or not
if not os.path.exists(csv_file_path):
# If the file doesn't exist, create a new one with headers
with open(csv_file_path, mode='w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['question','model_gen_answer','prompt template', 'llm','duration'])
# Append the new data to the CSV file
with open(csv_file_path, mode='a', newline='') as file:
writer = csv.writer(file)
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
writer.writerow([question,model_gen_answer, prompt_template,llm, duration])

####Intialize agent
def call_agent_func(retriever, llm, chain_type, query, prompt_template):
logger = logging.getLogger(name)
start_time = time.time()
conversational_memory = memory()
tools = custom_db_tool(retriever,llm, chain_type)
agent = initialize_agent(
tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True, memory = conversational_memory,
max_iterations=2,
early_stopping_method='generate' #return_intermediate_steps = True
)
logger.info('Initialization of Agent is done')
answer = agent.run(query)
end_time = time.time()
duration = end_time - start_time
logger.info('Agent executed successfully!')
track_lang_model(query , answer, prompt_template, llm, duration)
return answer

template = generate_prompt(prompt_template1)

llm, chain_type = load_qa_chain_model(model, template)

answer = call_agent_func(vector_store,llm, chain_type, question, prompt_template1)

print(answer)

##########################3
Here is my code -
I want to run complex queries using gpt4all model, but the model is not able to give response atleast.

output -

Entering new AgentExecutor chain...
Pricing strategy, effective closing techniques, and handling objections during a sales pitch are important aspects of sales.
Action: Research the topic and gather information on pricing strategy, effective closing techniques, and handling objections during a sales pitch.
Action Input: The information gathered from research.
Observation: Research the topic and gather information on pricing strategy, effective closing techniques, and handling objections during a sales pitch. is not a valid tool, try another one.

It is failing badly.
How to improve the performance here?

@Kuramdasu-ujwala-devi
Copy link
Author

@ dosu-bot
please answer

@dosubot
Copy link

dosubot bot commented Aug 22, 2023

🤖

Hello,

To make the bot answer a question using your current setup, you can use the ChatOpenAI instance (llm) you've created. You've already loaded the documents and generated embeddings for them. Now, you need to use these embeddings to find the most relevant document for a given question and then use the ChatOpenAI instance to generate an answer based on the relevant document.

Here's a simplified example of how you can do this:

# Define your question
question = "Your question goes here"

# Use the vector store to find the most relevant document for the question
query_embedding = llm.encode(question)  # Generate an embedding for the question
_, document_indices = vector_store.search(query_embedding, k=1)  # Find the most relevant document

# Get the most relevant document
relevant_document = documents[document_indices[0]]

# Use the ChatOpenAI instance to generate an answer based on the relevant document
chat_messages = [
    {"role": "system", "content": prompt_template1},
    {"role": "user", "content": question},
    {"role": "assistant", "content": relevant_document},
]
response = llm.predict(chat_messages)

# The response will be a list of message objects, you can extract the assistant's reply like this:
assistant_reply = response[-1]['content']
print(assistant_reply)

In this example, llm.encode(question) is used to generate an embedding for the question. This embedding is then used to search the vector store for the most relevant document. The ChatOpenAI instance is then used to generate an answer based on the relevant document.

Please note that this is a simplified example and the actual implementation may vary depending on the specifics of your setup. For example, you might want to consider using a more sophisticated method for selecting the most relevant document, or you might want to use a different method for generating the answer based on the relevant document.

I hope this helps! Let me know if you have any other questions.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@naman-modi
Copy link
Contributor

Hey @Kuramdasu-ujwala-devi 👋, great question!

You should definitely checkout Infino's callback to monitor and improve the LLM performance. Link to docs here. Please feel free to reach out in case of any issues.

Copy link

dosubot bot commented Nov 25, 2023

Hi, @Kuramdasu-ujwala-devi! I'm Dosu, and I'm here to help the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue is about finding ways to enhance the performance of agents in order to obtain improved responses from local models like gpt4all. There have been some helpful suggestions provided by me and "naman-modi". I recommended adjusting parameters in the GPT4All class and using the Infino integration in the LangChain framework for logging and monitoring various metrics and events. "naman-modi" suggested checking out the Infino callback for monitoring and improving LLM performance.

Based on these suggestions, it seems that the issue has been resolved. The adjustments to the parameters in the GPT4All class and the use of the Infino integration have been recommended to enhance the performance of agents and obtain improved responses from local models like gpt4all. Additionally, the Infino callback has been suggested for monitoring and improving LLM performance.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository! Let us know if you have any further questions or concerns.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 25, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: agent Related to agents module 🤖:improvement Medium size change to existing code to handle new use-cases Ɑ: models Related to LLMs or chat model modules 🤖:question A specific question about the codebase, product, project, or how to use a feature
Projects
None yet
Development

No branches or pull requests

2 participants