Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to connect a Chatbot that has its own data but has also access to internet for search? #8625

Closed
venturaEffect opened this issue Aug 2, 2023 · 9 comments
Labels
Ɑ: agent Related to agents module 🤖:question A specific question about the codebase, product, project, or how to use a feature

Comments

@venturaEffect
Copy link

venturaEffect commented Aug 2, 2023

Issue you'd like to raise.

Hi!

How to make a Chatbot that uses its own data access to the internet to get more info (like new updated)? I've tried and searched everywhere but can't make it work.

Here the code:

`
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import (
UnstructuredWordDocumentLoader,
TextLoader,
UnstructuredPowerPointLoader,
)
from langchain.tools import Tool
from langchain.utilities import GoogleSearchAPIWrapper
from langchain.chat_models import ChatOpenAI
import os
import openai
import sys
from dotenv import load_dotenv, find_dotenv

    sys.path.append('../..')
    
    _ = load_dotenv(find_dotenv())  # read local .env file
    
    google_api_key = os.environ.get("GOOGLE_API_KEY")
    google_cse_id = os.environ.get("GOOGLE_CSE_ID")
    
    openai.api_key = os.environ['OPENAI_API_KEY']
    
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
    os.environ["LANGCHAIN_API_KEY"] = os.environ['LANGCHAIN_API_KEY']
    
    os.environ["GOOGLE_API_KEY"] = google_api_key
    os.environ["GOOGLE_CSE_ID"] = google_cse_id
    
    folder_path_docx = "DB\\DB VARIADO\\DOCS"
    folder_path_txt = " DB\\BLOG-POSTS"
    folder_path_pptx_1 = "DB\\PPT JUNIO"
    folder_path_pptx_2 = "DB\\DB VARIADO\\PPTX"
    
    loaded_content = []
    
    for file in os.listdir(folder_path_docx):
        if file.endswith(".docx"):
            file_path = os.path.join(folder_path_docx, file)
            loader = UnstructuredWordDocumentLoader(file_path)
            docx = loader.load()
            loaded_content.extend(docx)
    
    for file in os.listdir(folder_path_txt):
        if file.endswith(".txt"):
            file_path = os.path.join(folder_path_txt, file)
            loader = TextLoader(file_path, encoding='utf-8')
            text = loader.load()
            loaded_content.extend(text)
    
    for file in os.listdir(folder_path_pptx_1):
        if file.endswith(".pptx"):
            file_path = os.path.join(folder_path_pptx_1, file)
            loader = UnstructuredPowerPointLoader(file_path)
            slides_1 = loader.load()
            loaded_content.extend(slides_1)
    
    for file in os.listdir(folder_path_pptx_2):
        if file.endswith(".pptx"):
            file_path = os.path.join(folder_path_pptx_2, file)
            loader = UnstructuredPowerPointLoader(file_path)
            slides_2 = loader.load()
            loaded_content.extend(slides_2)
    
    embedding = OpenAIEmbeddings()
    
    embeddings_content = []
    for one_loaded_content in loaded_content:
        embedding_content = embedding.embed_query(one_loaded_content.page_content)
        embeddings_content.append(embedding_content)
    
    db = DocArrayInMemorySearch.from_documents(loaded_content, embedding)
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})
    
    search = GoogleSearchAPIWrapper()
    
    
    def custom_search(query):
        max_results = 3
    
        internet_results = search.run(query)[:max_results]
        return internet_results
    
    chain = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name="gpt-4", temperature=0),
        chain_type="map_reduce",
        retriever=retriever,
        return_source_documents=True,
        return_generated_question=True,
    )
    
    history = []
    
    while True:
        query = input("Hola, soy Chatbot. ¿Qué te gustaría saber? ")
    
        internet_results = custom_search(query)
    
        combined_results = loaded_content + [internet_results]
    
        response = chain(
            {"question": query, "chat_history": history, "documents": combined_results})
    
        print(response["answer"])
    
        history.append(("system", query))
        history.append(("assistant", response["answer"])) 

`

This is the error message I get: "The document does not provide information on... ". So it seems it doesn't access to the internet or something else (?)

Really appreciate your suggestion or your help!

Suggestion:

No response

@dosubot dosubot bot added Ɑ: agent Related to agents module 🤖:question A specific question about the codebase, product, project, or how to use a feature labels Aug 2, 2023
@dosubot
Copy link

dosubot bot commented Aug 2, 2023

Answer generated by a 🤖

Answer

Hi,

From your code, it seems like you're trying to combine the results from your local documents and the internet search into one list and then pass it to the ConversationalRetrievalChain. However, the ConversationalRetrievalChain expects the documents parameter to be a list of Document objects, not a list of strings or other data types.

The custom_search function returns a list of search results, but it's not clear what the format of these results is. If they're not Document objects, that could be causing the issue. You might need to convert these search results into Document objects before adding them to the combined_results list.

Here's an example of how you might do this:

def custom_search(query):
    max_results = 3

    internet_results = search.run(query)[:max_results]
    # Convert the search results into Document objects
    internet_documents = [Document(page_content=result) for result in internet_results]
    return internet_documents

# ...

while True:
    query = input("Hola, soy Chatbot. ¿Qué te gustaría saber? ")

    internet_documents = custom_search(query)

    combined_results = loaded_content + internet_documents

    response = chain(
        {"question": query, "chat_history": history, "documents": combined_results})

    print(response["answer"])

    history.append(("system", query))
    history.append(("assistant", response["answer"]))

In this code, the custom_search function has been modified to convert the search results into Document objects. Then, in the main loop, the internet_documents are added to the combined_results list instead of the raw internet_results.

Please note that this is just a suggestion and the actual solution might differ based on the exact format of your search results and the specific requirements of your application.

I hope this helps! Let me know if you have any other questions.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

@venturaEffect
Copy link
Author

venturaEffect commented Aug 2, 2023

I get "Dcoument" is not defined.

Added this: from langchain.docstore.document import Document

But it still seems it hasn't access to internet. Any other suggestion?

@dosubot
Copy link

dosubot bot commented Aug 2, 2023

Answer generated by a 🤖

Answer

Hello,

From your code, it seems like you're trying to use the Document class from the langchain.docstore.document module. However, the error message "Document is not defined" suggests that Python can't find a reference to the Document class. This could be due to a few reasons:

  1. The Document class is not correctly imported. Make sure that the import statement from langchain.docstore.document import Document is at the top of your script and there are no typos.

  2. The Document class is not available in the version of LangChain you're using. Ensure you're using the latest version of LangChain, which is v0.0.250.

  3. There might be a conflict with another module or class named Document in your script. If you have another module or class with the same name, consider renaming it or using an alias in the import statement.

Regarding the issue of the Chatbot not accessing the internet, it's hard to pinpoint the exact issue without more information. However, I can suggest a few things:

  1. Make sure your environment has internet access. You can test this by running a simple Python script that makes a request to a website.

  2. Check if the GoogleSearchAPIWrapper is correctly set up and working. You can do this by running a separate script that only uses the GoogleSearchAPIWrapper to make a request.

  3. Ensure that the custom_search function is correctly implemented and is returning the expected results. You can test this by running the function separately with a test query and checking the output.

If you're still encountering issues, could you please provide more information about the error messages you're seeing, and any other relevant details about your setup? This will help me better understand the problem and provide a more accurate solution.

Best,
Dosu

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

@venturaEffect
Copy link
Author

venturaEffect commented Aug 3, 2023

Document correctly imported, upgrade on Langchain done, GoogleSearchAPIWrapper on a simple script working,... Everything seems fine but when running this code it still doesn't get access to the internet:

`
from googlesearch import search
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import (
UnstructuredWordDocumentLoader,
TextLoader,
UnstructuredPowerPointLoader,
)
from langchain.tools import Tool
from langchain.utilities import GoogleSearchAPIWrapper
from langchain.chat_models import ChatOpenAI
from langchain.docstore.document import Document
import os
import openai
import sys
from dotenv import load_dotenv, find_dotenv

    sys.path.append('../..')
    
    _ = load_dotenv(find_dotenv())  # read local .env file
    
    google_api_key = os.environ.get("GOOGLE_API_KEY")
    google_cse_id = os.environ.get("GOOGLE_CSE_ID")
    
    openai.api_key = os.environ['OPENAI_API_KEY']
    
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
    os.environ["LANGCHAIN_API_KEY"] = os.environ['LANGCHAIN_API_KEY']
    
    os.environ["GOOGLE_API_KEY"] = google_api_key
    os.environ["GOOGLE_CSE_ID"] = google_cse_id
    
    folder_path_docx = "DB\\DB VARIADO\\DOCS"
    folder_path_txt = "DB\\-BLOG-POSTS"
    folder_path_pptx_1 = "DB\\PPT JUNIO"
    folder_path_pptx_2 = "DB\\DB VARIADO\\PPTX"
    
    loaded_content = []
    
    for file in os.listdir(folder_path_docx):
        if file.endswith(".docx"):
            file_path = os.path.join(folder_path_docx, file)
            loader = UnstructuredWordDocumentLoader(file_path)
            docx = loader.load()
            loaded_content.extend(docx)
    
    for file in os.listdir(folder_path_txt):
        if file.endswith(".txt"):
            file_path = os.path.join(folder_path_txt, file)
            loader = TextLoader(file_path, encoding='utf-8')
            text = loader.load()
            loaded_content.extend(text)
    
    for file in os.listdir(folder_path_pptx_1):
        if file.endswith(".pptx"):
            file_path = os.path.join(folder_path_pptx_1, file)
            loader = UnstructuredPowerPointLoader(file_path)
            slides_1 = loader.load()
            loaded_content.extend(slides_1)
    
    for file in os.listdir(folder_path_pptx_2):
        if file.endswith(".pptx"):
            file_path = os.path.join(folder_path_pptx_2, file)
            loader = UnstructuredPowerPointLoader(file_path)
            slides_2 = loader.load()
            loaded_content.extend(slides_2)
    
    embedding = OpenAIEmbeddings()
    
    embeddings_content = []
    for one_loaded_content in loaded_content:
        embedding_content = embedding.embed_query(one_loaded_content.page_content)
        embeddings_content.append(embedding_content)
    
    db = DocArrayInMemorySearch.from_documents(loaded_content, embedding)
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})
    
    
    search = GoogleSearchAPIWrapper()
    
    
    def custom_search(query):
        max_results = 3
    
        internet_results = search.run(query)[:max_results]
    
        internet_documents = [Document(page_content=result)
                              for result in internet_results]
        return internet_documents

    chain = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name="gpt-4", temperature=0),
        chain_type="map_reduce",
        retriever=retriever,
        return_source_documents=True,
        return_generated_question=True,
    )
    
    history = []
    
    while True:
        query = input("Hola, soy Chatbot. ¿Qué te gustaría saber? ")
    
        internet_documents = custom_search(query)
    
        combined_results = loaded_content + internet_documents
    
        response = chain(
            {"question": query, "chat_history": history, "documents": combined_results})
    
        print(response["answer"])
    
        history.append(("system", query))
        history.append(("assistant", response["answer"]))

`

What can I do to make it work? Appreciate.

@dosubot
Copy link

dosubot bot commented Aug 3, 2023

Answer generated by a 🤖

Answer

Hello,

Thank you for providing a detailed description of your issue. From the code you've shared, it seems like you've done a thorough job setting up your environment and implementing the necessary functions. However, to better assist you, I need a bit more information:

  1. Could you please provide the exact error message you're encountering? This will help me understand where the issue might be coming from.
  2. Have you been able to successfully make requests to the internet outside of the LangChain framework in your current environment? This could help us determine if the issue is with the LangChain framework or with the environment itself.
  3. Are you able to successfully retrieve and use data from your local files (DOCS, BLOG-POSTS, PPT JUNIO, DB VARIADO)?
  4. Have you tried running a minimal version of your code that only attempts to access the internet, without the other functionalities? This could help isolate the issue.

Your answers to these questions will help me provide a more accurate solution to your problem. Looking forward to your response.


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

@venturaEffect
Copy link
Author

venturaEffect commented Aug 3, 2023

  1. The message I get when running the script is: "The text does not provide information on the... (and the query I asked to the chatbot)"
  2. Yes I could.
  3. Yes, it also works.
  4. Yes I made another chatbot that just answers looking in internet and it works.

In conclussion it is something wrong on my code that doesn't allow the chatbot to look at the custom data and also on the internet. Please need someone that has knowledge to accomplish a solution.

Here again the code I've so far:

`
from googlesearch import search
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import (
UnstructuredWordDocumentLoader,
TextLoader,
UnstructuredPowerPointLoader,
)
from langchain.tools import Tool
from langchain.utilities import GoogleSearchAPIWrapper
from langchain.chat_models import ChatOpenAI
from langchain.docstore.document import Document
import os
import openai
import sys
from dotenv import load_dotenv, find_dotenv

sys.path.append('../..')

_ = load_dotenv(find_dotenv())  # read local .env file

google_api_key = os.environ.get("GOOGLE_API_KEY")
google_cse_id = os.environ.get("GOOGLE_CSE_ID")

openai.api_key = os.environ['OPENAI_API_KEY']

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
os.environ["LANGCHAIN_API_KEY"] = os.environ['LANGCHAIN_API_KEY']

os.environ["GOOGLE_API_KEY"] = google_api_key
os.environ["GOOGLE_CSE_ID"] = google_cse_id

folder_path_docx = "DB\\DB VARIADO\\DOCS"
folder_path_txt = "DB\\BLOG-POSTS"
folder_path_pptx_1 = "DB\\PPT JUNIO"
folder_path_pptx_2 = "DB\\DB VARIADO\\PPTX"

loaded_content = []

for file in os.listdir(folder_path_docx):
    if file.endswith(".docx"):
        file_path = os.path.join(folder_path_docx, file)
        loader = UnstructuredWordDocumentLoader(file_path)
        docx = loader.load()
        loaded_content.extend(docx)

for file in os.listdir(folder_path_txt):
    if file.endswith(".txt"):
        file_path = os.path.join(folder_path_txt, file)
        loader = TextLoader(file_path, encoding='utf-8')
        text = loader.load()
        loaded_content.extend(text)

for file in os.listdir(folder_path_pptx_1):
    if file.endswith(".pptx"):
        file_path = os.path.join(folder_path_pptx_1, file)
        loader = UnstructuredPowerPointLoader(file_path)
        slides_1 = loader.load()
        loaded_content.extend(slides_1)

for file in os.listdir(folder_path_pptx_2):
    if file.endswith(".pptx"):
        file_path = os.path.join(folder_path_pptx_2, file)
        loader = UnstructuredPowerPointLoader(file_path)
        slides_2 = loader.load()
        loaded_content.extend(slides_2)

embedding = OpenAIEmbeddings()

embeddings_content = []
for one_loaded_content in loaded_content:
    embedding_content = embedding.embed_query(one_loaded_content.page_content)
    embeddings_content.append(embedding_content)

db = DocArrayInMemorySearch.from_documents(loaded_content, embedding)
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})


search = GoogleSearchAPIWrapper()


def custom_search(query):
    max_results = 3

    internet_results = search.run(query)[:max_results]

    internet_documents = [Document(page_content=result)
                          for result in internet_results]
    return internet_documents

chain = ConversationalRetrievalChain.from_llm(
    llm=ChatOpenAI(model_name="gpt-4", temperature=0),
    chain_type="map_reduce",
    retriever=retriever,
    return_source_documents=True,
    return_generated_question=True,
)

history = []

while True:
    query = input("Hola, soy Chatbot. ¿Qué te gustaría saber? ")

    internet_documents = custom_search(query)

    combined_results = loaded_content + internet_documents

    response = chain(
        {"question": query, "chat_history": history, "documents": combined_results})

    print(response["answer"])

    history.append(("system", query))
    history.append(("assistant", response["answer"]))

`

NOTE: There has to be a problem with how internet_documents is its output. Tried several ways adding it like how it displays loadded_content but it still doesn't work. loaded_content shows like this: [Document(page_content="text...", metadata={"source"= "path"}]... changed the code in custom_search in internet_documents = [Document(page_content=result) for result in internet_results] but just got Document(page_content='E', metadata={}), ...)] so it isn't getting the content properly from internet.

Appreciate any help!

@venturaEffect
Copy link
Author

venturaEffect commented Aug 5, 2023

I've updated the code but very strange it doesn't find a good response. When I print(response["answer"]) I get that there is no text to give to the query I put. Even if it gets information from the internet and the Document on the list seems good structured. Here the code:

`

        from googlesearch import search
        from langchain.chains import ConversationalRetrievalChain
        from langchain.memory import ConversationBufferMemory
        from langchain.vectorstores import DocArrayInMemorySearch
        from langchain.embeddings.openai import OpenAIEmbeddings
        from langchain.document_loaders import (
            UnstructuredWordDocumentLoader,
            TextLoader,
            UnstructuredPowerPointLoader,
        )
        from langchain.tools import Tool
        from langchain.utilities import GoogleSearchAPIWrapper
        from langchain.chat_models import ChatOpenAI
        from langchain.docstore.document import Document
        import os
        import openai
        import sys
        from dotenv import load_dotenv, find_dotenv
        
        sys.path.append('../..')
        
        _ = load_dotenv(find_dotenv())  
        
        google_api_key = os.environ.get("GOOGLE_API_KEY")
        google_cse_id = os.environ.get("GOOGLE_CSE_ID")
        
        openai.api_key = os.environ['OPENAI_API_KEY']
        
        os.environ["LANGCHAIN_TRACING_V2"] = "true"
        os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
        os.environ["LANGCHAIN_API_KEY"] = os.environ['LANGCHAIN_API_KEY']
        
        os.environ["GOOGLE_API_KEY"] = google_api_key
        os.environ["GOOGLE_CSE_ID"] = google_cse_id
        
        folder_path_docx = "DB\\DB VARIADO\\DOCS"
        folder_path_txt = "DB\\BLOG-POSTS"
        folder_path_pptx_1 = "DB\\PPT JUNIO"
        folder_path_pptx_2 = "DB\\DB VARIADO\\PPTX"
        
        loaded_content = []
        
        for file in os.listdir(folder_path_docx):
            if file.endswith(".docx"):
                file_path = os.path.join(folder_path_docx, file)
                loader = UnstructuredWordDocumentLoader(file_path)
                docx = loader.load()
                loaded_content.extend(docx)
        
        for file in os.listdir(folder_path_txt):
            if file.endswith(".txt"):
                file_path = os.path.join(folder_path_txt, file)
                loader = TextLoader(file_path, encoding='utf-8')
                text = loader.load()
                loaded_content.extend(text)
        
        for file in os.listdir(folder_path_pptx_1):
            if file.endswith(".pptx"):
                file_path = os.path.join(folder_path_pptx_1, file)
                loader = UnstructuredPowerPointLoader(file_path)
                slides_1 = loader.load()
                loaded_content.extend(slides_1)
        
        for file in os.listdir(folder_path_pptx_2):
            if file.endswith(".pptx"):
                file_path = os.path.join(folder_path_pptx_2, file)
                loader = UnstructuredPowerPointLoader(file_path)
                slides_2 = loader.load()
                loaded_content.extend(slides_2)
        
        embedding = OpenAIEmbeddings()
        
        embeddings_content = []
        for one_loaded_content in loaded_content:
            embedding_content = embedding.embed_query(one_loaded_content.page_content)
            embeddings_content.append(embedding_content)
        
        db = DocArrayInMemorySearch.from_documents(loaded_content, embedding)
        retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})
        
        search = GoogleSearchAPIWrapper()
        
        
        def custom_search(query):
            max_results = 3
        
            internet_results = search.results(query, max_results)
            internet_documents = [Document(page_content=result["snippet"], metadata={
                                           "source": result["link"]}) for result in internet_results
                                  ]
            return internet_documents
        
        
        chain = ConversationalRetrievalChain.from_llm(
            llm=ChatOpenAI(model_name="gpt-4", temperature=0),
            chain_type="map_reduce",
            retriever=retriever,
            return_source_documents=True,
            return_generated_question=True,
        )
        
        history = []
        
        
        
        while True:
            query = input("Hola, soy Chatbot. ¿Qué te gustaría saber? ")
        
            internet_documents = custom_search(query)
        
            small = loaded_content[:3]
        
            combined_results = small + internet_documents
            print(combined_results)
        
            response = chain(
                {"question": query, "chat_history": history, "documents": combined_results})
        
            print(response["answer"])
        
            history.append(("system", query))
            history.append(("assistant", response["answer"]))

`

Can anyone help me to make it work? Appreicate!

Copy link

dosubot bot commented Nov 4, 2023

Hi, @venturaEffect! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you were seeking help to connect a chatbot with its own data to the internet for search. In the comments, there were suggestions and code examples provided by me to convert search results into Document objects and to check the setup and implementation. You made updates to the code but were still encountering issues with accessing the internet and getting a proper response.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 4, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 11, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 11, 2023
@Laxmi-Pravaah
Copy link

how to connect a chatbot to internet for real time access or to give up to date information. using gemini model not gpt-4 ??? Please help me with resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: agent Related to agents module 🤖:question A specific question about the codebase, product, project, or how to use a feature
Projects
None yet
Development

No branches or pull requests

2 participants