Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question Answering over Docs giving cryptic error upon query #2944

Closed
ahmadsb86 opened this issue Apr 15, 2023 · 2 comments
Closed

Question Answering over Docs giving cryptic error upon query #2944

ahmadsb86 opened this issue Apr 15, 2023 · 2 comments

Comments

@ahmadsb86
Copy link

ahmadsb86 commented Apr 15, 2023

After ingesting some markdown files using a slightly modified version of the question-answering over docs example, I ran the qa.py script as it was in the example

# qa.py
import faiss
from langchain import OpenAI, HuggingFaceHub, LLMChain
from langchain.chains import VectorDBQAWithSourcesChain
import pickle
import argparse

parser = argparse.ArgumentParser(description='Ask a question to the notion DB.')
parser.add_argument('question', type=str, help='The question to ask the notion DB')
args = parser.parse_args()

# Load the LangChain.
index = faiss.read_index("docs.index")

with open("faiss_store.pkl", "rb") as f:
    store = pickle.load(f)

store.index = index

chain = VectorDBQAWithSourcesChain.from_llm(llm=OpenAI(temperature=0), vectorstore=store)
result = chain({"question": args.question})
print(f"Answer: {result['answer']}")

Only to get this cryptic error

Traceback (most recent call last):
  File "C:\Users\ahmad\OneDrive\Desktop\Coding\LANGCHAINSSSSSS\notion-qa\qa.py", line 22, in <module>
    result = chain({"question": args.question})
  File "C:\Users\ahmad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\langchain\chains\base.py", line 146, in __call__
    raise e
  File "C:\Users\ahmad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\langchain\chains\base.py", line 142, in __call__
    outputs = self._call(inputs)
  File "C:\Users\ahmad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\langchain\chains\qa_with_sources\base.py", line 97, in _call
    answer, _ = self.combine_document_chain.combine_docs(docs, **inputs)
  File "C:\Users\ahmad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\langchain\chains\combine_documents\map_reduce.py", line 150, in combine_docs
    num_tokens = length_func(result_docs, **kwargs)
  File "C:\Users\ahmad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\langchain\chains\combine_documents\stuff.py", line 77, in prompt_length
    inputs = self._get_inputs(docs, **kwargs)
  File "C:\Users\ahmad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\langchain\chains\combine_documents\stuff.py", line 64, in _get_inputs
    document_info = {
  File "C:\Users\ahmad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\langchain\chains\combine_documents\stuff.py", line 65, in <dictcomp>
    k: base_info[k] for k in self.document_prompt.input_variables
KeyError: 'source'

Here is the code I used for ingesting
|

"""This is the logic for ingesting Notion data into LangChain."""
from pathlib import Path
from langchain.text_splitter import CharacterTextSplitter
import faiss
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
import pickle
import time
from tqdm import tqdm


# Here we load in the data in the format that Notion exports it in.
folder = list(Path("Notion_DB/").glob("**/*.md"))

files = []
sources = []
for myFile in folder:
    with open(myFile, 'r',  encoding='utf-8') as f:
        print(myFile.name)
        files.append(f.read())
    sources.append(myFile)

# Here we split the documents, as needed, into smaller chunks.
# We do this due to the context limits of the LLMs.
text_splitter = CharacterTextSplitter(chunk_size=800, separator="\n")
docs = []
metadatas = []
for i, f in enumerate(files):
    splits = text_splitter.split_text(f)
    docs.extend(splits)
    metadatas.extend([{"source": sources[i]}] * len(splits))


# Add each element in docs into FAISS store, keeping a delay between inserting elements so we don't exceed rate limit
store = None
for (index, chunk) in tqdm(enumerate(docs)):
    if index == 0:
        store = FAISS.from_texts([chunk], OpenAIEmbeddings())
    else:
        time.sleep(1) # wait for a second to not exceed any rate limits
        store.add_texts([chunk])
    # print('finished with index '+index.__str__())

print('Done yayy!')

# # Here we create a vector store from the documents and save it to disk.
faiss.write_index(store.index, "docs.index")
store.index = None
with open("faiss_store.pkl", "wb") as f:
    pickle.dump(store, f)
@ahmadsb86 ahmadsb86 changed the title Question Answering over Docs giving error upon query Question Answering over Docs giving cryptic error upon query Apr 15, 2023
@cnhhoang850
Copy link
Contributor

Based on the code you provided, it seems that you are correctly adding the 'source' key in the metadata of the documents. You have created metadatas with a list of dictionaries containing the 'source' key:

metadatas.extend([{"source": sources[i]}] * len(splits))

However, you don't seem to be using the metadatas while creating the FAISS store. If you need to store the metadata along with the documents, you need to modify the FAISS store or any other relevant data structure to include the metadata information.

store = FAISS.from_texts([chunk], OpenAIEmbeddings())

The code you provided is only ingesting the data into a FAISS index and saving it to disk. The KeyError might be occurring later in the pipeline when you use this data along with the MapReduceDocumentsChain you provided earlier. Make sure that when you create Document objects, you include the metadata from metadatas in the process, and use these Document objects as input to the MapReduceDocumentsChain or any other part of the pipeline where the metadata is required.

hwchase17 pushed a commit that referenced this issue Apr 18, 2023
@cnhhoang850 slightly more generic fix for #2944, works for whatever the
expected metadata keys are not just `source`
samching pushed a commit to samching/langchain that referenced this issue May 1, 2023
@cnhhoang850 slightly more generic fix for langchain-ai#2944, works for whatever the
expected metadata keys are not just `source`
@dosubot
Copy link

dosubot bot commented Aug 29, 2023

Hi, @ahmadsb86. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you encountered a KeyError for the 'source' key when running the qa.py script after modifying the question-answering over docs example. User cnhhoang850 suggested that the issue might be related to not using the metadatas while creating the FAISS store and advised modifying the relevant data structure to include the metadata information. The KeyError might be occurring later in the pipeline when using the data along with the provided MapReduceDocumentsChain.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project. Let us know if you have any further questions or concerns.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Aug 29, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 10, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants