-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not overwrite docs in ElasticVectorSearch as Pinecone do #2484
Comments
Please provide me with your full code for reproducing errors, including the code for inserting data into Elastic Search. Did this error occur after a library upgrade or is it an entirely new code? |
I hope I fixed your issue at #2445. Please let me know if there's anything else I can do to help |
This code runs ok at the first time when there is no elastic index "test": texts = ['aaa', bbb', ccc', 'ddd','eee'] docsearch = ElasticVectorSearch.from_texts(texts, embedding=embedding, ids=ids, elasticsearch_url=f"http://elastic:{ELASTIC_PASSWORD}@localhost:9200", index_name="test") docs = docsearch.similarity_search(query, k=10) |
As I mentioned earlier, I hope it has been fixed. However, please be patient as it still needs to be merged. |
Thanks a lot! |
…ests (#2445) Using `pytest-vcr` in integration tests has several benefits. Firstly, it removes the need to mock external services, as VCR records and replays HTTP interactions on the fly. Secondly, it simplifies the integration test setup by eliminating the need to set up and tear down external services in some cases. Finally, it allows for more reliable and deterministic integration tests by ensuring that HTTP interactions are always replayed with the same response. Overall, `pytest-vcr` is a valuable tool for simplifying integration test setup and improving their reliability This commit adds the `pytest-vcr` package as a dependency for integration tests in the `pyproject.toml` file. It also introduces two new fixtures in `tests/integration_tests/conftest.py` files for managing cassette directories and VCR configurations. In addition, the `tests/integration_tests/vectorstores/test_elasticsearch.py` file has been updated to use the `@pytest.mark.vcr` decorator for recording and replaying HTTP interactions. Finally, this commit removes the `documents` fixture from the `test_elasticsearch.py` file and replaces it with a new fixture defined in `tests/integration_tests/vectorstores/conftest.py` that yields a list of documents to use in any other tests. This also includes my second attempt to fix issue : #2386 Maybe related #2484
Please update the library and test it again. If the error is not fixed, I will write a test and fix it. |
Thanks a lot. The previous problem is partially solved in ver 0.0.135. But the overwrite behavior is different than Pinecone. If I pass the same id into the elasticsearch, the document of the same id will not refresh, instead, new documents will be created. The document number can be checked with |
Thank you for your detailed feedback. Where you read about ElasticVectorSearch.from_texts, are you able to insert text IDs as a parameter? The source code indicates that the code doesn't take the 'ids' parameter into consideration. |
Yes, I can take in the param under current ver under this code with no running problem: But I don't know if Elasticsearch actually used this param. I hope it can use the user defined id if param |
You may not see any errors because the function accepted any parameter due to the For example, you may not notice any errors if you create something like it: docsearch = ElasticVectorSearch.from_texts(make_me_a_coffee=True) Let me ask again, did you read about that API in the documentation or somewhere else? I'm asking because at the moment, I'm not sure if we have a bug. For example, we might have such code in the documentation, or we might simply not have that functionality. P.S. I apologise for any inconvenience this may cause. |
@sergerdn Sorry, I didn't read the document carefully. Previously when I use Pinecone vectorstore, the ids param is specified. I hoped that I could use it the same way. But it didn't. I don't think there is a bug in it now. I hope in the future, the documents can be refreshed if the ids are provided just like Pinecone. :) |
Sure, that's no problem. I believe we'll need the same functionality as Pinecone. Would you mind sharing a direct link to the documentation? I'm not very familiar with the project and having the link would save me some time searching for it myself. |
Sure, the link is: Thanks a lot! 👍 |
Also I suggest another approach for you. It will clarify what you are doing and provide you with some another APIs, such as metadata for your documents. Take a look at the tests with some examples: https://github.com/hwchase17/langchain/blob/1931d4495ec67443b6b4b523e1ec790e61a7fb58/tests/integration_tests/vectorstores/test_elasticsearch.py#L124 I will observe how Pinecoin manages indexes and then apply the same approach with Elastic. I think that it is a metadata property of the document. NOT tested: def example():
def documents() -> List[Document]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = TextLoader(
os.path.join(os.path.dirname(__file__), "fixtures", "sharks.txt")
).load()
return text_splitter.split_documents(documents)
def add_documents(openai_api_key: str, elasticsearch_url: str ) -> None:
index_name = f"custom_index_blaa"
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
elastic_vector_search = ElasticVectorSearch(
embedding=embedding,
elasticsearch_url=elasticsearch_url,
index_name=index_name,
)
es = Elasticsearch(hosts=elasticsearch_url)
elastic_vector_search.add_documents(documents())
search_result = elastic_vector_search.similarity_search("sharks")
print(search_result)
add_documents(openai_api_key="bla", elasticsearch_url="http://bla-bla")
if __name__ == "__main__":
example() |
Improve the integration tests for Pinecone by adding an `.env.example` file for local testing. Additionally, add some dev dependencies specifically for integration tests. This change also helps me understand how Pinecone deals with certain things, see related issues langchain-ai#2484 langchain-ai#2816
Hi, @firezym. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale. Based on my understanding of the current state of the issue, you encountered an error when trying to update or add documents to an existing index in ElasticVectorSearch. User sergerdn provided a potential fix in a pull request, but it is still awaiting merge. You mentioned that the problem is partially solved in version 0.0.135, but the overwrite behavior is different than Pinecone. There was also a discussion about the possibility of adding functionality to refresh documents if IDs are provided, similar to Pinecone. Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days. Thank you for your understanding and contribution to the LangChain project. Let us know if you have any further questions or concerns. |
If index already exists or any doc inside it, I can not update the index or add more docs to it. for example:
docsearch = ElasticVectorSearch.from_texts(texts=texts[0:10], ids=ids[0:10], embedding=embedding, elasticsearch_url=f"http://elastic:{ELASTIC_PASSWORD}@localhost:9200", index_name="test")
Get an error: BadRequestError: BadRequestError(400, 'resource_already_exists_exception', 'index [test/v_Ahq4NSS2aWm2_gLNUtpQ] already exists')
The text was updated successfully, but these errors were encountered: