Can not overwrite docs in ElasticVectorSearch as Pinecone do #2484

firezym · 2023-04-06T10:24:28Z

If index already exists or any doc inside it, I can not update the index or add more docs to it. for example:
docsearch = ElasticVectorSearch.from_texts(texts=texts[0:10], ids=ids[0:10], embedding=embedding, elasticsearch_url=f"http://elastic:{ELASTIC_PASSWORD}@localhost:9200", index_name="test")

Get an error: BadRequestError: BadRequestError(400, 'resource_already_exists_exception', 'index [test/v_Ahq4NSS2aWm2_gLNUtpQ] already exists')

sergerdn · 2023-04-06T15:53:52Z

@firezym

Please provide me with your full code for reproducing errors, including the code for inserting data into Elastic Search. Did this error occur after a library upgrade or is it an entirely new code?

sergerdn · 2023-04-06T19:36:12Z

I hope I fixed your issue at #2445.

Please let me know if there's anything else I can do to help

firezym · 2023-04-07T01:51:19Z

This code runs ok at the first time when there is no elastic index "test":
`
from langchain import ElasticVectorSearch
from langchain.embeddings import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

texts = ['aaa', bbb', ccc', 'ddd','eee']
ids = [0,1,2,3,4]

docsearch = ElasticVectorSearch.from_texts(texts, embedding=embedding, ids=ids, elasticsearch_url=f"http://elastic:{ELASTIC_PASSWORD}@localhost:9200", index_name="test")

docs = docsearch.similarity_search(query, k=10)
docs
`
But if I run for the 2nd time (overwrite the docs of the same id), It gets error:
Get an error: BadRequestError: BadRequestError(400, 'resource_already_exists_exception', 'index [test/v_Ahq4NSS2aWm2_gLNUtpQ] already exists')

sergerdn · 2023-04-07T08:17:14Z

@firezym

As I mentioned earlier, I hope it has been fixed. However, please be patient as it still needs to be merged.

firezym · 2023-04-07T08:28:40Z

Thanks a lot!

…ests (#2445) Using `pytest-vcr` in integration tests has several benefits. Firstly, it removes the need to mock external services, as VCR records and replays HTTP interactions on the fly. Secondly, it simplifies the integration test setup by eliminating the need to set up and tear down external services in some cases. Finally, it allows for more reliable and deterministic integration tests by ensuring that HTTP interactions are always replayed with the same response. Overall, `pytest-vcr` is a valuable tool for simplifying integration test setup and improving their reliability This commit adds the `pytest-vcr` package as a dependency for integration tests in the `pyproject.toml` file. It also introduces two new fixtures in `tests/integration_tests/conftest.py` files for managing cassette directories and VCR configurations. In addition, the `tests/integration_tests/vectorstores/test_elasticsearch.py` file has been updated to use the `@pytest.mark.vcr` decorator for recording and replaying HTTP interactions. Finally, this commit removes the `documents` fixture from the `test_elasticsearch.py` file and replaces it with a new fixture defined in `tests/integration_tests/vectorstores/conftest.py` that yields a list of documents to use in any other tests. This also includes my second attempt to fix issue : #2386 Maybe related #2484

sergerdn · 2023-04-08T06:41:17Z

Please update the library and test it again. If the error is not fixed, I will write a test and fix it.

firezym · 2023-04-10T14:35:56Z

Please update the library and test it again. If the error is not fixed, I will write a test and fix it.
@sergerdn

Thanks a lot. The previous problem is partially solved in ver 0.0.135. But the overwrite behavior is different than Pinecone.

If I pass the same id into the elasticsearch, the document of the same id will not refresh, instead, new documents will be created.
docsearch = ElasticVectorSearch.from_texts(texts=texts[0:10], ids=ids[0:10], embedding=embedding, elasticsearch_url=f"http://elastic:{ELASTIC_PASSWORD}@localhost:9200", index_name="test")

The document number can be checked with es.count(index="test"). And also search results will double.

sergerdn · 2023-04-10T15:03:38Z

@firezym

Thank you for your detailed feedback.

Where you read about ElasticVectorSearch.from_texts, are you able to insert text IDs as a parameter? The source code indicates that the code doesn't take the 'ids' parameter into consideration.

https://github.com/hwchase17/langchain/blob/e63f9a846be7a85de7d3e3a1b277a4521b42808d/langchain/vectorstores/elastic_vector_search.py#L172

firezym · 2023-04-10T15:10:40Z

@sergerdn

Yes, I can take in the param under current ver under this code with no running problem:
docsearch = ElasticVectorSearch.from_texts(texts, embedding=embedding, ids=ids, elasticsearch_url=f"http://elastic:{ELASTIC_PASSWORD}@localhost:9200", index_name="test")
I upgrade langchain to ver 0.0.135 using pip before test the code.

But I don't know if Elasticsearch actually used this param. I hope it can use the user defined id if param ids are provided

sergerdn · 2023-04-10T15:19:58Z

@firezym

You may not see any errors because the function accepted any parameter due to the kwargs argument, but without any knowledge of what you are doing, you may encounter unexpected results.

For example, you may not notice any errors if you create something like it:

docsearch = ElasticVectorSearch.from_texts(make_me_a_coffee=True)

Let me ask again, did you read about that API in the documentation or somewhere else? I'm asking because at the moment, I'm not sure if we have a bug. For example, we might have such code in the documentation, or we might simply not have that functionality.

P.S.
When my latest pull request is merged, I will enhance the tests and be able to address some new bugs.
Currently, I'm approaching the elastic fixing process in small increments. This approach is beneficial because I can implement bug fixes with minimal changes, making it easier for the project owner to understand what I'm doing.
However, it can also be frustrating because I need to wait for the changes to be merged before moving forward. As a result, I'm working on elastic bug fixes and rejections at a slower pace.

I apologise for any inconvenience this may cause.

firezym · 2023-04-10T15:26:49Z

@sergerdn
Thanks, now I understand why there is no error due to kwargs argument.

Sorry, I didn't read the document carefully. Previously when I use Pinecone vectorstore, the ids param is specified. I hoped that I could use it the same way. But it didn't. I don't think there is a bug in it now.

I hope in the future, the documents can be refreshed if the ids are provided just like Pinecone. :)

sergerdn · 2023-04-10T15:30:25Z

@firezym

Sure, that's no problem. I believe we'll need the same functionality as Pinecone. Would you mind sharing a direct link to the documentation? I'm not very familiar with the project and having the link would save me some time searching for it myself.

firezym · 2023-04-10T15:35:15Z

@sergerdn

Sure, the link is:
https://python.langchain.com/en/latest/reference/modules/vectorstore.html?highlight=pinecone#langchain.vectorstores.Pinecone

Thanks a lot! 👍

sergerdn · 2023-04-10T15:47:31Z

Also I suggest another approach for you. It will clarify what you are doing and provide you with some another APIs, such as metadata for your documents.

Take a look at the tests with some examples: https://github.com/hwchase17/langchain/blob/1931d4495ec67443b6b4b523e1ec790e61a7fb58/tests/integration_tests/vectorstores/test_elasticsearch.py#L124

I will observe how Pinecoin manages indexes and then apply the same approach with Elastic. I think that it is a metadata property of the document.

NOT tested:

def example():
    def documents() -> List[Document]:
        text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

        documents = TextLoader(
            os.path.join(os.path.dirname(__file__), "fixtures", "sharks.txt")
        ).load()
        return text_splitter.split_documents(documents)

    def add_documents(openai_api_key: str, elasticsearch_url: str ) -> None:
        index_name = f"custom_index_blaa"
        embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
        elastic_vector_search = ElasticVectorSearch(
            embedding=embedding,
            elasticsearch_url=elasticsearch_url,
            index_name=index_name,
        )
        es = Elasticsearch(hosts=elasticsearch_url)
        elastic_vector_search.add_documents(documents())

        search_result = elastic_vector_search.similarity_search("sharks")
        print(search_result)

    add_documents(openai_api_key="bla", elasticsearch_url="http://bla-bla")


if __name__ == "__main__":
    example()

Improve the integration tests for Pinecone by adding an `.env.example` file for local testing. Additionally, add some dev dependencies specifically for integration tests. This change also helps me understand how Pinecone deals with certain things, see related issues #2484 #2816

Improve the integration tests for Pinecone by adding an `.env.example` file for local testing. Additionally, add some dev dependencies specifically for integration tests. This change also helps me understand how Pinecone deals with certain things, see related issues langchain-ai#2484 langchain-ai#2816

dosubot · 2023-09-21T16:02:07Z

Hi, @firezym. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding of the current state of the issue, you encountered an error when trying to update or add documents to an existing index in ElasticVectorSearch. User sergerdn provided a potential fix in a pull request, but it is still awaiting merge. You mentioned that the problem is partially solved in version 0.0.135, but the overwrite behavior is different than Pinecone. There was also a discussion about the possibility of adding functionality to refresh documents if IDs are provided, similar to Pinecone.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project. Let us know if you have any further questions or concerns.

sergerdn mentioned this issue Apr 6, 2023

feat: add pytest-vcr for recording HTTP interactions in integration tests #2445

Merged

This was referenced Apr 13, 2023

feat: improve pinecone tests #2806

Merged

Complete testing for Vector Stores #2816

Closed

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 21, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 28, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not overwrite docs in ElasticVectorSearch as Pinecone do #2484

Can not overwrite docs in ElasticVectorSearch as Pinecone do #2484

firezym commented Apr 6, 2023

sergerdn commented Apr 6, 2023 •

edited

sergerdn commented Apr 6, 2023 •

edited

firezym commented Apr 7, 2023 •

edited

sergerdn commented Apr 7, 2023

firezym commented Apr 7, 2023

sergerdn commented Apr 8, 2023

firezym commented Apr 10, 2023

sergerdn commented Apr 10, 2023 •

edited

firezym commented Apr 10, 2023 •

edited

sergerdn commented Apr 10, 2023 •

edited

firezym commented Apr 10, 2023

sergerdn commented Apr 10, 2023

firezym commented Apr 10, 2023 •

edited

sergerdn commented Apr 10, 2023 •

edited

dosubot bot commented Sep 21, 2023

Can not overwrite docs in ElasticVectorSearch as Pinecone do #2484

Can not overwrite docs in ElasticVectorSearch as Pinecone do #2484

Comments

firezym commented Apr 6, 2023

sergerdn commented Apr 6, 2023 • edited

sergerdn commented Apr 6, 2023 • edited

firezym commented Apr 7, 2023 • edited

sergerdn commented Apr 7, 2023

firezym commented Apr 7, 2023

sergerdn commented Apr 8, 2023

firezym commented Apr 10, 2023

sergerdn commented Apr 10, 2023 • edited

firezym commented Apr 10, 2023 • edited

sergerdn commented Apr 10, 2023 • edited

firezym commented Apr 10, 2023

sergerdn commented Apr 10, 2023

firezym commented Apr 10, 2023 • edited

sergerdn commented Apr 10, 2023 • edited

dosubot bot commented Sep 21, 2023

sergerdn commented Apr 6, 2023 •

edited

sergerdn commented Apr 6, 2023 •

edited

firezym commented Apr 7, 2023 •

edited

sergerdn commented Apr 10, 2023 •

edited

firezym commented Apr 10, 2023 •

edited

sergerdn commented Apr 10, 2023 •

edited

firezym commented Apr 10, 2023 •

edited

sergerdn commented Apr 10, 2023 •

edited