Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not overwrite docs in ElasticVectorSearch as Pinecone do #2484

Closed
firezym opened this issue Apr 6, 2023 · 15 comments
Closed

Can not overwrite docs in ElasticVectorSearch as Pinecone do #2484

firezym opened this issue Apr 6, 2023 · 15 comments

Comments

@firezym
Copy link

firezym commented Apr 6, 2023

If index already exists or any doc inside it, I can not update the index or add more docs to it. for example:
docsearch = ElasticVectorSearch.from_texts(texts=texts[0:10], ids=ids[0:10], embedding=embedding, elasticsearch_url=f"http://elastic:{ELASTIC_PASSWORD}@localhost:9200", index_name="test")

Get an error: BadRequestError: BadRequestError(400, 'resource_already_exists_exception', 'index [test/v_Ahq4NSS2aWm2_gLNUtpQ] already exists')

@sergerdn
Copy link
Contributor

sergerdn commented Apr 6, 2023

@firezym

Please provide me with your full code for reproducing errors, including the code for inserting data into Elastic Search. Did this error occur after a library upgrade or is it an entirely new code?

@sergerdn
Copy link
Contributor

sergerdn commented Apr 6, 2023

I hope I fixed your issue at #2445.

Please let me know if there's anything else I can do to help

@firezym
Copy link
Author

firezym commented Apr 7, 2023

This code runs ok at the first time when there is no elastic index "test":
`
from langchain import ElasticVectorSearch
from langchain.embeddings import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

texts = ['aaa', bbb', ccc', 'ddd','eee']
ids = [0,1,2,3,4]

docsearch = ElasticVectorSearch.from_texts(texts, embedding=embedding, ids=ids, elasticsearch_url=f"http://elastic:{ELASTIC_PASSWORD}@localhost:9200", index_name="test")

docs = docsearch.similarity_search(query, k=10)
docs
`
But if I run for the 2nd time (overwrite the docs of the same id), It gets error:
Get an error: BadRequestError: BadRequestError(400, 'resource_already_exists_exception', 'index [test/v_Ahq4NSS2aWm2_gLNUtpQ] already exists')

@sergerdn
Copy link
Contributor

sergerdn commented Apr 7, 2023

@firezym

As I mentioned earlier, I hope it has been fixed. However, please be patient as it still needs to be merged.

@firezym
Copy link
Author

firezym commented Apr 7, 2023

Thanks a lot!

hwchase17 pushed a commit that referenced this issue Apr 7, 2023
…ests (#2445)

Using `pytest-vcr` in integration tests has several benefits. Firstly,
it removes the need to mock external services, as VCR records and
replays HTTP interactions on the fly. Secondly, it simplifies the
integration test setup by eliminating the need to set up and tear down
external services in some cases. Finally, it allows for more reliable
and deterministic integration tests by ensuring that HTTP interactions
are always replayed with the same response.
Overall, `pytest-vcr` is a valuable tool for simplifying integration
test setup and improving their reliability

This commit adds the `pytest-vcr` package as a dependency for
integration tests in the `pyproject.toml` file. It also introduces two
new fixtures in `tests/integration_tests/conftest.py` files for managing
cassette directories and VCR configurations.

In addition, the
`tests/integration_tests/vectorstores/test_elasticsearch.py` file has
been updated to use the `@pytest.mark.vcr` decorator for recording and
replaying HTTP interactions.

Finally, this commit removes the `documents` fixture from the
`test_elasticsearch.py` file and replaces it with a new fixture defined
in `tests/integration_tests/vectorstores/conftest.py` that yields a list
of documents to use in any other tests.

This also includes my second attempt to fix issue :
#2386

Maybe related #2484
@sergerdn
Copy link
Contributor

sergerdn commented Apr 8, 2023

Please update the library and test it again. If the error is not fixed, I will write a test and fix it.

@firezym
Copy link
Author

firezym commented Apr 10, 2023

Please update the library and test it again. If the error is not fixed, I will write a test and fix it.
@sergerdn

Thanks a lot. The previous problem is partially solved in ver 0.0.135. But the overwrite behavior is different than Pinecone.

If I pass the same id into the elasticsearch, the document of the same id will not refresh, instead, new documents will be created.
docsearch = ElasticVectorSearch.from_texts(texts=texts[0:10], ids=ids[0:10], embedding=embedding, elasticsearch_url=f"http://elastic:{ELASTIC_PASSWORD}@localhost:9200", index_name="test")

The document number can be checked with es.count(index="test"). And also search results will double.

@sergerdn
Copy link
Contributor

sergerdn commented Apr 10, 2023

@firezym

Thank you for your detailed feedback.

Where you read about ElasticVectorSearch.from_texts, are you able to insert text IDs as a parameter? The source code indicates that the code doesn't take the 'ids' parameter into consideration.

https://github.com/hwchase17/langchain/blob/e63f9a846be7a85de7d3e3a1b277a4521b42808d/langchain/vectorstores/elastic_vector_search.py#L172

@firezym
Copy link
Author

firezym commented Apr 10, 2023

@sergerdn

Yes, I can take in the param under current ver under this code with no running problem:
docsearch = ElasticVectorSearch.from_texts(texts, embedding=embedding, ids=ids, elasticsearch_url=f"http://elastic:{ELASTIC_PASSWORD}@localhost:9200", index_name="test")
I upgrade langchain to ver 0.0.135 using pip before test the code.

But I don't know if Elasticsearch actually used this param. I hope it can use the user defined id if param ids are provided

@sergerdn
Copy link
Contributor

sergerdn commented Apr 10, 2023

@firezym

You may not see any errors because the function accepted any parameter due to the kwargs argument, but without any knowledge of what you are doing, you may encounter unexpected results.

For example, you may not notice any errors if you create something like it:

docsearch = ElasticVectorSearch.from_texts(make_me_a_coffee=True)

Let me ask again, did you read about that API in the documentation or somewhere else? I'm asking because at the moment, I'm not sure if we have a bug. For example, we might have such code in the documentation, or we might simply not have that functionality.

P.S.
When my latest pull request is merged, I will enhance the tests and be able to address some new bugs.
Currently, I'm approaching the elastic fixing process in small increments. This approach is beneficial because I can implement bug fixes with minimal changes, making it easier for the project owner to understand what I'm doing.
However, it can also be frustrating because I need to wait for the changes to be merged before moving forward. As a result, I'm working on elastic bug fixes and rejections at a slower pace.

I apologise for any inconvenience this may cause.

@firezym
Copy link
Author

firezym commented Apr 10, 2023

@sergerdn
Thanks, now I understand why there is no error due to kwargs argument.

Sorry, I didn't read the document carefully. Previously when I use Pinecone vectorstore, the ids param is specified. I hoped that I could use it the same way. But it didn't. I don't think there is a bug in it now.

I hope in the future, the documents can be refreshed if the ids are provided just like Pinecone. :)

@sergerdn
Copy link
Contributor

@firezym

Sure, that's no problem. I believe we'll need the same functionality as Pinecone. Would you mind sharing a direct link to the documentation? I'm not very familiar with the project and having the link would save me some time searching for it myself.

@firezym
Copy link
Author

firezym commented Apr 10, 2023

@sergerdn

Sure, the link is:
https://python.langchain.com/en/latest/reference/modules/vectorstore.html?highlight=pinecone#langchain.vectorstores.Pinecone

Thanks a lot! 👍

@sergerdn
Copy link
Contributor

sergerdn commented Apr 10, 2023

Also I suggest another approach for you. It will clarify what you are doing and provide you with some another APIs, such as metadata for your documents.

Take a look at the tests with some examples: https://github.com/hwchase17/langchain/blob/1931d4495ec67443b6b4b523e1ec790e61a7fb58/tests/integration_tests/vectorstores/test_elasticsearch.py#L124

I will observe how Pinecoin manages indexes and then apply the same approach with Elastic. I think that it is a metadata property of the document.

NOT tested:

def example():
    def documents() -> List[Document]:
        text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

        documents = TextLoader(
            os.path.join(os.path.dirname(__file__), "fixtures", "sharks.txt")
        ).load()
        return text_splitter.split_documents(documents)

    def add_documents(openai_api_key: str, elasticsearch_url: str ) -> None:
        index_name = f"custom_index_blaa"
        embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
        elastic_vector_search = ElasticVectorSearch(
            embedding=embedding,
            elasticsearch_url=elasticsearch_url,
            index_name=index_name,
        )
        es = Elasticsearch(hosts=elasticsearch_url)
        elastic_vector_search.add_documents(documents())

        search_result = elastic_vector_search.similarity_search("sharks")
        print(search_result)

    add_documents(openai_api_key="bla", elasticsearch_url="http://bla-bla")


if __name__ == "__main__":
    example()

hwchase17 pushed a commit that referenced this issue Apr 14, 2023
Improve the integration tests for Pinecone by adding an `.env.example`
file for local testing. Additionally, add some dev dependencies
specifically for integration tests.

This change also helps me understand how Pinecone deals with certain
things, see related issues
#2484
#2816
samching pushed a commit to samching/langchain that referenced this issue May 1, 2023
Improve the integration tests for Pinecone by adding an `.env.example`
file for local testing. Additionally, add some dev dependencies
specifically for integration tests.

This change also helps me understand how Pinecone deals with certain
things, see related issues
langchain-ai#2484
langchain-ai#2816
@dosubot
Copy link

dosubot bot commented Sep 21, 2023

Hi, @firezym. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding of the current state of the issue, you encountered an error when trying to update or add documents to an existing index in ElasticVectorSearch. User sergerdn provided a potential fix in a pull request, but it is still awaiting merge. You mentioned that the problem is partially solved in version 0.0.135, but the overwrite behavior is different than Pinecone. There was also a discussion about the possibility of adding functionality to refresh documents if IDs are provided, similar to Pinecone.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project. Let us know if you have any further questions or concerns.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 21, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 28, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants