Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vectorstores error: "search_phase_execution_exceptionm" after using elastic search #2386

Closed
longgui0318 opened this issue Apr 4, 2023 · 21 comments · Fixed by #2402
Closed

Comments

@longgui0318
Copy link
Contributor

Hi

I'm using elasticsearch as Vectorstores, just a simple call, but it's reporting an error, I've called add_documents beforehand and it's working. But calling similarity_search is giving me an error. Thanks for checking

Related Environment

  • docker >> image elasticsearch:7.17.0
  • python >> elasticsearch==7.17.0

Test code

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import ElasticVectorSearch

if __name__ == "__main__":
    embeddings = OpenAIEmbeddings()
    elastic_vector_search = ElasticVectorSearch(
        elasticsearch_url="http://192.168.1.2:9200",
        index_name="test20222",
        embedding=embeddings
    )
    searchResult = elastic_vector_search.similarity_search("What are the characteristics of sharks")

Error

(.venv) apple@xMacBook-Pro ai-chain % python test.py
Traceback (most recent call last):
  File "/Users/apple/work/x/ai-chain/test.py", line 14, in <module>
    result = elastic_vector_search.client.search(index="test20222",query={
  File "/Users/apple/work/x/ai-chain/.venv/lib/python3.9/site-packages/elasticsearch/_sync/client/utils.py", line 414, in wrapped
    return api(*args, **kwargs)
  File "/Users/apple/work/x/ai-chain/.venv/lib/python3.9/site-packages/elasticsearch/_sync/client/__init__.py", line 3798, in search
    return self.perform_request(  # type: ignore[return-value]
  File "/Users/apple/work/x/ai-chain/.venv/lib/python3.9/site-packages/elasticsearch/_sync/client/_base.py", line 320, in perform_request
    raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
elasticsearch.BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'runtime error')
@sergerdn
Copy link
Contributor

sergerdn commented Apr 4, 2023

@longgui0318

Can you confirm that the index has that name? Have you checked Kibana for information on it?
I am asking you because we have some buggy code. When you inserted docs into Elastic, your index name was ignored and a new index was created instead of the one you provided.
Also, we lack a test for it.

I believe that fixing it can be easy, and the test should also be improved. In my opinion, this occurred because the GitHub workflow only ran unit tests and not any functional tests at the moment.

    elastic_search.from_documents(
        documents=get_documents(),
        embedding=embedding,
        index_name="my_cool_name",  # the index name did not work as expected, so a new random name was created.
    )

https://github.com/hwchase17/langchain/blob/fe1eb8ca5f57fcd7c566adfc01fa1266349b72f3/langchain/vectorstores/elastic_vector_search.py#L244

Capture

@longgui0318
Copy link
Contributor Author

This is the way I called the save, so, I made sure the index_name was the same and checked it on kibana

    elastic_vector_search = ElasticVectorSearch(
        elasticsearch_url="http://192.168.1.2:9200",
        index_name="test20222",
        embedding=embeddings
    )
    elastic_vector_search.add_documents(docs);

@sergerdn
Copy link
Contributor

sergerdn commented Apr 4, 2023

@longgui0318

Please remove all indexes from ElasticSearch and then run your script to recreate the index with the newly created documents. Once the process is completed, kindly confirm that the index has been created with the required name.

Additionally, I request that you provide a screenshot from Kibana, if possible, that shows all of your indexes.

P.S. Right now, I am working on improving some tests with ElasticVectorSearch to make sure that everything is going as expected.

@longgui0318
Copy link
Contributor Author

Thank you for your attention, here is my Kibana information

image

@longgui0318
Copy link
Contributor Author

@sergerdn all of my indexes
image

@longgui0318
Copy link
Contributor Author

Determine the problem, when using from_documents to build, and then use ElasticVectorSearch to build the object is normally accessible. But if we directly use ElasticVectorSearch to build the object and then add the content by way of add_documents, we will get an error when similarity_search

@longgui0318
Copy link
Contributor Author

The difference is in this line of code

client.indices.create(index=index_name, mappings=mapping)

@sergerdn
Copy link
Contributor

sergerdn commented Apr 4, 2023

@longgui0318

To ensure we fully understand the problem, could you please provide code snippets that reproduce the issue? From the description provided, it seems like a familiar bug to me.

Additionally, please share a code snippet that demonstrates that everything is functioning as expected. This will help confirm that you are only seeing the expected index and not an arbitrary one.

@longgui0318
Copy link
Contributor Author

@sergerdn

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import UnstructuredWordDocumentLoader
from langchain.vectorstores import ElasticVectorSearch

if __name__ == "__main__":

    loader = UnstructuredWordDocumentLoader("test.docx",mode="elements")
    data = loader.load()
    text_spitter = CharacterTextSplitter(chunk_size=1000,chunk_overlap=0)
    docs = text_spitter.split_documents(data)
    embeddings = OpenAIEmbeddings()
    ## Case 1 start :this code is ok
    elastic_vector_search = ElasticVectorSearch.from_documents(
        docs,
        embedding=embeddings,
        elasticsearch_url="http://192.168.1.110:9200"        
    )
    ## Case 1 END
    searchResult = elastic_vector_search.similarity_search("What are the characteristics of sharks")
    ## Case 2 start:this code is error,Data saved successfully, but query has exceptions
    elastic_vector_search = ElasticVectorSearch(
        elasticsearch_url="http://192.168.1.110:9200",
        index_name="test20222",
        embedding=embeddings
    )
    elastic_vector_search.add_documents(docs)
    searchResult = elastic_vector_search.similarity_search("What are the characteristics of sharks")
    ## Case 2 END

@sergerdn
Copy link
Contributor

sergerdn commented Apr 4, 2023

Okay, it seems that it is a bug that I have described above, which can be found at #2386 (comment).

I am having difficulty updating the tests properly, as it will be more challenging for me than fixing the bug. Please be patient, I will work on fixing it.

@longgui0318
Copy link
Contributor Author

tks

hwchase17 pushed a commit that referenced this issue Apr 5, 2023
- Create a new docker-compose file to start an Elasticsearch instance
for integration tests.
- Add new tests to `test_elasticsearch.py` to verify Elasticsearch
functionality.
- Include an optional group `test_integration` in the `pyproject.toml`
file. This group should contain dependencies for integration tests and
can be installed using the command `poetry install --with
test_integration`. Any new dependencies should be added by running
`poetry add some_new_deps --group "test_integration" `

Note:
New tests running in live mode, which involve end-to-end testing of the
OpenAI API. In the future, adding `pytest-vcr` to record and replay all
API requests would be a nice feature for testing process.More info:
https://pytest-vcr.readthedocs.io/en/latest/

Fixes #2386
@sergerdn
Copy link
Contributor

sergerdn commented Apr 6, 2023

I made a mistake on the test and fixed another bug, but not the one we originally talked about.

@longgui0318
Copy link
Contributor Author

longgui0318 commented Apr 6, 2023

@sergerdn I think it may be that the information I gave is not accurate enough

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import UnstructuredWordDocumentLoader
from langchain.vectorstores import ElasticVectorSearch

if __name__ == "__main__":

    loader = UnstructuredWordDocumentLoader("test.docx",mode="elements")
    data = loader.load()
    text_spitter = CharacterTextSplitter(chunk_size=1000,chunk_overlap=0)
    docs = text_spitter.split_documents(data)
    embeddings = OpenAIEmbeddings()
    ## Please note that test20222 was not created before this
    ## this code is error,Data saved successfully
    elastic_vector_search = ElasticVectorSearch(
        elasticsearch_url="http://192.168.1.110:9200",
        index_name="test20222",
        embedding=embeddings
    )
    ## Only now is the documents added
    elastic_vector_search.add_documents(docs)
    ## query has exceptions 'search_phase_execution_exceptionm'.Because of this approach, no client.indices.create(index=index_name, mappings=mapping) has been executed before add_documents
    searchResult = elastic_vector_search.similarity_search("What are the characteristics of sharks")

@sergerdn
Copy link
Contributor

sergerdn commented Apr 6, 2023

It appears that it was executed before adding:

   raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
elasticsearch.BadRequestError: BadRequestError(400, 'resource_already_exists_exception', 'index [custom_index_68ae159b3ddc4c02b12cf6660e2e0499/6glhH4tQRlOzC6WKZQrwdg] already exists')

@longgui0318
Copy link
Contributor Author

No, it was executed after adding the data and confirming that kibana saw the data, so I guess some key data initialization was missing that caused the inconsistency between the two structures

@sergerdn
Copy link
Contributor

sergerdn commented Apr 6, 2023

No, it was executed after adding the data and confirming that kibana saw the data, so I guess some key data initialization was missing that caused the inconsistency between the two structures

I believe you are correct. An index was created, but with incorrect mappings.

Right mappings:

{
  "mappings": {
    "properties": {
      "metadata": {
        "properties": {
          "source": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      },
      "text": {
        "type": "text"
      },
      "vector": {
        "type": "dense_vector",
        "dims": 1536
      }
    }
  }
}

Wrong mappings:

{
  "mappings": {
    "properties": {
      "metadata": {
        "properties": {
          "source": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      },
      "text": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "vector": {
        "type": "float"
      }
    }
  }
}

@sergerdn
Copy link
Contributor

sergerdn commented Apr 6, 2023

@longgui0318

Thank you very much for your help. I have fixed it!

When it will be merged, please test it on your end to ensure that the new changes is working properly.
#2445

hwchase17 pushed a commit that referenced this issue Apr 7, 2023
…ests (#2445)

Using `pytest-vcr` in integration tests has several benefits. Firstly,
it removes the need to mock external services, as VCR records and
replays HTTP interactions on the fly. Secondly, it simplifies the
integration test setup by eliminating the need to set up and tear down
external services in some cases. Finally, it allows for more reliable
and deterministic integration tests by ensuring that HTTP interactions
are always replayed with the same response.
Overall, `pytest-vcr` is a valuable tool for simplifying integration
test setup and improving their reliability

This commit adds the `pytest-vcr` package as a dependency for
integration tests in the `pyproject.toml` file. It also introduces two
new fixtures in `tests/integration_tests/conftest.py` files for managing
cassette directories and VCR configurations.

In addition, the
`tests/integration_tests/vectorstores/test_elasticsearch.py` file has
been updated to use the `@pytest.mark.vcr` decorator for recording and
replaying HTTP interactions.

Finally, this commit removes the `documents` fixture from the
`test_elasticsearch.py` file and replaces it with a new fixture defined
in `tests/integration_tests/vectorstores/conftest.py` that yields a list
of documents to use in any other tests.

This also includes my second attempt to fix issue :
#2386

Maybe related #2484
@carcaussa
Copy link

hi, I'm using version 0.0.186 and having this error, apparently I'm experiencing the same mapping issue, do you know how can I know when this is merged and included in a release of langchain?
Thank you very much

@Baro1502
Copy link

Baro1502 commented Aug 2, 2023

Hi, I am currently using version 0.0.248 and I still encounter this issue. Is there any way that I can address this? Thank you

@luccafabro
Copy link

luccafabro commented Aug 2, 2023

@Baro1502 try create your index with this structure before inserts
PUT /your_index { "mappings": { "properties": { "metadata": { "properties": { "source": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } }, "text": { "type": "text" }, "vector": { "type": "dense_vector", "dims": 1536 } } } }

@rizwanalvi1
Copy link

in my case, it happened when I used different embedding models for storing and retrieving/searching.
I was using OpenAIEmbeddings() for storing and by mistake using instructor-large for searching

hope this feedback would be of some use as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants