Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Enhance Flint data source reader performance in OpenSearch integration #334

Open
dai-chen opened this issue May 9, 2024 · 0 comments
Labels
0.5 DataSource:OpenSearch Meta Meta issue, not directly linked to a PR

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented May 9, 2024

Is your feature request related to a problem?

The current implementation of the Flint data source reader when interfacing with OpenSearch exhibits several limitations impacting performance and scalability:

  1. Flint Data Source Limitations:

    • Single Partition Issue: Flint scan always reports a single partition (basic unit of RDD task). This forces a singular executor to process the entire Flint index in OpenSearch and limits parallel processing capabilities.
  2. OpenSearch Integration Challenges:

    • Non-Adaptive Pagination Size: Scroll page size is set in the first scroll request and is restricted by the fixed max_result_window setting in OpenSearch. [Probably no need to solve if scanning in parallel is supported in item 1]
    • Ser-De Overhead: There is no high-performance data transfer protocol exposed by OpenSearch and the overhead associated with the REST response serialization and deserialization is notably high.

What solution would you like?

To address the above issues, the following possible solutions can be considered:

  1. Split Flint Index into Partitions:

    • Modify the Flint data source to support multiple partitions and execute it via OpenSearch slice.
  2. Protocol Optimization in OpenSearch:

    • Explore the possibility of a high-performance data transfer protocol that minimizes serialization and deserialization overhead, such as Protobuf or Apache Arrow. In-memory format like Arrow may enable vector computation or other further optimization with Spark.

What alternatives have you considered?

N/A

Do you have any additional context?

Although the maximum number of executors is set to 10, the graph below illustrates that Spark executes the task slowly with a single executor due to the aforementioned problems. The task ultimately failed because it was manually cancelled.

Screenshot 2024-05-09 at 9 45 32 AM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.5 DataSource:OpenSearch Meta Meta issue, not directly linked to a PR
Projects
Development

No branches or pull requests

1 participant