[FEATURE] Enhance Flint data source reader performance in OpenSearch integration #334

dai-chen · 2024-05-09T17:39:12Z

Is your feature request related to a problem?

The current implementation of the Flint data source reader when interfacing with OpenSearch exhibits several limitations impacting performance and scalability:

Flint Data Source Limitations:
- Single Partition Issue: Flint scan always reports a single partition (basic unit of RDD task). This forces a singular executor to process the entire Flint index in OpenSearch and limits parallel processing capabilities.
OpenSearch Integration Challenges:
- Non-Adaptive Pagination Size: Scroll page size is set in the first scroll request and is restricted by the fixed max_result_window setting in OpenSearch. [Probably no need to solve if scanning in parallel is supported in item 1]
- Ser-De Overhead: There is no high-performance data transfer protocol exposed by OpenSearch and the overhead associated with the REST response serialization and deserialization is notably high.

What solution would you like?

To address the above issues, the following possible solutions can be considered:

Split Flint Index into Partitions:
- Modify the Flint data source to support multiple partitions and execute it via OpenSearch slice.
Protocol Optimization in OpenSearch:
- Explore the possibility of a high-performance data transfer protocol that minimizes serialization and deserialization overhead, such as Protobuf or Apache Arrow. In-memory format like Arrow may enable vector computation or other further optimization with Spark.

What alternatives have you considered?

N/A

Do you have any additional context?

Although the maximum number of executors is set to 10, the graph below illustrates that Spark executes the task slowly with a single executor due to the aforementioned problems. The task ultimately failed because it was manually cancelled.

The text was updated successfully, but these errors were encountered:

dai-chen added Meta Meta issue, not directly linked to a PR 0.5 labels May 9, 2024

github-actions bot added the untriaged label May 9, 2024

dai-chen removed the untriaged label May 9, 2024

dai-chen added this to OpenSearch Spark Project Planning May 9, 2024

This was referenced Jun 3, 2024

[FEATURE] Performance and Scalability Enhancements for Flint Index #365

Open

[EPIC] Zero-ETL - Apache Iceberg Table Support #372

Open

penghuo mentioned this issue Jun 24, 2024

[EPIC] Zero-ETL - OpenSearch Table #185

Open

10 tasks

dai-chen added the DataSource:OpenSearch label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Enhance Flint data source reader performance in OpenSearch integration #334

[FEATURE] Enhance Flint data source reader performance in OpenSearch integration #334

dai-chen commented May 9, 2024

[FEATURE] Enhance Flint data source reader performance in OpenSearch integration #334

[FEATURE] Enhance Flint data source reader performance in OpenSearch integration #334

Comments

dai-chen commented May 9, 2024