You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of the Flint data source reader when interfacing with OpenSearch exhibits several limitations impacting performance and scalability:
Flint Data Source Limitations:
Single Partition Issue: Flint scan always reports a single partition (basic unit of RDD task). This forces a singular executor to process the entire Flint index in OpenSearch and limits parallel processing capabilities.
OpenSearch Integration Challenges:
Non-Adaptive Pagination Size: Scroll page size is set in the first scroll request and is restricted by the fixed max_result_window setting in OpenSearch. [Probably no need to solve if scanning in parallel is supported in item 1]
Ser-De Overhead: There is no high-performance data transfer protocol exposed by OpenSearch and the overhead associated with the REST response serialization and deserialization is notably high.
What solution would you like?
To address the above issues, the following possible solutions can be considered:
Split Flint Index into Partitions:
Modify the Flint data source to support multiple partitions and execute it via OpenSearch slice.
Protocol Optimization in OpenSearch:
Explore the possibility of a high-performance data transfer protocol that minimizes serialization and deserialization overhead, such as Protobuf or Apache Arrow. In-memory format like Arrow may enable vector computation or other further optimization with Spark.
What alternatives have you considered?
N/A
Do you have any additional context?
Although the maximum number of executors is set to 10, the graph below illustrates that Spark executes the task slowly with a single executor due to the aforementioned problems. The task ultimately failed because it was manually cancelled.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
The current implementation of the Flint data source reader when interfacing with OpenSearch exhibits several limitations impacting performance and scalability:
Flint Data Source Limitations:
OpenSearch Integration Challenges:
max_result_window
setting in OpenSearch. [Probably no need to solve if scanning in parallel is supported in item 1]What solution would you like?
To address the above issues, the following possible solutions can be considered:
Split Flint Index into Partitions:
Protocol Optimization in OpenSearch:
What alternatives have you considered?
N/A
Do you have any additional context?
Although the maximum number of executors is set to 10, the graph below illustrates that Spark executes the task slowly with a single executor due to the aforementioned problems. The task ultimately failed because it was manually cancelled.
The text was updated successfully, but these errors were encountered: