Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support hybrid scan for Flint covering index #386

Open
dai-chen opened this issue Jun 18, 2024 · 0 comments
Open

[FEATURE] Support hybrid scan for Flint covering index #386

dai-chen opened this issue Jun 18, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@dai-chen
Copy link
Collaborator

Is your feature request related to a problem?

The Limitation section in this comment discusses how stale index data may lead to incorrect query results. While Flint's skipping index addresses this issue with a hybrid scan mode, this approach is not feasible for covering index with raw datasets due to the lack of versioning.

What solution would you like?

For data lake tables such as Delta, Iceberg, and Hudi, it is possible to implement hybrid scans using the version information in table metadata. For example, consider the following query with a primary key (timestamp column) in an Iceberg table http_logs:

SELECT timestamp, request, status
FROM glue.iceberg.http_logs;

This query can be rewritten to a hybrid scan leveraging Iceberg's time travel capabilities:

  1. During query rewriting, retrieve the Iceberg snapshot ID X from the committed log in the checkpoint. Ref: [FEATURE] Enhance show Flint index statement to include refresh status #385
  2. Rewrite the query to a hybrid scan as shown below:
spark.read
  .format("flint")
  .load("flint_glue_iceberg_http_logs_index")
UNION
spark.read
  .format("iceberg")
  .option("start-snapshot-id", "X")
  .load("glue.iceberg.http_logs")

What alternatives have you considered?

Alternatively, users can disable the covering index optimization by setting spark.flint.optimizer.covering.enabled to false to retrieve the latest data. This ensures that queries always access the most current data, but it may impact query performance due to the lack of indexing benefits.

Do you have any additional context?

N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant