Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim common hdfs prefix in index DF #13

Open
shay1bz opened this issue Jul 28, 2021 · 0 comments
Open

Trim common hdfs prefix in index DF #13

shay1bz opened this issue Jul 28, 2021 · 0 comments

Comments

@shay1bz
Copy link
Collaborator

shay1bz commented Jul 28, 2021

Try to recognize common path prefixes on runtime, and trim them.
For example, files in a standard table might look like:

hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0000.parquet
hdfs://my_cluster/foo/bar/my_table/dt=2020-01-01/part-0001.parquet
...

On read, before the shuffle, we can trim the common prefix to reduce the shuffle size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant