Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to resolve the schema, but not expand the paths when doing a collect_schema? #17855

Open
kszlim opened this issue Jul 24, 2024 · 2 comments
Labels
A-io Area: reading and writing data enhancement New feature or an improvement of an existing feature

Comments

@kszlim
Copy link
Contributor

kszlim commented Jul 24, 2024

Description

Path expansion for large datasets is very expensive, but I imagine schema resolution itself might be very cheap (relative to path expansion). Is it possible for schema resolution to avoid path expansion? It seems to take pretty long for me on a 10k file dataset, (i'm guessing it's due to listing all the paths).

@kszlim kszlim added the enhancement New feature or an improvement of an existing feature label Jul 24, 2024
@stinodego
Copy link
Member

stinodego commented Jul 25, 2024

Ideally we shouldn't do path expansion for collect_schema if the reader is provided a schema, e.g. through a schema argument. But I'm not sure where we're at with the implementation. Tagging @nameexhaustion for this one.

@stinodego stinodego added the A-io Area: reading and writing data label Jul 25, 2024
@kszlim
Copy link
Contributor Author

kszlim commented Jul 25, 2024

It would be interesting if you could get the schema from a single file in the inference case and then resolve the path expansion at collection time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants