fix: Read partitioned parquet files from relative paths #3470
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix for #3440. Parquet reads fail under the following conditions:
pyarrow.parquet.read_table
pyarrow.parquet.read_table
This error can be addressed in two ways:
a. Convert the relative path to an absolute path
b. Do not pass a filesystem object
While determining the root cause, I this StackOverflow question and this pyarrow Jira issue, both of which indicate that there are issues with passing the filesystem object. Both indicate that there are potential issues with passing a filesystem object to
pyarrow
, so I decided to go for fix b.This fix modifies
ludwig.data.dataset.ray.read_remote_parquet
to catchpyarrow.lib.ArrowInvalid
errors and try the read again with nofilesystem
kwarg. It also adds a test for this issue.