Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Read partitioned parquet files from relative paths #3470

Merged
merged 7 commits into from
Aug 1, 2023

Conversation

jeffkinnison
Copy link
Contributor

Fix for #3440. Parquet reads fail under the following conditions:

  1. The parquet data is partitioned into multiple files
  2. A relative path is passed to pyarrow.parquet.read_table
  3. A filesystem object is passed to pyarrow.parquet.read_table

This error can be addressed in two ways:

a. Convert the relative path to an absolute path
b. Do not pass a filesystem object

While determining the root cause, I this StackOverflow question and this pyarrow Jira issue, both of which indicate that there are issues with passing the filesystem object. Both indicate that there are potential issues with passing a filesystem object to pyarrow, so I decided to go for fix b.

This fix modifies ludwig.data.dataset.ray.read_remote_parquet to catch pyarrow.lib.ArrowInvalid errors and try the read again with no filesystem kwarg. It also adds a test for this issue.

Copy link
Collaborator

@ksbrar ksbrar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome 🚀

@github-actions
Copy link

github-actions bot commented Jul 19, 2023

Unit Test Results

  6 files  ±0    6 suites  ±0   1h 4m 11s ⏱️ - 2m 55s
34 tests ±0  29 ✔️ ±0    5 💤 ±0  0 ±0 
88 runs  ±0  72 ✔️ ±0  16 💤 ±0  0 ±0 

Results for commit 40d454d. ± Comparison against base commit 079429e.

♻️ This comment has been updated with latest results.

@jeffkinnison jeffkinnison merged commit d146799 into master Aug 1, 2023
16 checks passed
@jeffkinnison jeffkinnison deleted the parquet-parent-directory-fix branch August 1, 2023 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants