Skip to content

[query] Remove unexpected files from parallel text exports#11921

Merged
danking merged 1 commit intohail-is:mainfrom
chrisvittal:lowering/better-speculation-overwrite-fix
Jun 17, 2022
Merged

[query] Remove unexpected files from parallel text exports#11921
danking merged 1 commit intohail-is:mainfrom
chrisvittal:lowering/better-speculation-overwrite-fix

Conversation

@chrisvittal
Copy link
Collaborator

This fixes an issue introduced by #11910, where extra part files may be
created by speculative execution. This is an issue for parallel export
where we did not delete any files.

Now we list the output directory, and compare it to the proper list of
files. The paths that are not in the list of 'good' paths are deleted.

@chrisvittal chrisvittal requested a review from danking June 16, 2022 16:55
Comment on lines +514 to +516
val outputFiles = fs.listStatus(outputPath).map(_.getPath).toSet
outputFiles.diff(files.toSet).foreach(fs.delete(_, false))
val fileset = files.toSet
val diff = outputFiles.diff(fileset)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the local backend, fs.listStatus is returning results with paths being URIs with the file scheme, causing this set difference to return every file in the directory on the local backend. I'm not sure how to get around this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a systemic fix (we really need to get everything on one file system implementation and agree on what scheme (if any) local paths have). The rapid fix is to use fs.canonicalize_path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to canonicalize the paths in the cleanup method.

This fixes an issue introduced by hail-is#11910, where extra part files may be
created by speculative execution. This is an issue for parallel export
where we did not delete any files.

Now we list the output directory, and compare it to the proper list of
files. The paths that are not in the list of 'good' paths are deleted.
@chrisvittal chrisvittal force-pushed the lowering/better-speculation-overwrite-fix branch from e18b143 to b6d74c3 Compare June 17, 2022 15:57
@danking danking merged commit 55423ea into hail-is:main Jun 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants