Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copyMerge is broken - needs to take a list of things to merge #4932

Closed
tpoterba opened this issue Dec 10, 2018 · 0 comments

Comments

Projects
None yet
2 participants

danking added a commit to danking/hail that referenced this issue Dec 10, 2018

[hail] Correctly merge files in ExportPlink
`ExportPlink` was previously modified to resiliently handle failure of a
Spark task by including a per-task UUID. `copyMerge` was not modified to
correctly handle the directories generated by this modified
`ExportPlink`.

For example, if exactly one task out of N fails during `ExportPlink`, the
temporary output directory will contain N+1 partition files. One of
these partition files is corrupted and invalid. The other N are the
output of successful task completion. The invalid file should simply be
ignored by `copyMerge`.

This PR modifies `copyMerge` to take an optional list of files to
merge. If that argument is set to `None`, the original behavior
persists. The original behavior is used by `RichRDD.writeTable` and
`RichRDDByteArray.saveFromByteArrays`. These two methods use the default
Spark parallel export system. This system is not resilient to all task
failures, but *does* ensure failed tasks do not generate garbage
partitions in the output directory. Ergo, they can safely use the
original behavior of `copyMerge`.

I do not test this behavior because failing a task during write is a
little bit tricky. I have verified that all users of `copyMerge` now use
`copyMerge` correctly. Adding a test to `ExportPlink` would not save us
from incorrectly using `copyMerge` in the future.

A longer term testing strategy that includes a Chaos Monkey that kills
entire containers during Hail Scala tests execution would protect
against this type of bug.

---

Fixes hail-is#4932

danking added a commit that referenced this issue Dec 12, 2018

[hail] Correctly merge files in ExportPlink (#4938)
* [hail] Correctly merge files in ExportPlink

`ExportPlink` was previously modified to resiliently handle failure of a
Spark task by including a per-task UUID. `copyMerge` was not modified to
correctly handle the directories generated by this modified
`ExportPlink`.

For example, if exactly one task out of N fails during `ExportPlink`, the
temporary output directory will contain N+1 partition files. One of
these partition files is corrupted and invalid. The other N are the
output of successful task completion. The invalid file should simply be
ignored by `copyMerge`.

This PR modifies `copyMerge` to take an optional list of files to
merge. If that argument is set to `None`, the original behavior
persists. The original behavior is used by `RichRDD.writeTable` and
`RichRDDByteArray.saveFromByteArrays`. These two methods use the default
Spark parallel export system. This system is not resilient to all task
failures, but *does* ensure failed tasks do not generate garbage
partitions in the output directory. Ergo, they can safely use the
original behavior of `copyMerge`.

I do not test this behavior because failing a task during write is a
little bit tricky. I have verified that all users of `copyMerge` now use
`copyMerge` correctly. Adding a test to `ExportPlink` would not save us
from incorrectly using `copyMerge` in the future.

A longer term testing strategy that includes a Chaos Monkey that kills
entire containers during Hail Scala tests execution would protect
against this type of bug.

---

Fixes #4932

* ugh

* fix the thing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.