Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pipeline] Added implicit copy of resource group #5455

Merged
merged 2 commits into from Feb 28, 2019

Conversation

@jigold
Copy link
Collaborator

jigold commented Feb 27, 2019

Plus better error checking!

  • Some bioinformatic tools expect a secondary implied file to be present. For example, sample.vcf and it's index file sample.vcf.tbi. This PR adds file localization such that if any file in a resource group is used, the entire resource group will be copied and not just the mentioned file.

  • Added a mentioned set that tracks whether a resource was defined in the command or declare resource group functions. Otherwise, you could do something like this which would throw an error upon execution:

p = Pipeline()
t = p.new_task()
p.write_output(t.undefined_variable, 'gs://foo/foo')
p.run()
…cking

- Some bioinformatic tools expect a secondary implied file to be present. For example, sample.vcf and it's index file sample.vcf.tbi. This PR adds file localization such that if any file in a resource group is used, the entire resource group will be copied and not just the mentioned file.

- Added a mentioned set that tracks whether a resource was defined in the command or declare resource group functions. Otherwise, you could do something like this which would throw an error upon execution:

```python
p = Pipeline()
t = p.new_task()
p.write_output(t.undefined_variable, 'gs://foo/foo')
p.run()
```
@@ -92,6 +92,12 @@ def read_input_group(self, **kwargs):
return rg

def write_output(self, resource, dest): # pylint: disable=R0201
assert isinstance(resource, Resource)

This comment has been minimized.

Copy link
@akotlar

akotlar Feb 27, 2019

Collaborator

Can this ever be something besides Resource?

In short, do we need asserts for non-operational errors (something the user can't trigger)? I think that if the error isn't operational (user input), the assert is unnecessary (since it's a runtime check of a programmatic error), and should be written as a test. This applies also to several if isinstance(...) else assert

This comment has been minimized.

Copy link
@jigold

jigold Feb 27, 2019

Author Collaborator

This is a user facing method. I'm slowly converting all non-user methods to start with '_'. I'll make the error message nicer.

This comment has been minimized.

Copy link
@akotlar

akotlar Feb 27, 2019

Collaborator

Do we need the assert?

@@ -86,14 +102,18 @@ def __init__(self, source, root, **values):
self._output_paths = set()

for name, resource in values.items():
assert isinstance(resource, ResourceFile)

This comment has been minimized.

Copy link
@akotlar

akotlar Feb 27, 2019

Collaborator

In the spirit of the above comment; should this ever be something other than ResourceFile (meaning can a user get a resource in values that isn't a ResourceFile?

This comment has been minimized.

Copy link
@jigold

jigold Feb 27, 2019

Author Collaborator

No not as currently implemented. @danking What do you think about asserts? I think they're helpful so I remember what the type of the input should be. Does Python have the equivalent of require?

This comment has been minimized.

Copy link
@jigold

jigold Feb 27, 2019

Author Collaborator

I think the assertions are helpful. I don't see what the benefit is to taking them out and they'll help pinpoint bugs closer to where they originated.

This comment has been minimized.

Copy link
@akotlar

akotlar Feb 27, 2019

Collaborator

The benefit is that you don't need to place asserts everywhere. If there is something that the user is passing in that could be not a ResourceFile, I totally agree with you, we want to catch the error early and tell them. If this is a runtime check to make sure we didn't change our internal API, I think that should be a test of this function; it will fail at resource.add_resource_group(self).

@@ -40,12 +53,20 @@ def __getitem__(self, item):
def __getattr__(self, item):
return self._get_resource(item)

def _add_outputs(self, resource):

This comment has been minimized.

Copy link
@akotlar

akotlar Feb 27, 2019

Collaborator

Do we need this and _add_inputs, or would it be sufficient for methods calling _add_outputs / _add_inputs to call (for instance) _add_resource_to_set(self._outputs). The calling method already abstracts away a private method; it kind of seems we now abstract the same private method twice.

This comment has been minimized.

Copy link
@jigold

jigold Feb 27, 2019

Author Collaborator

Your way is probably better. Let me try that.

This comment has been minimized.

Copy link
@jigold

jigold Feb 27, 2019

Author Collaborator

The problem with your proposal is it works fine when mutating your own data structures, but I also want to mutate another Task's data structure as well. Then it would change from this r._source._add_outputs(r) to _add_resource_to_set(r._source._outputs, r) which I felt was a bit weird. The right thing to do is to not have the helper method, but I didn't want to copy the same logic for the resource group twice.

@@ -92,6 +92,12 @@ def read_input_group(self, **kwargs):
return rg

def write_output(self, resource, dest): # pylint: disable=R0201
assert isinstance(resource, Resource)
if isinstance(resource, TaskResourceFile) and resource not in resource._source._mentioned:

This comment has been minimized.

Copy link
@akotlar

akotlar Feb 27, 2019

Collaborator

TaskResourceFile is a Resource, so we don't need this check.

From a higher level, it seems we don't need either the assert above, or if isinstance(), because if resource is in resource._source._mentioned it should be a Resource.

output_file = f'{gcs_output_dir}/test_multiple_dependent_tasks.txt'
p = self.pipeline()
t = p.new_task()
t.command(f'echo "0" >> {t.ofile}')

This comment has been minimized.

Copy link
@akotlar

akotlar Feb 27, 2019

Collaborator

Can you help me understand where t.ofile is declared? It kind of seems undefined at this point.

This comment has been minimized.

Copy link
@jigold

jigold Feb 27, 2019

Author Collaborator

__getattr__ for t.ofile and __getitem__ for t['ofile'].

This comment has been minimized.

Copy link
@akotlar

akotlar Feb 27, 2019

Collaborator

It feels like this should be explicitly assigned, or not mentioned at all:

t.command(f'echo "0" >>')

or

command(f'echo "0" >> {t}')

with some __str__ that internally tracks the intermediate file being modified.

I won't block the PR due to this, since that would touch outside functionality, but I think implicitly adding attributes during a failed attempt to read them is semantically, imo, incorrect, and will lead to difficult to debug behavior for the end user (unless the only way to add attributes is implicit, which it isn't, since we already have declare_resource_group).

I would recommend we have declare_resource_group as the only method to add attributes.

not sure if @dking or @cseed have an opinion on this... I may be missing some reason to do this, or maybe this is just the idiom we've settled on?

This comment has been minimized.

Copy link
@jigold

jigold Feb 27, 2019

Author Collaborator

Can you write a dev post with your concerns? We iterated quite a bit on the interface. Also it would be good to move this discussion outside of this PR.

This comment has been minimized.

Copy link
@akotlar

akotlar Feb 27, 2019

Collaborator

Sure

This comment has been minimized.

Copy link
@akotlar
@akotlar

This comment has been minimized.

Copy link
Collaborator

akotlar commented Feb 27, 2019

Overall looks good! Just a few small things, which we're addressing (thanks!). I'll try to get this and the stacked PR in today :)

Copy link
Collaborator

akotlar left a comment

A few small things we're working through :)

else:
self._outputs.add(r)
self._mentioned.add(r)

This comment has been minimized.

Copy link
@akotlar

akotlar Feb 27, 2019

Collaborator

This is related to the implicit attribute creation, but I kind of feel nervous about implicitly adding resources. We have 2 modes: if t2: don't add if t1: add for t1 = Task(pipeline=...), t2 = Task(pipeline=...), and t1.command vs t2.command.

@jigold jigold dismissed akotlar’s stale review Feb 27, 2019

done -- see comments to unaddressed issues

@danking danking merged commit f6fe19c into hail-is:master Feb 28, 2019
1 check passed
1 check passed
hail-ci-0-1 successful build
Details
tpoterba added a commit to tpoterba/hail that referenced this pull request Nov 7, 2019
* [pipeline] Added implicit copy of resource group and better error checking

- Some bioinformatic tools expect a secondary implied file to be present. For example, sample.vcf and it's index file sample.vcf.tbi. This PR adds file localization such that if any file in a resource group is used, the entire resource group will be copied and not just the mentioned file.

- Added a mentioned set that tracks whether a resource was defined in the command or declare resource group functions. Otherwise, you could do something like this which would throw an error upon execution:

```python
p = Pipeline()
t = p.new_task()
p.write_output(t.undefined_variable, 'gs://foo/foo')
p.run()
```

* Add better error message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.