Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that invalid vts have results_dir cleaned before passing to ta… #4139

Merged
merged 10 commits into from Dec 19, 2016

Conversation

mateor
Copy link
Member

@mateor mateor commented Dec 14, 2016

Tasks that sometimes fail due to outside factors (download failures,
resolve issues, etc) often would call safe_mkdir(vt.results_dir, clean=True)
in order to wipe possibly truncated or crufty state.

But since vt.results_dir is a symlink, that replaced it with a real dir.
That ended up breaking caching, since the task output was therefore
never making it into the artifact cache, and other weird
bugs.

This is a small change that deletes the existing directories if
a target is invalid, removing the need for tasks to wipe the
results_dir, and also now checks to make sure that the results_dir
is legal before marking valid and passing to the artifact_caches.

The majority of the change is added test coverage around the breaks.
I have a couple immediate followups that cover an additional failure
state, and reworks the cache_manager to remove some of the harder
to reason about bits.

Closes: #4137

…sks.

Tasks that sometimes fail due to outside factors (download failures,
resolve issues, etc) often would call safe_mkdir(vt.results_dir, clean=True)
in order to wipe possibly truncated or crufty state.

But since vt.results_dir is a symlink, that replaced it with a real dir.
That ended up breaking caching, since the task output was therefore
never making it into the artifact cache, and other weird
bugs.

This is a small change that deletes the existing directories if
a target is invalid, removing the need for tasks to wipe the
results_dir, and also now checks to make sure that the results_dir
is legal before marking valid and passing to the artifact_caches.

The majority of the change is added test coverage around the breaks.
I have a couple immediate followups that cover an additional failure
state, and reworks the cache_manager to remove some of the harder
to reason about bits.

Closes: pantsbuild#4137
@mateor
Copy link
Member Author

mateor commented Dec 14, 2016

I am pushing this to CI to get Travis to take a crack at it - this will be my first non-release CR through github so forgive me if I muck the process at all.

You can read whenever you want, but I won't be adding specific reviewers until it passes CI.

ETA: I have two followups getting polished, one of which should also be backported to 1.2.1 (the error message for relative_symlink failure state)

@mateor
Copy link
Member Author

mateor commented Dec 14, 2016

I looked into the failures a bit tonight. They are legit, although it wasn't clear to me exactly what went wrong.

I will dig at it tomorrow and add some unit tests coverage around whatever it ends up deriving from.

# If the vt is invalid, clean. Deletes both because if the stable_dir has somehow been replaced with a real dir,
# the relative_symlink call below raises an uncaught exception when it attempts to unlink the real directory.
safe_rmtree(current_dir)
safe_rmtree(stable_dir)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it would be better to blow up in the case of a deleted/replaced results_dir symlink to expose an issue in a task, rather than to silently try to fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a followup that does that - I broke it out into a separate commit from due to your original comment. Can add here if you prefer.

@@ -264,11 +287,13 @@ def __init__(self,
def update(self, vts):
"""Mark a changed or invalidated VersionedTargetSet as successfully processed."""
for vt in vts.versioned_targets:
vt._ensure_legal()
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be public if it will be called here (it needn't be API: public)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, that is true.

@@ -95,15 +105,17 @@ def test_incremental(self):
vtB = task.execute()
self.assertContent(vtB, one + two)

# Incremental atop existing directory for vtA.
# vtC.previous_cache_key == vtB.cache_key so the task appends the file to vtB's results.
# This behavior used to be skipped because the current_dir existed when checked by cache_manager._use_previous_dir.
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this a bit more... the idea was that if we switched back to exactly the same inputs as some previous build with zinc (for example), then zinc would quickly do the right thing: which is noop.

On the other hand, clearing the directory regardless as you're doing now is definitely less error prone then ever reusing an existing directory, so I think your change makes sense.

I would still love to see the test updated before landing this though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am tinkering with this now...I think I can break out the previous_results move to happen after the cache check. It is tough to reason about the cache manager code, I have written another followup that deletes a lot (but not all) of the circular referencing and object mutation.

The above test should currently pass (although maybe it doesn't? I am working on a different integration test failure atm). The case it presents is no longer very useful though. Can you elaborate on what you are asking for with:

"I would still love to see the test updated before landing this though."

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added a comment at the top of the method indicating that it should be updated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...yes that sounds like something I would do :)

Okay, I will add a some more coverage. I also believe that I was able to add some additional correctness as mentioned above.

Assuming that the code is working, the incremental task behavior should now be:

  • full noop if target unchanged from previous run
  • noop from zinc on cached VT
  • previous_results copied in before zinc execution for invalid, uncached VT

I will add some tests to show the above, as much as I am able.

I didn't see a check for already populated current_result_dirs under the pants_workdir, just local and remote cache checks. Which surprised me, since the incremental implementation proves Pants is willing to consume from previous result_dirs. I wonder if that case happens often enough to be worth considering. Perhaps rare to the point of not mattering...

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a check for already populated current_result_dirs under the pants_workdir, just local and remote cache checks. Which surprised me, since the incremental implementation proves Pants is willing to consume from previous result_dirs.

AFAICT, that's actually exactly what this test was doing before: vtA had the same directory as vtC... I'm fine with dropping that behaviour for the reasons mentioned above, but I believe that that is what was being tested here.

Copy link
Member Author

@mateor mateor Dec 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood why it didn't copy the previous_results (the directory was found to exist) but it was not clear to me that was happening on purpose.

And the task itself doesn't noop, which is what I meant above. The invalidation block runs all three times. Zinc just passively inherits the existing results_dir since the namespace is reused. That may be equivalent to a noop for the zinc tool, for all I know, but either way no other tasks even hit that check, since zinc is the only thing that hits that previous_dir code.

Maybe the build_invalidator could go ahead and be fully namespaced in the same manner as the results_dir, and therefore hold an arbitrary number of VT results. Since those would only be VTs that successfully exited invalidation, the results_dir would only require a wipe if the current_results_dir exists but no hash file was dropped.

I guess that is firmly in the land of a new feature, in any rate, and possibly you know reasons that my speculating is flawed :)

I will be able to provide the contract detailed above. Tasks that are rebuilding cache_keys older than the most recent will consume from the cache, and if not found will be opting out of incremental support since they will no longer passively inherit their results.

Which is maybe preferred when taken as a whole? Cache hits look like they supercede the previous_results, in which case all cached VTs were copying over the previous results in the cache_manager, and then immediately pruning the files within the artifact_cache unpack.

The only case where a zinc task would newly miss out on both artifact cache and incremental results is if the results_dir existed but the artifact was uncached, which is probably pretty rare in practice.

As said in the comment, this has been a long-term uncaught exception,
surfacing when os.unlink was called on a directory. This resulted
in non-zero exit for Pants, with Operation not Permitted raising
from the OS.

Should now give a decent msg, but the outcome of Pants in this
situation will continue to be an execution-halting exception.

This behavior has diverged between relative and absolute symlinks
methods. the absolute_symlink checks for dirs, and silently rms
them. That behavior should probably be reconciled, one way or another,
one of these days.

Once a determination is made, it would likely inform the final determination
on whether safe_mkdir should be allowed to continue to be symlink agnostic.
This was probably really clogging the IO, since every jvm
target was:

* copying the entire previous_results_dir over
* doing it a second time
* checking for cached artifact
* if found, parsing the analysis and pruning unneeded results.

Now it creates the current VT results_dir, and then checks for the
cache. This includes the double check, so the last thing that is
done before yielding the VT as invalid is to copy any
available previous results.

My read on this is that it will save a good chunk of unnecessary system
calls.
This appears to save a bunch of unnecessary results_dir copies.
I added some new test coverage to exercise this, but there was
very limited testing to begin with.

I have one more followup commit that significantly cleans up
the logic in cache_manager. That can land separatly from this commit,
I wanted to get some test coverage in before I tried that one.
@mateor
Copy link
Member Author

mateor commented Dec 16, 2016

I added some additional test coverage around the discoveries we discussed in the bug thread. I probably should do another pass over this once it passes CI - will do this afternoon.

WHoops. I have gotten so used to disabling the style checks while
developing!

Otherwise appears to be good. There is still not enough test coverage
around invalidation in general, and incremental tasks specifically.
But I have added a lot more then there used to be.

I have a logic refactor as a followup, and I will try and
continue to add coverage as a part of that.
Copy link
Sponsor Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mateo... excellent work. Some minor nits, but not blocking probably.


# Finally, create the stable symlink.
relative_symlink(current_dir, stable_dir)
# If the target is valid, both directories are assumed to exist.
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth validating that here in the else clause?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I have a followup with that change. My hope was to change the behavior here as little as possible in the change, since the coverage was light and we were scheduling to backport to a stable branch.

But I ended up needing to make somewhat significant changes due to moving the previous check, so I can go ahead and add that here safely, I think.

# Set the self._previous last, so that it is only True after the copy completed.
self._previous_results_dir = previous_path

def _use_previous_dir(self, allow_incremental, root_dir):
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused method? It returns the self._previous_results_dir, which wouldn't be set when this method was called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks for the catch. All the functions of this method are covered elsewhere, so will delete.

It was a miss when splitting from the followup commit, which cleans up the cache_manager's more mutation-heavy choices: mateor@62e49a3

It needs to be updated but I intend to put up a roughly analogous RB as a followup.

At the moment this throws an error in these circumstances.
This is probably a win from a UX perspective, since we will
finally get some caught exceptions if someone starts
deleting dirs under pants.d. But perhaps you would prefer
to silently restore the working state for now, so let me
know if that is the case.

Also removes no longer needed method and retitles the
InvalidResultsDir() exception to avoid overloading "invalid" any
more than it already is.
Copy link
Contributor

@wisechengyi wisechengyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mateor. a few nits

raise self.IllegalResultsDir(
'\nThe results_dirs state should not be manually cleaned or recreated by tasks.\n{}'.format(errors)
)
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to return any value since it is not used.

docstring += 'i.e. self._result_dir has be to be a link to a dir'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes...I added the return value because things like this often end up being useful as bools, and people naturally would try it. Like added to a should_cache_artifacts method or something. It was an attempt to conform to expectation. Without the return True, `vt.ensure_legal() would be Falsey, which felt confusing.

I would prefer to leave it, but will remove if you let me know you disagree.


# Finally, create the stable symlink.
relative_symlink(current_dir, stable_dir)
def copy_previous_results(self, root_dir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could privatize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so - this is called by Task.py after cache is checked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make better sense to move to Task honestly - it is weird that the cache manager manages that part.

I have a followup that refactors cache_manager, will consider doing that there.

task_output = os.path.join(vt.results_dir, 'a_file')
self.create_file(task_output, 'foo')

def is_empty(self, dirname):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could make this static method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that sounds good. I considered adding it to the base class. Will go ahead and make static for now.

def is_empty(self, dirname):
return not os.listdir(dirname)

def matching_result_dirs(self, vt):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static

# Ensure that the result_dirs contain the same files.
return self.is_empty(vt.results_dir) == self.is_empty(vt.current_results_dir)

def clobber_symlink(self, vt):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static


def matching_result_dirs(self, vt):
# Ensure that the result_dirs contain the same files.
return self.is_empty(vt.results_dir) == self.is_empty(vt.current_results_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to conclude the content of the dirs are the same if neither of them are empty? or should the docstring be updated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this for sure needs to be changed. I was continually testing the file list and decided to break it out into a function - but I did not do a very good job.


# The main protection for this is the exception raised when the cache_manager attempts to mark the VT valid.
self.assertFalse(vt.valid)
vt.update()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc the exception will be raised at vt.update(), would it be more clear to say

<blob>
with self.assertRaises(VersionedTargetSet.IllegalResultsDir):
  vt.update()

instead of

with self.assertRaises(VersionedTargetSet.IllegalResultsDir):
  <blob>
  vt.update()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, will bring above the contextmanager block.

vt = self.make_vt()
self.clobber_symlink(vt)
vts = VersionedTargetSet.from_versioned_targets([vt])
vts.update()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

…subclass.

Also added some additional test coverage and some TODOs.
@mateor
Copy link
Member Author

mateor commented Dec 19, 2016

Okay, followed up on review, thanks @wisechengyi .

@mateor mateor merged commit 6a43414 into pantsbuild:master Dec 19, 2016
stuhood pushed a commit to twitter/pants that referenced this pull request Dec 20, 2016
pantsbuild#4139)

Tasks that sometimes fail due to outside factors (download failures,
resolve issues, etc) often would call safe_mkdir(vt.results_dir, clean=True)
in order to wipe possibly truncated or crufty state.

But since vt.results_dir is a symlink, that replaced it with a real dir.
That ended up breaking caching, since:
    * Product pipeline was consuming from the vt.results_dir
    * But the artifacts in the cache were created using vt.current_results_dir,
       which no longer got the output.

If a target is invalid, the cache_manager will now wipe the results_dir prior
to passing to the task, removing the need for tasks to clean it themself.
It also now checks to make sure that the results_dir is legal before marking
a VT valid and passing to the artifact_cache.

Also:
* Raise an Exception if relative_symlink tries to unlink a concrete file
* Copy over previous results *only* after checking for cache hits
    * Before this change every file in compile.zinc results_dir was copied twice
       per cache hit, and then thrown away. Now it only copies after a cache miss.

The majority of the change is added test coverage to cover the original
and added behavior.
lenucksi pushed a commit to lenucksi/pants that referenced this pull request Apr 25, 2017
pantsbuild#4139)

Tasks that sometimes fail due to outside factors (download failures,
resolve issues, etc) often would call safe_mkdir(vt.results_dir, clean=True)
in order to wipe possibly truncated or crufty state.

But since vt.results_dir is a symlink, that replaced it with a real dir.
That ended up breaking caching, since:
    * Product pipeline was consuming from the vt.results_dir
    * But the artifacts in the cache were created using vt.current_results_dir,
       which no longer got the output.

If a target is invalid, the cache_manager will now wipe the results_dir prior
to passing to the task, removing the need for tasks to clean it themself.
It also now checks to make sure that the results_dir is legal before marking
a VT valid and passing to the artifact_cache.

Also:
* Raise an Exception if relative_symlink tries to unlink a concrete file
* Copy over previous results *only* after checking for cache hits
    * Before this change every file in compile.zinc results_dir was copied twice
       per cache hit, and then thrown away. Now it only copies after a cache miss.

The majority of the change is added test coverage to cover the original
and added behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tasks that call safe_mkdir(results_dir, clean=True) silently overwrite symlink with real directory
3 participants