Ensure that invalid vts have results_dir cleaned before passing to ta… #4139

mateor · 2016-12-14T01:26:45Z

Tasks that sometimes fail due to outside factors (download failures,
resolve issues, etc) often would call safe_mkdir(vt.results_dir, clean=True)
in order to wipe possibly truncated or crufty state.

But since vt.results_dir is a symlink, that replaced it with a real dir.
That ended up breaking caching, since the task output was therefore
never making it into the artifact cache, and other weird
bugs.

This is a small change that deletes the existing directories if
a target is invalid, removing the need for tasks to wipe the
results_dir, and also now checks to make sure that the results_dir
is legal before marking valid and passing to the artifact_caches.

The majority of the change is added test coverage around the breaks.
I have a couple immediate followups that cover an additional failure
state, and reworks the cache_manager to remove some of the harder
to reason about bits.

Closes: #4137

…sks. Tasks that sometimes fail due to outside factors (download failures, resolve issues, etc) often would call safe_mkdir(vt.results_dir, clean=True) in order to wipe possibly truncated or crufty state. But since vt.results_dir is a symlink, that replaced it with a real dir. That ended up breaking caching, since the task output was therefore never making it into the artifact cache, and other weird bugs. This is a small change that deletes the existing directories if a target is invalid, removing the need for tasks to wipe the results_dir, and also now checks to make sure that the results_dir is legal before marking valid and passing to the artifact_caches. The majority of the change is added test coverage around the breaks. I have a couple immediate followups that cover an additional failure state, and reworks the cache_manager to remove some of the harder to reason about bits. Closes: pantsbuild#4137

mateor · 2016-12-14T01:29:39Z

I am pushing this to CI to get Travis to take a crack at it - this will be my first non-release CR through github so forgive me if I muck the process at all.

You can read whenever you want, but I won't be adding specific reviewers until it passes CI.

ETA: I have two followups getting polished, one of which should also be backported to 1.2.1 (the error message for relative_symlink failure state)

mateor · 2016-12-14T10:40:59Z

I looked into the failures a bit tonight. They are legit, although it wasn't clear to me exactly what went wrong.

I will dig at it tomorrow and add some unit tests coverage around whatever it ends up deriving from.

stuhood · 2016-12-14T20:15:40Z

src/python/pants/invalidation/cache_manager.py

+    # If the vt is invalid, clean. Deletes both because if the stable_dir has somehow been replaced with a real dir,
+    # the relative_symlink call below raises an uncaught exception when it attempts to unlink the real directory.
+    safe_rmtree(current_dir)
+    safe_rmtree(stable_dir)


IMO, it would be better to blow up in the case of a deleted/replaced results_dir symlink to expose an issue in a task, rather than to silently try to fix it.

I have a followup that does that - I broke it out into a separate commit from due to your original comment. Can add here if you prefer.

stuhood · 2016-12-14T20:17:27Z

src/python/pants/invalidation/cache_manager.py

@@ -264,11 +287,13 @@ def __init__(self,
  def update(self, vts):
    """Mark a changed or invalidated VersionedTargetSet as successfully processed."""
    for vt in vts.versioned_targets:
+      vt._ensure_legal()


This method should be public if it will be called here (it needn't be API: public)

oh yes, that is true.

stuhood · 2016-12-14T20:24:53Z

tests/python/pants_test/task/test_task.py

@@ -95,15 +105,17 @@ def test_incremental(self):
    vtB = task.execute()
    self.assertContent(vtB, one + two)

-    # Incremental atop existing directory for vtA.
+    # vtC.previous_cache_key == vtB.cache_key so the task appends the file to vtB's results.
+    # This behavior used to be skipped because the current_dir existed when checked by cache_manager._use_previous_dir.


I thought about this a bit more... the idea was that if we switched back to exactly the same inputs as some previous build with zinc (for example), then zinc would quickly do the right thing: which is noop.

On the other hand, clearing the directory regardless as you're doing now is definitely less error prone then ever reusing an existing directory, so I think your change makes sense.

I would still love to see the test updated before landing this though.

I am tinkering with this now...I think I can break out the previous_results move to happen after the cache check. It is tough to reason about the cache manager code, I have written another followup that deletes a lot (but not all) of the circular referencing and object mutation.

The above test should currently pass (although maybe it doesn't? I am working on a different integration test failure atm). The case it presents is no longer very useful though. Can you elaborate on what you are asking for with:

"I would still love to see the test updated before landing this though."

You added a comment at the top of the method indicating that it should be updated.

Hmmm...yes that sounds like something I would do :)

Okay, I will add a some more coverage. I also believe that I was able to add some additional correctness as mentioned above.

Assuming that the code is working, the incremental task behavior should now be:

full noop if target unchanged from previous run

noop from zinc on cached VT

previous_results copied in before zinc execution for invalid, uncached VT

I will add some tests to show the above, as much as I am able.

I didn't see a check for already populated current_result_dirs under the pants_workdir, just local and remote cache checks. Which surprised me, since the incremental implementation proves Pants is willing to consume from previous result_dirs. I wonder if that case happens often enough to be worth considering. Perhaps rare to the point of not mattering...

I didn't see a check for already populated current_result_dirs under the pants_workdir, just local and remote cache checks. Which surprised me, since the incremental implementation proves Pants is willing to consume from previous result_dirs.

AFAICT, that's actually exactly what this test was doing before: vtA had the same directory as vtC... I'm fine with dropping that behaviour for the reasons mentioned above, but I believe that that is what was being tested here.

I understood why it didn't copy the previous_results (the directory was found to exist) but it was not clear to me that was happening on purpose.

And the task itself doesn't noop, which is what I meant above. The invalidation block runs all three times. Zinc just passively inherits the existing results_dir since the namespace is reused. That may be equivalent to a noop for the zinc tool, for all I know, but either way no other tasks even hit that check, since zinc is the only thing that hits that previous_dir code.

Maybe the build_invalidator could go ahead and be fully namespaced in the same manner as the results_dir, and therefore hold an arbitrary number of VT results. Since those would only be VTs that successfully exited invalidation, the results_dir would only require a wipe if the current_results_dir exists but no hash file was dropped.

I guess that is firmly in the land of a new feature, in any rate, and possibly you know reasons that my speculating is flawed :)

I will be able to provide the contract detailed above. Tasks that are rebuilding cache_keys older than the most recent will consume from the cache, and if not found will be opting out of incremental support since they will no longer passively inherit their results.

Which is maybe preferred when taken as a whole? Cache hits look like they supercede the previous_results, in which case all cached VTs were copying over the previous results in the cache_manager, and then immediately pruning the files within the artifact_cache unpack.

The only case where a zinc task would newly miss out on both artifact cache and incremental results is if the results_dir existed but the artifact was uncached, which is probably pretty rare in practice.

As said in the comment, this has been a long-term uncaught exception, surfacing when os.unlink was called on a directory. This resulted in non-zero exit for Pants, with Operation not Permitted raising from the OS. Should now give a decent msg, but the outcome of Pants in this situation will continue to be an execution-halting exception. This behavior has diverged between relative and absolute symlinks methods. the absolute_symlink checks for dirs, and silently rms them. That behavior should probably be reconciled, one way or another, one of these days. Once a determination is made, it would likely inform the final determination on whether safe_mkdir should be allowed to continue to be symlink agnostic.

This was probably really clogging the IO, since every jvm target was: * copying the entire previous_results_dir over * doing it a second time * checking for cached artifact * if found, parsing the analysis and pruning unneeded results. Now it creates the current VT results_dir, and then checks for the cache. This includes the double check, so the last thing that is done before yielding the VT as invalid is to copy any available previous results. My read on this is that it will save a good chunk of unnecessary system calls.

This appears to save a bunch of unnecessary results_dir copies. I added some new test coverage to exercise this, but there was very limited testing to begin with. I have one more followup commit that significantly cleans up the logic in cache_manager. That can land separatly from this commit, I wanted to get some test coverage in before I tried that one.

mateor · 2016-12-16T13:12:46Z

I added some additional test coverage around the discoveries we discussed in the bug thread. I probably should do another pass over this once it passes CI - will do this afternoon.

WHoops. I have gotten so used to disabling the style checks while developing! Otherwise appears to be good. There is still not enough test coverage around invalidation in general, and incremental tasks specifically. But I have added a lot more then there used to be. I have a logic refactor as a followup, and I will try and continue to add coverage as a part of that.

stuhood

Thanks Mateo... excellent work. Some minor nits, but not blocking probably.

stuhood · 2016-12-16T19:54:43Z

src/python/pants/invalidation/cache_manager.py


-    # Finally, create the stable symlink.
-    relative_symlink(current_dir, stable_dir)
+    # If the target is valid, both directories are assumed to exist.


Maybe worth validating that here in the else clause?

I agree - I have a followup with that change. My hope was to change the behavior here as little as possible in the change, since the coverage was light and we were scheduling to backport to a stable branch.

But I ended up needing to make somewhat significant changes due to moving the previous check, so I can go ahead and add that here safely, I think.

stuhood · 2016-12-16T19:58:57Z

src/python/pants/invalidation/cache_manager.py

+    # Set the self._previous last, so that it is only True after the copy completed.
+    self._previous_results_dir = previous_path
+
+  def _use_previous_dir(self, allow_incremental, root_dir):


Unused method? It returns the self._previous_results_dir, which wouldn't be set when this method was called.

Yep, thanks for the catch. All the functions of this method are covered elsewhere, so will delete.

It was a miss when splitting from the followup commit, which cleans up the cache_manager's more mutation-heavy choices: mateor@62e49a3

It needs to be updated but I intend to put up a roughly analogous RB as a followup.

At the moment this throws an error in these circumstances. This is probably a win from a UX perspective, since we will finally get some caught exceptions if someone starts deleting dirs under pants.d. But perhaps you would prefer to silently restore the working state for now, so let me know if that is the case. Also removes no longer needed method and retitles the InvalidResultsDir() exception to avoid overloading "invalid" any more than it already is.

wisechengyi

Thanks @mateor. a few nits

wisechengyi · 2016-12-19T01:47:22Z

src/python/pants/invalidation/cache_manager.py

+        raise self.IllegalResultsDir(
+          '\nThe results_dirs state should not be manually cleaned or recreated by tasks.\n{}'.format(errors)
+        )
+    return True


no need to return any value since it is not used.

docstring += 'i.e. self._result_dir has be to be a link to a dir'

Yes...I added the return value because things like this often end up being useful as bools, and people naturally would try it. Like added to a should_cache_artifacts method or something. It was an attempt to conform to expectation. Without the return True, `vt.ensure_legal() would be Falsey, which felt confusing.

I would prefer to leave it, but will remove if you let me know you disagree.

wisechengyi · 2016-12-19T02:25:10Z

src/python/pants/invalidation/cache_manager.py


-    # Finally, create the stable symlink.
-    relative_symlink(current_dir, stable_dir)
+  def copy_previous_results(self, root_dir):


could privatize

I don't think so - this is called by Task.py after cache is checked.

Might make better sense to move to Task honestly - it is weird that the cache manager manages that part.

I have a followup that refactors cache_manager, will consider doing that there.

wisechengyi · 2016-12-19T02:40:34Z

tests/python/pants_test/invalidation/test_cache_manager.py

+    task_output = os.path.join(vt.results_dir, 'a_file')
+    self.create_file(task_output, 'foo')
+
+  def is_empty(self, dirname):


could make this static method.

yes, that sounds good. I considered adding it to the base class. Will go ahead and make static for now.

wisechengyi · 2016-12-19T02:47:08Z

tests/python/pants_test/invalidation/test_cache_manager.py

+  def is_empty(self, dirname):
+    return not os.listdir(dirname)
+
+  def matching_result_dirs(self, vt):


wisechengyi · 2016-12-19T02:47:13Z

tests/python/pants_test/invalidation/test_cache_manager.py

+    # Ensure that the result_dirs contain the same files.
+    return self.is_empty(vt.results_dir) == self.is_empty(vt.current_results_dir)
+
+  def clobber_symlink(self, vt):


wisechengyi · 2016-12-19T02:50:04Z

tests/python/pants_test/invalidation/test_cache_manager.py

+
+  def matching_result_dirs(self, vt):
+    # Ensure that the result_dirs contain the same files.
+    return self.is_empty(vt.results_dir) == self.is_empty(vt.current_results_dir)


is it safe to conclude the content of the dirs are the same if neither of them are empty? or should the docstring be updated?

Yeah, this for sure needs to be changed. I was continually testing the file list and decided to break it out into a function - but I did not do a very good job.

wisechengyi · 2016-12-19T02:56:44Z

tests/python/pants_test/invalidation/test_cache_manager.py

+
+      # The main protection for this is the exception raised when the cache_manager attempts to mark the VT valid.
+      self.assertFalse(vt.valid)
+      vt.update()


iiuc the exception will be raised at vt.update(), would it be more clear to say

<blob> with self.assertRaises(VersionedTargetSet.IllegalResultsDir): vt.update()

instead of

with self.assertRaises(VersionedTargetSet.IllegalResultsDir): <blob> vt.update()

agree, will bring above the contextmanager block.

wisechengyi · 2016-12-19T02:57:44Z

tests/python/pants_test/invalidation/test_cache_manager.py

+      vt = self.make_vt()
+      self.clobber_symlink(vt)
+      vts = VersionedTargetSet.from_versioned_targets([vt])
+      vts.update()


…subclass. Also added some additional test coverage and some TODOs.

mateor · 2016-12-19T18:35:28Z

Okay, followed up on review, thanks @wisechengyi .

pantsbuild#4139) Tasks that sometimes fail due to outside factors (download failures, resolve issues, etc) often would call safe_mkdir(vt.results_dir, clean=True) in order to wipe possibly truncated or crufty state. But since vt.results_dir is a symlink, that replaced it with a real dir. That ended up breaking caching, since: * Product pipeline was consuming from the vt.results_dir * But the artifacts in the cache were created using vt.current_results_dir, which no longer got the output. If a target is invalid, the cache_manager will now wipe the results_dir prior to passing to the task, removing the need for tasks to clean it themself. It also now checks to make sure that the results_dir is legal before marking a VT valid and passing to the artifact_cache. Also: * Raise an Exception if relative_symlink tries to unlink a concrete file * Copy over previous results *only* after checking for cache hits * Before this change every file in compile.zinc results_dir was copied twice per cache hit, and then thrown away. Now it only copies after a cache miss. The majority of the change is added test coverage to cover the original and added behavior.

stuhood reviewed Dec 14, 2016

View reviewed changes

mateor added 3 commits December 15, 2016 21:09

mateor requested review from peiyuwang and gmalmquist December 16, 2016 13:12

stuhood approved these changes Dec 16, 2016

View reviewed changes

mateor added 2 commits December 16, 2016 21:56

Update create_results_dir docs,now all results_dirs are checked.

648e9c0

wisechengyi approved these changes Dec 19, 2016

View reviewed changes

mateor added 2 commits December 19, 2016 18:28

Make some common methods static, remove the custom CacheKeyGenerator …

cb3a227

…subclass. Also added some additional test coverage and some TODOs.

Missed one

b462567

Make has_previous_dir depend on the directory existing.

4fbc4d4

mateor merged commit 6a43414 into pantsbuild:master Dec 19, 2016

mateor mentioned this pull request Apr 29, 2018

Some streamlining of the artifact cache code. #5766

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that invalid vts have results_dir cleaned before passing to ta… #4139

Ensure that invalid vts have results_dir cleaned before passing to ta… #4139

mateor commented Dec 14, 2016 •

edited

mateor commented Dec 14, 2016 •

edited

mateor commented Dec 14, 2016

stuhood Dec 14, 2016

mateor Dec 14, 2016

stuhood Dec 14, 2016

mateor Dec 14, 2016

stuhood Dec 14, 2016

mateor Dec 14, 2016

stuhood Dec 14, 2016

mateor Dec 14, 2016

stuhood Dec 14, 2016

mateor Dec 14, 2016 •

edited

mateor commented Dec 16, 2016

stuhood left a comment

stuhood Dec 16, 2016

mateor Dec 16, 2016

stuhood Dec 16, 2016

mateor Dec 16, 2016

wisechengyi left a comment

wisechengyi Dec 19, 2016

mateor Dec 19, 2016

wisechengyi Dec 19, 2016

mateor Dec 19, 2016

mateor Dec 19, 2016

wisechengyi Dec 19, 2016

mateor Dec 19, 2016

wisechengyi Dec 19, 2016

wisechengyi Dec 19, 2016

wisechengyi Dec 19, 2016

mateor Dec 19, 2016

wisechengyi Dec 19, 2016

mateor Dec 19, 2016

wisechengyi Dec 19, 2016

mateor commented Dec 19, 2016

Ensure that invalid vts have results_dir cleaned before passing to ta… #4139

Ensure that invalid vts have results_dir cleaned before passing to ta… #4139

Conversation

mateor commented Dec 14, 2016 • edited

mateor commented Dec 14, 2016 • edited

mateor commented Dec 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mateor Dec 14, 2016 • edited

Choose a reason for hiding this comment

mateor commented Dec 16, 2016

stuhood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wisechengyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mateor commented Dec 19, 2016

mateor commented Dec 14, 2016 •

edited

mateor commented Dec 14, 2016 •

edited

mateor Dec 14, 2016 •

edited