Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Gating and Crash Dumping #63
Conversation
johnsca
reviewed
Dec 23, 2016
I got an error after the test failure: http://pastebin.ubuntu.com/23674398/
Looks like glitch somehow continued after the failure. That seems more like an issue with the handling of the TestFailure exception than this change, though.
The one thing we're still missing with this is for the gated test failure to be surfaced via the Matrix exit code.
| @@ -9,21 +9,5 @@ | ||
| "periodic": 5 | ||
| "do": | ||
| "task": matrix.tasks.health | ||
| + "gating": true |
johnsca
Dec 23, 2016
Owner
Not that it matters overmuch, but I thought our discussion had been to put it at the rule level instead of the arg level? TBH, I would be in favor of moving task up to the top level as well, doing away with do and replacing it with an explicit args or task_args field, to remove the unnecessary level of nesting. At any rate, for something that is supported in some way across all tasks, I'm inclined to have it at the rule level.
petevg
Dec 23, 2016
Collaborator
From our IRC convo:
@johnsca for the gating stuff, we actually want health and matrix not to gate by default, even though "gating" defaults to "True" at a higher level, which is why I put the gating check in "task".
13:52 cory_fu: we can discuss it after the break ... we might need to do a more extensive rethink of our logic.
13:53 (Or just add "gating: false" to a lot of our tests, rather than "gating: true" to some of them ...)
| + try: | ||
| + await utils.crashdump(log=log) | ||
| + except Exception as e: | ||
| + log.exception("Error while running crashdump.") |
petevg
Dec 23, 2016
Collaborator
@johnsca log.exception will print out the trace, which contains the error.
| + actionf.__name__, type(e), e)) | ||
| + errors = True | ||
| + if errors and task.gating: | ||
| + raise TestFailure(task, "Exceptions were raised during glitch run.") |
johnsca
Dec 23, 2016
Owner
On the other hand, perhaps we should consider making task exceptions always gating, since it seems like they'd indicate a bug in Matrix.
petevg
Dec 23, 2016
Collaborator
@johnsca In this case, the exception is raised because of a bug in core, not in matrix (the reboot thing); I think that we always want to talk about the exception, but we only want to gate if we've set matrix to "gating".
| @@ -53,4 +58,9 @@ | ||
| else: | ||
| _log = rule.log.info | ||
| _log("Health check: %s", result) | ||
| + | ||
| + if result == 'unhealthy' and task.gating: |
johnsca
Dec 23, 2016
Owner
If we do move gating up to the rule level, I think it would just be rule.gating without the additional assignment above.
petevg
Jan 4, 2017
Collaborator
@johnsca I think that I'm happy w/ @bcsaller's design of sticking the flag in task (though he called it fail_fast rather than gating). That way, you can do things like have a glitch that doesn't gate on exceptions, but does gate on an unhealthy status in the cloud via a gating health check.
(Also, all the code is written this way, and it would not be compatible with the deadline to tear it up and move it up to the rule level :-p)
| @@ -8,3 +8,4 @@ juju>=0.0.1,<1.0.0 | ||
| #git+https://github.com/juju-solutions/python-libjuju.git#egg=juju | ||
| enforce==0.3.1 | ||
| ubuntui>=0.1.1,<1.0.0 | ||
| +jujuplugins |
petevg
Dec 23, 2016
Collaborator
@johnsca I figured that there was no harm in leaving it in. Saves the trouble for when we fix the juju ssh bug and uncomment code.
johnsca
Jan 4, 2017
Owner
I thought we had decided to drop installing the plugins as part of matrix and leave it up to a higher level (i.e., layer-cwr). Then we'd just use crashdump if it was already installed in the system, otherwise skip it (like how bundletester handles matrix). I already added code to layer-cwr to install the plugins there.
| +"tests": | ||
| +- "name": Verify gating | ||
| + "description": > | ||
| + This test should fail! We're verifying that our gating on health works. |
johnsca
Dec 23, 2016
Owner
If core ever fixes the issue with trusty, this will stop failing. At some point it would probably be worth adding a force_unhealthy task that uses juju_model.run('status-set error') (if that works). Then we could use the basic bundle as well to make the test (marginally) faster.
petevg
Dec 23, 2016
Collaborator
@johnsca Agree. We neeed a better way of breaking stuff. I think that we should do that as a separate ticket, though, as this works in the context of our deadline.
| @@ -368,8 +369,13 @@ def allow_event(e): | ||
| self.handle_shutdown(None) | ||
| if self.fail_fast and e.task.gating is True: | ||
| log.info("Fail Fast on Test failure: %s" % test.name) | ||
| + self.exit_code = 1 |
petevg
Dec 23, 2016
Collaborator
This check isn't actually catching our TestFailure, so setting the .exit_code here doesn't work. :-/
Need to investigate and fix ...
|
Pushed some updates to this PR. Don't merge it, though: our gating code isn't working correctly, and we need to hammer on it some to figure out why. |
|
@johnsca: This is working and ready to review! To verify that the exit code 1 works, try: To sanity check: |
|
@johnsca Just to follow-up on that failure that you saw: it should be fixed now. And by "fixed", I mean that glitch should catch the error, log about it, and continue, then raise a TestFailure after it is all done. (An alternative would be to raise the test failure right away, but due to the nature of glitch, it felt more interesting to continue.) |
| - self.handle_shutdown(None) | ||
| - if self.fail_fast and e.task.gating is True: | ||
| - log.info("Fail Fast on Test failure: %s" % test.name) | ||
| - break | ||
| finally: |
petevg
Jan 4, 2017
Collaborator
The TestFailure gets caught and handled elsewhere (in the asyncio lib), so this code never gets executed. Chopping it out so that it won't confuse future devs as much as it confused me :-)
| @@ -10,20 +10,3 @@ | ||
| "do": | ||
| "task": matrix.tasks.health | ||
| "until": health.status.healthy | ||
| -- "name": traffic | ||
| - "description": Traffic in the face of Chaos |
johnsca
Jan 4, 2017
Owner
Why is this being removed? I thought the whole point was to make it non-gating so that we could leave it in.
petevg
Jan 4, 2017
Collaborator
@johnsca It may be non gating, but glitch will still cause uncaught exceptions, which will lead to test failures (you were adamantly against me adding a generic catch for Exception in glitch).
Since traffic isn't implemented anyway, I think that it's better to leave it out of the default tests for now.
johnsca
Jan 4, 2017
Owner
Why is it throwing uncaught exceptions? I'm against a blanket catch because it will mask bugs, but if there are exceptions we know are coming up for reasons that aren't a bug in glitch but that indicate a failure, then we should catch them and turn them in to TestFailures.
I know that traffic isn't implemented yet, but the whole point of adding the gating logic was specifically so we wouldn't have to remove glitch entirely from the default suite; otherwise, we could have just made failures always gating.
| if exceptions: | ||
| - for t in exceptions: | ||
| + for t, e in exceptions: | ||
| s = io.StringIO() | ||
| t.print_stack(file=s) | ||
| log.error("Exception processing test: %s\n%s", |
| + | ||
| + message = "Deliberate Test Failure" | ||
| + | ||
| + if task.args.get('exception_type') == 'TestFailure': |
johnsca
Jan 4, 2017
Owner
I feel like TestFailure should be the default with the override to Exception. Also, with the param named exception_type implies that you could specify an arbitrary exception class. Maybe just make it a boolean exception: true
| @@ -8,3 +8,4 @@ juju>=0.0.1,<1.0.0 | ||
| #git+https://github.com/juju-solutions/python-libjuju.git#egg=juju | ||
| enforce==0.3.1 | ||
| ubuntui>=0.1.1,<1.0.0 | ||
| +jujuplugins |
petevg
Dec 23, 2016
Collaborator
@johnsca I figured that there was no harm in leaving it in. Saves the trouble for when we fix the juju ssh bug and uncomment code.
johnsca
Jan 4, 2017
Owner
I thought we had decided to drop installing the plugins as part of matrix and leave it up to a higher level (i.e., layer-cwr). Then we'd just use crashdump if it was already installed in the system, otherwise skip it (like how bundletester handles matrix). I already added code to layer-cwr to install the plugins there.
petevg commentedDec 23, 2016
Wired up health check to gating logic.
Simplifies our basic deploy test, and allows it to fail if things aren't
healthy after a deploy. This accomplishes one of matrix's basic goals:
to replace all the "just deploy" amulet tests that charmers write.
Added crash dump.
After each test, we zip up some relevant info about the test. For now,
this info simply comprises the matrix log. Once we fix an issue w/ 'juju
ssh' not working on our test environment, we can uncomment some of the
code in this commit, and get a more complete crash dump.
@bcsaller @johnsca