Gating and Crash Dumping #63

Merged
merged 12 commits into from Jan 4, 2017

Conversation

Projects
None yet
2 participants
Collaborator

petevg commented Dec 23, 2016

Wired up health check to gating logic.

Simplifies our basic deploy test, and allows it to fail if things aren't
healthy after a deploy. This accomplishes one of matrix's basic goals:
to replace all the "just deploy" amulet tests that charmers write.

Added crash dump.

After each test, we zip up some relevant info about the test. For now,
this info simply comprises the matrix log. Once we fix an issue w/ 'juju
ssh' not working on our test environment, we can uncomment some of the
code in this commit, and get a more complete crash dump.

@bcsaller @johnsca

I got an error after the test failure: http://pastebin.ubuntu.com/23674398/

Looks like glitch somehow continued after the failure. That seems more like an issue with the handling of the TestFailure exception than this change, though.

The one thing we're still missing with this is for the gated test failure to be surfaced via the Matrix exit code.

matrix/matrix.yaml
@@ -9,21 +9,5 @@
"periodic": 5
"do":
"task": matrix.tasks.health
+ "gating": true
@johnsca

johnsca Dec 23, 2016

Owner

Not that it matters overmuch, but I thought our discussion had been to put it at the rule level instead of the arg level? TBH, I would be in favor of moving task up to the top level as well, doing away with do and replacing it with an explicit args or task_args field, to remove the unnecessary level of nesting. At any rate, for something that is supported in some way across all tasks, I'm inclined to have it at the rule level.

@petevg

petevg Dec 23, 2016

Collaborator

From our IRC convo:

@johnsca for the gating stuff, we actually want health and matrix not to gate by default, even though "gating" defaults to "True" at a higher level, which is why I put the gating check in "task".
13:52 cory_fu: we can discuss it after the break ... we might need to do a more extensive rethink of our logic.
13:53 (Or just add "gating: false" to a lot of our tests, rather than "gating: true" to some of them ...)

+ try:
+ await utils.crashdump(log=log)
+ except Exception as e:
+ log.exception("Error while running crashdump.")
@johnsca

johnsca Dec 23, 2016

Owner

Should log what the error was.

@petevg

petevg Dec 23, 2016

Collaborator

@johnsca log.exception will print out the trace, which contains the error.

matrix/tasks/glitch/main.py
+ actionf.__name__, type(e), e))
+ errors = True
+ if errors and task.gating:
+ raise TestFailure(task, "Exceptions were raised during glitch run.")
@johnsca

johnsca Dec 23, 2016

Owner

On the other hand, perhaps we should consider making task exceptions always gating, since it seems like they'd indicate a bug in Matrix.

@petevg

petevg Dec 23, 2016

Collaborator

@johnsca In this case, the exception is raised because of a bug in core, not in matrix (the reboot thing); I think that we always want to talk about the exception, but we only want to gate if we've set matrix to "gating".

@@ -53,4 +58,9 @@
else:
_log = rule.log.info
_log("Health check: %s", result)
+
+ if result == 'unhealthy' and task.gating:
@johnsca

johnsca Dec 23, 2016

Owner

If we do move gating up to the rule level, I think it would just be rule.gating without the additional assignment above.

@petevg

petevg Jan 4, 2017

Collaborator

@johnsca I think that I'm happy w/ @bcsaller's design of sticking the flag in task (though he called it fail_fast rather than gating). That way, you can do things like have a glitch that doesn't gate on exceptions, but does gate on an unhealthy status in the cloud via a gating health check.

(Also, all the code is written this way, and it would not be compatible with the deadline to tear it up and move it up to the rule level :-p)

requirements.txt
@@ -8,3 +8,4 @@ juju>=0.0.1,<1.0.0
#git+https://github.com/juju-solutions/python-libjuju.git#egg=juju
enforce==0.3.1
ubuntui>=0.1.1,<1.0.0
+jujuplugins
@johnsca

johnsca Dec 23, 2016

Owner

This should be removed, eh? (Along with the binary in the wheelhouse)

@petevg

petevg Dec 23, 2016

Collaborator

@johnsca I figured that there was no harm in leaving it in. Saves the trouble for when we fix the juju ssh bug and uncomment code.

@johnsca

johnsca Jan 4, 2017

Owner

I thought we had decided to drop installing the plugins as part of matrix and leave it up to a higher level (i.e., layer-cwr). Then we'd just use crashdump if it was already installed in the system, otherwise skip it (like how bundletester handles matrix). I already added code to layer-cwr to install the plugins there.

tests/test_health_gating.matrix
+"tests":
+- "name": Verify gating
+ "description": >
+ This test should fail! We're verifying that our gating on health works.
@johnsca

johnsca Dec 23, 2016

Owner

If core ever fixes the issue with trusty, this will stop failing. At some point it would probably be worth adding a force_unhealthy task that uses juju_model.run('status-set error') (if that works). Then we could use the basic bundle as well to make the test (marginally) faster.

@petevg

petevg Dec 23, 2016

Collaborator

@johnsca Agree. We neeed a better way of breaking stuff. I think that we should do that as a separate ticket, though, as this works in the context of our deadline.

matrix/rules.py
@@ -368,8 +369,13 @@ def allow_event(e):
self.handle_shutdown(None)
if self.fail_fast and e.task.gating is True:
log.info("Fail Fast on Test failure: %s" % test.name)
+ self.exit_code = 1
@petevg

petevg Dec 23, 2016

Collaborator

This check isn't actually catching our TestFailure, so setting the .exit_code here doesn't work. :-/

Need to investigate and fix ...

Collaborator

petevg commented Dec 23, 2016

Pushed some updates to this PR. Don't merge it, though: our gating code isn't working correctly, and we need to hammer on it some to figure out why.

Collaborator

petevg commented Jan 3, 2017

@johnsca: This is working and ready to review!

To verify that the exit code 1 works, try:
matrix -s raw -l DEBUG -p path/to/wiki-simple -D tests/test_health_gating.matrix || echo "test succeeded!"

To sanity check:
matrix -s raw -l DEBUG -p path/to/wiki-simple && echo "test worked!"

Collaborator

petevg commented Jan 4, 2017

@johnsca Just to follow-up on that failure that you saw: it should be fixed now. And by "fixed", I mean that glitch should catch the error, log about it, and continue, then raise a TestFailure after it is all done. (An alternative would be to raise the test failure right away, but due to the nature of glitch, it felt more interesting to continue.)

- self.handle_shutdown(None)
- if self.fail_fast and e.task.gating is True:
- log.info("Fail Fast on Test failure: %s" % test.name)
- break
finally:
@petevg

petevg Jan 4, 2017

Collaborator

The TestFailure gets caught and handled elsewhere (in the asyncio lib), so this code never gets executed. Chopping it out so that it won't confuse future devs as much as it confused me :-)

matrix/matrix.yaml
@@ -10,20 +10,3 @@
"do":
"task": matrix.tasks.health
"until": health.status.healthy
-- "name": traffic
- "description": Traffic in the face of Chaos
@johnsca

johnsca Jan 4, 2017

Owner

Why is this being removed? I thought the whole point was to make it non-gating so that we could leave it in.

@petevg

petevg Jan 4, 2017

Collaborator

@johnsca It may be non gating, but glitch will still cause uncaught exceptions, which will lead to test failures (you were adamantly against me adding a generic catch for Exception in glitch).

Since traffic isn't implemented anyway, I think that it's better to leave it out of the default tests for now.

@johnsca

johnsca Jan 4, 2017

Owner

Why is it throwing uncaught exceptions? I'm against a blanket catch because it will mask bugs, but if there are exceptions we know are coming up for reasons that aren't a bug in glitch but that indicate a failure, then we should catch them and turn them in to TestFailures.

I know that traffic isn't implemented yet, but the whole point of adding the gating logic was specifically so we wouldn't have to remove glitch entirely from the default suite; otherwise, we could have just made failures always gating.

matrix/rules.py
if exceptions:
- for t in exceptions:
+ for t, e in exceptions:
s = io.StringIO()
t.print_stack(file=s)
log.error("Exception processing test: %s\n%s",
@johnsca

johnsca Jan 4, 2017

Owner

I feel like TestFailures should not be logged as exceptions.

matrix/tasks/fail.py
+
+ message = "Deliberate Test Failure"
+
+ if task.args.get('exception_type') == 'TestFailure':
@johnsca

johnsca Jan 4, 2017

Owner

I feel like TestFailure should be the default with the override to Exception. Also, with the param named exception_type implies that you could specify an arbitrary exception class. Maybe just make it a boolean exception: true

requirements.txt
@@ -8,3 +8,4 @@ juju>=0.0.1,<1.0.0
#git+https://github.com/juju-solutions/python-libjuju.git#egg=juju
enforce==0.3.1
ubuntui>=0.1.1,<1.0.0
+jujuplugins
@johnsca

johnsca Dec 23, 2016

Owner

This should be removed, eh? (Along with the binary in the wheelhouse)

@petevg

petevg Dec 23, 2016

Collaborator

@johnsca I figured that there was no harm in leaving it in. Saves the trouble for when we fix the juju ssh bug and uncomment code.

@johnsca

johnsca Jan 4, 2017

Owner

I thought we had decided to drop installing the plugins as part of matrix and leave it up to a higher level (i.e., layer-cwr). Then we'd just use crashdump if it was already installed in the system, otherwise skip it (like how bundletester handles matrix). I already added code to layer-cwr to install the plugins there.

petevg added some commits Dec 23, 2016

Gating and Crash Dumping
Wired up health check to gating logic.

Simplifies our basic deploy test, and allows it to fail if things aren't
healthy after a deploy. This accomplishes one of matrix's basic goals:
to replace all the "just deploy" amulet tests that charmers write.

Added crash dump.

After each test, we zip up some relevant info about the test. For now,
this info simply comprises the matrix log. Once we fix an issue w/ 'juju
ssh' not working on our test environment, we can uncomment some of the
code in this commit, and get a more complete crash dump.
Slightly better (but still broken) gating
Addresses some (but not all) issues w/ PR.
Set matrix.exit_code in a place that actually works.
Our TestFailure wasn't being caught in the place that we thought that it
was being caught.
Updated test_health_gating
We not verify that a success doesn't wipe out a failure.
Catch TestFaiure in a better place.
Also handle generic uncaught Exception.
Moved log message.
Was logging about doing something even if we didn't pass to check to see
if we wanted to do it. Whoops.
Wrote better tests for gating stuff.
Found and fixed an issue with them.

@johnsca johnsca merged commit b728bd2 into master Jan 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment