local exec healthchecks #703

coryschwartz · 2020-03-18T05:52:35Z

Final fix for #660

healthchecks, with fixes for the prometheus and pushgateway added to the local:exec runner.

I tried to lay this out so it would be less repetitive to add more checks if needed in the future.

First time running (so they all need to be fixed):

↬ ./testground healthcheck --runner local:exec
Mar 18 05:28:34.409635	INFO	testground client initialized	{"addr": "localhost:8080"}
checking runner local:exec
finished checking runner local:exec
Checks:
- local-redis: ok; local-redis instance check: OK
- local-prometheus: ok; local-prometheus instance check: OK
- local-pushgateway: ok; local-pushgateway instance check: OK
Fixes:
- local-redis: ok; local-redis instance check: OK
- local-prometheus: ok; local-prometheus instance check: OK
- local-pushgateway: ok; local-pushgateway instance check: OK

Second time running, now that they have all started:

↬ ./testground healthcheck --runner local:exec
Mar 18 05:28:36.364931	INFO	testground client initialized	{"addr": "localhost:8080"}
checking runner local:exec
finished checking runner local:exec
Checks:
- local-redis: ok; local-redis instance check: OK
- local-prometheus: ok; local-prometheus instance check: OK
- local-pushgateway: ok; local-pushgateway instance check: OK
No fixes applied.

Additionally, the process context is canceled and the processes all are killed as expected when the daemon closes.

raulk · 2020-03-18T11:47:17Z

Going to put this in the queue -- focusing on releasing v0.3 today, and this feature was planned for v0.4. Nice work getting ahead of the curve, @coryschwartz!

nonsense · 2020-03-19T13:43:46Z

docs/USAGE.md

+In order to run the tests with the local:exec runner, there are a few things that must be taken care of first.
+1. install required test software
+  * redis (redis.io)
+  * prometheus (prometheus.io)


Are prometheus and prometheus-pushgateway required components of testground now? I thought that monitoring components (specially when ran with local:exec, local:docker, etc.) are mostly optional.

If prometheus crashes for some reason, what is the expected outcome of a testplan (if we assume that the testplan is correct and passing) ?

I think that is a completely valid concern. I'm not particularly a fan of having these as required infrastructure, but having them opportunistically available would probably be a good thing.

What do you think about this? -- if the pushgateway is unavailable and can't start it, don't fail the healthcheck but don't give a healthy OK message either.

This way it starts automatically if it can, and doesn't prevent you from running without it.

I regard them as required. We offer a metric API that wires directly onto Prometheus. Plans using that API will just end up sending stuff to /dev/null if those components are optional, and the environment doesn’t have them. Why would there be any benefit in making them optional?

In general you don't always care about metrics. For example in tests. Would we have to set up prometheus as part of our unit and integration tests for modules that are instrumented?

Most previous projects I've work on have always had a way to disable metrics, so that they don't incur performance cost, when you don't need them - generally this is done for tracing, but there is no reason to not do it for metrics as well.

nonsense · 2020-03-20T19:05:29Z

pkg/runner/local_exec.go

+			}
+			// Checker failed, try to fix.
+			err := hcp.Fixer()
+			if err == nil {


Nitpick, but probably a good idea to switch err == nil to err != nil - make it easier to read.

nonsense

Overall LGTM, but I think we should consider metrics functionality as optional, and/or have a way to disable it (for example for the purpose of tests).

Also if we start having problems with Prometheus (not that I think this will happen), I don't think this should have any effect on the actual runs of testplans - we should just be losing the measurements.

* local exec healthchecks * swap == nil for != nil * fix lint

local exec healthchecks

e4b27eb

raulk self-requested a review March 18, 2020 11:47

Robmat05 added this to the Testground v0.4 milestone Mar 18, 2020

Robmat05 assigned coryschwartz Mar 18, 2020

Robmat05 added the status/waiting label Mar 18, 2020

coryschwartz requested a review from nonsense March 18, 2020 23:38

nonsense reviewed Mar 19, 2020

View reviewed changes

coryschwartz mentioned this pull request Mar 19, 2020

DRY-er healthchecks and fixes code #720

Closed

2 tasks

nonsense reviewed Mar 20, 2020

View reviewed changes

nonsense approved these changes Mar 20, 2020

View reviewed changes

Cory Schwartz added 4 commits March 20, 2020 23:47

swap == nil for != nil

f4c055b

Merge branch 'master' into feat/local-exec-healthchecks

e131767

fix lint

c6a0e11

Merge branch 'master' into feat/local-exec-healthchecks

6f6666f

coryschwartz merged commit d295689 into master Mar 23, 2020

coryschwartz deleted the feat/local-exec-healthchecks branch March 23, 2020 04:15

Robmat05 added status/done and removed status/waiting labels Mar 23, 2020

coryschwartz mentioned this pull request Mar 24, 2020

DRY out the healthcheck api; use containers for local:exec dependencies; wire context to healthchecks #734

Merged

aschmahmann pushed a commit that referenced this pull request Mar 24, 2020

local exec healthchecks (#703)

3be44c9

* local exec healthchecks * swap == nil for != nil * fix lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local exec healthchecks #703

local exec healthchecks #703

coryschwartz commented Mar 18, 2020

raulk commented Mar 18, 2020 •

edited

nonsense Mar 19, 2020

coryschwartz Mar 19, 2020

raulk Mar 19, 2020

nonsense Mar 20, 2020

nonsense Mar 20, 2020

nonsense left a comment

local exec healthchecks #703

local exec healthchecks #703

Conversation

coryschwartz commented Mar 18, 2020

raulk commented Mar 18, 2020 • edited

nonsense Mar 19, 2020

Choose a reason for hiding this comment

coryschwartz Mar 19, 2020

Choose a reason for hiding this comment

raulk Mar 19, 2020

Choose a reason for hiding this comment

nonsense Mar 20, 2020

Choose a reason for hiding this comment

nonsense Mar 20, 2020

Choose a reason for hiding this comment

nonsense left a comment

Choose a reason for hiding this comment

raulk commented Mar 18, 2020 •

edited