Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus monitoring, forever loop + sleep and some refactoring #14

Merged
merged 41 commits into from
Nov 30, 2018

Conversation

robertodauria
Copy link
Contributor

@robertodauria robertodauria commented Nov 21, 2018

Summary of changes:

  • gmx_site_maintenance is taken into account when looking for offline switches
  • Prometheus metrics are exposed (default port: 9999)
  • Rebot is not one-shot anymore, but runs an infinite loop sleeping 30 minutes between checks.
  • Terminating it with CTRL+C (SIGINT) or SIGTERM does a proper cleanup (writing the history file)
  • Added rebot_last_rebooted metric, exposing machine+site that were rebooted on the last run (WIP)

This change is Reviewable

This is in preparation to adding Prometheus instrumentation.
Make sure that when Rebot is terminated by SIGINT or SIGTERM the history file
is written to disk.
TODO: make sure rebootMany() gets the site label.
@robertodauria robertodauria changed the title Change switchQuery to check gmx_site_maintenance. Prometheus monitoring, forever loop + sleep and some refactoring Nov 26, 2018
…d tests.

With this change it's possible to expose the "site" field in Prometheus
metrics as well.
TODO: make it testable again.
@pboothe pboothe self-requested a review November 26, 2018 15:41
@pboothe
Copy link
Contributor

pboothe commented Nov 26, 2018

Quick feedback: please add a .travis.yml file to test the code in Travis-CI. You can copy from github.com/m-lab/pusher if your tests are complicated and need authentication or you can copy from github.com/m-lab/index2ip if your tests are simple.

I have added this repo to Travis and Coveralls.io, so once you add a .travis.yml file, everything should "just work".

@pboothe
Copy link
Contributor

pboothe commented Nov 26, 2018

Please put prometheus.go, history.go, and reboot.go (and their tests) each in their own subpackages. I suggest naming the packages healthcheck, history, and reboot. The moving of that code will cause build breakages, but those breakages will help expose the parts of the code where the separation of concerns has broken down a little, and suggest further cleanups.

Once the movement and subsequent fixes are done, please let me know and I will start my next round of review

Copy link
Contributor

@pboothe pboothe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comments.

.travis.yml Outdated
- go get github.com/mattn/goveralls

script:
- $GOPATH/bin/goveralls -service=travis-ci
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goveralls is only reporting the coverage for the main package. Try $GOPATH/bin/goveralls -service=travis-ci -package=./... instead.

and if that doesn't work, try the script pusher uses: https://github.com/m-lab/pusher/blob/master/.travis.yml#L19

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the go team fixed things!
Now you should make your script be:

- go test -covermode=count -coverprofile=profile.cov ./...
- goveralls -coverprofile=profile.cov -service=travis-ci

This works for Go 1.10 and later!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified to work in m-lab/pusher#36

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That worked. Thanks!

@robertodauria
Copy link
Contributor Author

Done.

Copy link
Contributor

@pboothe pboothe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes suggested, but this is definitely the last round of review - things look good!

Please get coverage up to 100% in each subpackage - it looks like there are just a few easy error cases and then deleting some code which became unused after refactoring.

Please move some functionality out of cheackAndReboot and into the healthcheck module - doing so will allow the healthcheck module to export only one function but still have all the functionality it needs.

Please normalize your flag usage - no need to start variable names with f, no periods in flag names, and please consider using flagx.ArgsFromEnv, as it may simplify your life.

This looks good!

Reviewed 1 of 10 files at r3.
Reviewable status: 1 of 12 files reviewed, 11 unresolved discussions (waiting on @pboothe, @evfirerob, and @stephen-soltesz)


main.go, line 49 at r3 (raw file):

	oneshot         bool

	fDryRun     bool

no need to name variables manipulated by flags with a preceding f. You can if you want, but it's definitely optional and in some cases can impair readability.


main.go, line 119 at r3 (raw file):

	}

	if fOneshot {

fOneshot and oneshot should be the same variable. Don't use one for the flag and the other for the action.


main.go, line 124 at r3 (raw file):

}

func readEnv() {

Unnecessary if you use flagx.ArgsFromEnv


main.go, line 207 at r3 (raw file):

	flag.BoolVar(&fOneshot, "oneshot", false,
		"Execute just once, do not loop.")
	flag.StringVar(&fListenAddr, "web.listen-address", ":9999",

please don't put . in flag names.


main.go, line 214 at r3 (raw file):

func main() {
	readEnv()
	parseFlags()

rather than parseFlags, maybe use flagx.ArgsFromEnv() - https://github.com/m-lab/go/blob/master/flagx/argsfromenv.go


main.go, line 222 at r3 (raw file):

	// and make sure we always write it back on exit.
	candidateHistory := history.Read(historyPath)

This logic is pretty complicated, and it seems like what you really want is just defer cleanup(candidateHistory)


main_test.go, line 185 at r3 (raw file):

func Test_main(t *testing.T) {
	restore := osx.MustSetenv("REBOT_ONESHOT", "1")

recommend defer restore()


healthcheck/prometheus.go, line 51 at r3 (raw file):

// GetOfflineNodes checks for offline nodes in the last N minutes.
// It returns a Vector of samples.
func GetOfflineNodes(prom promtest.PromClient, minutes int) ([]node.Node, error) {

Please make this also perform the filtering of offline sites. Then, only export GetOfflineNodes and make all other functions internal to the healthcheck package.


node/node.go, line 19 at r3 (raw file):

}

func NewNode(name string, site string) Node {

just New for this one. When people use it, they will type node.New and that is pretty self-documenting and avoids stutter, as requested in https://blog.golang.org/package-names


history/history.go, line 49 at r3 (raw file):

// the nodes slice. If a candidate did not previously exist, it creates a
// new one.
func Upsert(candidates []node.Node, history map[string]node.History) {

Update is a perfectly fine name for this - upsert is a term usually restricted to more-obscure database ops.

@pboothe
Copy link
Contributor

pboothe commented Nov 28, 2018

Also, in the spirit of nitpicking, log.Fatal (or log.Fatalf or log.Fatalln) is usually preferred to panic(), because it ensures that the failure message will go to the logging system instead of always stderr. Only use panic() if you need a stack trace. But that's a small thing and not worth holding up this review over.

@bassosimone
Copy link

Random:

Only use panic() if you need a stack trace. But that's a small thing and not worth holding up this review over.

I really like this suggestion :-)

Copy link
Contributor

@pboothe pboothe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 1 of 13 files reviewed, 1 unresolved discussion (waiting on @evfirerob and @stephen-soltesz)

@robertodauria robertodauria merged commit 9e7e0ae into master Nov 30, 2018
@pboothe pboothe deleted the rebot-go branch November 30, 2018 15:34
@pboothe
Copy link
Contributor

pboothe commented Nov 30, 2018

Woohoo!

@bassosimone
Copy link

👍 !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants