Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics package to publish metrics #531

Merged
merged 1 commit into from
Jun 13, 2018
Merged

Add metrics package to publish metrics #531

merged 1 commit into from
Jun 13, 2018

Conversation

ashish-amarnath
Copy link
Member

Add a metrics package to handle metric registration and publishing
Add a metricsAddress field to the server struct
Start a metrics endpoint as part of starting the controllers
Report backup status on backup completion

@ashish-amarnath
Copy link
Member Author

ashish-amarnath commented Jun 6, 2018

Starting a WorkInProgress PR to get some feedback on the approach here to add metrics. Related to #84 Adding some code to the discussion on the issue. I will continue to add more metrics following the RED.

@ncdc
Copy link
Contributor

ncdc commented Jun 7, 2018

@ashish-amarnath thanks for this PR. We've got a lot going on but at the moment (prepping for an 0.9 alpha with restic support), but we'll review it soon!

@@ -501,6 +509,13 @@ func (s *server) runControllers(config *api.Config) error {
s.logger,
)

go func() {
http.Handle("/metrics", prometheus.Handler())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to use package-level http handler registration. Would you mind creating an http.ServeMux, registering the handler with that, and then passing the ServeMux as the 2nd parameter to http.ListenAndServe?

/****Metrics related to Ark server backup***********/
lastBackupStatus = prometheus.NewGauge(
prometheus.GaugeOpts{
Namespace: "heptio-ark", // TODO: don't hard-code this value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the namespace needs to be flexible, I think it might make more sense to have a struct that holds on to all the metrics, e.g.

type Metrics struct {
  lastBackupStatus prometheus.Guage
}

func NewMetrics(namespace string) *Metrics {
  return &Metrics{
    lastBackupStatus: prometheus.NewGuage(
      prometheus.GaugeOpts{
        Namespace: namespace,
        // ...
      },
    ),
  }
}

func (m *Metrics) UpdateLastBackupResult(success bool) {
  if success {
    m.lastBackupStatus.Set(1)
  } else {
    m.lastBackupStatus.Set(0)
  }
}

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. I would have gone down that route to get rid of that TODO that I had.

@ncdc
Copy link
Contributor

ncdc commented Jun 7, 2018

Just did a very quick first pass

@ashish-amarnath
Copy link
Member Author

@ncdc Thanks for the quick review! I've made changes to address your comments. I will now start adding some metrics

lastBackupStatus prometheus.Gauge
}

var metrics *serverMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, 1 more change here, please 😄. I'd prefer not to have any package level variables.

// GetMetricsMux returns a mux with the metrics handle
func GetMetricsMux() *http.ServeMux {
mux := http.NewServeMux()
mux.Handle("/metrics", prometheus.Handler())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the new-fangled way to do this is to use promhttp.Handler().

lastBackupStatus prometheus.Gauge
}

var metrics *serverMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid the package-level variable. Would you mind looking at https://github.com/heptio/gimbal/blob/master/discovery/pkg/metrics/metrics.go and basing this file on that one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's helpful!

@ashish-amarnath ashish-amarnath changed the title [WIP] Implement support for metrics in ark server Add metrics package to publish metrics Jun 8, 2018
@@ -96,7 +98,14 @@ func NewCommand() *cobra.Command {
}
namespace := getServerNamespace(namespaceFlag)

s, err := newServer(namespace, fmt.Sprintf("%s-%s", c.Parent().Name(), c.Name()), pluginDir, logger)
metricAddressFlag := c.Flag("metric-address")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a variable at the top of this function (where logLevelFlag and pluginDir are) and set it to the default address, and then please bind it similar to how we bind log-level and plugin-dir). Also please call the flag metrics-address.

@@ -123,8 +132,17 @@ func getServerNamespace(namespaceFlag *pflag.Flag) string {
return api.DefaultNamespace
}

func getMetricAddress(metricAddressFlag *pflag.Flag) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need this function any more once you do what I commented above.

@@ -37,4 +37,8 @@ const (
// NamespaceScopedDir is the name of the directory containing namespace-scoped
// resource within an Ark backup.
NamespaceScopedDir = "namespaces"

// DefaultMetricAddress is the port at which the metric server will listen on to make the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this DefaultMetricsAddress please. I'd also move the constant to pkg/cmd/server/server.go.

"github.com/prometheus/client_golang/prometheus/promhttp"
)

// ServerMetrics Prometheus metrics for the server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServerMetrics contains Prometheus metrics for the Ark server.


// ServerMetrics Prometheus metrics for the server
type ServerMetrics struct {
/****Metrics related to Ark server backup***********/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the comment is needed?

}
}

// GetMetricsMux returns a mux with the metrics handle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of this function and set up the mux where you spawn the goroutine in server.go so this package doesn't need to know anything about http.

@ncdc ncdc requested a review from skriss June 8, 2018 20:22
metricsMux := http.NewServeMux()
metricsMux.Handle("/metrics", promhttp.Handler())
err := http.ListenAndServe(s.metricsAddress, metricsMux)
s.logger.Fatalf("Failed to start metrics: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's only make the fatal call if err != nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also verify that if you run ark server by hand and you do ctrl-c that it shuts down cleanly and this doesn't keep it running?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops! Thanks for catching that.

if err := controller.runBackup(backup, controller.bucket); err != nil {
logContext.WithError(err).Error("backup failed")
backup.Status.Phase = api.BackupPhaseFailed
controller.metrics.RegisterBackupFailed(backupName, backupScheduleName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the RegisterBackupSuccess case may need to be moved into an else clause here, such that a backup doesn't register as failed and succeeded.

I tried to run a backup using a GCP serviceaccount that didn't have access to the bucket selected, and my metrics stated that the backup failed and succeeded.

ERRO[0436] Error uploading log file                      bucket=<BUCKET_REDACTED> error="rpc error: code = Unknown desc = googleapi: Error 403: <ACCOUNT_REDACTED>.iam.gserviceaccount.com does not have storage.objects.create access to <BUCKET_REDACTED>/default-backup/default-backup-logs.gz., forbidden" key=default-backup/default-backup-logs.gz logSource="pkg/cloudprovider/backup_service.go:148"

ark backup get shows the backup as failed.

% ./_output/bin/linux/amd64/ark backup get
NAME             STATUS    CREATED                         EXPIRES   SELECTOR
default-backup   Failed    2018-06-12 09:52:38 -0400 EDT   29d       <none>

These are the metrics:

ark_server_backup_attempt_total{backupName="heptio-ark/default-backup",schedule="UNSCHEDULED"} 1
ark_server_backup_failed_total{backupName="heptio-ark/default-backup",schedule="UNSCHEDULED"} 1
ark_server_backup_size_bytes{backupName="heptio-ark/default-backup",schedule="UNSCHEDULED"} 4865
ark_server_backup_success_total{backupName="heptio-ark/default-backup",schedule="UNSCHEDULED"} 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrb thanks for running the test for me and catching this bug! Pushed a change with the fix.

func getBackupScheduleName(backup *api.Backup) string {
scheduleName := kubeutil.LabelValue(backup, "ark-schedule")
if scheduleName == "" {
scheduleName = "UNSCHEDULED"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a Backup doesn't have a Schedule, we should simply not add the schedule label to the metrics rather than adding an arbitrary value.. This will simplify the prometheus queries later on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this allows users to collect metrics for all unscheduled backups easily. I don't have a strong preference on this though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How I've seen this work with Prometheus/Grafana is to use label matching operators:

ark_server_backup_attempt_total{ark-schedule=".+"} --> Matches any backup with a schedule specified
ark_server_backup_attempt_total{ark-schedule=~"${Schedule}" --> matches any backup with schedule using a variable, defaults to ".+" for "All", ignores any backups without a schedule
ark_server_backup_attempt_total{ark-schedule!=".+"} --> Matches all backups without a schedule

--plugin-dir string directory containing Ark plugins (default "/plugins")
-h, --help help for server
--log-level the level at which to log. Valid values are debug, info, warning, error, fatal, panic. (default info)
--metrics-address string the address to expose prometheus metrics (default ":8085")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be simplified to just port? Or does this allow us to specify a listen address & port (e.g. localhost:8085 or $public_ip:8085)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does allow you to specify an address. In this case, it will always be localhost:8085 or just :8085.

go func() {
metricsMux := http.NewServeMux()
metricsMux.Handle("/metrics", promhttp.Handler())
if err := http.ListenAndServe(s.metricsAddress, metricsMux); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: It's nice to log the address/port that the metrics are using. It's helpful in troubleshooting them not working.

err := http.ListenAndServe(s.metricsAddress, metricsMux)
s.logger.Fatalf("Failed to start metrics: %v", err)
if err := http.ListenAndServe(s.metricsAddress, metricsMux); err != nil {
s.logger.Fatalf("Failed to start metrics: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.logger.WithError(err).Fatalf("Failed to start metrics")

const (
metricNamespace = "ark_server"
backupSizeBytesGauge = "backup_size_bytes"
backupAttemptsCount = "backup_attempts_total"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it typical to append something like total here - it seems weird to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncdc it's a normal thing - it can help with understanding the units of metrics. See: https://prometheus.io/docs/practices/naming/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, prom newbie here :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the convention I've seen for counters. The idea is we are counting 'events' and the value is the total number of events that have occurred, hence the 'total'.

metricNamespace = "ark_server"
backupSizeBytesGauge = "backup_size_bytes"
backupAttemptsCount = "backup_attempts_total"
backupSuccessCount = "backup_success_total"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep total, this name is fine as-is. Otherwise I'd maybe call it backup_successes

Also note that we may move away from success/failure for backups. For restores, the valid phases are new, in progress, failed validation, and completed. In each restore's status, we record the number of warnings and errors that occurred, but as long as we're able to restore something, the status is always completed. We are considering doing something similar for backups. (#286 is related)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we make that change, we can call this metric backup_completed_total. Or even have a metric per phase of the backup. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, and we can discuss it if/when we make that change.

backupSizeBytesGauge = "backup_size_bytes"
backupAttemptsCount = "backup_attempts_total"
backupSuccessCount = "backup_success_total"
backupFailedCount = "backup_failed_total"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep total, I'd call this backup_failure_total; otherwise, backup_failures.


const (
metricNamespace = "ark_server"
backupSizeBytesGauge = "backup_size_bytes"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rosskukulinski please compare with your naming suggestions here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the backup_size_bytes the size of the resulting tarbal? If so, I suggest ark_backup_tarball_size_bytes because that's different than the PV-related sizes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed.

@@ -28,6 +28,14 @@ import (
corev1listers "k8s.io/client-go/listers/core/v1"
)

// LabelValue looks up and returns the value for a label on the object
func LabelValue(objMeta metav1.Object, labelName string) string {
if value, exists := objMeta.GetLabels()[labelName]; exists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that labels are map[string]string and the zero value of any undefined key is the empty string, do we need this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -269,9 +273,16 @@ func (controller *backupController) processBackup(key string) error {

logContext.Debug("Running backup")
// execution & upload of backup
backupName := kubeutil.NamespaceAndName(backup)
backupScheduleName := kubeutil.LabelValue(backup, "ark-schedule")
controller.metrics.RegisterBackupAttempt(backupName, backupScheduleName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no schedule, won't this generate a new metric per backup, e.g.

ark_server_backup_attempts_total{backupName="foo1", schedule=""} 1
ark_server_backup_attempts_total{backupName="foo2", schedule=""} 1
ark_server_backup_attempts_total{backupName="foo3, schedule=""} 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will. I thought that is the behaviour @rosskukulinski was preferring, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take back my initial answer. I am not sure if it will create a metric per backup. trying to get the correct answer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.. after consulting with a collegue who knows more prom than me! we might just drop the backup name altogether and just have the schedule name.

Name: backupSizeBytesGauge,
Help: "Size, in bytes, of a backup",
},
[]string{"schedule", "backupName"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use constants for these. Also please rename backupName to backup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

const (
metricNamespace = "ark_server"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the _server part of this? I'm leaning toward just ark.

@@ -269,9 +273,15 @@ func (controller *backupController) processBackup(key string) error {

logContext.Debug("Running backup")
// execution & upload of backup
backupScheduleName, _ := backup.GetLabels()["ark-schedule"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the , _

errs = append(errs, err)
}

backupScheduleName, _ := backup.GetLabels()["ark-schedule"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

const (
metricNamespace = "ark_server"
backupTarballSizeBytesGauge = "backup_tarball_size_bytes"
backupAttemptsCount = "backup_attempts_total"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to be consistent with singular vs plural and our nomenclature. Should this be attempt or attempts?

success or successes?

failure or failures?

}

const (
metricNamespace = "ark_server"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already in the metrics package - can we call this namespace instead?

* add a metrics package to handle metric registration and publishing
* add a metricsAddress field to the server struct
* make metrics a part of the server
* start a metrics endpoint as part of starting the controllers
* instrument backup_controller to report metrics
* update cli-reference docs
* update example deployments with prometheus annotations
* update 'pkg/install' tooling with prometheus annotations

Signed-off-by: Ashish Amarnath <ashish.amarnath@gmail.com>
@skriss
Copy link
Contributor

skriss commented Jun 13, 2018

I don't have any additional comments. Excited for this!

@ncdc
Copy link
Contributor

ncdc commented Jun 13, 2018

LGTM. Just doing a final test.

@ncdc ncdc merged commit d1e3688 into vmware-tanzu:master Jun 13, 2018
@ncdc ncdc mentioned this pull request Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants