Metrics overhaul #6279

ichernetsky · 2018-06-04T18:23:46Z

JIRA issues: to be added later.

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-1.

mesosphere-ci

✗ Build of #6279 failed.

See the logs and test results for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-2.

mesosphere-ci

✔ Build of #6279 completed successfully.

See details at jenkins-marathon-pipelines-PR-6279-2.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.6.494-eee26f168/marathon-1.6.494-eee26f168.tgz",
"sha1": "f3d0f5972f7c359e855d1e3b00710446f19cb7d3"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6279.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

timcharper

Thanks for getting this started!

I'd like to decide and get all values updated/synced on Ioannis metrics sheet before I feel comfortable going forward with this

timcharper · 2018-06-04T22:17:15Z

src/main/scala/mesosphere/marathon/core/deployment/impl/DeploymentManagerActor.scala

@@ -138,8 +138,8 @@ class DeploymentManagerActor(
  val runningDeployments: mutable.Map[String, DeploymentInfo] = mutable.Map.empty
  val deploymentStatus: mutable.Map[String, DeploymentStepInfo] = mutable.Map.empty

-  private[this] val runningDeploymentsMetric = Metrics.minMaxCounter(ServiceMetric, getClass, "currentDeploymentCount")
-  private[this] val totalDeploymentsMetric = Metrics.minMaxCounter(ServiceMetric, getClass, "deploymentCount")
+  private[this] val runningDeploymentsMetric = Metrics.atomicGauge("marathon.deployments.running")


@wavesoft spreadsheet indicates this should be marathon.deployments.active

prefer marathon.deployments.active.count

@kensipe, I thought we agreed on following the naming guidelines that can be found on the Prometheus website. In case of counters, it suggests suffixing with .total, and in case of gauges, there is no prefix.

I like active part better. Thank you, guys!

kensipe · 2018-06-04T22:55:24Z

src/main/scala/mesosphere/marathon/core/deployment/impl/DeploymentManagerActor.scala

@@ -138,8 +138,8 @@ class DeploymentManagerActor(
  val runningDeployments: mutable.Map[String, DeploymentInfo] = mutable.Map.empty
  val deploymentStatus: mutable.Map[String, DeploymentStepInfo] = mutable.Map.empty

-  private[this] val runningDeploymentsMetric = Metrics.minMaxCounter(ServiceMetric, getClass, "currentDeploymentCount")
-  private[this] val totalDeploymentsMetric = Metrics.minMaxCounter(ServiceMetric, getClass, "deploymentCount")
+  private[this] val runningDeploymentsMetric = Metrics.atomicGauge("marathon.deployments.running")


prefer marathon.deployments.active.count

kensipe · 2018-06-04T22:56:35Z

src/main/scala/mesosphere/marathon/core/event/impl/stream/HttpEventStreamActor.scala

@@ -21,7 +21,7 @@ trait HttpEventStreamHandle {
 }

 class HttpEventStreamActorMetrics() {
-  val numberOfStreams: SettableGauge = Metrics.atomicGauge(ApiMetric, getClass, "number-of-streams")
+  val numberOfStreams: SettableGauge = Metrics.atomicGauge("marathon.http.event.streams")


seems like count or total should be here.. or active.count marathon.http.event.streams.active.count

It is a gauge, so .total shouldn't be there. I will append .active. As for .count, please refer to my comment above.

kensipe · 2018-06-04T22:58:12Z

src/main/scala/mesosphere/marathon/core/flow/impl/ReviveOffersActor.scala

@@ -37,7 +37,7 @@ private[impl] class ReviveOffersActor(
    offersWanted: Observable[Boolean],
    driverHolder: MarathonSchedulerDriverHolder) extends Actor with StrictLogging {

-  private[this] val reviveCountMetric = Metrics.minMaxCounter(ServiceMetric, getClass, "reviveCount")
+  private[this] val reviveCountMetric = Metrics.counter("marathon.revive.ops.total")


I don't think I like ops and if so.. I think I prefer it prefixed.. marathon.ops.revive.total .

I am fine with marathon.ops.revive.total. What do you prefer instead of ops? The full word: operations? mesos.ops/operations?

kensipe · 2018-06-04T22:58:57Z

src/main/scala/mesosphere/marathon/core/group/GroupManagerModule.scala

-      System.currentTimeMillis() - startedAt
-    )
+    Metrics.gauge(
+      "marathon.uptime.duration",


marathon.uptime.seconds

If we follow the Prometheus naming guidelines, it should stay as is.

In one place I appended .seconds in addition to .duration. But I am unsure if we need to do it. Maybe it is better to have a documented rule that durations are in seconds, and memory/space are in bytes by default everywhere (HTTP, StatsD, etc). What do you think?

kensipe · 2018-06-04T22:59:54Z

src/main/scala/mesosphere/marathon/core/matcher/manager/impl/OfferMatcherManagerActor.scala

@@ -27,9 +27,9 @@ import scala.util.control.NonFatal

 private[manager] class OfferMatcherManagerActorMetrics() {
  private[manager] val launchTokenGauge: SettableGauge =
-    Metrics.atomicGauge(ServiceMetric, getClass, "launchTokens")
+    Metrics.atomicGauge("marathon.launch.tokens")


marathon.launch.tokens.total

It is a gauge.

kensipe · 2018-06-04T23:00:28Z

src/main/scala/mesosphere/marathon/core/matcher/manager/impl/OfferMatcherManagerActor.scala

  private[manager] val currentOffersGauge: SettableGauge =
-    Metrics.atomicGauge(ServiceMetric, getClass, "currentOffers")
+    Metrics.atomicGauge("marathon.offers.current")


marathon.offers.active.count

kensipe · 2018-06-04T23:01:00Z

src/main/scala/mesosphere/marathon/core/task/tracker/impl/InstanceTrackerActor.scala

@@ -56,8 +56,8 @@ object InstanceTrackerActor {

  private[tracker] class ActorMetrics {
    // We can't use Metrics as we need custom names for compatibility.
-    val stagedCount: AtomicGauge = AtomicGauge("service.mesosphere.marathon.task.staged.count")
-    val runningCount: AtomicGauge = AtomicGauge("service.mesosphere.marathon.task.running.count")
+    val stagedCount: SettableGauge = Metrics.atomicGauge("marathon.tasks.staged")


marathon.tasks.staged.active.count

What does active mean here?

kensipe · 2018-06-04T23:01:14Z

src/main/scala/mesosphere/marathon/core/task/tracker/impl/InstanceTrackerActor.scala

-    val stagedCount: AtomicGauge = AtomicGauge("service.mesosphere.marathon.task.staged.count")
-    val runningCount: AtomicGauge = AtomicGauge("service.mesosphere.marathon.task.running.count")
+    val stagedCount: SettableGauge = Metrics.atomicGauge("marathon.tasks.staged")
+    val runningCount: SettableGauge = Metrics.atomicGauge("marathon.tasks.running")


marathon.tasks.running.active.count

What does active mean here?

ichernetsky · 2018-06-04T23:22:04Z

src/main/scala/mesosphere/marathon/core/storage/store/impl/BasePersistenceStore.scala

-  private val deleteTimer: Timer = Metrics.timer(ServiceMetric, getClass, "delete")
-  private val storeTimer: Timer = Metrics.timer(ServiceMetric, getClass, "store")
-  private val versionTimer: Timer = Metrics.timer(ServiceMetric, getClass, "versions")
+  private val idsTimer: Timer = Metrics.timer("marathon.storage.op.ids.duration", unit = Time.Seconds)


@kensipe, what do you think of op here?

ichernetsky · 2018-06-04T23:23:06Z

src/main/scala/mesosphere/marathon/core/task/tracker/impl/InstanceUpdateActor.scala

@@ -36,16 +37,17 @@ object InstanceUpdateActor {

  class ActorMetrics {
    /** the number of ops that are for instances that already have an op ready */
-    val numberOfQueuedOps: SettableGauge = Metrics.atomicGauge(ServiceMetric, classOf[InstanceUpdateActor], "delayed-ops")
+    val numberOfQueuedOps: SettableGauge = Metrics.atomicGauge("marathon.instance.update.ops.delayed")


@kensipe, what do you think of ops here?

ichernetsky · 2018-06-04T23:25:32Z

src/main/scala/mesosphere/marathon/core/election/ElectionService.scala

@@ -331,12 +333,12 @@ class ElectionServiceImpl(
 }

 object ElectionService extends StrictLogging {
-  private val leaderDurationMetric = "service.mesosphere.marathon.leaderDuration"
+  private val leaderDurationMetric = "marathon.leadership.duration.seconds"


Do we append .seconds to .duration metrics or just have a convention that all durations are in seconds everywhere?

wavesoft

I was thinking to restrict this freedom you are proposing, by giving the developer a subset of choices to chose from instead:

Prepend marathon. implicitly in all metrics
Separate metric name in 3 parts : Namespace, Class and Name.
Make the first two (namespace / class) enums, and allow only the name to be defined in spot.
Introduce a custom units enum, that could potentially also define the suffix of the metric. Ex Units.SECONDS will automatically add _seconds on the metric.

wavesoft · 2018-06-06T14:47:32Z

src/main/scala/mesosphere/marathon/api/HTTPMetricsFilter.scala

-  private[this] val bytesReadMetric = Metrics.counter(ServiceMetric, getClass, "bytesRead")
-  private[this] val bytesWrittenMetric = Metrics.counter(ServiceMetric, getClass, "bytesWritten")
+  private[this] val bytesReadMetric =
+    Metrics.counter("marathon.http.data.read.bytes.total", unit = Memory.Bytes)


I don't really like repeating that much of information on every instrument instantiation. Mainly because this is a recipe for trouble. That's because someone can easily bypass the rules we have right now in mind and come up with a totally ridiculous naming in the future.

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-4.

mesosphere-ci

✗ Build of #6279 failed.

See the logs and test results for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-5.

mesosphere-ci

✗ Build of #6279 failed.

See the logs and test results for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-6.

mesosphere-ci

✗ Build of #6279 failed.

See the logs and test results for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-7.

mesosphere-ci

✗ Build of #6279 failed.

See the logs and test results for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-8.

mesosphere-ci

✔ Build of #6279 completed successfully.

See details at jenkins-marathon-pipelines-PR-6279-8.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.30-49b9e7b10/marathon-1.7.30-49b9e7b10.tgz",
"sha1": "d777d8fcaa70eb7a84183c9f8b35cfa48d52bbea"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6279.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-9.

mesosphere-ci

✔ Build of #6279 completed successfully.

See details at jenkins-marathon-pipelines-PR-6279-9.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.31-f77c306ff/marathon-1.7.31-f77c306ff.tgz",
"sha1": "11cfea938e6e12a026fa559b0e755754ef8fb231"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6279.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-10.

mesosphere-ci

✔ Build of #6279 completed successfully.

See details at jenkins-marathon-pipelines-PR-6279-10.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.33-25988e938/marathon-1.7.33-25988e938.tgz",
"sha1": "b7bc0f8cd083128611af7c1329bc32a77e0f6907"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6279.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-22.

mesosphere-ci

✔ Build of #6279 completed successfully.

See details at jenkins-marathon-pipelines-PR-6279-22.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.40-d97f5abf6/marathon-1.7.40-d97f5abf6.tgz",
"sha1": "b4837b0203258405c21b7528585883d22a5eb069"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6279.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

timcharper · 2018-07-24T04:40:43Z

docs/docs/deprecatedMetrics.md

+`service.mesosphere.marathon.uptime` (gauge) - The uptime of the reporting Marathon process in milliseconds. This is helpful to diagnose stability problems that cause Marathon to restart.
+
+### App, group, and task counts
+`service.mesosphere.marathon.leaderDuration` (gauge) - The duration since the last leader election happened


This has nothing to do with app / tasks / groups

timcharper · 2018-08-01T09:50:53Z

docs/docs/command-line-flags.md

+* <span class="label label-default">v1.7.0</span> `--metrics_name_prefix`:
+    Configure the prefix that is used when constructing metric names (default: marathon).
+* <span class="label label-default">v1.7.0</span> `--metrics_prometheus`:
+    Enable the StatsD reporter. Once enabled, metrics in the Prometheus format are availalble at `/metrics/prometheus`.


Typo: available

timcharper · 2018-08-01T09:52:32Z

docs/docs/command-line-flags.md

+    Enable the StatsD reporter. Once enabled, metrics in the Prometheus format are availalble at `/metrics/prometheus`.
+* <span class="label label-default">v1.7.0</span> `--metrics_statsd`:
+    Enable the StatsD reporter.
+* <span class="label label-default">v1.7.0</span> `--metrics_statsd_host`:


Should we just make this a single field? host:port?

I don't have a strong opinion here. What the others think?

I'm inclined towards host:port, but not a strong opinion either.

timcharper · 2018-08-01T09:57:26Z

docs/docs/metrics.md


-<span class="label label-default">v0.15</span>
-`service.mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl.publishFuture` (timer) - This metric calculates how long it takes Marathon to process status updates.
+* a `counter` is a monotonically increasing integer, for instance, the


minor nit but I'm seeing some pretty inconsistent word wrapping applied :)

You mean, in this doc as compared to the others?

timcharper · 2018-08-01T10:01:02Z

src/main/scala/mesosphere/marathon/api/HttpModule.scala

+
+    val loginService = new HashLoginService()
+    loginService.setHotReload(false)
+    loginService.setUserStore(userStore)


timcharper · 2018-08-01T10:06:42Z

src/main/scala/mesosphere/marathon/core/deployment/impl/DeploymentManagerActor.scala

+  private[this] val newRunningDeploymentsMetric =
+    metrics.gauge("deployments.active")
+  private[this] val oldTotalDeploymentsMetric =
+    metrics.deprecatedMinMaxCounter(ServiceMetric, getClass, "deploymentCount")


I've been wondering... since we already have a layer of indirection in our metrics, why not have this layer receive a new name and an old name, and then dispatch to each underlying library accordingly?

This is one of two approaches I took initially, but had to roll back to a clean slate two times. The thing is that some metrics are of different metric types now.

Okay :) I thought that might be the case. We'll have to do lots of editing either way when we remove the old metrics.

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-23.

mesosphere-ci

✔ Build of #6279 completed successfully.

See details at jenkins-marathon-pipelines-PR-6279-23.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.41-8b6e7d03c/marathon-1.7.41-8b6e7d03c.tgz",
"sha1": "36cc7b20f97fc2cd3e8f7da3bc7681f4452c5120"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6279.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

ichernetsky · 2018-08-02T16:09:15Z

Squashed and rebased.

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-25.

mesosphere-ci

✔ Build of #6279 completed successfully.

See details at jenkins-marathon-pipelines-PR-6279-25.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.44-5ac8204e0/marathon-1.7.44-5ac8204e0.tgz",
"sha1": "e6c7497bbe6ee6de7bce0d4cbfd4d9e49d21828d"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6279.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-26.

mesosphere-ci

✔ Build of #6279 completed successfully.

See details at jenkins-marathon-pipelines-PR-6279-26.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.45-3c96f9478/marathon-1.7.45-3c96f9478.tgz",
"sha1": "113f63a93466f89167f2e36bd17f01d59f5a6325"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6279.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

kensipe

No show stoppers... there are comments marked AI which means Action Item.

Lots of repeating old and new ways of handling metrics... when we made the discussion to do that... it was because we wanted to make this breaking change and allow folks an out... however now that we are targeting DCOS 1.12... I would be for removing all old kamon code which would clean out a lot of code.

If not for this merge... we should do it before the release IMO.

kensipe · 2018-08-02T20:11:37Z

docs/docs/command-line-flags.md

+    Specify the host to push metrics to in the DataDog format.
+* <span class="label label-default">v1.7.0</span> `--metrics_datadog_port`:
+    Specify the port to push metrics to in the DataDog format.
+* <span class="label label-default">v1.7.0</span> `--metrics_datadog_protocol`:


why all the flags? are we planning to support multiple metrics at the same time? I don't see the value of that. I see several better options:

--metrics_host, --metrics_port, --metrics_transmission_interval_ms regardless of metric reporter.

and/or . --metric_provider=statsd

Though I do like your suggestion, I am not entirely sure it generalizes well. For instance, what if a user wants to enable multiple reporters? Now, we have Dropwizard (HTTP endpoint), Prometheus (the same), StatsD, DataDog. I bet we'll have more as time goes by. What do you think?

there may be different configurations for different impls... but there is a lot of commonality between them as listed. It would be nice to eliminate all the specific flags and generalize. the providers we support and how we configure them would then be in documentation.

I agree, but I don't see how to get rid of this. Suppose, one wants to enable A and B reporters. There has to be params like --metrics_a_host and --metrics_b_host, in order to enable them to report to different remote endpoints, as one might expect.

kensipe · 2018-08-02T20:12:08Z

docs/docs/command-line-flags.md

-* <span class="label label-default">v0.13.0</span> `--reporter_graphite` (Optional. Default: disabled):
-    Report metrics to [Graphite](http://graphite.wikidot.com) (StatsD) as defined by the given URL.
+* <span class="label label-default">v1.7.0</span> `--metrics_name_prefix`:
+    Configure the prefix that is used when constructing metric names (default: marathon).


does this support namespace name? like dcos.marathon or root.marathon?

kensipe · 2018-08-02T20:17:19Z

docs/docs/deprecatedMetrics.md

+
+### Statsd and derived statistics
+
+Statsd typically creates derived statistics (mean, p99) from mean values Marathon reports. Our metrics package also reports derived statistics. To avoid accidentally aggregating statistics multiple times, be sure you know where you are reporting and computing mean values.


these docs are killer! 👏

I think I should generalize it and add it to the new metrics doc. :)

kensipe · 2018-08-02T20:18:31Z

docs/docs/rest-api/public/api/v2/types/dropwizardMetrics.raml

@@ -0,0 +1,143 @@
+#%RAML 1.0 Library
+types:
+  DropwizardCounter:


dislike exposing our implementation details "dropwizard" to the outside world

based on maintaining both kamon and drop... perhaps we have to :(

Yeah, I would much more prefer to have just Counter, Gauge etc. and not expose the metrics library to the outside...

Probably I am missing something here. In which way it is exposed to the outside world? I can surely drop Dropwizard prefix, and replace Kamon with Deprecated.

kensipe · 2018-08-02T20:22:13Z

src/main/java/mesosphere/marathon/metrics/current/BufferPoolMetricSet.java

+                    mBeanServer.getMBeanInfo(on);
+                    gauges.put(name(pool, name), new JmxAttributeGauge(mBeanServer, on, attribute));
+                } catch (JMException ignored) {
+                    LOGGER.debug("Unable to load buffer pool MBeans, possibly running on Java 6");


AI: this should be a LOGGER.error

I am going to change it to warn.

kensipe · 2018-08-02T20:42:57Z

src/main/scala/mesosphere/marathon/core/event/impl/stream/HttpEventStreamServlet.scala

@@ -50,8 +58,9 @@ class HttpEventSSEHandle(request: HttpServletRequest, emitter: Emitter, allowHea
    * @return Passes through the `payload` argument
    */
  private def measureFrameBytesSent(eventName: String, payload: String): Unit = {
-    val overhead: Long = 16L + eventName.length
-    bytesWrittenMetric.increment(payload.length + overhead)
+    val overhead: Long = 16L + eventName.length.toLong


ai: magic number 16... I realize it was there before... it would be great to have a variable name that represents the reason why 16 is so magical

I believe sseFrameOverhead would be appropriate. @wavesoft, what do you think?

kensipe · 2018-08-02T20:53:08Z

...scala/mesosphere/marathon/core/task/update/impl/steps/NotifyHealthCheckManagerStepImpl.scala

@@ -14,6 +14,7 @@ import scala.concurrent.Future
 class NotifyHealthCheckManagerStepImpl @Inject() (healthCheckManagerProvider: Provider[HealthCheckManager])
  extends InstanceChangeHandler {
  override def name: String = "notifyHealthCheckManager"
+  override def metricName: String = "notify-health-check-manager"


surprised we did create a function to convert camelCase -> camel-case .
not a fan of inheritance for state... it would be better to have a class class for StepName with a name and metric IMO

I don't think I understand what you mean here. Could you please elaborate?

all metricNames are just conversions of the name of the step... notifyHealthCheckManager -> notify-health-check-manager . all the steps follow this pattern. It would be great to use a function that provides that... that way we have that behavior guaranteed. It isn't a big deal.

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-27.

mesosphere-ci

✔ Build of #6279 completed successfully.

See details at jenkins-marathon-pipelines-PR-6279-27.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.48-17db512ee/marathon-1.7.48-17db512ee.tgz",
"sha1": "634ee5f246078bd5a92bda45e9698377eebd8128"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6279.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-28.

mesosphere-ci

✔ Build of #6279 completed successfully.

See details at jenkins-marathon-pipelines-PR-6279-28.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.49-f0efe8b62/marathon-1.7.49-f0efe8b62.tgz",
"sha1": "f6372788c71491582164a137153d01fce1f7c202"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6279.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

alenkacz

I have just cosmetics for now...

alenkacz · 2018-08-03T09:43:05Z

docs/docs/deprecatedMetrics.md

+
+* graphite via `--reporter_graphite`.
+* datadog via `--reporter_datadog`.
+* statsd via `--reporter_datadog` (datadog reports supports statsd).


reports support? (when plural, it shouldn't be reports probably?)

I think it should be reporter.

alenkacz · 2018-08-03T09:52:30Z

docs/docs/deprecatedMetrics.md

+
+## Metric names
+
+All metric names have to prefixed by a prefix that you specify and are subject to modification by graphite, datadog, or statsd. For example, if we write that the name of a metric is `service.mesosphere.marathon.uptime`, it might be available under `stats.gauges.marathon_test.service.mesosphere.marathon.uptime` in your configuration.


alenkacz · 2018-08-03T09:57:19Z

docs/docs/metrics.md


-<span class="label label-default">v0.15</span>
-`service.mesosphere.marathon.state.GroupManager.processing` (gauge) - The number of currently processed app configuration updates. Since we serialize these updates, this is either 0 or 1.
+* none


what is a "none" unit?

should it be just integer?

Rephrased. Please take a look once I push my changes.

alenkacz · 2018-08-03T10:03:02Z

docs/docs/metrics.md

+  deployments.
+* `marathon.deployments.counter` — the count of deployments received
+  since the current Marathon instance became a leader.
+* `marathon.deployments.dismissed.counter` — the count of deployments


what's a dismissed deployment? I am just wondering if I don't know that whether users know that...

Elaborated.

alenkacz · 2018-08-03T10:05:00Z

docs/docs/metrics.md

+  since the current Marathon instance became a leader.
+* `marathon.deployments.dismissed.counter` — the count of deployments
+  dismissed since the current Marathon instance became a leader.
+* `marathon.groups.active.gauge` — the number of active groups.


there are groups that are not active? Why not just groups.count?

Our naming conventions. count alludes a counter. And it is not just groups, because this way we name counters. In the future there might be such one, and it will mean the total number of groups created so far by the current Marathon instance.

alenkacz · 2018-08-03T10:07:52Z

docs/docs/rest-api/public/api/v2/types/dropwizardMetrics.raml

@@ -0,0 +1,143 @@
+#%RAML 1.0 Library
+types:
+  DropwizardCounter:


Yeah, I would much more prefer to have just Counter, Gauge etc. and not expose the metrics library to the outside...

alenkacz · 2018-08-03T10:08:49Z

docs/index.md

@@ -37,7 +37,8 @@ Marathon is a production-grade container orchestration platform for Mesosphere's
 - [Service Discovery & Load Balancing](https://mesosphere.github.io/marathon/docs/service-discovery-load-balancing.html). Several methods available.
 - [Health Checks](https://mesosphere.github.io/marathon/docs/health-checks.html). Evaluate your application's health using HTTP or TCP checks.
 - [Event Subscription](https://mesosphere.github.io/marathon/docs/event-bus.html#subscription-to-events-via-the-event-stream). Supply an HTTP endpoint to receive notifications - for example to integrate with an external load balancer.
- [Metrics](https://mesosphere.github.io/marathon/docs/metrics.html). Query them at /metrics in JSON format or push them to systems like graphite, statsd and Datadog.
+- [Metrics](https://mesosphere.github.io/marathon/docs/metrics.html). Query them at `/metrics` in JSON format, push them to systems like Graphite, StatsD and DataDog, or scrape them using Prometheus.
+- [Deprecated Metrics](https://mesosphere.github.io/marathon/docs/deprecatedMetrics.html). Query them at `/metrics` in JSON format, or push them to systems like Graphite, StatsD and DataDog.


should we maybe have just Metrics here and link to the deprecated metrics from the metrics page?

No strong opinion here, if you ask me :)

alenkacz · 2018-08-03T10:10:19Z

src/main/java/mesosphere/marathon/metrics/current/BufferPoolMetricSet.java

+                    mBeanServer.getMBeanInfo(on);
+                    gauges.put(name(pool, name), new JmxAttributeGauge(mBeanServer, on, attribute));
+                } catch (JMException ignored) {
+                    LOGGER.warn("Unable to load buffer pool MBeans, possibly running on Java 6");


should we maybe log also the original exception?

I don't see any value, but I don't mind too. :)

alenkacz · 2018-08-03T10:55:32Z

src/main/java/mesosphere/marathon/metrics/current/InstrumentedHandler.java

+        this.asyncDispatches = metricRegistry.meter(name(prefix, "debug.http.dispatches.async.rate"));
+        this.asyncTimeouts = metricRegistry.meter(name(prefix, "debug.http.dispatches.async.timeouts.rate"));
+
+        this.responses = new Meter[]{


I don't like this being an array :( can we just put them into separately named variables?

I want to keep the changes between it and the original one to the minimum. Essentially, just different metric names. Also, there is a reason for this. Please refer to updateResponses

alenkacz · 2018-08-03T11:33:32Z

src/main/scala/mesosphere/marathon/api/SystemResource.scala

+    withAuthorization(ViewResource, SystemConfig){
+      if (config.isDeprecatedFeatureEnabled(DeprecatedFeatures.kamonMetrics)) {
+        notFound("Prometheus reporter is not available with the deprecated metrics")
+      } else if (!config.metricsPrometheusReporter()) {


why is there even enable/disable for this? It's not costing us anything to be enabled, right?

You are right. No strong opinion on my side. In fact, initially it was done the way you suggest, but then I added it for the sake of consistency.

zen-dog

Part 1 of the review. Nothing major but I left a few questions/suggestions.

zen-dog · 2018-08-02T15:42:45Z

build.sbt

@@ -125,20 +125,6 @@ lazy val commonSettings = Seq(
  fork in run := true
 )

-val aspect4jSettings = SbtAspectj.aspectjSettings ++ Seq(


zen-dog · 2018-08-03T10:09:31Z

docs/docs/deprecatedMetrics.md

+
+## Metric names
+
+All metric names have to prefixed by a prefix that you specify and are subject to modification by graphite, datadog, or statsd. For example, if we write that the name of a metric is `service.mesosphere.marathon.uptime`, it might be available under `stats.gauges.marathon_test.service.mesosphere.marathon.uptime` in your configuration.


s/to prefixed/to be prefixed

zen-dog · 2018-08-03T10:10:29Z

docs/docs/deprecatedMetrics.md

+
+### App, group, and task counts
+
+`service.mesosphere.marathon.app.count` (gauge) - The number of defined apps. This number influences the performance of Marathon: if you have


Wdyt about: Be advised that high app number may lead to a degraded performance ?

zen-dog · 2018-08-03T11:09:55Z

docs/docs/deprecatedMetrics.md

+### Task update processing
+
+<span class="label label-default">v0.15</span>
+`service.mesosphere.marathon.core.task.update.impl.ThrottlingTaskStatusUpdateProcessor.queued` (gauge) - The number of queued status updates.


Are those the old metric names that we keep for consistency? Because we shouldn't leak implementation details like class names e.g. ThrottlingTaskStatusUpdateProcessor or TaskStatusUpdateProcessorImpl into metric names.

It is deprecated/old metrics, which we keep around for the time being.

zen-dog · 2018-08-03T11:54:48Z

docs/docs/metrics.md


-Our metrics library calculates derived metrics like "mean" and "p99" using a sliding average window algorithm. This means that every time you fetch the `/metrics` endpoint you are looking at the average of the last N seconds. By default the length of the window is 30 seconds, but this can be configured with the `--metrics_averaging_window` flag.
+Marathon can send metrics to a DataDog agent over UDP, or directly to
+the DataDog cloud over HTTP. It is specified using


Do we still support using apiKey when sending metrics to DataDogHQ (like it was before Kamon)? Is it part of the --metrics_datadog_host parameter similar to how it is described in command-line-flags.md?

ok, I found the DataDog reporter so I take this question back. But the example with the apiKey should probably be mentioned in the docs.

zen-dog · 2018-08-03T13:10:01Z

src/main/scala/mesosphere/marathon/core/storage/zookeeper/ZooKeeperPersistenceStore.scala

-  private[this] val existsMetric             = Metrics.counter(ServiceMetric, getClass, "exists")
-  private[this] val transactionMetric        = Metrics.counter(ServiceMetric, getClass, "transaction")
-  private[this] val transactionOpCountMetric = Metrics.counter(ServiceMetric, getClass, "transactionOpCount")
+  private[this] val oldCreateMetric =


Since this store is never used in production (except for tests and benchmarks) you don't need to keep the old metrics. Just remove them.

zen-dog · 2018-08-03T13:33:52Z

src/main/scala/mesosphere/marathon/metrics/HistogramTimer.scala

-    case Time.Seconds => duration.toSeconds
-    case Time(factor, _) => (duration.toNanos * (Time.Nanoseconds.factor / factor)).toLong
-  }
+  def update(value: Long): Unit = timer.update(value)


Are those 2 methods used? Seems like you're calling timer.update directly

zen-dog · 2018-08-03T13:44:43Z

src/main/scala/mesosphere/marathon/metrics/MetricsConf.scala

+    descr = "Histogram reservoirs are divided into this number of chunks, and one chunk is cleared after each (resetting interval / number of chunks) elapsed",
+    default = Some(0),
+    argName = "chunks",
+    validate = v => v == 0 || v >= 2,


no max chunks cap?

Your suggestion?

¯_(ツ)_/¯ in doubt take 42 ;)

zen-dog · 2018-08-03T13:55:20Z

src/main/scala/mesosphere/marathon/metrics/current/DropwizardMetrics.scala

+    override def update(value: Long): Unit = timer.update(value, TimeUnit.NANOSECONDS)
+  }
+
+  private val timersLock = RichLock()


Why using RichLock here? what's wrong with plain old synchronized or does ReentrantLock has real performance advantages here?

Initially, I added histogram() method too, but there was no single use of it in the end, so I removed. And there was no reason to have histogram and timer synchronized on the same object. So, if in the future, copy & past & edit will result into the right thing, I guess.

Ok, got it. But unless you expect massive contention on those methods, I don't see a reason to prefer ReentrantLock to synchronized

Actually, managed to remove it completely.

zen-dog · 2018-08-03T14:38:57Z

src/main/scala/mesosphere/marathon/metrics/current/DropwizardMetrics.scala

+  private val timersLock = RichLock()
+  def timer(name: String): Timer = timersLock {
+    val effectiveName = constructName(name, "timer", DropwizardUnitOfMeasurement.Time)
+    if (registry.getTimers().containsKey(effectiveName)) {


Why don't we check for existing metrics with e.g. gauges or counters? What happens if you register >1 metric with the same name? And if nothing bad happens, maybe we don't need to synchronize here at all?

Because we use .register only for timers. In the cases this check is done by Dropwizard itself. If .register is called twice with the same metric name, the second call will result into an exception thrown.

I am dismissing my review until I finish going through the PR

zen-dog · 2018-08-03T16:40:18Z

src/main/scala/mesosphere/marathon/metrics/current/reporters/StatsDReporter.scala

+    Props(new StatsDReporter(metricsConf, registry))
+  }
+
+}


I'm concerned about all those reporters that we add. They are not rocket science but they are untested code which is also hard to test - we'll have to fiddle with ports and what not.
Potentially, I'd even put them in their own sub-project - it might even get some stars on github since there aren't many native scala exporters for dropwizard.

Yes, but there is no other way for the reasons we've discussed. As for putting them into in a (sub)project, I guess, @meichstedt should chime in, because it sounds like a product/managerial decision to make.

meln1k

One thing to fix, and good to go!

meln1k · 2018-08-03T16:43:11Z

src/main/scala/mesosphere/marathon/metrics/current/reporters/DataDogAPIReporter.scala

+  override def receive: Receive = {
+    case Tick =>
+      report()
+    case HttpResponse(code, _, entity, _) =>


I think we must call entity.discardBytes() in case we don't consume it, otherwise we will trigger backpressure.

Great catch. Thank you!

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-6279-30.

mesosphere-ci

✔ Build of #6279 completed successfully.

See details at jenkins-marathon-pipelines-PR-6279-30.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.47-0678f7c85/marathon-1.7.47-0678f7c85.tgz",
"sha1": "c4709c2c9a0a089ca63f1a9bf223d15f372ef923"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6279.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

zen-dog

Looks like a very solid effort. Lot's of love and attention to detail in the docs which is awesome ❤️ If it wasn't for the PR size, it would've been perfect.

ichernetsky added wip needs backport hold merge labels Jun 4, 2018

mesosphere-ci requested changes Jun 4, 2018

View reviewed changes

mesosphere-ci approved these changes Jun 4, 2018

View reviewed changes

timcharper suggested changes Jun 4, 2018

View reviewed changes

kensipe reviewed Jun 4, 2018

View reviewed changes

ichernetsky commented Jun 4, 2018

View reviewed changes

wavesoft previously requested changes Jun 6, 2018

View reviewed changes

ichernetsky force-pushed the inc/metrics-overhaul branch from eee26f1 to 518c8d3 Compare July 17, 2018 22:53

ichernetsky removed the needs backport label Jul 17, 2018

ichernetsky force-pushed the inc/metrics-overhaul branch from 518c8d3 to f1cbf5e Compare July 18, 2018 04:01

mesosphere-ci requested changes Jul 18, 2018

View reviewed changes

mesosphere-ci requested changes Jul 19, 2018

View reviewed changes

mesosphere-ci approved these changes Jul 19, 2018

View reviewed changes

mesosphere-ci requested changes Jul 19, 2018

View reviewed changes

mesosphere-ci approved these changes Jul 19, 2018

View reviewed changes

mesosphere-ci requested changes Jul 19, 2018

View reviewed changes

mesosphere-ci approved these changes Jul 19, 2018

View reviewed changes

mesosphere-ci requested changes Jul 31, 2018

View reviewed changes

mesosphere-ci approved these changes Jul 31, 2018

View reviewed changes

timcharper reviewed Aug 1, 2018

View reviewed changes

mesosphere-ci requested changes Aug 2, 2018

View reviewed changes

mesosphere-ci approved these changes Aug 2, 2018

View reviewed changes

timcharper approved these changes Aug 2, 2018

View reviewed changes

ichernetsky force-pushed the inc/metrics-overhaul branch from f98fdb4 to 5ac8204 Compare August 2, 2018 16:09

mesosphere-ci requested changes Aug 2, 2018

View reviewed changes

mesosphere-ci approved these changes Aug 2, 2018

View reviewed changes

ichernetsky force-pushed the inc/metrics-overhaul branch from 5ac8204 to 3c96f94 Compare August 2, 2018 17:05

mesosphere-ci requested changes Aug 2, 2018

View reviewed changes

mesosphere-ci approved these changes Aug 2, 2018

View reviewed changes

kensipe approved these changes Aug 2, 2018

View reviewed changes

mesosphere-ci requested changes Aug 3, 2018

View reviewed changes

mesosphere-ci approved these changes Aug 3, 2018

View reviewed changes

mesosphere-ci requested changes Aug 3, 2018

View reviewed changes

mesosphere-ci approved these changes Aug 3, 2018

View reviewed changes

alenkacz reviewed Aug 3, 2018

View reviewed changes

zen-dog reviewed Aug 3, 2018

View reviewed changes

meln1k approved these changes Aug 3, 2018

View reviewed changes

Refactoring of old metrics, implementation of new metrics, and new docs

0678f7c

ichernetsky force-pushed the inc/metrics-overhaul branch from 1f56cb2 to 0678f7c Compare August 3, 2018 19:02

mesosphere-ci requested changes Aug 3, 2018

View reviewed changes

mesosphere-ci approved these changes Aug 3, 2018

View reviewed changes

zen-dog approved these changes Aug 4, 2018

View reviewed changes

ichernetsky merged commit e9c8138 into master Aug 6, 2018

ichernetsky deleted the inc/metrics-overhaul branch August 6, 2018 06:47


		### Statsd and derived statistics

		Statsd typically creates derived statistics (mean, p99) from mean values Marathon reports. Our metrics package also reports derived statistics. To avoid accidentally aggregating statistics multiple times, be sure you know where you are reporting and computing mean values.


		## Metric names

		All metric names have to prefixed by a prefix that you specify and are subject to modification by graphite, datadog, or statsd. For example, if we write that the name of a metric is `service.mesosphere.marathon.uptime`, it might be available under `stats.gauges.marathon_test.service.mesosphere.marathon.uptime` in your configuration.


		### App, group, and task counts

		`service.mesosphere.marathon.app.count` (gauge) - The number of defined apps. This number influences the performance of Marathon: if you have

Metrics overhaul #6279

Metrics overhaul #6279

Conversation

ichernetsky commented Jun 4, 2018

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

timcharper left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wavesoft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timcharper left a comment •

edited