Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Service/Task Stats #1284

Open
jhorwit2 opened this issue Aug 1, 2016 · 19 comments
Open

Proposal: Service/Task Stats #1284

jhorwit2 opened this issue Aug 1, 2016 · 19 comments

Comments

@jhorwit2
Copy link

jhorwit2 commented Aug 1, 2016

Re: moby/moby#24597

I was doing some thinking about this problem and wanted to propose a possible solution to solicit feedback before I started working on it. The current problem is that there is no easy way to get stats for a service and/or task. You can currently get task stats by running docker stats on the container id, but you have to be on that node.

Saving stats to managers:

  1. The agent/worker will get a new reporter for stats.
    • The worker will keep track of subscribing/unsubscribing from container stats (existing logic in daemon) based on the current listen method.
    • All stats from the worker will be sent to managers via GRPC stream UpdateStats(stream obj). Stats will be the same data we push on that endpoint currently but protobuf, but it will include the task id + service id for db indexes. That way we get both task & service stats depending on how you query.
  2. The manager will listen for stat updates via grpc simliar to the dispatcher w/ task status updates.
    • When an update is received it will overwrite the stat value from before for the given task. This means only the last value will ever be stored in raft. When a task gets removed the stat will get removed for that task.
    • Updates should be done batchy just like task status updates.
    • A new table will need to be made task_stats with an index on service id & task id to support service/task queries.

Getting stats:

  • add a new rpc method GetStats(StatRequest) stream StatsResponse where you can query by either service id or task id. Querying by both is the same as just task id so should be fine also. StatsResponse will contain a list of stats ( 1 per task ). Technically this could be a bidirectional stream since you could have multiple requests for any number of tasks. Then you could just use the queue/sink stuff from go-events to push out. The client would push listen/remove requests & the server would push responses back.

Questions:

  • Does this seem ok 👼 ?
  • Should this be added to the dispatcher service? or a new service? The latter seems to make more sense to me, but not sure.
  • Is collecting stats every 1 second OK? That's the current hardcoded value in the daemon for the stats collector.
  • Should the StatRequest support multiple tasks/services? I can see tasks, but displaying multiple services might get interesting

Doing things this way seems to enable easily adding support for autoscaling in the future since the managers will have access to the aggregated stats for a service. Then you can watch for updates -> run against service rules -> schedule updates accordingly.

@aaronlehmann
Copy link
Collaborator

This seems well thought out. My only concern is whether the Raft store is the right place to store task stats. Since it's a fully consistent distributed store, the overhead of writes is quite high. Our current design limits the store to one concurrent writer, so that writes can never conflict with each other. If a Raft write takes tens of milliseconds (or more) to reach consensus, and we are updating the stats for each task every second, the stats reporting could become a bottleneck for the entire system after just a handful of tasks.

We generally batch writes into transactions to avoid this problem for other kinds of writes, and something similar could help for stats reporting. For example, the dispatcher collects task updates and writes all recent updates to Raft in a single transaction (see d.processTaskUpdates). Potentially doing something similar like collecting stats from all tasks and then writing all of those stats to Raft in the same transaction once a second would deal with the write latency issue.

But there are other reasons why I wonder if Raft is the right place for this data. You suggest only storing the latest stat sample in Raft, which makes sense because Raft should contain a limited data set that will fit in memory, but I'd expect that in a lot of cases you want to keep historical stats. Storing the stats in Raft would make this awkward.

Perhaps an alternative would be to have some sidecar service that listens for stats updates and stores them to a separate service. But admittedly this would be a lot more complex both in terms of design and operation.

Anyway, just some random thoughts on this. Raft may well be the right place to store the latest stat values, at least for now, as long as proper write batching is used. I'm curious to hear what other people think, though.

@jhorwit2
Copy link
Author

jhorwit2 commented Aug 1, 2016

@aaronlehmann Yea, the write concern was one that I had but I figured batching should take care of that. The only reason I mention not storing historical stats is because I didn't see a need yet or really ever. When you are running docker stats right now you don't have an option to look at history, it's only from this point in time and in the future. Even if autoscaling became a thing, you will only need to store a limited set of data, which doesn't need to be stored in raft. You could just have each manager listening for write events for task stats, then storing the last N stat points (depends on your stream algorithm or w/e you do to determine autoscale) in memory.

Do you have a specific use case in mind where storing historical events would be useful?

@aaronlehmann
Copy link
Collaborator

Do you have a specific use case in mind where storing historical events would be useful? IMO, that should be solved by some other service, like influx/prometheus/.

Not really. Just seems like something that would be useful for tracking down problems.

@dongluochen
Copy link
Contributor

Besides service/task stats, we should get node stats on resource utilization.

I don't know if Raft is the right place to store. There is no consistency requirement for stats. New leader can reconstruct stats in memory within reasonable time. And it can keep history N data points.

@jhorwit2
Copy link
Author

jhorwit2 commented Aug 1, 2016

While I do agree node stats are important I don't think that quite fits into this proposal since they aren't the same stats.

@stevvooe
Copy link
Contributor

stevvooe commented Aug 2, 2016

Flowing stats through the managers and raft really is going to become a scaling bottleneck very quickly. If you just look at a modest calculation of 1000 containers, with 40 statistics at 8 bytes a record, we are talking 26-27GB of data pushed with a 1 second sampling period. You can write a lot less to disk, but you have to be careful about how this is indexed and stored.

The other issue here is integration with existing metrics system. For docker/distribution, we exported a large number statistics via log files and /debug/vars. We did not release the analogous plugins to collect and report these metrics to existing metrics systems. Most users threw their hands up and helplessly said there was no monitoring, even though the data is there. Whatever happens, it needs to integrate easily with existing metrics systems and aggregation needs to be supported.

We can look at a few use cases to get a better idea of what to do:

  1. Convenient monitoring data for users (docker stats).
  2. Scheduling feedback.
  3. Export to external monitoring systems.

Given these use cases, it makes a lot more sense to build this kind of thing on something that can export. We can re-import into managers through an aggregator and convenience monitoring can be coordinated through manager resolution to a individual endpoint or queried from aggregator.

Put simply: let's just use prometheus. It defines an export format that we can build upon. The aggregators already exist and we can embed the aggregator right into the swarm manager, such that we have the data on tap for use cases 1 and 2.

@jhorwit2
Copy link
Author

jhorwit2 commented Aug 2, 2016

@stevvooe When you say "use prometheus", do you mean actually embedding a prometheus server into the swarm cluster?

@jhorwit2
Copy link
Author

jhorwit2 commented Aug 2, 2016

Thinking about this some more. Why not just:

  • Limit the stats being collected for tasks/services (go to just what we export on docker stats). You go from ~40 data points to the ~8 shown on docker stats. (maybe a couple more specific to tasks and/or services)
    • If you want all the things you can use another service built for this (cadvisor, telegraf, etc). You would then get all the data points you want (and way more). You can then use w/e exporters they support (graphite, influxdb, prometheus, etc, etc).
  • Have each manager only store (in memory) the last stat point for each task it's in charge of. Plus whatever other metrics deemed import perhaps for that manager (or nodes). Getting rid of raft for storage of stats.
  • Expose the metrics, if configured, on a specific port for exporting to prometheus. That way a prometheus node (or w/e) can scrape at a given interval. Prometheus would be the first thing supported. In the future more support could be added for graphite, influxdb, etc.
  • When a user runs docker stats you lookup which managers have the stats for the task(s) and then stream that data to that daemon to show the user.

That solves 1, 2 and 3.

@stevvooe
Copy link
Contributor

stevvooe commented Aug 2, 2016

@jhorwit2 Yes, let's just embed prometheus or some collector.

Now, I'm not saying to use prometheus' tooling and dashboard. We should have tools to translate to other environments. I am saying that we should use the export format and scraping model to keep the data under control and avoid the volume from ballooning.

There are three solid reasons:

  1. The export format provides enough sophistication to be integrated in different systems.
  2. We'll end up having to build half of a distributed metrics systems anyways.
  3. Aggregation and querying are going to have to built somewhere.

If we look at the api output of docker stats, there isn't really anything special there that requires a different format.

cc @crosbymichael

@jhorwit2
Copy link
Author

jhorwit2 commented Aug 2, 2016

@stevvooe I'm not sure I see where embedding prometheus is worth the added complexity.

  • Prometheus isn't geared towards long term storage (nor do i really see why this should be stored for long in docker). This task is typically offloaded to things like influxdb.
  • This now requires people (users/customers) to learn prometheus. Other solutions often expose a prometheus handler / implement a custom collector, but that does not require core prometheus code. You can use the client for that & just implement the Collector interface. Other solutions also provide ways to push to other storage backends like influxdb, etc.
  • If you plan to use prometheus as a way to aggregate / query data there will need to be ways to tune it. This can get pretty fun. This means you (docker) now has to worry about how to scale swarm/docker clustering & scaling some other embedded service you don't technically control.
  • How to handle HA? Prometheus doesn't handle replication natively, so you need to have every prometheus instance have the same scraper config.

Perhaps the best solution would be something like what @aaronlehmann mentioned with another small service who exists solely to handle metrics. Each node in the cluster can establish a grpc connection with the metric node(s) to stream stats for every task. The service can then aggregate based on service and export (via prometheus handler or pushing to supported backends). The service can also handle responding to api requests for stats. Perhaps something like worker/manager/collector. By default, managers can be collectors, but you have the option to run collectors as their own thing for resource/scaling purposes. Or if you are running cadvisor/telegraf on every node you can omit the collector completely.

@stevvooe
Copy link
Contributor

stevvooe commented Aug 3, 2016

@jhorwit2 I think you are associating aspects with a minimal proposal that are part of the prometheus server, rather than integrating the prometheus toools. I am not suggesting we use their storage system or query system. People won't even have to learn prometheus for basic cases (although, people will have to learn the half-baked solution, so this point is moot). A solid integration wouldn't even expose anything about prometheus unless the user wanted it.

My point is that we shouldn't be building a metrics system. Any metrics system purpose built for swarm will require integration and a metric model. Prometheus already provides and export format and protobuf that could be used to transmit metrics. Let's not rebuild this from scratch.

Metrics collection is woefully complex and most underestimate the problem. There is no avoiding complexity in this kind of thing. docker stats is a great example. It provides a few interesting stats, but immediately becomes less useful as soon I'd rather we take something like prometheus and integrate, over building a half-baked solution that falls over at scale.

@jhorwit2
Copy link
Author

jhorwit2 commented Aug 3, 2016

@stevvooe < insert sigh of relief gif > lol. I got a little confused/worried by talking about aggregator/scraper, embedding prometheus and querying 👼 . I completely agree that using prometheus' metric objs and export format is a smart idea.

So basically, we implement the collector interface to collect stats. Same paradigm used by things like node_exporter to register / collect metrics. We'd want to use the vector metrics so we can collect/get/remove based on labels (service name, etc).

So, the flow would be

Collecting Stats:

workers transmit task stats (engine-api/types/stats.go) to collector
collector takes the stat obj per task -> collects it into vec objs via specific labels 

Querying:

worker/manager queries for metrics based on labels
Export metrics via prometheus metric protobuf type (converts metric collectors to single metrics)
Respond

You still have the same problems I think with scale though. You need to stream the stats/metrics, collect on ideally more than 1 master for HA. You do get the metric collectors / models / math for free though 👍 .

edit: when I say labels i'm referring to the prometheus metric label which can be any value. Like task id.

@kunalkushwaha
Copy link
Contributor

One thing regarding scaling, Do stats need to be strongly consistent?
Each manager node can independently fetch/collect-by-listening from collectors instead of raft to sync them on managers.

How to handle HA? Prometheus doesn't handle replication natively, so you need to have every prometheus instance have the same scraper config.

Instead of configuring Prometheus in HA mode, multiple Prometheus deployment is preferred.

@jhorwit2
Copy link
Author

jhorwit2 commented Aug 3, 2016

@stevvooe https://github.com/jhorwit2/docker_collector Made a quick example. Aggregator basically just opens/closes stat streams for containers (like docker stats does). Collector is the prometheus stuff.

edit: This doesn't have querying but that's provided by looking up collectors (metrics) based on labels, then returning those.

@stevvooe
Copy link
Contributor

stevvooe commented Aug 3, 2016

@kunalkushwaha No, stats don't need consistency. In fact, the model of multiple aggregators would be fine, where each manager scrapes their own set of stats. The main issue is actually data volume. Whether using a push or pull model, there needs to be a way to reduce sampling rate as the amount of metrics go up. In practice, it is very easy to hit this tipping point, since it is a function of the number tasks * metrics * rate.

@jhorwit2 Great demo! Do you have an example of the export text format?

@jhorwit2
Copy link
Author

jhorwit2 commented Aug 3, 2016

This is the output from the text export (http://localhost:8080/metrics in the demo)

This

metrics: []containerMetric{
    containerMetric{
        name:       "memory_usage",
        help:       "Current memory usage in MiB",
        metricType: prometheus.CounterValue,
        getValue: func(s types.StatsJSON) float64 {
            return float64(s.MemoryStats.Usage)
        },
    },
    containerMetric{
        name:       "memory_limit",
        help:       "Memory limit in MiB",
        metricType: prometheus.CounterValue,
        getValue: func(s types.StatsJSON) float64 {
            return float64(s.MemoryStats.Limit)
        },
    },
    containerMetric{
        name:       "memory_percentage",
        help:       "Current memory percentage utilized in MiB",
        metricType: prometheus.CounterValue,
        getValue: func(s types.StatsJSON) float64 {
            return float64(s.MemoryStats.Usage) / float64(s.MemoryStats.Limit) * 100.0
        },
    },
},

Gets turned into this in text format.

# HELP memory_limit Memory limit in MiB
# TYPE memory_limit counter
memory_limit{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="15rbd495mv1iwreqyoxjz7oxq"} 2.097557504e+09
memory_limit{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="4zuqzcpunly7ax82hw3bf2z1g"} 2.097557504e+09
memory_limit{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="6nz6nhdsmxajo5ca2mr4h8dun"} 2.097557504e+09
memory_limit{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="c54xlrljprvsdykzsysq3qike"} 2.097557504e+09
# HELP memory_percentage Current memory percentage utilized in MiB
# TYPE memory_percentage counter
memory_percentage{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="15rbd495mv1iwreqyoxjz7oxq"} 0.07791462197739109
memory_percentage{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="4zuqzcpunly7ax82hw3bf2z1g"} 0.06971308282187623
memory_percentage{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="6nz6nhdsmxajo5ca2mr4h8dun"} 0.06912725859648233
memory_percentage{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="c54xlrljprvsdykzsysq3qike"} 0.0693225333382803
# HELP memory_usage Current memory usage in MiB
# TYPE memory_usage counter
memory_usage{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="15rbd495mv1iwreqyoxjz7oxq"} 1.634304e+06
memory_usage{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="4zuqzcpunly7ax82hw3bf2z1g"} 1.462272e+06
memory_usage{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="6nz6nhdsmxajo5ca2mr4h8dun"} 1.449984e+06
memory_usage{service_id="0z06lez523v6gt6ey4g5hsbjt",task_id="c54xlrljprvsdykzsysq3qike"} 1.45408e+06

or you can take the metrics and put them into protobuf (or json) to send around.

@aluzzardi
Copy link
Member

/cc @crosbymichael @icecrime

@jhorwit2
Copy link
Author

So I saw moby/moby#25820, are stats being worked on in swarmkit also? I was going to try and start that this week, but I want to make sure I'm not duplicating anything.

@aluzzardi
Copy link
Member

@jhorwit2 Hey, so, there are two kind of metrics: 1) Internal metrics (as in, internal to docker/swarmkit) and 2) container metrics

There's ongoing work in Docker and soon SwarmKit to expose internal metrics using the Prometheus format.

/cc @crosbymichael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants