Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Strategy for large number of services #31

Closed
neufeldtech opened this issue Jul 15, 2019 · 6 comments

Comments

@neufeldtech
Copy link

commented Jul 15, 2019

Hi There,
Thanks again for your time developing this software, it's helped us out immensely thus far. We've been running an old version (version 0.x) for quite a while and it's been good to us. Right now we're running two fastly_exporter instances with approximately 150 services each on two VMs to share the load. I'm interested in the auto-discovery feature that you've implemented in the new versions of this exporter but I have concerns about how I can manage a large number of Fastly services with it.

In total, we have approximately 900 Fastly services deployed to one Fastly account. As one can imagine, if I were to even attempt to boot the fastly_exporter with autodiscovery enabled, it would only be a bad time. Up until this point, I'd been manually curating our list of 'important services' to monitor with the exporter by manually filtering out our Staging environments, etc.

I was wondering if there were any existing strategies out there for dealing with an excessive number of Fastly properties with the exporter, and how one might go about architecting the exporter and the prometheus ingestion to deal with these volumes.

A couple of key things come to mind:

  • Might be nice/necessary to distribute & co-ordinate chunks of services to different instances of the exporter
  • Perhaps a feature to be able to exclude/include services based on a regular expression command line flag? This, in combination with the already-existing autodiscovery feature could be a viable method of dynamically and predictably consuming a sub-section of work. (Lots of our services are convention-based names, and this would make it easy to filter out in bulk)
@peterbourgon

This comment has been minimized.

Copy link
Owner

commented Jul 15, 2019

Interesting. At a high level, I don't see an inherent reason that one fastly-exporter shouldn't be able to handle 1000 services, though I can imagine the current architecture may not be ideal. Have you tried? If so, how does it explode?

Even if it's possible to do everything in one process, I can certainly understand why you'd want to "shard" the services out over multiple processes. My intuition would be to do something really simplistic. As a strawman, maybe we could have -shard-total and -shard-identity integer flags. If they're set, and if no explicit -service is provided, then each instance will "own" the discovered service IDs whose hash, modulo total, is equal to identity. So, if you want to split across 3 fastly-exporter instances, you'd start them as

fastly-exporter ... -shard-total 3 -shard-identity 0
fastly-exporter ... -shard-total 3 -shard-identity 1
fastly-exporter ... -shard-total 3 -shard-identity 2

I would want to have less stupid names for those flags, or otherwise make it more intuitive to set up—but would something like that work?

@peterbourgon

This comment has been minimized.

Copy link
Owner

commented Jul 15, 2019

Thinking further, -service-name-regexp ... would also be a nice way to do things, would that be equally effective for you?

@neufeldtech

This comment has been minimized.

Copy link
Author

commented Jul 16, 2019

Did a few experiments with the docker-compose file included in the repo.

When running v2.2.0, I let the exporter attempt to discover all services dynamically.

When it boots, it discovers all 923 services successfully 👍

However, after letting it run for several minutes, it's only able to scrape 273 service IDs.

The exporter reports a constant stream of timeouts for the other services that are not working, similar to this:

fastly-exporter_1  | level=error component=monitors service_id=REDACTED service_name=redacted-service-name.example.com err="Get https://rt.fastly.com/v1/channel/REDACTED/ts/1563245983: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

I like the idea of using the hashmod approach that you've described for predictable 'sharding'.

I'm also a big fan of the regex approach, which I could see myself using for excluding large swaths of services that are generated for CI builds (that have convention-based names), but should not be included in monitoring.

I think these two features together would open up a broad range of flexibility to be able to easily support large deployments.

Let me know if you'd like me to run any further tests with different timeout settings, or if other logs would be helpful for looking into possible constraints of a single exporter.

@peterbourgon

This comment has been minimized.

Copy link
Owner

commented Jul 16, 2019

Great, this is good food for thought. I'm pretty sure we can come up with something that will work, let me roll it around in my head for a little while. In the meantime, is it possible that you can build and test out the version in #32, under the maxconns branch? I have a hunch that it might eliminate the timeouts.

@neufeldtech

This comment has been minimized.

Copy link
Author

commented Jul 16, 2019

I built the maxconns branch and tried it with docker-compose pointed to my local image, with the same results as before. As you can see, it gradually is able to scrape more services, but levels off at 275 this time (much the same as before).

image

I built and ran the both the master branch, and the maxconns branch locally on my OSX machine, and I only ever observed 2 connections to rt.fastly.com while it was running (even with all the services). Is this what you'd expect to see? 🤔

my-mac$ netstat -anv | grep `pidof fastly-exporter`
tcp4       0    113  10.0.0.145.54733       151.101.126.34.443     ESTABLISHED 566694 131072  24260      0 0x0102 0x00000020
tcp4       0      0  127.0.0.1.8080         *.*                    LISTEN      131072 131072  24260      0 0x0100 0x00000026
tcp4       0      0  10.0.0.145.54595       151.101.126.35.443     ESTABLISHED 243264 131768  24260      0 0x0102 0x00000028
@peterbourgon

This comment has been minimized.

Copy link
Owner

commented Jul 23, 2019

Should be good now 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.