Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Question: Strategy for large number of services #31
In total, we have approximately 900 Fastly services deployed to one Fastly account. As one can imagine, if I were to even attempt to boot the fastly_exporter with autodiscovery enabled, it would only be a bad time. Up until this point, I'd been manually curating our list of 'important services' to monitor with the exporter by manually filtering out our Staging environments, etc.
I was wondering if there were any existing strategies out there for dealing with an excessive number of Fastly properties with the exporter, and how one might go about architecting the exporter and the prometheus ingestion to deal with these volumes.
A couple of key things come to mind:
Interesting. At a high level, I don't see an inherent reason that one fastly-exporter shouldn't be able to handle 1000 services, though I can imagine the current architecture may not be ideal. Have you tried? If so, how does it explode?
Even if it's possible to do everything in one process, I can certainly understand why you'd want to "shard" the services out over multiple processes. My intuition would be to do something really simplistic. As a strawman, maybe we could have
I would want to have less stupid names for those flags, or otherwise make it more intuitive to set up—but would something like that work?
Did a few experiments with the docker-compose file included in the repo.
When running v2.2.0, I let the exporter attempt to discover all services dynamically.
When it boots, it discovers all 923 services successfully
However, after letting it run for several minutes, it's only able to scrape 273 service IDs.
The exporter reports a constant stream of timeouts for the other services that are not working, similar to this:
fastly-exporter_1 | level=error component=monitors service_id=REDACTED service_name=redacted-service-name.example.com err="Get https://rt.fastly.com/v1/channel/REDACTED/ts/1563245983: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
I like the idea of using the hashmod approach that you've described for predictable 'sharding'.
I'm also a big fan of the regex approach, which I could see myself using for excluding large swaths of services that are generated for CI builds (that have convention-based names), but should not be included in monitoring.
I think these two features together would open up a broad range of flexibility to be able to easily support large deployments.
Let me know if you'd like me to run any further tests with different timeout settings, or if other logs would be helpful for looking into possible constraints of a single exporter.
Great, this is good food for thought. I'm pretty sure we can come up with something that will work, let me roll it around in my head for a little while. In the meantime, is it possible that you can build and test out the version in #32, under the
I built the
I built and ran the both the master branch, and the
my-mac$ netstat -anv | grep `pidof fastly-exporter` tcp4 0 113 10.0.0.145.54733 18.104.22.168.443 ESTABLISHED 566694 131072 24260 0 0x0102 0x00000020 tcp4 0 0 127.0.0.1.8080 *.* LISTEN 131072 131072 24260 0 0x0100 0x00000026 tcp4 0 0 10.0.0.145.54595 22.214.171.124.443 ESTABLISHED 243264 131768 24260 0 0x0102 0x00000028