Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Refactor, structural and performance optimizations #33
This rather large PR accomplishes a few things in one motion.
First, we add support for filtering services by name (via regex) and by shard (a new concept introduced for large accounts). This means that the authoritative "flow" of information during the exporter runtime has changed. When we refresh services visible to a token from api.fastly.com, we need to apply the filtering rules at that point, and only update the api.Cache with the services that pass. Then, api.Cache becomes the source of authority of service IDs for the rt.Manager, and therefore a dependency to it. (Addresses #31.)
@neufeldtech — would you be so kind as to review the new service filtering options (maybe checking the README is the easiest way to get an overview) and tell me if this would work for you?
(These changes were significant enough that I bit the bullet and did a ground-up refactor, creating package api and api.Cache, for interacting with api.fastly.com; package rt with rt.Subscriber and rt.Manager, for interacting with the rt.fastly.com real-time stats service; and package prom, holding an improved version of the exposed Prometheus metrics. I also refactored component dependency relationships. It's a lot easier to read and understand now, I think.)
Second, we make performance improvements for deserialization of rt.fastly.com responses. This was the primary performance bottleneck, especially in exporters configured with many services. Using jsoniter instead of stdlib package encoding/json gives us ~10x CPU and allocation improvements. (Obviates #27.)
@keur — if you have the time and energy, would you be so kind as to build and test the version in this branch, to see if it's a noticable improvement for you?
Finally, I've updated the Dockerfile to reflect these changes, and others that were missed previously.
@mrnetops — would you be so kind as to review my Dockerfile changes?
I spent some time with the new code, both in Docker and locally (OSX).
Many of these services are generated by naming conventions, and only get traffic when we have C.I. builds running against them. It's expected that these C.I. services, and many of our staging services, won't have metrics available for much of the time. This means that out of approximately 900 services, we may only see metrics for 200-300 of the services.
We can see evidence of the above when looking at the debug logs:
level=debug component=rt.fastly.com status_code=200 response_ts=1563744844 err="No data available, please retry" level=debug component=rt.fastly.com status_code=200 response_ts=1563744844 err="No data available, please retry" level=debug component=rt.fastly.com status_code=200 response_ts=1563744844 err="No data available, please retry"
When testing with 2 shards, I'm seeing about 135 services on shard 1, and 116 services on shard 2. Without performing in-depth analysis about which services should have metrics available, the numbers sound about right.
Regarding the timeouts, I am still receiving some amount of timeouts, and it's not clear what exact impact they are having on collection. If there is some further analysis that you'd like me to perform regarding the timeouts, let me know.
level=error component=rt.fastly.com during="execute request" err="Get https://rt.fastly.com/v1/channel/<redacted>/ts/1563745509: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" level=error component=rt.fastly.com during="execute request" err="Get https://rt.fastly.com/v1/channel/<redacted>/ts/1563745510: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" level=error component=rt.fastly.com during="execute request" err="Get https://rt.fastly.com/v1/channel/<redacted>/ts/1563745515: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
Regarding the Regex feature, I tested it and it does work, however, to my dismay, I learned that go does not support negative lookaheads/behinds for regular expressions.
The use-case I was intending to solve was as follows:
Given the following set of services, I'd like to scrape metrics for all of them, except any service that begin with
www.example.com v-www.example.com stg-www.example.com shield-www.example.com v-shield-www.example.com stg-shield-www.example.com search.example.com v-search.example.com stg-search.example.com shield-search.example.com v-shield-search.example.com stg-shield-search.example.com
Because lookaround regex features are not supported, I wasn't able to construct an expression that satisfied my requirement of "filter out anything that starts with
To accomplish the same requirement, would it make sense to also include another flag that make it possible to exclude services, rather than include? I imagine it could function much the same as the
Yes, this is how rt.fastly.com works, and it slipped my mind that it might be the cause of the missing services. I would like to make this condition more visible, but I don't think lifting that specific error to the info log level is the right approach, as it would be pretty spammy. Brainstorming just now: maybe I could add a new synthetic realtime_responses_total counter, and have labels for service ID and result, with the result being one of: success, no data, or error. What do you think?
That's strange, as the client timeout for rt.fastly.com requests (45s) is well above the forced server timeout of the Fastly service fronting that API. The only obvious answer I can think of is congestion or something between you and the API, but I'll ask around and see if there's a better explanation. In the meantime, I've added a
If I understand correctly, I think that regexp is just
I was testing with the new timeout flags and found great success so far, even without any service filtering via regex. It should be noted that previously, I was also occasionally getting timeouts on some
When running the new build in Docker with the timeout flags set to
I am however, seeing one new error on service startup. It only occurs a handful of times, but perhaps it necessitates another flag for connection timeout tuning:
fastly-exporter_1 | level=error component=rt.fastly.com during="execute request" err="Get https://rt.fastly.com/v1/channel/<REDACTED>/ts/0: net/http: TLS handshake timeout" fastly-exporter_1 | level=error component=rt.fastly.com during="execute request" err="Get https://rt.fastly.com/v1/channel/<REDACTED>/ts/0: net/http: TLS handshake timeout"
When checking the two service ID's that had the TLS handshake timeout, one has data, and the exporter was able to ingest data successfully, even after the TLS handshake error. The other, has 'no data', and we see the 'no data' counter incrementing as expected, which leads me to believe that this one is also "OK" even after experiencing the initial TLS handshake error.
This doesn't quite satisfy my use case as I've found out in testing. It turns out that any service name that starts with a
I don't think that this regex behaviour will continue to be a blocker for me, especially services that don't have metrics available on rt.fastly.com don't show up in the fastly_exporter.
As this build of the exporter seems stable so far, and I may test it out in our existing prometheus setup to see how it fairs with all services (no regex filtering) and 2 shards.
That's good news. Let's leave it there.
So I suspect it's the same root cause as the other errors. I've improved the way we update the realtime_results_total counter, so, since it's working eventually, I would just leave this as-is, unless you object.
Yes, another facepalm moment for me; using a regex to encode "anything EXCEPT a specific multi-character expression" is not currently possible in Go's implementation. It would be possible to change
I'll merge this and cut a new release in a day or so, unless you say otherwise before then. Thanks a lot for your detailed feedback, I think it's really improved the exporter!
Having both an
Thank you very much for these improvements to this exporter, and I look forward to the release.
OK, easy enough:
I'm mostly interested if you notice a subjective improvement. But I've just pushed 1442d1b which re-adds the pprof endpoints, so if you wanted to send some profiles over, I would happily review them.