Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor, structural and performance optimizations #33

Merged
merged 29 commits into from Jul 23, 2019

Conversation

@peterbourgon
Copy link
Owner

commented Jul 19, 2019

This rather large PR accomplishes a few things in one motion.

First, we add support for filtering services by name (via regex) and by shard (a new concept introduced for large accounts). This means that the authoritative "flow" of information during the exporter runtime has changed. When we refresh services visible to a token from api.fastly.com, we need to apply the filtering rules at that point, and only update the api.Cache with the services that pass. Then, api.Cache becomes the source of authority of service IDs for the rt.Manager, and therefore a dependency to it. (Addresses #31.)

@neufeldtech — would you be so kind as to review the new service filtering options (maybe checking the README is the easiest way to get an overview) and tell me if this would work for you?

(These changes were significant enough that I bit the bullet and did a ground-up refactor, creating package api and api.Cache, for interacting with api.fastly.com; package rt with rt.Subscriber and rt.Manager, for interacting with the rt.fastly.com real-time stats service; and package prom, holding an improved version of the exposed Prometheus metrics. I also refactored component dependency relationships. It's a lot easier to read and understand now, I think.)

Second, we make performance improvements for deserialization of rt.fastly.com responses. This was the primary performance bottleneck, especially in exporters configured with many services. Using jsoniter instead of stdlib package encoding/json gives us ~10x CPU and allocation improvements. (Obviates #27.)

@keur — if you have the time and energy, would you be so kind as to build and test the version in this branch, to see if it's a noticable improvement for you?

Finally, I've updated the Dockerfile to reflect these changes, and others that were missed previously.

@mrnetops — would you be so kind as to review my Dockerfile changes?

@neufeldtech

This comment has been minimized.

Copy link

commented Jul 19, 2019

I'll do some testing and report back the results

@neufeldtech

This comment has been minimized.

Copy link

commented Jul 21, 2019

I spent some time with the new code, both in Docker and locally (OSX).
At first, I was concerned that I was still not seeing the correct number of services showing up, however, after running it with the -debug=true flag it's more clear what is (probably) going on...

Many of these services are generated by naming conventions, and only get traffic when we have C.I. builds running against them. It's expected that these C.I. services, and many of our staging services, won't have metrics available for much of the time. This means that out of approximately 900 services, we may only see metrics for 200-300 of the services.

We can see evidence of the above when looking at the debug logs:

level=debug component=rt.fastly.com status_code=200 response_ts=1563744844 err="No data available, please retry"
level=debug component=rt.fastly.com status_code=200 response_ts=1563744844 err="No data available, please retry"
level=debug component=rt.fastly.com status_code=200 response_ts=1563744844 err="No data available, please retry"

When testing with 2 shards, I'm seeing about 135 services on shard 1, and 116 services on shard 2. Without performing in-depth analysis about which services should have metrics available, the numbers sound about right.

$ curl -s localhost:8081/metrics | ggrep -Po "[a-zA-Z0-9]{22}" | sort | uniq | wc -l
     116
$ curl -s localhost:8080/metrics | ggrep -Po "[a-zA-Z0-9]{22}" | sort | uniq | wc -l
     135

Regarding the timeouts, I am still receiving some amount of timeouts, and it's not clear what exact impact they are having on collection. If there is some further analysis that you'd like me to perform regarding the timeouts, let me know.

level=error component=rt.fastly.com during="execute request" err="Get https://rt.fastly.com/v1/channel/<redacted>/ts/1563745509: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
level=error component=rt.fastly.com during="execute request" err="Get https://rt.fastly.com/v1/channel/<redacted>/ts/1563745510: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
level=error component=rt.fastly.com during="execute request" err="Get https://rt.fastly.com/v1/channel/<redacted>/ts/1563745515: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

Regarding the Regex feature, I tested it and it does work, however, to my dismay, I learned that go does not support negative lookaheads/behinds for regular expressions.

The use-case I was intending to solve was as follows:

Given the following set of services, I'd like to scrape metrics for all of them, except any service that begin with v-:

www.example.com
v-www.example.com
stg-www.example.com

shield-www.example.com
v-shield-www.example.com
stg-shield-www.example.com

search.example.com
v-search.example.com
stg-search.example.com

shield-search.example.com
v-shield-search.example.com
stg-shield-search.example.com

Because lookaround regex features are not supported, I wasn't able to construct an expression that satisfied my requirement of "filter out anything that starts with v-. If a generic expression is still possible (while still being generic enough to support future service names), please share!

To accomplish the same requirement, would it make sense to also include another flag that make it possible to exclude services, rather than include? I imagine it could function much the same as the -name-regex flag except that once a match is made, the exporter removes it from the whitelist rather than adding.

peterbourgon added some commits Jul 22, 2019

@peterbourgon

This comment has been minimized.

Copy link
Owner Author

commented Jul 22, 2019

Many of these services are generated by naming conventions, and only get traffic when we have C.I. builds running against them. It's expected that these C.I. services, and many of our staging services, won't have metrics available for much of the time. This means that out of approximately 900 services, we may only see metrics for 200-300 of the services.

🤦‍♂

Yes, this is how rt.fastly.com works, and it slipped my mind that it might be the cause of the missing services. I would like to make this condition more visible, but I don't think lifting that specific error to the info log level is the right approach, as it would be pretty spammy. Brainstorming just now: maybe I could add a new synthetic realtime_responses_total counter, and have labels for service ID and result, with the result being one of: success, no data, or error. What do you think?

Regarding the timeouts, I am still receiving some amount of timeouts, and it's not clear what exact impact they are having on collection. If there is some further analysis that you'd like me to perform regarding the timeouts, let me know.

That's strange, as the client timeout for rt.fastly.com requests (45s) is well above the forced server timeout of the Fastly service fronting that API. The only obvious answer I can think of is congestion or something between you and the API, but I'll ask around and see if there's a better explanation. In the meantime, I've added a -rt-timeout flag to the branch; if you wanted, you could try setting to its maximum value (120s) and see if that changes anything.

Given the following set of services, I'd like to scrape metrics for all of them, except any service that begin with v- . . .

If I understand correctly, I think that regexp is just ^[^(v\-)]? This playground link seems to give the output you want.

peterbourgon added some commits Jul 22, 2019

Add realtime_api_requests_total metric
With service_id, service_name, and result labels,
and result including "no data", this should help
low-traffic services not appear broken.

@peterbourgon peterbourgon force-pushed the optimize branch from fbd2046 to 83efc10 Jul 22, 2019

@neufeldtech

This comment has been minimized.

Copy link

commented Jul 22, 2019

I was testing with the new timeout flags and found great success so far, even without any service filtering via regex. It should be noted that previously, I was also occasionally getting timeouts on some api.fastly.com calls when it would try to do the initial service inventory query.

When running the new build in Docker with the timeout flags set to -api-timeout=30s -rt-timeout=90s, it appears that I'm able to satisfactorily scrape all services on the account, and I'm no longer experiencing the same timeout errors that I was seeing before.

I am however, seeing one new error on service startup. It only occurs a handful of times, but perhaps it necessitates another flag for connection timeout tuning:

fastly-exporter_1  | level=error component=rt.fastly.com during="execute request" err="Get https://rt.fastly.com/v1/channel/<REDACTED>/ts/0: net/http: TLS handshake timeout"
fastly-exporter_1  | level=error component=rt.fastly.com during="execute request" err="Get https://rt.fastly.com/v1/channel/<REDACTED>/ts/0: net/http: TLS handshake timeout"

When checking the two service ID's that had the TLS handshake timeout, one has data, and the exporter was able to ingest data successfully, even after the TLS handshake error. The other, has 'no data', and we see the 'no data' counter incrementing as expected, which leads me to believe that this one is also "OK" even after experiencing the initial TLS handshake error.

If I understand correctly, I think that regexp is just ^[^(v-)]? This playground link seems to give the output you want.

This doesn't quite satisfy my use case as I've found out in testing. It turns out that any service name that starts with a v, such as video-live.example.com gets filtered out by this expression. See Playground Link.

I don't think that this regex behaviour will continue to be a blocker for me, especially services that don't have metrics available on rt.fastly.com don't show up in the fastly_exporter.

As this build of the exporter seems stable so far, and I may test it out in our existing prometheus setup to see how it fairs with all services (no regex filtering) and 2 shards.

@peterbourgon

This comment has been minimized.

Copy link
Owner Author

commented Jul 22, 2019

When running the new build in Docker with the timeout flags set to -api-timeout=30s -rt-timeout=90s, it appears that I'm able to satisfactorily scrape all services on the account, and I'm no longer experiencing the same timeout errors that I was seeing before.

That's good news. Let's leave it there.

I am however, seeing one new error on service startup. It only occurs a handful of times, but perhaps it necessitates another flag for connection timeout tuning:

Interesting. While it's possible to set an explicit net/http.Transport.TLSHandshakeTimeout, that part of connection establishment is subject to the net/http.Client.Timeout (source):

net/http.Client timeout infographic

So I suspect it's the same root cause as the other errors. I've improved the way we update the realtime_results_total counter, so, since it's working eventually, I would just leave this as-is, unless you object.

This doesn't quite satisfy my use case as I've found out in testing. It turns out that any service name that starts with a v, such as video-live.example.com gets filtered out by this expression.

Yes, another facepalm moment for me; using a regex to encode "anything EXCEPT a specific multi-character expression" is not currently possible in Go's implementation. It would be possible to change -name-regex to e.g. -name-regex-include and add another -name-regex-exclude, but if you don't have a concrete need for it now, I would leave it out. Is that OK?

I'll merge this and cut a new release in a day or so, unless you say otherwise before then. Thanks a lot for your detailed feedback, I think it's really improved the exporter!

@neufeldtech

This comment has been minimized.

Copy link

commented Jul 22, 2019

Having both an include and exclude regex option would certainly be a "nice to have" feature, as it would allow me to ignore large swaths of services where it's currently not possible to do so. That said, my use case will probably be fine without it for now, so this feature could be put on hold.

Thank you very much for these improvements to this exporter, and I look forward to the release.

@keur

This comment has been minimized.

Copy link

commented Jul 22, 2019

@peterbourgon Wow. This is awesome. Thanks so much. I will check this out. Would you like me to send you more profiles too?

peterbourgon added some commits Jul 22, 2019

README.md Outdated Show resolved Hide resolved
Update README.md
Co-Authored-By: Jordan Neufeld <neufeldtech@users.noreply.github.com>
@peterbourgon

This comment has been minimized.

Copy link
Owner Author

commented Jul 22, 2019

@neufeldtech

Having both an include and exclude regex option would certainly be a "nice to have" feature

OK, easy enough: -name-include-regex and -name-exclude-regex now work as you'd expect.

@keur

Would you like me to send you more profiles too?

I'm mostly interested if you notice a subjective improvement. But I've just pushed 1442d1b which re-adds the pprof endpoints, so if you wanted to send some profiles over, I would happily review them.

peterbourgon and others added some commits Jul 22, 2019

Update cmd/fastly-exporter/main.go
Co-Authored-By: Jordan Neufeld <neufeldtech@users.noreply.github.com>
@neufeldtech

This comment has been minimized.

Copy link

commented Jul 22, 2019

@peterbourgon Tested out the -name-exclude-regex flag, and it works perfect with my ^v- use-case. Many thanks for this!

@peterbourgon peterbourgon merged commit 54b4a6c into master Jul 23, 2019

1 check passed

builds.sr.ht: .build.yml builds.sr.ht job completed successfully
Details

@peterbourgon peterbourgon deleted the optimize branch Jul 23, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.