Scheduler: Don't schedule runner intervals on fixed cron expressions #345

joekohlsdorf · 2022-12-16T13:32:26Z

Here is a CPU graph for 10 collector instances started at different times (v0.46.1) running with default settings.

As you can see the CPU peaks every 10 minutes all align on round numbers, like 05:00, 05:10, 05:20 etc.

This isn't very nice on our infrastructure, to support this aligned load we have to provision much more infrastructure than we would need if the load was distributed.
In the pgAnalyze Enterprise server we see the same peaks. This is probably not very nice for pgAnalyze aaS either because all collectors of all customers will ingest their data exactly at the same time.

In my opinion, there is no need to schedule like this.
I think two changes are needed:

Random startup scheduling delay between 0 and 60 seconds so that timers don't exactly align even if 10 instances are restarted at the same time.
Simple summing of timestamps to determine when the next task should be scheduled instead of cron determined scheduling.
cronexpr is no longer maintained so this dependency could be removed entirely.

I'd like to use this issue to discuss if this proposed change makes sense.
If you agree I can get a PR ready.

lfittl · 2022-12-16T20:34:09Z

@joekohlsdorf Thanks for bringing this up!

The challenge here is that doing the data fetching at different random intervals would cause the statistics to be subtly different, potentially causing surprising downstream effects in terms of how data shows up in the app.

For context, whilst we do send the time a particular snapshot is for (https://github.com/pganalyze/collector-snapshot/blob/main/full_snapshot.proto#L19), I'd worry that e.g. the difference between looking at 2 seconds of query statistics counters vs 110 seconds would end up being quite confusing. Therefore I'd be -0.5 on adding a random component to the overall scheduling.

That said, I think a more specific approach could work here, i.e. not solving this in the scheduler directly, but rather running expensive code portions over a longer time period - e.g. parsing and fingerprinting queries can be expensive, but that could happen with a slight delay potentially.

To inform this better, if you are interested, what would be most helpful is a profile of the collector, so that we have a better sense whether your system is busy with (i.e. whether its the query parsing/fingerprinting, or something else).

joekohlsdorf · 2022-12-16T21:01:56Z

What I am suggesting is to continue using fixed second intervals but to not align them on the same exact minutes/seconds on the clock. Instead the scheduling should depend on collector startup time.

Example: If I start the collector at 00:03:42 my suggestion is that the 10min interval should be at 00:13:42, 00:23:42 and so on.

The idea of the random component is just to avoid ending up with the same problem when you have multiple collector instances which you restart all at the same time. My suggestion is to only do the random delay for the first scheduled run of a job when the collector starts up.
I see that you currently have 4 scheduled jobs and it would probably be beneficial to not start all of them when the clock hits 0 seconds like it is now.

I think this can be achieved quite easily with a Ticker.

lfittl · 2022-12-16T22:06:36Z

Ah, thanks for the clarification. We were discussed this a bit more internally, and we're still not sure if a change like this wouldn't end up causing unexpected problems (e.g. with some of the graphs) - so we'd prefer to not make this change at this point without first having debugged the underlying performance aspects a bit more.

What would be helpful is if you could run a profile on your workload, to help us understand better where the bottlenecks are for your system. Something like this is what my colleague just ran for some tests:

Add this to the main function:

f, err := os.Create("profile")
if err != nil {
    log.Fatal(err)
}
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()

and then after running the collector and shutting it down, enter the interactive go tool pprof profile tool and entering top30 -cum. If you have that result you could share it here, or email it to me (I believe you should have my email already).

joekohlsdorf · 2022-12-17T03:42:05Z

We don't have any performance issues. We just run a separate collector instance for each database so we see how all of them process data at exactly the same time. I didn't screenshot the y-axis of the chart because I thought it was irrelevant, the top of the graph is only ~0.2 CPUs so it's nothing too crazy.

As I said, I think this will also help your pgAnalyze SaaS offering because at the moment all customers submit their data at exactly the same time so you probably have to overprovision by quite a lot to support these write spikes. Unless you internally delay processing of customer data which I hope you aren't doing because some customers might use pgAnalyze to review top database queries during incidents.

A problem I see with my first suggestion is that intervals wouldn't line up anymore when you restart the collector.
A way to avoid that could be to calculate a fixed offset with a value derived from a per-database value which doesn't change. For example for the ten minute interval we could hash the hostname of the database instance to a number and then reduce this to a value between 0 and 600 by calculating module 600.
Then you would just add this fixed offset in the delay calculation here: https://github.com/pganalyze/collector/blob/main/scheduler/scheduler.go#L18

lfittl · 2022-12-21T03:38:00Z

@joekohlsdorf Makes sense, thanks for clarification re: overall CPU usage being low.

I think over time we do want to evolve the scheduling approach, but currently the trade-off here (i.e. making sure using dynamic interval offset doesn't break anything else) means that we'll not prioritize this for now from the looks of it. I appreciate you bringing it up though, it sparked some conversations here.

For context, on the processing side (i.e. our cloud or Enterprise Server deployments), the compact snapshots (which happen at varying intervals between 3 and 30 seconds) are often the bigger portion of the workload, and whilst some of them also follow a schedule (e.g. the gathering of pg_stat_activity data) it doesn't show the same kind of contention at the exact same time.

There is some ongoing work over in #346 that will likely reduce a good portion of the work that full snapshots do (when caused by the fingerprinting of queries, since that right now parses and fingerprints all the pg_stat_statements queries, not just the ones that are not yet known), which is why I was asking for the profiling data, since it'd allow us to confirm whether that'd help reduce the spike you're seeing.

As an alternative to profiling, once that PR is merged and in an official release, it'd be interesting to see whether that affects your collector CPU metrics at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler: Don't schedule runner intervals on fixed cron expressions #345

Scheduler: Don't schedule runner intervals on fixed cron expressions #345

joekohlsdorf commented Dec 16, 2022

lfittl commented Dec 16, 2022

joekohlsdorf commented Dec 16, 2022

lfittl commented Dec 16, 2022

joekohlsdorf commented Dec 17, 2022

lfittl commented Dec 21, 2022

Scheduler: Don't schedule runner intervals on fixed cron expressions #345

Scheduler: Don't schedule runner intervals on fixed cron expressions #345

Comments

joekohlsdorf commented Dec 16, 2022

lfittl commented Dec 16, 2022

joekohlsdorf commented Dec 16, 2022

lfittl commented Dec 16, 2022

joekohlsdorf commented Dec 17, 2022

lfittl commented Dec 21, 2022