-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficiency suggestion: move from multiple queues to a single job queue #63
Comments
To give an indication of the performance hit of running multiple queues, I changed the So going from 13 queues @ 3s intervals (4.3 timeouts/sec) to 13 queues @ 15s intervals (0.87 timeouts/sec) has reduced CPU usage by 20-40%, and will also reduce database ops by factor of 5. Having a single queue at the default 3s intervals will be only 0.33 timeouts/sec and reduce database ops by a factor of 13. Also, looking at the code, it appears that in a multi-server environment all servers are running all queues regardless of whether they are in control or not. If you have lots of jobs configured this could be a significant waste of resource (server & database cpu, and bandwidth between the servers and database). Why not just have the idle servers checking every While I'm making suggestions, it seems that Please don't get me wrong, I love this package and am very grateful that you have shared your excellent and hard work with the community! My intention is to help make it even better, not to criticise what you have achieved. You've made it so easy for us to create scheduled jobs that survive redeployments and server failures that the developer might create more and more job queues each with it's own overheads. If I had more time I would fork and then make a PR, but at the moment I have other priorities. Maybe in a few months I could help out if you're willing to consider PRs? |
Since there has been no comment on my suggestion I decided to fix the problem myself. However, since the one-queue-per-job-type is fundamentally ingrained into the code it was easier for me to pretty much start from scratch. My alternative package is here: wildhart:meteor.jobs. In some cases it could be a drop-in replacement for msavin:sjobs, but here are some potentially breaking API changes where I've simplified things to remove some functionality I wasn't using. |
Hey man, sorry that it took me so long to respond. I just found the time to study and see the problem. If you decide to go back to Steve Jobs, that feature, along with purging, has been implemented. Dominator will now check its status at most every 10 seconds.They are both configurable, and on by default in the new 3.1 release. For more info: |
Hi @msavin, no problem. I quite like my miniaturised 'fork' for now though, although I'd happily abandon it and return to Steve Jobs if you make the jobs queue more efficient, particularly when there are lots of different jobs. Your readme says To me, having multiple job queues, each running on a My version has one queue and after each batch of jobs are run, a One side effect of this is that adding/removing/rescheduling jobs wouldn't be picked up automatically, and there's no quick/easy way for one server to tell whichever server is running the queue to reset the timeout. Instead, I just make whichever server changes the job queue immediately take control of the queue and create it's own timeout. I suppose a nice side effect of this might be that the servers share the queue and resulting job processing a bit more evenly which might mean that CPU credits (AWS) might accrue more evenly. While thinking about that though, I wondered about your implementation, and couldn't see any mechanism for server A to tell server B to start/stop a queue. Your docs for I haven't implemented pausing jobs in my version, but if I did so then I would add a field to the dominator database to record which job types are paused. Then any server which takes control of the queue will know which are paused. The list of paused jobs would also survive a server reboot. Would you consider implementing similar improvements to Steve Jobs - a more efficient queue and multi-server job-type pausing? |
Hey man, thanks for responding, and I forgot to say, I appreciate you making the graphs and helping identify the problem.
1 / This related to the issue around dominator polling constantly, correct? The issue has been corrected. If I missed something else, please let me know.
2 / The Meteor.interval function has a very low cost, and the queue doesn't hit the database unless that server is marked as active. It could be optimized, I think as part of some greater refactor, but for now, I suspect the cost of that would be negligible.
3 / Agree with the thinking there - but the reason I did not go this way is a new job may be scheduled to run before that timeout runs its course. Plus, the issues you had indicated in the following paragraph. Change Streams could be interesting here - but I do not have with it experience with it yet.
4 / Yes, I thought somebody might PR this feature but it never happened. The logic around it is a bit tricky, though it gets simpler with the purging feature. This definitely needs to happen.
5 / Yeah I am all for it. What I really want is to improve how the jobs run, to make sure they run on time, and to set up a way to run more than one at a time. Change Streams can be a big help here. I've been thinking of breaking the package up - essentially to split up |
No, I think the performance issue is mostly around each job type having it's own queue, each with a default 3 second I'm sorry that my initial bug report probably blames the My 'fork' has proven (to me at least) that the queue can be more efficient and more on-time by using
My solution to this is simple - make the server which changes the queue take immediate control and set its own, correct, timeout (or if it's already in control just cancel the original timeout and create a new one). Then inside the original timeout callback, the previous server will first check that it's still in control and not run any jobs if it's not. This also makes the shared paused list fairly trivial as well, although I didn't bother implementing it in my version because I don't need to pause jobs (obviously none of your users do either because you say nobody requested this ability yet). I don't think it needs Change Streams. My suggestion would be:
Thanks for responding and showing an interest in my suggestions! |
I think Meteor.setInterval uses Fibers to time the functions, and it's a nearly zero cost implementation. Thus, I suspect the problem was around dominator running excessively. Beyond that, using one queue would be faster than thirteen, but it would also have a hidden price: you only run one job at a time instead of potentially 13 at a time. This could make a big difference for some applications. This does reinforce the idea of splitting the package up. There could be various implementations of
The limitation there is, a server takes dominance rarely - only if the server that had dominance goes down. Thus, you would need some way to update that timeout, if not by Change Streams then by an interval timer. The interval timer can just check, "hey, is there any job in the database due sooner than this one?", and if so, update the timeout settings.
I think most of it is on track. I think it can be even simpler if just one document was maintained in the dominator collection, and you'd update the |
And one more thing; I think the 2.0 version had one queue and just ran one job at once. However, I figured that may be too constrained. At the end of the day, a job runs very similarly to a method, if not the same, and if Meteor can take 1000 requests and more per second, then the burden should be quite low. |
The setInterval itself is cheap, but inside each interval you are querying the jobs collection for a single job type. This is not cheap, particularly if the db is on a remote server. In my case this was 13 very similar db queries every 3 seconds when most of the time there isnt' a job scheduled for hours. (Previously you were also updating the dominator, but I think you've fixed that)
I think we have a misunderstanding due to different interpretations of the word 'queue'. I think when you read 'queue' you are only thinking about one job type (because naturally you have one queue per job type in your implementation). In my interpretation I have only one queue containing all jobs of different types. On each timeout I am querying the db for any jobs which are due, regardless of what type they are (or 'queue' they are in using your terminology). So if 13 different jobs of different types are all scheduled for the same instant, they will all execute during the same timeout. Then once they have all finished a new timeout will be set for the next job in the updated queue. See my versions of findNextJob() and executeJobs(). So in my version there is one timeout, one db query and 13 jobs executed. In yours there are 13 intervals, 13 db queries and 13 jobs executed. Mine occurs only when jobs are due, yours occurs every 3 seconds. One of my jobs is only scheduled to run every 4 hours to process automatic subscription payments, send payment reminders, delete expired accounts, etc. Why query the jobs db every 3 seconds when you know the job isn't scheduled again for 4 hours? Other jobs are only created on-demand, e.g. sending a welcome email a day after a new user signs up - if users only sign up every couple of days, why query the db every 3 seconds?
That's up to you, but personally I don't think it's necessary. My system runs very efficiently across multiple servers regardless of how many jobs are scheduled, whereas your's get less efficient for every additional job type ('queue') you create.
In your implementation the server only takes dominance if another server goes down. In my implementation a server takes dominance whenever it makes changes to the job queue. There is very little cost to this, it just updates the dominator database with it's own serverId and current time. Then when the previous server enters its timeout callback it checks the dominator database, sees that another server has taken control and then does nothing. Have a read through my code, it's pretty simple really and apart from not saving job history or pausing jobs (which your's doesn't to multi server anyway) it's functionally identical to yours.
Good point, that would make the querying/updating of the dominator collection easier/faster. Another option is to use observeChanges on the dominator collection, to be notified if another server takes control, if no server is in control and to share the |
Yes, but, 13 queries, even simultaneously, isn't much in reality. Like, if you have a chat app with just 100 concurrent clients, you are probably going to get more input/output than that. If
I just zoomed into your code, and while that can be more efficient, it does have a limitation: if there are many pending jobs, they will all be returned from the database at once. function executeJobs(jobsArray=null) { // Jobs.execute() calls this function with [job] as a parameter
settings.log && settings.log('Jobs', 'executeJobs', jobsArray);
executing = true; // so that rescheduling, removing, etc, within the jobs doesn't result in lots of calls to findNextJob() (which is done once at the end of this function)
try {
(jobsArray || Jobs.collection.find({state: "pending", due: {$lte: new Date()}}, {sort: {due: 1, priority: -1}})).forEach(job => {
if (inControl && !checkControl()) console.warn('Jobs', 'LOST CONTROL WHILE EXECUTING JOBS'); // should never happen
if (inControl || jobsArray) executeJob(job); // allow Jobs.execute() to run a job even on a server which isn't in control, otherwise leave execution to server in control
});
} catch(e) {
console.warn('Jobs', 'executeJobs ERROR');
console.warn(e);
}
executing = false;
findNextJob();
} This bit here can be dangerous if you have like 10,000 scheduled emails. At least, you should consider putting a limit to the amount of documents retrieved, to help preserve memory.
You would also have an issue with blocking. For example, if we have 10,000 scheduled emails, other jobs may not run until these emails are sorted sent out. This is why SJ makes many queries - to make sure other jobs are given a chance to run near their schedule time. Again, much of it depends on your use case, but my goal is to have SJ be sufficient for any use case. The only area where I could see it lag behind is not being able to execute multiple jobs at once.
I missed that difference - makes sense, but again, I could see an issue where jobs are executed twice. For example, if Even if something is being done on MongoDB to account for this, the reads may not reflect the writes in time. |
I showed in my graph above that 13 queries every 3 seconds does cause a noticeable performance impact. That was back when each interval was also updating the dominator database, so maybe the impact is less now. Even so, just the thought of 13 intervals and db reads happening every 3 seconds when the next scheduled jobs might be hours away makes me cringe. It also make me reluctant to find more uses for different job types ('queues') when each comes with extra overhead (albeit negligible).
No, the
That is a good point. However, I think if all those emails had to be sent in bulk then a developer is more likely to write a single job to send 10,000 emails (in a controlled way), rather than create 10,000 jobs to send 1 email each. (If they were sensible they would use a 3rd party mail provider to send the emails for them). But I suppose with SJ you have provided the developer with an easy mechanism to perform such bulk actions in a controlled way, whereas my version doesn't. That problem is relatively easy to overcome though (code off the top of my head, not tested): var job, doneJobs;
do {
doneJobs = [];
do {
job = Jobs.collection.findOne({
state: "pending",
due: {$lte: new Date(),
name: {$nin: doneJobs}}, // give other job types a chance...
}, {sort: {due: 1, priority: -1}}));
if (job) {
execute(job);
doneJobs.push(job.name); // don't do this job type again until we've tried other jobs.
}
} while (job);
} while (doneJobs.length); (funny, in 30 years of coding I've hardly ever found use for a
That is also true and it did occur to me as a possibility. It's less of a problem with my system though because jobs tend to be run individually more or less exactly at their due time, rather than batched into periods of I think a way of fixing that might be to revert to a single server staying in control but using observeChanges on the queue to react to job changes (and if multi-server paused jobs were to be implemented it could use Thanks for taking the time to engage in this discussion and look at my code. |
Oops, pushed the wrong button, sorry. Note that my code above would run all the jobs in the same fibre which may block the server. Some of that code may benefit from using async or being wrapped in a zero delay timeout. |
MongoDB is definitely loading all the results into memory, and they get stale there.
You still need to consider this use case no matter what. To counter your example, an app might be sending 10,000 customized emails, in which case it's easier and more reliable to queue them up individually. In any case, its not uncommon to have a queue that has thousands of items due to run.
Yes, if you're making a developer tool you need to consider this or the people who use your code could have big problems. My older versions looked quite similar to what you have now. I really think it would make sense to update the |
I've updated my own package with all the improvements/fixes mentioned above:
To achieve this I've moved to only having one server in control of the queue - changes to the job queue by other servers are monitored with an observer. Job execution is shared across job queues using And I still have only one It would be great if you could implement something similar with SJ. |
Today I added my 20th job type ("queue") to one of my apps and so I decided to revert back to the current version of msavin:sjobs to compare the CPU usage with my own 'fork'. Deployed to a fresh 1vCPU Digital Ocean server with empty db and no connections/users. 20 jobs defined but no jobs executing during the measurement interval: msavin:sjobs with 20 jobs defined = 1.63% CPU:wildhart:jobs with 20 jobs defined = 0.37% CPU:The fluctuating memory usage with msavin:sjobs shows that there's lots more going on compared to the perfectly flat memory of wildhart:jobs. So I added 20 more jobs just for comparison:
This shows that the one- |
The multiple queue approach + polling ensures that jobs run as smoothly as possible and do not block each other. It also let’s you process more jobs at the same time.
I’m assuming you are using Mongo oplog driven observe? That would be breaking on applications that do not use oplog.
Is 1.6% vs .4% so significant?
…Sent from my iPhone
On Oct 4, 2019, at 11:16 AM, wildhart ***@***.***> wrote:
Today I added my 20th job type ("queue") to one of my apps and so I decided to revert back to the current version of msavin:sjobs to compare the CPU usage with my own 'fork'.
Deployed to a fresh 1vCPU Digital Ocean server with empty db and no connections/users. 20 jobs defined but no jobs executing during the measurement interval:
msavin:sjobs with 20 jobs defined = 1.63% CPU:
wildhart:jobs with 20 jobs defined = 0.37% CPU:
The fluctuating memory usage with msavin:sjobs shows that there's lots more going on compared to the perfectly flat memory of wildhart:jobs.
So I added 20 more jobs just for comparison:
msavin:sjobs with 40 jobs = 2.62%
wildhart:jobs with 40 jobs = 0.39%
This shows that the one-setInterval-per-queue design of msavin:sjobs does come with measurable overhead compared to the single-observer-and-setTimeout-for-all-jobs design of wildhart:sjobs.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub, or mute the thread.
|
My approach also prevents jobs blocking each other, as I've discussed above. How can more jobs be processed at the same time, there's only one thread? My system will process jobs just as quickly as yours. With my observer/timeout approach jobs will also run more precisely on their due time because there isn't the discretization of the interval period. I'm using a normal I deliberately refrained from using the word "significant" instead I said "measurable". Although the 0.4% is actually the background CPU of Meteor because my single Which to be honest is not significant, but unless it comes with extra benefit, why do something a demonstrably less efficient way? Why should defining extra job types each come with a little extra overhead? Jobs are very useful, I shouldn't have to think twice about adding more as I find more uses for them. Plus these measurements do not include the additional db CPU. The db was on the same virtual server - if the db was remote then there would additional bandwidth as well. All unnecessary. I'm happy with my own fork, and happy for your package to use whichever system you want to use. I just thought I'd present the data as food-for-thought. Do with it what you will ;-) |
I appreciate it, and will definitely consider it as part of how to improve the package. I’m considering making the |
I was reading over @wildhart's optimizations and thought I would mention that using only a |
I'm loving SteveJobs, thanks very much!
Now that I'm starting to use if for lots of different tasks, I've noticed that the
jobs_dominator_3
lastPing
is being updated too frequently, between 200 ms to 1s. If I connect to mongo and very quickly repeatedly query the latest serverId:The only configuration option I'm changing is:
Looking through your code, it appears that each job queue has its own
Meteor.setInterval
defaulting to 3 seconds. In my case I have 13 different jobs defined, which means thatlastPing
is being updated and the jobs queue is being queried 13 times every 3 seconds!I might add more jobs as my app grows in complexity. Jobs are really useful, so adding a new job type shouldn't come with extra overhead, particularly if some jobs aren't used that often.
I propose moving to a single job queue and maintaining a list of active job types, then at each interval the job queue can be queried for due and ACTIVE jobs. This should reduce the server overhead significantly, depending on how many job types you use.
This raises a question which I haven't delved into the code enough to answer yet - how is the list of active or inactive job types maintained during a server reload or transfer of control from one server to another? To achieve this you would either need a separate db collection to store the list of active jobs, or, with less overhead, adding a list of active jobs to the current entry in
jobs_dominator_3
. Then when a server restarts or takes control, it can get the latest entry from that table and see which jobs are active.(I noticed this while testing mongodump and mongorestore on a staging server - the mongorestore complained of a duplicate key
serverId
because SteveJobs was trying to update the ping at the same time that it was being restored. This was happening about 50% of the time which told me that the document was being updated quite frequently...)The text was updated successfully, but these errors were encountered: