Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiency suggestion: move from multiple queues to a single job queue #63

Closed
wildhart opened this issue Aug 18, 2018 · 19 comments
Closed
Labels

Comments

@wildhart
Copy link

I'm loving SteveJobs, thanks very much!

Now that I'm starting to use if for lots of different tasks, I've noticed that the jobs_dominator_3 lastPing is being updated too frequently, between 200 ms to 1s. If I connect to mongo and very quickly repeatedly query the latest serverId:

> db.jobs_dominator_3.find({_id:"bxPzJdjLCd4cWMpge"},{lastPing:1})
{ "_id" : "bxPzJdjLCd4cWMpge", "lastPing" : ISODate("2018-08-17T20:26:33.498Z") }
> db.jobs_dominator_3.find({_id:"bxPzJdjLCd4cWMpge"},{lastPing:1})
{ "_id" : "bxPzJdjLCd4cWMpge", "lastPing" : ISODate("2018-08-17T20:26:34.393Z") } // 895ms
> db.jobs_dominator_3.find({_id:"bxPzJdjLCd4cWMpge"},{lastPing:1})
{ "_id" : "bxPzJdjLCd4cWMpge", "lastPing" : ISODate("2018-08-17T20:26:34.393Z") }
> db.jobs_dominator_3.find({_id:"bxPzJdjLCd4cWMpge"},{lastPing:1})
{ "_id" : "bxPzJdjLCd4cWMpge", "lastPing" : ISODate("2018-08-17T20:26:35.386Z") } // 993ms
> db.jobs_dominator_3.find({_id:"bxPzJdjLCd4cWMpge"},{lastPing:1})
{ "_id" : "bxPzJdjLCd4cWMpge", "lastPing" : ISODate("2018-08-17T20:26:35.664Z") } // 278ms!!

The only configuration option I'm changing is:

Jobs.configure({
    maxWait: 30*1000, // specify how long the server could be inactive before another server takes on the master role  (default=5 min)
});

Looking through your code, it appears that each job queue has its own Meteor.setInterval defaulting to 3 seconds. In my case I have 13 different jobs defined, which means that lastPing is being updated and the jobs queue is being queried 13 times every 3 seconds!

I might add more jobs as my app grows in complexity. Jobs are really useful, so adding a new job type shouldn't come with extra overhead, particularly if some jobs aren't used that often.

I propose moving to a single job queue and maintaining a list of active job types, then at each interval the job queue can be queried for due and ACTIVE jobs. This should reduce the server overhead significantly, depending on how many job types you use.

This raises a question which I haven't delved into the code enough to answer yet - how is the list of active or inactive job types maintained during a server reload or transfer of control from one server to another? To achieve this you would either need a separate db collection to store the list of active jobs, or, with less overhead, adding a list of active jobs to the current entry in jobs_dominator_3. Then when a server restarts or takes control, it can get the latest entry from that table and see which jobs are active.

(I noticed this while testing mongodump and mongorestore on a staging server - the mongorestore complained of a duplicate key serverId because SteveJobs was trying to update the ping at the same time that it was being restored. This was happening about 50% of the time which told me that the document was being updated quite frequently...)

@wildhart
Copy link
Author

wildhart commented Aug 18, 2018

To give an indication of the performance hit of running multiple queues, I changed the updateInterval from the default of 3 seconds to 15 seconds on both my staging and production servers. Each has 13 different jobs configured, but during this period they were all idle (either no jobs in the queue or jobs well in the future) and neither had much user activity going on.

cpu

So going from 13 queues @ 3s intervals (4.3 timeouts/sec) to 13 queues @ 15s intervals (0.87 timeouts/sec) has reduced CPU usage by 20-40%, and will also reduce database ops by factor of 5.

Having a single queue at the default 3s intervals will be only 0.33 timeouts/sec and reduce database ops by a factor of 13.

Also, looking at the code, it appears that in a multi-server environment all servers are running all queues regardless of whether they are in control or not. If you have lots of jobs configured this could be a significant waste of resource (server & database cpu, and bandwidth between the servers and database). Why not just have the idle servers checking every maxWaitseconds to see if the lastPing has changed?

While I'm making suggestions, it seems that jobs_dominator_3 grows by n_servers records with every re-deployment (assuming the default random serverId). On a multi-server environment with frequent deployments this database could grow significantly. During local development it grows with every meteor run. Is there any intention of automatically purging old entries? (I'm going to start using setServerId with a constant id seeing as I'm still single server). Let me know if you want me to raise new issues for these 2 extra suggestions.

Please don't get me wrong, I love this package and am very grateful that you have shared your excellent and hard work with the community! My intention is to help make it even better, not to criticise what you have achieved. You've made it so easy for us to create scheduled jobs that survive redeployments and server failures that the developer might create more and more job queues each with it's own overheads.

If I had more time I would fork and then make a PR, but at the moment I have other priorities. Maybe in a few months I could help out if you're willing to consider PRs?

@wildhart
Copy link
Author

Since there has been no comment on my suggestion I decided to fix the problem myself. However, since the one-queue-per-job-type is fundamentally ingrained into the code it was easier for me to pretty much start from scratch.

My alternative package is here: wildhart:meteor.jobs.

In some cases it could be a drop-in replacement for msavin:sjobs, but here are some potentially breaking API changes where I've simplified things to remove some functionality I wasn't using.

@msavin
Copy link
Owner

msavin commented Jan 13, 2019

Hey man, sorry that it took me so long to respond. I just found the time to study and see the problem.
One of the planned optimizations was to create a grace period for dominator, but I held off on it as I didn't consider just how intensive it could get.

If you decide to go back to Steve Jobs, that feature, along with purging, has been implemented. Dominator will now check its status at most every 10 seconds.They are both configurable, and on by default in the new 3.1 release.

For more info:
27cdf83

@msavin msavin closed this as completed Jan 13, 2019
@wildhart
Copy link
Author

Hi @msavin, no problem.

I quite like my miniaturised 'fork' for now though, although I'd happily abandon it and return to Steve Jobs if you make the jobs queue more efficient, particularly when there are lots of different jobs. Your readme says It can run hundreds of jobs in seconds with minimal CPU impact, but that's not the case - my graphs above demonstrate then even when idle there is noticeable CPU impact with only 13 different jobs defined.

To me, having multiple job queues, each running on a setInterval and then polling the database each time seems very inefficient. Particularly when all servers are doing this, regardless of if they are in control or not.

My version has one queue and after each batch of jobs are run, a setTimeout is created for the next job in the queue. That means when no jobs are scheduled, the jobs queue is completely idle not using any CPU at all (except for the dominator checking). It also means that jobs run exactly on schedule, instead of delayed by the interval (which I had set to a long period to avoid excessive CPU usage).

One side effect of this is that adding/removing/rescheduling jobs wouldn't be picked up automatically, and there's no quick/easy way for one server to tell whichever server is running the queue to reset the timeout. Instead, I just make whichever server changes the job queue immediately take control of the queue and create it's own timeout. I suppose a nice side effect of this might be that the servers share the queue and resulting job processing a bit more evenly which might mean that CPU credits (AWS) might accrue more evenly.

While thinking about that though, I wondered about your implementation, and couldn't see any mechanism for server A to tell server B to start/stop a queue. Your docs for Jobs.start() and Jobs.stop() say This function currently only works on the server where it is called. If the request to start a job queue is initiated by a client request it could be handled by any server, so how are you supposed to tell all servers to start/stop a queue?

I haven't implemented pausing jobs in my version, but if I did so then I would add a field to the dominator database to record which job types are paused. Then any server which takes control of the queue will know which are paused. The list of paused jobs would also survive a server reboot.

Would you consider implementing similar improvements to Steve Jobs - a more efficient queue and multi-server job-type pausing?

@msavin
Copy link
Owner

msavin commented Jan 15, 2019

Hey man, thanks for responding, and I forgot to say, I appreciate you making the graphs and helping identify the problem.

Your readme says It can run hundreds of jobs in seconds with minimal CPU impact, but that's not the case - my graphs above demonstrate then even when idle there is noticeable CPU impact with only 13 different jobs defined.

1 / This related to the issue around dominator polling constantly, correct? The issue has been corrected. If I missed something else, please let me know.

To me, having multiple job queues, each running on a setInterval and then polling the database each time seems very inefficient. Particularly when all servers are doing this, regardless of if they are in control or not.

2 / The Meteor.interval function has a very low cost, and the queue doesn't hit the database unless that server is marked as active. It could be optimized, I think as part of some greater refactor, but for now, I suspect the cost of that would be negligible.

My version has one queue and after each batch of jobs are run, a setTimeout is created for the next job in the queue. That means when no jobs are scheduled, the jobs queue is completely idle not using any CPU at all (except for the dominator checking). It also means that jobs run exactly on schedule, instead of delayed by the interval (which I had set to a long period to avoid excessive CPU usage).

3 / Agree with the thinking there - but the reason I did not go this way is a new job may be scheduled to run before that timeout runs its course. Plus, the issues you had indicated in the following paragraph.

Change Streams could be interesting here - but I do not have with it experience with it yet.

If the request to start a job queue is initiated by a client request it could be handled by any server, so how are you supposed to tell all servers to start/stop a queue?

I haven't implemented pausing jobs in my version, but if I did so then I would add a field to the dominator database to record which job types are paused. Then any server which takes control of the queue will know which are paused. The list of paused jobs would also survive a server reboot.

4 / Yes, I thought somebody might PR this feature but it never happened. The logic around it is a bit tricky, though it gets simpler with the purging feature. This definitely needs to happen.

Would you consider implementing similar improvements to Steve Jobs - a more efficient queue and multi-server job-type pausing?

5 / Yeah I am all for it. What I really want is to improve how the jobs run, to make sure they run on time, and to set up a way to run more than one at a time. Change Streams can be a big help here.

I've been thinking of breaking the package up - essentially to split up actions, which are the developer APIs, and operator, which runs the jobs, so that someone else can build a better version of it. I'm happy to collaborate on that as well.

@wildhart
Copy link
Author

1 / This related to the issue around dominator polling constantly, correct? The issue has been corrected. If I missed something else, please let me know.

No, I think the performance issue is mostly around each job type having it's own queue, each with a default 3 second setInterval. If you define lots of different types of job (I currently use 13 - is that unusual?) then each server has lots of setIntervals going on, which I think is unnecessary.

I'm sorry that my initial bug report probably blames the dominator polling for the inefficiency, but that is only how I originally spotted the frequent polling/updating. Once I delved into your code I realised that each job type had it's own queue and setInterval and raised this issue.

My 'fork' has proven (to me at least) that the queue can be more efficient and more on-time by using setTimeout instead of setInterval

3 / Agree with the thinking there - but the reason I did not go this way is a new job may be scheduled to run before that timeout runs its course.

My solution to this is simple - make the server which changes the queue take immediate control and set its own, correct, timeout (or if it's already in control just cancel the original timeout and create a new one). Then inside the original timeout callback, the previous server will first check that it's still in control and not run any jobs if it's not.

This also makes the shared paused list fairly trivial as well, although I didn't bother implementing it in my version because I don't need to pause jobs (obviously none of your users do either because you say nobody requested this ability yet).

I don't think it needs Change Streams. My suggestion would be:

  1. When a particular job type is paused or started, the server doing the starting/stopping immediately updates its ownpausedJobs array, takes control of the job queue, then sets the appropriate timeout for the next job using {name: {$nin: pausedJobs}} as part of the query.
  2. In the process of taking control it updates the jobs_dominator_3 database, including a pausedJobs field.
  3. When the other (not-in-control) servers regularly check the jobs_dominator_3 database they always make a note of the pausedJobs field. Then, if they need to take control they already know which jobs are paused.

Thanks for responding and showing an interest in my suggestions!

@msavin
Copy link
Owner

msavin commented Jan 15, 2019

No, I think the performance issue is mostly around each job type having it's own queue, each with a default 3 second setInterval. If you define lots of different types of job (I currently use 13 - is that unusual?) then each server has lots of setIntervals going on, which I think is unnecessary.

I'm sorry that my initial bug report probably blames the dominator polling for the inefficiency, but that is only how I originally spotted the frequent polling/updating. Once I delved into your code I realised that each job type had it's own queue and setInterval and raised this issue.

My 'fork' has proven (to me at least) that the queue can be more efficient and more on-time by using setTimeout instead of setInterval

I think Meteor.setInterval uses Fibers to time the functions, and it's a nearly zero cost implementation. Thus, I suspect the problem was around dominator running excessively.

Beyond that, using one queue would be faster than thirteen, but it would also have a hidden price: you only run one job at a time instead of potentially 13 at a time. This could make a big difference for some applications.

This does reinforce the idea of splitting the package up. There could be various implementations of operator. One could focus on just running one job at a time, for minimal impact, the other can be the current one, which runs one job per queue at a time, and then later, there can be a more advanced one which runs multiple jobs perhaps across multiple servers.

My solution to this is simple - make the server which changes the queue take immediate control and set its own, correct, timeout (or if it's already in control just cancel the original timeout and create a new one). Then inside the original timeout callback, the previous server will first check that it's still in control and not run any jobs if it's not.

This also makes the shared paused list fairly trivial as well, although I didn't bother implementing it in my version because I don't need to pause jobs (obviously none of your users do either because you say nobody requested this ability yet).

The limitation there is, a server takes dominance rarely - only if the server that had dominance goes down. Thus, you would need some way to update that timeout, if not by Change Streams then by an interval timer. The interval timer can just check, "hey, is there any job in the database due sooner than this one?", and if so, update the timeout settings.

My suggestion would be:

I think most of it is on track. I think it can be even simpler if just one document was maintained in the dominator collection, and you'd update the serverId when necessary.

@msavin msavin reopened this Jan 15, 2019
@msavin msavin changed the title efficiency suggestion: move from multiple queues to a single job queue Efficiency suggestion: move from multiple queues to a single job queue Jan 15, 2019
@msavin
Copy link
Owner

msavin commented Jan 15, 2019

And one more thing; I think the 2.0 version had one queue and just ran one job at once. However, I figured that may be too constrained. At the end of the day, a job runs very similarly to a method, if not the same, and if Meteor can take 1000 requests and more per second, then the burden should be quite low.

@wildhart
Copy link
Author

I think Meteor.setInterval uses Fibers to time the functions, and it's a nearly zero cost implementation. Thus, I suspect the problem was around dominator running excessively.

The setInterval itself is cheap, but inside each interval you are querying the jobs collection for a single job type. This is not cheap, particularly if the db is on a remote server. In my case this was 13 very similar db queries every 3 seconds when most of the time there isnt' a job scheduled for hours. (Previously you were also updating the dominator, but I think you've fixed that)

Beyond that, using one queue would be faster than thirteen, but it would also have a hidden price: you only run one job at a time instead of potentially 13 at a time. This could make a big difference for some applications.

I think we have a misunderstanding due to different interpretations of the word 'queue'. I think when you read 'queue' you are only thinking about one job type (because naturally you have one queue per job type in your implementation). In my interpretation I have only one queue containing all jobs of different types. On each timeout I am querying the db for any jobs which are due, regardless of what type they are (or 'queue' they are in using your terminology).

So if 13 different jobs of different types are all scheduled for the same instant, they will all execute during the same timeout. Then once they have all finished a new timeout will be set for the next job in the updated queue. See my versions of findNextJob() and executeJobs().

So in my version there is one timeout, one db query and 13 jobs executed. In yours there are 13 intervals, 13 db queries and 13 jobs executed. Mine occurs only when jobs are due, yours occurs every 3 seconds.

One of my jobs is only scheduled to run every 4 hours to process automatic subscription payments, send payment reminders, delete expired accounts, etc. Why query the jobs db every 3 seconds when you know the job isn't scheduled again for 4 hours? Other jobs are only created on-demand, e.g. sending a welcome email a day after a new user signs up - if users only sign up every couple of days, why query the db every 3 seconds?

This does reinforce the idea of splitting the package up. There could be various implementations of operator. One could focus on just running one job at a time, for minimal impact, the other can be the current one, which runs one job per queue at a time, and then later, there can be a more advanced one which runs multiple jobs perhaps across multiple servers.

That's up to you, but personally I don't think it's necessary. My system runs very efficiently across multiple servers regardless of how many jobs are scheduled, whereas your's get less efficient for every additional job type ('queue') you create.

The limitation there is, a server takes dominance rarely - only if the server that had dominance goes down. Thus, you would need some way to update that timeout, if not by Change Streams then by an interval timer. The interval timer can just check, "hey, is there any job in the database due sooner than this one?", and if so, update the timeout settings.

In your implementation the server only takes dominance if another server goes down. In my implementation a server takes dominance whenever it makes changes to the job queue. There is very little cost to this, it just updates the dominator database with it's own serverId and current time. Then when the previous server enters its timeout callback it checks the dominator database, sees that another server has taken control and then does nothing. Have a read through my code, it's pretty simple really and apart from not saving job history or pausing jobs (which your's doesn't to multi server anyway) it's functionally identical to yours.

I think most of it is on track. I think it can be even simpler if just one document was maintained in the dominator collection, and you'd update the serverId when necessary.

Good point, that would make the querying/updating of the dominator collection easier/faster. Another option is to use observeChanges on the dominator collection, to be notified if another server takes control, if no server is in control and to share the pausedJobs array. I'm not sure if that really helps though.

@msavin
Copy link
Owner

msavin commented Jan 15, 2019

The setInterval itself is cheap, but inside each interval you are querying the jobs collection for a single job type. This is not cheap, particularly if the db is on a remote server. In my case this was 13 very similar db queries every 3 seconds when most of the time there isnt' a job scheduled for hours. (Previously you were also updating the dominator, but I think you've fixed that)

Yes, but, 13 queries, even simultaneously, isn't much in reality. Like, if you have a chat app with just 100 concurrent clients, you are probably going to get more input/output than that. If

On each timeout I am querying the db for any jobs which are due, regardless of what type they are (or 'queue' they are in using your terminology).

I just zoomed into your code, and while that can be more efficient, it does have a limitation: if there are many pending jobs, they will all be returned from the database at once.

function executeJobs(jobsArray=null) { // Jobs.execute() calls this function with [job] as a parameter
	settings.log && settings.log('Jobs', 'executeJobs', jobsArray);
	executing = true; // so that rescheduling, removing, etc, within the jobs doesn't result in lots of calls to findNextJob() (which is done once at the end of this function)
	try {
		(jobsArray || Jobs.collection.find({state: "pending", due: {$lte: new Date()}}, {sort: {due: 1, priority: -1}})).forEach(job => {
			if (inControl && !checkControl()) console.warn('Jobs', 'LOST CONTROL WHILE EXECUTING JOBS'); // should never happen
			if (inControl || jobsArray) executeJob(job); // allow Jobs.execute() to run a job even on a server which isn't in control, otherwise leave execution to server in control
		});
	} catch(e) {
		console.warn('Jobs', 'executeJobs ERROR');
		console.warn(e);
	}
	executing = false;
	findNextJob();
}

This bit here can be dangerous if you have like 10,000 scheduled emails. At least, you should consider putting a limit to the amount of documents retrieved, to help preserve memory.

Jobs.collection.find({state: "pending", due: {$lte: new Date()}}, {sort: {due: 1, priority: -1}}))

You would also have an issue with blocking. For example, if we have 10,000 scheduled emails, other jobs may not run until these emails are sorted sent out. This is why SJ makes many queries - to make sure other jobs are given a chance to run near their schedule time.

Again, much of it depends on your use case, but my goal is to have SJ be sufficient for any use case. The only area where I could see it lag behind is not being able to execute multiple jobs at once.

In your implementation the server only takes dominance if another server goes down. In my implementation a server takes dominance whenever it makes changes to the job queue. There is very little cost to this, it just updates the dominator database with it's own serverId and current time. Then when the previous server enters its timeout callback it checks the dominator database, sees that another server has taken control and then does nothing. Have a read through my code, it's pretty simple really and apart from not saving job history or pausing jobs (which your's doesn't to multi server anyway) it's functionally identical to yours.

I missed that difference - makes sense, but again, I could see an issue where jobs are executed twice. For example, if executeJobs is looping through a hundred jobs, and then another server takes over, and starts looping through them too, you will have jobs that run twice.

Even if something is being done on MongoDB to account for this, the reads may not reflect the writes in time.

@wildhart
Copy link
Author

Yes, but, 13 queries, even simultaneously, isn't much in reality. Like, if you have a chat app with just 100 concurrent clients, you are probably going to get more input/output than that. If

I showed in my graph above that 13 queries every 3 seconds does cause a noticeable performance impact. That was back when each interval was also updating the dominator database, so maybe the impact is less now. Even so, just the thought of 13 intervals and db reads happening every 3 seconds when the next scheduled jobs might be hours away makes me cringe. It also make me reluctant to find more uses for different job types ('queues') when each comes with extra overhead (albeit negligible).

I just checked out your code, and while that can be more efficient, it does have a limitation: if there are like 1000 pending jobs, they will all be returned from the database and run at once.

No, the Jobs.collection.find().forEach() only returns one job document at a time, so they aren't all pulled into memory simultaneously. Internally I presume mongodb isn't loading all the results into its memory either. However, maybe a better solution is below...

You would also have an issue with blocking. For example, if we have 10,000 scheduled emails, other jobs may not run until these emails are sorted sent out. This is why SJ makes many queries - to make sure other jobs are given a chance to run near their schedule time.

That is a good point. However, I think if all those emails had to be sent in bulk then a developer is more likely to write a single job to send 10,000 emails (in a controlled way), rather than create 10,000 jobs to send 1 email each. (If they were sensible they would use a 3rd party mail provider to send the emails for them).

But I suppose with SJ you have provided the developer with an easy mechanism to perform such bulk actions in a controlled way, whereas my version doesn't.

That problem is relatively easy to overcome though (code off the top of my head, not tested):

var job, doneJobs;
do {
    doneJobs = [];
    do {
        job = Jobs.collection.findOne({
            state: "pending",
            due: {$lte: new Date(),
            name: {$nin: doneJobs}}, // give other job types a chance...
        }, {sort: {due: 1, priority: -1}}));
        if (job) {
            execute(job);
            doneJobs.push(job.name); // don't do this job type again until we've tried other jobs.
        }
    } while (job);
} while (doneJobs.length);

(funny, in 30 years of coding I've hardly ever found use for a do..while loop before, yet here's two!)

I missed that difference - makes sense, but again, I could see an issue where jobs are executed twice. For example, if executeJobs is looping through a hundred jobs, and then another server takes over, and starts looping through them too, you will have jobs that run twice.

That is also true and it did occur to me as a possibility. It's less of a problem with my system though because jobs tend to be run individually more or less exactly at their due time, rather than batched into periods of setIntervals. Still could happen though, with the likelihood increasing with the number of jobs scheduled.

I think a way of fixing that might be to revert to a single server staying in control but using observeChanges on the queue to react to job changes (and if multi-server paused jobs were to be implemented it could use observeChanges on the dominator to see if the list of pausedJobs is changed by another server).

Thanks for taking the time to engage in this discussion and look at my code.

@wildhart
Copy link
Author

Oops, pushed the wrong button, sorry.

Note that my code above would run all the jobs in the same fibre which may block the server. Some of that code may benefit from using async or being wrapped in a zero delay timeout.

@msavin
Copy link
Owner

msavin commented Jan 15, 2019

No, the Jobs.collection.find().forEach() only returns one job document at a time, so they aren't all pulled into memory simultaneously. Internally I presume mongodb isn't loading all the results into its memory either. However, maybe a better solution is below...

MongoDB is definitely loading all the results into memory, and they get stale there.

That is a good point. However, I think if all those emails had to be sent in bulk then a developer is more likely to write a single job to send 10,000 emails (in a controlled way), rather than create 10,000 jobs to send 1 email each. (If they were sensible they would use a 3rd party mail provider to send the emails for them).

You still need to consider this use case no matter what. To counter your example, an app might be sending 10,000 customized emails, in which case it's easier and more reliable to queue them up individually. In any case, its not uncommon to have a queue that has thousands of items due to run.

That is also true and it did occur to me as a possibility. [...] Still could happen though, with the likelihood increasing with the number of jobs scheduled.

Yes, if you're making a developer tool you need to consider this or the people who use your code could have big problems.


My older versions looked quite similar to what you have now. I really think it would make sense to update the operator for this package, then to start a new fork of essentially the same thing.

@wildhart
Copy link
Author

I've updated my own package with all the improvements/fixes mentioned above:

  • 1000s of jobs on one queue cannot hold up jobs in other queues;
  • While executing jobs only one job is queried from the db to minimise memory use and stale data;
  • Jobs cannot be executed simultaneously on two servers;
  • Job queues can be stopped and started, across multiple servers;

To achieve this I've moved to only having one server in control of the queue - changes to the job queue by other servers are monitored with an observer. Job execution is shared across job queues using findOne() inside a loop instead of find().forEach().

And I still have only one timeout set for the next due job of any type, so most of the time the servers are doing nothing, and you can add as many different job types ('queues') as you want without any additional overhead.

It would be great if you could implement something similar with SJ.

@wildhart
Copy link
Author

wildhart commented Oct 4, 2019

Today I added my 20th job type ("queue") to one of my apps and so I decided to revert back to the current version of msavin:sjobs to compare the CPU usage with my own 'fork'.

Deployed to a fresh 1vCPU Digital Ocean server with empty db and no connections/users. 20 jobs defined but no jobs executing during the measurement interval:

msavin:sjobs with 20 jobs defined = 1.63% CPU:

jobs-msavin

wildhart:jobs with 20 jobs defined = 0.37% CPU:

jobs-wildhart

The fluctuating memory usage with msavin:sjobs shows that there's lots more going on compared to the perfectly flat memory of wildhart:jobs.

So I added 20 more jobs just for comparison:

  • msavin:sjobs with 40 jobs = 2.62%
  • wildhart:jobs with 40 jobs = 0.39%

This shows that the one-setInterval-per-queue design of msavin:sjobs does come with measurable overhead compared to the single-observer-and-setTimeout-for-all-jobs design of wildhart:sjobs.

@msavin
Copy link
Owner

msavin commented Oct 4, 2019 via email

@wildhart
Copy link
Author

wildhart commented Oct 4, 2019

My approach also prevents jobs blocking each other, as I've discussed above.

How can more jobs be processed at the same time, there's only one thread? My system will process jobs just as quickly as yours.

With my observer/timeout approach jobs will also run more precisely on their due time because there isn't the discretization of the interval period.

I'm using a normal Meteor.find().observe(). This particular server was set up to use oplog. If oplog support isn't enabled then the observer would automatically fall back to polling, but it would only poll one query instead of 20.

I deliberately refrained from using the word "significant" instead I said "measurable". Although the 0.4% is actually the background CPU of Meteor because my single setTimeout was doing precisely nothing during the measurement period - the next scheduled job was 4 hours in the future. So the comparison is really 1.2% vs zero.

Which to be honest is not significant, but unless it comes with extra benefit, why do something a demonstrably less efficient way? Why should defining extra job types each come with a little extra overhead? Jobs are very useful, I shouldn't have to think twice about adding more as I find more uses for them.

Plus these measurements do not include the additional db CPU. The db was on the same virtual server - if the db was remote then there would additional bandwidth as well. All unnecessary.

I'm happy with my own fork, and happy for your package to use whichever system you want to use. I just thought I'd present the data as food-for-thought. Do with it what you will ;-)

Repository owner deleted a comment from toys-inc Oct 4, 2019
@msavin
Copy link
Owner

msavin commented Oct 4, 2019

I appreciate it, and will definitely consider it as part of how to improve the package. I’m considering making the operator part of the package swappable, as there can be multiple strategies for running jobs. At the same time, I think the real challenge is running multiple jobs across multiple servers, ideally, utilizing unused CPU power.

@evolross
Copy link

evolross commented Sep 25, 2020

I was reading over @wildhart's optimizations and thought I would mention that using only a setTimeout to schedule the next job might suffer from node.js's setTimeout max limit. I opened an issue for this on wildhart/meteor.jobs.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants