New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setInterval callback function unexpected halt #22149

Closed
newbreach opened this Issue Aug 6, 2018 · 35 comments

Comments

Projects
None yet
10 participants
@newbreach

newbreach commented Aug 6, 2018

There is a setInterval method in the express project. When the project has been running for a month or so, there is no error, the project does not stop, but the callback of the setInterval is no longer executed.

like this :
setInterval(function(){
//do samething
console.log(Date.now());
},500);
A month later, no output is available, but the application process is robust.

@ChALkeR ChALkeR added the timers label Aug 6, 2018

@tkarls

This comment has been minimized.

Show comment
Hide comment
@tkarls

tkarls Aug 6, 2018

This is interesting. I have an app with some code looking like this:

function doSomething() {
    promiseReturningFunction().then((data) => {
        //do some
    }).catch((error) => {
        //log error
    }).then(() => {
        setTimeout(doSomething, 5000);
    });
}

This is just meant to run these code blocks forever. Then suddenly a few installations stopped working recently. The app still responds to api requests and new timers that are set during for example an api call is executed.

However, the doSomething functions are no longer being exectuted. It is as if the event queue at some point just was emptied or something.

My installations that are having this problem are running node 10.4.1 or 10.6.0 depending on when they were upgraded.

A bug in the core nodejs is not my first thought but I cant find a logical explanation for my app to suddenly drop 5-6 timer constructs like this that has worked for a few years before....

tkarls commented Aug 6, 2018

This is interesting. I have an app with some code looking like this:

function doSomething() {
    promiseReturningFunction().then((data) => {
        //do some
    }).catch((error) => {
        //log error
    }).then(() => {
        setTimeout(doSomething, 5000);
    });
}

This is just meant to run these code blocks forever. Then suddenly a few installations stopped working recently. The app still responds to api requests and new timers that are set during for example an api call is executed.

However, the doSomething functions are no longer being exectuted. It is as if the event queue at some point just was emptied or something.

My installations that are having this problem are running node 10.4.1 or 10.6.0 depending on when they were upgraded.

A bug in the core nodejs is not my first thought but I cant find a logical explanation for my app to suddenly drop 5-6 timer constructs like this that has worked for a few years before....

@targos

This comment has been minimized.

Show comment
Hide comment
@targos
Member

targos commented Aug 6, 2018

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Aug 6, 2018

Member

For what it's worth you are not supposed to rely on timers staying running for a month. That said, I haven't run into this sort of thing before - even on Node servers with an uptime of more than a month.

@tkarls

The app still responds to api requests and new timers that are set during for example an api call is executed.

It is very likely that promiseReturningFunction never fulfilled or that you are returning a promise that was pending forever from either the catch or the then below it - or the then below is never executed.

Also note that is not a great way to write Node.js code (and this is core's fault IMO). Are you using async_hooks and some form of instrumentation? Can you check the logs for unhandledRejections?

Member

benjamingr commented Aug 6, 2018

For what it's worth you are not supposed to rely on timers staying running for a month. That said, I haven't run into this sort of thing before - even on Node servers with an uptime of more than a month.

@tkarls

The app still responds to api requests and new timers that are set during for example an api call is executed.

It is very likely that promiseReturningFunction never fulfilled or that you are returning a promise that was pending forever from either the catch or the then below it - or the then below is never executed.

Also note that is not a great way to write Node.js code (and this is core's fault IMO). Are you using async_hooks and some form of instrumentation? Can you check the logs for unhandledRejections?

@tkarls

This comment has been minimized.

Show comment
Hide comment
@tkarls

tkarls Aug 6, 2018

@benjamingr yea that would of course be the obvious conclusion and what I initially thought until I realized it wall all timers at once. I just can't figure out how that problem suddenly occurred on 6 different places in the code.

The "background tasks" are not depending on each other and hasn't been changed for a loong time. Had it been one timer that stopped that's one thing but all at the same time? This was also on 7 different servers running the same app in different locations and with a similar up-time. Like 3-4 weeks or so until the issue occurred.

Perhaps there are a better construct to keep some "background tasks" executing but I thought this was simple enough to do its job.

Not familiar with the async_hooks and logs are not saved (which is a flaw...) :(

tkarls commented Aug 6, 2018

@benjamingr yea that would of course be the obvious conclusion and what I initially thought until I realized it wall all timers at once. I just can't figure out how that problem suddenly occurred on 6 different places in the code.

The "background tasks" are not depending on each other and hasn't been changed for a loong time. Had it been one timer that stopped that's one thing but all at the same time? This was also on 7 different servers running the same app in different locations and with a similar up-time. Like 3-4 weeks or so until the issue occurred.

Perhaps there are a better construct to keep some "background tasks" executing but I thought this was simple enough to do its job.

Not familiar with the async_hooks and logs are not saved (which is a flaw...) :(

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Aug 6, 2018

Member

@tkarls to be clear, I am not accusing you of writing code problematically - I am saying Node should guide users to safer ways to write code and make them "obvious".

If there is a way to get the "unhandledRejection" warnings then there is a very high chance you'll see it there. I've seen this before several times:

  • server runs for days.
  • there is a networking "hickup" and stuff fails for just a second or two.
  • recurring tasks fail, but since Node doesn't exit on promise unhandled rejection warnings the server chugs on.
  • Many people don't log warnings in production.

I'm getting "off" timer land and into "promise debugging land" here - just saying this is entirely on our radar and something we're thinking and talking about improving. Also it's possible your problem was entirely different and we do have a bug in setInterval - so take everything I say (guess) with a grain of salt.

Perhaps there are a better construct to keep some "background tasks" executing but I thought this was simple enough to do its job.

Typically, not on the server and using something that does scheduling - either what your cloud provides, crons if you're on a VM, k8s/cron jobs etc. I also warmly recommend turning on warnings (in production) for safety and using an APM (application performance monitoring) tool and setting up alerts for uptime and certain tasks.

Again, I strongly believe this is an issue with Node's user experience (not your or most people's code) and I feel strongly that Node (well, us) should fix it.

Member

benjamingr commented Aug 6, 2018

@tkarls to be clear, I am not accusing you of writing code problematically - I am saying Node should guide users to safer ways to write code and make them "obvious".

If there is a way to get the "unhandledRejection" warnings then there is a very high chance you'll see it there. I've seen this before several times:

  • server runs for days.
  • there is a networking "hickup" and stuff fails for just a second or two.
  • recurring tasks fail, but since Node doesn't exit on promise unhandled rejection warnings the server chugs on.
  • Many people don't log warnings in production.

I'm getting "off" timer land and into "promise debugging land" here - just saying this is entirely on our radar and something we're thinking and talking about improving. Also it's possible your problem was entirely different and we do have a bug in setInterval - so take everything I say (guess) with a grain of salt.

Perhaps there are a better construct to keep some "background tasks" executing but I thought this was simple enough to do its job.

Typically, not on the server and using something that does scheduling - either what your cloud provides, crons if you're on a VM, k8s/cron jobs etc. I also warmly recommend turning on warnings (in production) for safety and using an APM (application performance monitoring) tool and setting up alerts for uptime and certain tasks.

Again, I strongly believe this is an issue with Node's user experience (not your or most people's code) and I feel strongly that Node (well, us) should fix it.

@newbreach

This comment has been minimized.

Show comment
Hide comment
@newbreach

newbreach Aug 6, 2018

As tkarls says, I have a similar situation; @benjamingr thank you for your advice on coding, but I can't use cron because I use doSomething to trigger about 400-500 milliseconds.

newbreach commented Aug 6, 2018

As tkarls says, I have a similar situation; @benjamingr thank you for your advice on coding, but I can't use cron because I use doSomething to trigger about 400-500 milliseconds.

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Aug 6, 2018

Member

@newbreach in a similar situation - what sort of code do you have in the setTimeout?

Member

benjamingr commented Aug 6, 2018

@newbreach in a similar situation - what sort of code do you have in the setTimeout?

@tkarls

This comment has been minimized.

Show comment
Hide comment
@tkarls

tkarls Aug 6, 2018

@benjamingr no problem, I didn't feel particularly 'accused' because it it was an easy fix I could do to my code I want to know about it so I can fix the problem :D

Thanks for your advises. Using external schedule trigging had not even occurred to me! Today the node process is kept alive using a systemd service that limits memory and tasks and restarts the node process if needed while ensuring always up. I'm sure there are more advanced monitors.

But back to the issue at hand. I hear what you are saying about unhandled rejections but I do have that catch before the then that sets the timer. So any error or exception should be caught I think (I have double checked that all promises are returned and not just created locally too).

One of my "workers" query an api over the network. Another checks a local database and talks to a local service. A third one is querying the NTP status on the server etc. Most of the functions are rather small and easy to overview. So I find it improbable (but I suppose not impossible) that all these functions have a bug in them that manifest at the same time. This happened on several servers during the summer (running 10.4.1-10.6.0) but not on any of our 60+ servers running <= 9.8.0. Granted our sw is also newer on the newer installations but a few of the "workers" hasn't been changed in well over a year.

The only thing that doesn't indicate a bug/change in setTimeout/setInterval to me is probably that there isn't 500 people reporting it already due to node being so popular. Which I can agree is a pretty strong indication too...

I do appreciate you insights and guesses though. That is all I'm doing right now too :)

tkarls commented Aug 6, 2018

@benjamingr no problem, I didn't feel particularly 'accused' because it it was an easy fix I could do to my code I want to know about it so I can fix the problem :D

Thanks for your advises. Using external schedule trigging had not even occurred to me! Today the node process is kept alive using a systemd service that limits memory and tasks and restarts the node process if needed while ensuring always up. I'm sure there are more advanced monitors.

But back to the issue at hand. I hear what you are saying about unhandled rejections but I do have that catch before the then that sets the timer. So any error or exception should be caught I think (I have double checked that all promises are returned and not just created locally too).

One of my "workers" query an api over the network. Another checks a local database and talks to a local service. A third one is querying the NTP status on the server etc. Most of the functions are rather small and easy to overview. So I find it improbable (but I suppose not impossible) that all these functions have a bug in them that manifest at the same time. This happened on several servers during the summer (running 10.4.1-10.6.0) but not on any of our 60+ servers running <= 9.8.0. Granted our sw is also newer on the newer installations but a few of the "workers" hasn't been changed in well over a year.

The only thing that doesn't indicate a bug/change in setTimeout/setInterval to me is probably that there isn't 500 people reporting it already due to node being so popular. Which I can agree is a pretty strong indication too...

I do appreciate you insights and guesses though. That is all I'm doing right now too :)

@newbreach

This comment has been minimized.

Show comment
Hide comment
@newbreach

newbreach Aug 6, 2018

@benjamingr

My code logic is probably like this

  1. accept the HTTP request and send the message to the TCP client.

  2. the request callback method is added to the queue.

  3. receive the message of TCP client. If a reply is received, the callback method is removed from the queue and the callback is executed. The content of the callback method is "response.write" ({"success": true}).

  4. timer checks whether the request message is more than 40 seconds. If it is over, remove the callback method and execute it.

The part of the code is as follows:

setInterval(function () {
  let cbs = [];
  for (let key of callbackQueue.keys()) {
    let item = callbackQueue.get(key)
    if (item && Date.now() - item.time >= 40000) {
      callbackQueue.delete(key);
      cbs.push(item);
    }
  }
  for (let i in cbs) {
    cbs[i].cb.call(cbs[i].thisArg, new util.BusinessError('network fail', 4001));
  }
}, 200);

thanks!

Edit by @Fishrock123 - code formatting

newbreach commented Aug 6, 2018

@benjamingr

My code logic is probably like this

  1. accept the HTTP request and send the message to the TCP client.

  2. the request callback method is added to the queue.

  3. receive the message of TCP client. If a reply is received, the callback method is removed from the queue and the callback is executed. The content of the callback method is "response.write" ({"success": true}).

  4. timer checks whether the request message is more than 40 seconds. If it is over, remove the callback method and execute it.

The part of the code is as follows:

setInterval(function () {
  let cbs = [];
  for (let key of callbackQueue.keys()) {
    let item = callbackQueue.get(key)
    if (item && Date.now() - item.time >= 40000) {
      callbackQueue.delete(key);
      cbs.push(item);
    }
  }
  for (let i in cbs) {
    cbs[i].cb.call(cbs[i].thisArg, new util.BusinessError('network fail', 4001));
  }
}, 200);

thanks!

Edit by @Fishrock123 - code formatting

@newbreach newbreach closed this Aug 6, 2018

@newbreach

This comment has been minimized.

Show comment
Hide comment
@newbreach

newbreach Aug 6, 2018

Click the mistake, I'm sorry

newbreach commented Aug 6, 2018

Click the mistake, I'm sorry

@newbreach newbreach reopened this Aug 6, 2018

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Aug 6, 2018

Member

I'm trying to reproduce this (with NODE_DEBUG=*, and the code setInterval(() => console.log('ping', Date.now()), 500);) - I'll see if I can hit it after a day or two. Otherwise I'm not sure how I can do "time dilation" and run it faster in a VM - I'll look into that.

Edit: there appears to be a way to test it in virtualbox if anyone is interested in taking a stab sooner: https://superuser.com/a/1022317/248212

Member

benjamingr commented Aug 6, 2018

I'm trying to reproduce this (with NODE_DEBUG=*, and the code setInterval(() => console.log('ping', Date.now()), 500);) - I'll see if I can hit it after a day or two. Otherwise I'm not sure how I can do "time dilation" and run it faster in a VM - I'll look into that.

Edit: there appears to be a way to test it in virtualbox if anyone is interested in taking a stab sooner: https://superuser.com/a/1022317/248212

@Fishrock123

This comment has been minimized.

Show comment
Hide comment
@Fishrock123

Fishrock123 Aug 7, 2018

Member

This just became a problem recently? Like, in Node 10 ... or something else?

Member

Fishrock123 commented Aug 7, 2018

This just became a problem recently? Like, in Node 10 ... or something else?

@apapirovski

This comment has been minimized.

Show comment
Hide comment
@apapirovski

apapirovski Aug 7, 2018

Member

I'm not seeing anything in the 10.x code that jumps out at me but will continue investigating and try to reproduce.

Member

apapirovski commented Aug 7, 2018

I'm not seeing anything in the 10.x code that jumps out at me but will continue investigating and try to reproduce.

@tkarls

This comment has been minimized.

Show comment
Hide comment
@tkarls

tkarls Aug 8, 2018

@Fishrock123 Well for me I have not experienced anything like this on node <= 9.8
The next version I have used after that is 10.4.1 and there I got this problem after a few weeks (3-4) of run-time on several servers.

Today I will try and attach a debugger to one of my failing servers. I said before that new timers set by setTimeout() in api calls were executed correctly. This may have been a premature conclusion. Because it looks like this is not the case in two other API calls. I don't seem to get an error thrown though since the API call ends with success. Just the scheduled task does not run.

I will add any info I get from the debugging here

tkarls commented Aug 8, 2018

@Fishrock123 Well for me I have not experienced anything like this on node <= 9.8
The next version I have used after that is 10.4.1 and there I got this problem after a few weeks (3-4) of run-time on several servers.

Today I will try and attach a debugger to one of my failing servers. I said before that new timers set by setTimeout() in api calls were executed correctly. This may have been a premature conclusion. Because it looks like this is not the case in two other API calls. I don't seem to get an error thrown though since the API call ends with success. Just the scheduled task does not run.

I will add any info I get from the debugging here

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Aug 8, 2018

Member

Code is running with all the debug stuff for 2 days now here - not sure how to debug it otherwise - if anyone has any experience "speeding time" on an emulated system please chime in.

Member

benjamingr commented Aug 8, 2018

Code is running with all the debug stuff for 2 days now here - not sure how to debug it otherwise - if anyone has any experience "speeding time" on an emulated system please chime in.

@tkarls

This comment has been minimized.

Show comment
Hide comment
@tkarls

tkarls Aug 8, 2018

Ok, so I have successfully been able to connect the debugger over a ssh tunnel. I have 3 timers I can access via the api. 2 is not working in the sense that I can step the code and see that setTimeout() is indeed called with sensible values and does not throw. But the callback is never called. One callback is working as expected which is a bit strange.

I called process._getActiveHandles(); before the setTimeout of the working timer and got 21 handles. After the call to setTimeout() I got 22 handles and the last handle in the list is my new timer which I can see from the msecs value.

Doing the same thing on a setTimeout that is not working results is:
before setTimeout I have 23 active handles.
after setTimeout I have 23 active handles and the last one in the list is NOT my timer. Probably why it is never called.

I do have several timers in the list already with timer values that I do recognize and the only thing I can see is really different from the timer that is still working and the one that is not is the timeout value. The working timer uses a unique timeout value only used for that timer. The timers that isn't working is using the same timeout values as several of my "background timers" is using. Not sure if this is a coincidence or not... but I can also see in the debugger that one and exactly one background timer is still working. That one also has a unique timeout value that no other times has in the code. All other timers uses "common values" such as 3000, 5000, 10000 and 30000 and more than one timer uses each value.

sounds pretty strange to me...

tkarls commented Aug 8, 2018

Ok, so I have successfully been able to connect the debugger over a ssh tunnel. I have 3 timers I can access via the api. 2 is not working in the sense that I can step the code and see that setTimeout() is indeed called with sensible values and does not throw. But the callback is never called. One callback is working as expected which is a bit strange.

I called process._getActiveHandles(); before the setTimeout of the working timer and got 21 handles. After the call to setTimeout() I got 22 handles and the last handle in the list is my new timer which I can see from the msecs value.

Doing the same thing on a setTimeout that is not working results is:
before setTimeout I have 23 active handles.
after setTimeout I have 23 active handles and the last one in the list is NOT my timer. Probably why it is never called.

I do have several timers in the list already with timer values that I do recognize and the only thing I can see is really different from the timer that is still working and the one that is not is the timeout value. The working timer uses a unique timeout value only used for that timer. The timers that isn't working is using the same timeout values as several of my "background timers" is using. Not sure if this is a coincidence or not... but I can also see in the debugger that one and exactly one background timer is still working. That one also has a unique timeout value that no other times has in the code. All other timers uses "common values" such as 3000, 5000, 10000 and 30000 and more than one timer uses each value.

sounds pretty strange to me...

@Fishrock123

This comment has been minimized.

Show comment
Hide comment
@Fishrock123

Fishrock123 Aug 8, 2018

Member

That sounds like a rescheduling/reuse/re-enroll bug. (tl;dr some path causes pooled timers to not execute or schedule)

@apapirovski You're probably most familiar with this code now, any ideas?

Member

Fishrock123 commented Aug 8, 2018

That sounds like a rescheduling/reuse/re-enroll bug. (tl;dr some path causes pooled timers to not execute or schedule)

@apapirovski You're probably most familiar with this code now, any ideas?

@apapirovski

This comment has been minimized.

Show comment
Hide comment
@apapirovski

apapirovski Aug 8, 2018

Member

Nothing off the top of my head. Will investigate.

Member

apapirovski commented Aug 8, 2018

Nothing off the top of my head. Will investigate.

@apapirovski

This comment has been minimized.

Show comment
Hide comment
@apapirovski

apapirovski Aug 8, 2018

Member

@tkarls can you elaborate on "background timers"? Are these using unref or something? Any chance either me or @Fishrock123 could get access to the codebase? It would make it easier to know what I should be looking for in our code.

Member

apapirovski commented Aug 8, 2018

@tkarls can you elaborate on "background timers"? Are these using unref or something? Any chance either me or @Fishrock123 could get access to the codebase? It would make it easier to know what I should be looking for in our code.

@tkarls

This comment has been minimized.

Show comment
Hide comment
@tkarls

tkarls Aug 9, 2018

@apapirovski
With background timer I mean that It's basically a function that does some task and then reschedule itself using setTimeout(). Not using unref(), only setTimeout and sometimes clearTimeout

Like this:

function doSomething() {
    promiseReturningFunction().then((data) => {
        //do some
    }).catch((error) => {
        //log error
    }).then(() => {
        setTimeout(doSomething, 5000);
    });
}

I call the function once when the app is lifted and then they are supposed to keep running without being poked from the outside. Hence "backgrund task" but I never use anything else than setTimeout to do the scheduling.

The api timer for instance is set in response to an api call. For instance I have a call that enables the firewall that sets an timer to auto disable it unless the user confirms the action within 30 seconds. This timer is not working. The relevant code is as follows;

enableFirewall: (options) => {
        return getFwStatus().then((before) => {
            return execp('sudo ufw --force enable').then(()=>{
                setDisableTimer(disableTime);
            }).then(()=>{
                return getFwStatus().then((after) => {
                    options.before = before;
                    options.after = after;
                    return addAuditLogEntry(options);
                }); 
            });
        }).then(()=>{
            return disableTime;
        }).catch( (error) => {
            options.name = 'firewall-error';
            return disableFirewall(options).catch(() => {
                return execp('sudo ufw disable')
            }).then(() => {
                return Promise.reject(error);
            });
        });
    }

function setDisableTimer(timeout) {
    cancelDisableTimer();
    disableFirewallTimer = setTimeout(() => {
        disableFirewall({ user: '--disable timer--'});
    }, timeout);
}

function cancelDisableTimer() {
    if (disableFirewallTimer != null) {
        clearTimeout(disableFirewallTimer);
        disableFirewallTimer = null;
    }
}

function disableFirewall(options) {
    var options = options || { user: '--unknown user--' }
    return getFwStatus().then((before) => {
        return execp('sudo ufw disable').then(() => {
            return getFwStatus().then((after) => {
                options.before = before;
                options.after = after;
                return addAuditLogEntry(options);
            });
        });
    });
}

The api call will call enableFirewall and I do see when stepping the code that it reaches and executes without error

disableFirewallTimer = setTimeout(() => {
        disableFirewall({ user: '--disable timer--'});
    }, timeout);

and that the timeout parameter indeed is set to 30000 as I expected it to be. But a breakpoint in the anonymous function never get hit (and the firewall doesn't get called either).

One of the "background tasks" that are not working any more

function keepNtpStatusFresh() {
    getNtpStatusOutputInternal().then((data) => {
        lastNtpCheck.data = data;
        lastNtpCheck.when = new Date();
    }).catch((error) => {
        lastNtpCheck.data = null;
        sails.log.error(error);
    }).then(() => {
        setTimeout(keepNtpStatusFresh, 5000);
    });
}

function getNtpStatusOutputInternal() {
    return execp(sails.config.paneda.ntpStatus, { timeout: 30000 }).then((result) => {
        var re = new RegExp(/^(.)([^\s=]+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)/gm);
        var match = re.exec(result.stdout);

        var servers = [];
        while (match != null) {
            servers.push({
                type: match[1],
                remote: match[2],
                refid: match[3],
                st: Number(match[4]),
                t: match[5],
                when: Number(match[6]),
                poll: Number(match[7]),
                reach: Number(match[8]),
                delay: Number(match[9]),
                offset: Number(match[10]),
                jitter: Number(match[11]),
            });
            match = re.exec(result.stdout);
        }
        return { output: result.stdout, serverList: servers };
    });
} 

tkarls commented Aug 9, 2018

@apapirovski
With background timer I mean that It's basically a function that does some task and then reschedule itself using setTimeout(). Not using unref(), only setTimeout and sometimes clearTimeout

Like this:

function doSomething() {
    promiseReturningFunction().then((data) => {
        //do some
    }).catch((error) => {
        //log error
    }).then(() => {
        setTimeout(doSomething, 5000);
    });
}

I call the function once when the app is lifted and then they are supposed to keep running without being poked from the outside. Hence "backgrund task" but I never use anything else than setTimeout to do the scheduling.

The api timer for instance is set in response to an api call. For instance I have a call that enables the firewall that sets an timer to auto disable it unless the user confirms the action within 30 seconds. This timer is not working. The relevant code is as follows;

enableFirewall: (options) => {
        return getFwStatus().then((before) => {
            return execp('sudo ufw --force enable').then(()=>{
                setDisableTimer(disableTime);
            }).then(()=>{
                return getFwStatus().then((after) => {
                    options.before = before;
                    options.after = after;
                    return addAuditLogEntry(options);
                }); 
            });
        }).then(()=>{
            return disableTime;
        }).catch( (error) => {
            options.name = 'firewall-error';
            return disableFirewall(options).catch(() => {
                return execp('sudo ufw disable')
            }).then(() => {
                return Promise.reject(error);
            });
        });
    }

function setDisableTimer(timeout) {
    cancelDisableTimer();
    disableFirewallTimer = setTimeout(() => {
        disableFirewall({ user: '--disable timer--'});
    }, timeout);
}

function cancelDisableTimer() {
    if (disableFirewallTimer != null) {
        clearTimeout(disableFirewallTimer);
        disableFirewallTimer = null;
    }
}

function disableFirewall(options) {
    var options = options || { user: '--unknown user--' }
    return getFwStatus().then((before) => {
        return execp('sudo ufw disable').then(() => {
            return getFwStatus().then((after) => {
                options.before = before;
                options.after = after;
                return addAuditLogEntry(options);
            });
        });
    });
}

The api call will call enableFirewall and I do see when stepping the code that it reaches and executes without error

disableFirewallTimer = setTimeout(() => {
        disableFirewall({ user: '--disable timer--'});
    }, timeout);

and that the timeout parameter indeed is set to 30000 as I expected it to be. But a breakpoint in the anonymous function never get hit (and the firewall doesn't get called either).

One of the "background tasks" that are not working any more

function keepNtpStatusFresh() {
    getNtpStatusOutputInternal().then((data) => {
        lastNtpCheck.data = data;
        lastNtpCheck.when = new Date();
    }).catch((error) => {
        lastNtpCheck.data = null;
        sails.log.error(error);
    }).then(() => {
        setTimeout(keepNtpStatusFresh, 5000);
    });
}

function getNtpStatusOutputInternal() {
    return execp(sails.config.paneda.ntpStatus, { timeout: 30000 }).then((result) => {
        var re = new RegExp(/^(.)([^\s=]+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)/gm);
        var match = re.exec(result.stdout);

        var servers = [];
        while (match != null) {
            servers.push({
                type: match[1],
                remote: match[2],
                refid: match[3],
                st: Number(match[4]),
                t: match[5],
                when: Number(match[6]),
                poll: Number(match[7]),
                reach: Number(match[8]),
                delay: Number(match[9]),
                offset: Number(match[10]),
                jitter: Number(match[11]),
            });
            match = re.exec(result.stdout);
        }
        return { output: result.stdout, serverList: servers };
    });
} 
@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Aug 9, 2018

Member
ping 1533808667121
TIMER 61156: 500 list wait because diff is -1
TIMER 61156: timeout callback 500
TIMER 61156: now: 116841691

My test based on the above is still running for several days now by the way.

Member

benjamingr commented Aug 9, 2018

ping 1533808667121
TIMER 61156: 500 list wait because diff is -1
TIMER 61156: timeout callback 500
TIMER 61156: now: 116841691

My test based on the above is still running for several days now by the way.

@tkarls

This comment has been minimized.

Show comment
Hide comment
@tkarls

tkarls Aug 9, 2018

Yes, also since our servers were restarted they are working fine for little over 10 days now...

I have tried to check when the servers were upgraded and when the error showed up on each server. It looks like they all failed after about 25 days of run-time.

If this was a totally random problem or some sort of race condition I would have expected one server to fail after a few days, one to fail after a few weeks and perhaps 1 should still be running. 100% of the upgraded servers did fail. They are also running in completely different environments.

My brain can't help noticing that 25 days is almost exactly the number of milliseconds that can fit into a signed 32 bit integer (2 160 000 000 vs 2 147 483 647 ). Not sure if this is a pure coincidence (especially since Javascript doesn't have signed 32 bit integers AFAIK) but I'm mentioning it anyway :)

tkarls commented Aug 9, 2018

Yes, also since our servers were restarted they are working fine for little over 10 days now...

I have tried to check when the servers were upgraded and when the error showed up on each server. It looks like they all failed after about 25 days of run-time.

If this was a totally random problem or some sort of race condition I would have expected one server to fail after a few days, one to fail after a few weeks and perhaps 1 should still be running. 100% of the upgraded servers did fail. They are also running in completely different environments.

My brain can't help noticing that 25 days is almost exactly the number of milliseconds that can fit into a signed 32 bit integer (2 160 000 000 vs 2 147 483 647 ). Not sure if this is a pure coincidence (especially since Javascript doesn't have signed 32 bit integers AFAIK) but I'm mentioning it anyway :)

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Aug 9, 2018

Member

@tkarls as far as I know livuv uses 64bit ints for timers from what I can tell and Node just calls into libuv for checking.

Maybe @cjihrig or @saghul will know? (since libuv timers)

(We do cap the timeout itself to 32bit integers but not the number of repeats).

I wonder if this correlates with a libuv update - I'd try and bisect but for a bug that takes ~25 days to reproduce I don't think it'll help.

Member

benjamingr commented Aug 9, 2018

@tkarls as far as I know livuv uses 64bit ints for timers from what I can tell and Node just calls into libuv for checking.

Maybe @cjihrig or @saghul will know? (since libuv timers)

(We do cap the timeout itself to 32bit integers but not the number of repeats).

I wonder if this correlates with a libuv update - I'd try and bisect but for a bug that takes ~25 days to reproduce I don't think it'll help.

@tkarls

This comment has been minimized.

Show comment
Hide comment
@tkarls

tkarls Aug 9, 2018

@apapirovski @benjamingr @Fishrock123

I was just able to reproduce this with a smaller example 😁

I used libfaketime to speed things up by a factor of 1000 on a linux server and executed a test program containing functions scheduling themselves in the callback and printing the current date. The simulated start date is today (2018-08-09) and the last print is in september 03 wich happened to be 25 days!

Here is the script for reference:

function timer1 () {
    console.log('timer1' + new Date().toString());
    setTimeout(timer1, 500);
}

function timer2 () {
    console.log('timer2' + new Date().toString());
    setTimeout(timer2, 500);
}

function timer3 () {
    console.log('timer3' + new Date().toString());
    setTimeout(timer3, 500);
}

function timer4 () {
    console.log('timer4' + new Date().toString());
    setTimeout(timer4, 500);
}

function timer5 () {
    console.log('timer5' + new Date().toString());
    setTimeout(timer5, 1500);
}

function timer6 () {
    console.log('timer6' + new Date().toString());
    setTimeout(timer6, 500);
}

function timer7 () {
    console.log('timer7' + new Date().toString());
    setTimeout(timer7, 500);
}

function timer8 () {
    console.log('timer8' + new Date().toString());
    setTimeout(timer8, 500);
}

function timer9 () {
    console.log('timer9' + new Date().toString());
    setTimeout(timer9, 500);
}
function timer10 () {
    console.log('timer10' + new Date().toString());
    setTimeout(timer10, 500);
}

timer1();
timer2();
timer3();
timer4();
timer5();
timer6();
timer7();
timer8();
timer9();
timer10();

In the middle of september 3 the output stopped. No error other output in the console.

Here is how I started it:
faketime -f "+10 x1000" node timer_test

@benjamingr So, libuv... I'm not at all familiar with that lib. But I suppose the bug could be there instead...

tkarls commented Aug 9, 2018

@apapirovski @benjamingr @Fishrock123

I was just able to reproduce this with a smaller example 😁

I used libfaketime to speed things up by a factor of 1000 on a linux server and executed a test program containing functions scheduling themselves in the callback and printing the current date. The simulated start date is today (2018-08-09) and the last print is in september 03 wich happened to be 25 days!

Here is the script for reference:

function timer1 () {
    console.log('timer1' + new Date().toString());
    setTimeout(timer1, 500);
}

function timer2 () {
    console.log('timer2' + new Date().toString());
    setTimeout(timer2, 500);
}

function timer3 () {
    console.log('timer3' + new Date().toString());
    setTimeout(timer3, 500);
}

function timer4 () {
    console.log('timer4' + new Date().toString());
    setTimeout(timer4, 500);
}

function timer5 () {
    console.log('timer5' + new Date().toString());
    setTimeout(timer5, 1500);
}

function timer6 () {
    console.log('timer6' + new Date().toString());
    setTimeout(timer6, 500);
}

function timer7 () {
    console.log('timer7' + new Date().toString());
    setTimeout(timer7, 500);
}

function timer8 () {
    console.log('timer8' + new Date().toString());
    setTimeout(timer8, 500);
}

function timer9 () {
    console.log('timer9' + new Date().toString());
    setTimeout(timer9, 500);
}
function timer10 () {
    console.log('timer10' + new Date().toString());
    setTimeout(timer10, 500);
}

timer1();
timer2();
timer3();
timer4();
timer5();
timer6();
timer7();
timer8();
timer9();
timer10();

In the middle of september 3 the output stopped. No error other output in the console.

Here is how I started it:
faketime -f "+10 x1000" node timer_test

@benjamingr So, libuv... I'm not at all familiar with that lib. But I suppose the bug could be there instead...

@apapirovski

This comment has been minimized.

Show comment
Hide comment
@apapirovski

apapirovski Aug 9, 2018

Member

I think I know the bug... confirming right now. Ugh.

Member

apapirovski commented Aug 9, 2018

I think I know the bug... confirming right now. Ugh.

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Aug 9, 2018

Member

@tkarls excellent work isolating it. libuv is the underlying i/o library Node uses (it was written for Node but is used by others now) - it sounds like Anatoli has some idea about where the bug is though :)

Member

benjamingr commented Aug 9, 2018

@tkarls excellent work isolating it. libuv is the underlying i/o library Node uses (it was written for Node but is used by others now) - it sounds like Anatoli has some idea about where the bug is though :)

@apapirovski apapirovski referenced this issue Aug 9, 2018

Closed

src: fix integer overflow in GetNow #22214

3 of 3 tasks complete
@tkarls

This comment has been minimized.

Show comment
Hide comment
@tkarls

tkarls Aug 9, 2018

Great that the cause is identified now 👍

I wonder why all these strange errors always happen to me 😂

tkarls commented Aug 9, 2018

Great that the cause is identified now 👍

I wonder why all these strange errors always happen to me 😂

@newbreach

This comment has been minimized.

Show comment
Hide comment
@newbreach

newbreach Aug 9, 2018

I'm very happy. Don't worry about the boss getting angry! 😁Thx👍 @tkarls @benjamingr @apapirovski

newbreach commented Aug 9, 2018

I'm very happy. Don't worry about the boss getting angry! 😁Thx👍 @tkarls @benjamingr @apapirovski

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Aug 12, 2018

Member

The fix landed and should be out at the next release. Thanks for the report!

Member

benjamingr commented Aug 12, 2018

The fix landed and should be out at the next release. Thanks for the report!

@girng

This comment has been minimized.

Show comment
Hide comment
@girng

girng Sep 2, 2018

For what it's worth you are not supposed to rely on timers staying running for a month. That said, I haven't run into this sort of thing before

So much for having a gameserver be up for over a month! The absolute horror!

girng commented Sep 2, 2018

For what it's worth you are not supposed to rely on timers staying running for a month. That said, I haven't run into this sort of thing before

So much for having a gameserver be up for over a month! The absolute horror!

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Sep 2, 2018

Member

So much for having a gameserver be up for over a month! The absolute horror!

Note that what I meant by that isn't that your services aren't expected to be up for a month. It's that redundancy is expected (2+ instances behind a load balancer) so servers are resilient to failure.

There are plenty of cases where you need to restart your server (sceptre/meltdown style attacks) regardless of Node. I was just pointing out that relying on it being up for a month isn't something I'd personally recommend doing.

Member

benjamingr commented Sep 2, 2018

So much for having a gameserver be up for over a month! The absolute horror!

Note that what I meant by that isn't that your services aren't expected to be up for a month. It's that redundancy is expected (2+ instances behind a load balancer) so servers are resilient to failure.

There are plenty of cases where you need to restart your server (sceptre/meltdown style attacks) regardless of Node. I was just pointing out that relying on it being up for a month isn't something I'd personally recommend doing.

jhuckaby added a commit to jhuckaby/Cronicle that referenced this issue Sep 2, 2018

Version 0.8.27
Now emitting a warning on startup for Node v10.0 to v10.8, due to the core timer bug: nodejs/node#22149
Updated timer bug warning in README.
Bumped pixl-server version to 1.0.16 for more verbose startup log entry.
@murrayju

This comment has been minimized.

Show comment
Hide comment
@murrayju

murrayju Sep 4, 2018

The fix landed and should be out at the next release

@benjamingr What version(s) contain(s) (or will contain) this fix?

murrayju commented Sep 4, 2018

The fix landed and should be out at the next release

@benjamingr What version(s) contain(s) (or will contain) this fix?

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Sep 4, 2018

Member

Node v10.9.0

Member

benjamingr commented Sep 4, 2018

Node v10.9.0

@CurtisCBS CurtisCBS referenced this issue Sep 9, 2018

Closed

2018-09-10 #75

tejohnso added a commit to Rise-Vision/display-monitoring-server that referenced this issue Sep 10, 2018

Merge #38 add-logging
Add logging per cycle.

The interval was stopping due to [NodeJS issue](nodejs/node#22149)

This merge will trigger an image rebuild and grab the fix from node:latest
@halis

This comment has been minimized.

Show comment
Hide comment
@halis

halis Oct 11, 2018

Just use setTimeout instead and use recursion to call it again.

halis commented Oct 11, 2018

Just use setTimeout instead and use recursion to call it again.

@tkarls

This comment has been minimized.

Show comment
Hide comment
@tkarls

tkarls Oct 12, 2018

@halis
Did you read the thread? 😂

tkarls commented Oct 12, 2018

@halis
Did you read the thread? 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment