Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted data when CPU load is high #22

Closed
skrobul opened this issue Jul 9, 2014 · 10 comments
Closed

Corrupted data when CPU load is high #22

skrobul opened this issue Jul 9, 2014 · 10 comments

Comments

@skrobul
Copy link

skrobul commented Jul 9, 2014

Hi
I am using node-tail on a server with relatively high load. My app basically monitors a syslog text file that gets written with relatively high rate (about 25-35kB/s). File looks more or less like this:

2014-07-09T09:56:16+00:00 nszen3.dc1.company.net [info] postfix/smtpd: lost connection after CONNECT from unknown[10.12.109.70]
2014-07-09T09:56:16+00:00 nszen3.dc1.company.net [info] postfix/smtpd: disconnect from unknown[10.12.109.70]
2014-07-09T09:56:15+00:00 xe17-1d.dc2.company.net [notice] 897262: Jul  9 09:56:14.844 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/16, changed state to down
2014-07-09T09:56:13+00:00 aw1-1p.dc3.company.net [notice] 1724: Jul  9 09:56:15.584 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down
2014-07-09T09:56:13+00:00 aw1-1.dc3.company.net [notice] 2673: Jul  9 09:56:15.622 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down
2014-07-09T09:56:40+00:00 aw1-1.dc3.company.net [error] Postfix blah blah
multi
line message
2014-07-09T09:56:13+00:00 aw1-1.dc3.company.net [notice] 2673: Jul  9 09:56:15.622 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down
...

The end goal is to 'split' the stream into an array of syslog messages. The logical step would be to use \n as a separator but it does not work in this particular case because some of the messages are multi line.

I ended up writing code that looks like this:

var tail = new Tail("/opt/syslog/archive.log", "\n20")
tail.on('line', function(data) {
   process_further('20' + );
}
tail.on('error', function(error) { console.log(error);});

The results are as expected vast majority of the time - I get the exact message that's supposed to be extracted. Unfortunately once every couple minutes the message gets corrupted in a funny way. process_further() for some reason gets called with junk data in the begining of the string, for example:

2014-2014-07-09T06:03:36+00:00 478242-dhcp2.kickstart.company.com [notice] info: dhcpd: DHCPREQUEST for 10.132.29.112 (10.192.0.40) from 00:26:b9:4e:e3:07 via 10.192.28.3

or

202014-07-08T12:47:15+00:00 f4-42-1.company.net [notice] : 2014 Jul  8 12:46:19 UTC: %ETHPORT-5-IF_DUPLEX: Interface Ethernet1/34, operational duplex mode changed to Full

Please bear in mind that it happens only once every couple thousand of lines.
I have been troubleshooting this for last couple days and I am not able to find the culprit. So far I have checked following things:

  • There are no 'weird characters' or anything else in a raw file that is being monitored when it happens. The afffected area (few lines before and after the corrupted message) has been verified with hexeditor and the only character after last line is \n aka 0x0A.
  • It seems to be related to CPU load. Currently the system is running with load average of 8.5
  • I am not able to replicate the issue when replaying same file on other computer
  • Replaying same file on the same computer does not produce corruption in a same place
  • One of my theories was that file gets somehow overwritten or worse, segments are removed between reads. I've Added debugging console.log in https://github.com/lucagrulla/node-tail/blob/master/tail.coffee#L48 to check if the condition is ever met:
if (stats.size < _this.pos) {
         console.log("rewinded! stats.size=" + stats.size + " @pos=" + _this.pos + " difference(@pos-size)=" + (_this.pos - stats.size));
         _this.pos = stats.size;
}

and to my surprise it was fired couple times:

rewinded! stats.size=1219983229 @pos=1219983490 difference(@pos-size)=261
rewinded! stats.size=1219983750 @pos=1219984281 difference(@pos-size)=531

I thought it is not possible, but wanted to verify anyway so I wrote short ruby script to monitor file size returned by system and produce an alert when it's rewinded:

#!/usr/bin/ruby
filename = '/opt/syslog/archive.log'
prev = 0
loop do
  fsize = File.stat(filename).size
  if fsize - prev < 0
    alert = 'REWINDED!'
    puts "size: #{fsize} prev:#{prev} diff: #{fsize - prev} #{alert}"
  else
    alert = ''
  end
  prev =  fsize
  abort("File rewinded, aborting") if alert != ''
  sleep 0.1
end

That script has been running for last couple of hours and it never fired, so the file is not being rewinded yet for some reason node thinks it is.

Still, it does not look like it correlates with the time when messages are corrupted:

rewinded! stats.size=1219983229 @pos=1219983490 difference(@pos-size)=261
rewinded! stats.size=1219983750 @pos=1219984281 difference(@pos-size)=531
rewinded! stats.size=1226892183 @pos=1226894946 difference(@pos-size)=2763
rewinded! stats.size=1232545312 @pos=1232545481 difference(@pos-size)=169
rewinded! stats.size=1232581791 @pos=1232582488 difference(@pos-size)=697
rewinded! stats.size=1235945205 @pos=1235945374 difference(@pos-size)=169
rewinded! stats.size=1245025044 @pos=1245026503 difference(@pos-size)=1459
rewinded! stats.size=1245908859 @pos=1245909751 difference(@pos-size)=892
Problem with chunk: 014-07-09T10:48:01+00:00 dev.dc3.company.net [info] postfix/cleanup: 821A8DDD33: message-id=<20140709104801.7F0B7DDD4F@dev-script.dc3.company.net>
14-07-09T10:48:01+00:00 phase3sniffer.dc3.company.net [info] postfix/qmgr: 84FA12EC0085: from=<root@phase3sniffer.dc3.company.net>, size=1159, nrcpt=1 (queue active)
rewinded! stats.size=1263001128 @pos=1263001772 difference(@pos-size)=644
rewinded! stats.size=1270921071 @pos=1270921374 difference(@pos-size)=303
rewinded! stats.size=1274044805 @pos=1274045193 difference(@pos-size)=388
rewinded! stats.size=1276448317 @pos=1276449057 difference(@pos-size)=740
rewinded! stats.size=1276991206 @pos=1276992015 difference(@pos-size)=809
rewinded! stats.size=1277356959 @pos=1277357127 difference(@pos-size)=168
rewinded! stats.size=1289780552 @pos=1289781614 difference(@pos-size)=1062
  • My other theory was the race condition where @buffer gets overwritten. Would it be possible if readBlock() did not finish executing lines 11-22 but @internalDispatcher received another next event from watch() ?

More (possibly relevant) details:

  • Node 0.10.25-linux-64
  • Linux 2.6.18-92
  • ext3 filesystem

Simplest code to reproduce the problem:

var Tail = require('tail').Tail
var tail = new Tail("/opt/syslog/archive.log", "\n20")

tail.on('line', function(data) {
  if(data.substr(0,2) != '14') {
    console.log("Problem with chunk: " + data);
  }
})
tail.on('error', function(error) { throw error })

Please note that first 'Problem' must be ignored since node-tail will start reading from a beginning of a file which is unlikely to start with \n20.

Thanks for reading this. I would appreciate any ideas or suggestions as to what else may I try?

thanks,
Marek

@lucagrulla
Copy link
Owner

Hi Marek,

I am not sure of what is causing the problem you are seeing.

By your description it looks like the top level event watchEvent gets confused and identifies wrong start and end position offset to the rest of the system.

fs.watch has some platform dependency(see http://nodejs.org/api/fs.html#fs_caveats) but given you are using Linux it should work as expected(it doesn't work too well on Windows).

First suggestion would be to re-write your test script(the Ruby one) in Node to mimic the internals of tails. Mostly add an event to watch (https://github.com/lucagrulla/node-tail/blob/master/tail.coffee#L40) and in the event handler(that could like a but like https://github.com/lucagrulla/node-tail/blob/master/tail.coffee#L44) just println when the offsets look wrong(as you did in the ruby script).

Ignore anything else: no tailing, no buffer, just keep trace of @pos and compare it with what passed from fs.watch; this should help us understanding if really fs.watch is getting wrong information to the rest of the library given your specific load.

Let me know the outcomes of this experiment.

Thanks,
Luca

@skrobul
Copy link
Author

skrobul commented Jul 14, 2014

Hi Luca,
Thanks for looking into this. Correct me if I'm wrong, but I was under an impression that fs.watch merely forwards the kernel information that inodes on underlying filesystem have changed, but the actual file size comes from fs.stats, correct? Anyways - following your advice, I've reimplemented my Ruby script in Node.

It looks as follows:

var fs = require('fs')
var filename = '/opt/syslog/archive.log'
prev = 0

check_if_rewound_fswatch = function() {
  fs.watch(filename, function(ev, fsname) {
    fs.stat(filename, function(err, stats) {
      if((stats.size - prev) < 0) {
        console.log("File possibly rewound. prev=" + prev + " current: " + stats.size)
        process.exit(1)
      }
      prev = stats.size
    });
  });
}
check_if_rewound_fswatch();

Above code produced an error after about 10 minutes of working:

$ ./bin/node myrevtest.js
File possibly rewound. prev=1378298051 current: 1378297333
$

It was bit suspicious to me, so I've wrote a version that exactly imitates what ruby does - file size check triggered by a timer rather than a callback from fs.watch(). It looked more or less like this:

check_if_rewound_interval = function() {
  fs.stat(filename, function(err, stats){
    if((stats.size - prev) < 0) {
      console.log("File possibly rewound. prev=" + prev + " current: " + stats.size)
      process.exit(1)
    }
    prev = stats.size
    setTimeout(check_if_rewound_interval, 10)
  });
}
check_if_rewound_interval()

Snippet above has been running for last 40 minutes and did not trigger the condition. Any ideas what may be causing this?
Could it be possible that new fs.watch() callback gets started before the previous one is finished? I'm relatively new to NodeJS and thought it's single threaded, but I'm not sure if such situation is possible...

@skrobul
Copy link
Author

skrobul commented Jul 14, 2014

I was worried that above can be a NodeJS bug or I may be simply missing this 'event' while doing interval checks, so I decided to reimplement whole thing in pure C (see code here) and call the kernel/inotify directly. So far it has not detected any single 'rewound' event... I don't know what to make out of this yet

@lucagrulla
Copy link
Owner

Hi Marek,

I had another look at the problem and the only concurrency issue I can see is when a fs.stat initialized a T2 is returning before a fs.stat created at T1; technically it's possible given that stat is synchronous, so it might be that at high load the problem appears while it stays dormient until there.

To prove this let's move the call from async to sync.

-take the same NodsJS script you used to test and replace fs.stat with fs.statSync; The code has to change because StatSync will return a fs.Stats object instead of accepting a callback.

Your code will roughly look like the following(I have not tried this code so it might no work straght out of the box ;-)):

check_if_rewound_fswatch = function() {

  fs.watch(filename, function(ev, fsname) {
   var stats = fs.StatSync(filename);
   if((stats.size - prev) < 0) {
        console.log("File possibly rewound. prev=" + prev + " current: " + stats.size)
        process.exit(1)
      }
      prev = stats.size
   });
}

Let me know if this new version will solve the problem; if you confirm the error is gone I'll fix te library and release a new version in the next few days.

@skrobul
Copy link
Author

skrobul commented Jul 14, 2014

I've started both sync and async versions side by side and the async one stopped after just couple minutes, the sync is still holding up so it all looks promising. I will keep it running and give you an update in few hours. Thanks for your help!

@skrobul
Copy link
Author

skrobul commented Jul 15, 2014

Just a quick update - it seems to be working so far.

@lucagrulla
Copy link
Owner

you mean statSync is working fine?

Thanks,
L.

On 15 July 2014 09:39, Marek Skrobacki notifications@github.com wrote:

Just a quick update - it seems to be working so far.


Reply to this email directly or view it on GitHub
#22 (comment).

@skrobul
Copy link
Author

skrobul commented Jul 15, 2014

Yes, sorry - should have been more specific. The fs.statSync is the working one.

@lucagrulla
Copy link
Owner

ok.

I'll need to change the library to statSync then, there's clearly a race
condition to manifests itself very clearly at high throughput.

L.

On 15 July 2014 09:47, Marek Skrobacki notifications@github.com wrote:

Yes, sorry - should have been more specific. The fs.statSync is the
working one.


Reply to this email directly or view it on GitHub
#22 (comment).

@lucagrulla
Copy link
Owner

Based on the spike testing last night I released 0.3.6 with the fix that should resolve this async issues.

I'll close the issue for now, let me know if you experience it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants