New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted data when CPU load is high #22
Comments
Hi Marek, I am not sure of what is causing the problem you are seeing. By your description it looks like the top level event watchEvent gets confused and identifies wrong start and end position offset to the rest of the system. fs.watch has some platform dependency(see http://nodejs.org/api/fs.html#fs_caveats) but given you are using Linux it should work as expected(it doesn't work too well on Windows). First suggestion would be to re-write your test script(the Ruby one) in Node to mimic the internals of tails. Mostly add an event to watch (https://github.com/lucagrulla/node-tail/blob/master/tail.coffee#L40) and in the event handler(that could like a but like https://github.com/lucagrulla/node-tail/blob/master/tail.coffee#L44) just println when the offsets look wrong(as you did in the ruby script). Ignore anything else: no tailing, no buffer, just keep trace of @pos and compare it with what passed from fs.watch; this should help us understanding if really fs.watch is getting wrong information to the rest of the library given your specific load. Let me know the outcomes of this experiment. Thanks, |
Hi Luca, It looks as follows: var fs = require('fs')
var filename = '/opt/syslog/archive.log'
prev = 0
check_if_rewound_fswatch = function() {
fs.watch(filename, function(ev, fsname) {
fs.stat(filename, function(err, stats) {
if((stats.size - prev) < 0) {
console.log("File possibly rewound. prev=" + prev + " current: " + stats.size)
process.exit(1)
}
prev = stats.size
});
});
}
check_if_rewound_fswatch(); Above code produced an error after about 10 minutes of working:
It was bit suspicious to me, so I've wrote a version that exactly imitates what ruby does - file size check triggered by a timer rather than a callback from check_if_rewound_interval = function() {
fs.stat(filename, function(err, stats){
if((stats.size - prev) < 0) {
console.log("File possibly rewound. prev=" + prev + " current: " + stats.size)
process.exit(1)
}
prev = stats.size
setTimeout(check_if_rewound_interval, 10)
});
}
check_if_rewound_interval() Snippet above has been running for last 40 minutes and did not trigger the condition. Any ideas what may be causing this? |
I was worried that above can be a NodeJS bug or I may be simply missing this 'event' while doing interval checks, so I decided to reimplement whole thing in pure C (see code here) and call the kernel/inotify directly. So far it has not detected any single 'rewound' event... I don't know what to make out of this yet |
Hi Marek, I had another look at the problem and the only concurrency issue I can see is when a fs.stat initialized a T2 is returning before a fs.stat created at T1; technically it's possible given that stat is synchronous, so it might be that at high load the problem appears while it stays dormient until there. To prove this let's move the call from async to sync. -take the same NodsJS script you used to test and replace fs.stat with fs.statSync; The code has to change because StatSync will return a fs.Stats object instead of accepting a callback. Your code will roughly look like the following(I have not tried this code so it might no work straght out of the box ;-)): check_if_rewound_fswatch = function() {
fs.watch(filename, function(ev, fsname) {
var stats = fs.StatSync(filename);
if((stats.size - prev) < 0) {
console.log("File possibly rewound. prev=" + prev + " current: " + stats.size)
process.exit(1)
}
prev = stats.size
});
} Let me know if this new version will solve the problem; if you confirm the error is gone I'll fix te library and release a new version in the next few days. |
I've started both sync and async versions side by side and the async one stopped after just couple minutes, the sync is still holding up so it all looks promising. I will keep it running and give you an update in few hours. Thanks for your help! |
Just a quick update - it seems to be working so far. |
you mean statSync is working fine? Thanks, On 15 July 2014 09:39, Marek Skrobacki notifications@github.com wrote:
|
Yes, sorry - should have been more specific. The |
ok. I'll need to change the library to statSync then, there's clearly a race L. On 15 July 2014 09:47, Marek Skrobacki notifications@github.com wrote:
|
Based on the spike testing last night I released 0.3.6 with the fix that should resolve this async issues. I'll close the issue for now, let me know if you experience it again. |
Hi
I am using node-tail on a server with relatively high load. My app basically monitors a syslog text file that gets written with relatively high rate (about 25-35kB/s). File looks more or less like this:
The end goal is to 'split' the stream into an array of syslog messages. The logical step would be to use
\n
as a separator but it does not work in this particular case because some of the messages are multi line.I ended up writing code that looks like this:
The results are as expected vast majority of the time - I get the exact message that's supposed to be extracted. Unfortunately once every couple minutes the message gets corrupted in a funny way.
process_further()
for some reason gets called with junk data in the begining of the string, for example:or
Please bear in mind that it happens only once every couple thousand of lines.
I have been troubleshooting this for last couple days and I am not able to find the culprit. So far I have checked following things:
\n
aka0x0A
.and to my surprise it was fired couple times:
I thought it is not possible, but wanted to verify anyway so I wrote short ruby script to monitor file size returned by system and produce an alert when it's rewinded:
That script has been running for last couple of hours and it never fired, so the file is not being rewinded yet for some reason node thinks it is.
Still, it does not look like it correlates with the time when messages are corrupted:
@buffer
gets overwritten. Would it be possible ifreadBlock()
did not finish executing lines 11-22 but@internalDispatcher
received anothernext
event fromwatch()
?More (possibly relevant) details:
Simplest code to reproduce the problem:
Please note that first 'Problem' must be ignored since node-tail will start reading from a beginning of a file which is unlikely to start with
\n20
.Thanks for reading this. I would appreciate any ideas or suggestions as to what else may I try?
thanks,
Marek
The text was updated successfully, but these errors were encountered: