-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Files Being Left Unprocessed with Identical Timestamps #57
Comments
Did a bit more thinking on this one. Changing to |
I have this same issue as well. The problem doesn't occur when the files with the same timestamp are processed in the same run of the Only by keeping the full list of already processed files that have last_modified timestamp, can the next run of Example
Keep the list (in sincedb?) of allready processed files; in this case all three files:
Find files
New list: all 7 files with timestamp This is an acceptable solution because the list of filenames to save and compare against is small in practice. |
Dumb workaround if you're using 'delete': set the registry contents to a really old date once in a while. |
Yes, that would work even if you're using 'backup'. But it defeats the purpose of automation. |
Had the same issue, realized I could set |
I've seen this issue as well, the way I can understand it is that list_new_files can run at e.g. 08:00:00.04, processing a file modified at 08:00:00.03, and 08:00:00 is written to the sincedb. If a file is added to the S3 bucket with a modified time of 08:00:00.07 for example, it will not have been picked up by the first run of list_new_files, and will be detected as "Object Not Modified" on the second run. I've submitted a PR to define a cut-off time which is 2 seconds in the past, so anything modified within 2 seconds of list_new_files running will be ignored. It's probably not the most elegant of solutions but it's worked for us. |
We are seeing similar issues when processing CloudTrail logs that are being sent to a centralized bucket from multiple accounts in the format We are using the delete mode to remove the files after processing, so do not actually need to worry about the sincedb timestamp. To work around the issue we ended up adding a new setting (sincedb_disabled) to the input that when set to |
Thank you for reporting the issue. The new release v3.6.0 fixed it |
I've had a long standing issue with LS 1.5.x leaving files pending in my S3 input bucket and not being read. In tracing this down this (https://github.com/logstash-plugins/logstash-input-s3/blob/master/lib/logstash/inputs/s3.rb#L408) test looks suspect to me. I am processing S3 files on a batch basis and have multiple files present on start up with identical
last_modified
timestamps (second-level precision).It looks like the
newer
function, using a greater than, will process the first file it finds with the earliest timestamp, but skip over any subsequent ones.For example:
If I'm understanding the code,
file1.gz
will get parsed, butfile2.gz
andfile3.gz
will fail the>
test and will be skipped, leaving them hanging around after the conclusion of the batch run.Is my interpretation correct and could this be changed safely to
>=
? I'm using delete after read so the concern of re-reading the same file isn't a problem for me.The text was updated successfully, but these errors were encountered: