Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Files Being Left Unprocessed with Identical Timestamps #57

Closed
davidski opened this issue Oct 5, 2015 · 8 comments
Closed

Multiple Files Being Left Unprocessed with Identical Timestamps #57

davidski opened this issue Oct 5, 2015 · 8 comments
Assignees

Comments

@davidski
Copy link

davidski commented Oct 5, 2015

I've had a long standing issue with LS 1.5.x leaving files pending in my S3 input bucket and not being read. In tracing this down this (https://github.com/logstash-plugins/logstash-input-s3/blob/master/lib/logstash/inputs/s3.rb#L408) test looks suspect to me. I am processing S3 files on a batch basis and have multiple files present on start up with identical last_modified timestamps (second-level precision).

It looks like the newer function, using a greater than, will process the first file it finds with the earliest timestamp, but skip over any subsequent ones.

For example:

Timestamp File
2015-01-01 08:00:00Z file1.gz
2015-01-01 08:00:00Z file2.gz
2015-01-01 08:00:00Z file3.gz
2015-01-01 08:00:01Z file4.gz
2015-01-01 08:00:02Z file5.gz
2015-01-01 08:00:03Z file6.gz

If I'm understanding the code, file1.gz will get parsed, but file2.gz and file3.gz will fail the > test and will be skipped, leaving them hanging around after the conclusion of the batch run.

Is my interpretation correct and could this be changed safely to >=? I'm using delete after read so the concern of re-reading the same file isn't a problem for me.

@davidski
Copy link
Author

davidski commented Oct 6, 2015

Did a bit more thinking on this one. Changing to >= is a problem for folks that aren't using the delete mode, as files with the same timestamp could be read over and over. A better solution would be to increase the hash to not just the last_modified timestamp but to be a combination of last_modified and the filename.

@kjwierenga
Copy link

I have this same issue as well. The problem doesn't occur when the files with the same timestamp are processed in the same run of the process_files call. It only happens when say first file1.gz and file2.gz are processed and the sincedb timestamp is set to 2015-01-01 08:00:00Z in one call to process_files and interval seconds later file3.gz is processed in the next call to process_files. In this case file3.gz is skipped because its timestamp is less than (<) the sincedb timestamp.

Only by keeping the full list of already processed files that have last_modified timestamp, can the next run of process_files compare against that list and skip those files.

Example

  • 1st call process_files
Timestamp Log Object Key
2016-05-12 14:26:00 2016-05-12-12-25-12-D68E3F15AAE3DC9A
2016-05-12 14:26:00 2016-05-12-12-25-15-58C0CBDF807CD34B
2016-05-12 14:26:00 2016-05-12-12-25-26-5FC01F0ED4E5A430

Keep the list (in sincedb?) of allready processed files; in this case all three files:

2016-05-12 14:26:00 2016-05-12-12-25-12-D68E3F15AAE3DC9A,2016-05-12-12-25-15-58C0CBDF807CD34B,2016-05-12-12-25-26-5FC01F0ED4E5A430

  • 2nd call process_files (interval seconds later)

Find files >= 2016-05-12 14:26:00 and skip if name matches a name on the list of already processed files:

Timestamp Log Object Key Skip/Process
2016-05-12 14:26:00 2016-05-12-12-25-12-D68E3F15AAE3DC9A skip
2016-05-12 14:26:00 2016-05-12-12-25-15-58C0CBDF807CD34B skip
2016-05-12 14:26:00 2016-05-12-12-25-26-5FC01F0ED4E5A430 skip
2016-05-12 14:26:00 2016-05-12-12-25-28-FFC2D6CB851FA917 process
2016-05-12 14:26:00 2016-05-12-12-25-31-213B56F74A224293 process
2016-05-12 14:26:00 2016-05-12-12-25-43-BAA23FC4D63F463B process
2016-05-12 14:26:00 2016-05-12-12-25-59-327A55EDBF5CCDF2 process

New list: all 7 files with timestamp 2016-05-12 14:26:00.

This is an acceptable solution because the list of filenames to save and compare against is small in practice.

@ph ph added the P2 label May 12, 2016
@webmstr
Copy link

webmstr commented Jun 1, 2016

Dumb workaround if you're using 'delete': set the registry contents to a really old date once in a while.

@kjwierenga
Copy link

kjwierenga commented Jun 1, 2016

Yes, that would work even if you're using 'backup'. But it defeats the purpose of automation.

@tiwilliam
Copy link

Had the same issue, realized I could set sincedb_path to /dev/null to ignore mtime. This can solve it if you have delete enabled at least.

@lsutcliffe-dbg
Copy link

I've seen this issue as well, the way I can understand it is that list_new_files can run at e.g. 08:00:00.04, processing a file modified at 08:00:00.03, and 08:00:00 is written to the sincedb. If a file is added to the S3 bucket with a modified time of 08:00:00.07 for example, it will not have been picked up by the first run of list_new_files, and will be detected as "Object Not Modified" on the second run.

I've submitted a PR to define a cut-off time which is 2 seconds in the past, so anything modified within 2 seconds of list_new_files running will be ignored.

It's probably not the most elegant of solutions but it's worked for us.

@SpencerLN
Copy link

We are seeing similar issues when processing CloudTrail logs that are being sent to a centralized bucket from multiple accounts in the format s3://bucket/AWSLogs/ACCOUNTID/YEAR/MONTH/DAY/. We occasionally see files from different accounts being put into the bucket with the same last modified date and when the files are not processed in the same iteration by Logstash they end up being skipped.

We are using the delete mode to remove the files after processing, so do not actually need to worry about the sincedb timestamp. To work around the issue we ended up adding a new setting (sincedb_disabled) to the input that when set to true will skip the "sincedb.newer" check.

@kaisecheng
Copy link
Contributor

Thank you for reporting the issue. The new release v3.6.0 fixed it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants