Multiple Files Being Left Unprocessed with Identical Timestamps #57

davidski · 2015-10-05T18:29:14Z

I've had a long standing issue with LS 1.5.x leaving files pending in my S3 input bucket and not being read. In tracing this down this (https://github.com/logstash-plugins/logstash-input-s3/blob/master/lib/logstash/inputs/s3.rb#L408) test looks suspect to me. I am processing S3 files on a batch basis and have multiple files present on start up with identical last_modified timestamps (second-level precision).

It looks like the newer function, using a greater than, will process the first file it finds with the earliest timestamp, but skip over any subsequent ones.

For example:

Timestamp	File
2015-01-01 08:00:00Z	file1.gz
2015-01-01 08:00:00Z	file2.gz
2015-01-01 08:00:00Z	file3.gz
2015-01-01 08:00:01Z	file4.gz
2015-01-01 08:00:02Z	file5.gz
2015-01-01 08:00:03Z	file6.gz

If I'm understanding the code, file1.gz will get parsed, but file2.gz and file3.gz will fail the > test and will be skipped, leaving them hanging around after the conclusion of the batch run.

Is my interpretation correct and could this be changed safely to >=? I'm using delete after read so the concern of re-reading the same file isn't a problem for me.

The text was updated successfully, but these errors were encountered:

davidski · 2015-10-06T02:13:11Z

Did a bit more thinking on this one. Changing to >= is a problem for folks that aren't using the delete mode, as files with the same timestamp could be read over and over. A better solution would be to increase the hash to not just the last_modified timestamp but to be a combination of last_modified and the filename.

kjwierenga · 2016-05-12T13:01:43Z

I have this same issue as well. The problem doesn't occur when the files with the same timestamp are processed in the same run of the process_files call. It only happens when say first file1.gz and file2.gz are processed and the sincedb timestamp is set to 2015-01-01 08:00:00Z in one call to process_files and interval seconds later file3.gz is processed in the next call to process_files. In this case file3.gz is skipped because its timestamp is less than (<) the sincedb timestamp.

Only by keeping the full list of already processed files that have last_modified timestamp, can the next run of process_files compare against that list and skip those files.

Example

1st call process_files

Timestamp	Log Object Key
2016-05-12 14:26:00	2016-05-12-12-25-12-D68E3F15AAE3DC9A
2016-05-12 14:26:00	2016-05-12-12-25-15-58C0CBDF807CD34B
2016-05-12 14:26:00	2016-05-12-12-25-26-5FC01F0ED4E5A430

Keep the list (in sincedb?) of allready processed files; in this case all three files:

2016-05-12 14:26:00 2016-05-12-12-25-12-D68E3F15AAE3DC9A,2016-05-12-12-25-15-58C0CBDF807CD34B,2016-05-12-12-25-26-5FC01F0ED4E5A430

2nd call process_files (interval seconds later)

Find files >= 2016-05-12 14:26:00 and skip if name matches a name on the list of already processed files:

Timestamp	Log Object Key	Skip/Process
2016-05-12 14:26:00	2016-05-12-12-25-12-D68E3F15AAE3DC9A	skip
2016-05-12 14:26:00	2016-05-12-12-25-15-58C0CBDF807CD34B	skip
2016-05-12 14:26:00	2016-05-12-12-25-26-5FC01F0ED4E5A430	skip
2016-05-12 14:26:00	2016-05-12-12-25-28-FFC2D6CB851FA917	process
2016-05-12 14:26:00	2016-05-12-12-25-31-213B56F74A224293	process
2016-05-12 14:26:00	2016-05-12-12-25-43-BAA23FC4D63F463B	process
2016-05-12 14:26:00	2016-05-12-12-25-59-327A55EDBF5CCDF2	process

New list: all 7 files with timestamp 2016-05-12 14:26:00.

This is an acceptable solution because the list of filenames to save and compare against is small in practice.

webmstr · 2016-06-01T00:24:33Z

Dumb workaround if you're using 'delete': set the registry contents to a really old date once in a while.

kjwierenga · 2016-06-01T07:30:31Z

Yes, that would work even if you're using 'backup'. But it defeats the purpose of automation.

tiwilliam · 2016-09-14T16:07:03Z

Had the same issue, realized I could set sincedb_path to /dev/null to ignore mtime. This can solve it if you have delete enabled at least.

lsutcliffe-dbg · 2019-10-16T17:24:17Z

I've seen this issue as well, the way I can understand it is that list_new_files can run at e.g. 08:00:00.04, processing a file modified at 08:00:00.03, and 08:00:00 is written to the sincedb. If a file is added to the S3 bucket with a modified time of 08:00:00.07 for example, it will not have been picked up by the first run of list_new_files, and will be detected as "Object Not Modified" on the second run.

I've submitted a PR to define a cut-off time which is 2 seconds in the past, so anything modified within 2 seconds of list_new_files running will be ignored.

It's probably not the most elegant of solutions but it's worked for us.

SpencerLN · 2019-11-10T21:27:00Z

We are seeing similar issues when processing CloudTrail logs that are being sent to a centralized bucket from multiple accounts in the format s3://bucket/AWSLogs/ACCOUNTID/YEAR/MONTH/DAY/. We occasionally see files from different accounts being put into the bucket with the same last modified date and when the files are not processed in the same iteration by Logstash they end up being skipped.

We are using the delete mode to remove the files after processing, so do not actually need to worry about the sincedb timestamp. To work around the issue we ended up adding a new setting (sincedb_disabled) to the input that when set to true will skip the "sincedb.newer" check.

kaisecheng · 2021-03-11T10:14:02Z

Thank you for reporting the issue. The new release v3.6.0 fixed it

davidski mentioned this issue Nov 6, 2015

Change newer function to use gt-or-equal #61

Closed

ph added the P2 label May 12, 2016

jsvd added bug adoptme labels Mar 12, 2018

yaauie self-assigned this Nov 5, 2018

lsutcliffe-dbg mentioned this issue Oct 16, 2019

Define cut-off time in list_new_files to prevent recently-modified files being missed #189

Closed

SpencerLN mentioned this issue Nov 10, 2019

Add setting to skip the sincedb timestamp check #192

Closed

jasonpepper mentioned this issue Feb 19, 2020

Miss files due to the same last modified time #191

Open

kaisecheng mentioned this issue Mar 5, 2021

Files being unprocessed with the same last modified timestamp #221

Closed

kaisecheng closed this as completed Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Files Being Left Unprocessed with Identical Timestamps #57

Multiple Files Being Left Unprocessed with Identical Timestamps #57

davidski commented Oct 5, 2015

davidski commented Oct 6, 2015

kjwierenga commented May 12, 2016

webmstr commented Jun 1, 2016

kjwierenga commented Jun 1, 2016 •

edited

Loading

tiwilliam commented Sep 14, 2016

lsutcliffe-dbg commented Oct 16, 2019

SpencerLN commented Nov 10, 2019

kaisecheng commented Mar 11, 2021

Multiple Files Being Left Unprocessed with Identical Timestamps #57

Multiple Files Being Left Unprocessed with Identical Timestamps #57

Comments

davidski commented Oct 5, 2015

davidski commented Oct 6, 2015

kjwierenga commented May 12, 2016

webmstr commented Jun 1, 2016

kjwierenga commented Jun 1, 2016 • edited Loading

tiwilliam commented Sep 14, 2016

lsutcliffe-dbg commented Oct 16, 2019

SpencerLN commented Nov 10, 2019

kaisecheng commented Mar 11, 2021

kjwierenga commented Jun 1, 2016 •

edited

Loading