Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: sincedb_clean_after not being respected #276

Merged
merged 5 commits into from
Sep 25, 2020

Conversation

kares
Copy link
Contributor

@kares kares commented Sep 10, 2020

In certain cases, in read mode such as the one described in #250 sincedb is not updated.

There was no periodic check for sincedb updates that would cause the sincedb_clean_after entries to cleanup.
The cleanup relied on new files being discovered or new content being added to existing files -> causing sincedb updates.

The fix here is to periodically flush sincedb (from the watch loop).
Besides, to make the process more deterministic, there's a minor change to make sure the same "updated" timestamp is used to mark the last changed time.

resolves #250
expected to also resolve #260

for now the plugin relied on other files coming in or getting new
content - would cause the sincedb flush to eventually trigger (due
changes). however, if there isn't any activity going on we still need to
cleanup sincedb periodically to respect the clean_after setting...
@kares kares changed the title Sincedb clean Fix: sincedb_clean_after not being respected Sep 10, 2020
@kares
Copy link
Contributor Author

kares commented Sep 10, 2020

CI 🔴 is intermittent here's a 🟢 https://travis-ci.org/github/kares/logstash-input-file/builds/725910454

Copy link
Contributor

@colinsurprenant colinsurprenant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yaauie yaauie assigned yaauie and unassigned elasticsearch-bot Sep 10, 2020
sleep(@settings.stat_interval)
# we need to check potential expired keys (sincedb_clean_after) periodically
sincedb_collection.flush_at_interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

We already have a SincedbCollection#write_if_requested in this until quit? loop, which conditionally sends SincedbCollection#flush_at_interval IFF a write has been requested. Should we simply change this to send SincedbCollection#flush_at_interval instead, or is there a valid reason to sometimes flush twice in a loop?

Also, this isn't new to your PR, but it looks like we are passing an instance of SincedbCollection around a fair bit, with this flush_at_interval and other state-mutating methods being invoked by potentially multiple threads simultaneously 😩. We should probably file a separate issue to audit the thread-safety.

Copy link
Contributor Author

@kares kares Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the described in the bug report and in the description the problem is when no writes are requested (no new discovered files or new content in watched files) write_if_requested is simply a noop and thus sincedb_clean_after won't get triggered.

the plugin should guarantee a cleanup of expired sincedb entried regardless of whether there's actual new content. currently cleanup is only guaranteed to happen on shutdown.

have decided to add an explicit flush_at_interval along side while keeping the existing write_if_requested, since they accomplish different things we want to happen in a loop a) write changes if any b) guarantee periodic sincedb flushes. the actual write won't happen twice in the loop since flushing only happens if sincedb_write_interval time has passed.

the whole thing is ripe for deeper refactoring but unfortunately I do not have the cycles to get into that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaauie is your preference still (given ^^^) :

I think that the change to FileWatch::Watch#subscribe can be simplified a bit by jumping straight to invoking SincedbCollection#flush_at_interval exactly once per loop instead of invoking it once-or-twice.

do not have strong opinions for either way given the current state of code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that the actual flush is safeguarded to not flush more often than the interval, so it doesn't matter much if we attempt it once or twice in the loop. I'll update my review to an LGTM.

Copy link
Contributor

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the reorganization and the as_of change to make sure we mark the time-of-action correctly in cases where the clock advances while we are preparing to flush.

I think that the change to FileWatch::Watch#subscribe can be simplified a bit by jumping straight to invoking SincedbCollection#flush_at_interval exactly once per loop instead of invoking it once-or-twice.

Copy link
Contributor

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants