Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set counter consistently so zero always indicates all records #7801

Merged
merged 7 commits into from
Feb 24, 2023

Conversation

bgirardeau-figma
Copy link
Contributor

This change sets the counter to zero only when all records are being logged as part of an initial run or new epoch.

Previously in some cases, the counter would be zero for a differential log (specifically events tables or if the query changed). For these cases, the counter is now reset to 1 instead. This change also does a small refactor to ensure the epoch in the database is updated for events tables, so that it will be set accurately if the same query is later changed to run over a non-evented table.

More context and motivation for this change is in #7799

@bgirardeau-figma bgirardeau-figma requested review from a team as code owners October 27, 2022 00:47
@mike-myers-tob mike-myers-tob added ready for review Pull requests that are ready to be reviewed by a maintainer core labels Nov 9, 2022
Copy link
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the intent here.

One ask -- can something about the meaning of 0 be added to the docs/wiki?

@bgirardeau-figma
Copy link
Contributor Author

One ask -- can something about the meaning of 0 be added to the docs/wiki?

Yeah! There is documentation about counter, so I updated it to be a little more explicit about when it starts at 0 vs 1.

Comment on lines 45 to 49
uint64_t Query::getQueryCounter(bool all_records, bool new_query) const {
uint64_t counter = 0;
if (new_query) {
if (all_records) {
return counter;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can roll with this, but I think it's a bit confusing.

For events, this calls: incrementCounter(false, new_epoch || new_query, counter);
but for normal queries: incrementCounter(new_epoch, new_query, counter);

And that distinction feels weird. I don't really understand why it's not something like:

uint64_t Query::getQueryCounter(bool start_at_zero, bool new_query) const {
  if (new_query) {
    if (all_records) {
      return 0;
    }
    return 1;
}
    uint64_t counter = 0;

which would allow a more consistent call semantics.

But I might be missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I found making a clear distinction between if we are returning all records vs something about the query changed was the least confusing option for me, but I agree it's tricky.

The issue is that events and normal queries are different in that events never have a snapshot/all records available, so they never start at 0 and always reset to 1, while normal queries have a distinction between epoch change or just query SQL changing when they are resetting the counter (to 0 vs 1).

So normal queries would still need to change start_at_zero based on new_epoch. ie: incrementCounter(new_epoch, new_query || new_epoch, counter), which means it doesn't really change the logic in a meaningful way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I found making a clear distinction between if we are returning all records vs something about the query changed was the least confusing option for me, but I agree it's tricky.

Yeah -- I wasn't suggesting a logic change, just a naming/signature change. But I think you're right -- bool start_at_zero is no better than bool all_records. They're both just describing the behavior. Mostly this pairs with the other comment.

if (update_db || fresh_results || new_query) {
auto status = incrementCounter(fresh_results || new_query, counter);
if (update_db || new_epoch || new_query) {
auto status = incrementCounter(new_epoch, new_query, counter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not auto status = incrementCounter(true, new_epoch || new_query, counter);

Will this do the right thing with the same epoch and a new query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: summary is it won't do the right thing because we actually want to have the behavior to return 1 in the new query same epoch case (since it does not return all records then).

The first argument should be false if it's the same epoch because that means the results are a diff, rather than including all records (even if the query changes). In the same epoch but new query case, counter resets to 1 instead of 0.

auto status = incrementCounter(new_epoch, new_epoch || new_query, counter) would work here though. It won't have different behavior since new_epoch always forces counter of 0 anyway, happy to change if you think it's clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very confused here.

I thought new_epoch referred to a new epoch, eg, the epoch has been changed. And that new_query referred to the first execution of a query. (I get this from reading the comments in the code).

As I understand the goal, either a new_query OR a new_epoch should reset the counter. And then what we reset the counter to depends on whether it's evented or not. Put differently, I think if there's a new_epoch, every query should be a new_query. And that we should be calling this as incrementCounter(<event-or-regular>, new_epoch || new_query, counter);

Splitting it up as incrementCounter(new_epoch, new_query, counter); feels weird. I think when a new config is loaded, and there are new queries, this would return 1 which seems wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I see the confusion, I had been using "new query" to mean the query SQL changed, not necessarily when the query is brand new overall. And "new epoch" to mean a new epoch for this query, including its first epoch overall.

I think it could be clearer if I rename these variables instead to new_query_epoch and new_query_sql -- I'll try that now but happy to revert or tweak if you prefer other names.

Here's another way to describe the behaviors I was going for in this PR:

non-evented table

  1. query is brand new -> reset counter to 0, include all records
  2. epoch changed -> reset counter to 0, include all records
  3. epoch same, query sql changed -> reset counter to 1 (formerly 0), do a normal diff
  4. epoch same, query sql same -> increment counter by 1, do a normal diff

evented table

  1. query is brand new -> reset counter to 1 (formerly 0), stream events
  2. epoch changed -> reset counter to 1 (formerly 0), stream events
  3. epoch same, query sql changed -> reset counter to 1 (formerly 0), stream events
  4. epoch same, query sql same -> increment counter by 1, stream events

The bolded parts are what this PR changes vs what osquery does today. To me the new behavior better matches the intent of the documentation to be able "to skip the initial added records" by ignoring when counter = 0 described here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(updated the PR, hopefully the naming makes more sense!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, your table form is helpful. I am confused, because I disagree with one of the lines:

non-evented table

  1. query is brand new -> reset counter to 0, include all records
  2. epoch changed -> reset counter to 0, include all records
  3. epoch same, query sql changed -> reset counter to 1 (formerly 0), do a normal diff
  4. epoch same, query sql same -> increment counter by 1, do a normal diff

I think (1), (2), and (3) should have the same behavior. They should reset to 0 and includes all records.

I think a SQL change is functionally a new query. I'm curious why not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, maybe we're both wrong?

If the query name is unchanged, we should not be resetting the the counter. [query name, epoch, counter] should be unique. Changing the query sql, but not the name, should not change the counter because it would invalidate that. It's also probably a bad idea.

But I'm not sure what the underlying system actually does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think there are good arguments either way for whether to return all records for a SQL change, so I was hesitant to change the current behavior of it doing a diff - it seemed like a larger/potentially more disruptive change than changing the counter behavior.

If the query name is unchanged, we should not be resetting the the counter.

I like this idea! While I was most bothered by the counter being 0 in this case, I'm not actually sure why resetting at all there is helpful -- I agree violating the uniqueness of that triple is annoying.

I'm happy to make this update. If anyone really relied on knowing when results are due to a query SQL update for some reason, in theory they could do that in a more clear/direct way by looking for the status line or joining on the osquery_schedule table in the query itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think there are good arguments either way for whether to return all records for a SQL change, so I was hesitant to change the current behavior of it doing a diff - it seemed like a larger/potentially more disruptive change than changing the counter behavior.

Yes, I think that's the correct judgement -- This PR should preserve the existing behavior. Not that we can't fix it, but we should be deliberate about doing so.

In that vein, this PR feels reasonable. It's not a deep change. Can you open a blueprint issue with a proposal for better counter behavior on changing SQL? I imagine there might be discussion to be had (I can think of 2 solutions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I opened one here #7945

@directionless directionless added this to the 5.8.0 milestone Feb 15, 2023
Copy link
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat trepidatious -- as discussed in the comment threads, I think this exposes the corner case where we have poor behavior on a sql change. (osquery resets the counter, thus violating the idea that [queryname, epoch, counter] is immutable.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core ready for review Pull requests that are ready to be reviewed by a maintainer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants