Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark time column as "additional" for event tables #8020

Merged

Conversation

bgirardeau-figma
Copy link
Contributor

Background

Osquery optimizes event tables in scheduled queries to only return data since the last time the scheduled query was run (unless --events_optimize=false is set). Sometimes this behavior is not desirable if the query is for a specifically set time range, so there is code to turn off the optimization if a constraint on the time column is present. However this does not work today because the constraint is not available to the table implementation unless it is marked "additional" in the schema.

Change

This change adds additional=True to time columns in event tables, disabling event optimization when a time constraint is present in a scheduled query.

If the old behavior is desired (having the time constraint filtered by SQL while optimization is applied), this is still possible by using the + operator to disable indexing in a query like +time > 0.

In theory this change has been the expected behavior, but it's possible others are relying on the old behavior. This should help fix #7352 (at least partially)

Test Plan

I did a couple basic tests with the query planner and disk events tables to make sure the time constraint was passed to xFilter when expected. It would be helpful if anyone else has more time or ideas on how to best test this.

{
  "schedule": {
      "optimized_query": {
          "query": "select time, action, name from disk_events order by time asc limit 5",
          "interval": 30
      },
      "unoptimized_query": {
          "query": "select time, action, name from disk_events where time > 0 order by time asc limit 5",
          "interval": 30
      },
      "force_optimized_query": {
          "query": "select time, action, name from disk_events where +time > 0 order by time asc limit 5",
          "interval": 30
      }
  }
}

@bgirardeau-figma bgirardeau-figma requested review from a team as code owners May 12, 2023 01:30
Copy link
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems kinda safe, the one thing that gives me pause is that I don't remember how how row_id is calculated.

sqlite needs each row to have a unique row_id. If there are no indexes it generates one by concating everything. If there are indexes, it concatenates those columns.

But I can't remember how additional works. It would be a problem if the time was the rowid.

@bgirardeau-figma
Copy link
Contributor Author

bgirardeau-figma commented May 24, 2023

sqlite needs each row to have a unique row_id. If there are no indexes it generates one by concating everything. If there are indexes, it concatenates those columns.

But I can't remember how additional works. It would be a problem if the time was the rowid.

Good question, it looks like it does concatenate all the columns together if there're only additional=True columns:

// If there are only 'additional' columns (rare), pkey is the 'unique row'.
// Otherwise an additional constraint will create duplicate rowids.
if (!indexed && epilog[kDisableRowId]) {
pkeys.clear();
for (const auto& column : columns) {
pkeys.push_back(std::get<0>(column));
}
}

Before this change pk was all zeros:

osquery> PRAGMA table_info(disk_events);
+-----+------------+---------+---------+------------+----+
| cid | name       | type    | notnull | dflt_value | pk |
+-----+------------+---------+---------+------------+----+
| 0   | action     | TEXT    | 1       |            | 1  |
| 1   | path       | TEXT    | 1       |            | 2  |
| 2   | name       | TEXT    | 1       |            | 3  |
| 3   | device     | TEXT    | 1       |            | 4  |
| 4   | uuid       | TEXT    | 1       |            | 5  |
| 5   | size       | BIGINT  | 1       |            | 6  |
| 6   | ejectable  | INTEGER | 1       |            | 7  |
| 7   | mountable  | INTEGER | 1       |            | 8  |
| 8   | writable   | INTEGER | 1       |            | 9  |
| 9   | content    | TEXT    | 1       |            | 10 |
| 10  | media_name | TEXT    | 1       |            | 11 |
| 11  | vendor     | TEXT    | 1       |            | 12 |
| 12  | filesystem | TEXT    | 1       |            | 13 |
| 13  | checksum   | TEXT    | 1       |            | 14 |
| 14  | time       | BIGINT  | 1       |            | 15 |
+-----+------------+---------+---------+------------+----+

Copy link
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for digging that up in tables.cpp. Interesting how pk column changes, but maybe that doesn't matter.

@Smjert
Copy link
Member

Smjert commented May 25, 2023

Thank you for digging that up in tables.cpp. Interesting how pk column changes, but maybe that doesn't matter.

It's in the part that @bgirardeau-figma has shown, since with only additional and no other primary key column which is actually unique, all the columns are part of the primary key. Plus table_info documentation (https://www.sqlite.org/pragma.html#pragma_table_info) says:

This pragma returns [...] "pk" (either zero for columns that are not part of the primary key, or the 1-based index of the column within the primary key)

@directionless directionless merged commit 09b05ad into osquery:master Jun 1, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in scheduler events optimization, correlation query never outputting anything
3 participants