Skip to content

Add denylist mechanism to distributed queries#7675

Merged
mike-myers-tob merged 4 commits intoosquery:masterfrom
lucasmrod:add-denylist-mechanism-to-dist-queries
Aug 8, 2022
Merged

Add denylist mechanism to distributed queries#7675
mike-myers-tob merged 4 commits intoosquery:masterfrom
lucasmrod:add-denylist-mechanism-to-dist-queries

Conversation

@lucasmrod
Copy link
Copy Markdown
Contributor

@lucasmrod lucasmrod commented Jul 6, 2022

Implements proposal #7658.

Am not sure about the default value for denylisting expiration, my guess is we could start with a default value of 1h.

Apart from the added unit tests I manually tested this with Fleet, by running an expensive live query twice (first time to cause the watcher to kill the worker, second time to verify denylisting):
Screen Shot 2022-07-06 at 09 00 25

@lucasmrod lucasmrod requested review from a team as code owners July 6, 2022 12:06
@mike-myers-tob mike-myers-tob added ready for review Pull requests that are ready to be reviewed by a maintainer distributed labels Jul 6, 2022
Copy link
Copy Markdown
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the logic in this PR. Working through it, I think, what it's doing is:

  1. When I query starts running, time stamp it
  2. When osquery stops, remove it
  3. Before a query runs, check that we don't have a record of the query running in the last hour. (This works because we remove queries on completion)

So a different way of saying that, is that this does nothing to limit resource usage (that's either the watchdog or the OS's responsibility), and this will ensure that a query either exists cleanly, or does not get to run again within DURATION.

That might be a reasonable choice. Might be worth documenting on https://osquery.readthedocs.io/en/latest/deployment/performance-safety/

The implementation feels like it only works because distributed queries are single threaded. Which makes me a little nervous, but might be a reasonable, simple choice today.

Comment on lines +43 to +46
FLAG(uint64,
distributed_denylist_duration,
3600,
"Seconds to denylist distributed queries (default 1 hour)");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of a nit, but I kinda want to quibble over this. I don't know that I feel strongly, but some questions...

  • How do we handle the scheduled queries? I don't quickly see them in the --help output
  • Should this be prefaced with watchdog_?
  • I'd probably default to a 86,400 second block

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Probably a good idea to match schedule queries deny-listing duration.

denylist_[failed_query_] = getUnixTime() + 86400;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed default to 86400 seconds. 8a72448.

for (const auto& query : queries) {
auto request = popRequest(query);

if (request.query.size() < max_query_size) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels weird. What happens if it's larger than max_query_size? Should we truncate? Hash?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of hashes, does it make more sense to use one instead?

It solves the oddness around max-query size, and I suspect is somewhat smoother for it. Downside, is that one cannot see the query in question. Not totally sure what makes sense

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I thought of using the hash too, but ended up using the actual query as key for better UX for troubleshooting/debugging.

I added this check as a safety-check. If the query is for some reason > 8 MB, then this new "distributed deny-listing" feature is just not used (to not mess up the storage basically).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can be persuaded, but it feels like a weird place. I dunno, maybe I'm old skool. Keys should be small. 😆

UX wise, I think the only place people will see this is in the logs, which can log with the full query anyhow, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Changed to using SHA256 of the queries as key: 8a72448.

const std::string kCarves = "carves";
const std::string kLogs = "logs";
const std::string kDistributedQueries = "distributed";
const std::string kDistributedRunningQueries = "distributed_running";
Copy link
Copy Markdown
Member

@Smjert Smjert Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we're creating another column family for this when we already have the distributed one?

As far as we've used them, I see them as high level categories. I know that RocksDB has also some specific behavior around how it handles their SST and some implications on performance, but what I'm also partially worried about is that this also creates again a non-backward compatible database.

The alternative is to simply provide a prefix to the keys to differentiate them from the others.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we're creating another column family for this when we already have the distributed one?

I wanted to implement this as an additional+simplest feature/code and not modify existing working code.
If we wanted to reuse "distributed" domain, then I would need to modify the following for example:

std::vector<std::string> Distributed::getPendingQueries() {
std::vector<std::string> queries;
scanDatabaseKeys(kDistributedQueries, queries);
return queries;
}

(I would need to differentiate "queryName -> query" from "query -> timestamp" key/values.)

but what I'm also partially worried about is that this also creates again a non-backward compatible database.

Why is adding new domains a non-backwards compatible change?

Copy link
Copy Markdown
Member

@Smjert Smjert Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean; it would require changing the prefix of the keys and then write a migration step, as we have done in the past, to fixup the key names, and it would unfortunately be non-backward compatible anyway.
I should've created a prefix for those keys when I moved it to the new domain ^^'

Anyway, adding column families (the domain) is non-backward compatible because older versions of osquery won't attempt to open that new column family and RocksDB will error out, due to the fact that not all column families have been opened.

Something like this:

Rocksdb open failed (4:0) Invalid argument: You have to open all column families. Column families not opened: distributed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... I see, that makes sense, thanks for the clarification.

  • Curious what the plan forward is:
    • Is supporting rollback of osquery version important for users? (am guessing that there were issues when e.g. carves were introduced to osquery, which introduced a new column family)
    • Should osquery document adding new domains for new features is not supported or discouraged unless necessary?

(Maybe there are docs around this and I missed them, sorry.)


A non-breaking alternative is to use the "default" column family, but would require careful use of prefixes throughout the application. It also depends on the feature; for this particular feature the key space will not grow considerably (1 entry per deny-listed query).

Copy link
Copy Markdown
Member

@Smjert Smjert Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to add the new domain for now, but I think this definitely raises the question on what we should do going forward.
I think it's mostly an annoyance because the user maybe wants to test a new version, finds a problem, but then it can't go back, and if not careful then it can lose the data (logs and events) that were stored.

I'm not sure but maybe it's possible to just list all the column families in the DB and use that list to open the DB, so that we don't have to explicitly list them and the unused ones are ignored.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Draft PR with this idea: #7712.

@lucasmrod
Copy link
Copy Markdown
Contributor Author

I'm a little confused by the logic in this PR. Working through it, I think, what it's doing is:

  1. When I query starts running, time stamp it
  2. When osquery stops, remove it
  3. Before a query runs, check that we don't have a record of the query running in the last hour. (This works because we remove queries on completion)

Correct. Nit: step 2 is actually:
2. When the query finishes, remove it from the "distributed_running" domain.

So a different way of saying that, is that this does nothing to limit resource usage (that's either the watchdog or the OS's responsibility), and this will ensure that a query either exists cleanly, or does not get to run again within DURATION.

Correct. This PR/solution is mimicking the current implementation for deny-listing of scheduled queries (but for distributed queries):

void Config::recordQueryStart(const std::string& name) {
// There should only ever be a single executing query in the schedule.
setDatabaseValue(kPersistentSettings, kExecutingQuery, name);
// Store the time this query name last executed for later results eviction.
// When configuration updates occur the previous schedule is searched for
// 'stale' query names, aka those that have week-old or longer last execute
// timestamps. Offending queries have their database results purged.
setDatabaseValue(
kPersistentSettings, "timestamp." + name, std::to_string(getUnixTime()));
}

I used a totally separate "domain" to not mix distributed and schedule functionality.

@Smjert
Copy link
Copy Markdown
Member

Smjert commented Jul 25, 2022

I used a totally separate "domain" to not mix distributed and schedule functionality.

I'm not sure I follow, the schedule functionality uses the queries domain or kQueries, while the distributed uses distributed or kDistributedQueries, so there's already no overlap.

@lucasmrod
Copy link
Copy Markdown
Contributor Author

I'm not sure I follow, the schedule functionality uses the queries domain or kQueries, while the distributed uses distributed or kDistributedQueries, so there's already no overlap.

Let me know if #7675 (comment) makes sense.

@directionless
Copy link
Copy Markdown
Member

Correct. This PR/solution is mimicking the current implementation for deny-listing of scheduled queries (but for distributed queries):

Thanks for going through the logic with me. I think it makes sense.

I would like to request we document this, and maybe the scheduled behavior. Probably in https://osquery.readthedocs.io/en/latest/deployment/performance-safety/ somewhere?

@directionless directionless added this to the 5.5.0 milestone Jul 29, 2022
@lucasmrod
Copy link
Copy Markdown
Contributor Author

I would like to request we document this, and maybe the scheduled behavior. Probably in https://osquery.readthedocs.io/en/latest/deployment/performance-safety/ somewhere?

Added docs to docs/wiki/deployment/debugging.md. Let me know if that works. 8a72448.

Copy link
Copy Markdown
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test this, but I think it looks good.

@mike-myers-tob
Copy link
Copy Markdown
Member

Merging this now, with follow-up potential in #7712 and with the plan to test this in pre-release testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

distributed ready for review Pull requests that are ready to be reviewed by a maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants