-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added cache and throttling for certificates
, keychain_acls
, and keychain_items
tables.
#8192
Conversation
keychain_acls
table.certificates
and keychain_acls
tables.
certificates
and keychain_acls
tables.certificates
, keychain_acls
, and keychain_items
tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly looking good to me. Minor changes requested.
I would like to manually verify I get the same results before and after this change but I'm currently blocked by #8196 (and working on resolving that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to be careful about is that every table can be called concurrently, by a query via the scheduler and a query via the distributed/ad-hoc system.
Here the cache is global, so it would need to be thread safe.
@Smjert I added mutex locks to make Keychain Cache thread safe. Should I also update table documentation for these tables? For example:
|
I definitely like the idea of updating the table documentation. The mutex work lgtm but I'll let @Smjert weigh in. |
This is a bit of a drive by review: one thing I noticed is that putting the lock inside the cache itself, while it protects the cache data from corruption or half reads, the operation done in the table is not a transaction. You have to consider that the threads can stop at any point in between the read and write, and they can swap order. What could happen is that two concurrent queries both initially read the cache and have no hit, so both go and try to read the keychain data from the file. But then the one that read the keychain for first (with the data less up to date) gets preempted and the one that came second is the one first writing into the cache; then when the first thread wakes up, it will overwrite the most recent data with an older copy. There's to say though that it's unlikely that something changes in the time that passes between the preemption of the thread, the keychain file read from the second thread and the first thread later wake up. I don't know if you have already considered this; again it doesn't seems necessarily an issue from this quick review, but I thought I'd mention it. |
I see that Stefano approved this, but I find myself confused about how the cache works. Do entries persist between queries? If so, do they expire? There's an access interval, which isn't quite the same as cache expiration. If you have a reasonable story for those, I'm not this this PR needs to change. But a followup PR with some clear docs in the "wiki" would help. |
Yes, the cache is global and persists. Basically the cache has a copy of all the results (as if no constraint was applied) for each of the specified category here: https://github.com/osquery/osquery/pull/8192/files#diff-cbf1c8c0ddce99dc2309a7852b01cdd527797a808a0c1ad0aa4a231729bb8213R32 Each osquery table writes in one of the different categories, always substituting the whole thing in that category. But there's still the problem of deciding for how long to reuse the data (when to invalidate), and currently that's implemented with the table checking the hash of the file; if the hash has changed and enough time has passed (keychain_access_interval) from the last keychain read, then we invalidate the cache and re-read the data from disk, otherwise we return the cached data. This is my understanding for how this works. EDIT: sorry I forgot to mention the hashing. |
With the further caveat that I expressed here: #8192 (comment), where two queries running concurrently could either find the cache empty or the hash file changed and enough time to have been passed, and both decide they need to access the keychain in close succession, not respecting the interval after the new, first access. |
@Smjert Yes, if two threads miss in the cache at the same time, they will both access the keychain files and both write to the cache. It the keychain file was also modified during this, we can end up with:
In both cases, the data will remain there for the keychain_access_interval, and hash/data will be updated with a subsequent access. Since osquery will recover from this corner case situation, it doesn't seem like a huge concern. @directionless I will add additional documentation to the Command Line Flags wiki in a separate PR. |
Additional documentation added in PR #8205 |
@Smjert After thinking some more, there is a chance of cache corruption with my current approach:
I will make a change to use 1 mutex across all 3 keychain tables, so that only 1 keychain table can be accessed at a time. This will have the side benefit of guaranteeing that multiple threads will not be using Apple keychain APIs at the same time. Perhaps that was actually the root cause of the bug in the first place. |
I'm all for the more correct approach, although I don't follow where the corruption is in that example. From what I see Thread2 is writing the cache with hash2 and the new data (post keychain modification) and Thread1 is later overwriting that with hash1 and the old data (pre keychain modification), which is undesirable but it is still a complete result which will recover after an interval. |
You're right -- I don't see how we can get new hash with old data unless the keychain file is modified twice. I think it is still worthwhile to limit access to 1 keychain table at a time. |
EDIT: ah saw your response now; yeah I guess this is further reason to simplify this ^^'. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lgtm. Thank you Victor! @Smjert please have a look soon otherwise I propose we merge this week.
hello, while I understand the value of a cache for osquery when running as a service, I am curious if we have considered scenarios where the main use cases are from a command line? do we have a way to optout of caching ? our primary use case is to execute a query on the command line i.e. osqueryd.exe -S --json "select * from xyz" and we do not wish to activate any caching. in other words all caching should be inactive by default for non interactive cmd execution |
Hi @great944 The cache in this PR is specific to the macOS keychain related tables, and given the underlying corruption issues from the OS it would be a bad idea to disable it. That said, this PR also includes an option to disable it. Can you share what your concerns are, and why you wouldn't want it for straight CLI usage? |
Also keep into account that the cache would only work for multiple queries to the keychain tables done in the same command line execution, given that the cache is in memory only. |
Fixes #7780
Related issue: fleetdm/fleet#13065
Adding a cache for macOS keychain file accesses.
The cache checks whether a keychain file has been modified by comparing the file's SHA256 hash. If the file has been modified, the cache also checks whether the file has been accessed recently. If it has been accessed within the configured interval, the old cached results are returned.
The cache works independently for each table. This means that multiple tables can access the keychain files within the interval, but each one of them can only do so once.
The following feature flags have been added:
Default
keychain_access_interval
is 5 minutes.Old table results exactly match new table results when ordered by primary key.
Performance results for
certificates
for 10 rounds (old vs new):Performance results for
certificates
for 10 counts (new is faster and less memory due to cache):Performance results for
keychain_acls
for 10 rounds (old vs new):Performance results for
keychain_acls
for 10 counts (new is faster and less memory due to cache):Performance results for
keychain_items
for 10 rounds (old vs new). New performance is better. This is likely because the new code only opens each keychain file once, while old code opened each keychain file multiple times -- once for each keychain item type (password, certificate, etc.)Performance results for
keychain_items
for 10 counts (new is way faster and less memory):