Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added cache and throttling for certificates, keychain_acls, and keychain_items tables. #8192

Merged
merged 15 commits into from
Dec 15, 2023

Conversation

getvictor
Copy link
Contributor

@getvictor getvictor commented Nov 14, 2023

Fixes #7780
Related issue: fleetdm/fleet#13065

Adding a cache for macOS keychain file accesses.

The cache checks whether a keychain file has been modified by comparing the file's SHA256 hash. If the file has been modified, the cache also checks whether the file has been accessed recently. If it has been accessed within the configured interval, the old cached results are returned.

The cache works independently for each table. This means that multiple tables can access the keychain files within the interval, but each one of them can only do so once.

The following feature flags have been added:

    --keychain_access_cache                          Use a cache for keychain accesses (default true)
    --keychain_access_interval VALUE                 Minimum minutes required between keychain accesses. Keychain cache must be enabled to use

Default keychain_access_interval is 5 minutes.

Old table results exactly match new table results when ordered by primary key.

Performance results for certificates for 10 rounds (old vs new):

 U:1  C:0  M:3  F:0  D:0  manual: utilization: 13.940000000000001 cpu_time: 0.14197365480000002 memory: 35979264.0 fds: 4.0 duration: 0.5259468078613281
 U:1  C:0  M:3  F:0  D:0  manual: utilization: 13.236666666666668 cpu_time: 0.13929492440000002 memory: 35386163.2 fds: 4.0 duration: 0.5783782720565795

Performance results for certificates for 10 counts (new is faster and less memory due to cache):

 U:2  C:1  M:3  F:0  D:2  manual: utilization: 44.29999999999999 cpu_time: 0.669130928 memory: 50331648.0 fds: 4.0 duration: 1.5303008556365967
 U:2  C:1  M:3  F:0  D:2  manual: utilization: 28.366666666666664 cpu_time: 0.42878788 memory: 40534016.0 fds: 4.0 duration: 1.0357069969177246

Performance results for keychain_acls for 10 rounds (old vs new):

 U:1  C:0  M:3  F:0  D:0  manual: utilization: 10.57 cpu_time: 0.10779758619999999 memory: 29374873.6 fds: 4.0 duration: 0.5254422664642334
 U:1  C:0  M:3  F:0  D:0  manual: utilization: 11.366666666666669 cpu_time: 0.1201387166 memory: 29420748.8 fds: 4.0 duration: 0.5800627708435059

Performance results for keychain_acls for 10 counts (new is faster and less memory due to cache):

 U:2  C:1  M:3  F:0  D:2  manual: utilization: 22.059999999999995 cpu_time: 0.557502384 memory: 77463552.0 fds: 4.0 duration: 2.0405030250549316
 U:2  C:1  M:3  F:0  D:2  manual: utilization: 26.733333333333334 cpu_time: 0.405785928 memory: 36782080.0 fds: 4.0 duration: 1.0386288166046143

Performance results for keychain_items for 10 rounds (old vs new). New performance is better. This is likely because the new code only opens each keychain file once, while old code opened each keychain file multiple times -- once for each keychain item type (password, certificate, etc.)

 U:2  C:1  M:3  F:0  D:1  manual: utilization: 30.510833333333334 cpu_time: 0.45226203760000006 memory: 29961420.8 fds: 4.0 duration: 0.9804916620254517
 U:2  C:0  M:3  F:0  D:0  manual: utilization: 20.805 cpu_time: 0.2112246234 memory: 26363494.4 fds: 4.0 duration: 0.5286885976791382

Performance results for keychain_items for 10 counts (new is way faster and less memory):

 U:3  C:2  M:3  F:0  D:3  manual: utilization: 78.25999999999999 cpu_time: 3.946559488 memory: 41320448.0 fds: 4.0 duration: 4.564563989639282
 U:2  C:1  M:3  F:0  D:2  manual: utilization: 25.474999999999998 cpu_time: 0.511153552 memory: 33996800.0 fds: 4.0 duration: 1.5302011966705322

@getvictor getvictor changed the title Added cache for keychain_acls table. Added cache for certificates and keychain_acls tables. Nov 14, 2023
@getvictor getvictor mentioned this pull request Nov 14, 2023
@getvictor getvictor changed the title Added cache for certificates and keychain_acls tables. Added cache and throttling for certificates, keychain_acls, and keychain_items tables. Nov 15, 2023
@getvictor getvictor marked this pull request as ready for review November 15, 2023 17:20
@getvictor getvictor requested review from a team as code owners November 15, 2023 17:20
Copy link
Member

@zwass zwass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly looking good to me. Minor changes requested.

I would like to manually verify I get the same results before and after this change but I'm currently blocked by #8196 (and working on resolving that).

osquery/tables/system/darwin/keychain_items.cpp Outdated Show resolved Hide resolved
osquery/tables/system/darwin/keychain.h Outdated Show resolved Hide resolved
osquery/tables/system/darwin/keychain_utils.cpp Outdated Show resolved Hide resolved
Copy link
Member

@Smjert Smjert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to be careful about is that every table can be called concurrently, by a query via the scheduler and a query via the distributed/ad-hoc system.

Here the cache is global, so it would need to be thread safe.

@directionless directionless added this to the 5.11.0 milestone Nov 25, 2023
@getvictor
Copy link
Contributor Author

@Smjert I added mutex locks to make Keychain Cache thread safe.

Should I also update table documentation for these tables? For example:

table_name("keychain_acls")
description("Applications that have ACL entries in the keychain. NOTE: osquery limits frequent access to keychain files. This limit is controlled by keychain_access_interval flag.")

@zwass
Copy link
Member

zwass commented Nov 27, 2023

I definitely like the idea of updating the table documentation. The mutex work lgtm but I'll let @Smjert weigh in.

@Smjert
Copy link
Member

Smjert commented Nov 28, 2023

This is a bit of a drive by review: one thing I noticed is that putting the lock inside the cache itself, while it protects the cache data from corruption or half reads, the operation done in the table is not a transaction. You have to consider that the threads can stop at any point in between the read and write, and they can swap order.

What could happen is that two concurrent queries both initially read the cache and have no hit, so both go and try to read the keychain data from the file. But then the one that read the keychain for first (with the data less up to date) gets preempted and the one that came second is the one first writing into the cache; then when the first thread wakes up, it will overwrite the most recent data with an older copy.

There's to say though that it's unlikely that something changes in the time that passes between the preemption of the thread, the keychain file read from the second thread and the first thread later wake up.

I don't know if you have already considered this; again it doesn't seems necessarily an issue from this quick review, but I thought I'd mention it.

Smjert
Smjert previously approved these changes Nov 28, 2023
@directionless
Copy link
Member

I see that Stefano approved this, but I find myself confused about how the cache works. Do entries persist between queries? If so, do they expire? There's an access interval, which isn't quite the same as cache expiration.

If you have a reasonable story for those, I'm not this this PR needs to change. But a followup PR with some clear docs in the "wiki" would help.

@Smjert
Copy link
Member

Smjert commented Nov 28, 2023

I see that Stefano approved this, but I find myself confused about how the cache works. Do entries persist between queries? If so, do they expire? There's an access interval, which isn't quite the same as cache expiration.

If you have a reasonable story for those, I'm not this this PR needs to change. But a followup PR with some clear docs in the "wiki" would help.

Yes, the cache is global and persists. Basically the cache has a copy of all the results (as if no constraint was applied) for each of the specified category here: https://github.com/osquery/osquery/pull/8192/files#diff-cbf1c8c0ddce99dc2309a7852b01cdd527797a808a0c1ad0aa4a231729bb8213R32

Each osquery table writes in one of the different categories, always substituting the whole thing in that category.
So there's no way to continuosly increase the cache size because you're keeping old and new results.

But there's still the problem of deciding for how long to reuse the data (when to invalidate), and currently that's implemented with the table checking the hash of the file; if the hash has changed and enough time has passed (keychain_access_interval) from the last keychain read, then we invalidate the cache and re-read the data from disk, otherwise we return the cached data.

This is my understanding for how this works.

EDIT: sorry I forgot to mention the hashing.

@Smjert
Copy link
Member

Smjert commented Nov 28, 2023

if the hash has changed and enough time has passed (keychain_access_interval) from the last keychain read, then we invalidate the cache and re-read the data from disk, otherwise we return the cached data.

With the further caveat that I expressed here: #8192 (comment), where two queries running concurrently could either find the cache empty or the hash file changed and enough time to have been passed, and both decide they need to access the keychain in close succession, not respecting the interval after the new, first access.

@getvictor
Copy link
Contributor Author

@Smjert Yes, if two threads miss in the cache at the same time, they will both access the keychain files and both write to the cache. It the keychain file was also modified during this, we can end up with:

  1. Old hash pointing to old data.
  2. Old hash pointing to new data.

In both cases, the data will remain there for the keychain_access_interval, and hash/data will be updated with a subsequent access. Since osquery will recover from this corner case situation, it doesn't seem like a huge concern.

@directionless I will add additional documentation to the Command Line Flags wiki in a separate PR.

@getvictor
Copy link
Contributor Author

Additional documentation added in PR #8205

@getvictor
Copy link
Contributor Author

@Smjert After thinking some more, there is a chance of cache corruption with my current approach:

  1. Thread1 reads cache, creates hash1, and reads keychain file
  2. Keychain file modified
  3. Thread2 reads cache and creates hash2
  4. Thread2 writes cache
  5. Thread1 writes cache <-- bad data!

I will make a change to use 1 mutex across all 3 keychain tables, so that only 1 keychain table can be accessed at a time. This will have the side benefit of guaranteeing that multiple threads will not be using Apple keychain APIs at the same time. Perhaps that was actually the root cause of the bug in the first place.

@Smjert
Copy link
Member

Smjert commented Nov 29, 2023

@Smjert After thinking some more, there is a chance of cache corruption with my current approach:

  1. Thread1 reads cache, creates hash1, and reads keychain file
  2. Keychain file modified
  3. Thread2 reads cache and creates hash2
  4. Thread2 writes cache
  5. Thread1 writes cache <-- bad data!

I will make a change to use 1 mutex across all 3 keychain tables, so that only 1 keychain table can be accessed at a time. This will have the side benefit of guaranteeing that multiple threads will not be using Apple keychain APIs at the same time. Perhaps that was actually the root cause of the bug in the first place.

I'm all for the more correct approach, although I don't follow where the corruption is in that example.

From what I see Thread2 is writing the cache with hash2 and the new data (post keychain modification) and Thread1 is later overwriting that with hash1 and the old data (pre keychain modification), which is undesirable but it is still a complete result which will recover after an interval.
What I'm missing?

@getvictor
Copy link
Contributor Author

You're right -- I don't see how we can get new hash with old data unless the keychain file is modified twice. I think it is still worthwhile to limit access to 1 keychain table at a time.

@Smjert
Copy link
Member

Smjert commented Nov 29, 2023

EDIT: ah saw your response now; yeah I guess this is further reason to simplify this ^^'. Thanks!

@directionless
Copy link
Member

@Smjert @zwass How does this look?

Copy link
Member

@zwass zwass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm. Thank you Victor! @Smjert please have a look soon otherwise I propose we merge this week.

@zwass zwass merged commit 6f380fc into osquery:master Dec 15, 2023
16 checks passed
@great944
Copy link

great944 commented Jan 7, 2024

hello, while I understand the value of a cache for osquery when running as a service, I am curious if we have considered scenarios where the main use cases are from a command line? do we have a way to optout of caching ? our primary use case is to execute a query on the command line i.e. osqueryd.exe -S --json "select * from xyz" and we do not wish to activate any caching. in other words all caching should be inactive by default for non interactive cmd execution

@directionless
Copy link
Member

Hi @great944

The cache in this PR is specific to the macOS keychain related tables, and given the underlying corruption issues from the OS it would be a bad idea to disable it. That said, this PR also includes an option to disable it.

Can you share what your concerns are, and why you wouldn't want it for straight CLI usage?

@Smjert
Copy link
Member

Smjert commented Jan 8, 2024

Also keep into account that the cache would only work for multiple queries to the keychain tables done in the same command line execution, given that the cache is in memory only.
So sure, there might be a slight slowdown to create the cache at each execution, but if you have a single query to a keychain table, each execution will not use the cache, and will not be subject to the interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a non-deprecated API for Keychain access
5 participants