Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Build times on Big Sur #510

Closed
ameyah opened this issue Dec 2, 2020 · 19 comments · Fixed by #524
Closed

High Build times on Big Sur #510

ameyah opened this issue Dec 2, 2020 · 19 comments · Fixed by #524

Comments

@ameyah
Copy link

ameyah commented Dec 2, 2020

Santa 1.15 on Big Sur is causing high local build times.

What logs would be useful to debug this?

> santactl status
>>> Daemon Info
  Driver Connected          | Yes
  Mode                      | Monitor
  File Logging              | Yes
  Watchdog CPU Events       | 72  (Peak: 665.36%)
  Watchdog RAM Events       | 0  (Peak: 216.54MB)

Screen Shot 2020-12-02 at 6 40 23 AM

@russellhancox
Copy link
Collaborator

Do you have any other systemextensions loaded? What does systemextensionsctl list output?

@subuavudaifb
Copy link

This is the output:

systemextensionsctl list
2 extension(s)
--- com.apple.system_extension.endpoint_security
enabled	active	teamID	bundleID (version)	name	[state]
*	*	7AGZNQ2S2T	com.carbonblack.es-loader.es-extension (1.0/1)	es-extension	[activated enabled]
*	*	EQHXZ8M8AV	com.google.santa.daemon (1.15/1.15)	santad	[activated enabled]

@russellhancox
Copy link
Collaborator

Can you remove com.carbonblack.es-loader.es-extension temporarily, retry the build and check to see if Santa is still using high CPU?

@subuavudaifb
Copy link

Thanks Russel! Looks like santa is triggering a bug with the carbon black system extension. A theory we have is that the google.santad.plist is periodically checking if the santa system extension is alive and if its not - it loads it. However with Big Sur this somehow kills com.carbonblack.es-loader.es-extension over and over again. Do you have suggestions on how to run the plist only when the santa system extension is down?

@russellhancox
Copy link
Collaborator

I'm not sure about that theory - com.google.santad.plist doesn't exist when the system extension is being used, loading of santad is left up to sysxd.

The reason I asked you to test that is that we have a theory that the bug is related to the caching mechanism provided by the EndpointSecurity framework - either that caching mechanism is broken, some system extensions refuse to allow anything to cache, or some extensions are clearing the cache very aggressively. The result is that Santa is having to perform work for every single execution even for unmodified binaries it has previously seen; during a build this is unusable as you've discovered.

Tom and I are going to work on bypassing the ES caching system and use our own, as we did pre-sysx and that will avoid these issues. Until then, I'm afraid your options are:

a) put up with the bad performance
b) only run Santa
c) only run CarbonBlack

@gu0keno0
Copy link

gu0keno0 commented Dec 2, 2020

@russellhancox thanks for the suggestions. And I'd like to confirm a few more things to make sure that I fully understand your points:

  1. I checked Santa's install.sh installation script and seems like it will load com.google.santad.plist launchd config: https://github.com/google/santa/blob/main/Conf/install.sh#L68-L69

  2. The plist file https://github.com/google/santa/blob/main/Conf/com.google.santad.plist will periodically invoke the system extension binary, because the process will exit with 0 once it is launched. I assume this binary does some checking on the status of the system extension, but correct me if I'm wrong here. We have this plist file in our production as well.

  3. I'm fairly certain that the invocation of the sysext binary by launchd triggers the MacOS kernel to kill CarbonBlack es-loader system extension. I still need to understand the interactions more and I think it is a CarbonBlack issue, yet if we simply stop invoking /Applications/Santa.app/Contents/Library/SystemExtensions/com.google.santa.daemon.systemextension/Contents/MacOS/com.google.santa.daemon periodically by launchd, the issue will be mitigated.

Therefore I'd like to confirm if the periodical invocation of /Applications/Santa.app/Contents/Library/SystemExtensions/com.google.santa.daemon.systemextension/Contents/MacOS/com.google.santa.daemon by launchd is intentional and if not, could we change it to be conditional ? For example could we launch it only when Santa's system extension is not running?

@russellhancox
Copy link
Collaborator

That plist is installed and loaded for the case where Santa is running as a kext. If Santa is configured to run as a sysx (as it is by default on 10.15+) when it is started from launchd via that plist, it deletes the plist and triggers a re-load via sysx, removing that plist. If santad is running as a system extension, as it is in this case, that plist file should not exist.

@gu0keno0
Copy link

gu0keno0 commented Dec 2, 2020

@russellhancox : thanks, I double checked it, the plist file looks to be indeed deployed by our infra, rather than by Santa itself. Is there documentation about what are the launchd daemons that should be installed by Santa? It will help to verify if there are more misconfigurations.

@subuavudaifb
Copy link

The reason I asked you to test that is that we have a theory that the bug is related to the caching mechanism provided by the EndpointSecurity framework - either that caching mechanism is broken, some system extensions refuse to allow anything to cache, or some extensions are clearing the cache very aggressively. The result is that Santa is having to perform work for every single execution even for unmodified binaries it has previously seen; during a build this is unusable as you've discovered.

Tom and I are going to work on bypassing the ES caching system and use our own, as we did pre-sysx and that will avoid these issues. Until then, I'm afraid your options are:

Hi @russellhancox , were you able to make any progress on this issue?

@russellhancox
Copy link
Collaborator

Is there documentation about what are the launchd daemons that should be installed by Santa?

Unfortunately not, we used to but the sysx migration has changed this a few times and we haven't yet documented what the expected end-state is. At the system level the only "manually managed" daemon is com.google.santa.bundleservice, the daemon itself is managed by sysxd.

Hi @russellhancox , were you able to make any progress on this issue?

We have a test client with a self-managed caching layer messily integrated, which seems to work well but we haven't yet had a chance to test this with another system extension loaded to see if it actually fixes the problem

@Quantu
Copy link

Quantu commented Jan 7, 2021

I have this same exact issue running AMP and Santa (1.15 and 1.17 tested) on Big Sur. High CPU usage by the Santa System Extension and very very very slow builds.

@russellhancox Any way I can get my hands on a copy of that test client with the self-managed caching layer? 😉

@russellhancox
Copy link
Collaborator

Unfortunately due to the signing/entitlement requirements it's not possible for us to distribute development builds (they're signed with dev profiles that are linked to specific devices) and we can't produce production builds without reviewed & submitted code.

I have just sent out a PR that includes this feature and makes it optional, so we should be able to get a build out that includes it early next week.

@Quantu
Copy link

Quantu commented Jan 8, 2021

That's wonderful news, thanks for the update!

@russellhancox
Copy link
Collaborator

Sorry, closed this a little prematurely. The v2021.1 release includes this cache, you'll need to enable it in your config profile by setting EnableSysxCache to true.

@Quantu
Copy link

Quantu commented Jan 13, 2021

v2021.1 with the new cache option enabled is confirmed to fix the issue when running alongside AMP. 👍

@parkisan
Copy link

parkisan commented Mar 3, 2021

Not to necropost, but did you guys ( @russellhancox , @tburgin ) follow up with Apple on the caching issues with Endpoint Security? My team is providing feedback for a few items in Big Sur and we were wondering if this has been highlighted.

@russellhancox
Copy link
Collaborator

We didn't file anything about this - we're unsure whether there is a bug or if there is whether it's with the ES framework itself or bugs in other ES clients. And as we don't run 2 ES clients side-by-side and haven't seen this ourselves it's hard to gather the logs that would be necessary to file something.

I'm happy to provide any needed details I can if you decide to file one, fwiw.

@parkisan
Copy link

parkisan commented Mar 4, 2021

Sure, I'd appreciate any pointers you may have on how to collect any more debug data for Apple/vendors for this particular issue.

We can reproduce it in Big Sur with Santa 1.13. We run at least 2 other sysexts which should be looking at executions as well. I don't believe we've done extensive testing on isolating this particular bug down to a combination of sysexts so we can try that too.

@russellhancox
Copy link
Collaborator

Apple will want sysdiagnose output, almost certainly. Debug data for vendors I wouldn't know about but Console.app can show you a lot of information about processes, especially with Info/Debug messages turned on.

From our position all I can say is we had multiple reports of Santa reporting very high (several hundred percent) CPU usage when run alongside other EndpointSecurity agents; with some logs from Console.app we determined Santa was repeating work for every execution and guessed the issue must be related to the ES cache. At first we thought it was this other client misbehaving but we've heard of this with at least 3 different products (Carbonblack, CrowdStrike and Cisco AMP) so it's either a bug in ES, widespread confusion about ES's caching or some interesting undocumented behavior. We could probably confirm which by writing a test ES client of our own so we could run 2 things we control side-by-side but a lack of time has prevented us doing that, especially as we have a fix in place that we're happy with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants