Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opa cli slow on Windows 10 #4646

Closed
wermerb opened this issue Apr 30, 2022 · 37 comments
Closed

Opa cli slow on Windows 10 #4646

wermerb opened this issue Apr 30, 2022 · 37 comments
Labels
bug investigating Issues being actively investigated

Comments

@wermerb
Copy link

wermerb commented Apr 30, 2022

Hi,

It seems something has changed with the Windows binary in the last couple of releases.
The cli bootstrap time got significantly slower.
If I run the following command opa help response times are:

  • v0.38.0: almost instant
  • v0.39.0: 3 seconds till response
  • v0.40.0: 5 seconds till response

I cannot reproduce the above with the linux binaries.

Thank you

@wermerb wermerb added the bug label Apr 30, 2022
@srenatus
Copy link
Contributor

I think I've seen this related to the location of the binary. If it's used straight out of the download folder, extra checks apply that take time. It's a Windows thing. Can you check where there binaries are placed and perhaps try in another location, please?

@wermerb
Copy link
Author

wermerb commented Apr 30, 2022

Thanks for the quick response, I have tested on following locations, results are the same:

  • Downloads
  • Documents
  • Desktop

@srenatus
Copy link
Contributor

Can you try disabling any anti virus software that could be interfering?

@wermerb
Copy link
Author

wermerb commented Apr 30, 2022

I have tested it further. Without any antivirus software running i get the same results.
I have tried it on 3 different machines and i have found something interesting:

  • When i tested on an Intel processor based pc the results are the same
  • However when i ran it on an AMD processor based pc the issue above cannot be reproduced.

@kyorav
Copy link

kyorav commented May 2, 2022

Also a windows user here (Windows 10 pro, Intel processor).

I ran 'opa version' with different versions and using 'time' to get the stats. Here is what I got running on gitbash:
v0.40.0 : real 0m12.597s, user 0m0.000s, sys 0m0.030s
v0.39.0 : real 0m3.278s, user 0m0.015s, sys 0m0.172s
v0.38.0: real 0m0.495s, user 0m0.000s, sys 0m0.140s

I also tried from inside WSL2, its faster but still not OK:
v0.40.0: real 0m6.771s, user 0m0.012s, sys 0m0.010s
v0.39.0: real 0m4.014s, user 0m0.015s, sys 0m0.008s
v0.38.0: real 0m0.320s, user 0m0.008s, sys 0m0.004s

Oddly enough, running on PowerShell is even worse than gitbash

@srenatus
Copy link
Contributor

srenatus commented May 2, 2022

@kyorav @wermerb Is the timing consistent? The only difference I was able to reproduce (so far) was in first run vs second run, presumably due to the binary being read from disk and cached. (Do you use harddrives or SSDs?)

@wermerb
Copy link
Author

wermerb commented May 2, 2022

Its consistent for me and I have SSD

@srenatus
Copy link
Contributor

srenatus commented May 2, 2022

@wermerb thanks for the quick reply! :)

@srenatus
Copy link
Contributor

srenatus commented May 2, 2022

So, trying to shed some light onto this, I'll build binaries that drop trace logs on exit. If you could download them, please, and give it a go, and share the resulting trace.out file with me, that would be great. 🤞 we'll know better what's going on here if we have a trace... The builds should show up in an artifact called "binaries", a zip archive with all the binaries built during that PR run, here: https://github.com/open-policy-agent/opa/actions/runs/2257707275

@srenatus
Copy link
Contributor

srenatus commented May 2, 2022

I often find this hard to find, so here's a treasure map -- scroll down on that page if it doesn't fit in your screen:

image

@srenatus srenatus added the investigating Issues being actively investigated label May 2, 2022
@wermerb
Copy link
Author

wermerb commented May 2, 2022

Here you go:
trace.zip

@srenatus
Copy link
Contributor

srenatus commented May 2, 2022

Hrmm from the trace, the binary was running for 25ms. So it must be something outside of the binary's doing?

Regardless, I've pushed another commit to the debug PR, because we do something special after all on windows: the "mouse trap". Cobra, the CLI lib we use, does that by default via https://github.com/inconshreveable/mousetrap.

If you don't mind, could you try again with the binaries that'll show up in this build in a bit? Since it was only 25ms before, I don't think this is the issue, but it would still be interesting to have another trace without the mousetrap call...

@wermerb
Copy link
Author

wermerb commented May 2, 2022

Sure np, here is the new trace:

trace.zip

@srenatus
Copy link
Contributor

srenatus commented May 2, 2022

Thanks. I assume there's no huge change from that last run? The execution went down to ~7ms; because it didn't to the mousetrap work, but this still means that whatever happens happens outside of the binary execution. Perhaps loading some libraries, or trying to find them, or something 🤔

I'll need to do some reading on what could be causing this...

@srenatus
Copy link
Contributor

srenatus commented May 3, 2022

So I've asked around internally, and we came up with a few more things to note and check:

  1. https://textslashplain.com/2016/04/04/downloads-and-the-mark-of-the-web/ This could be interesting to verify: is it set? does removing it have an effect?
  2. If you're using MS AV: you could tell MS AV to trust/skip directory
    https://support.microsoft.com/en-us/windows/add-an-exclusion-to-windows-security-811816c0-4dfd-af4a-47e4-c301afe13b26
  3. Windows 10 allowed multiple AntiVirus applications to be installed.
  4. You could try looking in EventViewer: Applications And Services Logs -> Microsoft -> Windows Defender
  5. You can use procmon to diagnose such issues, https://docs.microsoft.com/en-us/sysinternals/downloads/procmon

@kyorav+@wermerb Could you have a look at those points, by any chance? I'd appreciate it a lot.

@wermerb
Copy link
Author

wermerb commented May 3, 2022

Hey @srenatus,

I'll will iterate over the list later today.
At least point 3 is out of the picture on my end.

@wermerb
Copy link
Author

wermerb commented May 3, 2022

I had some success with procmon.
What was interesting that in v0.40.0 the following events are showing up multiple times:

4:58:19.9830473 PM	opa40.exe	21244	TCP Send	my-pc.fqdn.com:52248 -> my-dns.fqdn.com:49667	SUCCESS	Length: 268, startime: 123865444, endtime: 123865450, seqnum: 0, connid: 0
4:58:19.9830567 PM	opa40.exe	21244	TCP TCPCopy	my-pc.fqdn.com:52248 -> my-dns.fqdn.com:49667	SUCCESS	Length: 204, seqnum: 0, connid: 0
4:58:19.9831632 PM	opa40.exe	21244	TCP Receive	my-pc.fqdn.com:52248 -> my-dns.fqdn.com:49667	SUCCESS	Length: 204, seqnum: 0, connid: 0
4:58:20.0459476 PM	opa40.exe	21244	TCP Send	my-pc.fqdn.com:52248 -> my-dns.fqdn.com:49667	SUCCESS	Length: 252, startime: 123865450, endtime: 123865456, seqnum: 0, connid: 0
4:58:20.0459599 PM	opa40.exe	21244	TCP TCPCopy	my-pc.fqdn.com:52248 -> my-dns.fqdn.com:49667	SUCCESS	Length: 300, seqnum: 0, connid: 0
4:58:20.0460388 PM	opa40.exe	21244	TCP Receive	my-pc.fqdn.com:52248 -> my-dns.fqdn.com:49667	SUCCESS	Length: 300, seqnum: 0, connid: 0
4:58:20.1231452 PM	opa40.exe	21244	TCP Send	my-pc.fqdn.com:52248 -> my-dns.fqdn.com:49667	SUCCESS	Length: 140, startime: 123865456, endtime: 123865464, seqnum: 0, connid: 0
4:58:20.1231566 PM	opa40.exe	21244	TCP TCPCopy	my-pc.fqdn.com:52248 -> my-dns.fqdn.com:49667	SUCCESS	Length: 140, seqnum: 0, connid: 0
4:58:20.1232209 PM	opa40.exe	21244	TCP Receive	my-pc.fqdn.com:52248 -> my-dns.fqdn.com:49667	SUCCESS	Length: 140, seqnum: 0, connid: 0

The above events only displayed when i ran the v0.40.0 and v0.39.0.

@srenatus
Copy link
Contributor

srenatus commented May 3, 2022

Hrrm but what is it...? Can you use tcpdump or wireshark or something to sniff the traffic?

Does the binary without them mousetrap exhibit this, too? Could be it's only related to gathering the process info. The TCP calls alone don't mean that this is what's causing the wait time.

@wermerb
Copy link
Author

wermerb commented May 3, 2022

I'll check it as well, however I think its interesting that the v0.38.0 does not have any TCP event.

@srenatus
Copy link
Contributor

srenatus commented May 3, 2022

#4374 I have a suspicion. I'll get you a build that doesn't do that tomorrow 🤞

@srenatus
Copy link
Contributor

srenatus commented May 4, 2022

More detail about the last comment above: I suspect the TCP connection you've seen is related to looking up the user when trying to see if their UID or GID is 0. Since that can't happen on Windows I've removed that check on windows in 49ddc71.

@wermerb Could you please grab that build's binary and see how it behaves on your system? Specifically, I'd be interested in

  • Is the TCP connection still happening?
  • Is the overall execution still slow?

Thinking about this more, if you've seen the TCP event using opa version, it is probably not related to this. The only windows-specific code running on opa version (that I'm aware of) is the mouse trap. Have you had a chance to check the mouse trap build for TCP events?

Sorry if this is dragging along, but I won't drop this until it's resolved 😅

@wermerb
Copy link
Author

wermerb commented May 4, 2022

@srenatus no worries happy to help :).

For testing I'm using the opa test command.

  • Is the TCP connection still happening?
    • Yes unfortunately it is
  • Is the overall execution still slow?
    • Yes

To help the investigation, I'll create and attach the procmon logs for the v0.38.0 and v0.40.0 maybe you'll find something other interesting point in it.

@srenatus
Copy link
Contributor

srenatus commented May 4, 2022

For testing I'm using the opa test command.

Ah, ok. So that doesn't instantiate the runtime, so my suspicion was wrong. Well, that's progress, too. 🙃

To help the investigation, I'll create and attach the procmon logs for the v0.38.0 and v0.40.0 maybe you'll find something other interesting point in it.

Thank you! Looking forward to 🔍 them.

@wermerb
Copy link
Author

wermerb commented May 4, 2022

One more thing that is changed is the Go version, i dont know whether it has anything to do with it or not:

  • v0.38.1: go1.17.8
  • v0.40.0: go1.18.1

Here are the procmon logs:
v0.38.1.CSV
v0.40.0.CSV

@srenatus
Copy link
Contributor

srenatus commented May 9, 2022

Thanks! I'm having a look and try to find anything suspicious... I'm not an expert on windows, though 😓

@srenatus
Copy link
Contributor

srenatus commented May 9, 2022

So it seems like the procmon csv captures the time it takes to run opa help:

scratch/windows-slow % head -2 *.CSV
==> v0.38.1.CSV <==
Time of Day,Process Name,PID,Operation,Path,Result,Detail
06:47.5,opa.exe,10204,Process Start,,SUCCESS,"Parent PID: 27012, Command line: opa  help, Current directory: C:\Users\<user-name>\Documents\opa\, Environment: "

==> v0.40.0.CSV <==
Time of Day,Process Name,PID,Operation,Path,Result,Detail
54:30.4,opa40.exe,22352,Process Start,,SUCCESS,"Parent PID: 27012, Command line: opa40.exe  help, Current directory: C:\Users\<user-name>\Documents\opa\, Environment: "
scratch/windows-slow %  tail -2 *.CSV
==> v0.38.1.CSV <==
06:47.6,opa.exe,10204,RegCloseKey,HKLM\System\CurrentControlSet\Services\WinSock2\Parameters\Protocol_Catalog9,SUCCESS,
06:47.6,opa.exe,10204,RegCloseKey,HKLM\System\CurrentControlSet\Services\WinSock2\Parameters\NameSpace_Catalog5,SUCCESS,

==> v0.40.0.CSV <==
54:31.9,opa40.exe,22352,CloseFile,C:\Windows\System32\en-US\mswsock.dll.mui,SUCCESS,
54:31.9,opa40.exe,22352,TCP Disconnect,mymachine.mydomain.com:56111 -> pgs-it-wdctrn12.mydomain.com:49155,SUCCESS,"Length: 0, seqnum: 0, connid: 0"

☝️ While I'lm not certain how to read those time stamps, it's 06:47.5-06:47.6 for 0.38.1 and 54:30.4-54:31.9 -- so there's significantly more time spent that those logs cover. That's good, we've got the right measurement (I hadn't seen that time spent in the golang traces from above).

Looking at the event counts, we find:

v0.38.1:

scratch/windows-slow % awk -F, < v0.38.1.CSV '{ m[$4]++ } END { for (ev in m) print m[ev] ": "ev }'  | sort -nr
194: RegQueryValue
135: RegOpenKey
80: CreateFile
76: CloseFile
69: RegCloseKey
62: QueryEAFile
49: RegQueryKey
40: QueryStandardInformationFile
40: CreateFileMapping
31: QueryStreamInformationFile
31: QueryDeviceInformationVolume
31: Load Image
23: QueryNameInformationFile
17: QueryBasicInformationFile
12: Thread Exit
12: Thread Create
9: ReadFile
4: QueryDirectory
2: RegSetInfoKey
1: RegSetValue
1: QuerySecurityFile
1: QueryInformationVolume
1: QueryAttributeTagFile
1: QueryAllInformationFile
1: Process Start
1: Process Exit
1: Operation

v0.40.0:

scratch/windows-slow % awk -F, < v0.40.0.CSV '{ m[$4]++ } END { for (ev in m) print m[ev] ": " ev }' | sort -nr
408: RegQueryValue
238: RegOpenKey
172: CreateFile
162: CloseFile
131: RegCloseKey
101: RegQueryKey
94: QueryEAFile
84: CreateFileMapping
61: QueryStandardInformationFile
53: QuerySecurityFile
47: QueryStreamInformationFile
47: QueryDeviceInformationVolume
47: Load Image
42: QueryBasicInformationFile
30: QueryNameInformationFile
14: Thread Exit
14: Thread Create
14: RegCreateKey
8: ReadFile
7: TCP TCPCopy
7: TCP Send
7: TCP Receive
4: QueryDirectory
2: TCP Disconnect
2: TCP Connect
2: RegSetInfoKey
2: RegEnumValue
1: RegQueryKeySecurity
1: QueryInformationVolume
1: QueryAttributeTagFile
1: QueryAllInformationFile
1: Process Start
1: Process Exit
1: Operation

So there's lot more file action going on. But looking at the differences in the referred to files, it seems to be user info related:

scratch/windows-slow % awk -F, < v0.38.1.CSV '/File/{print $5}' | sort | uniq > 38.files
scratch/windows-slow % awk -F, < v0.40.0.CSV '/File/{print $5}' | sort | uniq > 40.files
scratch/windows-slow % diff 38.files 40.files
8a9
> C:\Users\<user-name>\Documents\opa\MSASN1.dll
10,11c11,15
< C:\Users\<user-name>\Documents\opa\opa.exe
< C:\WINDOWS\system32\007d7daa9b4ead5bd0d1dff8c2febf38f1d16b8964c7e78cfa734a59704008bfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
---
> C:\Users\<user-name>\Documents\opa\cryptdll.dll
> C:\Users\<user-name>\Documents\opa\opa40.exe
> C:\Users\<user-name>\Documents\opa\profapi.dll
> C:\Users\<user-name>\Documents\opa\sspicli.dll
> C:\WINDOWS\system32\36cf5eefc646236c097672f25e0abcbb24184e7c3cbf8b98c31c62116d937cc3AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
13c17,18
< C:\Windows\Prefetch\OPA.EXE-44413B5A.pf
---
> C:\Windows\Prefetch\OPA40.EXE-F16A340E.pf
> C:\Windows\System32\FWPUCLNT.DLL
17a23
> C:\Windows\System32\apphelp.dll
21a28
> C:\Windows\System32\cryptdll.dll
22a30
> C:\Windows\System32\dsparse.dll
27a36
> C:\Windows\System32\kerberos.dll
28a38,39
> C:\Windows\System32\logoncli.dll
> C:\Windows\System32\msasn1.dll
31a43,44
> C:\Windows\System32\netapi32.dll
> C:\Windows\System32\netutils.dll
33a47
> C:\Windows\System32\ntdsapi.dll
36a51,52
> C:\Windows\System32\profapi.dll
> C:\Windows\System32\rasadhlp.dll
38a55
> C:\Windows\System32\secur32.dll
39a57
> C:\Windows\System32\sspicli.dll
45a64
> C:\Windows\System32\wkscli.dll
47a67
> C:\Windows\apppatch\sysmain.sdb
50c70,71
< HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\opa.exe
---
> HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\opa40.exe
> HKLM\SOFTWARE\Policies\Microsoft\Windows NT\DNSClient\UseHostsFile
52c73
< HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\opa.exe
---
> HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\opa40.exe
55a77
> HKLM\System\CurrentControlSet\Services\Dnscache\Parameters\UseHostsFile

Looking at the RegQueryValue keys, we've got a bunch of TCP and socket-related registry keys being read, but that's also related to the TCP connections.

I suspect this still boils down to something lookup up user information, and that's a costly operation on windows (at least in some situations, AD, kerberos, etc). Now what I don't understand is why that happens:

For opa help

  • we don't check the current user (the docker "rootless" business)
  • we do check the mousetrap, but we did that with 0.38.1, too.

@srenatus
Copy link
Contributor

srenatus commented May 9, 2022

@wermerb I've compared the 0.40.0 binary, and 0.38.1's, using objdump -x. The 0.40.0 requires quite a few things that 0.38.1 doesn't seem to need; and I'm still trying to understand where that comes from. There's another test I'd like to run with you, if you can spare the time:

Using the binaries from this build, do we

  1. get the same (long) time for opa help?
  2. get the tcp connections, too?

@wermerb
Copy link
Author

wermerb commented May 10, 2022

@srenatus Thanks for the update.
I have tested the new binary its a bit faster now and no TCP connections are happening, but still slower than the 0.38.1
Logfile.CSV

@srenatus
Copy link
Contributor

srenatus commented Jun 7, 2022

So I've not made any breakthrough here. How are you holding up? Using the old version, or waiting patiently?

@wermerb
Copy link
Author

wermerb commented Jun 7, 2022

I'm using the older version for now.

@srenatus
Copy link
Contributor

Good news. We've located another user.Current() call in the transitive deps of badger, which was introduced in 0.39.0. I'm hoping that #4805 will make a difference for you, too. If you wouldn't mind trying the binaries from that PR, I'd appreciate that.

@wermerb
Copy link
Author

wermerb commented Jun 22, 2022

Hey, @srenatus thanks for the update. I would be happy to test it. Unfortunately im not so familiar with Github actions could you point me where can i find the corresponding binaries?

@srenatus
Copy link
Contributor

https://github.com/open-policy-agent/opa/actions/runs/2541221364 Please try the binaries from here, from the Artifacts section on that page 😃

@wermerb
Copy link
Author

wermerb commented Jun 22, 2022

Yup, you hit the nail on the head, this one works :)

@srenatus
Copy link
Contributor

Yay. This makes me happy

@srenatus
Copy link
Contributor

I'll close this. The next release will come with the fix. Thanks for bearing with me!

@wermerb
Copy link
Author

wermerb commented Jun 22, 2022

Thank you sir :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug investigating Issues being actively investigated
Projects
None yet
Development

No branches or pull requests

3 participants