New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: cgocall
is low performance for Conn on Windows
#58336
Comments
There's not really a whole lot of information here- you said you'd previously looked at it in pprof, what did you run with pprof, can you link the output? I'd also consider changing the package in the title to |
Does this reproduce when downloading large files from http? |
I have constructed a short simple proxy, that can help reproduce the issue.
|
When the profile tells you that a lot of time is spent in So it's not clear that this is a bug as such. Of course it would be nice to get more complete information. And it is possible that your program is really not spending much time in non-Go code, in which case there may be a bug somewhere. |
The problem here is in OS level, comparing to other language developed programs.
This is not the answer here, I think. All different languages call the OS network APIs. My descriptions may not cover what you concern, please have a try by yourself. |
It seems like it's possible that WSARecv is making blocking calls which puts the syscall over the 1us timeout to rejoin the P without penalty, which will create a fair amount of context switching when called repeatedly. I believe linux & mac make non-blocking calls for socket read/write and get results via netpoll? I don't know how I would check that this is the case, though. Obviously net.Conn.Read is blocking, but that doesn't say much about syscall.WSARecv. |
Hi @lifenjoiner, do you have numeric CPU usage numbers reported by the OS you could share for your sample program vs maybe shadowsocks-rust or one of the other programs? |
I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo. |
Measurements by Base info Both buffer size are (should be) 8k.
Results
|
Do you have a tool to measure context switching in there? |
Does waiting consume lots of CPU resource? If you go to Line 157 in 02d8ebd
Currently, no.
|
Context switching rate should be high, referring: It says All data are included in the attachment above. |
No- rather, when go goes into the syscall state, it takes the os thread with it. My understanding is that it has 1us to return before go spins up another os thread to handle the goroutines that were previously being handled by the os thread that was taken for the syscall. The thing is that if there's not enough work to sustain an additional os thread than what it was previously using (which is most cases, since go is generally using the correct number of threads), go will generally keep spinning up and shutting down os threads, causing an obscene amount of context switches, which seems to be what's happening here. I would say that go's handling of goroutines returning from syscall is generally poor when it can't put everything back exactly where it found it, and getting to the point where this code maintained two os threads due to the usage patterns in play would probably be ideal, even though making these calls nonblocking would be a quick fix |
btw there's another case like this (poor decisions about when to spin down os threads cause insane context switches) I'm aware of that isn't cgo related: #21827 In that case, parking an os locked thread with sleep or a blocking channel read will cause the os thread to be spun down since it isn't doing any work, the goroutine is moved to a thread where it can't run, and then the locked thread has to be awoken when the goroutine unparks, causing two context switches. |
@lifenjoiner Could you collect a runtime trace (https://pkg.go.dev/runtime/trace#Start) from this example? Just collect for a few seconds, otherwise you'll probably get a huge trace file. |
@prattmic Many thanks for diving in! Here is a trace file for 30s while downloading is stable. Method:
Append: |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes. It is long standing.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I think this is a new case of #19574, taking effect on higher level
Conn
and specific platform Windows. It is suggested to be a new issue.This is a long term user observation.
It is traced into
cgocall
bypprof
(high percentage): on Windows,Conn
Read
/Write
finally calls APIWSARecv
/WSASend
accordingruntime.cgocall
.It is a very basic functionality for web servers, proxies, and other network data transferring tools.
Low performance will pop up when there is a large amount of data, with high CPU usage (for example 10%+, task manager, OS level) which could be 10+ times than rust/c developed similar apps.
Since the scenario is network data transferring, I don't have a simple/short example that can reproduce the issue . Anyone who can generate it is appreciated.Proxies are the most typical case (both
Read
/Write
). My example is bellow in the reply.Test existing programs if you are interested
Maybe we can get started with some high rated open source proxies for now.Here I used the client as a local proxy without any upstream proxies, and wget (single thread) to download via it.
Attention: downloading self hosted resources on the same machine doesn't help reproduce the difference. Loopback MSS = 65495 is adopted.
go: xray, v2ray, and glider.
Config file: direct.json.zip.
rust: shadowsocks-rust.
Mitigation
An alternative way to Read/Write directly is using a big enough buffer. By piling up packets data and then Read/Write in one time, it definitely reduces the context switching costs. But, it has disadvantages and does not 100% work:
My experimental local proxy with big buffer.
I have tried
ReadFrom
, and observed 4k length packets are transferred on loopback interface, triggering high CPU usage.What did you expect to see?
Lower CPU usage. OS level.
What did you see instead?
High CPU usage.
The text was updated successfully, but these errors were encountered: