Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: cgocall is low performance for Conn on Windows #58336

Closed
lifenjoiner opened this issue Feb 5, 2023 · 30 comments
Closed

runtime: cgocall is low performance for Conn on Windows #58336

lifenjoiner opened this issue Feb 5, 2023 · 30 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. OS-Windows Performance
Milestone

Comments

@lifenjoiner
Copy link

lifenjoiner commented Feb 5, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20 windows/386

Does this issue reproduce with the latest release?

Yes. It is long standing.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=auto
set GOARCH=386
set GOBIN=D:\build\go\GOPATH\bin
set GOCACHE=D:\build\go\GOCACHE
set GOENV=C:\Users\HYX\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=386
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=D:\build\go\GOPATH\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=D:\build\go\GOPATH
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=D:\build\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=D:\build\go\pkg\tool\windows_386
set GOVCS=
set GOVERSION=go1.20
set GCCGO=gccgo
set GO386=sse2
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=0
set GOMOD=
set GOWORK=
set CGO_CFLAGS=-O2 -g
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-O2 -g
set CGO_FFLAGS=-O2 -g
set CGO_LDFLAGS=-O2 -g
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m32 -fno-caret-diagnostics -Qunused-arguments -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=C:\Users\HYX\AppData\Local\Temp\go-build3874135637=/tmp/go-build -gno-record-gcc-switches

What did you do?

I think this is a new case of #19574, taking effect on higher level Conn and specific platform Windows. It is suggested to be a new issue.

This is a long term user observation.

It is traced into cgocall by pprof (high percentage): on Windows, Conn Read/Write finally calls API WSARecv/WSASend according runtime.cgocall.

It is a very basic functionality for web servers, proxies, and other network data transferring tools.

Low performance will pop up when there is a large amount of data, with high CPU usage (for example 10%+, task manager, OS level) which could be 10+ times than rust/c developed similar apps.

Since the scenario is network data transferring, I don't have a simple/short example that can reproduce the issue . Anyone who can generate it is appreciated.

Proxies are the most typical case (both Read/Write). My example is bellow in the reply.

Test existing programs if you are interested

Maybe we can get started with some high rated open source proxies for now.
Here I used the client as a local proxy without any upstream proxies, and wget (single thread) to download via it.
Attention: downloading self hosted resources on the same machine doesn't help reproduce the difference. Loopback MSS = 65495 is adopted.

go: xray, v2ray, and glider.

xray.exe -config direct.json
v2ray.exe run -config direct.json
wget.exe -c -O nul -e https_proxy=127.0.0.1:5080 <https-big-file-url>

Config file: direct.json.zip.

glider.exe -listen=http://127.0.0.1:6080

rust: shadowsocks-rust.

sslocal.exe -b 127.0.0.1:7080 --protocol=http

Mitigation
An alternative way to Read/Write directly is using a big enough buffer. By piling up packets data and then Read/Write in one time, it definitely reduces the context switching costs. But, it has disadvantages and does not 100% work:

  • buffers should not be too big, or when connections increase, they consume lots of memory;
  • if packets data can't be piled up, that will make buffering ineffective;
  • must be answered small and high frequency packets are common.

My experimental local proxy with big buffer.

I have tried ReadFrom, and observed 4k length packets are transferred on loopback interface, triggering high CPU usage.

What did you expect to see?

Lower CPU usage. OS level.

What did you see instead?

High CPU usage.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 5, 2023
@CannibalVox
Copy link

There's not really a whole lot of information here- you said you'd previously looked at it in pprof, what did you run with pprof, can you link the output? I'd also consider changing the package in the title to net - for all we know, cgo could be working directly but net is calling it too many times.

@CannibalVox
Copy link

Does this reproduce when downloading large files from http?

@lifenjoiner
Copy link
Author

I have constructed a short simple proxy, that can help reproduce the issue.

tunnel.go Source Code (Click to show)
package main

import (
	"flag"
	"log"
	"net"
	"sync"
	"time"
)

const GlobalTimeout = 15 * time.Second

type Config struct {
	BuffSize    int
	LocalAddr   string
	RemoteAddr  string
}

func parseConfig() *Config {
	conf := &Config{}

	flag.IntVar(&conf.BuffSize, "buffsize", 8192, "Specify connection buffer size.")
	flag.StringVar(&conf.LocalAddr, "localaddr", "127.0.0.1:6699", "Local address to listen to.")
	flag.StringVar(&conf.RemoteAddr, "remoteaddr", "", "Remote address to connect to.")

	flag.Parse()

	return conf
}

func ListenAndServe(cfg *Config) {
	l, err := net.Listen("tcp", cfg.LocalAddr)
	if err != nil {
		log.Printf("[tcp] failed to listen on %s: %v\n", cfg.LocalAddr, err)
		return
	}
	defer l.Close()

	log.Printf("[tcp] listening on %s\n", cfg.LocalAddr)
	for {
		c, err := l.Accept()
		if err != nil {
			log.Printf("[tcp] failed to accept: %v\n", err)
			continue
		}
		go Serve(c, cfg)
	}
}

func Serve(c net.Conn, cfg *Config) {
	var wg sync.WaitGroup
	var LrErr, LwErr, RrErr, RwErr error

	defer c.Close()

	LeftTimeout := 2 * GlobalTimeout
	RightTimeout := GlobalTimeout

	cr, err := net.DialTimeout("tcp", cfg.RemoteAddr, GlobalTimeout)
	if err != nil {
		return
	}
	defer cr.Close()

	wg.Add(1)
	go func() {
		defer wg.Done()
		var n int
		LeftBuf := make([]byte, cfg.BuffSize)
		for {
			c.SetDeadline(time.Now().Add(LeftTimeout))
			n, LrErr = c.Read(LeftBuf)
			if LrErr == nil {
				cr.SetDeadline(time.Now().Add(RightTimeout))
				_, RwErr = cr.Write(LeftBuf[0:n])
			}
			if LrErr != nil || LwErr != nil || RrErr != nil || RwErr != nil {
				c.SetDeadline(time.Now())
				cr.SetDeadline(time.Now())
				break
			}
		}
	}()

	RightBuf := make([]byte, cfg.BuffSize)
	for {
		var n int
		cr.SetDeadline(time.Now().Add(RightTimeout))
		n, RrErr = cr.Read(RightBuf)
		if RrErr == nil {
			c.SetDeadline(time.Now().Add(LeftTimeout))
			_, LwErr = c.Write(RightBuf[0:n])
		}
		if LrErr != nil || LwErr != nil || RrErr != nil || RwErr != nil {
			c.SetDeadline(time.Now())
			cr.SetDeadline(time.Now())
			break
		}
	}
	wg.Wait()

	if RrErr != nil {
		log.Printf("Remote error: %v\n", RrErr)
	} else if RwErr != nil {
		log.Printf("Remote error: %v\n", RwErr)
	}
}

func main() {
	cfg := parseConfig()
	if cfg.RemoteAddr == "" {
		log.Printf("Invalid remoteaddr!")
		return
	}

	ListenAndServe(cfg)
}

If you want to pprof, drop pprof.go in the same folder, and compile with -tags=pprof.

Because of it is a very simple proxy, the functionality is limited too.

Usage of tunnel.exe:
  -buffsize int
        Specify connection buffer size. (default 8192)
  -localaddr string
        Local address to listen to. (default "127.0.0.1:6699")
  -remoteaddr string
        Remote address to connect to.

Here is one example for testing:

tunnel.exe -localaddr=127.0.0.1:443 -remoteaddr=dl.google.com:443 -buffsize=512000
curl.exe -o nul --resolve dl.google.com:443:127.0.0.1 https://dl.google.com/go/go1.20.windows-386.zip

You may specify a common buffsize that you prefer. Bigger buffer could reduce cgocall times, but not 100%.

Results of a same high speed resource on my side:

graph-sample-8kBuffer-CPUHigh.png

graph-sample-8kBuffer-CPUHigh

graph-sample-500kBuffer-CPULow.png

graph-sample-500kBuffer-CPULow


@CannibalVox

net is calling it too many times

I think that would be necessary, as messages need responses (in time).

Does this reproduce when downloading large files from http?

Of course. It shouldn't be related to the contents. I also confirmed it.


The key is the contexts switching cost of cgo calling. That seems to be a bit too much, resulting high speed Read/Write cost more resources.

@ianlancetaylor
Copy link
Contributor

When the profile tells you that a lot of time is spent in cgocall what it is really telling you is that your program is spending most of its time in functions that are not written in Go. The profiler is not able to get a stack trace from functions that are not written in Go, so what you see is the backtrace in Go code that leads up to the non-Go code. That backtrace ends in cgocall, because that is the name of the function that calls the non-Go function.

So it's not clear that this is a bug as such. Of course it would be nice to get more complete information. And it is possible that your program is really not spending much time in non-Go code, in which case there may be a bug somewhere.

@lifenjoiner
Copy link
Author

cgocall high CPU usage could be in 2 scopes:

  1. program level, by pprof,
  2. OS level, by task manager.

The problem here is in OS level, comparing to other language developed programs.
For example shadowsocks-rust which I mentioned above.
They are tested by downloading the same file with the same speed. Using Wireshark, I can see the max packet-data-length from shadowsocks-rust is 8192. There should be same level API/cgo calling. While setting the same buffer size to the above example, OS level high CPU usage is observed.

@ianlancetaylor

When the profile tells you that a lot of time is spent in cgocall what it is really telling you is that your program is spending most of its time in functions that are not written in Go.

This is not the answer here, I think. All different languages call the OS network APIs.
The profiles don't really count here though, some people still may want to have a glance, as they tell something.


My descriptions may not cover what you concern, please have a try by yourself.

@CannibalVox
Copy link

The key is the contexts switching cost of cgo calling. That seems to be a bit too much, resulting high speed Read/Write cost more resources.

It seems like it's possible that WSARecv is making blocking calls which puts the syscall over the 1us timeout to rejoin the P without penalty, which will create a fair amount of context switching when called repeatedly. I believe linux & mac make non-blocking calls for socket read/write and get results via netpoll?

I don't know how I would check that this is the case, though. Obviously net.Conn.Read is blocking, but that doesn't say much about syscall.WSARecv.

@thepudds
Copy link
Contributor

thepudds commented Feb 7, 2023

Hi @lifenjoiner, do you have numeric CPU usage numbers reported by the OS you could share for your sample program vs maybe shadowsocks-rust or one of the other programs?

@CannibalVox
Copy link

I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.

@lifenjoiner
Copy link
Author

Measurements by typeperf:

Base info

Both buffer size are (should be) 8k.

tunnel.exe

curl.exe -o nul -x http://127.0.0.1:7080 <URL>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0   948k      0  0:06:58  0:06:58 --:--:--  948k

sslocal.exe

curl.exe -o nul --resolve <domain>:443:127.0.0.1 <URL>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0   949k      0  0:06:57  0:06:57 --:--:--  949k

Results

% User Time is likely to be what I saw in Process Explorer.

tunnel.exe
tunnel

sslocal.exe
sslocal

Raw data: typeperf-process-counters.tunnel-vs-sslocal.zip

@CannibalVox
Copy link

Do you have a tool to measure context switching in there?

@lifenjoiner
Copy link
Author

lifenjoiner commented Feb 7, 2023

I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.

Does waiting consume lots of CPU resource?

If you go to View -> Source in the http UI mode of pprof, you will see most of time are spent at:

entersyscall()

Do you have a tool to measure context switching in there?

Currently, no.
I can see the Properties -> Threads -> Cycles Delta in Process Explorer:

Process Threads Cycles Delta
tunnel.exe 2+ 50+ M
sslocal.exe 1 20 ~ 30 M

@lifenjoiner
Copy link
Author

lifenjoiner commented Feb 7, 2023

Context switching rate should be high, referring:

You can determine whether context switching is excessive by comparing it with the value of Processor\% Privileged Time. If this counter is at 40 percent or more and the context-switching rate is high, then you can investigate the cause for high rates of context switches.

It says \Processor(_Total)\% Privileged Time, here shows \Process(tunnel)\% Privileged Time and \Process(sslocal)\% Privileged Time, which should tell something.

tunnel.exe
tunnel-2

sslocal.exe
sslocal-2

All data are included in the attachment above.

@CannibalVox
Copy link

I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.

Does waiting consume lots of CPU resource?

No- rather, when go goes into the syscall state, it takes the os thread with it. My understanding is that it has 1us to return before go spins up another os thread to handle the goroutines that were previously being handled by the os thread that was taken for the syscall. The thing is that if there's not enough work to sustain an additional os thread than what it was previously using (which is most cases, since go is generally using the correct number of threads), go will generally keep spinning up and shutting down os threads, causing an obscene amount of context switches, which seems to be what's happening here.

I would say that go's handling of goroutines returning from syscall is generally poor when it can't put everything back exactly where it found it, and getting to the point where this code maintained two os threads due to the usage patterns in play would probably be ideal, even though making these calls nonblocking would be a quick fix

@CannibalVox
Copy link

btw there's another case like this (poor decisions about when to spin down os threads cause insane context switches) I'm aware of that isn't cgo related: #21827

In that case, parking an os locked thread with sleep or a blocking channel read will cause the os thread to be spun down since it isn't doing any work, the goroutine is moved to a thread where it can't run, and then the locked thread has to be awoken when the goroutine unparks, causing two context switches.

@dr2chase dr2chase added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 10, 2023
@prattmic
Copy link
Member

@lifenjoiner Could you collect a runtime trace (https://pkg.go.dev/runtime/trace#Start) from this example? Just collect for a few seconds, otherwise you'll probably get a huge trace file.

@lifenjoiner
Copy link
Author

lifenjoiner commented Feb 14, 2023

@prattmic Many thanks for diving in!

Here is a trace file for 30s while downloading is stable.
trace-30s.zip

Method:
https://pkg.go.dev/net/http/pprof#hdr-Usage_examples

curl -o trace-30s.out http://localhost:6060/debug/pprof/trace?seconds=30

Append:
One more includes download startup:
trace-30s-includes-download-startup.zip

@mknyszek mknyszek added this to the Backlog milestone Feb 15, 2023
@CannibalVox
Copy link

Hey @prattmic I saw that you were assigned #44343 and it is scheduled for 1.21. If the goal is to deal with tasks that are a blocker from Windows users upgrading go to the latest version, this is a big one, and while it raises some interesting high level questions about the way that the scheduler handles things, the task as reported may not be that complicated. Network calls on windows appear to be synchronous and they need to be made asynchronous in order to stop this behavior I believe. I don't see how it's possible to run a web service on windows in Go 1.21 without this task being fixed.

@qmuntal
Copy link
Contributor

qmuntal commented Apr 18, 2023

@lifenjoiner could you run your benchmarks using go tip? Some weeks ago I submitted an improvement that asyncify some udp/tcp operations: CL 463839.

@lifenjoiner
Copy link
Author

@qmuntal Thanks for your work! Following are my benchmarks.

Base Info

  • go1.20.3.windows-386.zip
  • Patch. Manually patched my local src/internal/poll/fd_windows.go. Confirmed it works by making a syntax mistake of the file did trigger a compiling error.
  • set GOARCH=amd64 to build PE64 files. Keep the same page.
  • My network changed. So, --limit-rate is added to my previous downloading commands to make similar tests. But it is NOT at a uniform speed as before. I saw peaks and pause of the downloading speed and CPU usage by Process Explorer. So, another group of tests with a higher downloading speed was made.
  • A different PC.

Results

  1. --limit-rate 1M, 2.753 / 0.293 = 9.4. Previous: 6.367 / 0.543 = 11.7.

tunnel.exe

$>curl.exe --limit-rate 1M -o nul --resolve dldir1.qq.com:443:127.0.0.1 <url>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0  1036k      0  0:06:22  0:06:22 --:--:-- 1272k

tunnel-d1mps-u-20230418

sslocal.exe

$>curl.exe --limit-rate 1M -o nul -x http://127.0.0.1:7080 <url>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0  1033k      0  0:06:23  0:06:23 --:--:-- 1184k

sslocal-d1mps-u-20230418

  1. --limit-rate 3M, 8.541 / 0.729 = 11.7.

tunnel.exe

$>curl.exe --limit-rate 3M -o nul --resolve dldir1.qq.com:443:127.0.0.1 <url>/WeCom_4.1.0.6007.exe
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0  3000k      0  0:02:12  0:02:12 --:--:-- 1005k

tunnel-d3mps-u-20230418

sslocal.exe

$>curl.exe --limit-rate 3M -o nul -x http://127.0.0.1:7080 <url>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0  3203k      0  0:02:03  0:02:03 --:--:-- 4164k

sslocal-d3mps-u-20230418

Conclusion

Seems there are still more improvements can be done.

@qmuntal
Copy link
Contributor

qmuntal commented Apr 18, 2023

Conclusion
Seems there are still more improvements can be done.

I'm having a hard time interpreting those results. Has there been an improvement on the metric you are using to benchmark this issue? I want to know if CL 463839 goes in the right direction or is something unrelated.

@lifenjoiner
Copy link
Author

lifenjoiner commented Apr 18, 2023

The screenshots are reports generated by typeperf. I've selected the valid duration and % User Time. It shows the average under the scroll bar.

I don't know if CL 463839 is the right direction.
Here are my new benchmarks on the same PC and network. CL463839 Vs NoPatch. Both with 10 samples.

My network is not very stable. Maybe you can remeasure them by yourself.

Prerequisite

  1. benchmark.bat. The script that I used.
@echo off
setlocal ENABLEDELAYEDEXPANSION
pushd %~dp0

::-code-::

for /L %%i in (0, 1, 9) do (
    title %%i %date% %time%
    start /min typeperf -f csv -si 00:01 -cf typeperf-process-counters.txt -o typeperf-process-counters-%%i.csv
    sleepy 2s
    curl.exe --limit-rate 3M -o nul --resolve dldir1.qq.com:443:127.0.0.1 https://dldir1.qq.com/wework/work_weixin/WeCom_4.1.3.6008.exe
    sleepy 2s
    taskkill /f /im typeperf.exe
)

::-code-::

:Exit
popd
endlocal
exit /b

You may choose a URL by your convenient.

  1. typeperf-process-counters.txt
\Process(tunnel)\% Privileged Time
\Process(tunnel)\% Processor Time
\Process(tunnel)\% User Time
  1. CL463839.exe and NoPatch.exe

tunnel.go

tunnel.exe -localaddr=127.0.0.1:443 -remoteaddr=dldir1.qq.com:443

Compile them by yourself. All other files and logs are in the attached zip file.

Data Files

CL463839-Vs-NoPatch_10-samples.zip
benchmark.xlsx

Analysis

CL463839-Vs-NoPatch

@qmuntal
Copy link
Contributor

qmuntal commented May 18, 2023

I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.

@CannibalVox WSARecv is non-blocking, have verified that using @lifenjoiner's tunnel.go.

@lifenjoiner your initial guess seems to be right, there are a lot of context switches due to the high amount of cgo calls. I haven't seen any low hanging fruit in the net nor internall/poll that could mitigate that, so I'm inclined to say that this is a duplicated of #19574.

@CannibalVox
Copy link

I don't understand, I thought sub-1us cgo calls don't cause context switches

@CannibalVox
Copy link

Also #19574 didn't repro when I tried it a few months ago

@lifenjoiner
Copy link
Author

@qmuntal Thanks for your efforts!

Hoping someone will make more progress.

@gopherbot
Copy link

Change https://go.dev/cl/562915 mentions this issue: runtime: don't call lockOSThread for every syscall call on Windows

@qmuntal
Copy link
Contributor

qmuntal commented Mar 11, 2024

@lifenjoiner CL 562915 might fix your issue. It removes the LockOSThread call from syscall.Syscall, which is a cause of context switches. Could you give it a try?

To build a Go application using a specific commit, follow these instructions:

  1. run go install golang.org/dl/gotip@latest
  2. run gotip download 562915
  3. run gotip build .

@lifenjoiner
Copy link
Author

  • --limit-rate 3M
  • Measured curl at the same time.

CL562915
CL562915.xlsx

Seems better. Great work! @qmuntal

@CannibalVox
Copy link

!!! This seems like a really big deal, @qmuntal, thank you very much!

gopherbot pushed a commit that referenced this issue Mar 26, 2024
Windows syscall.SyscallN currently calls lockOSThread for every syscall.
This can be expensive and produce unnecessary context switches,
especially when the syscall is called frequently under high contention.

The lockOSThread was necessary to ensure that cgocall wouldn't
reschedule the goroutine to a different M, as the syscall return values
are reported back in the M struct.

This CL instructs cgocall to copy the syscall return values into the
the M that will see the caller on return, so the caller no longer needs
to call lockOSThread.

Updates #58336.

Cq-Include-Trybots: luci.golang.try:gotip-windows-arm64,gotip-windows-amd64-longtest
Change-Id: If6644fd111dbacab74e7dcee2afa18ca146735da
Reviewed-on: https://go-review.googlesource.com/c/go/+/562915
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
Auto-Submit: Emmanuel Odeke <emmanuel@orijtech.com>
Reviewed-by: Than McIntosh <thanm@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@qmuntal
Copy link
Contributor

qmuntal commented Mar 26, 2024

Seems better. Great work! @qmuntal

If I read your charts correctly, CL 562915 reduced User Time from ~6% to ~2% and Privileged Time from ~50% to ~6%. Given that the Privileged Time is now way less than the high-context-switch threshold (40% according to #58336 (comment)), IMO we can close this issue.

@qmuntal qmuntal closed this as completed Mar 26, 2024
@dmitshur dmitshur modified the milestones: Backlog, Go1.23 Mar 26, 2024
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 26, 2024
Mchnan pushed a commit to Mchnan/go-sylixos that referenced this issue Mar 28, 2024
Windows syscall.SyscallN currently calls lockOSThread for every syscall.
This can be expensive and produce unnecessary context switches,
especially when the syscall is called frequently under high contention.

The lockOSThread was necessary to ensure that cgocall wouldn't
reschedule the goroutine to a different M, as the syscall return values
are reported back in the M struct.

This CL instructs cgocall to copy the syscall return values into the
the M that will see the caller on return, so the caller no longer needs
to call lockOSThread.

Updates golang#58336.

Cq-Include-Trybots: luci.golang.try:gotip-windows-arm64,gotip-windows-amd64-longtest
Change-Id: If6644fd111dbacab74e7dcee2afa18ca146735da
Reviewed-on: https://go-review.googlesource.com/c/go/+/562915
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
Auto-Submit: Emmanuel Odeke <emmanuel@orijtech.com>
Reviewed-by: Than McIntosh <thanm@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. OS-Windows Performance
Projects
Development

No branches or pull requests