Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: non-preemptible zeroing in clear, append, bytes.growSlice, etc #69327

Open
rhysh opened this issue Sep 6, 2024 · 8 comments
Open

runtime: non-preemptible zeroing in clear, append, bytes.growSlice, etc #69327

rhysh opened this issue Sep 6, 2024 · 8 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Sep 6, 2024

The clear and append built-ins can result in the need to zero an arbitrary amount of memory. For byte slices, the compiler appears to use a call to runtime.memclrNoHeapPointers. That function cannot be preempted, which can lead to arbitrary delays when another goroutine wants to stop the world (such as to start or end a GC cycle).

Applications that use bytes.Buffer can experience this when a call to bytes.(*Buffer).Write leads to a call to bytes.growSlice which uses append, as seen in one of the execution traces from #68399.

The runtime and compiler should collaborate to allow opportunities for preemption when zeroing large amounts of memory.

CC @golang/runtime @mknyszek


Reproducer, using `clear` built-in plus `runtime.ReadMemStats` to provide STWs
package memclr

import (
	"context"
	"fmt"
	"math"
	"runtime"
	"runtime/metrics"
	"sync"
	"testing"
	"time"
)

func BenchmarkMemclr(b *testing.B) {
	for exp := 4; exp <= 9; exp++ {
		size := int(math.Pow10(exp))
		b.Run(fmt.Sprintf("bytes=10^%d", exp), testcaseMemclr(size))
	}
}

func testcaseMemclr(l int) func(b *testing.B) {
	return func(b *testing.B) {
		b.SetBytes(int64(l))
		v := make([]byte, l)
		for range b.N {
			clear(v)
		}
	}
}

func BenchmarkSTW(b *testing.B) {
	for exp := 4; exp <= 9; exp++ {
		size := int(math.Pow10(exp))
		b.Run(fmt.Sprintf("bytes=10^%d", exp), testcaseSTW(size))
	}
}

func testcaseSTW(size int) func(*testing.B) {
	const name = "/sched/pauses/stopping/other:seconds"

	return func(b *testing.B) {
		ctx, cancel := context.WithCancel(context.Background())

		clears := 0
		var wg sync.WaitGroup
		wg.Add(1)
		go func() {
			defer wg.Done()
			v := make([]byte, size)
			for ctx.Err() == nil {
				clear(v)
				clears++
			}
		}()

		before := readMetric(name)
		var memstats runtime.MemStats
		for range b.N {
			runtime.ReadMemStats(&memstats)
			time.Sleep(10 * time.Microsecond) // allow others to make progress
		}
		after := readMetric(name)

		cancel()
		wg.Wait()

		ns := float64(time.Second.Nanoseconds())
		diff := delta(before.Float64Histogram(), after.Float64Histogram())
		b.ReportMetric(worst(diff)*ns, "worst-ns")
		b.ReportMetric(avg(diff)*ns, "avg-ns")
		b.ReportMetric(float64(clears), "clears")
	}
}

func readMetric(name string) metrics.Value {
	samples := []metrics.Sample{{Name: name}}
	metrics.Read(samples)
	return samples[0].Value
}

func delta(a, b *metrics.Float64Histogram) *metrics.Float64Histogram {
	v := &metrics.Float64Histogram{
		Buckets: a.Buckets,
		Counts:  append([]uint64(nil), b.Counts...),
	}
	for i := range a.Counts {
		v.Counts[i] -= a.Counts[i]
	}
	return v
}

func worst(h *metrics.Float64Histogram) float64 {
	var v float64
	for i, n := range h.Counts {
		if n > 0 {
			v = h.Buckets[i]
		}
	}
	return v
}

func avg(h *metrics.Float64Histogram) float64 {
	var v float64
	var nn uint64
	for i, n := range h.Counts {
		if bv := h.Buckets[i]; !math.IsInf(bv, 0) && !math.IsNaN(bv) {
			v += float64(n) * h.Buckets[i]
			nn += n
		}
	}
	return v / float64(nn)
}
Reproducer results, showing average time to stop the world is more than 1 ms (instead of less than 10 µs) when another part of the app is clearing a 100 MB byte slice
GOGC=off go test -cpu=2 -bench=. ./memclr)
goos: darwin
goarch: arm64
pkg: issues/memclr
cpu: Apple M1
BenchmarkMemclr/bytes=10^4-2             9421521               122.2 ns/op      81857.90 MB/s
BenchmarkMemclr/bytes=10^5-2             1000000              1433 ns/op        69779.61 MB/s
BenchmarkMemclr/bytes=10^6-2               99464             15148 ns/op        66016.50 MB/s
BenchmarkMemclr/bytes=10^7-2                8704            153918 ns/op        64969.54 MB/s
BenchmarkMemclr/bytes=10^8-2                 758           1632702 ns/op        61248.17 MB/s
BenchmarkMemclr/bytes=10^9-2                  67          16443990 ns/op        60812.49 MB/s
BenchmarkSTW/bytes=10^4-2                  29718             40598 ns/op              5473 avg-ns          2912452 clears            98304 worst-ns
BenchmarkSTW/bytes=10^5-2                  29895             38866 ns/op              4920 avg-ns           560027 clears            81920 worst-ns
BenchmarkSTW/bytes=10^6-2                  26226             44481 ns/op              8116 avg-ns            70132 clears            16384 worst-ns
BenchmarkSTW/bytes=10^7-2                   8925            164844 ns/op            120482 avg-ns             8919 clears           655360 worst-ns
BenchmarkSTW/bytes=10^8-2                   2184           1571734 ns/op           1376487 avg-ns             2102 clears          4194304 worst-ns
BenchmarkSTW/bytes=10^9-2                   1209           7075640 ns/op           6506152 avg-ns              529.0 clears       16777216 worst-ns
PASS
ok      issues/memclr   29.098s
`bytes.growSlice` calling `runtime.memclrNoHeapPointers`
$ go version
go version go1.23.0 darwin/arm64

$ go tool objdump -s 'bytes.growSlice$' `which go` | grep CALL
  buffer.go:249         0x10012a3cc             97fd23cd                CALL runtime.growslice(SB)              
  buffer.go:249         0x10012a3f0             97fd3db8                CALL runtime.memclrNoHeapPointers(SB)   
  buffer.go:250         0x10012a43c             97fd3e0d                CALL runtime.memmove(SB)                
  buffer.go:251         0x10012a464             94000ef3                CALL bytes.growSlice.func1(SB)          
  buffer.go:251         0x10012a484             97fd3ca3                CALL runtime.panicSliceAcap(SB)         
  buffer.go:249         0x10012a488             97fc9d82                CALL runtime.panicmakeslicelen(SB)      
  buffer.go:249         0x10012a490             97fc2cb8                CALL runtime.deferreturn(SB)            
  buffer.go:229         0x10012a4c0             97fd332c                CALL runtime.morestack_noctxt.abi0(SB)
@rhysh rhysh added Performance compiler/runtime Issues related to the Go compiler and/or runtime. labels Sep 6, 2024
@prattmic
Copy link
Member

prattmic commented Sep 9, 2024

mallocgc uses memclrNoHeapPointersChunked to address this. It seems like the builtins could do the same?

cc @golang/runtime @mknyszek @dr2chase

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 9, 2024
@dmitshur dmitshur added this to the Backlog milestone Sep 9, 2024
@dr2chase
Copy link
Contributor

dr2chase commented Sep 9, 2024

@prattmic that seems plausible to me.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 9, 2024

The simplest way to fix this would be to have memclrNoHeapPointers just always do the Chunked part, right? That doesn't even need a compiler change: rename memclrNoHeapPointers to memclrNoHeapPointers0 and memclrNoHeapPointersChunked to memclrNoHeapPointers.

I think the only concern would be if this function is called in places where we have to be very carefully non-preemptible (no explicit acquirem), but there are very few of these cases, and we can still call memclrNoHeapPointers0 there.

@rhysh
Copy link
Contributor Author

rhysh commented Sep 9, 2024

From go doc -u runtime.memclrNoHeapPointers:

    memclrNoHeapPointers ensures that if ptr is pointer-aligned, and n is a
    multiple of the pointer size, then any pointer-aligned, pointer-sized
    portion is cleared atomically. Despite the function name, this is necessary
    because this function is the underlying implementation of typedmemclr and
    memclrHasPointers. See the doc of memmove for more details.

    The (CPU-specific) implementations of this function are in memclr_*.s.

    memclrNoHeapPointers should be an internal detail, but widely used packages
    access it using linkname. Notable members of the hall of shame include:
      - github.com/bytedance/sonic
      - github.com/chenzhuoyu/iasm
      - github.com/cloudwego/frugal
      - github.com/dgraph-io/ristretto
      - github.com/outcaste-io/ristretto

    Do not remove or change the type signature. See go.dev/issue/67401.

First, @mknyszek , you said

the only concern would be if this function is called in places where we have to be very carefully non-preemptible (no explicit acquirem), but there are very few of these cases

Are you referring to the "ensures ... cleared atomically" part of that documentation? (I see that memclrNoHeapPointers is marked NOSPLIT — that's probably one of the load-bearing components that ensures the atomicity.) What's the reason for not using an explicit acquirem, or an explicit getg().m.preemptoff, or other explicit mechanism of disabling preemption?

Second, given that the current function definition includes a promise of atomicity — is that part of the API that members of the linkname hall of shame might require? (I would guess that it would be hard to use that property effectively outside of the runtime.)

Though the comment of "any pointer-aligned, pointer-sized portion is cleared atomically" might only guarantee that individual portions — the size of a single pointer — are each cleared atomically?

@randall77
Copy link
Contributor

I think the distinguishing use case for memclrNoHeapPointers is clearing junk from something that is about to contain pointers.
We want to ensure that the GC can't see a partially-cleared buffer and treat the uncleared portion as valid pointers.

I don't think any user code linknaming into memclrNoHeapPointers would run into this problem. Only the runtime can see Go-heap-allocated, not-yet-initialized memory.

I think there are a few ways to solve this special issue. One is to ensure that an object mid-initialization is never scanned by the GC. Another is to ensure that an object mid-initialization has no ptr bits set yet. A third is to ensure that it is scanned conservatively. With any of those, I think interrupting the memclr itself should be fine.

@randall77
Copy link
Contributor

Though the comment of "any pointer-aligned, pointer-sized portion is cleared atomically" might only guarantee that individual portions — the size of a single pointer — are each cleared atomically?

Correct.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 9, 2024

First, @mknyszek , you said

the only concern would be if this function is called in places where we have to be very carefully non-preemptible (no explicit acquirem), but there are very few of these cases

Are you referring to the "ensures ... cleared atomically" part of that documentation? (I see that memclrNoHeapPointers is marked NOSPLIT — that's probably one of the load-bearing components that ensures the atomicity.) What's the reason for not using an explicit acquirem, or an explicit getg().m.preemptoff, or other explicit mechanism of disabling preemption?

Yeah, that's right. I think a lot of places this is used don't actually care about true atomicity, but we need to be very careful.

I was looking a bit more deeply into this and this is going to be harder than I thought. Take for example typedmemclr, which assumes it cannot be interrupted. The concern is a situation like the following:

(1) typedmemclr begins outside of a GC cycle. We skip the check for GC being enabled and running bulkBarrierPreWrite.
(2) typedmemclr is preempted after bulkBarrierPreWrite.
(3) The GC mark phase begins.
(4) We delete a bunch of pointers without barriers during the mark phase.

Now whether or not this is a real problem is subtle. I think (but am not 100% certain) that you could make the argument that this isn't actually a problem in this case. The purpose of the deletion barrier is to catch pointers which are hidden in already-scanned stacks. If this is just a regular clear, then there's no chance of that. But if we read pointers from what we're about to clear and stored them on the stack before the mark phase, we can be certain they will be visible during the mark phase, if they're still relevant.

Unfortunately there may be more bad or possibly-bad-but-actually-fine scenarios, but reasoning about all the situations will be subtle.

I don't know why these functions do not explicitly disable preemption, but I suspect it's just performance. However, I think if we explicitly disabled preemption, that would probably not meaningfully impact performance very much either.

Second, given that the current function definition includes a promise of atomicity — is that part of the API that members of the linkname hall of shame might require? (I would guess that it would be hard to use that property effectively outside of the runtime.)

Though the comment of "any pointer-aligned, pointer-sized portion is cleared atomically" might only guarantee that individual portions — the size of a single pointer — are each cleared atomically?

I agree with @randall77 that it's unlikely linknamed Go code will run into this.

... With maybe one exception. IIRC sonic JITs code, and it may be making calls to mallocgc with needzero==false. Though if they're doing that with anything containing pointers, that might already be broken. There are preemption points in mallocgc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

7 participants