Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add built-in function "formallyRead" for preventing elimination of "dead" stores #33325

Closed
nsajko opened this issue Jul 27, 2019 · 13 comments

Comments

@nsajko
Copy link
Contributor

commented Jul 27, 2019

Dead stores are writes to variables that are not followed by corresponding reads. While dead store elimination is mostly a desirable optimization, some of the time it is essential for program correctness that dead store elimination (or the unreachable code elimination that it sometimes implies) does not happen.

The two examples on my mind are:

  • Secure zeroization of sensitive data (also known as cleansing, wiping, etc.), where the store (to computer memory) itself is essential, and it should not legitimately be followed by a read.

  • Benchmarking, where the problem is the elimination of code that is meant to be benchmarked

In C these problems might be solved by using memset_s, volatile types or inline assembly tricks.

In current Go user code, convoluted efforts to prevent dead store elimination for wiping memory and to prevent dead code elimination for benchmarking ultimately rely only on the lack of optimizations in current Go implementations, instead of on formal correctness, which this proposal aims to enable in addition to simplification of user code.

The new built-in "formallyRead" is meant to return nothing and either

  • takes a pointer or an arbitrary number of pointers

  • takes a slice or an arbitrary number of slices

  • if package "unsafe" is imported it additionally can take an unsafe.Pointer and a uintptr (to indicate base and size of a memory object)

The pointer arguments to formallyRead indicate objects that must be considered as having formally been read at the point of the formallyRead invocation. Slice arguments indicate the same for the slice's underlying array in the range up to len(slice).

The following minimal example is meant to show the effects of formallyRead :

package main

func main() {
                ...

                var key [32]byte

                // Do cryptographic operations with the key
                ...

                for i := range key {
                                key[i] = 0
                }
                // Pretend to read the key so the compiler would not optimize out the preceding loop.
                formallyRead(&key) // or formallyRead(key[:])

                ...
}

The following example showcases the cutting down of benchmark code that formallyRead would enable, because use of sink package level variables is no longer necessary:

var GlobalF float64

func BenchmarkAcos(b *testing.B) {
	x := 0.0
	for i := 0; i < b.N; i++ {
		x = Acos(.5)
	}
	GlobalF = x
}

turns into:

func BenchmarkAcos(b *testing.B) {
	for i := 0; i < b.N; i++ {
		x := Acos(.5)
		formallyRead(&x)
	}
}

See also:

Regarding benchmarking:

// Global exported variables are used to store the

https://groups.google.com/forum/#!topic/golang-dev/eCPwwvqs2Cg

A more ambitious and ambiguous proposal regarding the zeroization use case:

#21374

Discusses some ugly hacks that are done in the name of zeroization:

#21865

@gopherbot gopherbot added this to the Proposal milestone Jul 27, 2019

@gopherbot gopherbot added the Proposal label Jul 27, 2019

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2019

My initial thought is that a builtin function zero that zeros out the memory associated with a pointer or slice would be easier to understand than formallyRead.

The benefits to benchmarking seem less interesting to me. Using global variables is easy enough to understand, and it doesn't seem likely to fail any time soon.

@nsajko

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2019

The benefits to benchmarking seem less interesting to me. Using global variables is easy enough to understand, and it doesn't seem likely to fail any time soon.

But you said in the linked mailing list discussion that:

Note that gccgo will already discard assignments to an unexported variable that is never read, among other optimizations.

Also, from the same discussion it seems like both you and rsc think that using sink variables complicates benchmarking code, but you say that formallyRead is not easy to understand ... Maybe the main problem is that formallyRead is not a good name?

Edit: i was confused about the "unexported variable" part and thus misunderstood what GCC optimizes and what does it not optimize, sorry.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2019

Maybe the main problem is that formallyRead is not a good name?

Yes, maybe. For the benchmarking case, perhaps we could repurpose runtime.KeepAlive. Perhaps it already works.

@nsajko nsajko referenced this issue Jul 27, 2019
@randall77

This comment has been minimized.

Copy link
Contributor

commented Jul 28, 2019

I think runtime.KeepAlive should work as currently implemented (for both uses discussed here).

For benchmarking particularly, perhaps there should be a method on testing.B which could delegate to runtime.KeepAlive for now, but have a more evocative name.

@awnumar

This comment has been minimized.

Copy link
Contributor

commented Jul 29, 2019

func wipe(key []byte) {
    for i := range key {
        key[i] = 0
    }
    runtime.KeepAlive(key)
}

What's to stop the compiler from optimizing out the entire function? KeepAlive is currently implemented as:

func KeepAlive(x interface{}) {
    // Introduce a use of x that the compiler can't eliminate.
    // This makes sure x is alive on entry. We need x to be alive
    // on entry for "defer runtime.KeepAlive(x)"; see issue 21402.
    if cgoAlwaysFalse {
        println(x)
    }
}

I think the branch is there to confuse it into thinking there's a possibility of the value being needed?

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jul 29, 2019

runtime.KeepAlive cannot be optimized away.
Or perhaps more precisely, if we come up with an optimization that causes runtime.KeepAlive to be optimized away, we will change the implementation of runtime.KeepAlive to avoid that optimization.

I think the branch is there to confuse it into thinking there's a possibility of the value being needed?

Correct.

@nsajko

This comment has been minimized.

Copy link
Contributor Author

commented Jul 30, 2019

@randall77 A question: Suppose one passes a slice to runtime.KeepAlive (like in awnumar's example above). Does runtime.KeepAlive then protect just the slice structure (the one that has the pointer to the backing array, length and capacity; I think) or does it also protect the slice's backing array?

I am asking because I think println does not need to read the slice's backing array.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2019

@nsajko Both. runtime.KeepAlive keeps its argument live directly (the slice structure). Anything that is kept live will also keep all its referents live, so the backing store is thus kept live also. If the slice backing store contained pointers, anything those pointed to would also be kept live, etc.

@nsajko

This comment has been minimized.

Copy link
Contributor Author

commented Jul 30, 2019

@randall77 Sorry for bothering you again, but, in your last message, did you mean "live" just in the GC sense, or also in the "prevents dead code elimination" sense?

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2019

I mean in the "live" sense. Dead code elimination isn't observable in the language.

@awnumar

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2019

@randall77 So when you said,

I think runtime.KeepAlive should work as currently implemented (for both uses discussed here).

How does this hold?

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2019

Sorry, I was talking about the language semantics. Yes, runtime.KeepAlive also makes sure its argument is computed.

Anything passed to runtime.KeepAlive must be computed and available at the runtime.KeepAlive call. As if someone might read it and check its contents. Whether that hypothetical someone is the GC or another goroutine.

nsajko added a commit to nsajko/memguard that referenced this issue Jul 31, 2019
core: in Wipe and Scramble, keep buf alive after being written to
This is to prevent dead store elimination.

See the discussion at
golang/go#33325 .
nsajko added a commit to nsajko/go-1 that referenced this issue Jul 31, 2019
math: replace exported variables with runtime.KeepAlive in benchmarks
Use runtime.KeepAlive for preventing benchmarked code from being
eliminated as "dead" code (by marking its return values as live),
instead of relying on "package level exported sink variable" hacks that
are far from obvious to those not familiar with the internals of the Go
compilers and also do not seem to be robust in the long term (I mean,
it seems like an update to one of the Go implementations could easily
break benchmarks by doing dead code elimination of benchmarked code
regardless of those sink variables).

The discussion that led to this patch happening:
golang#33325

Also changed the BenchmarkDim benchmark a bit since it seemed very
possible it would be optimized out otherwise (because subtracting zero
from any number is a non-operation).

Some other benchmarks other than BenchmarkDim also experienced speedups
or slowdowns, which is quite mysterious to me in some cases (although I
have not checked the assembly) because KeepAlive is supposed to be
intrinsified?

Benchstat output, benchmarked with tip on Linux AMD64 on Intel i5-8300H:

name                   old time/op  new time/op   delta
Acos-8                 13.6ns ± 0%   13.8ns ± 0%   +1.72%  (p=0.000 n=13+15)
Acosh-8                18.7ns ± 1%   19.1ns ± 0%   +2.37%  (p=0.000 n=14+13)
Asin-8                 10.7ns ± 1%   11.0ns ± 1%   +3.33%  (p=0.000 n=14+15)
Asinh-8                24.6ns ± 1%   25.2ns ± 0%   +2.44%  (p=0.000 n=15+15)
Atan-8                 6.68ns ± 0%   6.96ns ± 0%   +4.24%  (p=0.000 n=14+15)
Atanh-8                20.9ns ± 0%   21.1ns ± 0%   +0.89%  (p=0.000 n=15+15)
Atan2-8                12.4ns ± 1%   12.7ns ± 0%   +2.36%  (p=0.000 n=15+15)
Cbrt-8                 10.9ns ± 0%   11.9ns ±11%   +9.54%  (p=0.000 n=13+15)
Ceil-8                 2.01ns ± 0%   2.01ns ± 0%     ~     (all equal)
Copysign-8             0.75ns ± 1%   0.75ns ± 0%     ~     (p=0.902 n=14+16)
Cos-8                  9.76ns ± 0%   9.88ns ± 0%   +1.23%  (p=0.000 n=15+13)
Cosh-8                 13.7ns ± 2%   13.8ns ± 0%   +0.83%  (p=0.001 n=15+13)
Erf-8                  6.86ns ± 0%   7.19ns ± 0%   +4.84%  (p=0.000 n=14+14)
Erfc-8                 8.10ns ± 1%   8.29ns ± 1%   +2.31%  (p=0.000 n=15+16)
Erfinv-8               8.78ns ± 0%   9.08ns ± 0%   +3.33%  (p=0.000 n=14+14)
Erfcinv-8              8.95ns ± 0%   9.07ns ± 0%   +1.39%  (p=0.000 n=14+15)
Exp-8                  8.56ns ± 0%   8.60ns ± 0%   +0.48%  (p=0.000 n=14+15)
ExpGo-8                22.5ns ± 1%   22.3ns ± 0%   -0.54%  (p=0.000 n=16+15)
Expm1-8                11.8ns ± 0%   12.1ns ± 0%   +2.54%  (p=0.000 n=16+14)
Exp2-8                 20.8ns ± 0%   21.3ns ± 1%   +2.23%  (p=0.000 n=14+14)
Exp2Go-8               20.9ns ± 0%   21.7ns ± 1%   +3.88%  (p=0.000 n=14+15)
Abs-8                  0.38ns ± 0%   0.38ns ± 0%     ~     (p=0.082 n=15+13)
Dim-8                  0.63ns ± 0%   1.21ns ±47%  +92.18%  (p=0.000 n=16+16)
Floor-8                2.01ns ± 1%   2.01ns ± 0%   -0.14%  (p=0.000 n=14+16)
Max-8                  2.44ns ±12%   2.55ns ±31%     ~     (p=0.187 n=16+14)
Min-8                  3.02ns ±49%   3.41ns ±44%     ~     (p=0.115 n=16+16)
Mod-8                  41.9ns ±32%   37.4ns ±34%     ~     (p=0.115 n=16+16)
Frexp-8                5.17ns ±39%   4.31ns ±26%     ~     (p=0.116 n=16+14)
Gamma-8                11.5ns ±54%   10.8ns ±49%     ~     (p=0.291 n=16+16)
Hypot-8                3.93ns ±23%   3.22ns ± 3%     ~     (p=0.087 n=16+13)
HypotGo-8              5.92ns ±27%   6.25ns ±45%     ~     (p=0.830 n=16+16)
Ilogb-8                3.59ns ±49%   3.73ns ±46%   +4.01%  (p=0.016 n=16+16)
J0-8                   50.0ns ± 0%   59.9ns ±47%  +19.74%  (p=0.000 n=13+16)
J1-8                   50.1ns ± 0%   50.8ns ± 0%   +1.40%  (p=0.000 n=13+13)
Jn-8                    108ns ± 0%    108ns ± 0%     ~     (all equal)
Ldexp-8                5.67ns ±15%   5.56ns ± 0%     ~     (p=0.831 n=13+15)
Lgamma-8               14.3ns ±52%   13.5ns ±49%     ~     (p=0.316 n=16+16)
Log-8                  8.95ns ± 1%  10.24ns ±46%  +14.49%  (p=0.000 n=13+15)
Logb-8                 4.00ns ±48%   3.92ns ±50%     ~     (p=0.387 n=16+15)
Log1p-8                16.7ns ±43%   13.6ns ± 1%     ~     (p=0.241 n=16+13)
Log10-8                11.2ns ± 2%   11.6ns ± 0%   +3.43%  (p=0.000 n=13+14)
Log2-8                 5.90ns ±48%   6.32ns ±45%     ~     (p=0.130 n=15+16)
Modf-8                 4.73ns ±31%   4.70ns ±28%     ~     (p=0.993 n=16+16)
Nextafter32-8          4.66ns ±37%   4.55ns ±31%     ~     (p=0.299 n=16+16)
Nextafter64-8          3.37ns ±15%   3.95ns ±44%     ~     (p=0.133 n=14+16)
PowInt-8               32.4ns ±34%   31.9ns ±37%     ~     (p=0.830 n=16+16)
PowFrac-8              66.7ns ± 0%   77.3ns ±50%     ~     (p=0.160 n=13+16)
Pow10Pos-8             1.01ns ± 1%   1.26ns ±21%  +25.36%  (p=0.000 n=16+15)
Pow10Neg-8             1.54ns ±41%   1.74ns ±28%     ~     (p=0.069 n=16+16)
Round-8                2.53ns ±39%   2.47ns ±41%     ~     (p=0.229 n=16+16)
RoundToEven-8          0.57ns ± 1%   0.63ns ± 2%  +11.42%  (p=0.000 n=13+13)
Remainder-8            27.4ns ± 1%   26.9ns ± 0%   -1.80%  (p=0.000 n=13+13)
Signbit-8              0.50ns ± 1%   0.75ns ± 0%  +49.28%  (p=0.000 n=13+13)
Sin-8                  9.09ns ± 0%   9.49ns ±19%   +4.45%  (p=0.000 n=15+13)
Sincos-8               11.5ns ± 1%   11.7ns ± 0%   +1.73%  (p=0.000 n=15+13)
Sinh-8                 13.7ns ± 3%   13.9ns ± 0%     ~     (p=0.628 n=16+15)
SqrtIndirect-8         2.26ns ± 1%   2.45ns ±32%     ~     (p=0.075 n=14+16)
SqrtLatency-8          3.27ns ± 1%   3.51ns ±46%     ~     (p=0.490 n=14+14)
SqrtIndirectLatency-8  5.78ns ± 0%   6.03ns ± 0%   +4.33%  (p=0.000 n=16+16)
SqrtGoLatency-8        34.7ns ± 0%   35.0ns ± 0%   +0.88%  (p=0.000 n=16+16)
SqrtPrime-8            2.53µs ± 0%   2.56µs ± 0%   +1.34%  (p=0.000 n=16+16)
Tan-8                  9.59ns ± 0%   9.88ns ± 0%   +3.04%  (p=0.000 n=16+15)
Tanh-8                 14.3ns ± 3%   14.5ns ± 0%   +1.58%  (p=0.001 n=16+13)
Trunc-8                2.01ns ± 0%   2.01ns ± 0%     ~     (all equal)
Y0-8                   49.0ns ± 1%   49.7ns ± 0%   +1.56%  (p=0.000 n=16+15)
Y1-8                   49.0ns ± 1%   49.7ns ± 0%   +1.28%  (p=0.000 n=16+15)
Yn-8                    106ns ± 0%    106ns ± 0%     ~     (all equal)
Float64bits-8          0.50ns ± 0%   0.30ns ± 1%  -40.17%  (p=0.000 n=16+16)
Float64frombits-8      0.50ns ± 0%   0.28ns ± 0%  -43.82%  (p=0.000 n=15+16)
Float32bits-8          0.25ns ± 0%   0.28ns ± 0%  +12.25%  (p=0.000 n=16+12)
Float32frombits-8      0.25ns ± 0%   0.25ns ± 0%     ~     (p=0.704 n=16+16)
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2019

Based on the discussion above, runtime.KeepAlive already does what the proposed new builtin function would do. So, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.