Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime: add a pointer check code to gcWriteBarrier for helping debugging zombie pointers #51276

Open
octobered opened this issue Feb 20, 2022 · 8 comments
Labels
Projects
Milestone

Comments

@octobered
Copy link

@octobered octobered commented Feb 20, 2022

Background

Zombie pointer problem is very difficult to debug. Now GoGC will only report this issue in scanning objects, even though Go GC will panic here, information is not enough for debugging (reportZombies will print this span, but won't tell developer how this pointer collected to gcw, developer will have to find out where this memory address comes from.).

Currently there are two ways that could put pointers to gcw: greyobject and gcWriteBarrier (and its cx, dx... derivation). greyobject already has debug.gccheckmark flag to validate if this pointer is ok, however gcWriteBarrier has no corresponding way to do the validation.

For example, a misuse atomic.StorePointer won't have any errors reported, if developers don't pay attention to the correctness of this value. (Espcially when this field is an ID or some other int64 fields have no real world meaning) code snippet

package main

import (
	"fmt"
	"sync/atomic"
	"unsafe"
)

type struct1 struct {
	Field int64
}

type struct1ptr *struct1

var s1 struct1ptr = &struct1{
	1,
}

func main() {
	f1 := struct1{
		Field: 0,
	}
	atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(s1)), unsafe.Pointer(&f1))
	fmt.Println(s1.Field)
}

Moreover, if this function is working under heavy load, this wrong assign may leads to a zombie pointer which is hard to debug, link As this panic will only tell developers which mspan contains zombie pointer, but no information about where this pointer is allocated.

package main

import (
	"fmt"
	"runtime"
	"sync"
	"sync/atomic"
	"time"
	"unsafe"
)

var current time.Time

func main() {
	var wg sync.WaitGroup
	for i := 0; i < 1000; i++ {
		wg.Add(2)
		go func() {
			defer wg.Done()
			rc := time.Now()
			atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&current)), unsafe.Pointer(&rc))
		}()
		go func() {
			defer wg.Done()
			fmt.Printf("%v\n", current.Unix())
		}()
		runtime.GC()
	}
	wg.Wait()
	fmt.Println("===over===")
}

Another case is unsafe type conversion. playground Same UserInfo from different RPCs may leads developer to use unsafe.Pointer to elimate new allocation performance loss and avoid writing too long code for mapping field. But after IDL files updated, UserInfoFromARPC and UserInfoFromBRPC may be different, and there is a high probability that the newly added fields will be of pointer type, since new fields are often optional.

package main

import (
	"fmt"
	"runtime"
	"sync"
	"unsafe"
)

type UserInfoFromARPC struct {
	ID  int64
	Age int64
}

type UserInfoFromBRPC struct {
	ID    int64
	Age   int64
	Extra *string // this field was introduced after UnsafeQueryB is finished, and generated by some idl generator
}

func Query() *UserInfoFromARPC {
	return &UserInfoFromARPC{
		ID:  100,
		Age: 100,
	}
}

func UnsafeQueryB() *UserInfoFromBRPC {
	return (*UserInfoFromBRPC)(unsafe.Pointer(Query()))
}
func main() {
	var wg sync.WaitGroup
	for i := 0; i < 1000; i++ {
		wg.Add(1)
		go func() {
			fmt.Println(UnsafeQueryB())
			wg.Done()
		}()
		runtime.GC()
	}
	wg.Wait()
}

Proposal

Idea in brief

Add a macro-controlled code section in gcWriteBarrier, gcWriteBarrier will try to check pointer validation before store it. If the pointer is illegal, panic here. In general, the panic stack will have the frame of the wrong "allocation" happened function. Developer could use this to find their bugs easier.

Why choose to modify gcWriteBarrier

ZombiePointers will only happened when this wrong pointer has been esacped to heap. If a pointer doesn't escaped to heap, it won't cause zombie pointer at least.

Why choose to using a macro instead of a flag

As gcWriteBarrier is a frequently called function, using macro will avoid changing regular code. Only when developers find there is a zombie pointer in their program and it's reproducible, they could compile codes with -+ and this flag to recompile runtime and enable these tracking code, and waiting for panic to find out where the bug is.

Cons

  1. Controlling compiled code by c macro isn't a common way in go

but maybe least performance overhead?

PS: should we modify reportZombies to make it not panic when invalidptr=0 is set?

@gopherbot gopherbot added this to the Proposal milestone Feb 20, 2022
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Feb 20, 2022
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 20, 2022

Go doesn't have macros, so it's not clear to me exactly what you are suggesting.

gcWriteBarrier is run very frequently, so I agree that we should not add a run-time check for whether to check that the pointer is valid.

Perhaps we could do this with build tags, but I have a feeling that few people would such a feature in practice.

CC @golang/runtime

@ericlagergren
Copy link
Contributor

@ericlagergren ericlagergren commented Feb 21, 2022

The first example seems like something that vet should catch. It is rather easy to either forget that the API is actually func StorePointer(dst **T, t *T) or to typo the unwieldily (*unsafe.Pointer)(...) expression. I make this mistake basically every time I use the API. A vet check would catch the problem at its source.

I think that it might be possible for vet to catch the second case. It would a false positive if UserInfoFromARPC were allocated off Go's heap (like C.malloc or syscall.Mmap), but that could be worked around by using //go:notinheap. But I don't know.

If vet could catch these then there wouldn't be a need for runtime checks.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Feb 22, 2022

FWIW, in the past for debugging I tried to set the write barrier buffer size to 1, so it will always go through the slow path, which has at least some check e.g. in findObject, and it was helpful. Perhaps we can make a GODEBUG mode that set the the buffer size to 1.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 22, 2022

I think that you can actually get the effect of a buffer size of 1 today by using GODEBUG=cgocheck=2. Though that's hardly intuitive.

@guodongli-google
Copy link

@guodongli-google guodongli-google commented Feb 22, 2022

A vet checker looks plausible: it can strip the unsafe.Pointer() conversions and obtain the original types, and check whether the types of the two arguments:

atomic.StorePointer(T1, T2)

Valid cases include:

  • T1 = *T2
  • T1 = *T3, T2 is compatible with T3.

The check can easily go transitively over the types, e.g. when T2 and T3 are structs with fields of named types. So the checker may narrow down the patterns it will report, e.g. only report the case when T1 = T2, which covers the examples:

atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(s1)), unsafe.Pointer(&f1))

and

atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&current)), unsafe.Pointer(&rc))

The frequency of these patterns in real code is a question though.

@octobered
Copy link
Author

@octobered octobered commented Feb 23, 2022

The first example seems like something that vet should catch. It is rather easy to either forget that the API is actually func StorePointer(dst **T, t *T) or to typo the unwieldily (*unsafe.Pointer)(...) expression. I make this mistake basically every time I use the API. A vet check would catch the problem at its source.

I think that it might be possible for vet to catch the second case. It would a false positive if UserInfoFromARPC were allocated off Go's heap (like C.malloc or syscall.Mmap), but that could be worked around by using //go:notinheap. But I don't know.

If vet could catch these then there wouldn't be a need for runtime checks.

I get this point, vet would be a less intrusive way. Actually I'm trying to developing a plugin under golangci-lint so that if anyone meets this problem, they could use it to debug. ( I think these codes are not very common, should I add it go go vet instead of golangci-lint?

However the actual problem I concern it's that two examples I gave are just coming from my personal knowledge, I don't know other cases that could trigger this.

I think that you can actually get the effect of a buffer size of 1 today by using GODEBUG=cgocheck=2. Though that's hardly intuitive.

Well they are kinda different actually, I think you mean when cgocheck=2, cgoCheckWriteBarrier will be called. cgoCheckWriteBarrier seems will not check it dst and src are both Go pointers, as this check depends on if this pointer smaller than span limit, which they are "in span" very likely. (These zombie pointers are smaller than span, but they are larger than freeindex and their allocBits are unset. They could be in span and zombie.)

Go doesn't have macros, so it's not clear to me exactly what you are suggesting.

gcWriteBarrier is run very frequently, so I agree that we should not add a run-time check for whether to check that the pointer is valid.

Perhaps we could do this with build tags, but I have a feeling that few people would such a feature in practice.

CC @golang/runtime

I do find out go rarely use macros, however gcWriteBarrier is assemble, I think I might could introduce a macro like GOAMD64_v2/3/4 and it's controlled by flags? (I haven't read related code, it's just my guess😂

@octobered
Copy link
Author

@octobered octobered commented Feb 23, 2022

and another thing I think maybe should be pointed out is invalidptr can not fully disable zombie pointer check. invalidptr will only affect if this pointer is collected by findObject, if this pointer is marked by gcWriteBarrier, this will just panic in reportZombies, which is not controlled by invalidptr.

I am not very sure but is it ok to make reportZombies controlled by invalidptr?

@randall77
Copy link
Contributor

@randall77 randall77 commented Feb 24, 2022

Another "fix" for this problem is to just wait for #50860 to be accepted and implemented. Then we can have people move to using the generic (*Pointer[T]).Store which doesn't have the same confusion that the current StorePointer has.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Proposals
Incoming
Development

No branches or pull requests

7 participants