New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: defer/recover support in race detector #26813

Open
adilbaig opened this Issue Aug 5, 2018 · 17 comments

Comments

Projects
None yet
5 participants
@adilbaig

adilbaig commented Aug 5, 2018

Hi Go Devs,

I've noticed a strange memory leak when running my Go program with the Race Detector. The program is an ad-hit tracker, it receives data from RabbitMQ and writes them down to the database. When there are errors in the DB, the program leaks memory (the RES mem usage grows unchecked until the machine runs out of RAM).

Run the program (below) with and without the -race detector to see the difference.

What version of Go are you using (go version)?

go version go1.10.3 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/adil/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/adil/go"
GORACE=""
GOROOT="/snap/go/2130"
GOTMPDIR=""
GOTOOLDIR="/snap/go/2130/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build049513405=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Run this program without the race detector:

> go build -o race experimental/rabbitmqmemoryleak.go && ./race

Observe the memory usage in top. It should flatline. Then, run it with the -race detector

> go build -race -o race experimental/rabbitmqmemoryleak.go && ./race

The memory usage grows unchecked.

package main

import (
	"database/sql"
	"log"

	"github.com/go-errors/errors"
	_ "github.com/lib/pq"
	"github.com/streadway/amqp"
)

// This program demonstrates a memory leak in Go's built-in race detector.
// You will need RabbitMQ and Postgres installed on your machine.

// First, run the program like this:
// > 	go build -o race experimental/rabbitmqmemoryleak.go && ./race
// Observe the 'RES' memory in `top`. It should level after a bit
//
// Then, run the program like this:
// > 	go build -race -o race experimental/rabbitmqmemoryleak.go && ./race
// Watch as RAM usage counts unchecked

// Strangely running `go run -race experimental/rabbitmqmemoryleak.go` does not leak memory.

func main() {

	var err error

	conn, err := amqp.Dial("amqp://guest:guest@localhost:5672")
	if err != nil {
		log.Fatal(err)
	}
	defer conn.Close()

	channel, err := conn.Channel()
	if err != nil {
		log.Fatal(err)
	}
	defer channel.Close()

	queue := "testing-memory-leak"

	// 1. Create the queue
	var args amqp.Table
	_, err = channel.QueueDeclare(queue, false, true, false, false, args)
	defer channel.QueueDelete(queue, false, false, true)

	// 2. Fill in messages
	cnt := 100000
	log.Printf("Inserting %d messages", cnt)
	for index := 0; index < cnt; index++ {
		msg := `{"date":"2018-07-17 08:44:30.202358+02","campaignID":18275,"affiliateSiteID":70002,"materialID":969206,"referenceID":0,"referenceHash":null,"referenceName":null,"isUnique":true,"merchantCommission":"0","affiliateCommission":"0","IP":"37.74.164.6","postImpressionHash":"0ebdd4720ef7eba5f09a84b8c68367c6","universalFingerprintHash":"c47e486da2ff1a4695b1499052b09807","pushToDynamoDB":true,"timestamp":"1531809870","handlerType":"impression","host":"snipped","pid":7962,"userAgentID":null,"userAgentHash":"08ac6aeb466ccd8f05ad37a144afe0f8","userAgentName":"Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/58.0.3029.110 Safari\/537.36 Edge\/16.16299","requestHash":"29c2363878fee77174d0b1e2491ea93b"}`
		err := channel.Publish("", queue, true, false, amqp.Publishing{ContentType: "application/json", Body: []byte(msg), DeliveryMode: 2})
		if err != nil {
			log.Panic(err)
		}
	}
	log.Printf("Done")

	pgb, err := sql.Open("postgres", "host=localhost user=user password=pass dbname=db sslmode=disable")
	if err != nil {
		log.Fatal(err)
	}
	defer pgb.Close()

	// 2. Listen on the messages
	forever := make(chan bool)
	for i := 0; i < 10; i++ {
		go handler(queue, pgb)
	}
	<-forever
}

func handler(queue string, pgb *sql.DB) {
	conn, err := amqp.Dial("amqp://guest:guest@localhost:5672")
	if err != nil {
		log.Fatal(err)
	}

	channel, err := conn.Channel()
	if err != nil {
		log.Fatal(err)
	}

	for {
		message, ok, err := channel.Get(queue, false)
		if err != nil {
			log.Fatal(err)
		}
		if !ok {
			continue
		}

		log.Printf("%+v", message)

		err = someErroneousSQLTransaction(pgb)
		if err != nil {
			log.Printf("ERROR: %s", err)
			message.Nack(false, true)
		} else {
			message.Ack(false)
		}
	}
}

func someErroneousSQLTransaction(pgb *sql.DB) error {
	refTx, err := pgb.Begin()
	if err != nil {
		return errors.Wrap(err, 0)
	}
	defer func() {
		if err != nil {
			refTx.Rollback()
		} else {
			refTx.Commit()
		}
	}()
	_, err = refTx.Exec(`set transaction isolation level repeatable read`)
	if err != nil {
		return err
	}

	var referenceID int64
	switch err = pgb.QueryRow(`SELECT "ID" FROM public.blah LIMIT 1`).Scan(&referenceID); err {
	case sql.ErrNoRows:
		break
	case nil:
		break
	default:
		return err
	}

	return nil
}

What did you expect to see?

Flatline memory usage, as measured with top when running with -race. I expect it to take more memory to maintain internal data structures, but it should flatline at some point.

What did you see instead?

Constantly growing RES mem usage in top.

Strangely running go run -race experimental/rabbitmqmemoryleak.go does not leak memory.

@ianlancetaylor ianlancetaylor changed the title from Go's race detector is leaking memory on RabbitMQ and Postgres to runtime: Go's race detector is leaking memory on RabbitMQ and Postgres Aug 5, 2018

@ianlancetaylor ianlancetaylor added this to the Go1.12 milestone Aug 5, 2018

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Aug 5, 2018

@dvyukov

This comment has been minimized.

Member

dvyukov commented Aug 6, 2018

Difference between go build and go run looks strange. Do we have any known precedent of differences?

@dvyukov

This comment has been minimized.

Member

dvyukov commented Aug 6, 2018

@adilbaig can you provide a step-by-step repro instructions?

@agnivade

This comment has been minimized.

Member

agnivade commented Aug 6, 2018

It is there in the code comments I believe -

// This program demonstrates a memory leak in Go's built-in race detector.
// You will need RabbitMQ and Postgres installed on your machine.

// First, run the program like this:
// > 	go build -o race experimental/rabbitmqmemoryleak.go && ./race
// Observe the 'RES' memory in `top`. It should level after a bit
//
// Then, run the program like this:
// > 	go build -race -o race experimental/rabbitmqmemoryleak.go && ./race
// Watch as RAM usage counts unchecked

Did you want anything specific ?

@dvyukov

This comment has been minimized.

Member

dvyukov commented Aug 6, 2018

Something that I can copy-paste into my console.

@adilbaig

This comment has been minimized.

adilbaig commented Aug 6, 2018

@dvyukov How about this.

# Have RabbitMQ and Postgres setup (tested with version 9.6), on Ubuntu
sudo apt install rabbitmq-server postgresql-9.6
go get "github.com/streadway/amqp"
go build -o race rabbitmqmemoryleak.go && ./race
# Open `top` in another shell and see that memory usage (RES) is flat for this process. CTRL+C ./race.
go build -race -o race rabbitmqmemoryleak.go && ./race
# Now, open `top` in another shell and see the memory leak

You don't actually need to create Postgres tables. I see the leaks when the SQL errors out (for example due to bad login credentials).

@dvyukov

This comment has been minimized.

Member

dvyukov commented Aug 6, 2018

Can reproduce it now.
The problem is that the program grows infinite stack with infinite recursion:

...
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
...

So it will crash without race detector sooner or later too, makes GC slow, crash messages unreadable, etc.
It's just that race detector is more sensitive to thousands of unique stacks as it memorizes tons of contextual information about the program as it runs. In this case it memorizes several variations of the stack per recursion frame, which is quadratic memory consumption.

@adilbaig

This comment has been minimized.

adilbaig commented Aug 6, 2018

@dvnagorny

So it will crash without race detector sooner or later too

I haven't seen that happen, as I ran it without the race detector over the weekend. Can you confirm that it's stable without -race?

I'm a novice at Go, but, given your explanation, am i right to think this will be an issue for any evented system (that errors out on events).?

@dvyukov

This comment has been minimized.

Member

dvyukov commented Aug 6, 2018

Can you confirm that it's stable without -race?

no.

I'm a novice at Go, but, given your explanation, am i right to think this will be an issue for any evented system (that errors out on events).?

I don't know about these libraries, but what first comes to my mind is that there is a way to use them without infinite recursion. You both have an infinite loop in handler and call it recursively. I would expect that you need to do only 1 of these things, but not both. Have you looked at other usage examples of these libraries? Do all of them use it in the same way?

@adilbaig

This comment has been minimized.

adilbaig commented Aug 7, 2018

I don't know about these libraries, but what first comes to my mind is that there is a way to use them without infinite recursion. You both have an infinite loop in handler and call it recursively. I would expect that you need to do only 1 of these things, but not both.

I guess it only looks recursive in the dump, but it isn't, it's just running in a loop one message after another erroring out at every message. So I'm guessing with the race detector on, a reference to something in the sql package is still being bandied around. I tried replacing the sql error with a plain fmt.Errorf, memory usage remained flat. So it isn't any error that causes the leak, but something specific to the sql library. That's my guess.

for {
		message, ok, err := channel.Get(queue, false)
		if err != nil {
			log.Fatal(err)
		}
		if !ok {
			continue
		}

		log.Printf("%+v", message)

		// err = someErroneousSQLTransaction(pgb)
		err = fmt.Errorf("FAKE ERROR!") //Didn't leak
		if err != nil {
			log.Printf("ERROR: %s", err)
			message.Nack(false, true)
		} else {
			message.Ack(false)
		}
	}
...

Have you looked at other usage examples of these libraries? Do all of them use it in the same way?

Yep, it's largely just fetching and processing a message in a loop. Here's a canonical example from the site: https://www.rabbitmq.com/tutorials/tutorial-one-go.html (Code is at https://github.com/rabbitmq/rabbitmq-tutorials/blob/master/go/receive.go)

@dvnagorny

@dvyukov

This comment has been minimized.

Member

dvyukov commented Aug 8, 2018

I guess it only looks recursive in the dump, but it isn't, it's just running in a loop one message after another erroring out at every message.

No, it calls someErroneousSQLTransaction which calls sql.(*DB).Begin which calls handler again.

@agnivade

This comment has been minimized.

Member

agnivade commented Aug 11, 2018

Maybe I am missing something, but I fail to see how a sql.(*DB).Begin() call can call into user level code.

Also, @adilbaig, you are creating a transaction but are not using the tx to make your queries but instead, you are using the main db handle itself to make the QueryRow call. Any reason for that ? If you want your queries to happen through the transaction, you should use the tx handle itself to make the queries.

Also, we already have options to transaction types - https://golang.org/pkg/database/sql/#IsolationLevel. You can set them directly while creating a tx object.

My suspicion is that creating multiple transactions from mulitple goroutines is somehow causing this behavior in the race detector. Can you try with pgb.BeginTx() by setting opts to repeatable read, and use the tx to make the query and see if it makes any difference ?

@adilbaig

This comment has been minimized.

adilbaig commented Aug 12, 2018

@agnivade That was a mistake. Thanks for pointing it out!

Also, we already have options to transaction types ..

Nice! I will use that in my code.

The change doesn't make a difference though. I also think the bug resides somewhere in sql creating connections (not necessarily transactions) inside goroutines. Here's the code:

func someErroneousSQLTransaction(pgb *sql.DB) error {
	ctx := context.Background()
	refTx, err := pgb.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelRepeatableRead})
	if err != nil {
		return err
	}
	defer func() {
		if err != nil {
			refTx.Rollback()
		} else {
			refTx.Commit()
		}
	}()

	var referenceID int64
	switch err = refTx.QueryRow(`SELECT "ID" FROM public.blah LIMIT 1`).Scan(&referenceID); err {
	case sql.ErrNoRows:
		break
	case nil:
		break
	default:
		return err
	}

	return nil
}
@agnivade

This comment has been minimized.

Member

agnivade commented Aug 12, 2018

If that is the case, then I think we can take out rabbitMQ from the equation. Perhaps you can shorten the test by just spawning 10 goroutines which share a (*sql.DB) connection and perform the same query. And run that with race detector and see what happens.

Also, we are still unclear as to why the behavior does not show with go run -race, but only with go build -race. @dvyukov - do you have any clues on that ?

@adilbaig

This comment has been minimized.

adilbaig commented Aug 12, 2018

@agnivade - Well no, it only leaks when you're processing RMQ messages. Just plain channels throwing sql errors doesn't leak. The combination of the two libraries is playing a part.

@dvyukov

This comment has been minimized.

Member

dvyukov commented Aug 13, 2018

Right. It seems that the trace is indeed broken.
First thing that comes to mind is broken panic/defer/recover support. On recover we should unwind race stack accordingly calling the necessary amount of racefuncexit, but I don't see this code at all. Did we ever do this? Does this program recover from panics? Defer should work out-of-the-box, at least that was the intention, but I wonder if it somehow leads to unmatched racefuncenter/racefuncexit.
I am travelling for next 3 weeks and most likely won't be be able to look closer.

@dvyukov dvyukov changed the title from runtime: Go's race detector is leaking memory on RabbitMQ and Postgres to runtime: defer/recover support in race detector Sep 2, 2018

@dvyukov

This comment has been minimized.

Member

dvyukov commented Sep 2, 2018

Here is minimized reproducer:

package main

var x uint64

func main() {
	go func() {
		x++
	}()
	for i := 0; i < 10; i++ {
		baz()
	}
	x++
}

func baz() {
	defer func() {
		recover()
	}()
	foo()
}

func foo() {
	panic(0)
}
$ go version
go version devel +c9cc20bd3a Sat Sep 1 15:43:42 2018 +0000 linux/amd64

$ go run -race -gcflags=-l /tmp/test.go
==================
WARNING: DATA RACE
Read at 0x0000005253e8 by main goroutine:
  main.main()
      /tmp/test.go:12 +0x72
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54

Previous write at 0x0000005253e8 by goroutine 5:
  main.main.func1()
      /tmp/test.go:7 +0x56

Goroutine 5 (finished) created at:
  main.main()
      /tmp/test.go:6 +0x46
==================

The repeated frames must not be present in the trace.
The root cause is that we don't unwind tsan stack on panic, i.e. not calling racefuncexit for each unwound frame. The bug is more visible with recover, but strictly saying affects just defer too. Consider that a race happens in a deferred func, or that main application loop runs in a deferred function.
This is not a regression, panics were never properly supported.

We could either defer racefuncexit instead of calling it, or manually call racefuncexit necessary amount of times before calling deferred functions (can we figure out how many frames we skip?). In both cases we need to be careful handling runtime functions because runtime is not instrumented, so we can potentially call racefuncexit too many times, or rely on racefuncexit being deferred when it's actually not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment