A panic occurred when used HTTP probe #222

mazing80 · 2019-03-19T10:04:14Z

I have used multiple HTTP probes with one cloudprober.
But, after some time, panic occurs. (about 30 minutes or 1 hour later)

Environment:
OS: CentOS 7.6.1810
Docker: 18.09.3
Cloudprober: v0.10.1

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4ee41c]

goroutine 45 [running]:
bufio.(*Writer).Flush(0xc00012f680, 0xc000434100, 0x68)
        /home/travis/.gimme/versions/go1.11.linux.amd64/src/bufio/bufio.go:575 +0x5c
bufio.(*Writer).Write(0xc00012f680, 0xc000434100, 0x6a, 0x100, 0x1, 0x9, 0x3)
        /home/travis/.gimme/versions/go1.11.linux.amd64/src/bufio/bufio.go:611 +0xeb
net/http.(*response).write(0xc0004c2000, 0x6a, 0xc000434100, 0x6a, 0x100, 0x0, 0x0, 0x15eebc0, 0x0, 0x0)
        /home/travis/.gimme/versions/go1.11.linux.amd64/src/net/http/server.go:1551 +0x149
net/http.(*response).Write(0xc0004c2000, 0xc000434100, 0x6a, 0x100, 0x3, 0x3, 0x409813)
        /home/travis/.gimme/versions/go1.11.linux.amd64/src/net/http/server.go:1521 +0x56
fmt.Fprintf(0xf65da0, 0xc0004c2000, 0xea1527, 0x9, 0xc00069be00, 0x3, 0x3, 0x6a, 0x0, 0x0)
        /home/travis/.gimme/versions/go1.11.linux.amd64/src/fmt/print.go:189 +0xa5
github.com/google/cloudprober/surfacers/prometheus.New.func1(0xf65da0, 0xc0004c2000, 0xc000246300, 0xc0002e8a20, 0x56)
        /home/travis/gopath/src/github.com/google/cloudprober/surfacers/prometheus/prometheus.go:131 +0x1bf
github.com/google/cloudprober/surfacers/prometheus.(*PromSurfacer).writeData(0xc0003ea120, 0xf65da0, 0xc0004c2000)
        /home/travis/gopath/src/github.com/google/cloudprober/surfacers/prometheus/prometheus.go:360 +0x24d
github.com/google/cloudprober/surfacers/prometheus.New.func3(0xc0003ea120, 0xc0003ea180)
        /home/travis/gopath/src/github.com/google/cloudprober/surfacers/prometheus/prometheus.go:150 +0x12c
created by github.com/google/cloudprober/surfacers/prometheus.New
        /home/travis/gopath/src/github.com/google/cloudprober/surfacers/prometheus/prometheus.go:144 +0x1d7

The text was updated successfully, but these errors were encountered:

manugarg · 2019-03-19T16:13:55Z

@mazing80 Thanks a lot of reporting the issue.

This panic is very strange though. goroutine that you pasted above is running the prometheus surfacer and is in bufio.Writer.Flush(). I can't imagine a reason for a panic in that function as the data is already prepared by the time that function is called (unless there is a bug in net/http or bufio -- extremely unlikely). I think there must be other goroutines in the stack trace. Can you please paste them too. Panic may occurring in some other goroutine.

Thanks once again.

mazing80 · 2019-03-25T11:21:28Z

@manugarg Thank you for the quick reply!
Sadly, the stack trace is that's all. I also think it is a strange issue.
Anyway, I also will debug it.

Thank you!

manugarg · 2019-03-25T16:14:58Z

@mazing80 If you could share your config (privately if you'd prefer that), I can keep a cloudprober running with that config, or even look at the code for venues for possible panics. Also, if you have any other info, for example if adding of a specific probe or surfacer causes it, that will also help.

mazing80 · 2019-03-28T05:26:41Z

@manugarg Sorry for the late reply.
Here is my sample configuration:

probe {
        name: "api-staging"
        type: HTTP
        targets {
                host_names: "host1,host2,host3"
        }

        interval_msec: 5000             # 5s
        timeout_msec: 2000              # 2s

        latency_unit: "ms"
        latency_distribution {
                explicit_buckets: "2,4,8,16,32,64,128,256,512"
        }

        http_probe {
                relative_url: "/health"
                port: 8080
        }
}

probe {
        name: "api-release"
        type: HTTP
        targets {
                host_names: "host1-prod,host2-prod,host3-prod"
        }

        interval_msec: 5000             # 5s
        timeout_msec: 2000              # 2s

        latency_unit: "ms"
        latency_distribution {
                explicit_buckets: "2,4,8,16,32,64,128,256,512"
        }

        http_probe {
                relative_url: "/health"
                port: 8080
        }
}
...

BTW, I found the cause.
Please see this code: https://github.com/google/cloudprober/blob/master/surfacers/prometheus/prometheus.go#L144-L159

If you look at the code, http.HandleFunc (ps.c.GetMetricsUrl (), ..) can be run multiple at the same time, and done will be passed to all currently running http.HandleFunc (ps.c.GetMetricsUrl (), ..). So maybe http.HandleFunc (ps.c.GetMetricsUrl (), ..) running ps.writeData(w) can receive done and be interrupted.
This causes the Writer to be reset internally and eventually panic occurs.

I have created PR(#229) to fix this so please take a look.
Thank you!

manugarg · 2019-03-28T17:12:48Z

@mazing80 That's a great find. Excellent work :)

This bug is quite hard to reproduce because even if serving multiple requests in parallel, it depends a bit on chance -- which handler will read from the "done channel" first. Logic is clearly flawed though. I'll review your PR. Thank you!

Using a shared "done" channel may result in panic if an ongoing HTTP handler reads the channel before the handler for which the request has actually finished. See #222 for more details. PiperOrigin-RevId: 242071124

manugarg · 2019-04-05T17:34:43Z

Fixed by: e086f8e.

ylwangs · 2019-04-10T03:12:11Z

Prefect! I also encountered the same problem. I will verify if it can be solved effectively.

manugarg · 2019-04-10T19:12:50Z

@MrDragon1122 Thanks for reporting. Please do let us know what you find :)

Using a shared "done" channel may result in panic if an ongoing HTTP handler reads the channel before the handler for which the request has actually finished. See google#222 for more details. PiperOrigin-RevId: 242071124

manugarg self-assigned this Mar 19, 2019

mazing80 mentioned this issue Mar 28, 2019

Fix panic issue from prometheus surfacer #229

Closed

manugarg mentioned this issue Apr 5, 2019

Fix a bug in prometheus surfacer's HTTP query handler. #230

Merged

manugarg closed this as completed Apr 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A panic occurred when used HTTP probe #222

A panic occurred when used HTTP probe #222

mazing80 commented Mar 19, 2019 •

edited

manugarg commented Mar 19, 2019

mazing80 commented Mar 25, 2019

manugarg commented Mar 25, 2019

mazing80 commented Mar 28, 2019 •

edited

manugarg commented Mar 28, 2019

manugarg commented Apr 5, 2019

ylwangs commented Apr 10, 2019

manugarg commented Apr 10, 2019

A panic occurred when used HTTP probe #222

A panic occurred when used HTTP probe #222

Comments

mazing80 commented Mar 19, 2019 • edited

manugarg commented Mar 19, 2019

mazing80 commented Mar 25, 2019

manugarg commented Mar 25, 2019

mazing80 commented Mar 28, 2019 • edited

manugarg commented Mar 28, 2019

manugarg commented Apr 5, 2019

ylwangs commented Apr 10, 2019

manugarg commented Apr 10, 2019

mazing80 commented Mar 19, 2019 •

edited

mazing80 commented Mar 28, 2019 •

edited