Skip to content
This repository has been archived by the owner on Nov 5, 2021. It is now read-only.

A panic occurred when used HTTP probe #222

Closed
mazing80 opened this issue Mar 19, 2019 · 8 comments
Closed

A panic occurred when used HTTP probe #222

mazing80 opened this issue Mar 19, 2019 · 8 comments
Assignees

Comments

@mazing80
Copy link
Contributor

mazing80 commented Mar 19, 2019

I have used multiple HTTP probes with one cloudprober.
But, after some time, panic occurs. (about 30 minutes or 1 hour later)

Environment:
OS: CentOS 7.6.1810
Docker: 18.09.3
Cloudprober: v0.10.1

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4ee41c]

goroutine 45 [running]:
bufio.(*Writer).Flush(0xc00012f680, 0xc000434100, 0x68)
        /home/travis/.gimme/versions/go1.11.linux.amd64/src/bufio/bufio.go:575 +0x5c
bufio.(*Writer).Write(0xc00012f680, 0xc000434100, 0x6a, 0x100, 0x1, 0x9, 0x3)
        /home/travis/.gimme/versions/go1.11.linux.amd64/src/bufio/bufio.go:611 +0xeb
net/http.(*response).write(0xc0004c2000, 0x6a, 0xc000434100, 0x6a, 0x100, 0x0, 0x0, 0x15eebc0, 0x0, 0x0)
        /home/travis/.gimme/versions/go1.11.linux.amd64/src/net/http/server.go:1551 +0x149
net/http.(*response).Write(0xc0004c2000, 0xc000434100, 0x6a, 0x100, 0x3, 0x3, 0x409813)
        /home/travis/.gimme/versions/go1.11.linux.amd64/src/net/http/server.go:1521 +0x56
fmt.Fprintf(0xf65da0, 0xc0004c2000, 0xea1527, 0x9, 0xc00069be00, 0x3, 0x3, 0x6a, 0x0, 0x0)
        /home/travis/.gimme/versions/go1.11.linux.amd64/src/fmt/print.go:189 +0xa5
github.com/google/cloudprober/surfacers/prometheus.New.func1(0xf65da0, 0xc0004c2000, 0xc000246300, 0xc0002e8a20, 0x56)
        /home/travis/gopath/src/github.com/google/cloudprober/surfacers/prometheus/prometheus.go:131 +0x1bf
github.com/google/cloudprober/surfacers/prometheus.(*PromSurfacer).writeData(0xc0003ea120, 0xf65da0, 0xc0004c2000)
        /home/travis/gopath/src/github.com/google/cloudprober/surfacers/prometheus/prometheus.go:360 +0x24d
github.com/google/cloudprober/surfacers/prometheus.New.func3(0xc0003ea120, 0xc0003ea180)
        /home/travis/gopath/src/github.com/google/cloudprober/surfacers/prometheus/prometheus.go:150 +0x12c
created by github.com/google/cloudprober/surfacers/prometheus.New
        /home/travis/gopath/src/github.com/google/cloudprober/surfacers/prometheus/prometheus.go:144 +0x1d7
@manugarg
Copy link
Contributor

@mazing80 Thanks a lot of reporting the issue.

This panic is very strange though. goroutine that you pasted above is running the prometheus surfacer and is in bufio.Writer.Flush(). I can't imagine a reason for a panic in that function as the data is already prepared by the time that function is called (unless there is a bug in net/http or bufio -- extremely unlikely). I think there must be other goroutines in the stack trace. Can you please paste them too. Panic may occurring in some other goroutine.

Thanks once again.

@manugarg manugarg self-assigned this Mar 19, 2019
@mazing80
Copy link
Contributor Author

@manugarg Thank you for the quick reply!
Sadly, the stack trace is that's all. I also think it is a strange issue.
Anyway, I also will debug it.

Thank you!

@manugarg
Copy link
Contributor

@mazing80 If you could share your config (privately if you'd prefer that), I can keep a cloudprober running with that config, or even look at the code for venues for possible panics. Also, if you have any other info, for example if adding of a specific probe or surfacer causes it, that will also help.

@mazing80
Copy link
Contributor Author

mazing80 commented Mar 28, 2019

@manugarg Sorry for the late reply.
Here is my sample configuration:

probe {
        name: "api-staging"
        type: HTTP
        targets {
                host_names: "host1,host2,host3"
        }

        interval_msec: 5000             # 5s
        timeout_msec: 2000              # 2s

        latency_unit: "ms"
        latency_distribution {
                explicit_buckets: "2,4,8,16,32,64,128,256,512"
        }

        http_probe {
                relative_url: "/health"
                port: 8080
        }
}

probe {
        name: "api-release"
        type: HTTP
        targets {
                host_names: "host1-prod,host2-prod,host3-prod"
        }

        interval_msec: 5000             # 5s
        timeout_msec: 2000              # 2s

        latency_unit: "ms"
        latency_distribution {
                explicit_buckets: "2,4,8,16,32,64,128,256,512"
        }

        http_probe {
                relative_url: "/health"
                port: 8080
        }
}
...

BTW, I found the cause.
Please see this code: https://github.com/google/cloudprober/blob/master/surfacers/prometheus/prometheus.go#L144-L159

If you look at the code, http.HandleFunc (ps.c.GetMetricsUrl (), ..) can be run multiple at the same time, and done will be passed to all currently running http.HandleFunc (ps.c.GetMetricsUrl (), ..). So maybe http.HandleFunc (ps.c.GetMetricsUrl (), ..) running ps.writeData(w) can receive done and be interrupted.
This causes the Writer to be reset internally and eventually panic occurs.

I have created PR(#229) to fix this so please take a look.
Thank you!

@manugarg
Copy link
Contributor

@mazing80 That's a great find. Excellent work :)

This bug is quite hard to reproduce because even if serving multiple requests in parallel, it depends a bit on chance -- which handler will read from the "done channel" first. Logic is clearly flawed though. I'll review your PR. Thank you!

manugarg pushed a commit that referenced this issue Apr 5, 2019
Using a shared "done" channel may result in panic if an ongoing HTTP handler reads the channel before the handler for which the request has actually finished. See #222 for more details.

PiperOrigin-RevId: 242071124
manugarg pushed a commit that referenced this issue Apr 5, 2019
Using a shared "done" channel may result in panic if an ongoing HTTP handler reads the channel before the handler for which the request has actually finished. See #222 for more details.

PiperOrigin-RevId: 242071124
@manugarg
Copy link
Contributor

manugarg commented Apr 5, 2019

Fixed by: e086f8e.

@manugarg manugarg closed this as completed Apr 5, 2019
@ylwangs
Copy link

ylwangs commented Apr 10, 2019

Prefect! I also encountered the same problem. I will verify if it can be solved effectively.

@manugarg
Copy link
Contributor

@MrDragon1122 Thanks for reporting. Please do let us know what you find :)

pranj pushed a commit to yext/cloudprober that referenced this issue May 20, 2019
Using a shared "done" channel may result in panic if an ongoing HTTP handler reads the channel before the handler for which the request has actually finished. See google#222 for more details.

PiperOrigin-RevId: 242071124
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants