Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.10.0-nightly-8004f11 crashing multiple times a day! #5303

Closed
jhhayden opened this issue Jan 7, 2016 · 30 comments
Closed

0.10.0-nightly-8004f11 crashing multiple times a day! #5303

jhhayden opened this issue Jan 7, 2016 · 30 comments

Comments

@jhhayden
Copy link

jhhayden commented Jan 7, 2016

I am accepting traffic via http and also udp. The engine is tsm1 , a single server and on a completely clean install. It always "appears" to crash right after a query but the same query can be run again after the restart and work fine. Here are snippets of some crashes from a few days ago:

[http] 2016/01/05 15:42:03 10.25.2.164 - - [05/Jan/2016:15:42:03 +0000] POST /write?db=myTestDB&precision=s&u=&p= HTTP/1.1 204 0 - EventMachine HttpClient de31571d-b3c2-11e5-9ccf-000000000000 1.847503ms
[query] 2016/01/05 15:42:05 SELECT recBytes FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND time > now() - 1h
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x612a56]

goroutine 1743764 [running]:
github.com/influxdb/influxdb/tsdb.(*RawExecutor).execute(0xc82b17c8a0, 0xc83d5bd3e0, 0xc83d5bd260)
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:229 +0xef6
created by github.com/influxdb/influxdb/tsdb.(*RawExecutor).Execute
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:59 +0x67
[http] 2016/01/05 17:45:07 10.25.2.164 - - [05/Jan/2016:17:45:07 +0000] POST /write?db=myTestDB&precision=s&u=&p= HTTP/1.1 204 0 - EventMachine HttpClient 0f0e0c73-b3d4-11e5-ac3f-000000000000 1.853793ms
[query] 2016/01/05 17:45:07 SELECT count(backendTime) FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND (elbStatusCode = '500' OR elbStatusCode = '504') AND time > now() - 30m GROUP BY time(1s)
[query] 2016/01/05 17:45:07 SELECT recBytes FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND time > now() - 30m
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x612a56]

goroutine 181186 [running]:
github.com/influxdb/influxdb/tsdb.(*RawExecutor).execute(0xc827bdb890, 0xc829caafc0, 0xc829caae40)
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:229 +0xef6
created by github.com/influxdb/influxdb/tsdb.(*RawExecutor).Execute
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:59 +0x67
[http] 2016/01/05 20:33:02 10.25.2.164 - - [05/Jan/2016:20:33:02 +0000] POST /write?db=myTestDB&precision=s&u=&p= HTTP/1.1 204 0 - EventMachine HttpClient 84934e4e-b3eb-11e5-9b14-000000000000 2.23985ms
[query] 2016/01/05 20:33:08 SELECT recBytes FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND time > now() - 30m
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x612a56]

goroutine 364592 [running]:
github.com/influxdb/influxdb/tsdb.(*RawExecutor).execute(0xc833450a80, 0xc825c06360, 0xc825c06180)
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:229 +0xef6
created by github.com/influxdb/influxdb/tsdb.(*RawExecutor).Execute
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:59 +0x67
[http] 2016/01/05 21:03:00 10.25.2.164 - - [05/Jan/2016:21:03:00 +0000] POST /write?db=myTestDB&precision=s&u=&p= HTTP/1.1 204 0 - EventMachine HttpClient b4663e99-b3ef-11e5-8d48-000000000000 2.154557ms
[query] 2016/01/05 21:03:08 SELECT recBytes FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND time > now() - 30m
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x612a56]

goroutine 10318 [running]:
github.com/influxdb/influxdb/tsdb.(*RawExecutor).execute(0xc823da2d20, 0xc82a736540, 0xc82a7363c0)
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:229 +0xef6
created by github.com/influxdb/influxdb/tsdb.(*RawExecutor).Execute
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:59 +0x67
[http] 2016/01/05 21:15:04 10.25.2.164 - - [05/Jan/2016:21:15:04 +0000] POST /write?db=myTestDB&precision=s&u=&p= HTTP/1.1 204 0 - EventMachine HttpClient 638db845-b3f1-11e5-98b0-000000000000 2.1595ms
[query] 2016/01/05 21:15:09 SELECT recBytes FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND time > now() - 30m
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x612a56]

goroutine 19107 [running]:
github.com/influxdb/influxdb/tsdb.(*RawExecutor).execute(0xc829bdfb90, 0xc8244c0780, 0xc8244c0600)
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:229 +0xef6
created by github.com/influxdb/influxdb/tsdb.(*RawExecutor).Execute
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:59 +0x67
[http] 2016/01/05 22:33:35 10.25.2.164 - - [05/Jan/2016:22:33:35 +0000] POST /write?db=myTestDB&precision=s&u=&p= HTTP/1.1 204 0 - EventMachine HttpClient 5b8b79d3-b3fc-11e5-8cf1-000000000000 1.777596ms
[query] 2016/01/05 22:33:40 SELECT count(backendTime) FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND (elbStatusCode = '500' OR elbStatusCode = '504') AND time > now() - 30m GROUP BY time(1s)
[query] 2016/01/05 22:33:40 SELECT recBytes FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND time > now() - 30m
[query] 2016/01/05 22:33:40 SELECT count(backendTime) FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND elbStatusCode != '200' AND elbStatusCode != '201' AND elbStatusCode != '204' AND time > now() - 30m GROUP BY time(1s), elbStatusCode
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x612a56]

goroutine 108404 [running]:
github.com/influxdb/influxdb/tsdb.(*RawExecutor).execute(0xc82116d9b0, 0xc827ce4600, 0xc827ce4420)
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:229 +0xef6
created by github.com/influxdb/influxdb/tsdb.(*RawExecutor).Execute
        /tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:59 +0x67

If there is some kind of tracing/debugging switch I can turn on to help debug this, let know.

John Hayden

@jhhayden
Copy link
Author

jhhayden commented Jan 7, 2016

ps. The queries are coming in from the latest grafana

@beckettsean
Copy link
Contributor

@jhhayden I notice all of these queries are basically SELECT recBytes FROM elbs WHERE elb =~ /elb-api-server$/ AND time > now() - 30m. Do you see panics from any other queries, perhaps one that doesn't use a regex in the FROM clause?

@jhhayden
Copy link
Author

jhhayden commented Jan 7, 2016

Every panic is preceded by a query using regex. I cannot seem to trigger it on demand though

@beckettsean
Copy link
Contributor

Interesting. So it appears that having a regex for the measurement is necessary but not sufficient. Since the same query doesn't produce the same behavior, there must be some other internal requirement to expose the bug.

The line that throws the panic is in the query return sorting code: https://github.com/influxdata/influxdb/blob/0.9.6/tsdb/raw.go#L229. It highly implicates this function: chunkedOutput.Values as sometimes outputting an invalid data structure. We now exceed my ability to investigate.

@benbjohnson, any intuition for why that code might cause a panic intermittently, perhaps correlated with regex measurements?

@otoolep
Copy link
Contributor

otoolep commented Jan 7, 2016

@beckettsean is close, it's that chunkedOutput never gets any data and stays nil.

From examining the code it looks like no data is available to be returned (or no mappers are in existence). Obviously we handle cases like this in our code, because lots of queries return no data and don't cra

@otoolep otoolep self-assigned this Jan 7, 2016
@otoolep otoolep removed the area/tsm label Jan 7, 2016
@otoolep
Copy link
Contributor

otoolep commented Jan 7, 2016

This is unlikely to be tsm-specific so removing that label.

@otoolep
Copy link
Contributor

otoolep commented Jan 7, 2016

At least, chunkedOutput being nil looks like the issue.

@rossmcdonald
Copy link
Contributor

I've seen this as well:

[query] SELECT * FROM database."default".measurement WHERE time > 403316h AND time < 403484h AND tag = 'value' AND tag2 = 'value2' GROUP BY tag3 LIMIT 1
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x633045]

goroutine 9915771 [running]:
github.com/influxdb/influxdb/tsdb.(*RawExecutor).execute(0xc2432dc6f0, 0xc2384aef00, 0xc2384aee40)
        /tmp/tmp.DxFelBPhVV/src/github.com/influxdb/influxdb/tsdb/raw.go:229 +0xf55
created by github.com/influxdb/influxdb/tsdb.(*RawExecutor).Execute
        /tmp/tmp.DxFelBPhVV/src/github.com/influxdb/influxdb/tsdb/raw.go:59 +0x64

@jhhayden
Copy link
Author

I know how to recreate this error! Due to a crash in my testDB and the fact I was troubleshooting a different issue for 4 days, I had a db with a gap of 4 days of no data. When I restarted the db, it crashed as soon as a query came in from grafana.

[query] 2016/01/27 23:21:53 SELECT recBytes FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND time > now() - 1d
[query] 2016/01/27 23:21:53 SELECT count(backendTime) FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND elbStatusCode != '200' AND elbStatusCode != '201' AND elbStatusCode != '204' AND time > now() - 1d GROUP BY time(1m), elbStatusCode
[query] 2016/01/27 23:21:53 SELECT count(backendTime) FROM elb."default".elbs WHERE elb =~ /elb-api-server$/ AND (elbStatusCode = '500' OR elbStatusCode = '504') AND time > now() - 1d GROUP BY time(1m)
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x612a56]

goroutine 81 [running]:
github.com/influxdb/influxdb/tsdb.(_RawExecutor).execute(0xc82a1d8f90, 0xc823124a20, 0xc8231248a0)
/tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:229 +0xef6
created by github.com/influxdb/influxdb/tsdb.(_RawExecutor).Execute
/tmp/tmp.eqCMvBE7Hx/src/github.com/influxdb/influxdb/tsdb/raw.go:59 +0x67

So it appears that if there is no data (or it can't find any within parameters), it crashes. I hope this helps.

Question: Is it worth upgrading to the latest nightly build?

@rossmcdonald
Copy link
Contributor

@jhhayden That's very helpful, thank you. It may be worth upgrading, though this issue will probably still occur. If this is just a test system, then I'd say go for it. If you have production/live data, then you may want to hold off until the next official release (which should be GA sometime next week).

@jsternberg
Copy link
Contributor

@jhhayden if you are willing to use nightlies, the 0.11 nightly has a completely refactored query engine where this shouldn't happen.

We still want to fix this for people using 0.10 in a patch fix, but we're still having trouble reproducing it on our own machines. If I push up a branch with a small addition to add a nil check around the place that is panicking, would you be able to test it out to see if it works properly?

Thanks.

@jhhayden
Copy link
Author

I'd love to! Just let me know what to install.

John Hayden

On Wed, Feb 17, 2016 at 1:13 PM, Jonathan A. Sternberg <
notifications@github.com> wrote:

@jhhayden https://github.com/jhhayden if you are willing to use
nightlies, the 0.11 nightly has a completely refactored query engine where
this shouldn't happen.

We still want to fix this for people using 0.10 in a patch fix, but we're
still having trouble reproducing it on our own machines. If I push up a
branch with a small addition to add a nil check around the place that is
panicking, would you be able to test it out to see if it works properly?

Thanks.


Reply to this email directly or view it on GitHub
#5303 (comment)
.

@jsternberg
Copy link
Contributor

If you can build from the js-5303-mapped-chunks-nil-panic branch and then run the binary produced from that, it would help to determine if that commit fixes this issue.

Thank you!

@jhhayden
Copy link
Author

First shot at building a Go project. Failed of course: ;-) Here is what
I ran and the results.

./package.sh -t rpm 0.11
which: no fpm in
(/root/.gvm/pkgsets/go1.4.3/global/bin:/root/.gvm/gos/go1.4.3/bin:/root/.gvm/pkgsets/go1.4.3/global/overlay/bin:/root/.gvm/bin:/root/.gvm/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/aws/bin:/root/bin)

Starting package process...

Current branch is js-5303-mapped-chunks-nil-panic. Start packaging this
branch? [Y/n] y
/root/.gvm/bin/gvm
Now using version go1.4.3
GOPATH (/root/gocodez) looks sane, using /root/gocodez for installation.
Git tree is clean.
From https://github.com/influxdata/influxdb

  • branch js-5303-mapped-chunks-nil-panic -> FETCH_HEAD
    CONFLICT (modify/delete): tsdb/raw.go deleted in HEAD and modified in
    839df96. Version
    839df96 of tsdb/raw.go left in tree.
    Auto-merging cmd/influx_tsm/main.go
    CONFLICT (content): Merge conflict in cmd/influx_tsm/main.go
    Auto-merging CHANGELOG.md
    CONFLICT (content): Merge conflict in CHANGELOG.md
    Automatic merge failed; fix conflicts and then commit the result.
    Failed to pull latest code -- aborting.

John Hayden

On Wed, Feb 17, 2016 at 1:54 PM, Jonathan A. Sternberg <
notifications@github.com> wrote:

If you can build from the js-5303-mapped-chunks-nil-panic branch and then
run the binary produced from that, it would help to determine if that
commit fixes this issue.

Thank you!


Reply to this email directly or view it on GitHub
#5303 (comment)
.

@jsternberg
Copy link
Contributor

If you have the go tools installed, you should just be able to do this:

$ git fetch
$ git checkout js-5303-mapped-chunks-nil-panic
$ go get ./...
$ go build ./cmd/influxd -o influxd

@jhhayden
Copy link
Author

git fetch
remote: Counting objects: 9, done.
remote: Total 9 (delta 6), reused 6 (delta 6), pack-reused 3
Unpacking objects: 100% (9/9), done.
From https://github.com/influxdata/influxdb

  • [new branch] jl-influx-tsm -> origin/jl-influx-tsm
    [root@ip-10-25-2-167 influxdb]# git checkout js-5303-mapped-chunks-nil-panic
    CHANGELOG.md: needs merge
    cmd/influx_tsm/main.go: needs merge
    tsdb/raw.go: needs merge
    error: you need to resolve your current index first

On Wed, Feb 17, 2016 at 2:40 PM, Jonathan A. Sternberg <
notifications@github.com> wrote:

If you have the go tools installed, you should just be able to do this:

$ git fetch
$ git checkout js-5303-mapped-chunks-nil-panic
$ go build ./cmd/influxd -o influxd


Reply to this email directly or view it on GitHub
#5303 (comment)
.

@jsternberg
Copy link
Contributor

@jhhayden you need to reset the git repository. You also shouldn't be building as root if you can avoid it. Where are you attempting to build influxd? Do you have the go compiler installed and the go workspace setup for building?

@jhhayden
Copy link
Author

I tried following the instructions at CONTRIBUTING.md

Agreed about the root thing but this is just a test machine. I'll try
again as non-root

John Hayden

On Wed, Feb 17, 2016 at 2:49 PM, Jonathan A. Sternberg <
notifications@github.com> wrote:

@jhhayden https://github.com/jhhayden you need to reset the git
repository. You also shouldn't be building as root if you can avoid it.
Where are you attempting to build influxd? Do you have the go compiler
installed and the go workspace setup for building?


Reply to this email directly or view it on GitHub
#5303 (comment)
.

@jsternberg
Copy link
Contributor

If you have go installed, you should be able to just do this in a new directory.

$ mkdir /tmp/influxbuild && cd /tmp/influxbuild
$ export GOPATH=/tmp/influxbuild
$ mkdir -p src/github.com/influxdb
$ git clone git://github.com/influxdata/influxdb -b js-5303-mapped-chunks-nil-panic src/github.com/influxdb/influxdb
$ cd src/github.com/influxdb/influxdb
$ go get -u ./...
$ go build -o /tmp/influxd ./cmd/influxd

Then you can just run the binary at /tmp/influxd.

@jhhayden
Copy link
Author

go build -o /tmp/influxd ./cmd/influxd
cmd/influxd/run/command.go:16:2: cannot find package "
github.com/BurntSushi/toml" in any of:
/home/ec2-user/.gvm/gos/go1.4/src/github.com/BurntSushi/toml (from $GOROOT)
/tmp/influxbuild/src/github.com/BurntSushi/toml (from $GOPATH)

John Hayden

On Wed, Feb 17, 2016 at 3:07 PM, Jonathan A. Sternberg <
notifications@github.com> wrote:

If you have go installed, you should be able to just do this in a new
directory.

$ mkdir /tmp/influxbuild && cd /tmp/influxbuild
$ export GOPATH=/tmp/influxbuild
$ mkdir -p src/github.com/influxdb
$ git clone git://github.com/influxdata/influxdb -b js-5303-mapped-chunks-nil-panic src/github.com/influxdb/influxdb
$ cd src/github.com/influxdb/influxdb
$ go get -u ./...
$ go build -o /tmp/influxd ./cmd/influxd

Then you can just run the binary at /tmp/influxd.


Reply to this email directly or view it on GitHub
#5303 (comment)
.

@jsternberg
Copy link
Contributor

Did you do the go get command from above?

@jhhayden
Copy link
Author

yes

John Hayden

On Wed, Feb 17, 2016 at 3:13 PM, Jonathan A. Sternberg <
notifications@github.com> wrote:

Did you do the go get command from above?


Reply to this email directly or view it on GitHub
#5303 (comment)
.

@jsternberg
Copy link
Contributor

Have you tried checking if this folder actually exists? /tmp/influxbuild/src/github.com/BurntSushi/toml

Otherwise you can try installing it without gvm, but instead directly through the tarball. Start a new bash session to remove any environment variables previously set by gvm, then attempting to use the build instructions from above again.

@jhhayden
Copy link
Author

I looked more closely at the output from the go get command and noticed
this:

package github.com/influxdb/influxdb: /tmp/influxbuild/src/
github.com/influxdb/influxdb is from git://github.com/influxdata/influxdb,
should be from https://github.com/influxdb/influxdb

All the lines say from .../influxdata/infludb should be influxdb/influxdb.

Could this be the cause because of the company name change a while ago?

John Hayden

On Wed, Feb 17, 2016 at 3:21 PM, Jonathan A. Sternberg <
notifications@github.com> wrote:

Have you tried checking if this folder actually exists?
/tmp/influxbuild/src/github.com/BurntSushi/toml

Otherwise you can try installing it without gvm, but instead directly
through the tarball. Start a new bash session to remove any environment
variables previously set by gvm, then attempting to use the build
instructions from above again.


Reply to this email directly or view it on GitHub
#5303 (comment)
.

@jsternberg
Copy link
Contributor

Ah, yes. That would be it. 0.10 still has to be built with the previous location, but it seems go get doesn't enjoy the git clone call I had. You can either clone it again with influxdb instead of influxdata or you can probably just change the remote.

git remote set-url origin git://github.com/influxdb/influxdb

@rossmcdonald
Copy link
Contributor

@jhhayden Can you try using the build.py script located in the root directory of the Github repo? The full set of steps should look similar to (assuming you already have python and Go 1.4 already installed):

mkdir /tmp/influxbuild && cd /tmp/influxbuild
export GOPATH=/tmp/influxbuild
mkdir -p src/github.com/influxdb
git clone https://github.com/influxdata/influxdb -b js-5303-mapped-chunks-nil-panic src/github.com/influxdb/influxdb
cd src/github.com/influxdb/influxdb
./build.py

Where the resulting binaries should be located in ./build.

@jhhayden
Copy link
Author

Ross McDonald's plan worked! I will stop and then rerun influx using the
new binary. I will people know if it crashes.

Thanks again

John Hayden

On Wed, Feb 17, 2016 at 3:34 PM, Ross McDonald notifications@github.com
wrote:

@jhhayden https://github.com/jhhayden Can you try using the build.py
script located in the root directory of the Github repo? The full set of
steps should look similar to (assuming you already have python and Go 1.4
already installed):

mkdir /tmp/influxbuild && cd /tmp/influxbuild
export GOPATH=/tmp/influxbuild
mkdir -p src/github.com/influxdb
git clone https://github.com/influxdata/influxdb -b js-5303-mapped-chunks-nil-panic src/github.com/influxdb/influxdb
cd src/github.com/influxdb/influxdb
./build.py

Where the resulting binaries should be located in ./build.


Reply to this email directly or view it on GitHub
#5303 (comment)
.

@jsternberg
Copy link
Contributor

@jhhayden have you experience any crash yet or had any ability to test the change?

@jhhayden
Copy link
Author

No crashes at all. Its looking good but I have gone as long as 2 days
between crashes (maybe due to traffic lessening) before it starts up again.
But I think this is good.

John Hayden

On Thu, Feb 18, 2016 at 11:42 AM, Jonathan A. Sternberg <
notifications@github.com> wrote:

@jhhayden https://github.com/jhhayden have you experience any crash yet
or had any ability to test the change?


Reply to this email directly or view it on GitHub
#5303 (comment)
.

@jsternberg
Copy link
Contributor

Then I'm merging the PR and we'll have this as part of 0.10.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants