Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout when running influx_inspect buildtsi #12890

Closed
sorrison opened this issue Mar 26, 2019 · 10 comments
Closed

Timeout when running influx_inspect buildtsi #12890

sorrison opened this issue Mar 26, 2019 · 10 comments

Comments

@sorrison
Copy link

sorrison commented Mar 26, 2019

I'm trying to convert to use the tsi1 index and I get the below error.
It seems to be doing lots of things and then it just gets stuck.

I'm using influxdb 1.7.4

We have quite a large DB (69GB) so I'm wondering if this is the issue?
I have also tried it on a single shard and it just times out after a while of doing a lot of activity.

Running an strace on the process produces no output. I just see

strace: Process 16587 attached
futex(0x10d8260, FUTEX_WAIT_PRIVATE, 0, NULL^

And it doesn't seem to be using any CPU etc.

Error is:

<truncated>
2019-03-26T01:31:37.790502Z	error	Cannot compact index files	{"log_id": "0EPuf~40000", "db_instance": "gnocchi", "db_rp": "rp_31536000", "db_shard_id": 70834, "index": "tsi", "tsi1_partition": "8", "trace_id": "0EPup1GW000", "op_name": "tsi1_compact_to_level", "tsi1_level": 3, "error": "tsi1: compaction interrupted"}
2019-03-26T01:31:37.790582Z	info	TSI level compaction (end)	{"log_id": "0EPuf~40000", "db_instance": "gnocchi", "db_rp": "rp_31536000", "db_shard_id": 70834, "index": "tsi", "tsi1_partition": "8", "trace_id": "0EPup1GW000", "op_name": "tsi1_compact_to_level", "tsi1_level": 3, "op_event": "end", "op_elapsed": "12924.505ms"}
panic: sync: WaitGroup is reused before previous Wait has returned

goroutine 196 [running]:
sync.(*WaitGroup).Wait(0xc000cd4990)
	/usr/local/go/src/sync/waitgroup.go:132 +0xad
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait(0xc000cd4900)
	/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/partition.go:327 +0x33
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Wait(0xc004f80e10)
	/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index.go:335 +0x4a
github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi.IndexShard(0xc004cc0000, 0xc000b7c180, 0x30, 0xc000b7c1b0, 0x2f, 0x100000, 0x40000000, 0x2710, 0xc005918480, 0xc000fba100, ...)
	/go/src/github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi/buildtsi.go:314 +0x9bf
github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi.(*Command).processRetentionPolicy.func1(0xc000bb4138, 0xc000bae100, 0xc00015e580, 0xc0000da758, 0x7, 0xc0000d53ff, 0xb, 0xc005542180, 0xc004cc0000, 0xc000bb0060, ...)
	/go/src/github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi/buildtsi.go:189 +0x1e5
created by github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi.(*Command).processRetentionPolicy
	/go/src/github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi/buildtsi.go:180 +0x5c7

Time:

real	2m42.510s
user	47m43.832s
sys	12m18.648s
@boombatower
Copy link

Perhaps related (< 100M database), but since either 1.7.4 or 1.7.5 seeing timeouts on all queries and writes and slow memory build up (leak) until restarted and then fine for a few hours. Exactly same work load for last few months, but influxdb version was updated.

@nitper
Copy link

nitper commented Apr 12, 2019

I'm seeing the same issue. After <24 hours, influxdb (running in docker, very tiny DB) stops responding and clients timeout. I'm seeing the same FUTEX_WAIT_PRIVATE, 0, NULL when running strace. InfluxDB is still logging queries, but they are all 500 timeouts.

cpu and disk i/o are basically zero, and memory usage is very low. I'll try an older version and see if that fixes it.

@nitper
Copy link

nitper commented Apr 12, 2019

After some more searching, this is probably the same issue as #13010

@timhallinflux
Copy link
Contributor

the fix for #13010 is in the 1.7 branch if you are building from source and our plan is to have a 1.7.6 tagged and built next week.

@timhallinflux
Copy link
Contributor

1.7.6 build is now available.

@sorrison
Copy link
Author

sorrison commented May 1, 2019

Just tried using the 1.7.6 version and get the same issue

panic: sync: WaitGroup is reused before previous Wait has returned

goroutine 610312 [running]:
sync.(*WaitGroup).Wait(0xc55316c090)
	/usr/local/go/src/sync/waitgroup.go:132 +0xad
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait(0xc55316c000)
	/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/partition.go:327 +0x33
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Wait(0xc1cea74000)
	/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index.go:335 +0x4a
github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi.IndexShard(0xc009f7c0a0, 0xc594117500, 0x30, 0xc594117560, 0x2f, 0x100000, 0x40000000, 0x2710, 0xc123a9f380, 0x7fa7bff0e200, ...)
	/go/src/github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi/buildtsi.go:314 +0x9bf
github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi.(*Command).processRetentionPolicy.func1(0xc02470c1ac, 0xc04a37c860, 0xc00015e480, 0xc0000da758, 0x7, 0xc011591b4f, 0xb, 0xc06c455f20, 0xc009f7c0a0, 0xc03d486f60, ...)
	/go/src/github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi/buildtsi.go:189 +0x1e5
created by github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi.(*Command).processRetentionPolicy
	/go/src/github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi/buildtsi.go:180 +0x5c7

@stale
Copy link

stale bot commented Jul 30, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 30, 2019
@brtkwr
Copy link

brtkwr commented Aug 2, 2019

Also hitting the same issue I think...

2019-08-01T16:16:20.696544Z     info    TSI log compaction (start)  {"log_id": "0G~T0Jel000", "db_instance": "monasca", "db_rp": "default", "db_shard_id": 352(4 results) [1093/1646]
_partition": "7", "trace_id": "0G~VKjr0000", "op_name": "tsi1_compact_log_file", "tsi1_log_file_id": 5, "op_event": "start"}
2019-08-01T16:16:21.951093Z     error   Cannot compact index files  {"log_id": "0G~T0Jel000", "db_instance": "monasca", "db_rp": "default", "db_shard_id": 480, "index": "tsi", "tsi1
_partition": "8", "trace_id": "0G~VHonG000", "op_name": "tsi1_compact_to_level", "tsi1_level": 3, "error": "tsi1: compaction interrupted"}
2019-08-01T16:16:21.951263Z     info    TSI level compaction (end)  {"log_id": "0G~T0Jel000", "db_instance": "monasca", "db_rp": "default", "db_shard_id": 480, "index": "tsi", "tsi1
_partition": "8", "trace_id": "0G~VHonG000", "op_name": "tsi1_compact_to_level", "tsi1_level": 3, "op_event": "end", "op_elapsed": "49141.900ms"}
2019-08-01T16:16:22.002947Z     info    TSI log compaction (start)  {"log_id": "0G~T0Jel000", "db_instance": "monasca", "db_rp": "default", "db_shard_id": 352, "index": "tsi", "tsi1
_partition": "8", "trace_id": "0G~VKoxW000", "op_name": "tsi1_compact_log_file", "tsi1_log_file_id": 5, "op_event": "start"}
panic: sync: WaitGroup is reused before previous Wait has returned

goroutine 13 [running]:
sync.(*WaitGroup).Wait(0xd30c1a0090)
        /usr/local/go/src/sync/waitgroup.go:132 +0xad
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait(0xd30c1a0000)
        /go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/partition.go:327 +0x33
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Wait(0xd30d496000)
        /go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index.go:335 +0x4a
github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi.IndexShard(0xc0000d6640, 0xd5f5ae4480, 0x1d, 0xd5f5ae44a0, 0x1c, 0x100000, 0x40000000, 0x2710, 0xd5f5c4e420, 0x0, ...)
        /go/src/github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi/buildtsi.go:314 +0x9bf
github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi.(*Command).processRetentionPolicy.func1(0xd5f7393b60, 0xdc1cf82180, 0xc000146480, 0xc0000347ca, 0x7, 0xc000034672, 0x7, 0x
c0000a8720, 0xc0000d6640, 0xc000034860, ...)
        /go/src/github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi/buildtsi.go:189 +0x1e5
created by github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi.(*Command).processRetentionPolicy
        /go/src/github.com/influxdata/influxdb/cmd/influx_inspect/buildtsi/buildtsi.go:180 +0x5c7

I have been halving the concurrency down from the default of 24, 12, 6 and each time, it has eventually failed.... I need to drop from series from the DB as the cardinality has gone out of control slowing down all our queries and I cannot do that without converting all the indices....

@stale stale bot removed the wontfix label Aug 2, 2019
@e-dard
Copy link
Contributor

e-dard commented Sep 18, 2019

This will be fixed when #14902 is backported to the 1.7 and 1.8 branch. The fix will be part of the 1.7.9 release.

@8none1
Copy link
Contributor

8none1 commented Dec 9, 2019

This can be closed now that #14902 is merged and backported.

@8none1 8none1 closed this as completed Dec 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants