Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.3] Seemingly deadlocked when ingesting metrics via graphite plugin #3869

Closed
pkittenis opened this issue Aug 27, 2015 · 3 comments
Closed

Comments

@pkittenis
Copy link

Hello,

First off, apologies but I don't have a reproducible test for this yet.

The behaviour we see with Influxdb 0.9.3 with real life clients sending graphite metrics to the graphite plugin port of influxdb is this:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                 
32484 influxdb  20   0 1047m 745m  11m S 102.6  4.7  68:50.94 influxd                                                                

The db keeps spinning on CPU and we see a slew of write timeouts in log in between wal flushes 'due to idle'. I presume both of these cannot be happening at same time. Is this a case of a deadlock somewhere?

Can see the measurements in show measurements but data points are missing. No surprise there, given the write timeouts. The surprise is there's nothing hitting the disk and the wal is seemingly idle.

The actual data being sent is ~450K graphite metrics/minute from about 25k unique series. This is real data from real graphite clients and is being ingested in both graphite and influxdb 0.8 with no issues.

I am in the process of trying to replicate this locally - no luck yet. Would appreciate any pointers in the meantime. I can provide a profile of influxdb while it is doing this if that would be helpful.

Iostat shows data drive being hit very very little, few hundred KB per minute.

Influxdb logs and induced stack trace by SIGQUIT while this is happening:

[metastore] 2015/08/27 15:19:30 Using data dir: /var/opt/influxdb/meta
[metastore] 2015/08/27 15:19:30 Node at localhost:8088 [Follower]
[metastore] 2015/08/27 15:19:32 Node at localhost:8088 [Leader]. peers=[]
[metastore] 2015/08/27 15:19:32 Created local node: id=1, host=localhost:8088
[store] 2015/08/27 15:19:32 Using data dir: /var/opt/influxdb/data
[handoff] 2015/08/27 15:19:32 Starting hinted handoff service
[handoff] 2015/08/27 15:19:32 Using data dir: /var/opt/influxdb/hh
[tcp] 2015/08/27 15:19:32 Starting cluster service
[shard-precreation] 2015/08/27 15:19:32 Starting precreation service with check interval of 10m0s, advance period of 30m0s
[snapshot] 2015/08/27 15:19:32 Starting snapshot service
[admin] 2015/08/27 15:19:32 Starting admin service
[admin] 2015/08/27 15:19:32 Listening on HTTP: 0.0.0.0:8083
[continuous_querier] 2015/08/27 15:19:32 Starting continuous query service
[httpd] 2015/08/27 15:19:32 Starting HTTP service
[httpd] 2015/08/27 15:19:32 Authentication enabled: false
[httpd] 2015/08/27 15:19:32 Listening on HTTP: 0.0.0.0:8086
[retention] 2015/08/27 15:19:32 Starting rentention policy enforcement service
[graphite] 2015/08/27 15:19:32 Starting graphite service, batch size 1000, batch timeout 1s
[graphite] 2015/08/27 15:19:32 Listening on UDP: 127.0.0.1:2013
2015/08/27 15:19:32 InfluxDB starting, version 0.9.3, branch 0.9.3, commit 5d42b212fca8facfe9db0c83822f09b88be643ec
2015/08/27 15:19:32 GOMAXPROCS set to 4
[run] 2015/08/27 15:19:32 Listening for signals
[wal] 2015/08/27 15:19:37 WAL starting with 30720 ready series size, 0.50 compaction threshold, and 20971520 partition size threshold
[wal] 2015/08/27 15:19:37 WAL writing to /db/influxdb_wal/wal/graphite/default/1
[wal] 2015/08/27 15:19:38 WAL starting with 30720 ready series size, 0.50 compaction threshold, and 20971520 partition size threshold
[wal] 2015/08/27 15:19:38 WAL writing to /db/influxdb_wal/wal/graphite/default/2
[wal] 2015/08/27 15:19:43 WAL starting with 30720 ready series size, 0.50 compaction threshold, and 20971520 partition size threshold
[wal] 2015/08/27 15:19:43 WAL writing to /db/influxdb_wal/wal/graphite/default/3
[graphite] 2015/08/27 15:20:22 failed to write point batch to database "graphite": timeout
[wal] 2015/08/27 15:20:32 WAL starting with 30720 ready series size, 0.50 compaction threshold, and 20971520 partition size threshold
[wal] 2015/08/27 15:20:32 WAL writing to /db/influxdb_wal/wal/graphite/default/4
[graphite] 2015/08/27 15:21:22 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:21:27 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:21:32 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:22:16 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:22:21 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:22:26 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:22:31 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:22:36 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:22:43 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:23:18 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:23:23 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:23:28 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:23:33 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:23:38 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:24:16 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:24:21 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:24:26 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:24:31 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:24:36 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:24:41 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:25:20 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:25:25 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:25:30 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:25:35 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:25:41 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:26:18 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:26:23 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:26:28 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:26:33 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:26:38 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:26:43 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:26:48 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:26:53 failed to write point batch to database "graphite": timeout
[wal] 2015/08/27 15:26:57 WAL starting with 30720 ready series size, 0.50 compaction threshold, and 20971520 partition size threshold
[wal] 2015/08/27 15:26:57 WAL writing to /db/influxdb_wal/wal/graphite/default/5
[graphite] 2015/08/27 15:26:58 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:27:18 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:27:23 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:27:28 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:27:33 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:27:38 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:27:43 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:27:48 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:27:53 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:28:18 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:28:23 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:28:28 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:28:33 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:28:38 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:28:43 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:28:48 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:28:53 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:28:58 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:29:17 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:29:22 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:29:27 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:29:32 failed to write point batch to database "graphite": timeout
[retention] 2015/08/27 15:29:32 retention policy enforcement check commencing
[retention] 2015/08/27 15:29:32 retention policy shard deletion check commencing
[metastore] 2015/08/27 15:29:32 new shard group 6 successfully created for database graphite, retention policy default
[metastore] 2015/08/27 15:29:32 new shard group 7 successfully created for database graphite, retention policy default
[graphite] 2015/08/27 15:29:37 failed to write point batch to database "graphite": timeout
[wal] 2015/08/27 15:29:38 Flushing 43861 measurements and 104881 series to index
[wal] 2015/08/27 15:29:38 Flushing 168 measurements and 170 series to index
[wal] 2015/08/27 15:29:38 Metadata flush took 36.446349ms
[wal] 2015/08/27 15:29:39 Metadata flush took 746.320935ms
[graphite] 2015/08/27 15:29:42 failed to write point batch to database "graphite": timeout
[wal] 2015/08/27 15:29:43 Flushing 109 measurements and 109 series to index
[wal] 2015/08/27 15:29:43 Metadata flush took 46.66375ms
[graphite] 2015/08/27 15:29:47 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:29:55 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:30:20 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:30:25 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:30:30 failed to write point batch to database "graphite": timeout
[wal] 2015/08/27 15:30:32 Flushing 17 measurements and 17 series to index
[wal] 2015/08/27 15:30:32 Metadata flush took 64.250652ms
[graphite] 2015/08/27 15:30:35 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:30:40 failed to write point batch to database "graphite": timeout
[wal] 2015/08/27 15:30:43 Flush due to idle. Flushing 22 series with 1207 bytes from partition 4. Compacting 0 series
[wal] 2015/08/27 15:30:43 compaction of partition 4 took 23.559258ms
[wal] 2015/08/27 15:30:43 Flush due to idle. Flushing 23 series with 1275 bytes from partition 5. Compacting 0 series
[wal] 2015/08/27 15:30:43 compaction of partition 5 took 12.574245ms
[wal] 2015/08/27 15:30:43 Flush due to idle. Flushing 25 series with 1377 bytes from partition 1. Compacting 0 series
[wal] 2015/08/27 15:30:43 compaction of partition 1 took 30.696511ms
[wal] 2015/08/27 15:30:43 Flush due to idle. Flushing 23 series with 1224 bytes from partition 2. Compacting 0 series
[wal] 2015/08/27 15:30:43 compaction of partition 2 took 13.580917ms
[wal] 2015/08/27 15:30:43 Flush due to idle. Flushing 16 series with 850 bytes from partition 3. Compacting 0 series
[wal] 2015/08/27 15:30:43 compaction of partition 3 took 14.095429ms
[graphite] 2015/08/27 15:30:45 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:30:50 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:30:55 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:00 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:05 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:10 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:15 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:20 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:25 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:30 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:37 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:42 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:47 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:52 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:31:57 failed to write point batch to database "graphite": timeout
[wal] 2015/08/27 15:31:59 Flush due to idle. Flushing 5 series with 85 bytes from partition 1. Compacting 0 series
[wal] 2015/08/27 15:31:59 compaction of partition 1 took 55.903703ms
[wal] 2015/08/27 15:31:59 Flush due to idle. Flushing 14 series with 238 bytes from partition 2. Compacting 0 series
[wal] 2015/08/27 15:31:59 compaction of partition 2 took 20.480222ms
[wal] 2015/08/27 15:31:59 Flush due to idle. Flushing 11 series with 187 bytes from partition 3. Compacting 0 series
[wal] 2015/08/27 15:31:59 compaction of partition 3 took 12.124952ms
[wal] 2015/08/27 15:31:59 Flush due to idle. Flushing 9 series with 153 bytes from partition 4. Compacting 0 series
[wal] 2015/08/27 15:31:59 compaction of partition 4 took 23.220238ms
[wal] 2015/08/27 15:31:59 Flush due to idle. Flushing 14 series with 238 bytes from partition 5. Compacting 0 series
[wal] 2015/08/27 15:31:59 compaction of partition 5 took 30.866042ms
[graphite] 2015/08/27 15:32:02 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:32:07 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:32:16 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:32:21 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:32:26 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:32:31 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:32:36 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:32:41 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:32:47 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:32:52 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:32:57 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/27 15:33:02 failed to write point batch to database "graphite": timeout
[wal] 2015/08/27 15:33:04 Flush due to idle. Flushing 2 series with 153 bytes from partition 5. Compacting 0 series
[wal] 2015/08/27 15:33:04 compaction of partition 5 took 25.157718ms
[wal] 2015/08/27 15:33:04 Flush due to idle. Flushing 1 series with 68 bytes from partition 1. Compacting 0 series
[wal] 2015/08/27 15:33:04 compaction of partition 1 took 10.731526ms
[wal] 2015/08/27 15:33:04 Flush due to idle. Flushing 3 series with 221 bytes from partition 2. Compacting 0 series
[wal] 2015/08/27 15:33:04 compaction of partition 2 took 17.326634ms
[wal] 2015/08/27 15:33:04 Flush due to idle. Flushing 8 series with 612 bytes from partition 3. Compacting 0 series
[wal] 2015/08/27 15:33:04 compaction of partition 3 took 33.873323ms
[wal] 2015/08/27 15:33:04 Flush due to idle. Flushing 3 series with 204 bytes from partition 4. Compacting 0 series
[wal] 2015/08/27 15:33:04 compaction of partition 4 took 7.881754ms
SIGQUIT: quit
PC=0x4415cb

goroutine 1144 [running]:
runtime.cmpbody()
    /root/.gvm/gos/go1.4.2/src/runtime/asm_amd64.s:1195 +0x2b fp=0xc2080f28d0 sp=0xc2080f28c8
runtime.cmpstring(0xc2081f6320, 0x46, 0xc20da7d900, 0x4d, 0xffffffffffffffff)
    /root/.gvm/gos/go1.4.2/src/runtime/asm_amd64.s:1161 +0x19 fp=0xc2080f28d8 sp=0xc2080f28d0
sort.(*StringSlice).Less(0xc2148ca060, 0xaee1, 0xb722, 0xb301)
    <autogenerated>:13 +0xfa fp=0xc2080f2928 sp=0xc2080f28d8
sort.doPivot(0x2b8236b61560, 0xc2148ca060, 0xaee1, 0xb869, 0xb869, 0xaee1)
    /root/.gvm/gos/go1.4.2/src/sort/sort.go:138 +0x231 fp=0xc2080f2990 sp=0xc2080f2928
sort.quickSort(0x2b8236b61560, 0xc2148ca060, 0xaee1, 0xb869, 0x1b)
    /root/.gvm/gos/go1.4.2/src/sort/sort.go:173 +0xa9 fp=0xc2080f29d8 sp=0xc2080f2990
sort.quickSort(0x2b8236b61560, 0xc2148ca060, 0xa555, 0xb869, 0x1c)
    /root/.gvm/gos/go1.4.2/src/sort/sort.go:180 +0x178 fp=0xc2080f2a20 sp=0xc2080f29d8
sort.quickSort(0x2b8236b61560, 0xc2148ca060, 0x65be, 0xcb7b, 0x1e)
    /root/.gvm/gos/go1.4.2/src/sort/sort.go:180 +0x178 fp=0xc2080f2a68 sp=0xc2080f2a20
sort.quickSort(0x2b8236b61560, 0xc2148ca060, 0x0, 0xcb7b, 0x1f)
    /root/.gvm/gos/go1.4.2/src/sort/sort.go:180 +0x178 fp=0xc2080f2ab0 sp=0xc2080f2a68
sort.Sort(0x2b8236b61560, 0xc2148ca060)
    /root/.gvm/gos/go1.4.2/src/sort/sort.go:200 +0x7b fp=0xc2080f2ae0 sp=0xc2080f2ab0
sort.Strings(0xc21df00000, 0xcb7b, 0xec00)
    /root/.gvm/gos/go1.4.2/src/sort/sort.go:276 +0x82 fp=0xc2080f2b30 sp=0xc2080f2ae0
github.com/influxdb/influxdb/tsdb.(*DatabaseIndex).CreateMeasurementIndexIfNotExists(0xc2082ce140, 0xc20b98e2d0, 0x4e, 0x4e)
    /root/go/src/github.com/influxdb/influxdb/tsdb/meta.go:110 +0x48d fp=0xc2080f2c28 sp=0xc2080f2b30
github.com/influxdb/influxdb/tsdb.(*DatabaseIndex).CreateSeriesIndexIfNotExists(0xc2082ce140, 0xc20b98e2d0, 0x4e, 0xc20983ccf0, 0xc20983cc60)
    /root/go/src/github.com/influxdb/influxdb/tsdb/meta.go:88 +0x8e fp=0xc2080f2c68 sp=0xc2080f2c28
github.com/influxdb/influxdb/tsdb.(*Shard).WritePoints(0xc2082e06c0, 0xc214d84000, 0x3e8, 0x400, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tsdb/shard.go:165 +0x1dd fp=0xc2080f2e90 sp=0xc2080f2c68
github.com/influxdb/influxdb/tsdb.(*Store).WriteToShard(0xc2080321a0, 0x1, 0xc214d84000, 0x3e8, 0x400, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tsdb/store.go:303 +0xf7 fp=0xc2080f2ed8 sp=0xc2080f2e90
github.com/influxdb/influxdb/cluster.func·002(0x1, 0x1, 0xc214d84000, 0x3e8, 0x400)
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:241 +0xfc fp=0xc2080f2fb8 sp=0xc2080f2ed8
runtime.goexit()
    /root/.gvm/gos/go1.4.2/src/runtime/asm_amd64.s:2232 +0x1 fp=0xc2080f2fc0 sp=0xc2080f2fb8
created by github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:271 +0x30b

goroutine 1 [chan receive, 13 minutes]:
main.(*Main).Run(0xc20806da80, 0xc20800a010, 0x4, 0x4, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/cmd/influxd/main.go:91 +0x777
main.main()
    /root/go/src/github.com/influxdb/influxdb/cmd/influxd/main.go:42 +0xdc

goroutine 6 [syscall, 13 minutes]:
os/signal.loop()
    /root/.gvm/gos/go1.4.2/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
    /root/.gvm/gos/go1.4.2/src/os/signal/signal_unix.go:27 +0x35

goroutine 17 [syscall, 13 minutes, locked to thread]:
runtime.goexit()
    /root/.gvm/gos/go1.4.2/src/runtime/asm_amd64.s:2232 +0x1

goroutine 8 [IO wait, 13 minutes]:
net.(*pollDesc).Wait(0xc20800e450, 0x72, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20800e450, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc20800e3f0, 0x0, 0x2b8236b59f40, 0xc2080fe000)
    /root/.gvm/gos/go1.4.2/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc208036de8, 0x0, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/tcpsock_posix.go:234 +0x4e
net.(*TCPListener).Accept(0xc208036de8, 0x0, 0x0, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/tcpsock_posix.go:244 +0x4c
github.com/influxdb/influxdb/tcp.(*Mux).Serve(0xc2080e29c0, 0x2b8236b5c5d0, 0xc208036de8, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tcp/mux.go:45 +0x57
created by github.com/influxdb/influxdb/cmd/influxd/run.func·003
    /root/go/src/github.com/influxdb/influxdb/cmd/influxd/run/server.go:313 +0x8f2

goroutine 9 [chan receive, 13 minutes]:
github.com/influxdb/influxdb/tcp.(*listener).Accept(0xc208036df0, 0x0, 0x0, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tcp/mux.go:117 +0x80
github.com/influxdb/influxdb/meta.(*raftLayer).Accept(0xc2080e2b10, 0x0, 0x0, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/meta/store.go:2058 +0x64
github.com/hashicorp/raft.(*NetworkTransport).listen(0xc20800e4d0)
    /root/go/src/github.com/hashicorp/raft/net_transport.go:346 +0x50
created by github.com/hashicorp/raft.NewNetworkTransport
    /root/go/src/github.com/hashicorp/raft/net_transport.go:138 +0x2c4

goroutine 10 [select]:
github.com/hashicorp/raft.(*Raft).leaderLoop(0xc2080c41c0)
    /root/go/src/github.com/hashicorp/raft/raft.go:826 +0xc7c
github.com/hashicorp/raft.(*Raft).runLeader(0xc2080c41c0)
    /root/go/src/github.com/hashicorp/raft/raft.go:798 +0x74e
github.com/hashicorp/raft.(*Raft).run(0xc2080c41c0)
    /root/go/src/github.com/hashicorp/raft/raft.go:595 +0xc9
github.com/hashicorp/raft.*Raft.(github.com/hashicorp/raft.run)·fm()
    /root/go/src/github.com/hashicorp/raft/raft.go:245 +0x27
github.com/hashicorp/raft.func·011()
    /root/go/src/github.com/hashicorp/raft/state.go:152 +0x51
created by github.com/hashicorp/raft.(*raftState).goFunc
    /root/go/src/github.com/hashicorp/raft/state.go:153 +0xe3

goroutine 11 [select, 3 minutes]:
github.com/hashicorp/raft.(*Raft).runFSM(0xc2080c41c0)
    /root/go/src/github.com/hashicorp/raft/raft.go:502 +0x1043
github.com/hashicorp/raft.*Raft.(github.com/hashicorp/raft.runFSM)·fm()
    /root/go/src/github.com/hashicorp/raft/raft.go:246 +0x27
github.com/hashicorp/raft.func·011()
    /root/go/src/github.com/hashicorp/raft/state.go:152 +0x51
created by github.com/hashicorp/raft.(*raftState).goFunc
    /root/go/src/github.com/hashicorp/raft/state.go:153 +0xe3

goroutine 12 [select]:
github.com/hashicorp/raft.(*Raft).runSnapshots(0xc2080c41c0)
    /root/go/src/github.com/hashicorp/raft/raft.go:1613 +0x466
github.com/hashicorp/raft.*Raft.(github.com/hashicorp/raft.runSnapshots)·fm()
    /root/go/src/github.com/hashicorp/raft/raft.go:247 +0x27
github.com/hashicorp/raft.func·011()
    /root/go/src/github.com/hashicorp/raft/state.go:152 +0x51
created by github.com/hashicorp/raft.(*raftState).goFunc
    /root/go/src/github.com/hashicorp/raft/state.go:153 +0xe3

goroutine 13 [select, 13 minutes]:
github.com/influxdb/influxdb/meta.(*localRaft).logLeaderChanges(0xc208072de0)
    /root/go/src/github.com/influxdb/influxdb/meta/state.go:187 +0x46e
created by github.com/influxdb/influxdb/meta.(*localRaft).open
    /root/go/src/github.com/influxdb/influxdb/meta/state.go:177 +0x130c

goroutine 34 [chan receive, 13 minutes]:
github.com/influxdb/influxdb/tcp.(*listener).Accept(0xc208036df8, 0x0, 0x0, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tcp/mux.go:117 +0x80
github.com/influxdb/influxdb/meta.(*Store).serveExecListener(0xc208062000)
    /root/go/src/github.com/influxdb/influxdb/meta/store.go:619 +0x8f
created by github.com/influxdb/influxdb/meta.(*Store).Open
    /root/go/src/github.com/influxdb/influxdb/meta/store.go:239 +0x391

goroutine 35 [chan receive, 13 minutes]:
github.com/influxdb/influxdb/tcp.(*listener).Accept(0xc208036e00, 0x0, 0x0, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tcp/mux.go:117 +0x80
github.com/influxdb/influxdb/meta.(*Store).serveRPCListener(0xc208062000)
    /root/go/src/github.com/influxdb/influxdb/meta/store.go:723 +0xc9
created by github.com/influxdb/influxdb/meta.(*Store).Open
    /root/go/src/github.com/influxdb/influxdb/meta/store.go:242 +0x3db

goroutine 14 [select]:
github.com/influxdb/influxdb/services/hh.(*Service).retryWrites(0xc208045180)
    /root/go/src/github.com/influxdb/influxdb/services/hh/service.go:103 +0x2e7
created by github.com/influxdb/influxdb/services/hh.(*Service).Open
    /root/go/src/github.com/influxdb/influxdb/services/hh/service.go:67 +0x20e

goroutine 37 [select, 13 minutes]:
github.com/influxdb/influxdb/cmd/influxd/run.(*Server).monitorErrorChan(0xc2080ac0e0, 0xc2080724e0)
    /root/go/src/github.com/influxdb/influxdb/cmd/influxd/run/server.go:442 +0x14a
created by github.com/influxdb/influxdb/cmd/influxd/run.func·003
    /root/go/src/github.com/influxdb/influxdb/cmd/influxd/run/server.go:319 +0xa63

goroutine 15 [select, 13 minutes]:
github.com/influxdb/influxdb/services/hh.(*Service).expireWrites(0xc208045180)
    /root/go/src/github.com/influxdb/influxdb/services/hh/service.go:121 +0x2a4
created by github.com/influxdb/influxdb/services/hh.(*Service).Open
    /root/go/src/github.com/influxdb/influxdb/services/hh/service.go:68 +0x228

goroutine 16 [chan receive, 13 minutes]:
github.com/influxdb/influxdb/tcp.(*listener).Accept(0xc208036e08, 0x0, 0x0, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tcp/mux.go:117 +0x80
github.com/influxdb/influxdb/cluster.(*Service).serve(0xc20800e2a0)
    /root/go/src/github.com/influxdb/influxdb/cluster/service.go:83 +0x102
created by github.com/influxdb/influxdb/cluster.(*Service).Open
    /root/go/src/github.com/influxdb/influxdb/cluster/service.go:60 +0x138

goroutine 50 [select, 3 minutes]:
github.com/influxdb/influxdb/services/precreator.(*Service).runPrecreation(0xc20800adc0)
    /root/go/src/github.com/influxdb/influxdb/services/precreator/service.go:75 +0x407
created by github.com/influxdb/influxdb/services/precreator.(*Service).Open
    /root/go/src/github.com/influxdb/influxdb/services/precreator/service.go:53 +0x20a

goroutine 51 [chan receive, 13 minutes]:
github.com/influxdb/influxdb/tcp.(*listener).Accept(0xc208036e10, 0x0, 0x0, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tcp/mux.go:117 +0x80
github.com/influxdb/influxdb/services/snapshotter.(*Service).serve(0xc20800aeb0)
    /root/go/src/github.com/influxdb/influxdb/services/snapshotter/service.go:75 +0xad
created by github.com/influxdb/influxdb/services/snapshotter.(*Service).Open
    /root/go/src/github.com/influxdb/influxdb/services/snapshotter/service.go:48 +0x133

goroutine 52 [IO wait, 13 minutes]:
net.(*pollDesc).Wait(0xc20816c0d0, 0x72, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20816c0d0, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc20816c070, 0x0, 0x2b8236b59f40, 0xc2080540d0)
    /root/.gvm/gos/go1.4.2/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc208164028, 0xc20801e700, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/tcpsock_posix.go:234 +0x4e
net.(*TCPListener).Accept(0xc208164028, 0x0, 0x0, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/tcpsock_posix.go:244 +0x4c
net/http.(*Server).Serve(0xc208072180, 0x2b8236b5c5d0, 0xc208164028, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/http/server.go:1728 +0x92
net/http.Serve(0x2b8236b5c5d0, 0xc208164028, 0x2b8236b61340, 0xc2080540c0, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/http/server.go:1606 +0xa1
github.com/influxdb/influxdb/services/admin.(*Service).serve(0xc20800af50)
    /root/go/src/github.com/influxdb/influxdb/services/admin/service.go:107 +0x147
created by github.com/influxdb/influxdb/services/admin.(*Service).Open
    /root/go/src/github.com/influxdb/influxdb/services/admin/service.go:70 +0x472

goroutine 53 [select]:
github.com/influxdb/influxdb/services/continuous_querier.(*Service).backgroundLoop(0xc20800e310)
    /root/go/src/github.com/influxdb/influxdb/services/continuous_querier/service.go:157 +0x3e9
created by github.com/influxdb/influxdb/services/continuous_querier.(*Service).Open
    /root/go/src/github.com/influxdb/influxdb/services/continuous_querier/service.go:93 +0x2df

goroutine 54 [IO wait, 13 minutes]:
net.(*pollDesc).Wait(0xc20816c140, 0x72, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20816c140, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc20816c0e0, 0x0, 0x2b8236b59f40, 0xc2080dc010)
    /root/.gvm/gos/go1.4.2/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc208164030, 0xc208110000, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/tcpsock_posix.go:234 +0x4e
net.(*TCPListener).Accept(0xc208164030, 0x0, 0x0, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/tcpsock_posix.go:244 +0x4c
net/http.(*Server).Serve(0xc208112120, 0x2b8236b5c5d0, 0xc208164030, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/http/server.go:1728 +0x92
net/http.Serve(0x2b8236b5c5d0, 0xc208164030, 0x2b8236b61368, 0xc20800e380, 0x0, 0x0)
    /root/.gvm/gos/go1.4.2/src/net/http/server.go:1606 +0xa1
github.com/influxdb/influxdb/services/httpd.(*Service).serve(0xc20800b090)
    /root/go/src/github.com/influxdb/influxdb/services/httpd/service.go:108 +0x79
created by github.com/influxdb/influxdb/services/httpd.(*Service).Open
    /root/go/src/github.com/influxdb/influxdb/services/httpd/service.go:76 +0x68b

goroutine 55 [select, 3 minutes]:
github.com/influxdb/influxdb/services/retention.(*Service).deleteShardGroups(0xc208072a20)
    /root/go/src/github.com/influxdb/influxdb/services/retention/service.go:69 +0x342
created by github.com/influxdb/influxdb/services/retention.(*Service).Open
    /root/go/src/github.com/influxdb/influxdb/services/retention/service.go:45 +0x138

goroutine 56 [select, 3 minutes]:
github.com/influxdb/influxdb/services/retention.(*Service).deleteShards(0xc208072a20)
    /root/go/src/github.com/influxdb/influxdb/services/retention/service.go:102 +0x71b
created by github.com/influxdb/influxdb/services/retention.(*Service).Open
    /root/go/src/github.com/influxdb/influxdb/services/retention/service.go:46 +0x152

goroutine 66 [chan send]:
github.com/influxdb/influxdb/tsdb.func·001()
    /root/go/src/github.com/influxdb/influxdb/tsdb/batcher.go:58 +0x6c
github.com/influxdb/influxdb/tsdb.func·002()
    /root/go/src/github.com/influxdb/influxdb/tsdb/batcher.go:89 +0x402
created by github.com/influxdb/influxdb/tsdb.(*PointBatcher).Start
    /root/go/src/github.com/influxdb/influxdb/tsdb/batcher.go:104 +0x29a

goroutine 67 [select]:
github.com/influxdb/influxdb/cluster.(*PointsWriter).WritePoints(0xc20800e230, 0xc20faab540, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:210 +0x634
github.com/influxdb/influxdb/services/graphite.(*Service).processBatches(0xc20809e0c0, 0xc2080737a0)
    /root/go/src/github.com/influxdb/influxdb/services/graphite/service.go:260 +0x239
created by github.com/influxdb/influxdb/services/graphite.(*Service).Open
    /root/go/src/github.com/influxdb/influxdb/services/graphite/service.go:105 +0x76c

goroutine 68 [chan send]:
github.com/influxdb/influxdb/services/graphite.(*Service).handleLine(0xc20809e0c0, 0xc214b72ba0, 0x5e)
    /root/go/src/github.com/influxdb/influxdb/services/graphite/service.go:246 +0x2a7
github.com/influxdb/influxdb/services/graphite.func·003()
    /root/go/src/github.com/influxdb/influxdb/services/graphite/service.go:219 +0x1f7
created by github.com/influxdb/influxdb/services/graphite.(*Service).openUDPServer
    /root/go/src/github.com/influxdb/influxdb/services/graphite/service.go:222 +0x2fa

goroutine 69 [select, 13 minutes]:
github.com/influxdb/influxdb/cmd/influxd/run.(*Command).monitorServerErrors(0xc2080ae300)
    /root/go/src/github.com/influxdb/influxdb/cmd/influxd/run/command.go:140 +0x2da
created by github.com/influxdb/influxdb/cmd/influxd/run.(*Command).Run
    /root/go/src/github.com/influxdb/influxdb/cmd/influxd/run/command.go:122 +0xdef

goroutine 74 [select]:
github.com/influxdb/influxdb/tsdb/engine/wal.(*Log).autoflusher(0xc2080ac620, 0xc208c40960)
    /root/go/src/github.com/influxdb/influxdb/tsdb/engine/wal/wal.go:603 +0x4d3
created by github.com/influxdb/influxdb/tsdb/engine/wal.(*Log).Open
    /root/go/src/github.com/influxdb/influxdb/tsdb/engine/wal/wal.go:219 +0x79b

goroutine 26 [select]:
github.com/influxdb/influxdb/tsdb/engine/wal.(*Log).autoflusher(0xc2080ac1c0, 0xc20825c360)
    /root/go/src/github.com/influxdb/influxdb/tsdb/engine/wal/wal.go:603 +0x4d3
created by github.com/influxdb/influxdb/tsdb/engine/wal.(*Log).Open
    /root/go/src/github.com/influxdb/influxdb/tsdb/engine/wal/wal.go:219 +0x79b

goroutine 1186 [semacquire]:
sync.(*Mutex).Lock(0xc2082ce140)
    /root/.gvm/gos/go1.4.2/src/sync/mutex.go:66 +0xd3
sync.(*RWMutex).Lock(0xc2082ce140)
    /root/.gvm/gos/go1.4.2/src/sync/rwmutex.go:82 +0x33
github.com/influxdb/influxdb/tsdb.(*Shard).WritePoints(0xc2082e06c0, 0xc214d4c000, 0x3e8, 0x400, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tsdb/shard.go:163 +0x14c
github.com/influxdb/influxdb/tsdb.(*Store).WriteToShard(0xc2080321a0, 0x1, 0xc214d4c000, 0x3e8, 0x400, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tsdb/store.go:303 +0xf7
github.com/influxdb/influxdb/cluster.func·002(0x1, 0x1, 0xc214d4c000, 0x3e8, 0x400)
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:241 +0xfc
created by github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:271 +0x30b

goroutine 100 [select]:
github.com/influxdb/influxdb/tsdb/engine/wal.(*Log).autoflusher(0xc2080ac700, 0xc209323a40)
    /root/go/src/github.com/influxdb/influxdb/tsdb/engine/wal/wal.go:603 +0x4d3
created by github.com/influxdb/influxdb/tsdb/engine/wal.(*Log).Open
    /root/go/src/github.com/influxdb/influxdb/tsdb/engine/wal/wal.go:219 +0x79b

goroutine 1188 [semacquire]:
sync.(*RWMutex).RLock(0xc2082ce140)
    /root/.gvm/gos/go1.4.2/src/sync/rwmutex.go:36 +0x5f
github.com/influxdb/influxdb/tsdb.(*Shard).validateSeriesAndFields(0xc2082e06c0, 0xc212794000, 0x3e8, 0x400, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /root/go/src/github.com/influxdb/influxdb/tsdb/shard.go:336 +0x135
github.com/influxdb/influxdb/tsdb.(*Shard).WritePoints(0xc2082e06c0, 0xc212794000, 0x3e8, 0x400, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tsdb/shard.go:156 +0x75
github.com/influxdb/influxdb/tsdb.(*Store).WriteToShard(0xc2080321a0, 0x1, 0xc212794000, 0x3e8, 0x400, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tsdb/store.go:303 +0xf7
github.com/influxdb/influxdb/cluster.func·002(0x1, 0x1, 0xc212794000, 0x3e8, 0x400)
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:241 +0xfc
created by github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:271 +0x30b

goroutine 167 [select]:
github.com/influxdb/influxdb/tsdb/engine/wal.(*Log).autoflusher(0xc209908620, 0xc20b9c0f00)
    /root/go/src/github.com/influxdb/influxdb/tsdb/engine/wal/wal.go:603 +0x4d3
created by github.com/influxdb/influxdb/tsdb/engine/wal.(*Log).Open
    /root/go/src/github.com/influxdb/influxdb/tsdb/engine/wal/wal.go:219 +0x79b

goroutine 790 [select]:
github.com/influxdb/influxdb/tsdb/engine/wal.(*Log).autoflusher(0xc2080acc40, 0xc211e45860)
    /root/go/src/github.com/influxdb/influxdb/tsdb/engine/wal/wal.go:603 +0x4d3
created by github.com/influxdb/influxdb/tsdb/engine/wal.(*Log).Open
    /root/go/src/github.com/influxdb/influxdb/tsdb/engine/wal/wal.go:219 +0x79b

goroutine 1160 [semacquire]:
sync.(*Mutex).Lock(0xc2082ce140)
    /root/.gvm/gos/go1.4.2/src/sync/mutex.go:66 +0xd3
sync.(*RWMutex).Lock(0xc2082ce140)
    /root/.gvm/gos/go1.4.2/src/sync/rwmutex.go:82 +0x33
github.com/influxdb/influxdb/tsdb.(*Shard).WritePoints(0xc2082e06c0, 0xc214db6000, 0x381, 0x400, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tsdb/shard.go:163 +0x14c
github.com/influxdb/influxdb/tsdb.(*Store).WriteToShard(0xc2080321a0, 0x1, 0xc214db6000, 0x381, 0x400, 0x0, 0x0)
    /root/go/src/github.com/influxdb/influxdb/tsdb/store.go:303 +0xf7
github.com/influxdb/influxdb/cluster.func·002(0x1, 0x1, 0xc214db6000, 0x381, 0x400)
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:241 +0xfc
created by github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:271 +0x30b

goroutine 1187 [select]:
github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard(0xc20800e230, 0xc21337ca00, 0xb37f10, 0x8, 0xc2080546e8, 0x7, 0x1, 0xc212794000, 0x3e8, 0x400, ...)
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:278 +0x913
github.com/influxdb/influxdb/cluster.func·001(0xc21337ca00, 0xb37f10, 0x8, 0xc2080546e8, 0x7, 0xc212794000, 0x3e8, 0x400)
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:205 +0xad
created by github.com/influxdb/influxdb/cluster.(*PointsWriter).WritePoints
    /root/go/src/github.com/influxdb/influxdb/cluster/points_writer.go:206 +0x447

rax     0xb723
rbx     0x46
rcx     0xc210a1ca20
rdx     0x4d
rdi     0xc20da7d900
rsi     0xc2081f6320
rbp     0x46
rsp     0xc2080f28c8
r8      0xc2080f28e8
r9      0xa555
r10     0xc21df00000
r11     0xa90fe0
r12     0x61a
r13     0x0
r14     0x0
r15     0x51d
rip     0x4415cb
rflags  0x206
cs      0x33
fs      0x0
gs      0x0
@jwilder jwilder self-assigned this Aug 27, 2015
jwilder added a commit that referenced this issue Aug 27, 2015
Writes could timeout and when adding new measurement names to the
index if the sort took a long time.  The names slice was never
actually used (except a test) so keeping it in index wastes memory
and sort it wastes CPU and increases lock contention.  The sorting
was happening while the shard held a write-lock.

Fixes #3869
@jwilder
Copy link
Contributor

jwilder commented Aug 27, 2015

That trace looks like there might be a lot of time spent sorting measurement names when writing new series. https://github.com/influxdb/influxdb/blob/master/tsdb/meta.go#L110

When the sort is slow, it looks like it's causing the write lock taken here: https://github.com/influxdb/influxdb/blob/master/tsdb/shard.go#L163 to be held for too long. I think the sorting and the keeping the names slice may not be necessary.

@pkittenis
Copy link
Author

Thanks a lot! I'll verify that it fixes the issue and let you know.

Much appreciate your prompt response.

My test case was generating similar names so never deadlocked..

@pkittenis
Copy link
Author

Can confirm the deadlock no longer happens with this change - many thanks once again for your help.

Between this change and the wal partition changes performance now exceeds the 0.8 series which is pretty awesome - thanks for the great work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants