[0.9.5-nightly-952f1d5] not all nodes have measurements in _internal database #4212

ecables · 2015-09-24T02:57:39Z

Just upgraded to 0.9.5-nightly-952f1d5, and I noticed that only one of the nodes has measurements in the _internal database.

db1:

$ /opt/influxdb/influx
Connected to http://localhost:8086 version 0.9.5-nightly-952f1d5
InfluxDB shell 0.9.5-nightly-952f1d5
> use _internal
Using database _internal
> show measurements
name: measurements
------------------
name
cluster
engine
httpd
runtime
shard
wal

> SELECT mean(write_req) FROM monitor.httpd WHERE time > now() - 15m GROUP BY time(60s),hostname
name: httpd
tags: hostname=db1
time            mean
----            ----
1443062340000000000 605
1443062400000000000 606.8333333333334
1443062460000000000 611.3333333333334
1443062520000000000 613.5
1443062580000000000 618.1666666666666
1443062640000000000 620
1443062700000000000 626.1666666666666
1443062760000000000 630.8333333333334
1443062820000000000 633.3333333333334
1443062880000000000 637.6666666666666
1443062940000000000 640.8333333333334
1443063000000000000 642.3333333333334
1443063060000000000 645
1443063120000000000 647
1443063180000000000 651
1443063240000000000 655

>

db2:

$ /opt/influxdb/influx
Connected to http://localhost:8086 version 0.9.5-nightly-952f1d5
InfluxDB shell 0.9.5-nightly-952f1d5
> use _internal
Using database _internal
> SELECT mean(write_req) FROM monitor.httpd WHERE time > now() - 15m GROUP BY time(60s),hostname
> show measurements
>

db3:

$ /opt/influxdb/influx
Connected to http://localhost:8086 version 0.9.5-nightly-952f1d5
InfluxDB shell 0.9.5-nightly-952f1d5
> use _internal
Using database _internal
> SELECT mean(write_req) FROM monitor.httpd WHERE time > now() - 15m GROUP BY time(60s),hostname
> show measurements
>

The text was updated successfully, but these errors were encountered:

otoolep · 2015-09-24T14:36:37Z

That's probably because the first shard groups of the _internal database are not fully replicated.

Show us the output of 'SHOW SHARDS'.

ecables · 2015-09-24T15:16:15Z

Requested output below:

$ /opt/influxdb/influx
Connected to http://localhost:8086 version 0.9.5-nightly-952f1d5
InfluxDB shell 0.9.5-nightly-952f1d5
> show shards
name: stuffhere
-------------
id  start_time      end_time        expiry_time     owners
2   2015-08-17T00:00:00Z    2015-08-24T00:00:00Z    2017-08-23T00:00:00Z    3,1,2
4   2015-08-24T00:00:00Z    2015-08-31T00:00:00Z    2017-08-30T00:00:00Z    2,3,1
12  2015-08-31T00:00:00Z    2015-09-07T00:00:00Z    2017-09-06T00:00:00Z    2,3,1
20  2015-09-07T00:00:00Z    2015-09-14T00:00:00Z    2017-09-13T00:00:00Z    3,1,2
32  2015-09-14T00:00:00Z    2015-09-21T00:00:00Z    2017-09-20T00:00:00Z    2,3,1
54  2015-09-21T00:00:00Z    2015-09-28T00:00:00Z    2017-09-27T00:00:00Z    2,3,1
1   2015-08-22T00:00:00Z    2015-08-23T00:00:00Z    2015-10-07T00:00:00Z    2,3,1
3   2015-08-23T00:00:00Z    2015-08-24T00:00:00Z    2015-10-08T00:00:00Z    1,2,3
5   2015-08-24T00:00:00Z    2015-08-25T00:00:00Z    2015-10-09T00:00:00Z    3,1,2
6   2015-08-25T00:00:00Z    2015-08-26T00:00:00Z    2015-10-10T00:00:00Z    1,2,3
7   2015-08-26T00:00:00Z    2015-08-27T00:00:00Z    2015-10-11T00:00:00Z    2,3,1
8   2015-08-27T00:00:00Z    2015-08-28T00:00:00Z    2015-10-12T00:00:00Z    2,3,1
9   2015-08-28T00:00:00Z    2015-08-29T00:00:00Z    2015-10-13T00:00:00Z    3,1,2
10  2015-08-29T00:00:00Z    2015-08-30T00:00:00Z    2015-10-14T00:00:00Z    3,1,2
11  2015-08-30T00:00:00Z    2015-08-31T00:00:00Z    2015-10-15T00:00:00Z    3,1,2
13  2015-08-31T00:00:00Z    2015-09-01T00:00:00Z    2015-10-16T00:00:00Z    1,2,3
14  2015-09-01T00:00:00Z    2015-09-02T00:00:00Z    2015-10-17T00:00:00Z    2,3,1
15  2015-09-02T00:00:00Z    2015-09-03T00:00:00Z    2015-10-18T00:00:00Z    3,1,2
16  2015-09-03T00:00:00Z    2015-09-04T00:00:00Z    2015-10-19T00:00:00Z    2,3,1
17  2015-09-04T00:00:00Z    2015-09-05T00:00:00Z    2015-10-20T00:00:00Z    2,3,1
18  2015-09-05T00:00:00Z    2015-09-06T00:00:00Z    2015-10-21T00:00:00Z    1,2,3
19  2015-09-06T00:00:00Z    2015-09-07T00:00:00Z    2015-10-22T00:00:00Z    2,3,1
21  2015-09-07T00:00:00Z    2015-09-08T00:00:00Z    2015-10-23T00:00:00Z    1,2,3
22  2015-09-08T00:00:00Z    2015-09-09T00:00:00Z    2015-10-24T00:00:00Z    2,3,1
23  2015-09-09T00:00:00Z    2015-09-10T00:00:00Z    2015-10-25T00:00:00Z    1,2,3
24  2015-09-10T00:00:00Z    2015-09-11T00:00:00Z    2015-10-26T00:00:00Z    2,3,1
26  2015-09-11T00:00:00Z    2015-09-12T00:00:00Z    2015-10-27T00:00:00Z    1,2,3
28  2015-09-12T00:00:00Z    2015-09-13T00:00:00Z    2015-10-28T00:00:00Z    2,3,1
30  2015-09-13T00:00:00Z    2015-09-14T00:00:00Z    2015-10-29T00:00:00Z    3,1,2
33  2015-09-14T00:00:00Z    2015-09-15T00:00:00Z    2015-10-30T00:00:00Z    2,3,1
35  2015-09-15T00:00:00Z    2015-09-16T00:00:00Z    2015-10-31T00:00:00Z    1,2,3
37  2015-09-16T00:00:00Z    2015-09-17T00:00:00Z    2015-11-01T00:00:00Z    2,3,1
39  2015-09-17T00:00:00Z    2015-09-18T00:00:00Z    2015-11-02T00:00:00Z    1,2,3
41  2015-09-18T00:00:00Z    2015-09-19T00:00:00Z    2015-11-03T00:00:00Z    1,2,3
50  2015-09-19T00:00:00Z    2015-09-20T00:00:00Z    2015-11-04T00:00:00Z    1,2,3
52  2015-09-20T00:00:00Z    2015-09-21T00:00:00Z    2015-11-05T00:00:00Z    1,2,3
55  2015-09-21T00:00:00Z    2015-09-22T00:00:00Z    2015-11-06T00:00:00Z    1,2,3
57  2015-09-22T00:00:00Z    2015-09-23T00:00:00Z    2015-11-07T00:00:00Z    2,3,1
59  2015-09-23T00:00:00Z    2015-09-24T00:00:00Z    2015-11-08T00:00:00Z    3,1,2
60  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-11-09T00:00:00Z    1,2,3


name: _internal
---------------
id  start_time      end_time        expiry_time     owners
81  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    3
82  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    1
83  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    2

>

otoolep · 2015-09-24T15:23:17Z

Yeah -- you need to bump the replication factor of your _internal database to 3, or configure a dedicated database and retention policy yourself for monitoring data, set it in the config and bounce the nodes.

Make sense?

ecables · 2015-09-24T15:38:18Z

Okay, I altered the retention policy for _internal to a replication factor of 3, like so:

> ALTER RETENTION POLICY monitor ON _internal REPLICATION 3
> SHOW RETENTION POLICIES ON "_internal"
name    duration    replicaN    default
monitor 168h0m0s    3       true

The output of SHOW RETENTION ... confirms this, but SHOW SHARDS still shows a single node for each ID.

> SHOW SHARDS
...

name: _internal
---------------
id  start_time      end_time        expiry_time     owners
81  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    3
82  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    1
83  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    2

Do I need to restart Influx for this change to take effect?

otoolep · 2015-09-24T15:44:34Z

@ecables -- what will happen is this change will take effect when the next set of shards -- called "shard groups" are created. That group will be created a few minutes before the current shards expire.

ecables · 2015-09-24T15:51:18Z

Well, I ended up restarting the nodes (db1, db2, and db3), thinking that might trigger this, and on two of the nodes I'm seeing owners with ID 4, 5, and 6 for _internal, despite there only being 3 nodes in the cluster. Looking at SHOW SERVERS I'm seeing unresolved versions of our existing servers.

db1:

name: _internal
---------------
id  start_time      end_time        expiry_time     owners
104 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    6
105 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    1
106 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    2
107 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    3
108 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    4
109 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    5

> show servers
id  cluster_addr            raft
1   db2.domain.com:8088 true
2   db1.domain.com:8088 true
3   db3.domain.com:8088 true
4   10.0.9.39:8088      false
5   10.0.11.23:8088     false
6   10.0.12.32:8088     false

In the above output, servers 4-6 are the unresolved versions of 1-3.

db2 shows only 3:

name: _internal
---------------
id  start_time      end_time        expiry_time     owners
61  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    3
62  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    1
63  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    2

> show servers
id  cluster_addr            raft
1   db2.domain.com:8088 true
2   db1.domain.com:8088 true
3   db3.domain.com:8088 true

db3 also shows 6 servers as well:

name: _internal
---------------
id  start_time      end_time        expiry_time     owners
104 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    6
105 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    1
106 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    2
107 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    3
108 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    4
109 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    5

> show servers
id  cluster_addr            raft
1   db2.domain.com:8088 true
2   db1.domain.com:8088 true
3   db3.domain.com:8088 true
4   10.0.9.39:8088      false
5   10.0.11.23:8088     false
6   10.0.12.32:8088     false

Thoughts? We've had weird issues like this in the past, where 3 new unresolved versions of the existing servers show up, and then create errors in the logs when writes to unknown nodes occur.

otoolep · 2015-09-24T15:53:12Z

@jwilder -- any thoughts?

otoolep · 2015-09-24T15:55:14Z

To be clear @ecables -- you are not running 6-node clusters, right? When the name resolution didn't happen, everything got doubled?

ecables · 2015-09-24T15:56:08Z

No, we only have 3 nodes, and the 6 nodes shown merely represent the hostname representation and the unresolved representation (IP address).

otoolep · 2015-09-24T15:58:03Z

Yeah, that's what I assumed, but just wanted to be sure.

I've CC @jwilder who most recently worked on the clustering code.

jwilder · 2015-09-24T16:23:47Z

@ecables When you restart the nodes, is the same hostname always passed via the config or via the -hostname flag? Those unresolved addresses look like a server was started without changing the hostname and also not using the same meta data dirs. It seem to have joined the cluster as a new node which is why it has new node IDs and raft=false.

ecables · 2015-09-24T16:47:32Z

No changes were made to the configuration during the restart; before starting the cluster during the upgrade from 0.9.4.1 -> 0.9.5-nightly I ensured that the hostname configuration variable was accurate.

Jhors2 · 2015-09-24T17:55:36Z

I also want to contribute that this cluster has not been wiped since around 0.9.2. At one point we were using the "peers = []" config directory which was specified via IP addresses at the time. Around the 0.9.3 train the peers.json config file changed from permitting IPs to permitting hostnames in the config. peers.json was then changed to use hostnames rather than IPs.

I'm assuming this MAY be stale information because of that. If so is there a way to gracefully yank these out of the cluster? The other funny thing to note here is that these nodes come and go as we restart the cluster.

otoolep · 2015-09-24T21:58:43Z

Right now we do not have configuration support for removing nodes from the cluster -- the DROP SERVER command is due to ship in 0.9.5 or 0.9.6.

We may be able to do some work here, to help, but this could be awkward. Your cluster is not a fully healthy state, and it may be due to the upgrade from 0.9.2 to the latest code.

otoolep · 2015-09-25T03:44:30Z

I am going to close this issue for now, as I believe the system is behaving as expected.

otoolep · 2015-09-25T03:44:55Z

With regards to the replication of the internal stats, that is.

otoolep · 2015-10-03T04:49:18Z

After some recent reports, this does not appear to be solved.

otoolep · 2015-10-03T04:49:48Z

It may be related to the extra nodes on the cluster, which will hopefully be cleaned up by #4310

otoolep · 2015-10-03T04:57:48Z

Patch ready for merging, 4 green CI builds.

https://circleci.com/gh/influxdb/influxdb/tree/drop_node_non_raft

jsternberg · 2016-05-17T23:28:19Z

This is an issue for an older version of InfluxDB and clustering is no longer supported in the open source version. I'm going to close this issue. Thank you.

otoolep closed this as completed Sep 25, 2015

This was referenced Sep 28, 2015

Cluster has extra nodes post-upgrade #4260

Closed

Support dropping non-Raft nodes #4310

Merged

otoolep reopened this Oct 3, 2015

jsternberg closed this as completed May 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.9.5-nightly-952f1d5] not all nodes have measurements in _internal database #4212

[0.9.5-nightly-952f1d5] not all nodes have measurements in _internal database #4212

ecables commented Sep 24, 2015

otoolep commented Sep 24, 2015

ecables commented Sep 24, 2015

otoolep commented Sep 24, 2015

ecables commented Sep 24, 2015

otoolep commented Sep 24, 2015

ecables commented Sep 24, 2015

otoolep commented Sep 24, 2015

otoolep commented Sep 24, 2015

ecables commented Sep 24, 2015

otoolep commented Sep 24, 2015

jwilder commented Sep 24, 2015

ecables commented Sep 24, 2015

Jhors2 commented Sep 24, 2015

otoolep commented Sep 24, 2015

otoolep commented Sep 25, 2015

otoolep commented Sep 25, 2015

otoolep commented Oct 3, 2015

otoolep commented Oct 3, 2015

otoolep commented Oct 3, 2015

jsternberg commented May 17, 2016

[0.9.5-nightly-952f1d5] not all nodes have measurements in _internal database #4212

[0.9.5-nightly-952f1d5] not all nodes have measurements in _internal database #4212

Comments

ecables commented Sep 24, 2015

otoolep commented Sep 24, 2015

ecables commented Sep 24, 2015

otoolep commented Sep 24, 2015

ecables commented Sep 24, 2015

otoolep commented Sep 24, 2015

ecables commented Sep 24, 2015

otoolep commented Sep 24, 2015

otoolep commented Sep 24, 2015

ecables commented Sep 24, 2015

otoolep commented Sep 24, 2015

jwilder commented Sep 24, 2015

ecables commented Sep 24, 2015

Jhors2 commented Sep 24, 2015

otoolep commented Sep 24, 2015

otoolep commented Sep 25, 2015

otoolep commented Sep 25, 2015

otoolep commented Oct 3, 2015

otoolep commented Oct 3, 2015

otoolep commented Oct 3, 2015

jsternberg commented May 17, 2016