Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.5-nightly-952f1d5] not all nodes have measurements in _internal database #4212

Closed
ecables opened this issue Sep 24, 2015 · 20 comments
Closed

Comments

@ecables
Copy link

ecables commented Sep 24, 2015

Just upgraded to 0.9.5-nightly-952f1d5, and I noticed that only one of the nodes has measurements in the _internal database.

db1:

$ /opt/influxdb/influx
Connected to http://localhost:8086 version 0.9.5-nightly-952f1d5
InfluxDB shell 0.9.5-nightly-952f1d5
> use _internal
Using database _internal
> show measurements
name: measurements
------------------
name
cluster
engine
httpd
runtime
shard
wal

> SELECT mean(write_req) FROM monitor.httpd WHERE time > now() - 15m GROUP BY time(60s),hostname
name: httpd
tags: hostname=db1
time            mean
----            ----
1443062340000000000 605
1443062400000000000 606.8333333333334
1443062460000000000 611.3333333333334
1443062520000000000 613.5
1443062580000000000 618.1666666666666
1443062640000000000 620
1443062700000000000 626.1666666666666
1443062760000000000 630.8333333333334
1443062820000000000 633.3333333333334
1443062880000000000 637.6666666666666
1443062940000000000 640.8333333333334
1443063000000000000 642.3333333333334
1443063060000000000 645
1443063120000000000 647
1443063180000000000 651
1443063240000000000 655

>

db2:

$ /opt/influxdb/influx
Connected to http://localhost:8086 version 0.9.5-nightly-952f1d5
InfluxDB shell 0.9.5-nightly-952f1d5
> use _internal
Using database _internal
> SELECT mean(write_req) FROM monitor.httpd WHERE time > now() - 15m GROUP BY time(60s),hostname
> show measurements
>

db3:

$ /opt/influxdb/influx
Connected to http://localhost:8086 version 0.9.5-nightly-952f1d5
InfluxDB shell 0.9.5-nightly-952f1d5
> use _internal
Using database _internal
> SELECT mean(write_req) FROM monitor.httpd WHERE time > now() - 15m GROUP BY time(60s),hostname
> show measurements
>
@otoolep
Copy link
Contributor

otoolep commented Sep 24, 2015

That's probably because the first shard groups of the _internal database are not fully replicated.

Show us the output of 'SHOW SHARDS'.

@ecables
Copy link
Author

ecables commented Sep 24, 2015

Requested output below:

$ /opt/influxdb/influx
Connected to http://localhost:8086 version 0.9.5-nightly-952f1d5
InfluxDB shell 0.9.5-nightly-952f1d5
> show shards
name: stuffhere
-------------
id  start_time      end_time        expiry_time     owners
2   2015-08-17T00:00:00Z    2015-08-24T00:00:00Z    2017-08-23T00:00:00Z    3,1,2
4   2015-08-24T00:00:00Z    2015-08-31T00:00:00Z    2017-08-30T00:00:00Z    2,3,1
12  2015-08-31T00:00:00Z    2015-09-07T00:00:00Z    2017-09-06T00:00:00Z    2,3,1
20  2015-09-07T00:00:00Z    2015-09-14T00:00:00Z    2017-09-13T00:00:00Z    3,1,2
32  2015-09-14T00:00:00Z    2015-09-21T00:00:00Z    2017-09-20T00:00:00Z    2,3,1
54  2015-09-21T00:00:00Z    2015-09-28T00:00:00Z    2017-09-27T00:00:00Z    2,3,1
1   2015-08-22T00:00:00Z    2015-08-23T00:00:00Z    2015-10-07T00:00:00Z    2,3,1
3   2015-08-23T00:00:00Z    2015-08-24T00:00:00Z    2015-10-08T00:00:00Z    1,2,3
5   2015-08-24T00:00:00Z    2015-08-25T00:00:00Z    2015-10-09T00:00:00Z    3,1,2
6   2015-08-25T00:00:00Z    2015-08-26T00:00:00Z    2015-10-10T00:00:00Z    1,2,3
7   2015-08-26T00:00:00Z    2015-08-27T00:00:00Z    2015-10-11T00:00:00Z    2,3,1
8   2015-08-27T00:00:00Z    2015-08-28T00:00:00Z    2015-10-12T00:00:00Z    2,3,1
9   2015-08-28T00:00:00Z    2015-08-29T00:00:00Z    2015-10-13T00:00:00Z    3,1,2
10  2015-08-29T00:00:00Z    2015-08-30T00:00:00Z    2015-10-14T00:00:00Z    3,1,2
11  2015-08-30T00:00:00Z    2015-08-31T00:00:00Z    2015-10-15T00:00:00Z    3,1,2
13  2015-08-31T00:00:00Z    2015-09-01T00:00:00Z    2015-10-16T00:00:00Z    1,2,3
14  2015-09-01T00:00:00Z    2015-09-02T00:00:00Z    2015-10-17T00:00:00Z    2,3,1
15  2015-09-02T00:00:00Z    2015-09-03T00:00:00Z    2015-10-18T00:00:00Z    3,1,2
16  2015-09-03T00:00:00Z    2015-09-04T00:00:00Z    2015-10-19T00:00:00Z    2,3,1
17  2015-09-04T00:00:00Z    2015-09-05T00:00:00Z    2015-10-20T00:00:00Z    2,3,1
18  2015-09-05T00:00:00Z    2015-09-06T00:00:00Z    2015-10-21T00:00:00Z    1,2,3
19  2015-09-06T00:00:00Z    2015-09-07T00:00:00Z    2015-10-22T00:00:00Z    2,3,1
21  2015-09-07T00:00:00Z    2015-09-08T00:00:00Z    2015-10-23T00:00:00Z    1,2,3
22  2015-09-08T00:00:00Z    2015-09-09T00:00:00Z    2015-10-24T00:00:00Z    2,3,1
23  2015-09-09T00:00:00Z    2015-09-10T00:00:00Z    2015-10-25T00:00:00Z    1,2,3
24  2015-09-10T00:00:00Z    2015-09-11T00:00:00Z    2015-10-26T00:00:00Z    2,3,1
26  2015-09-11T00:00:00Z    2015-09-12T00:00:00Z    2015-10-27T00:00:00Z    1,2,3
28  2015-09-12T00:00:00Z    2015-09-13T00:00:00Z    2015-10-28T00:00:00Z    2,3,1
30  2015-09-13T00:00:00Z    2015-09-14T00:00:00Z    2015-10-29T00:00:00Z    3,1,2
33  2015-09-14T00:00:00Z    2015-09-15T00:00:00Z    2015-10-30T00:00:00Z    2,3,1
35  2015-09-15T00:00:00Z    2015-09-16T00:00:00Z    2015-10-31T00:00:00Z    1,2,3
37  2015-09-16T00:00:00Z    2015-09-17T00:00:00Z    2015-11-01T00:00:00Z    2,3,1
39  2015-09-17T00:00:00Z    2015-09-18T00:00:00Z    2015-11-02T00:00:00Z    1,2,3
41  2015-09-18T00:00:00Z    2015-09-19T00:00:00Z    2015-11-03T00:00:00Z    1,2,3
50  2015-09-19T00:00:00Z    2015-09-20T00:00:00Z    2015-11-04T00:00:00Z    1,2,3
52  2015-09-20T00:00:00Z    2015-09-21T00:00:00Z    2015-11-05T00:00:00Z    1,2,3
55  2015-09-21T00:00:00Z    2015-09-22T00:00:00Z    2015-11-06T00:00:00Z    1,2,3
57  2015-09-22T00:00:00Z    2015-09-23T00:00:00Z    2015-11-07T00:00:00Z    2,3,1
59  2015-09-23T00:00:00Z    2015-09-24T00:00:00Z    2015-11-08T00:00:00Z    3,1,2
60  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-11-09T00:00:00Z    1,2,3


name: _internal
---------------
id  start_time      end_time        expiry_time     owners
81  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    3
82  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    1
83  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    2

>

@otoolep
Copy link
Contributor

otoolep commented Sep 24, 2015

Yeah -- you need to bump the replication factor of your _internal database to 3, or configure a dedicated database and retention policy yourself for monitoring data, set it in the config and bounce the nodes.

Make sense?

@ecables
Copy link
Author

ecables commented Sep 24, 2015

Okay, I altered the retention policy for _internal to a replication factor of 3, like so:

> ALTER RETENTION POLICY monitor ON _internal REPLICATION 3
> SHOW RETENTION POLICIES ON "_internal"
name    duration    replicaN    default
monitor 168h0m0s    3       true

The output of SHOW RETENTION ... confirms this, but SHOW SHARDS still shows a single node for each ID.

> SHOW SHARDS
...

name: _internal
---------------
id  start_time      end_time        expiry_time     owners
81  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    3
82  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    1
83  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    2

Do I need to restart Influx for this change to take effect?

@otoolep
Copy link
Contributor

otoolep commented Sep 24, 2015

@ecables -- what will happen is this change will take effect when the next set of shards -- called "shard groups" are created. That group will be created a few minutes before the current shards expire.

@ecables
Copy link
Author

ecables commented Sep 24, 2015

Well, I ended up restarting the nodes (db1, db2, and db3), thinking that might trigger this, and on two of the nodes I'm seeing owners with ID 4, 5, and 6 for _internal, despite there only being 3 nodes in the cluster. Looking at SHOW SERVERS I'm seeing unresolved versions of our existing servers.

db1:

name: _internal
---------------
id  start_time      end_time        expiry_time     owners
104 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    6
105 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    1
106 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    2
107 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    3
108 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    4
109 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    5

> show servers
id  cluster_addr            raft
1   db2.domain.com:8088 true
2   db1.domain.com:8088 true
3   db3.domain.com:8088 true
4   10.0.9.39:8088      false
5   10.0.11.23:8088     false
6   10.0.12.32:8088     false

In the above output, servers 4-6 are the unresolved versions of 1-3.

db2 shows only 3:

name: _internal
---------------
id  start_time      end_time        expiry_time     owners
61  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    3
62  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    1
63  2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    2

> show servers
id  cluster_addr            raft
1   db2.domain.com:8088 true
2   db1.domain.com:8088 true
3   db3.domain.com:8088 true

db3 also shows 6 servers as well:

name: _internal
---------------
id  start_time      end_time        expiry_time     owners
104 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    6
105 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    1
106 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    2
107 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    3
108 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    4
109 2015-09-24T00:00:00Z    2015-09-25T00:00:00Z    2015-10-02T00:00:00Z    5

> show servers
id  cluster_addr            raft
1   db2.domain.com:8088 true
2   db1.domain.com:8088 true
3   db3.domain.com:8088 true
4   10.0.9.39:8088      false
5   10.0.11.23:8088     false
6   10.0.12.32:8088     false

Thoughts? We've had weird issues like this in the past, where 3 new unresolved versions of the existing servers show up, and then create errors in the logs when writes to unknown nodes occur.

@otoolep
Copy link
Contributor

otoolep commented Sep 24, 2015

@jwilder -- any thoughts?

@otoolep
Copy link
Contributor

otoolep commented Sep 24, 2015

To be clear @ecables -- you are not running 6-node clusters, right? When the name resolution didn't happen, everything got doubled?

@ecables
Copy link
Author

ecables commented Sep 24, 2015

No, we only have 3 nodes, and the 6 nodes shown merely represent the hostname representation and the unresolved representation (IP address).

@otoolep
Copy link
Contributor

otoolep commented Sep 24, 2015

Yeah, that's what I assumed, but just wanted to be sure.

I've CC @jwilder who most recently worked on the clustering code.

@jwilder
Copy link
Contributor

jwilder commented Sep 24, 2015

@ecables When you restart the nodes, is the same hostname always passed via the config or via the -hostname flag? Those unresolved addresses look like a server was started without changing the hostname and also not using the same meta data dirs. It seem to have joined the cluster as a new node which is why it has new node IDs and raft=false.

@ecables
Copy link
Author

ecables commented Sep 24, 2015

No changes were made to the configuration during the restart; before starting the cluster during the upgrade from 0.9.4.1 -> 0.9.5-nightly I ensured that the hostname configuration variable was accurate.

@Jhors2
Copy link

Jhors2 commented Sep 24, 2015

I also want to contribute that this cluster has not been wiped since around 0.9.2. At one point we were using the "peers = []" config directory which was specified via IP addresses at the time. Around the 0.9.3 train the peers.json config file changed from permitting IPs to permitting hostnames in the config. peers.json was then changed to use hostnames rather than IPs.

I'm assuming this MAY be stale information because of that. If so is there a way to gracefully yank these out of the cluster? The other funny thing to note here is that these nodes come and go as we restart the cluster.

@otoolep
Copy link
Contributor

otoolep commented Sep 24, 2015

Right now we do not have configuration support for removing nodes from the cluster -- the DROP SERVER command is due to ship in 0.9.5 or 0.9.6.

We may be able to do some work here, to help, but this could be awkward. Your cluster is not a fully healthy state, and it may be due to the upgrade from 0.9.2 to the latest code.

@otoolep
Copy link
Contributor

otoolep commented Sep 25, 2015

I am going to close this issue for now, as I believe the system is behaving as expected.

@otoolep otoolep closed this as completed Sep 25, 2015
@otoolep
Copy link
Contributor

otoolep commented Sep 25, 2015

With regards to the replication of the internal stats, that is.

@otoolep
Copy link
Contributor

otoolep commented Oct 3, 2015

After some recent reports, this does not appear to be solved.

@otoolep otoolep reopened this Oct 3, 2015
@otoolep
Copy link
Contributor

otoolep commented Oct 3, 2015

It may be related to the extra nodes on the cluster, which will hopefully be cleaned up by #4310

@otoolep
Copy link
Contributor

otoolep commented Oct 3, 2015

Patch ready for merging, 4 green CI builds.

https://circleci.com/gh/influxdb/influxdb/tree/drop_node_non_raft

@jsternberg
Copy link
Contributor

This is an issue for an older version of InfluxDB and clustering is no longer supported in the open source version. I'm going to close this issue. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants