Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Query/Clustering Fixes #2353

Merged
merged 7 commits into from
Apr 21, 2015
Merged

Distributed Query/Clustering Fixes #2353

merged 7 commits into from
Apr 21, 2015

Conversation

jwilder
Copy link
Contributor

@jwilder jwilder commented Apr 20, 2015

This PR fixes cluster joins and distributed queries.

New data nodes would never join an existing cluster properly. They would start up but would not attempt to join the cluster so they all reported themselves as Server ID 1.

The distributed queries issue was that previously sent data was not being cleared out on each iteration of the mapper output processing. Subsequent iterations would send out duplicate data causing incorrect results.

Fixes #2348 #2343 #2334 #2272

t.Parallel()
testName := "3-node server integration partial replication"
testName := "6-node server integration partial replication"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 is not really a valid number for a cluster. Best practices means you should use odd numbers. Why 6?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change it at all? Would 5 or 7 be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we are trying to do is get the server to create more than one shard per shard group, to test the full range of possibilities across writing/reading series data.

From what we could tell, the number of shards that a shard group creates is calculated on the # of data nodes / replication factor. So, we needed something that would do that. We probably should make this a 5 node cluster with a replication factor of 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a node size of 5 and a replication factor of 2 would be better. That should result in shard groups of 2 shards in size, which will then result in 4 actual shards on the cluster.

@otoolep
Copy link
Contributor

otoolep commented Apr 20, 2015

Some of these changes look really important. I'm not ready to +1 it yet however, as I have some questions.

@@ -642,19 +642,8 @@ func (cmd *RunCommand) openServer(joinURLs []url.URL) *influxdb.Server {
// Give brokers time to elect a leader if entire cluster is being restarted.
time.Sleep(1 * time.Second)

if s.ID() == 0 && s.Index() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why wasn't this condition correct? Why was also requiring Index() to be 0 wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, after you open your own meta store, you will read your ID from that, if you're ID isn't 0, your index certainly shouldn't be either, as you have been part of a cluster from before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this issue correctly, the bug was that sometimes s.ID() was 0 but s.Index() was not, hence the bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Index was non-zero.

@toddboom
Copy link
Contributor

Reviewing this now.

t.Parallel()
testName := "3-node server integration partial replication"
testName := "6-node server integration partial replication"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testName not quite right 6-> 5

@otoolep
Copy link
Contributor

otoolep commented Apr 21, 2015

Changes look good, I have some minor comments you might wish to address, but I don't need to re-review. As long as I understand why just using s.ID() is correct, great. @benbjohnson also took a quick look at this.

+1, nice work.

@otoolep
Copy link
Contributor

otoolep commented Apr 21, 2015

Don't forget the changelog.

New data nodes would never actually join the cluster.  They would
pose as server ID 1 in a cluster.
Drop database did not close any open shard files or close
any topic reader/heartbeats.  In the tests, we create and drop new
databases during each test run so these were open files and connection
slowed things down and consumed a lot of RAM as the tests progressed.
jwilder added a commit that referenced this pull request Apr 21, 2015
Distributed Query/Clustering Fixes
@jwilder jwilder merged commit fbf9cdb into master Apr 21, 2015
@jwilder jwilder deleted the 2272-fix branch April 21, 2015 19:45
@otoolep
Copy link
Contributor

otoolep commented Apr 21, 2015

Yeah, here's hoping! It is probable.

mark-rushakoff pushed a commit that referenced this pull request Jan 9, 2019
…error

Change typo of procstats to procstat and make exe required
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data node fail to join cluster in 0.9.0rc25
4 participants