Skip to content
This repository has been archived by the owner on Mar 26, 2020. It is now read-only.

Switch to etcd/clientv3 for store and remove libkv #284

Merged
merged 10 commits into from Apr 12, 2017

Conversation

kshlm
Copy link
Member

@kshlm kshlm commented Mar 21, 2017

Closes #260

... etcd client v3 api does not support directories any more.
@kshlm kshlm self-assigned this Mar 21, 2017
@@ -200,29 +204,33 @@ func volumeCreateHandler(w http.ResponseWriter, r *http.Request) {
Store: "vol-create.Store",
Rollback: "vol-create.Rollback",
LogFields: &log.Fields{
"reqid": uuid.NewRandom().String(),
"reqid": reqid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have question unrelated to your patch. Here's what a debug log for a txn step currently (master) looks like:

DEBU[2017-03-22T14:32:29+05:30] RunStep request recieved                      reqid=9c47b856-f49c-4218-9783-ba38d3246569 stepfunc=vol-create.Stage txnid=4c9aa1e4-e6a6-4ff7-95a7-d861e3e85f73

Why is there a distinction between reqid and a txnid ? Can one client request turn into multiple (not steps) transactions ? If not, we could just keep the txnid and make things simple.

It will also be good to log request to txn mapping once at the beginning of transaction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original reason for having them was that a transaction could lead to new requests on other gd2s. But that would just make things more complicated to follow. A single ID the whole way through would be better.

Doing it this way is what opentracing sort of does. It allocates an ID at the beginning of a request and the ID flows through the whole cluster as actions are performed. I was looking into it to see if we could make use of it. We should try to use this sort of an established standard to trace operations instead of trying to invent our own.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single ID the whole way through would be better.

Exactly. Currently, we have two. I'll file an issue for this later.

Copy link
Contributor

@prashanthpai prashanthpai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying this change out. I have two glusterd2 instances on the same node (using different ports). Volume create request just hangs for me. The CPU usage shoots up to close 100%

@kshlm
Copy link
Member Author

kshlm commented Mar 22, 2017

I'm trying this change out. I have two glusterd2 instances on the same node (using different ports). Volume create request just hangs for me. The CPU usage shoots up to close 100%

I've had it intermittently as well. Seems to be a problem with the locking mechanism. It resolved itself sometimes after sometime. Waiting a little while before doing any operations after probe seems to help. I've only seen this on fresh starts. On restarts, it just works fine.

@kshlm
Copy link
Member Author

kshlm commented Apr 4, 2017

I've had it intermittently as well. Seems to be a problem with the locking mechanism. It resolved itself sometimes after sometime. Waiting a little while before doing any operations after probe seems to help. I've only seen this on fresh starts. On restarts, it just works fine.

Seems like the newly added peer is sort of hung, which is preventing etcd operations from happening. Not sure how my changes here is causing this. Need to investigate more.

@kshlm
Copy link
Member Author

kshlm commented Apr 6, 2017

@prashanthpai This now works properly. The problem was that the embedded etcd was destroyed without disconnecting the store first. This caused store.Session.Close() to hang later on when attempted to reconnect as that needs to do some cleanup before closing down.

The first peer was always working right. It used to hang waiting for the new peer to respond.

Copy link
Contributor

@prashanthpai prashanthpai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peer probe works fine now. But I see the same issue (CPU usage shoots up) during peer detach now.

On the peer being detached:

INFO[2017-04-07T10:30:47+05:30] Etcd embedded server is stopped.             
INFO[2017-04-07T10:30:47+05:30] Etcd data dir, WAL dir and config file removed 
ERRO[2017-04-07T10:30:47+05:30] Could not start embedded etcd server.         Error=listen tcp 192.168.56.25:2380: bind: address already in use

Further, this log message needs fixing (on the initiator node):

INFO[2017-04-07T10:29:47+05:30] Added new member to etcd cluster              member-id=4319042937552170850
DEBU[2017-04-07T10:29:47+05:30] Reconfiguring etcd on remote peer             initial-cluster=ddf9b1f8-20f4-494a-ab0c-b8fbdb63a38d=http://192.168.56.25:2380,cc2de8b8-aa4f-4182-a4a9-3f831483ed4c=http://192.168.56.25:2480

@@ -8,61 +8,50 @@ import (
"time"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the libkv comment present at the top of this file.

store/store.go Outdated

// Close closes the store connections
func (s *GDStore) Close() {
s.Session.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both these calls can return error. It'll be nice to log on error.

store/store.go Outdated
config "github.com/spf13/viper"
)

const (
// GlusterPrefix prefixes all paths in the store
GlusterPrefix string = "gluster/"
directoryVal = "thisisadirectory"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although etcd v3 API has a flat v3 structure without directory support, I don't think this is needed. It isn't used currently. Am I missing something ?

return nil, err
}

// We cannot have more than one peer with a given ID
// TODO: Fix this to return a proper error
if len(resp.Kvs) > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use resp.Count here ?

@kshlm
Copy link
Member Author

kshlm commented Apr 7, 2017

Peer probe works fine now. But I see the same issue (CPU usage shoots up) during peer detach now.

Huh, works fine for me. Are you testing with the single node multiple instances?

@prashanthpai
Copy link
Contributor

Huh, works fine for me. Are you testing with the single node multiple instances?

Single node, multiple instance. I'll test this further.

@prashanthpai
Copy link
Contributor

It's still the same for me:

INFO[2017-04-07T17:31:26+05:30] Etcd embedded server is stopped.             
INFO[2017-04-07T17:31:26+05:30] Etcd data dir, WAL dir and config file removed 
ERRO[2017-04-07T17:31:26+05:30] Could not start embedded etcd server.         Error=listen tcp 192.168.56.25:2380: bind: address already in use

This is weird. Out of curiosity, I tried a peer listing on the peer being removed, and here's what I see:

[ppai@gd2-1 json]$ curl -s -X GET http://192.168.56.25:23007/v1/peers | python -m json.tool
{
    "Error": "grpc: the client connection is closing"
}

@prashanthpai
Copy link
Contributor

This is definitely a problem. This session close line is the one that takes a very long time and the CPU usage shoots up:

        if e := s.Session.Close(); e != nil {
                log.WithError(e).Warn("failed to close etcd session")
        }

The port is not correctly closed (you check this manually) before calling etcdmgmt.StartEmbeddedEtcd()

@kshlm
Copy link
Member Author

kshlm commented Apr 11, 2017

@prashanthpai I don't seem to be able to hit your problem at all. I do get a slight pause when session.Close is happening, but it always succeeds and etcd restarts correctly.

Considering that this is only a problem during detach, and on the node being detached, I guess for now we can consider this a known issue, and merge this PR. We anyway plan to replace this current solution with elastic-etcd, which shouldn't have this problem.

... avoids the problem of session.Close() blocking and causing failures.
Copy link
Contributor

@prashanthpai prashanthpai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workaround works (has a benign WARN log entry on session close) although over the long term, we'll probably need a more robust fix, perhaps in etcd.

@prashanthpai prashanthpai merged commit 9062b80 into gluster:master Apr 12, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants