Switch to etcd/clientv3 for store and remove libkv #284

kshlm · 2017-03-21T13:35:02Z

Closes #260

... etcd client v3 api does not support directories any more.

prashanthpai · 2017-03-22T09:23:57Z

commands/volumes/volume-create.go

@@ -200,29 +204,33 @@ func volumeCreateHandler(w http.ResponseWriter, r *http.Request) {
 		Store:    "vol-create.Store",
 		Rollback: "vol-create.Rollback",
 		LogFields: &log.Fields{
-			"reqid": uuid.NewRandom().String(),
+			"reqid": reqid,


I have question unrelated to your patch. Here's what a debug log for a txn step currently (master) looks like:

DEBU[2017-03-22T14:32:29+05:30] RunStep request recieved reqid=9c47b856-f49c-4218-9783-ba38d3246569 stepfunc=vol-create.Stage txnid=4c9aa1e4-e6a6-4ff7-95a7-d861e3e85f73

Why is there a distinction between reqid and a txnid ? Can one client request turn into multiple (not steps) transactions ? If not, we could just keep the txnid and make things simple.

It will also be good to log request to txn mapping once at the beginning of transaction.

My original reason for having them was that a transaction could lead to new requests on other gd2s. But that would just make things more complicated to follow. A single ID the whole way through would be better.

Doing it this way is what opentracing sort of does. It allocates an ID at the beginning of a request and the ID flows through the whole cluster as actions are performed. I was looking into it to see if we could make use of it. We should try to use this sort of an established standard to trace operations instead of trying to invent our own.

A single ID the whole way through would be better.

Exactly. Currently, we have two. I'll file an issue for this later.

prashanthpai

I'm trying this change out. I have two glusterd2 instances on the same node (using different ports). Volume create request just hangs for me. The CPU usage shoots up to close 100%

kshlm · 2017-03-22T12:22:09Z

I'm trying this change out. I have two glusterd2 instances on the same node (using different ports). Volume create request just hangs for me. The CPU usage shoots up to close 100%

I've had it intermittently as well. Seems to be a problem with the locking mechanism. It resolved itself sometimes after sometime. Waiting a little while before doing any operations after probe seems to help. I've only seen this on fresh starts. On restarts, it just works fine.

kshlm · 2017-04-04T09:18:25Z

I've had it intermittently as well. Seems to be a problem with the locking mechanism. It resolved itself sometimes after sometime. Waiting a little while before doing any operations after probe seems to help. I've only seen this on fresh starts. On restarts, it just works fine.

Seems like the newly added peer is sort of hung, which is preventing etcd operations from happening. Not sure how my changes here is causing this. Need to investigate more.

.. when handling a peer probe request

kshlm · 2017-04-06T12:33:10Z

@prashanthpai This now works properly. The problem was that the embedded etcd was destroyed without disconnecting the store first. This caused store.Session.Close() to hang later on when attempted to reconnect as that needs to do some cleanup before closing down.

The first peer was always working right. It used to hang waiting for the new peer to respond.

prashanthpai

Peer probe works fine now. But I see the same issue (CPU usage shoots up) during peer detach now.

On the peer being detached:

INFO[2017-04-07T10:30:47+05:30] Etcd embedded server is stopped.             
INFO[2017-04-07T10:30:47+05:30] Etcd data dir, WAL dir and config file removed 
ERRO[2017-04-07T10:30:47+05:30] Could not start embedded etcd server.         Error=listen tcp 192.168.56.25:2380: bind: address already in use

Further, this log message needs fixing (on the initiator node):

INFO[2017-04-07T10:29:47+05:30] Added new member to etcd cluster              member-id=4319042937552170850
DEBU[2017-04-07T10:29:47+05:30] Reconfiguring etcd on remote peer             initial-cluster=ddf9b1f8-20f4-494a-ab0c-b8fbdb63a38d=http://192.168.56.25:2380,cc2de8b8-aa4f-4182-a4a9-3f831483ed4c=http://192.168.56.25:2480

prashanthpai · 2017-04-06T13:10:04Z

store/store.go

@@ -8,61 +8,50 @@ import (
 	"time"


You can remove the libkv comment present at the top of this file.

prashanthpai · 2017-04-06T13:19:49Z

store/store.go

+
+// Close closes the store connections
+func (s *GDStore) Close() {
+	s.Session.Close()


Both these calls can return error. It'll be nice to log on error.

prashanthpai · 2017-04-06T13:33:06Z

store/store.go

 	config "github.com/spf13/viper"
 )

 const (
 	// GlusterPrefix prefixes all paths in the store
 	GlusterPrefix string = "gluster/"
+	directoryVal         = "thisisadirectory"


Although etcd v3 API has a flat v3 structure without directory support, I don't think this is needed. It isn't used currently. Am I missing something ?

prashanthpai · 2017-04-06T13:53:54Z

peer/store-utils.go

 		return nil, err
 	}

+	// We cannot have more than one peer with a given ID
+	// TODO: Fix this to return a proper error
+	if len(resp.Kvs) > 1 {


Why not use resp.Count here ?

kshlm · 2017-04-07T06:00:44Z

Peer probe works fine now. But I see the same issue (CPU usage shoots up) during peer detach now.

Huh, works fine for me. Are you testing with the single node multiple instances?

prashanthpai · 2017-04-07T11:51:02Z

Huh, works fine for me. Are you testing with the single node multiple instances?

Single node, multiple instance. I'll test this further.

prashanthpai · 2017-04-07T12:06:52Z

It's still the same for me:

INFO[2017-04-07T17:31:26+05:30] Etcd embedded server is stopped.             
INFO[2017-04-07T17:31:26+05:30] Etcd data dir, WAL dir and config file removed 
ERRO[2017-04-07T17:31:26+05:30] Could not start embedded etcd server.         Error=listen tcp 192.168.56.25:2380: bind: address already in use

This is weird. Out of curiosity, I tried a peer listing on the peer being removed, and here's what I see:

[ppai@gd2-1 json]$ curl -s -X GET http://192.168.56.25:23007/v1/peers | python -m json.tool
{
    "Error": "grpc: the client connection is closing"
}

prashanthpai · 2017-04-07T13:00:22Z

This is definitely a problem. This session close line is the one that takes a very long time and the CPU usage shoots up:

        if e := s.Session.Close(); e != nil {
                log.WithError(e).Warn("failed to close etcd session")
        }

The port is not correctly closed (you check this manually) before calling etcdmgmt.StartEmbeddedEtcd()

kshlm · 2017-04-11T07:25:27Z

@prashanthpai I don't seem to be able to hit your problem at all. I do get a slight pause when session.Close is happening, but it always succeeds and etcd restarts correctly.

Considering that this is only a problem during detach, and on the node being detached, I guess for now we can consider this a known issue, and merge this PR. We anyway plan to replace this current solution with elastic-etcd, which shouldn't have this problem.

... avoids the problem of session.Close() blocking and causing failures.

prashanthpai

The workaround works (has a benign WARN log entry on session close) although over the long term, we'll probably need a more robust fix, perhaps in etcd.

kshlm added 3 commits March 21, 2017 18:34

Switch to etcd/clientv3 for store

04a810b

Remove libkv

eb4b7cd

Remove prefix initialization

83d0444

... etcd client v3 api does not support directories any more.

kshlm self-assigned this Mar 21, 2017

kshlm requested review from aravindavk, prashanthpai, atinmu, samikshan and sac March 21, 2017 15:01

prashanthpai reviewed Mar 22, 2017

View reviewed changes

prashanthpai mentioned this pull request Apr 4, 2017

Use one single ID for transaction logging #287

Closed

kshlm added 2 commits April 6, 2017 15:12

Merge branch 'master' into rm-libkv

3b0f3b1

Stop store connection before configuring etcd

39b3c87

.. when handling a peer probe request

prashanthpai suggested changes Apr 7, 2017

View reviewed changes

kshlm added 3 commits April 7, 2017 10:58

Remove old comments and unused code

cd80cb4

Log store close errors

daf2369

Check resp.Count instead of len(resp.Kvs)

a939d66

Update vendored packages

2620948

Close client before session...

09ceb28

... avoids the problem of session.Close() blocking and causing failures.

prashanthpai approved these changes Apr 12, 2017

View reviewed changes

prashanthpai merged commit 9062b80 into gluster:master Apr 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to etcd/clientv3 for store and remove libkv #284

Switch to etcd/clientv3 for store and remove libkv #284

kshlm commented Mar 21, 2017

prashanthpai Mar 22, 2017

kshlm Mar 22, 2017

prashanthpai Mar 22, 2017

prashanthpai left a comment

kshlm commented Mar 22, 2017

kshlm commented Apr 4, 2017

kshlm commented Apr 6, 2017

prashanthpai left a comment

prashanthpai Apr 6, 2017

prashanthpai Apr 6, 2017

prashanthpai Apr 6, 2017

prashanthpai Apr 6, 2017

kshlm commented Apr 7, 2017

prashanthpai commented Apr 7, 2017

prashanthpai commented Apr 7, 2017

prashanthpai commented Apr 7, 2017

kshlm commented Apr 11, 2017

prashanthpai left a comment

Switch to etcd/clientv3 for store and remove libkv #284

Switch to etcd/clientv3 for store and remove libkv #284

Conversation

kshlm commented Mar 21, 2017

prashanthpai Mar 22, 2017

Choose a reason for hiding this comment

kshlm Mar 22, 2017

Choose a reason for hiding this comment

prashanthpai Mar 22, 2017

Choose a reason for hiding this comment

prashanthpai left a comment

Choose a reason for hiding this comment

kshlm commented Mar 22, 2017

kshlm commented Apr 4, 2017

kshlm commented Apr 6, 2017

prashanthpai left a comment

Choose a reason for hiding this comment

prashanthpai Apr 6, 2017

Choose a reason for hiding this comment

prashanthpai Apr 6, 2017

Choose a reason for hiding this comment

prashanthpai Apr 6, 2017

Choose a reason for hiding this comment

prashanthpai Apr 6, 2017

Choose a reason for hiding this comment

kshlm commented Apr 7, 2017

prashanthpai commented Apr 7, 2017

prashanthpai commented Apr 7, 2017

prashanthpai commented Apr 7, 2017

kshlm commented Apr 11, 2017

prashanthpai left a comment

Choose a reason for hiding this comment