Skip to content
This repository has been archived by the owner on Mar 26, 2020. It is now read-only.

Implement brick multiplexing #1301

Merged
merged 1 commit into from
Dec 4, 2018
Merged

Conversation

prashanthpai
Copy link
Contributor

@prashanthpai prashanthpai commented Oct 27, 2018

This PR adds support for multiple brick translator stacks running in
a single brick server process. Multiplexing is controlled by the
cluster.brick-multiplex global option. By default it's off, and
bricks are started in separate processes as before. If multiplexing is
enabled, then compatible bricks (those with the same volume options)
will be started in the same process.

Multiplexing and demultiplexing involves sending an RPC request from
glusterd2 to brick process.

Pidfile and socket files:

  • Each multiplexed brick will have a separate pidfile (returned by
    PidFile() method of Daemon interface).
  • As multiplexed bricks will have only one server xlator, it will be
    listening on only one Unix Domain Socket file. A hardlink to this
    socketfile is created per multiplexed brick. Hence, glusterd2
    will separate connection for each multiplexed brick.

Differences from glusterd1's implementation:
No dynamic information about port is stored on disk or in etcd. Unlike
glusterd1, all multiplexed bricks in a brick process will now sign in
on reconnection (https://review.gluster.org/21503). This make it easy
for glusterd2 to repopulate its pmap registry on restart. This is a
much simpler approach as against what glusterd1 has to do i.e, read
/proc/<brick-pid>/fd/ and read stored port information.

Bonus:
Sign in RPC request has been modified (https://review.gluster.org/21503)
to send PID of the brick process. Glusterd2 can (in future) rely on this
instead of having to read/manage pidfiles.

Code organisation:
For easier review, most of the brick multiplexing code has been kept
separate in 'brickmux' package. Minimal changes have been introduced
to other existing packages. However, there's an opportunity for a
future refactor which can converge 'brickmux', 'brick' and 'pmap'
packages.

Using brick multiplexing:

$ glustercli volume set all cluster.brick-multiplex on
$ glustercli volume set all cluster.brick-multiplex off

What's not been done (not GCS priority per se):

  • Brick multiplexing hasn't been implemented for volume expand operation
    i.e add brick.
  • 'cluster.max-bricks-per-process' option hasn't been implemented. This
    means all compatible bricks shall be multiplexed to one single brick
    process.
  • Turning on/off brick multiplexing with volumes actively running has
    no effect. In GCS production, brick multiplexing shall only be turned
    on at the beginning.

Reference (glusterd1 implementation):
https://review.gluster.org/16496

Closes #219
Signed-off-by: Prashanth Pai ppai@redhat.com

@prashanthpai
Copy link
Contributor Author

retest this please

@@ -138,7 +138,9 @@ func (p *GfPortmap) SignIn(args *SignInReq, reply *SignInRsp) error {
"port": args.Port,
}).Debug("brick signed in")

registry.Update(args.Port, args.Brick, conn)
// TODO: Add Pid field to SignInRsp and pass it here when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prashanthpai Question : We also need to add Pid field to SignInReq as well, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I meant SignInReq not SignInRsp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change can only be done after https://review.gluster.org/#/c/glusterfs/+/21503/ is merged and it ends up in the nightly RPMs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the comment now

@atinmu
Copy link
Contributor

atinmu commented Nov 8, 2018

I'm starting to review this now, expect it to be complete by next Monday.

Copy link
Contributor

@atinmu atinmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpandey-RH I'm still reviewing. Haven't got a chance to go through all the changes yet.

startedVolsPresent = true
break
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this loop? I see that we loop through all the volumes in the subsequent code anyways. Can't we figure out if a volume is started there itself? In a cluster configuration having high number of volumes such loop can be costly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is an optimisation. Looping over a high number of volumes is CPU bound operation and the cost of that can be neglected in comparison with the latency involved in fetching from etcd.

var targetVolume *volume.Volinfo
if !startedVolsPresent {
// no started volumes present; allow bricks belonging to volume
// that's about to be started to be multiplexed to the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'about to be started' - not sure if I understand this logic completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In production, bricks belonging to the same volume will not end up in the same machine. However, during development, one may create a volume having two or more bricks on the same machine. This code ensures that all bricks multiplex to the same process in such cases.

For example:

  1. create a replica 3 volume on a single node cluster i.e all bricks will be on same node.
  2. start volume txn:
    • first brick will be started as a process
    • second and third brick must not be started as separate process; it must be mux'd onto first brick

// if volume isn't started, we can't multiplex.
continue
}
if reflect.DeepEqual(v.Options, brickVolinfo.Options) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see unsafe_option () in the glusterfs source. We actually ignore auth*, *event-threads, *diagnostics.brick-log in this comparison in GD1 as of now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point 👍 For now with limited scope of GCS, this should be okay I guess.

}

brickDaemon, _ := brick.NewGlusterfsd(localBrick)
if running, _ := daemon.IsRunning(brickDaemon); running {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check even needed? We've not started the brick anyways?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We pick a compatible target volume.
  2. We loop over local bricks of target volume to pick a running brick (hence the check) as the target brick.

Copy link
Contributor Author

@prashanthpai prashanthpai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing, when TestBrickMux is run in a loop, sometimes volume stop fails with a socket error (unexpected EOF). There's a race somewhere. I haven't got time to look at it yet.

startedVolsPresent = true
break
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is an optimisation. Looping over a high number of volumes is CPU bound operation and the cost of that can be neglected in comparison with the latency involved in fetching from etcd.

var targetVolume *volume.Volinfo
if !startedVolsPresent {
// no started volumes present; allow bricks belonging to volume
// that's about to be started to be multiplexed to the same
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In production, bricks belonging to the same volume will not end up in the same machine. However, during development, one may create a volume having two or more bricks on the same machine. This code ensures that all bricks multiplex to the same process in such cases.

For example:

  1. create a replica 3 volume on a single node cluster i.e all bricks will be on same node.
  2. start volume txn:
    • first brick will be started as a process
    • second and third brick must not be started as separate process; it must be mux'd onto first brick

// if volume isn't started, we can't multiplex.
continue
}
if reflect.DeepEqual(v.Options, brickVolinfo.Options) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point 👍 For now with limited scope of GCS, this should be okay I guess.

}

brickDaemon, _ := brick.NewGlusterfsd(localBrick)
if running, _ := daemon.IsRunning(brickDaemon); running {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We pick a compatible target volume.
  2. We loop over local bricks of target volume to pick a running brick (hence the check) as the target brick.

@atinmu
Copy link
Contributor

atinmu commented Nov 25, 2018

Had a pair review with @vpandey-RH and highlighted few additional test areas. Overall the code structure is neat, thanks for the good work @prashanthpai .

Thing which need to be tested:

  1. What happens when a brick process is killed, does gluster volume status all multiplex bricks offline?
  2. When gluster pod/node is rebooted, does multiplexed bricks come in a right manner?

(I have a suspect that both the above wouldn't work as expected)

If these two scenarios work fine, then overall I'm fine with the code. I'll pass to @aravindavk @kshlm and @Madhu-1 for the review. Ofcourse a follow up PR would be needed to handle brick-mux in add-brick and introduce cluster.max-bricks-per-process option.

@vpandey-RH
Copy link
Contributor

@atinmu I have added some more new e2e tests.

@atinmu
Copy link
Contributor

atinmu commented Nov 28, 2018

@vpandey-RH Did we cover all the tests?

@vpandey-RH
Copy link
Contributor

@atinmu I have manually tested the above scenarios that you have mentioned plus a couple of scenarios more. In e2e I have added these tests. PTAL and in case I have missed something, you can notify me.


r.Nil(client.VolumeDelete(volname2))
r.Nil(client.VolumeDelete(volname1))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing a test where we create ~10 odd volumes with brick mux,kill all the gluster processes and restart glusterd and check if all bricks are online with one single glusterfsd process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had some doubt in impelmenting this scenario, I was a bit skeptical whether restarting GD2 will work as expected or will it be a fresh instance with no volume records and other such things. I had a discussion with Aravinda and according to him it should work fine. So, I am working on this testcase.

@vpandey-RH
Copy link
Contributor

@atinmu I have added the above testcase but still there seems to be some issue with VolumeStop. Still working on finding the real cause of failure with Volume stop.

@vpandey-RH vpandey-RH force-pushed the bmux-in-gd2 branch 2 times, most recently from 7735b30 to baf36e9 Compare November 29, 2018 08:56
@prashanthpai
Copy link
Contributor Author

As mentioned earlier, in my testing, when TestBrickMux is run in a loop, sometimes volume stop fails with a socket error (unexpected EOF). This is easily reproducible on local machine.

@vpandey-RH
Copy link
Contributor

@prashanthpai Yes, volumeStop does fail with an EOF. I have a question here -
Are the detach RPC calls in volume-stop synchronous or asynchronous ?

return fmt.Errorf("detach brick RPC request failed; OpRet = %d", rsp.OpRet)
}

time.Sleep(time.Second * 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason of adding this sleep if the rpc call and callback are synchronous? Also shouldn't we be marking brick status as stopped in case op.ret is 0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atinmu So, the reason for adding this is to see if my doubts on client.Call being asynchronous was true. Now, I have seen a better result with adding sleep before we delete the socket file. But if @kshlm says that its synchronous then I don't feel there is any need for this sleep. But at the same time after looking at the e2e test results I am thinking that the EOF error could occur due to deleting the socket file even before the rpc result has returned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another Question that I had is - Whether the detach request returns only after the brick has been detached and a signout request has been sent. Or glusterfsd detects its a valid request, starts the sign out process and returns like what happens in heal(glustershd gets the heal RPC req, starts the heal process and returns with an output that heal has started and then runs heal as an independent activity.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. detach request returns as long as rpc request succeeds and the brick graph removal from glusterfsd stack happens asynchronously. Comparing the code of glusterd_volume_stop_glusterfs () from GD1, it looks like we mark this brick's rpc connection as NULL post sending the detach request. Is this what we're missing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure about this but what we have to make sure is that all the operations related to marking a brick as stopped(such as a signout request) should happen before we delete the socketfile and pidfile. That is why I added a sleep so that we can give some time for all these processes before we delete socketfile. Although adding sleep isn't a very good preventive measure.

@atinmu
Copy link
Contributor

atinmu commented Nov 29, 2018

@vpandey-RH Following things need to be checked for the volume stop issue:

  1. Does it fail while sending the rpc request or does the rsp.opRet contains non zero value? One of the things to check to confirm is by looking at the brick log and see if a detach request for the same brick is received or not.

@atinmu
Copy link
Contributor

atinmu commented Nov 30, 2018

Update on the volume stop eof issue:

from glusterd2.log

time="2018-11-29 20:22:15.256522" level=debug msg="starting demultiplexing" brick=/tmp/gd2_func_test/TestBrickMux/brick773983421 source="[demultiplex.go:30:brickmux.Demultiplex]"
time="2018-11-29 20:22:15.256693" level=debug msg="brick socket file" brick=/tmp/gd2_func_test/TestBrickMux/w1/run/gluster/b0868b3ae644bab9.socket source="[demultiplex.go:36:brickmux.Demultiplex]"
time="2018-11-29 20:22:15.257023" level=info msg="connected to daemon" name=glusterfsd socket=/tmp/gd2_func_test/TestBrickMux/w1/run/gluster/b0868b3ae644bab9.socket source="[connection.go:62:daemon. GetRPCClient]"
time="2018-11-29 20:22:15.261865" level=debug msg="failed to call Brick Terminate" brick=/tmp/gd2_func_test/TestBrickMux/brick773983421 source="[demultiplex.go:51:brickmux.Demultiplex]"
time="2018-11-29 20:22:15.261967" level=error msg="Step failed on node." error="unexpected EOF" node=49593f2b-a251-4d71-9d11-b4ca3b7840e9 reqid=f993b09b-63c0-4fca-ad01-0197597a8705 source="[step.go:119: transaction.runStepFuncOnNodes]" step=vol-stop.StopBricks txnid=0a20cc74-e6c4-4b9c-b745-566d4afddc21
time="2018-11-29 20:22:15.262061" level=error msg="Transaction failed, rolling back changes" error="Step vol-stop.StopBricks failed on 1 nodes" reqid=f993b09b-63c0-4fca-ad01-0197597a8705 source="[transaction.go:143:transaction.(*Txn).Do]" txnid=0a20cc74-e6c4-4b9c-b745-566d4afddc21
time="2018-11-29 20:22:15.262140" level=error msg="transaction to stop volume failed" error="Step vol-stop.StopBricks failed on 1 nodes" reqid=f993b09b-63c0-4fca-ad01-0197597a8705 source="[volume-stop.go:178: volumes.volumeStopHandler]" volume=TestBrickMux3

from the log file of the parent brick:

[2018-11-29 14:52:15.258611] W [socket.c:721:__socket_rwv] 0-socket.glusterfsd: writev on failed (Bad file descriptor)
[2018-11-29 14:52:15.258823] I [glusterfsd-mgmt.c:279:glusterfs_handle_terminate] 0-glusterfs: detaching not-only child /tmp/gd2_func_test/TestBrickMux/brick773983421
[2018-11-29 14:52:15.258947] I [server.c:408:server_call_xlator_mem_cleanup] 0-TestBrickMux1-server: Create graph janitor thread for brick /tmp/gd2_func_test/TestBrickMux/brick773983421
[2018-11-29 14:52:15.260564] I [barrier.c:665:fini] 0-TestBrickMux2-barrier: Disabling barriering and dequeuing all the queued fops
[2018-11-29 14:52:15.270207] I [barrier.c:665:fini] 0-TestBrickMux2-barrier: Disabling barriering and dequeuing all the queued fops
[2018-11-29 14:52:15.277525] E [xlator.c:1553:glusterfs_delete_volfile_checksum] 0-TestBrickMux1-server: failed to get volfile checksum for volfile id TestBrickMux3.49593f2b-a251-4d71-9d11-b4ca3b7840e9.tmp- gd2_func_test-TestBrickMux-brick773983421.

So what's happening is from Demultiplex () , client.Call("Brick.OpBrickTerminate", req, &rsp) call is failing with an unexpected EOF socket. We did see the demultiplex request has reached the brick process and it successfully detached it, however most probably before sending the response back the socket connection is broken as highlighted in the socket_rwv failure with EBADF error code.

We still need to figure out the root cause of this socket closure.

@prashanthpai @vpandey-RH @aravindavk @kshlm ^^

@ghost ghost assigned atinmu Dec 3, 2018
@atinmu
Copy link
Contributor

atinmu commented Dec 3, 2018

retest this please

@atinmu
Copy link
Contributor

atinmu commented Dec 3, 2018

Interestingly, with latest GD2 & glusterfs master I ran the brick mux test in my local container for 10 times but didn't see any failure yet:

root@b9bad6008a5e:/home/go/src/github.com/gluster/glusterd2# go test ./e2e -v -functest -run "TestBrickMux"
=== RUN TestBrickMux
--- PASS: TestBrickMux (54.72s)
PASS
ok github.com/gluster/glusterd2/e2e 54.795s
root@b9bad6008a5e:/home/go/src/github.com/gluster/glusterd2# go test ./e2e -v -functest -run "TestBrickMux"
=== RUN TestBrickMux
--- PASS: TestBrickMux (56.70s)
PASS
ok github.com/gluster/glusterd2/e2e 56.748s
root@b9bad6008a5e:/home/go/src/github.com/gluster/glusterd2# go test ./e2e -v -functest -run "TestBrickMux"
=== RUN TestBrickMux
--- PASS: TestBrickMux (54.60s)
PASS
ok github.com/gluster/glusterd2/e2e 54.688s
root@b9bad6008a5e:/home/go/src/github.com/gluster/glusterd2# go test ./e2e -v -functest -run "TestBrickMux"
=== RUN TestBrickMux
--- PASS: TestBrickMux (56.62s)
PASS
ok github.com/gluster/glusterd2/e2e 56.731s
root@b9bad6008a5e:/home/go/src/github.com/gluster/glusterd2# go test ./e2e -v -functest -run "TestBrickMux"
=== RUN TestBrickMux
--- PASS: TestBrickMux (56.54s)
PASS
ok github.com/gluster/glusterd2/e2e 56.632s
root@b9bad6008a5e:/home/go/src/github.com/gluster/glusterd2# go test ./e2e -v -functest -run "TestBrickMux"
=== RUN TestBrickMux
--- PASS: TestBrickMux (54.41s)
PASS
ok github.com/gluster/glusterd2/e2e 54.547s
root@b9bad6008a5e:/home/go/src/github.com/gluster/glusterd2# go test ./e2e -v -functest -run "TestBrickMux"
=== RUN TestBrickMux
--- PASS: TestBrickMux (56.24s)
PASS
ok github.com/gluster/glusterd2/e2e 56.298s
root@b9bad6008a5e:/home/go/src/github.com/gluster/glusterd2# go test ./e2e -v -functest -run "TestBrickMux"
=== RUN TestBrickMux
--- PASS: TestBrickMux (53.05s)
PASS
ok github.com/gluster/glusterd2/e2e 53.110s
root@b9bad6008a5e:/home/go/src/github.com/gluster/glusterd2# go test ./e2e -v -functest -run "TestBrickMux"
=== RUN TestBrickMux
--- PASS: TestBrickMux (55.12s)
PASS
ok github.com/gluster/glusterd2/e2e 55.197s

@atinmu
Copy link
Contributor

atinmu commented Dec 3, 2018

retest this please

1 similar comment
@atinmu
Copy link
Contributor

atinmu commented Dec 3, 2018

retest this please

@atinmu
Copy link
Contributor

atinmu commented Dec 4, 2018

Avoiding duplicate pmap signout is tracked through the fix https://review.gluster.org/#/c/glusterfs/+/21792/ and BZ https://bugzilla.redhat.com/show_bug.cgi?id=1655861

@vpandey-RH vpandey-RH force-pushed the bmux-in-gd2 branch 2 times, most recently from 1b7497d to 98c2435 Compare December 4, 2018 09:01
// make appropriate changes once glusterfsd side is fixed.
time.Sleep(time.Millisecond * 200)

log.WithField("brick", b.String()).Debug("deleting brick pid file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the following debug logs do not make much sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atinmu I have modified this log a little bit, just to track the deletion of pid and socket file.

This PR adds support for multiple brick translator stacks running in
a single brick server process. Multiplexing is controlled by the
"cluster.brick-multiplex" global option. By default it's off, and
bricks are started in separate processes as before. If multiplexing is
enabled, then *compatible* bricks (those with the same volume options)
will be started in the same process.

Multiplexing and demultiplexing involves sending an RPC request from
glusterd2 to brick process.

Pidfile and socket files:
* Each multiplexed brick will have a separate pidfile (returned by
  PidFile() method of Daemon interface).
* As multiplexed bricks will have only one server xlator, it will be
  listening on only one Unix Domain Socket file. A hardlink to this
  socketfile is created per multiplexed brick.

Differences from glusterd1's implementation:
No dynamic information about port is stored on disk or in etcd. Unlike
glusterd1, all multiplexed bricks in a brick process will now sign in
on reconnection (https://review.gluster.org/21503). This makes it easy
for glusterd2 to repopulate its pmap registry on restart. This is a
much simpler approach as against what glusterd1 has to do i.e, read
/proc/<brick-pid>/fd/ and read stored port information.

Bonus:
Sign in RPC request has been modified (https://review.gluster.org/21503)
to send PID of the brick process. Glusterd2 can (in future) rely on this
instead of having to read/manage pidfiles.

Code organisation:
For easier review, most of the brick multiplexing code has been kept
separate in 'brickmux' package. Minimal changes have been introduced
to other existing packages. However, there's an opportunity for a
future refactor which can converge 'brickmux', 'brick' and 'pmap'
packages.

Using brick multiplexing:
    $ glustercli volume set all cluster.brick-multiplex on
    $ glustercli volume set all cluster.brick-multiplex off

What's not been done (not GCS priority per se):
* Brick multiplexing hasn't been implemented for volume expand operation
  i.e add brick.
* 'cluster.max-bricks-per-process' option hasn't been implemented. This
  means all compatible bricks shall be multiplexed to one single brick
  process.
* Turning on/off brick multiplexing with volumes actively running has
  no effect. In GCS production, brick multiplexing shall only be turned
  on at the beginning.

Reference (glusterd1 implementation):
https://review.gluster.org/16496

Signed-off-by: Prashanth Pai <ppai@redhat.com>
@aravindavk aravindavk merged commit 6bbda20 into gluster:master Dec 4, 2018
@ghost ghost removed the in progress label Dec 4, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants