-
Notifications
You must be signed in to change notification settings - Fork 82
Conversation
08e93bd
to
1b3fd4d
Compare
retest this please |
1b3fd4d
to
60054c3
Compare
glusterd2/pmap/rpc_prog.go
Outdated
@@ -138,7 +138,9 @@ func (p *GfPortmap) SignIn(args *SignInReq, reply *SignInRsp) error { | |||
"port": args.Port, | |||
}).Debug("brick signed in") | |||
|
|||
registry.Update(args.Port, args.Brick, conn) | |||
// TODO: Add Pid field to SignInRsp and pass it here when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prashanthpai Question : We also need to add Pid field to SignInReq as well, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I meant SignInReq
not SignInRsp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change can only be done after https://review.gluster.org/#/c/glusterfs/+/21503/ is merged and it ends up in the nightly RPMs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the comment now
bd7e210
to
2b00b47
Compare
I'm starting to review this now, expect it to be complete by next Monday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vpandey-RH I'm still reviewing. Haven't got a chance to go through all the changes yet.
startedVolsPresent = true | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need this loop? I see that we loop through all the volumes in the subsequent code anyways. Can't we figure out if a volume is started there itself? In a cluster configuration having high number of volumes such loop can be costly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is an optimisation. Looping over a high number of volumes is CPU bound operation and the cost of that can be neglected in comparison with the latency involved in fetching from etcd.
var targetVolume *volume.Volinfo | ||
if !startedVolsPresent { | ||
// no started volumes present; allow bricks belonging to volume | ||
// that's about to be started to be multiplexed to the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'about to be started' - not sure if I understand this logic completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In production, bricks belonging to the same volume will not end up in the same machine. However, during development, one may create a volume having two or more bricks on the same machine. This code ensures that all bricks multiplex to the same process in such cases.
For example:
- create a replica 3 volume on a single node cluster i.e all bricks will be on same node.
- start volume txn:
- first brick will be started as a process
- second and third brick must not be started as separate process; it must be mux'd onto first brick
// if volume isn't started, we can't multiplex. | ||
continue | ||
} | ||
if reflect.DeepEqual(v.Options, brickVolinfo.Options) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see unsafe_option () in the glusterfs source. We actually ignore auth*, *event-threads, *diagnostics.brick-log in this comparison in GD1 as of now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point 👍 For now with limited scope of GCS, this should be okay I guess.
} | ||
|
||
brickDaemon, _ := brick.NewGlusterfsd(localBrick) | ||
if running, _ := daemon.IsRunning(brickDaemon); running { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check even needed? We've not started the brick anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We pick a compatible target volume.
- We loop over local bricks of target volume to pick a running brick (hence the check) as the target brick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my testing, when TestBrickMux
is run in a loop, sometimes volume stop fails with a socket error (unexpected EOF). There's a race somewhere. I haven't got time to look at it yet.
startedVolsPresent = true | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is an optimisation. Looping over a high number of volumes is CPU bound operation and the cost of that can be neglected in comparison with the latency involved in fetching from etcd.
var targetVolume *volume.Volinfo | ||
if !startedVolsPresent { | ||
// no started volumes present; allow bricks belonging to volume | ||
// that's about to be started to be multiplexed to the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In production, bricks belonging to the same volume will not end up in the same machine. However, during development, one may create a volume having two or more bricks on the same machine. This code ensures that all bricks multiplex to the same process in such cases.
For example:
- create a replica 3 volume on a single node cluster i.e all bricks will be on same node.
- start volume txn:
- first brick will be started as a process
- second and third brick must not be started as separate process; it must be mux'd onto first brick
// if volume isn't started, we can't multiplex. | ||
continue | ||
} | ||
if reflect.DeepEqual(v.Options, brickVolinfo.Options) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point 👍 For now with limited scope of GCS, this should be okay I guess.
} | ||
|
||
brickDaemon, _ := brick.NewGlusterfsd(localBrick) | ||
if running, _ := daemon.IsRunning(brickDaemon); running { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We pick a compatible target volume.
- We loop over local bricks of target volume to pick a running brick (hence the check) as the target brick.
2b00b47
to
c766c38
Compare
Had a pair review with @vpandey-RH and highlighted few additional test areas. Overall the code structure is neat, thanks for the good work @prashanthpai . Thing which need to be tested:
(I have a suspect that both the above wouldn't work as expected) If these two scenarios work fine, then overall I'm fine with the code. I'll pass to @aravindavk @kshlm and @Madhu-1 for the review. Ofcourse a follow up PR would be needed to handle brick-mux in add-brick and introduce cluster.max-bricks-per-process option. |
c766c38
to
aaa6771
Compare
@atinmu I have added some more new e2e tests. |
@vpandey-RH Did we cover all the tests? |
@atinmu I have manually tested the above scenarios that you have mentioned plus a couple of scenarios more. In e2e I have added these tests. PTAL and in case I have missed something, you can notify me. |
|
||
r.Nil(client.VolumeDelete(volname2)) | ||
r.Nil(client.VolumeDelete(volname1)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing a test where we create ~10 odd volumes with brick mux,kill all the gluster processes and restart glusterd and check if all bricks are online with one single glusterfsd process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had some doubt in impelmenting this scenario, I was a bit skeptical whether restarting GD2 will work as expected or will it be a fresh instance with no volume records and other such things. I had a discussion with Aravinda and according to him it should work fine. So, I am working on this testcase.
aaa6771
to
2a56d54
Compare
@atinmu I have added the above testcase but still there seems to be some issue with VolumeStop. Still working on finding the real cause of failure with Volume stop. |
7735b30
to
baf36e9
Compare
As mentioned earlier, in my testing, when |
@prashanthpai Yes, volumeStop does fail with an EOF. I have a question here - |
glusterd2/brickmux/demultiplex.go
Outdated
return fmt.Errorf("detach brick RPC request failed; OpRet = %d", rsp.OpRet) | ||
} | ||
|
||
time.Sleep(time.Second * 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason of adding this sleep if the rpc call and callback are synchronous? Also shouldn't we be marking brick status as stopped in case op.ret is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atinmu So, the reason for adding this is to see if my doubts on client.Call being asynchronous was true. Now, I have seen a better result with adding sleep before we delete the socket file. But if @kshlm says that its synchronous then I don't feel there is any need for this sleep. But at the same time after looking at the e2e test results I am thinking that the EOF error could occur due to deleting the socket file even before the rpc result has returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another Question that I had is - Whether the detach request returns only after the brick has been detached and a signout request has been sent. Or glusterfsd detects its a valid request, starts the sign out process and returns like what happens in heal(glustershd gets the heal RPC req, starts the heal process and returns with an output that heal has started and then runs heal as an independent activity.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. detach request returns as long as rpc request succeeds and the brick graph removal from glusterfsd stack happens asynchronously. Comparing the code of glusterd_volume_stop_glusterfs () from GD1, it looks like we mark this brick's rpc connection as NULL post sending the detach request. Is this what we're missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure about this but what we have to make sure is that all the operations related to marking a brick as stopped(such as a signout request) should happen before we delete the socketfile and pidfile. That is why I added a sleep so that we can give some time for all these processes before we delete socketfile. Although adding sleep isn't a very good preventive measure.
@vpandey-RH Following things need to be checked for the volume stop issue:
|
Update on the volume stop eof issue: from glusterd2.log time="2018-11-29 20:22:15.256522" level=debug msg="starting demultiplexing" brick=/tmp/gd2_func_test/TestBrickMux/brick773983421 source="[demultiplex.go:30:brickmux.Demultiplex]" from the log file of the parent brick: [2018-11-29 14:52:15.258611] W [socket.c:721:__socket_rwv] 0-socket.glusterfsd: writev on failed (Bad file descriptor) So what's happening is from Demultiplex () , client.Call("Brick.OpBrickTerminate", req, &rsp) call is failing with an unexpected EOF socket. We did see the demultiplex request has reached the brick process and it successfully detached it, however most probably before sending the response back the socket connection is broken as highlighted in the socket_rwv failure with EBADF error code. We still need to figure out the root cause of this socket closure. |
retest this please |
Interestingly, with latest GD2 & glusterfs master I ran the brick mux test in my local container for 10 times but didn't see any failure yet: root@b9bad6008a5e:/home/go/src/github.com/gluster/glusterd2# go test ./e2e -v -functest -run "TestBrickMux" |
retest this please |
1 similar comment
retest this please |
Avoiding duplicate pmap signout is tracked through the fix https://review.gluster.org/#/c/glusterfs/+/21792/ and BZ https://bugzilla.redhat.com/show_bug.cgi?id=1655861 |
1b7497d
to
98c2435
Compare
glusterd2/brickmux/demultiplex.go
Outdated
// make appropriate changes once glusterfsd side is fixed. | ||
time.Sleep(time.Millisecond * 200) | ||
|
||
log.WithField("brick", b.String()).Debug("deleting brick pid file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the following debug logs do not make much sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atinmu I have modified this log a little bit, just to track the deletion of pid and socket file.
This PR adds support for multiple brick translator stacks running in a single brick server process. Multiplexing is controlled by the "cluster.brick-multiplex" global option. By default it's off, and bricks are started in separate processes as before. If multiplexing is enabled, then *compatible* bricks (those with the same volume options) will be started in the same process. Multiplexing and demultiplexing involves sending an RPC request from glusterd2 to brick process. Pidfile and socket files: * Each multiplexed brick will have a separate pidfile (returned by PidFile() method of Daemon interface). * As multiplexed bricks will have only one server xlator, it will be listening on only one Unix Domain Socket file. A hardlink to this socketfile is created per multiplexed brick. Differences from glusterd1's implementation: No dynamic information about port is stored on disk or in etcd. Unlike glusterd1, all multiplexed bricks in a brick process will now sign in on reconnection (https://review.gluster.org/21503). This makes it easy for glusterd2 to repopulate its pmap registry on restart. This is a much simpler approach as against what glusterd1 has to do i.e, read /proc/<brick-pid>/fd/ and read stored port information. Bonus: Sign in RPC request has been modified (https://review.gluster.org/21503) to send PID of the brick process. Glusterd2 can (in future) rely on this instead of having to read/manage pidfiles. Code organisation: For easier review, most of the brick multiplexing code has been kept separate in 'brickmux' package. Minimal changes have been introduced to other existing packages. However, there's an opportunity for a future refactor which can converge 'brickmux', 'brick' and 'pmap' packages. Using brick multiplexing: $ glustercli volume set all cluster.brick-multiplex on $ glustercli volume set all cluster.brick-multiplex off What's not been done (not GCS priority per se): * Brick multiplexing hasn't been implemented for volume expand operation i.e add brick. * 'cluster.max-bricks-per-process' option hasn't been implemented. This means all compatible bricks shall be multiplexed to one single brick process. * Turning on/off brick multiplexing with volumes actively running has no effect. In GCS production, brick multiplexing shall only be turned on at the beginning. Reference (glusterd1 implementation): https://review.gluster.org/16496 Signed-off-by: Prashanth Pai <ppai@redhat.com>
98c2435
to
c3473e0
Compare
This PR adds support for multiple brick translator stacks running in
a single brick server process. Multiplexing is controlled by the
cluster.brick-multiplex
global option. By default it's off, andbricks are started in separate processes as before. If multiplexing is
enabled, then compatible bricks (those with the same volume options)
will be started in the same process.
Multiplexing and demultiplexing involves sending an RPC request from
glusterd2 to brick process.
Pidfile and socket files:
PidFile()
method ofDaemon
interface).listening on only one Unix Domain Socket file. A hardlink to this
socketfile is created per multiplexed brick. Hence, glusterd2
will separate connection for each multiplexed brick.
Differences from glusterd1's implementation:
No dynamic information about port is stored on disk or in etcd. Unlike
glusterd1, all multiplexed bricks in a brick process will now sign in
on reconnection (https://review.gluster.org/21503). This make it easy
for glusterd2 to repopulate its pmap registry on restart. This is a
much simpler approach as against what glusterd1 has to do i.e, read
/proc/<brick-pid>/fd/
and read stored port information.Bonus:
Sign in RPC request has been modified (https://review.gluster.org/21503)
to send PID of the brick process. Glusterd2 can (in future) rely on this
instead of having to read/manage pidfiles.
Code organisation:
For easier review, most of the brick multiplexing code has been kept
separate in 'brickmux' package. Minimal changes have been introduced
to other existing packages. However, there's an opportunity for a
future refactor which can converge 'brickmux', 'brick' and 'pmap'
packages.
Using brick multiplexing:
What's not been done (not GCS priority per se):
i.e add brick.
means all compatible bricks shall be multiplexed to one single brick
process.
no effect. In GCS production, brick multiplexing shall only be turned
on at the beginning.
Reference (glusterd1 implementation):
https://review.gluster.org/16496
Closes #219
Signed-off-by: Prashanth Pai ppai@redhat.com