-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to tirpc usage [Tested | Works on F27, F28, F29, Centos7] #182
Conversation
TODO:
|
30a4bc5
to
3c9c230
Compare
TODO's Progress:
Not yet done.
Done.
Done. |
2701208
to
98a57b9
Compare
All Done. @amarts @nixpanic @lxbsz please help review this series. Most importantly overriding the library function svc_getreq_poll() part in daemon/gluster-blockd.c. Without which the daemon is unregistering itself after first cli request. Some notable changes:
I have tested this PR on F27, F28 & F29 (with HA 3 setups each) all works good to me. Thanks! |
I still could hit the crash problem testing for a while, my test steps were like:
The code like:
Compiled by using :
Thanks |
@lxbsz what is your libtirpc version ? |
The first time I compiled without the And the second time is as above in the last comment I pasted. Did I miss something important @pkalever ? I will test it again later. Thanks. |
I reproduced it again, please see the detail steps:
|
Yeah this was coming from targetcli, switch to release v2.1.fb49 (not master) and rtslib 2.1.fb69 Not sure, about the RC for the trace seen above, I have scaled upto 250 block volume creates last night, and didn't see any issue. Will it be possible to share your setup with me ? else I will have to spend my cycles reproducing this one with asan enabled.
Can you fix your targetcli and rtslib first, mean while I will try reproducing this myself. Feel free to debug it by yourself (just in case if its env issue) |
@pkalever And the crashes in my setups mostly in the malloc and free randomly everywhere, such as also we can see in gfapi, etc, but this is not the gfapi issue. Mainly because the malloc chunk/metadata is corrupted by double free in libtirpc. But there still has the stuck about no reply from the RPC server. So there at least 2 issue: double free memory and RPC server stuck, maybe they are the same one. For the double free issue the root cause is still not found yet. Hope this is useful. Thanks. |
The third issue, when deleting the block, I can hit the following errors very easy. It seems the memory is corrupted. Mar 21 14:29:56 fedora2 gluster-blockd[16893]: 2 unable to free arguments, 4
|
@lxbsz right. I'm also hitting similar double free issues. May be we should consider implementing our own stub routines :-( Lets investigate in this direction, as we don't have any other way. |
7a777fb
to
a8157bf
Compare
@pkalever But there still have the stuck problem: [2019-03-25 08:58:40.058450] ERROR: block_create_cli_1: RPC: Unable to receive; errno = Connection reset by peerblock block21 create on volume dht with hosts 10.70.39.243 failed Thanks |
@lxbsz thanks for trying this out. |
a8157bf
to
d984d5f
Compare
@amarts @lxbsz At the moment the build of gluster-block on f28 and f29 fails with out these patches. With this patches gluster-block enables integration with TIRPC, but unfortunately sometimes we hit crash below crash in TIRPC libraries like:
However this is not very immediate, I can at-least create 20-30 devices everytime before I hit the crash. Now the question, can we live with it for now ? IMHO, having this patches is better than gluster-block not working with TIRPC (on F28/F29) @amarts @lxbsz @nixpanic Whats your take ? Many Thanks! |
I am inclined to mark it as known-issues. One of the main base for the project at the moment is CentOS/RHEL, and in that setup tirpc is not yet mandatory. That way, having these patches will make the project have proper RPMs, and we can always improve and fix the things. Lets call out the issue of scale on setups where we have libtirpc in our release notes, and take these patches in. |
If the problem is reproducible, it would be good to have a look at the coredump. It sounds like a use-after free of some kind. This change is needed, and also encouraged for CentOS/RHEL builds. Identifying and fixing the issue should have a relatively high priority. |
@nixpanic but this must be coming from the libtirpc itself right ? More detailed crash:
|
Yeah, I am trying to send one patch to make sure that there will be only one svc_run loop will run in one process. Thanks.
|
Create around 5 hundreds blocks and the above patch works for me. Thanks |
Good news indeed :-) |
Nice work on figuring out the issue! |
👍 Cool, hopefully this allows us to make a release. |
2a96b5c
to
3e23a58
Compare
@amarts @lxbsz updated the PR now. Splitting this PR to Sub PR's in below fashion :
BTW: expect this PR to see travis build failures as the build is happening on F27 and this PR expects '--enable-tirpc=no' to be passed a build time. |
AC_SUBST(TIRPC_CFLAGS) | ||
AC_SUBST(TIRPC_LIBS) | ||
if test "$enable_tirpc" != "no"; then | ||
enable_tirpc="yes"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to add the help promotion, something like:
$ ./configure --help
[...]
--enable-tirpc Please enable litirpc library if glibc >= 2.26 [default=yes]
[...]
Will this make sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, Okay! We can add a help string.
Will refresh this in the next spin.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lxbsz please check:
✨ ./configure --help | grep tirpc
--enable-libtirpc enable use of tirpc [default: yes]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, cool. it works as expected.
3e23a58
to
49d6af4
Compare
daemon/gluster-blockd.c
Outdated
LOG("mgmt", GB_LOG_DEBUG, | ||
"server process received signal:%d with pid:%lu", signum, getpid()); | ||
|
||
exit(EXIT_SUCCESS); /* server process terminated by signal */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we may use the svc_stop() instead to stop the loop and the libtirpc will do the clean up to finish current stuff, will make sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was looking for something like that actually :-)
Was not aware (forget) there is a graceful way of exiting svc_run loop.
Thanks, let me test this and incorporate the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, before when debugging this issue I tested this code and it worked for me that time.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: where did you see svc_stop () ?
gluster-blockd.c:79:3: warning: implicit declaration of function ‘svc_stop’; did you mean ‘svc_stat’? [-Wimplicit-function-declaration]
svc_stop();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May you mean svc_exit() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it s.
Sorry, I mixed the svc_exit with the other libs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the libtirpc code, in svc_run() IMO, it should add the rdlock to protect the svc_pollfd, or it may cause the crash here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the libtirpc code, in svc_run() IMO, it should add the rdlock to protect the svc_pollfd, or it may cause the crash here.
Please ignore this too, currently the svc_exit() must be called in the process's signal handler, or it will hit the crash issue.
|
||
kill(ctx.chpid, signum); /* Pass the signal to server process */ | ||
waitpid(ctx.chpid, &wstatus, 0); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also here use the svc_stop, then just return and the cli process will also do the clean up in libtirpc.
Then the cli process will help waitpid for the child process in LINE 602.
AC_SUBST(TIRPC_CFLAGS) | ||
AC_SUBST(TIRPC_LIBS) | ||
if test "$enable_tirpc" != "no"; then | ||
enable_tirpc="yes"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, cool. it works as expected.
@pkalever Thanks. |
49d6af4
to
d84d632
Compare
@lxbsz please check this out. |
daemon/gluster-blockd.c
Outdated
|
||
svc_exit(); | ||
|
||
exit(EXIT_SUCCESS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to do the exit(EXIT_SUCCESS) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I see WEXITSTATUS() returning 1 if I don't enforce this here.
Which means an abnormal exit of child.
Can you check ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another problem:
[2019-04-15 11:54:52.545869] INFO: cli process pid: (18235) [at gluster-blockd.c+580 :
][2019-04-15 11:54:52.546274] INFO: server process pid: (18250) [at gluster-blockd.c+570 :]
[2019-04-15 11:56:31.793613] INFO: Block Hosting Volfile Server Set to: localhost [at utils.c+276 :]
[2019-04-15 11:56:31.793715] CRIT: glfsLruCount now is 5 [at lru.c+42 :]
[2019-04-15 11:56:31.793729] CRIT: logLevel now is INFO [at utils.c+53 :]
[2019-04-15 11:56:31.793920] INFO: Inotify is watching "/etc/sysconfig", wd: 1, mask: IN_MODIFY [at dyn-config.c+510 :]
[2019-04-15 11:56:31.796159] INFO: Distro ID=fedora. Current kernel version: '4.18.16-300.fc29.x86_64'. [at gluster-blockd.c+421 :]
[2019-04-15 11:56:35.381259] INFO: Block Hosting Volfile Server Set to: localhost [at utils.c+276 :]
[2019-04-15 11:56:35.381346] CRIT: glfsLruCount now is 5 [at lru.c+42 :]
[2019-04-15 11:56:35.381539] CRIT: logLevel now is INFO [at utils.c+53 :]
[2019-04-15 11:56:35.382742] INFO: Inotify is watching "/etc/sysconfig", wd: 1, mask: IN_MODIFY [at dyn-config.c+510 :]
[2019-04-15 11:56:35.384463] INFO: Distro ID=fedora. Current kernel version: '4.18.16-300.fc29.x86_64'. [at gluster-blockd.c+421 :]
[2019-04-15 11:56:36.870342] INFO: capabilities fetched successfully [at gluster-blockd.c+550 :]
[2019-04-15 11:56:36.871805] INFO: cli process pid: (19434) [at gluster-blockd.c+580 :]
[2019-04-15 11:56:36.871881] INFO: server process pid: (19449) [at gluster-blockd.c+570 :]
[2019-04-15 11:56:36.872043] ERROR: bind on port 24010 failed (Address already in use) [at gluster-blockd.c+263 :]
[root@fedora2 gluster-block]#
[root@fedora2 gluster-block]# systemctl status gluster-blockd tcmu-runner
● gluster-blockd.service - Gluster block storage utility
Loaded: loaded (/usr/local/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2019-04-15 17:26:35 IST; 47s ago
Main PID: 19434 (gluster-blockd)
Tasks: 4 (limit: 2357)
Memory: 2.7M
CGroup: /system.slice/gluster-blockd.service
├─18250 /usr/local/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO
└─19434 /usr/local/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO
I restart the gluster-blockd service, and the cli process is new, but the server process is still the old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lxbsz
Did you notice the bind failure in the above log ?
It might have happen that your parent process exited before child in your previous run and child was attached to pid 1.
Can you run
# pkill -9 gluster-blockd
then try again ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore this, this is should be the reason that I just kill -9 cli_process only and then restart it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I see WEXITSTATUS() returning 1 if I don't enforce this here.
Which means an abnormal exit of child.Can you check ?
From manual we can see that returning 1 is normal exit.
WIFEXITED(status)
returns true if the child terminated normally, that is, by calling exit(3) or _exit(2), or by returning from main().
That means the process is exit(EXIT_SUCCESS==0) or returned directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but how can we consider child returning value 1 as success/normal ?
And, do you see anything wrong with exit(EXITSUCCESS) from child after svc_exit() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we directly exit(EXIT_SUCCESS) here, then there maybe no chance to do one graceful exit for the child process, which will execute the code after svc_run() in server process.
Make sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lxbsz
Hmm, make sense to me, I got your concern here.
Please check the latest patch, which fixes it by moving the exit() part to glusterBlockServerProcess.
|
||
svc_exit(); | ||
|
||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we could remove the 'return;' here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is matter of style, to indicate we reach end of a void function.
In fact you can notice this in many void functions across the gluster-block code.
Void functions can return without a value :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
@amarts @nixpanic Requesting a final review on this PR [+ other splitted PR's #182 (comment) ] ? Thanks! |
d84d632
to
7c8fa21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pkalever
Looks good to me.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed By: Amar Tumballi
glibc has removed the rpc functions from current releases. Instead of relying on glibc providing these, the modern libtirpc library should be used instead. Change-Id: I46c979e52147abce956255de5ad16b01b5621f52 Updates: gluster#56 Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com>
glibc will not contain the `rpcgen` tool in new versions anymore. In Fedora 28 this tool can now be found in its own package. Change-Id: I1de67cdf7418cb509e096e62a4201b5b8707ef24 Updates: gluster#56 Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com>
Creating the socket is not needed for svcunix_create(), it does all the work for us (and prevents "address already in use" errors). For svctcp_create(), the socket is expected on port 24010 and hence needs to be setup completely with bind() and listen() before it can be used. Registering the RPC-program should make it possible to use a dynamically assigned port, but the gluster-block CLI expects it on the static port. This also fixes a warning when building the RPM from 'make dist' tarball as the 22 June 2017 was a Thursday, nt Tuesday. Changes from pkalever: Add a flag at config time for tweaking the use of tirpc or glibc sunrpc $ ./autogen.sh && ./configure --enable-tirpc=yes/no Change-Id: If3a6b7527399dd0a5a16f4273efdd607617289de Updates: gluster#56 Signed-off-by: Niels de Vos <ndevos@redhat.com> Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com>
Problem: ------- When using TIRPC we have seen some contention with poll fd's, this was because we are calling svc_run() in both the cli and remote threads. Probably an issue in TIRPC as the current model worked well for us with glibc sun-rpc implementation. Solution: -------- Convert cli and remote threads to individual processes How to compile ? ---------------- On systems having glibc sun-rpc: (F27/Centos/RHEL) $ ./autogen.sh && ./configure --enable-tirpc=no On systems having tirpc: (F28 & above) $ ./autogen.sh && ./configure Thanks to Xiubo for debugging this issue along. Closes: gluster#57 Closes: gluster#165 Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com>
7c8fa21
to
e87fef6
Compare
PR works with glibc sunrpc protocol:
$ ./autogen.sh && ./configure --enable-tirpc=no && make -j install
For using tirpc protocol:
$ ./autogen.sh && ./configure --enable-tirpc=yes && make -j install
or
$ ./autogen.sh && ./configure && make -j install