Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to tirpc usage [Tested | Works on F27, F28, F29, Centos7] #182

Merged
merged 4 commits into from
Apr 15, 2019

Conversation

pkalever
Copy link
Contributor

@pkalever pkalever commented Mar 18, 2019

PR works with glibc sunrpc protocol:
$ ./autogen.sh && ./configure --enable-tirpc=no && make -j install

For using tirpc protocol:
$ ./autogen.sh && ./configure --enable-tirpc=yes && make -j install
or
$ ./autogen.sh && ./configure && make -j install

@ghost ghost assigned pkalever Mar 18, 2019
@ghost ghost added the in progress label Mar 18, 2019
@pkalever
Copy link
Contributor Author

@nixpanic @amarts @lxbsz @itisravi Please take a look. Thanks!

@pkalever
Copy link
Contributor Author

pkalever commented Mar 18, 2019

TODO:

  • some code cleanup
  • Maintain comparability with glibc sunrpc (F27)
  • Edit the travis docker build file to fix
# cat build.log
[...]
checking for TIRPC... no
configure: error: libtirpc is required to build gluster-block
The command '/bin/sh -c true  && ./autogen.sh  && ./configure  && make  && make check  && make install  && make clean  && true' returned a non-zero code: 1
The command "docker build -f ./extras/docker/Dockerfile.buildtest ." exited with 1.

@pkalever pkalever changed the title [WIP] Migrate to tirpc usage [Tested | Works on F28 & F29] [WIP] Migrate to tirpc usage [Tested | Works on F27, F28 & F29] Mar 18, 2019
@pkalever
Copy link
Contributor Author

TODO's Progress:

  • some code cleanup

Not yet done.
Will do it soon.

  • Maintain comparability with glibc sunrpc (F27)

Done.
With this PR gluster-block works on F27, F28 and F29. i.e. with glibc and tirpc libraries both.

  • Edit the travis docker build file to fix
# cat build.log
[...]
checking for TIRPC... no
configure: error: libtirpc is required to build gluster-block
The command '/bin/sh -c true  && ./autogen.sh  && ./configure  && make  && make check  && make install  && make clean  && true' returned a non-zero code: 1
The command "docker build -f ./extras/docker/Dockerfile.buildtest ." exited with 1.

Done.
Along with current f27 docker file, I'have added docker build file for f28 and f29 too. As per travis build test all checks (builds) pass on F27, F28 & F29

@pkalever pkalever force-pushed the glibc-ti-rpc-comp-fix branch 2 times, most recently from 2701208 to 98a57b9 Compare March 18, 2019 16:10
@pkalever pkalever changed the title [WIP] Migrate to tirpc usage [Tested | Works on F27, F28 & F29] Migrate to tirpc usage [Tested | Works on F27, F28 & F29] Mar 18, 2019
@pkalever
Copy link
Contributor Author

pkalever commented Mar 18, 2019

TODO's Progress:

All Done.

@amarts @nixpanic @lxbsz please help review this series. Most importantly overriding the library function svc_getreq_poll() part in daemon/gluster-blockd.c. Without which the daemon is unregistering itself after first cli request.

Some notable changes:

  • we use unix socket (cli) + inet socket (daemon) with glibc
  • And with tirpc we use inet socket communication for cli too (of-course on localhost)

I have tested this PR on F27, F28 & F29 (with HA 3 setups each) all works good to me.

Thanks!

@lxbsz
Copy link
Collaborator

lxbsz commented Mar 19, 2019

@pkalever

I still could hit the crash problem testing for a while, my test steps were like:
Tested on Fedora 29:
1, created one block device with HA = 1, it successed.
2, then tried to delete it, but failed, this should be the targetcli-fb/rtslib's issue, I am using the upstream version as you suggested before. Till now the gluster-blockd still working well.
3, tried to delete the same block device by using 'force' again.
4, executed the 'targetcli saveconfig'
5, tried to create a new block device, but found the following errors:

# targetcli 
targetcli shell version 2.1.fb48
Copyright 2011-2013 by Datera, Inc and others.
For help on commands, type 'help'.

/> saveconfig 
Configuration saved to /etc/target/saveconfig.json
/> exit

# ls /etc/target/
backup  saveconfig.json

# gluster-block create repvol/block116 ha 1 10.70.39.241 1G
client create failed: RPC: Program not registered, tcp localhost

# systemctl status tcmu-runner gluster-blockd
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-03-19 06:59:35 IST; 8min ago
 Main PID: 8397 (tcmu-runner)
    Tasks: 20 (limit: 2357)
   Memory: 7.0M
   CGroup: /system.slice/tcmu-runner.service
           └─8397 /usr/bin/tcmu-runner

Mar 19 06:59:35 fedora3 systemd[1]: Starting LIO Userspace-passthrough daemon...
Mar 19 06:59:35 fedora3 tcmu-runner[8397]: log file path now is '/var/log/tcmu-runner.log'
Mar 19 06:59:35 fedora3 tcmu-runner[8397]: Starting...
Mar 19 06:59:35 fedora3 tcmu-runner[8397]: tcmu_cfgfs_set_str:287: Kernel does not support configfs file /sys/module/target_core_user/parameters/block_netlink.
Mar 19 06:59:35 fedora3 systemd[1]: Started LIO Userspace-passthrough daemon.

● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/local/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled)
   Active: failed (Result: signal) since Tue 2019-03-19 07:05:33 IST; 2min 22s ago
  Process: 8419 ExecStart=/usr/local/sbin/gluster-blockd --glfs-lru-count $GB_GLFS_LRU_COUNT --log-level $GB_LOG_LEVEL $GB_EXTRA_ARGS (code=killed, signal=SEGV)
 Main PID: 8419 (code=killed, signal=SEGV)

Mar 19 06:59:36 fedora3 gluster-blockd[8419]: Parameter loglevel_file is now 'info'.
Mar 19 06:59:36 fedora3 gluster-blockd[8419]: Parameter logfile is now '/var/log/gluster-block/gluster-block-configshell.log'.
Mar 19 06:59:36 fedora3 gluster-blockd[8419]: Parameter auto_save_on_exit is now 'false'.
Mar 19 06:59:48 fedora3 gluster-blockd[8419]: Command not found saveconfig
Mar 19 07:01:49 fedora3 gluster-blockd[8419]: Unexpected keyword parameter 'save'.
Mar 19 07:03:43 fedora3 gluster-blockd[8419]: Unexpected keyword parameter 'save'.
Mar 19 07:03:43 fedora3 gluster-blockd[8419]: No such Target in configfs: /sys/kernel/config/target/iscsi/iqn.2016-12.org.gluster-block:9ddcbf01-2bfc-419d-ab17-f116b6767f88
Mar 19 07:05:33 fedora3 gluster-blockd[8419]: Unexpected keyword parameter 'save'.
Mar 19 07:05:33 fedora3 systemd[1]: gluster-blockd.service: Main process exited, code=killed, status=11/SEGV
Mar 19 07:05:33 fedora3 systemd[1]: gluster-blockd.service: Failed with result 'signal'.


#journalctl -r
-- Logs begin at Sun 2019-03-10 09:11:13 IST, end at Tue 2019-03-19 07:05:34 IST. --
Mar 19 07:05:34 fedora3 abrt-notification[8778]: Process 8419 (gluster-blockd) crashed in ??()
Mar 19 07:05:33 fedora3 systemd-coredump[8726]: Process 8419 (gluster-blockd) of user 0 dumped core.
                                                
                                                Stack trace of thread 8444:
                                                #0  0x00007f61eb79f420 n/a (libtirpc.so.3)
                                                #1  0x00007f61eb7a3853 n/a (libtirpc.so.3)
                                                #2  0x00007f61eb7a398f n/a (libtirpc.so.3)
                                                #3  0x00007f61eb7a3b67 n/a (libtirpc.so.3)
                                                #4  0x00007f61eb7a3bea n/a (libtirpc.so.3)
                                                #5  0x00007f61eb7a3caf n/a (libtirpc.so.3)
                                                #6  0x00007f61eb7a2b59 xdr_u_int32_t (libtirpc.so.3)
                                                #7  0x00007f61eb796d5b xdr_callmsg (libtirpc.so.3)
                                                #8  0x00007f61eb79fb0c n/a (libtirpc.so.3)
                                                #9  0x00007f61eb79c0a8 svc_getreq_common (libtirpc.so.3)
                                                #10 0x0000000000405ba9 svc_getreq_poll (gluster-blockd)
                                                #11 0x00007f61eb79eb6e svc_run (libtirpc.so.3)
                                                #12 0x00000000004057fe glusterBlockServerThreadProc (gluster-blockd)
                                                #13 0x00007f61eb98358e start_thread (libpthread.so.0)
                                                #14 0x00007f61eb6bb513 __clone (libc.so.6)
                                                
                                                Stack trace of thread 8420:
                                                #0  0x00007f61eb98cd34 read (libpthread.so.0)
                                                #1  0x0000000000435d7e glusterBlockDynConfigStart (gluster-blockd)
                                                #2  0x00007f61eb98358e start_thread (libpthread.so.0)
                                                #3  0x00007f61eb6bb513 __clone (libc.so.6)
[......]
                                                Stack trace of thread 8460:
                                                #0  0x00007f61eb6bb847 epoll_wait (libc.so.6)
                                                #1  0x00007f61eb8affe4 n/a (libglusterfs.so.0)
                                                #2  0x00007f61eb98358e start_thread (libpthread.so.0)
                                                #3  0x00007f61eb6bb513 __clone (libc.so.6)
Mar 19 07:05:33 fedora3 systemd[1]: gluster-blockd.service: Failed with result 'signal'.
Mar 19 07:05:33 fedora3 systemd[1]: gluster-blockd.service: Main process exited, code=killed, status=11/SEGV
Mar 19 07:05:33 fedora3 kernel: Code: 25 28 00 00 00 48 89 44 24 08 31 c0 48 85 ff 0f 84 02 01 00 00 4c 8b af 80 00 00 00 48 89 fb 49 89 f7 41 89 d4 8b 2f 49 89 e6 <41> 8b >
Mar 19 07:05:33 fedora3 kernel: gluster-blockd[8444]: segfault at 1d4 ip 00007f61eb79f420 sp 00007f61e9ea45f0 error 4 in libtirpc.so.3.0.0[7f61eb78c000+1f000]

The code like:

# git log --oneline
e5d1cff (HEAD -> pk1) glfs   _**==> this is my local patch fixing the glfs_ftruncate(,, NULL, NULL) issue**_
98a57b9 (origin/glibc-ti-rpc-comp-fix) travis: build on f27, f28 and f29
cfd8a89 gluster-block: fix compatability with glibc sunrpc
9f6a207 gluster-block: use inet domain for cli communation too
97a4ce9 rpc: fix assert in __xprt_do_unregister
67af766 rpc: fix leak reported by asan
49c0950 rpc: fix heap-buffer-overflow reported by asan
e7dc3a8 daemon: check for dependencies versions at runtime
1ae9f1e gluster-block: correct some rpc socket setup bits
ae5d750 rpc: support the modern non-glibc rpcgen tool
eb1f602 rpc: use modern libtirpc instead of old glibc implementation
2629966 (origin/master, origin/HEAD, master) cli: add timeout option
9b0849c Minor fixes (#176)
0903285 cli: support env variable way of controlling GB_CLI_TIMEOUT
61e6fa8 cli-timeout: make rpc-timeout as configurable option
0fd2ea4 coverity: fix multiple issues (#172)
9e38bb5 coverity: fix pointless_string_compare issues (#171)
21771ee loadConfig: retry opening conf-file if the initial attempt fail
88624a0 cli: push output to stderr if remote rpc return non-zero
d913aa7 all: fix high severity issues from coverity (#162)
b80148e gluster-blockd: defend for NULL to avoid crash
866f9e6 cli: clean up the code style

Compiled by using :

./autogen.sh && ./configure --enable-tirpc=yes && make -j install

Thanks

@pkalever
Copy link
Contributor Author

@lxbsz what is your libtirpc version ?
Also how frequently do you hit this, i.e. everytime or only once ?

@lxbsz
Copy link
Collaborator

lxbsz commented Mar 19, 2019

@lxbsz what is your libtirpc version ?
Also how frequently do you hit this, i.e. everytime or only once ?

[root@fedora3 gluster-block]# rpm -qa|grep libtirpc
libtirpc-1.1.4-2.rc2.fc29.x86_64
libtirpc-devel-1.1.4-2.rc2.fc29.x86_64
[root@fedora3 gluster-block]# 
[root@fedora3 gluster-block]# cat /etc/os-release 
NAME=Fedora
VERSION="29 (Server Edition)"
ID=fedora
VERSION_ID=29
PLATFORM_ID="platform:f29"
PRETTY_NAME="Fedora 29 (Server Edition)"
ANSI_COLOR="0;34"
CPE_NAME="cpe:/o:fedoraproject:fedora:29"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f29/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=29
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=29
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Server Edition"
VARIANT_ID=server
[root@fedora3 gluster-block]# 

The first time I compiled without the --enable-tirpc=yes option, and it crash immediately, so I make clean and reconfigure it with your methods.

And the second time is as above in the last comment I pasted.

Did I miss something important @pkalever ?

I will test it again later.

Thanks.

@lxbsz
Copy link
Collaborator

lxbsz commented Mar 19, 2019

@pkalever

I reproduced it again, please see the detail steps:


[root@fedora3 gluster-block]# systemctl restart tcmu-runner gluster-blockd
[root@fedora3 gluster-block]# systemctl status tcmu-runner gluster-blockd
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-03-19 08:42:57 IST; 2s ago
 Main PID: 10688 (tcmu-runner)
    Tasks: 6 (limit: 2357)
   Memory: 1.5M
   CGroup: /system.slice/tcmu-runner.service
           └─10688 /usr/bin/tcmu-runner

Mar 19 08:42:57 fedora3 systemd[1]: Starting LIO Userspace-passthrough daemon...
Mar 19 08:42:57 fedora3 tcmu-runner[10688]: log file path now is '/var/log/tcmu-runner.log'
Mar 19 08:42:57 fedora3 tcmu-runner[10688]: Starting...
Mar 19 08:42:57 fedora3 tcmu-runner[10688]: tcmu_cfgfs_set_str:287: Kernel does not support configfs file /sys/module/target_core_user/parameters/block_netlink.
Mar 19 08:42:57 fedora3 systemd[1]: Started LIO Userspace-passthrough daemon.

● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/local/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-03-19 08:42:57 IST; 2s ago
 Main PID: 10696 (gluster-blockd)
    Tasks: 4 (limit: 2357)
   Memory: 1.8M
   CGroup: /system.slice/gluster-blockd.service
           └─10696 /usr/local/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO

Mar 19 08:42:57 fedora3 systemd[1]: Started Gluster block storage utility.
Mar 19 08:42:57 fedora3 gluster-blockd[10696]: Parameter auto_add_default_portal is now 'false'.
Mar 19 08:42:57 fedora3 gluster-blockd[10696]: Parameter auto_enable_tpgt is now 'false'.
Mar 19 08:42:57 fedora3 gluster-blockd[10696]: Parameter loglevel_file is now 'info'.
Mar 19 08:42:57 fedora3 gluster-blockd[10696]: Parameter logfile is now '/var/log/gluster-block/gluster-block-configshell.log'.
Mar 19 08:42:57 fedora3 gluster-blockd[10696]: Parameter auto_save_on_exit is now 'false'.
[root@fedora3 gluster-block]# gluster-block create repvol/block200 ha 1 10.70.39.241 1G
IQN: iqn.2016-12.org.gluster-block:8bb3e789-3ce3-4453-a2f7-6273c8e78901
PORTAL(S):  10.70.39.241:3260
RESULT: SUCCESS
[root@fedora3 gluster-block]# targetcli ls
o- / ......................................................................................................................... [...]
  o- backstores .............................................................................................................. [...]
  | o- block .................................................................................................. [Storage Objects: 0]
  | o- fileio ................................................................................................. [Storage Objects: 0]
  | o- pscsi .................................................................................................. [Storage Objects: 0]
  | o- ramdisk ................................................................................................ [Storage Objects: 0]
  | o- user:glfs .............................................................................................. [Storage Objects: 1]
  |   o- block200 ........................ [repvol@10.70.39.241/block-store/8bb3e789-3ce3-4453-a2f7-6273c8e78901 (1.0GiB) activated]
  |     o- alua ................................................................................................... [ALUA Groups: 3]
  |       o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]
  |       o- glfs_tg_pt_gp_ano .................................................................. [ALUA state: Active/non-optimized]
  |       o- glfs_tg_pt_gp_ao ....................................................................... [ALUA state: Active/optimized]
  o- iscsi ............................................................................................................ [Targets: 1]
  | o- iqn.2016-12.org.gluster-block:8bb3e789-3ce3-4453-a2f7-6273c8e78901 ................................................ [TPGs: 1]
  |   o- tpg1 .................................................................................................. [gen-acls, no-auth]
  |     o- acls .......................................................................................................... [ACLs: 0]
  |     o- luns .......................................................................................................... [LUNs: 1]
  |     | o- lun0 ............................................................................... [user/block200 (glfs_tg_pt_gp_ao)]
  |     o- portals .................................................................................................... [Portals: 1]
  |       o- 10.70.39.241:3260 ................................................................................................ [OK]
  o- loopback ......................................................................................................... [Targets: 0]
  o- vhost ............................................................................................................ [Targets: 0]
[root@fedora3 gluster-block]# gluster-block delete repvol/block200
FAILED ON:   10.70.39.241
SUCCESSFUL ON: None
RESULT: FAIL
[root@fedora3 gluster-block]# gluster-block delete repvol/block200 force
FAILED ON:   10.70.39.241
SUCCESSFUL ON: None
RESULT: SUCCESS
[root@fedora3 gluster-block]# gluster-block create repvol/block201 ha 1 10.70.39.241 1G
IQN: iqn.2016-12.org.gluster-block:1a97d140-b39f-402c-aed4-a55ae1f7a4e7
PORTAL(S):  10.70.39.241:3260
RESULT: SUCCESS
[root@fedora3 gluster-block]# gluster-block delete repvol/block200 force
block repvol/block200 doesn't exist
RESULT:FAIL
[root@fedora3 gluster-block]# gluster-block delete repvol/block201 force
FAILED ON:   10.70.39.241
SUCCESSFUL ON: None
RESULT: SUCCESS
[root@fedora3 gluster-block]# gluster-block delete repvol/block201 force
block repvol/block201 doesn't exist
RESULT:FAIL
[root@fedora3 gluster-block]# gluster-block create repvol/block201 ha 1 10.70.39.241 1G
IQN: -
PORTAL(S): -
ROLLBACK FAILED ON: 10.70.39.241 
RESULT: FAIL
[root@fedora3 gluster-block]# gluster-block create repvol/block201 ha 1 10.70.39.241 1G
BLOCK with name: 'block201' already EXIST

RESULT:FAIL
[root@fedora3 gluster-block]# gluster-block create repvol/block200 ha 1 10.70.39.241 1G
IQN: -
PORTAL(S): -
ROLLBACK FAILED ON: 10.70.39.241 
RESULT: FAIL
[root@fedora3 gluster-block]# gluster-block create repvol/block202 ha 1 10.70.39.241 1G
client create failed: RPC: Program not registered, tcp localhost
[root@fedora3 gluster-block]# systemctl status tcmu-runner gluster-blockd
● tcmu-runner.service - LIO Userspace-passthrough daemon
   Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-03-19 08:42:57 IST; 2min 20s ago
 Main PID: 10688 (tcmu-runner)
    Tasks: 16 (limit: 2357)
   Memory: 6.6M
   CGroup: /system.slice/tcmu-runner.service
           └─10688 /usr/bin/tcmu-runner

Mar 19 08:42:57 fedora3 systemd[1]: Starting LIO Userspace-passthrough daemon...
Mar 19 08:42:57 fedora3 tcmu-runner[10688]: log file path now is '/var/log/tcmu-runner.log'
Mar 19 08:42:57 fedora3 tcmu-runner[10688]: Starting...
Mar 19 08:42:57 fedora3 tcmu-runner[10688]: tcmu_cfgfs_set_str:287: Kernel does not support configfs file /sys/module/target_core_user/parameters/block_netlink.
Mar 19 08:42:57 fedora3 systemd[1]: Started LIO Userspace-passthrough daemon.

● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/local/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2019-03-19 08:44:59 IST; 18s ago
  Process: 10696 ExecStart=/usr/local/sbin/gluster-blockd --glfs-lru-count $GB_GLFS_LRU_COUNT --log-level $GB_LOG_LEVEL $GB_EXTRA_ARGS (code=exited, status=1/FAILURE)
 Main PID: 10696 (code=exited, status=1/FAILURE)

Mar 19 08:44:39 fedora3 gluster-blockd[10696]: This ALUATargetPortGroup already exists in configFS
Mar 19 08:44:39 fedora3 gluster-blockd[10696]: This ALUATargetPortGroup already exists in configFS
Mar 19 08:44:39 fedora3 gluster-blockd[10696]: Command not found saveconfig
Mar 19 08:44:59 fedora3 gluster-blockd[10696]: UserBackedStorageObject creation failed.
Mar 19 08:44:59 fedora3 gluster-blockd[10696]: This ALUATargetPortGroup already exists in configFS
Mar 19 08:44:59 fedora3 gluster-blockd[10696]: This ALUATargetPortGroup already exists in configFS
Mar 19 08:44:59 fedora3 gluster-blockd[10696]: Command not found saveconfig
Mar 19 08:44:59 fedora3 gluster-blockd[10696]: unable to free arguments
Mar 19 08:44:59 fedora3 systemd[1]: gluster-blockd.service: Main process exited, code=exited, status=1/FAILURE
Mar 19 08:44:59 fedora3 systemd[1]: gluster-blockd.service: Failed with result 'exit-code'.
[root@fedora3 gluster-block]# 

@pkalever
Copy link
Contributor Author

@lxbsz

[root@fedora3 gluster-block]# gluster-block delete repvol/block200 force
FAILED ON: 10.70.39.241
SUCCESSFUL ON: None
RESULT: SUCCESS

Yeah this was coming from targetcli, switch to release v2.1.fb49 (not master) and rtslib 2.1.fb69

Not sure, about the RC for the trace seen above, I have scaled upto 250 block volume creates last night, and didn't see any issue. Will it be possible to share your setup with me ? else I will have to spend my cycles reproducing this one with asan enabled.

Mar 19 08:44:39 fedora3 gluster-blockd[10696]: Command not found saveconfig
Mar 19 08:44:59 fedora3 gluster-blockd[10696]: UserBackedStorageObject creation failed.
Mar 19 08:44:59 fedora3 gluster-blockd[10696]: This ALUATargetPortGroup already exists in configFS
Mar 19 08:44:59 fedora3 gluster-blockd[10696]: This ALUATargetPortGroup already exists in configFS
Mar 19 08:44:59 fedora3 gluster-blockd[10696]: Command not found saveconfig
Mar 19 08:44:59 fedora3 gluster-blockd[10696]: unable to free arguments

Can you fix your targetcli and rtslib first, mean while I will try reproducing this myself.

Feel free to debug it by yourself (just in case if its env issue)

@lxbsz
Copy link
Collaborator

lxbsz commented Mar 21, 2019

@pkalever
In the libtirpc.so there seems have the double free issue and I have debugged it without freeing any memory, and have tried this but no any crash now.

And the crashes in my setups mostly in the malloc and free randomly everywhere, such as also we can see in gfapi, etc, but this is not the gfapi issue. Mainly because the malloc chunk/metadata is corrupted by double free in libtirpc.

But there still has the stuck about no reply from the RPC server.

So there at least 2 issue: double free memory and RPC server stuck, maybe they are the same one.

For the double free issue the root cause is still not found yet.

Hope this is useful.

Thanks.

@lxbsz
Copy link
Collaborator

lxbsz commented Mar 21, 2019

The third issue, when deleting the block, I can hit the following errors very easy. It seems the memory is corrupted.

Mar 21 14:29:56 fedora2 gluster-blockd[16893]: 2 unable to free arguments, 4

void
gluster_block_cli_1(struct svc_req *rqstp, register SVCXPRT *transp)
{
[...]
        if (!svc_freeargs (transp, (xdrproc_t) _xdr_argument, (caddr_t) &argument)) {
                fprintf (stderr, "%s, %d", "2 unable to free arguments", rqstp->rq_proc);
                exit (1);
        }
        if (!gluster_block_cli_1_freeresult (transp, _xdr_result, (caddr_t) &result))
                fprintf (stderr, "%s", "unable to free results");

        return;

● gluster-blockd.service - Gluster block storage utility
   Loaded: loaded (/usr/local/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2019-03-21 14:29:56 IST; 455ms ago
  Process: 16893 ExecStart=/usr/local/sbin/gluster-blockd --glfs-lru-count $GB_GLFS_LRU_COUNT --log-level $GB_LOG_LEVEL $GB_EXTRA_ARGS (code=exited, status=1/FAILURE)
 Main PID: 16893 (code=exited, status=1/FAILURE)

Mar 21 14:29:21 fedora2 gluster-blockd[16893]: lxb clnt_vc_call:427
Mar 21 14:29:21 fedora2 gluster-blockd[16893]: lxb clnt_vc_call:433
Mar 21 14:29:21 fedora2 gluster-blockd[16893]: lxb clnt_vc_call:435
Mar 21 14:29:56 fedora2 gluster-blockd[16893]: lxb read_vc:547
Mar 21 14:29:56 fedora2 gluster-blockd[16893]: lxb read_vc:566
Mar 21 14:29:56 fedora2 gluster-blockd[16893]: lxb read_vc:568
Mar 21 14:29:56 fedora2 gluster-blockd[16893]: lxb svc_vc_freeargs:711, ret: 0
Mar 21 14:29:56 fedora2 gluster-blockd[16893]: 2 unable to free arguments, 4
Mar 21 14:29:56 fedora2 systemd[1]: gluster-blockd.service: Main process exited, code=exited, status=1/FAILURE
Mar 21 14:29:56 fedora2 systemd[1]: gluster-blockd.service: Failed with result 'exit-code'.

@pkalever
Copy link
Contributor Author

@lxbsz right. I'm also hitting similar double free issues. May be we should consider implementing our own stub routines :-(

Lets investigate in this direction, as we don't have any other way.

@lxbsz
Copy link
Collaborator

lxbsz commented Mar 25, 2019

@lxbsz right. I'm also hitting similar double free issues. May be we should consider implementing our own stub routines :-(

@pkalever
Yeah, this make sense.

Lets investigate in this direction, as we don't have any other way.

Sure :-)

Thanks

@pkalever
Copy link
Contributor Author

pkalever commented Mar 25, 2019

@lxbsz patch a8157bf solved some crashes for me. Please take a look.

Thanks!

@lxbsz
Copy link
Collaborator

lxbsz commented Mar 25, 2019

@pkalever
Tested your new changes and it works much better than before.
And test for about 5 minutes and didn't hit any crash issue, which i can reproduce almost in 10 seconds
everytime in my local setups.

But there still have the stuck problem:

[2019-03-25 08:58:40.058450] ERROR: block_create_cli_1: RPC: Unable to receive; errno = Connection reset by peerblock block21 create on volume dht with hosts 10.70.39.243 failed

Thanks

@pkalever
Copy link
Contributor Author

pkalever commented Mar 25, 2019

@lxbsz thanks for trying this out.
Reaching till 21st block create is encouraging :-)

@pkalever
Copy link
Contributor Author

@amarts @lxbsz At the moment the build of gluster-block on f28 and f29 fails with out these patches.

With this patches gluster-block enables integration with TIRPC, but unfortunately sometimes we hit crash below crash in TIRPC libraries like:

AddressSanitizer:DEADLYSIGNAL                                                                                        
=================================================================                                                    
==8233==ERROR: AddressSanitizer: SEGV on unknown address 0x60e07b800085 (pc 0x7fcb7893040b bp 0x631000a78800 sp 0x7fcb
742fc4f0 T3)                                                                                                         
==8233==The signal is caused by a READ memory access.                                                                
    #0 0x7fcb7893040a  (/lib64/libtirpc.so.3+0x1b40a)
    #1 0x7fcb78934852  (/lib64/libtirpc.so.3+0x1f852)
    #2 0x7fcb7893498e  (/lib64/libtirpc.so.3+0x1f98e)
    #3 0x7fcb78934b66  (/lib64/libtirpc.so.3+0x1fb66)
    #4 0x7fcb78934be9  (/lib64/libtirpc.so.3+0x1fbe9)                                                                
    #5 0x7fcb78934cae  (/lib64/libtirpc.so.3+0x1fcae)                                                                
    #6 0x7fcb78933b58 in xdr_u_int32_t (/lib64/libtirpc.so.3+0x1eb58)                                                
    #7 0x7fcb78927d5a in xdr_callmsg (/lib64/libtirpc.so.3+0x12d5a)                                                  
    #8 0x7fcb78930b0b  (/lib64/libtirpc.so.3+0x1bb0b)
    #9 0x7fcb7892d0a7 in svc_getreq_common (/lib64/libtirpc.so.3+0x180a7)                                            
    #10 0x404cfa in svc_getreq_poll /root/gluster-block/daemon/gluster-blockd.c:104                                  
    #11 0x7fcb7892fb6d in svc_run (/lib64/libtirpc.so.3+0x1ab6d)
    #12 0x40665f in glusterBlockServerThreadProc /root/gluster-block/daemon/gluster-blockd.c:298
    #13 0x7fcb78aea58d in start_thread (/lib64/libpthread.so.0+0x858d)
    #14 0x7fcb7884c6a2 in clone (/lib64/libc.so.6+0xfd6a2)

However this is not very immediate, I can at-least create 20-30 devices everytime before I hit the crash.

Now the question, can we live with it for now ?

IMHO, having this patches is better than gluster-block not working with TIRPC (on F28/F29)

@amarts @lxbsz @nixpanic Whats your take ?

Many Thanks!

@amarts
Copy link
Member

amarts commented Mar 26, 2019

I am inclined to mark it as known-issues.

One of the main base for the project at the moment is CentOS/RHEL, and in that setup tirpc is not yet mandatory.

That way, having these patches will make the project have proper RPMs, and we can always improve and fix the things. Lets call out the issue of scale on setups where we have libtirpc in our release notes, and take these patches in.

@pkalever
Copy link
Contributor Author

@amarts yes, completely make-sense to me.

@lxbsz @amarts Can I get your review on this PR.

Thanks!

@nixpanic
Copy link
Member

If the problem is reproducible, it would be good to have a look at the coredump. It sounds like a use-after free of some kind.

This change is needed, and also encouraged for CentOS/RHEL builds. Identifying and fixing the issue should have a relatively high priority.

@pkalever
Copy link
Contributor Author

@nixpanic but this must be coming from the libtirpc itself right ?

More detailed crash:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==15327==ERROR: AddressSanitizer: SEGV on unknown address 0x60e022800001 (pc 0x7f2bc7a13b6d bp 0x7f2bc3dfd5c0 sp 0x7f2bc3dfd570 T2)
==15327==The signal is caused by a READ memory access.
    #0 0x7f2bc7a13b6c in read_vc /root/libtirpc-1.1.4/src/svc_vc.c:500
    #1 0x7f2bc7a19a05 in fill_input_buf /root/libtirpc-1.1.4/src/xdr_rec.c:655
    #2 0x7f2bc7a19af3 in get_input_bytes /root/libtirpc-1.1.4/src/xdr_rec.c:683
    #3 0x7f2bc7a19b9d in set_input_fragment /root/libtirpc-1.1.4/src/xdr_rec.c:704
    #4 0x7f2bc7a18e7b in xdrrec_getbytes /root/libtirpc-1.1.4/src/xdr_rec.c:283
    #5 0x7f2bc7a18d3a in xdrrec_getlong /root/libtirpc-1.1.4/src/xdr_rec.c:237
    #6 0x7f2bc7a17c28 in xdr_u_int32_t /root/libtirpc-1.1.4/src/xdr.c:243
    #7 0x7f2bc7a0871b in xdr_callmsg /root/libtirpc-1.1.4/src/rpc_callmsg.c:185
    #8 0x7f2bc7a13f05 in svc_vc_recv /root/libtirpc-1.1.4/src/svc_vc.c:635
    #9 0x7f2bc7a0eebb in svc_getreq_common /root/libtirpc-1.1.4/src/svc.c:682
    #10 0x404cfa in svc_getreq_poll /root/gluster-block/daemon/gluster-blockd.c:104
    #11 0x7f2bc7a1277d in svc_run /root/libtirpc-1.1.4/src/svc_run.c:91
    #12 0x406257 in glusterBlockCliThreadProc /root/gluster-block/daemon/gluster-blockd.c:259
    #13 0x7f2bc79db58d in start_thread (/lib64/libpthread.so.0+0x858d)
    #14 0x7f2bc790a6a2 in clone (/lib64/libc.so.6+0xfd6a2)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /root/libtirpc-1.1.4/src/svc_vc.c:500 in read_vc
Thread T2 created by T0 here:
    #0 0x7f2bc7c25f63 in __interceptor_pthread_create (/lib64/libasan.so.5+0x52f63)
    #1 0x40c6d1 in main /root/gluster-block/daemon/gluster-blockd.c:706
    #2 0x7f2bc7831412 in __libc_start_main (/lib64/libc.so.6+0x24412)

==15327==ABORTING

@lxbsz
Copy link
Collaborator

lxbsz commented Apr 9, 2019

@lxbsz good find!

The items that the operating system must store that are unique to each thread are:

Thread ID
Saved registers, stack pointer, instruction pointer
Stack (local variables, temporary variables, return addresses)
Signal mask
Priority (scheduling information)


The items that are shared among threads within a process are:

Text segment (instructions)
Data segment (static and global data)
BSS segment (uninitialized data)
Open file descriptors
Signals
Current working directory
User and group IDs

IMO TIRPC is messing up with shared units as non-shared.

Especially with '__svc_xports[p->fd]' In the TIRPC cod, which is why it was failing to get fd and assert on it every now and then.

We can fix or file a bug later, forking the cli thread as a process won't harm for now.

Yeah, I am trying to send one patch to make sure that there will be only one svc_run loop will run in one process.

Thanks.

Thanks Xiubo!

@lxbsz
Copy link
Collaborator

lxbsz commented Apr 9, 2019

@pkalever

Create around 5 hundreds blocks and the above patch works for me.

Thanks
BRs

@pkalever
Copy link
Contributor Author

pkalever commented Apr 9, 2019

@pkalever

Create around 500 hundred blocks and the above patch works for me.

Good news indeed :-)

@nixpanic
Copy link
Member

nixpanic commented Apr 9, 2019

Nice work on figuring out the issue!

@amarts
Copy link
Member

amarts commented Apr 11, 2019

👍 Cool, hopefully this allows us to make a release.

@pkalever
Copy link
Contributor Author

pkalever commented Apr 15, 2019

@amarts @lxbsz updated the PR now.

Splitting this PR to Sub PR's in below fashion :

  1. TIRPC core fixes (Uses this PR Migrate to tirpc usage [Tested | Works on F27, F28, F29, Centos7] #182 )
  2. targetcli/tcmu-runner dependency version check (daemon: check for dependencies versions at runtime #206 )
  3. leak fixes by asan (Asan fixes #207 )
  4. travis builds for f28/f27 and Centos (travis: build on f27, f28, f29 and centos7 #208 )

BTW: expect this PR to see travis build failures as the build is happening on F27 and this PR expects '--enable-tirpc=no' to be passed a build time.

AC_SUBST(TIRPC_CFLAGS)
AC_SUBST(TIRPC_LIBS)
if test "$enable_tirpc" != "no"; then
enable_tirpc="yes";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to add the help promotion, something like:

$ ./configure --help
[...]
--enable-tirpc Please enable litirpc library if glibc >= 2.26 [default=yes]
[...]

Will this make sense ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, Okay! We can add a help string.

Will refresh this in the next spin.

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxbsz please check:
✨ ./configure --help | grep tirpc
--enable-libtirpc enable use of tirpc [default: yes]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, cool. it works as expected.

LOG("mgmt", GB_LOG_DEBUG,
"server process received signal:%d with pid:%lu", signum, getpid());

exit(EXIT_SUCCESS); /* server process terminated by signal */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we may use the svc_stop() instead to stop the loop and the libtirpc will do the clean up to finish current stuff, will make sense ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was looking for something like that actually :-)

Was not aware (forget) there is a graceful way of exiting svc_run loop.

Thanks, let me test this and incorporate the change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, before when debugging this issue I tested this code and it worked for me that time.
Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxbsz

BTW: where did you see svc_stop () ?

gluster-blockd.c:79:3: warning: implicit declaration of function ‘svc_stop’; did you mean ‘svc_stat’? [-Wimplicit-function-declaration]
svc_stop();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May you mean svc_exit() ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it s.
Sorry, I mixed the svc_exit with the other libs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the libtirpc code, in svc_run() IMO, it should add the rdlock to protect the svc_pollfd, or it may cause the crash here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the libtirpc code, in svc_run() IMO, it should add the rdlock to protect the svc_pollfd, or it may cause the crash here.

Please ignore this too, currently the svc_exit() must be called in the process's signal handler, or it will hit the crash issue.


kill(ctx.chpid, signum); /* Pass the signal to server process */
waitpid(ctx.chpid, &wstatus, 0);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also here use the svc_stop, then just return and the cli process will also do the clean up in libtirpc.
Then the cli process will help waitpid for the child process in LINE 602.

AC_SUBST(TIRPC_CFLAGS)
AC_SUBST(TIRPC_LIBS)
if test "$enable_tirpc" != "no"; then
enable_tirpc="yes";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, cool. it works as expected.

@lxbsz
Copy link
Collaborator

lxbsz commented Apr 15, 2019

@pkalever
Test it in RHEL 7 and Fedora 29, all works as expect.
And in Fedora 29 create at least 300 blocks.

Thanks.

@pkalever
Copy link
Contributor Author

@lxbsz please check this out.


svc_exit();

exit(EXIT_SUCCESS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to do the exit(EXIT_SUCCESS) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxbsz

For some reason I see WEXITSTATUS() returning 1 if I don't enforce this here.
Which means an abnormal exit of child.

Can you check ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another problem:

[2019-04-15 11:54:52.545869] INFO: cli process pid: (18235) [at gluster-blockd.c+580 :

]
[2019-04-15 11:54:52.546274] INFO: server process pid: (18250) [at gluster-blockd.c+570 :]
[2019-04-15 11:56:31.793613] INFO: Block Hosting Volfile Server Set to: localhost [at utils.c+276 :]
[2019-04-15 11:56:31.793715] CRIT: glfsLruCount now is 5 [at lru.c+42 :]
[2019-04-15 11:56:31.793729] CRIT: logLevel now is INFO [at utils.c+53 :]
[2019-04-15 11:56:31.793920] INFO: Inotify is watching "/etc/sysconfig", wd: 1, mask: IN_MODIFY [at dyn-config.c+510 :]
[2019-04-15 11:56:31.796159] INFO: Distro ID=fedora. Current kernel version: '4.18.16-300.fc29.x86_64'. [at gluster-blockd.c+421 :]
[2019-04-15 11:56:35.381259] INFO: Block Hosting Volfile Server Set to: localhost [at utils.c+276 :]
[2019-04-15 11:56:35.381346] CRIT: glfsLruCount now is 5 [at lru.c+42 :]
[2019-04-15 11:56:35.381539] CRIT: logLevel now is INFO [at utils.c+53 :]
[2019-04-15 11:56:35.382742] INFO: Inotify is watching "/etc/sysconfig", wd: 1, mask: IN_MODIFY [at dyn-config.c+510 :]
[2019-04-15 11:56:35.384463] INFO: Distro ID=fedora. Current kernel version: '4.18.16-300.fc29.x86_64'. [at gluster-blockd.c+421 :]
[2019-04-15 11:56:36.870342] INFO: capabilities fetched successfully [at gluster-blockd.c+550 :]
[2019-04-15 11:56:36.871805] INFO: cli process pid: (19434) [at gluster-blockd.c+580 :]
[2019-04-15 11:56:36.871881] INFO: server process pid: (19449) [at gluster-blockd.c+570 :]
[2019-04-15 11:56:36.872043] ERROR: bind on port 24010 failed (Address already in use) [at gluster-blockd.c+263 :]
[root@fedora2 gluster-block]#

[root@fedora2 gluster-block]# systemctl status gluster-blockd tcmu-runner
● gluster-blockd.service - Gluster block storage utility
Loaded: loaded (/usr/local/lib/systemd/system/gluster-blockd.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2019-04-15 17:26:35 IST; 47s ago
Main PID: 19434 (gluster-blockd)
Tasks: 4 (limit: 2357)
Memory: 2.7M
CGroup: /system.slice/gluster-blockd.service
├─18250 /usr/local/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO
└─19434 /usr/local/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO

I restart the gluster-blockd service, and the cli process is new, but the server process is still the old one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxbsz
Did you notice the bind failure in the above log ?

It might have happen that your parent process exited before child in your previous run and child was attached to pid 1.

Can you run
# pkill -9 gluster-blockd

then try again ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore this, this is should be the reason that I just kill -9 cli_process only and then restart it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxbsz

For some reason I see WEXITSTATUS() returning 1 if I don't enforce this here.
Which means an abnormal exit of child.

Can you check ?

From manual we can see that returning 1 is normal exit.

   WIFEXITED(status)
            returns true if the child terminated normally, that is, by calling exit(3) or _exit(2), or by returning from main().

That means the process is exit(EXIT_SUCCESS==0) or returned directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but how can we consider child returning value 1 as success/normal ?

And, do you see anything wrong with exit(EXITSUCCESS) from child after svc_exit() ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we directly exit(EXIT_SUCCESS) here, then there maybe no chance to do one graceful exit for the child process, which will execute the code after svc_run() in server process.
Make sense ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxbsz
Hmm, make sense to me, I got your concern here.
Please check the latest patch, which fixes it by moving the exit() part to glusterBlockServerProcess.


svc_exit();

return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we could remove the 'return;' here.

Copy link
Contributor Author

@pkalever pkalever Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is matter of style, to indicate we reach end of a void function.

In fact you can notice this in many void functions across the gluster-block code.

Void functions can return without a value :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

@pkalever
Copy link
Contributor Author

@amarts @nixpanic Requesting a final review on this PR [+ other splitted PR's #182 (comment) ] ?

Thanks!

Copy link
Collaborator

@lxbsz lxbsz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkalever
Looks good to me.
Thanks.

Copy link
Member

@amarts amarts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed By: Amar Tumballi

nixpanic and others added 4 commits April 15, 2019 19:27
glibc has removed the rpc functions from current releases. Instead of
relying on glibc providing these, the modern libtirpc library should be
used instead.

Change-Id: I46c979e52147abce956255de5ad16b01b5621f52
Updates: gluster#56
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Reviewed-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
glibc will not contain the `rpcgen` tool in new versions anymore. In
Fedora 28 this tool can now be found in its own package.

Change-Id: I1de67cdf7418cb509e096e62a4201b5b8707ef24
Updates: gluster#56
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Reviewed-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Creating the socket is not needed for svcunix_create(), it does all the
work for us (and prevents "address already in use" errors).

For svctcp_create(), the socket is expected on port 24010 and hence
needs to be setup completely with bind() and listen() before it can be
used. Registering the RPC-program should make it possible to use a
dynamically assigned port, but the gluster-block CLI expects it on the
static port.

This also fixes a warning when building the RPM from 'make dist' tarball
as the 22 June 2017 was a Thursday, nt Tuesday.

Changes from pkalever:
Add a flag at config time for tweaking the use of tirpc or glibc sunrpc
$ ./autogen.sh && ./configure --enable-tirpc=yes/no

Change-Id: If3a6b7527399dd0a5a16f4273efdd607617289de
Updates: gluster#56
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Problem:
-------
When using TIRPC we have seen some contention with poll fd's, this was
because we are calling svc_run() in both the cli and remote threads.

Probably an issue in TIRPC as the current model worked well for us with
glibc sun-rpc implementation.

Solution:
--------
Convert cli and remote threads to individual processes

How to compile ?
----------------
On systems having glibc sun-rpc: (F27/Centos/RHEL)
$ ./autogen.sh && ./configure --enable-tirpc=no

On systems having tirpc: (F28 & above)
$ ./autogen.sh && ./configure

Thanks to Xiubo for debugging this issue along.

Closes: gluster#57
Closes: gluster#165
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
@pkalever
Copy link
Contributor Author

Updated the tags. Ready for Merge now.

Thanks to @nixpanic (for the initial patches) @lxbsz (for test debug and review) @amarts (for review).

@pkalever pkalever merged commit f70857a into gluster:master Apr 15, 2019
@ghost ghost removed the in progress label Apr 15, 2019
@lxbsz lxbsz mentioned this pull request Apr 24, 2019
@pkalever pkalever deleted the glibc-ti-rpc-comp-fix branch May 13, 2020 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants