Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changelog: A brick process is getting crash due to SIGSEGV in changelog #3522

Merged
merged 1 commit into from Jun 27, 2022

Conversation

mohit84
Copy link
Contributor

@mohit84 mohit84 commented May 13, 2022

A brick process is getting crashed while using glusterfind tool.
The glusterfind tool uses changelog xlator and the xlator has race
condition to handle crpc object list so at the time of calling
ev_connector thread it is getting crashed.

Solution: The xlator is not using correct lock to sync the list
in crpc object so use crpc->lock to handle the crpc->list.

Fixes: #3521
Change-Id: I13ec8603dc06ecba4cd293cb48012a2ebef55749
Signed-off-by: Mohit Agrawal moagrawa@redhat.com

@mohit84 mohit84 requested a review from xhernandez May 13, 2022 07:14
@mohit84
Copy link
Contributor Author

mohit84 commented May 13, 2022

/run regression

@mykaul
Copy link
Contributor

mykaul commented May 16, 2022

What about changelog_rpc_clnt_unref() which is calling list_del(&crpc->list) ?

(I must say there are so many locks, and some are called under others, that it's hard to easily identify the order and the correctness of locks there!)

@mohit84
Copy link
Contributor Author

mohit84 commented May 16, 2022

changelog_rpc_clnt_unref

changelog_rpc_clnt_unref call only while crpc ref value is 0 it means no one is using crpc object so we don't need to call
this under critical section.

@mohit84
Copy link
Contributor Author

mohit84 commented May 16, 2022

What about changelog_rpc_clnt_unref() which is calling list_del(&crpc->list) ?

(I must say there are so many locks, and some are called under others, that it's hard to easily identify the order and the correctness of locks there!)

IIUC a brick process is getting crashed during access of crpc->list so to make it thread safe i have used crpc->lock wherever
we are accessing crpc->list.

@mykaul
Copy link
Contributor

mykaul commented May 16, 2022

changelog_rpc_clnt_unref

changelog_rpc_clnt_unref call only while crpc ref value is 0 it means no one is using crpc object so we don't need to call this under critical section.

Can't someone ref it in between? Looks racy to me, but I don't know the code well enough.

@mohit84
Copy link
Contributor Author

mohit84 commented May 16, 2022

changelog_rpc_clnt_unref

changelog_rpc_clnt_unref call only while crpc ref value is 0 it means no one is using crpc object so we don't need to call this under critical section.

Can't someone ref it in between? Looks racy to me, but I don't know the code well enough.

It should not be otherwise it is a bug, last reference has been unref by only while a client want to destroy the rpc connection.
Before destroy a connection the client tries disconnect when it has successfully disconnect then it tries destroy a connection
after calling the function rpc_clnt_trigger_destroy that eventually send a RPC_CLNT_DESTROY event to the client to cleanup
the things.
@xhernandez Can you please share your view if you also have any concern on it.

@mohit84
Copy link
Contributor Author

mohit84 commented May 23, 2022

/run regression

A brick process is getting crashed while using glusterfind tool.
The glusterfind tool uses changelog xlator and the xlator has race
condition to handle crpc object list so at the time of calling
ev_connector thread it is getting crashed.

Solution: The xlator is not using correct lock to sync the list
          in crpc object so use crpc->lock to handle the crpc->list.

Fixes: gluster#3521
Change-Id: I13ec8603dc06ecba4cd293cb48012a2ebef55749
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
@mohit84
Copy link
Contributor Author

mohit84 commented May 23, 2022

/run regression

@mohit84
Copy link
Contributor Author

mohit84 commented May 23, 2022

@aravindavk Can you please review it.

@mohit84 mohit84 requested a review from aravindavk May 23, 2022 09:32
@mohit84
Copy link
Contributor Author

mohit84 commented May 24, 2022

/run brick-mux regression

@gluster-ant
Copy link
Collaborator

7 test(s) failed
./tests/basic/afr/lk-quorum.t
./tests/basic/open-fd-snap-delete.t
./tests/basic/uss.t
./tests/bugs/distribute/bug-907072.t
./tests/bugs/replicate/do-not-reopen-fd.t
./tests/features/index/index-link-count-lifecycle.t
./tests/features/worm_sh.t

6 test(s) generated core
./tests/000-flaky/bugs_distribute_bug-1122443.t
./tests/000-flaky/bugs_glusterd_bug-857330/normal.t
./tests/bugs/glusterd/mgmt-handshake-and-volume-sync-post-glusterd-restart.t
./tests/bugs/glusterd/remove-brick-testcases.t
./tests/features/trash.t
./tests/features/worm_sh.t

18 test(s) needed retry
./tests/000-flaky/basic_afr_split-brain-favorite-child-policy.t
./tests/000-flaky/basic_mount-nfs-auth.t
./tests/000-flaky/bugs_distribute_bug-1122443.t
./tests/000-flaky/bugs_glusterd_bug-857330/normal.t
./tests/000-flaky/bugs_nfs_bug-1116503.t
./tests/basic/afr/lk-quorum.t
./tests/basic/distribute/file-rename.t
./tests/basic/ec/ec-1468261.t
./tests/basic/open-fd-snap-delete.t
./tests/basic/uss.t
./tests/bugs/distribute/bug-862967.t
./tests/bugs/distribute/bug-907072.t
./tests/bugs/distribute/bug-973073.t
./tests/bugs/glusterfs-server/bug-877992.t
./tests/bugs/md-cache/bug-1211863.t
./tests/bugs/replicate/do-not-reopen-fd.t
./tests/features/index/index-link-count-lifecycle.t
./tests/features/worm_sh.t

4 flaky test(s) marked as success even though they failed
./tests/000-flaky/basic_mount-nfs-auth.t
./tests/000-flaky/bugs_distribute_bug-1122443.t
./tests/000-flaky/bugs_glusterd_bug-857330/normal.t
./tests/000-flaky/bugs_nfs_bug-1116503.t

@mohit84
Copy link
Contributor Author

mohit84 commented May 24, 2022

The brick_mux crash is not specific to this patch so we can ignore it now.

@xhernandez
Copy link
Contributor

/run regression

@xhernandez
Copy link
Contributor

The brick_mux crash is not specific to this patch so we can ignore it now.

Are you sure the crashes are not related ? the latest regression has also had several crashes.

@mohit84
Copy link
Contributor Author

mohit84 commented May 27, 2022

The brick_mux crash is not specific to this patch so we can ignore it now.

Are you sure the crashes are not related ? the latest regression has also had several crashes.

There is no crash(https://build.gluster.org/job/gh_centos7-regression/2525/console) for centos-regression, yes i am sure
the crash is not specific to this patch. We are getting crash only for brick_mux.

@mohit84
Copy link
Contributor Author

mohit84 commented May 27, 2022

The brick_mux crash is not specific to this patch so we can ignore it now.

Are you sure the crashes are not related ? the latest regression has also had several crashes.

There is no crash(https://build.gluster.org/job/gh_centos7-regression/2525/console) for centos-regression, yes i am sure the crash is not specific to this patch. We are getting crash only for brick_mux.

I will run a brick_mux regression without apply a patch and share the link.

@mohit84
Copy link
Contributor Author

mohit84 commented May 27, 2022

The brick_mux crash is not specific to this patch so we can ignore it now.

Are you sure the crashes are not related ? the latest regression has also had several crashes.

There is no crash(https://build.gluster.org/job/gh_centos7-regression/2525/console) for centos-regression, yes i am sure the crash is not specific to this patch. We are getting crash only for brick_mux.

I will run a brick_mux regression without apply a patch and share the link.

Below is the link as we can see for brick_mux multiple test cases are getting crashed.
#3559

@xhernandez
Copy link
Contributor

The brick_mux crash is not specific to this patch so we can ignore it now.

Are you sure the crashes are not related ? the latest regression has also had several crashes.

There is no crash(https://build.gluster.org/job/gh_centos7-regression/2525/console) for centos-regression, yes i am sure the crash is not specific to this patch. We are getting crash only for brick_mux.

I will run a brick_mux regression without apply a patch and share the link.

Below is the link as we can see for brick_mux multiple test cases are getting crashed. #3559

Any idea why ? brick-mux should be stable.

@mohit84
Copy link
Contributor Author

mohit84 commented Jun 14, 2022

The brick_mux crash is not specific to this patch so we can ignore it now.

Are you sure the crashes are not related ? the latest regression has also had several crashes.

There is no crash(https://build.gluster.org/job/gh_centos7-regression/2525/console) for centos-regression, yes i am sure the crash is not specific to this patch. We are getting crash only for brick_mux.

I will run a brick_mux regression without apply a patch and share the link.

Below is the link as we can see for brick_mux multiple test cases are getting crashed. #3559

Any idea why ? brick-mux should be stable.

The brick_mux regression is getting crashed because of this change(2e825b3), We have to revert it. Though it is not a leak because vol_opt object is added to the xlator->volume_options while do validate auth_options that's why the list is not delete in gf_auth_fini function. The object is deleted while xlator->volume_options has been cleaned during xlator cleanup.

@xhernandez xhernandez merged commit 13a299c into gluster:devel Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

changelog: A brick process is getting crash due to SIGSEGV in changelog
4 participants