Brick going offline on another host as well as the host which rebooted #2480

dit101 · 2021-05-25T14:14:53Z

Description of problem:
Hi,
I have an issue where sometimes if I reboot a Gluster node the bricks on that host go offline as expected but also a brick on another host which can cause volume failures. I have to force a volume start to bring the brick back online.
Thanks

The exact command to reproduce the issue:
systemctl stop glusterd
killall glusterfs glusterfsd glusterd
init 6
gluster volume status

The full output of the command that failed:
Server1 was rebooted and the bricks on server2 also went offline and stayed offline

[root@server2 ~]# gluster volume status
Status of volume: volume1
Gluster process TCP Port RDMA Port Online Pid

Brick server1:/data/gluster/brick2/bric
k 49152 0 Y 1459
Brick server2:/data/gluster/brick2/bric
k N/A N/A N N/A
Brick server4:/data/gluster/brick2/bric
k 49152 0 Y 8873
Self-heal Daemon on localhost N/A N/A Y 8847
Bitrot Daemon on localhost N/A N/A Y 8994
Scrubber Daemon on localhost N/A N/A Y 9019
Self-heal Daemon on server3 N/A N/A Y 8980
Bitrot Daemon on server3 N/A N/A Y 9108
Scrubber Daemon on server3 N/A N/A Y 9127
Self-heal Daemon on server4 N/A N/A Y 8839
Bitrot Daemon on server4 N/A N/A Y 8997
Scrubber Daemon on server4 N/A N/A Y 9008
Self-heal Daemon on server1.prod.bluefa
ce.com N/A N/A Y 1521
Bitrot Daemon on server1.prod.blueface.
com N/A N/A Y 1481
Scrubber Daemon on server1.prod.bluefac
e.com N/A N/A Y 1493

Task Status of Volume volume1

There are no active volume tasks

Status of volume: volume2
Gluster process TCP Port RDMA Port Online Pid

Brick server1:/data/gluster/brick1/bric
k 49153 0 Y 1470
Brick server2:/data/gluster/brick1/bric
k N/A N/A N N/A
Brick server3:/data/gluster/brick1/bric
k 49152 0 Y 8963
Self-heal Daemon on localhost N/A N/A Y 8847
Bitrot Daemon on localhost N/A N/A Y 8994
Scrubber Daemon on localhost N/A N/A Y 9019
Self-heal Daemon on server4 N/A N/A Y 8839
Bitrot Daemon on server4 N/A N/A Y 8997
Scrubber Daemon on server4 N/A N/A Y 9008
Self-heal Daemon on server1.prod.bluefa
ce.com N/A N/A Y 1521
Bitrot Daemon on server1.prod.blueface.
com N/A N/A Y 1481
Scrubber Daemon on server1.prod.bluefac
e.com N/A N/A Y 1493
Self-heal Daemon on server3 N/A N/A Y 8980
Bitrot Daemon on server3 N/A N/A Y 9108
Scrubber Daemon on server3 N/A N/A Y 9127

Task Status of Volume volume2

There are no active volume tasks

Expected results:
The expectations is that bricks on server2 would not go offline when server1 is rebooted

Mandatory info:
- The output of the gluster volume info command:
[root@server2 ~]# gluster volume info

Volume Name: volume1
Type: Replicate
Volume ID: f330ead1-2f98-49bc-a8ec-db3f5c18a3f4
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: server1:/data/gluster/brick2/brick
Brick2: server2:/data/gluster/brick2/brick
Brick3: server4:/data/gluster/brick2/brick (arbiter)
Options Reconfigured:
diagnostics.brick-log-level: INFO
features.scrub: Active
features.bitrot: on
network.ping-timeout: 5
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Volume Name: volume2
Type: Replicate
Volume ID: f645aa78-cd37-4670-a27b-c4e3bb14965e
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: server1:/data/gluster/brick1/brick
Brick2: server2:/data/gluster/brick1/brick
Brick3: server3:/data/gluster/brick1/brick (arbiter)
Options Reconfigured:
diagnostics.brick-log-level: INFO
features.scrub: Active
features.bitrot: on
network.ping-timeout: 5
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

- The output of the `gluster volume status` command:
[root@server2 ~]# gluster volume status
Status of volume: volume1
Gluster process TCP Port RDMA Port Online Pid

Brick server1:/data/gluster/brick2/bric
k 49152 0 Y 1459
Brick server2:/data/gluster/brick2/bric
k N/A N/A N N/A
Brick server4:/data/gluster/brick2/bric
k 49152 0 Y 8873
Self-heal Daemon on localhost N/A N/A Y 8847
Bitrot Daemon on localhost N/A N/A Y 8994
Scrubber Daemon on localhost N/A N/A Y 9019
Self-heal Daemon on server3 N/A N/A Y 8980
Bitrot Daemon on server3 N/A N/A Y 9108
Scrubber Daemon on server3 N/A N/A Y 9127
Self-heal Daemon on server4 N/A N/A Y 8839
Bitrot Daemon on server4 N/A N/A Y 8997
Scrubber Daemon on server4 N/A N/A Y 9008
Self-heal Daemon on server1.prod.bluefa
ce.com N/A N/A Y 1521
Bitrot Daemon on server1.prod.blueface.
com N/A N/A Y 1481
Scrubber Daemon on server1.prod.bluefac
e.com N/A N/A Y 1493

Task Status of Volume volume1

There are no active volume tasks

Status of volume: volume2
Gluster process TCP Port RDMA Port Online Pid

Brick server1:/data/gluster/brick1/bric
k 49153 0 Y 1470
Brick server2:/data/gluster/brick1/bric
k N/A N/A N N/A
Brick server3:/data/gluster/brick1/bric
k 49152 0 Y 8963
Self-heal Daemon on localhost N/A N/A Y 8847
Bitrot Daemon on localhost N/A N/A Y 8994
Scrubber Daemon on localhost N/A N/A Y 9019
Self-heal Daemon on server4 N/A N/A Y 8839
Bitrot Daemon on server4 N/A N/A Y 8997
Scrubber Daemon on server4 N/A N/A Y 9008
Self-heal Daemon on server1.prod.bluefa
ce.com N/A N/A Y 1521
Bitrot Daemon on server1.prod.blueface.
com N/A N/A Y 1481
Scrubber Daemon on server1.prod.bluefac
e.com N/A N/A Y 1493
Self-heal Daemon on server3 N/A N/A Y 8980
Bitrot Daemon on server3 N/A N/A Y 9108
Scrubber Daemon on server3 N/A N/A Y 9127

Task Status of Volume volume2

There are no active volume tasks

- The output of the gluster volume heal command:
[root@server2 ~]# gluster volume heal volume1 info
Brick server1:/data/gluster/brick2/brick
Status: Connected
Number of entries: 0

Brick server2:/data/gluster/brick2/brick
Status: Connected
Number of entries: 0

Brick server4:/data/gluster/brick2/brick
Status: Connected
Number of entries: 0

[root@server2 ~]# gluster volume heal volume2 info
Brick server1:/data/gluster/brick1/brick
Status: Connected
Number of entries: 0

Brick server2:/data/gluster/brick1/brick
Status: Connected
Number of entries: 0

Brick server3:/data/gluster/brick1/brick
Status: Connected
Number of entries: 0

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump
No crash

Additional info:
To get the bricks online again I used the force command
gluster volume start volname force

- The operating system / glusterfs version:
Centos 7\Gluster FS 9.2

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

The text was updated successfully, but these errors were encountered:

pranithk · 2021-05-25T14:26:14Z

@dit101 Could you attach the client and brick logs to find what happened?

dit101 · 2021-05-25T14:48:37Z

logs.tar.gz
sorry attached now. There are no clients in this cluster yet. We have clients in another cluster where I've also seen this issue. I enabled trace logging before the reboot so hope it helps

dit101 · 2021-05-26T08:00:02Z

@pranithk could you have a look at the logs to see the issue please?

pranithk · 2021-05-26T08:43:25Z

@dit101 As per the following logs on server2:

[2021-05-25 13:45:48.160663 +0000] I [MSGID: 115029] [server-handshake.c:561:server_setvolume] 0-volume1-server: accepted client from CTX_ID:59bedcd3-9838-4d78-a42f-aeeb72468cb4-GRAPH_ID:0-PID:11967-HOST:server2-PC_NAME:volume1-client-1-RECON_NO:-0 (version: 9.2) with subvol /data/gluster/brick2/brick 
[2021-05-25 13:45:58.389527 +0000] I [MSGID: 100030] [glusterfsd.c:2687:main] 0-/usr/sbin/glusterfsd: Started running version [{arg=/usr/sbin/glusterfsd}, {version=9.2}, {cmdlinestr=/usr/sbin/glusterfsd -s server2 --volfile-id volume1.server2.data-gluster-brick2-brick -p /var/run/gluster/vols/volume1/server2-data-gluster-brick2-brick.pid -S /var/run/gluster/f228f66364437039.socket --brick-name /data/gluster/brick2/brick -l /var/log/glusterfs/bricks/data-gluster-brick2-brick.log --xlator-option *-posix.glusterd-uuid=c1e3616c-b0e8-44ba-bf4e-d566071983a4 --process-name brick --brick-port 49155 --xlator-option volume1-server.listen-port=49155}]

There are no cleanup_and_exit() logs which indicate graceful exit of the process. So someone killed the process with SIGKILL. One possibility could be OOM kill. Did you see any OOM kill messages in dmesg or /var/log/messages ? on server2 when this happened?

Is this consistently recreateable?

dit101 · 2021-05-26T09:30:55Z

@pranithk it doesn't happen all the time. I may be able to repeat it by rebooting the same server again. I checked the logs on server2 and OOM killer didn't run.

pranithk · 2021-05-26T10:20:27Z

@dit101 if there was an OOM killer you should see it in /var/log/messages file, check the rotated logs also.

dit101 · 2021-05-26T10:32:44Z

@pranithk I checked /var/log/messages from yesterday for anything related to OOM Killer and there's nothing there

dit101 · 2021-05-26T14:34:48Z

@pranithk the log message at 1345 would have been me force starting the volume to bring the brick online. Server1 was rebooted at 1339 which then brought the bricks on Server2 offline as well as those on Server1. These servers aren't used at the moment as we're testing before they go live. I can try recreate the issue again if you want to collect more information?

pranithk · 2021-05-26T14:42:32Z

@dit101 What I was pointing at is, before that, there is no cleanup_and_exit() log. Something like this.

[2021-05-25 13:39:14.695819 +0000] W [glusterfsd.c:1433:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7ea5) [0x7ff52daccea5] -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x5593d7c6cc5
5] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x5593d7c6cabb] ) 0-: received signum (15), shutting down 
[2021-05-25 13:39:44.092503 +0000] I [MSGID: 100030] [glusterfsd.c:2687:main] 0-/usr/sbin/glusterfsd: Started running version [{arg=/usr/sbin/glusterfsd}, {version=9.2}, {cmdlinestr=/usr/sbin/glusterfsd -s server1 --volfile-id volume2.server1.data-gluster-brick1-brick -p /var/run/gluster/vols/volume2/server1-data-gluster-brick1-brick.pid -S /var/run/gluster/2a31935539f185be.socket --brick-name /data/gluster/brick1/brick -l /var/log/glusterfs/bricks/data-gluster-brick1-brick.log --xlator-option *-posix.glusterd-uuid=3cba18af-dd9d-4264-8c74-4ac5fbcd2441 --process-name brick --brick-port 49153 --xlator-option volume2-server.listen-port=49153}]

If this log is missing, then the brick is killed with SIGKILL, which can happen with OOM killer, so was asking

dit101 · 2021-05-26T15:25:26Z

@pranithk thanks for explaining. Nothing was done on Server2 and there was no issue. When I rebooted Server1 the bricks went offline on Server2. I've seen this issue on a couple of clusters now. Is there anything else you need from me?

pranithk · 2021-05-26T16:29:42Z

@dit101 Went through the logs multiple times. I don't see any logs that indicate any operation on the brick is done which could lead to brick process going down on server2 from logs. SIGKILL is the only possibility with the info you gave so far. If there is a way to find who is killing this process, that would be helpful.

dit101 · 2021-05-26T16:47:28Z

@pranithk I don't have a way to check at the moment but I think the brick process was running but just goes offline in Gluster. So a process listing still shows the brick process. I can try validate this by reproducing the issue if you want. If I can reproduce the issue is there any logging and output you'd like?

pranithk · 2021-05-26T16:51:44Z

@pranithk I don't have a way to check at the moment but I think the brick process was running but just goes offline in Gluster. So a process listing still shows the brick process. I can try validate this by reproducing the issue if you want. If I can reproduce the issue is there any logging and output you'd like?

Oh, Could you capture ps aux | grep glusterfs output on server2 when this happens?

pranithk · 2021-05-26T17:00:46Z

@pranithk I don't have a way to check at the moment but I think the brick process was running but just goes offline in Gluster. So a process listing still shows the brick process. I can try validate this by reproducing the issue if you want. If I can reproduce the issue is there any logging and output you'd like?

Oh, Could you capture ps aux | grep glusterfs output on server2 when this happens?

cc @nik-redhat One possibility for this could be because of portmap handling issues.Do you have any other information that needs captured?

nik-redhat · 2021-05-27T07:52:28Z

I too, couldn't found anything problematic in the logs shared.
@dit101 along with the output of the above command if, you could also share the following data it might be helpful:

volume status before reboot
volume status after reboot and contents of /var/lib/glusterd, also I think the output of ps aux | grep glusterfsd is to be taken at this time, @pranithk ?
do a force start and then share the contents of /var/lib/glusterd as well.

pranithk · 2021-05-27T08:25:21Z

I too, couldn't found anything problematic in the logs shared.
@dit101 along with the output of the above command if, you could also share the following data it might be helpful:

volume status before reboot

volume status after reboot and contents of /var/lib/glusterd, also I think the output of ps aux | grep glusterfsd is to be taken at this time, @pranithk ?

Makes sense. I am wondering if there is a situation where the portmap information went wrong where the brick is running but the port in glusterd is not mapped correctly or something.

do a force start and then share the contents of /var/lib/glusterd as well.

nik-redhat · 2021-05-27T08:35:48Z

That might be the case....hence we need to check what port value is being stored in the brick vol files, and if at all it is getting updated correctly or not.

pranithk · 2021-05-27T08:38:35Z

That might be the case....hence we need to check what port value is being stored in the brick vol files, and if at all it is getting updated correctly or not.

In that case maybe netstat output would be helpful too.

pranithk · 2021-05-27T08:39:32Z

netstat -alnp to be precise, to see what ports are used by what processes.

dit101 · 2021-05-27T08:44:13Z

Thanks @pranithk @nik-redhat I'll try reproduce later and gather the required information

dit101 · 2021-05-27T14:59:32Z

@pranithk @nik-redhat this time when I rebooted server4 the server1 brick for volume1 went offline. I've collected and attached the logs including process and netstat output so hope they help. The brick process continued to run on server1 as per the process listing.
logs2.tar.gz
.

pranithk · 2021-05-27T15:13:52Z

@nik-redhat Do you want to give this a shot? I am suspecting this to be because of recent changes to port-mapper. But I could be wrong.

nik-redhat · 2021-05-27T15:22:08Z

@pranithk the recent changes to the portmapper have not yet gone into a release. It is only in the devel branch till now, so that shouldn't be the case here.
Also, due to some recent updates in the plan, I will only be able to take up this issue after 11th June. I am afraid, either you will have to look into it, or if someone else can take a look.
Sorry for these last-minute changes.

pranithk · 2021-05-27T15:26:14Z

@pranithk the recent changes to the portmapper have not yet gone into a release. It is only in the devel branch till now, so that shouldn't be the case here.

Ah! Cool.

Also, due to some recent updates in the plan, I will only be able to take up this issue after 11th June. I am afraid, either you will have to look into it, or if someone else can take a look.
Sorry for these last-minute changes.

Sure, will take a look.

pranithk · 2021-05-27T18:06:59Z

Based on the info provided today. I went back to yesterday's logs and find the following suspicious logs

[2021-05-25 13:39:04.724636 +0000] E [MSGID: 106061] [glusterd-server-quorum.c:260:glusterd_is_volume_in_server_quorum] 0-management: Dict get failed [{Key=cluster.server-quorum-type}] 
[2021-05-25 13:39:07.749804 +0000] I [MSGID: 106496] [glusterd-handshake.c:969:__server_getspec] 0-management: Received mount request for volume volume1.server1.data-gluster-brick2-brick 
[2021-05-25 13:39:07.750083 +0000] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /data/gluster/brick2/brick on port 49153 
[2021-05-25 13:39:07.750124 +0000] I [MSGID: 106496] [glusterd-handshake.c:969:__server_getspec] 0-management: Received mount request for volume volume2.server1.data-gluster-brick1-brick 
[2021-05-25 13:39:07.750316 +0000] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /data/gluster/brick1/brick on port 49152 
[2021-05-25 13:39:14.695967 +0000] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /data/gluster/brick1/brick on port 49152 
[2021-05-25 13:39:14.697336 +0000] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /data/gluster/brick2/brick on port 49153

These logs suggest that when the glusterd went down on server1, brick processes were sending signin and signout as if they have come up and gone down to server2 which is leading to the volume status misbehaving on server2 because the brick paths are identical on both the servers.

This is happening because when the glusterd was brought down on server1, connections were made to backup volfile server i.e server2.

[2021-05-25 13:39:04.723966 +0000] I [glusterfsd-mgmt.c:2642:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: server1
[2021-05-25 13:39:04.723978 +0000] I [glusterfsd-mgmt.c:2682:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server server2

So when the bricks are killed, signoff went to server2 even though the bricks are killed on server1.

@dit101 As a workaround, please kill brick processes before killing glusterd for now.

@amarts @xhernandez @rafikc30 @srijan-sivakumar

As per my understanding the following patch introduced the bug. I think we shouldn't send the extra volfile servers for bricks.
I don't think I will get time to send a patch to fix this one. Should be a straight forward fix. Could one of @srijan-sivakumar / @nik-redhat pick this up whenever you get a chance?

commit d453680922b18cde874b7c6c3ea2a63f69e51d79
Author: Mohit Agrawal <moagrawal@redhat.com>
Date:   Wed Nov 6 10:32:04 2019 +0530

    glusterd: Client Handling of Elastic Clusters
    
    Configure the list of gluster servers in the key
    GLUSTERD_BRICK_SERVERS at the time of GETSPEC RPC CALL
    and access the value in client side to update volfile
    serve list so that client would be able to connect
    next volfile server in case of current volfile server
    is down
    
    Updates #741
    Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
    
    Change-Id: I23f36ddb92982bb02ffd83937a8bd8a2c97e8104

Code in glusterfsd-mgmt.c that is relevant:

mgmt_rpc_notify(struct rpc_clnt *rpc, void *mydata, rpc_clnt_event_t event,
                void *data)
...
            server = ctx->cmd_args.curr_server;
            if (server->list.next == &ctx->cmd_args.volfile_servers) {
                if (!ctx->active) {
                    need_term = 1;
                }
                emval = ENOTCONN;
                GF_LOG_OCCASIONALLY(log_ctr2, "glusterfsd-mgmt", GF_LOG_INFO,
                                    "Exhausted all volfile servers");
                break;
            }
            server = list_entry(server->list.next, typeof(*server), list);
            ctx->cmd_args.curr_server = server;
            ctx->cmd_args.volfile_server = server->volfile_server;

            ret = dict_set_str(rpc_trans->options, "remote-host",
                               server->volfile_server);

dit101 · 2021-05-28T15:12:29Z

Thanks @pranithk I shutdown the bricks first and they didn't go offline on another server. I'll let you know if I see the issue when shutting down the bricks first as it doesn't happen all the time. But from what you've posted it looks like you've found the cause

rafikc30 · 2021-06-07T07:27:38Z

@pranithk Nice explanation. I can send a patch if you are busy.

nik-redhat · 2021-06-07T07:51:26Z

@pranithk just a thought if the brick paths wouldn't have been same, then will this issue occur? Just curious to know this, based on the issue with volfiles of the bricks.
Maybe @dit101, if it's feasible for you to create a dummy volume and give this a try?

A brick process requires portmap with it's local glusterd. Since the portmapper is not a centralized one, the informations are stored locally in each glusterd. When a glusterd goes down, connecting to a backup volfile server will result in undefined behaviour especially when the portmap signin and signout requests are send to a different glusterd than it is intended. If that happens then there can be undefined behaviour when there are bricks with the same path are present in differnt nodes. In this patch, we will prevent bricks connecting to a backup volfile servers. Which means that the bricks won't be connected to any other glusterd's to receive a management update if the glusterd on the local node goes down. THANKS TO PRANITH FOR THE RCA HERE gluster#2480 Change-Id: Iddd6f1d0f0da1cf0c90729043f23a293d478bf7c Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>

A brick process requires portmap with it's local glusterd. Since the portmapper is not a centralized one, the informations are stored locally in each glusterd. When a glusterd goes down, connecting to a backup volfile server will result in undefined behaviour especially when the portmap signin and signout requests are send to a different glusterd than it is intended. If that happens then there can be undefined behaviour when there are bricks with the same path are present in differnt nodes. In this patch, we will prevent bricks connecting to a backup volfile servers. Which means that the bricks won't be connected to any other glusterd's to receive a management update if the glusterd on the local node goes down. THANKS TO PRANITH FOR THE RCA HERE gluster#2480 Fixes: gluster#2480 Change-Id: Iddd6f1d0f0da1cf0c90729043f23a293d478bf7c Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>

pranithk · 2021-06-08T00:30:50Z

@pranithk just a thought if the brick paths wouldn't have been same, then will this issue occur? Just curious to know this, based on the issue with volfiles of the bricks.
Maybe @dit101, if it's feasible for you to create a dummy volume and give this a try?

No it won't occur

A brick process requires portmap with it's local glusterd. Since the portmapper is not a centralized one, the informations are stored locally in each glusterd. When a glusterd goes down, connecting to a backup volfile server will result in undefined behaviour especially when the portmap signin and signout requests are send to a different glusterd than it is intended. If that happens then there can be undefined behaviour when there are bricks with the same path are present in differnt nodes. In this patch, we will prevent bricks connecting to a backup volfile servers. Which means that the bricks won't be connected to any other glusterd's to receive a management update if the glusterd on the local node goes down. THANKS TO PRANITH FOR THE RCA HERE gluster#2480 Fixes: gluster#2480 Change-Id: Iddd6f1d0f0da1cf0c90729043f23a293d478bf7c Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>

atinmu · 2021-06-08T10:56:52Z

good catch @pranithk ; just managed to read this analysis while looking at @rafikc30's patch!

pranithk · 2021-06-08T11:10:10Z

good catch @pranithk ; just managed to read this analysis while looking at @rafikc30's patch!

Thanks! :-)

dit101 · 2021-06-08T14:47:39Z

@pranithk just a thought if the brick paths wouldn't have been same, then will this issue occur? Just curious to know this, based on the issue with volfiles of the bricks.
Maybe @dit101, if it's feasible for you to create a dummy volume and give this a try?

No it won't occur

Thanks Guys. Was a bit busy and will be all week. Was going to try this over this coming weekend if you needed me to. Since @pranithk confirmed it won't occur with different brick paths I won't test that scenario :-)

A brick process requires portmap with it's local glusterd. Since the portmapper is not a centralized one, the informations are stored locally in each glusterd. When a glusterd goes down, connecting to a backup volfile server will result in undefined behaviour especially when the portmap signin and signout requests are send to a different glusterd than it is intended. If that happens then there can be undefined behaviour when there are bricks with the same path are present in differnt nodes. In this patch, we will prevent bricks connecting to a backup volfile servers. Which means that the bricks won't be connected to any other glusterd's to receive a management update if the glusterd on the local node goes down. THANKS TO PRANITH FOR THE RCA HERE gluster#2480 Fixes: gluster#2480 Change-Id: Iddd6f1d0f0da1cf0c90729043f23a293d478bf7c Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>

A brick process requires portmap with it's local glusterd. Since the portmapper is not a centralized one, the informations are stored locally in each glusterd. When a glusterd goes down, connecting to a backup volfile server will result in undefined behaviour especially when the portmap signin and signout requests are send to a different glusterd than it is intended. If that happens then there can be undefined behaviour when there are bricks with the same path are present in differnt nodes. In this patch, we will prevent bricks connecting to a backup volfile servers. Which means that the bricks won't be connected to any other glusterd's to receive a management update if the glusterd on the local node goes down. THANKS TO PRANITH FOR THE RCA HERE #2480 Fixes: #2480 Change-Id: Iddd6f1d0f0da1cf0c90729043f23a293d478bf7c Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>

A brick process requires portmap with it's local glusterd. Since the portmapper is not a centralized one, the informations are stored locally in each glusterd. When a glusterd goes down, connecting to a backup volfile server will result in undefined behaviour especially when the portmap signin and signout requests are send to a different glusterd than it is intended. If that happens then there can be undefined behaviour when there are bricks with the same path are present in differnt nodes. In this patch, we will prevent bricks connecting to a backup volfile servers. Which means that the bricks won't be connected to any other glusterd's to receive a management update if the glusterd on the local node goes down. THANKS TO PRANITH FOR THE RCA HERE gluster#2480 Fixes: gluster#2480 >Change-Id: Iddd6f1d0f0da1cf0c90729043f23a293d478bf7c >Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com> Change-Id: I45d962994a1128b59a04ca9880404fe7357990c6 Signed-off-by: nik-redhat <nladha@redhat.com>

A brick process requires portmap with it's local glusterd. Since the portmapper is not a centralized one, the informations are stored locally in each glusterd. When a glusterd goes down, connecting to a backup volfile server will result in undefined behaviour especially when the portmap signin and signout requests are send to a different glusterd than it is intended. If that happens then there can be undefined behaviour when there are bricks with the same path are present in differnt nodes. In this patch, we will prevent bricks connecting to a backup volfile servers. Which means that the bricks won't be connected to any other glusterd's to receive a management update if the glusterd on the local node goes down. THANKS TO PRANITH FOR THE RCA HERE #2480 Fixes: #2480 >Change-Id: Iddd6f1d0f0da1cf0c90729043f23a293d478bf7c >Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com> Change-Id: I45d962994a1128b59a04ca9880404fe7357990c6 Signed-off-by: nik-redhat <nladha@redhat.com> Co-authored-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>

rafikc30 mentioned this issue Jun 7, 2021

glusterd/brick: Do not connect into the backup volfile server #2509

Merged

xhernandez closed this as completed in #2509 Jun 16, 2021

This was referenced Aug 11, 2021

Gluster volume brick status: "Transport endpoint is not connected" #2706

Closed

Issues with glustershd with release 8.4 and 9.1 #2492

Closed

nik-redhat mentioned this issue Nov 5, 2021

glusterd/brick: Do not connect into the backup volfile server #2931

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brick going offline on another host as well as the host which rebooted #2480

Brick going offline on another host as well as the host which rebooted #2480

dit101 commented May 25, 2021

pranithk commented May 25, 2021

dit101 commented May 25, 2021

dit101 commented May 26, 2021

pranithk commented May 26, 2021

dit101 commented May 26, 2021

pranithk commented May 26, 2021

dit101 commented May 26, 2021

dit101 commented May 26, 2021

pranithk commented May 26, 2021

dit101 commented May 26, 2021

pranithk commented May 26, 2021

dit101 commented May 26, 2021

pranithk commented May 26, 2021

pranithk commented May 26, 2021

nik-redhat commented May 27, 2021

pranithk commented May 27, 2021

nik-redhat commented May 27, 2021

pranithk commented May 27, 2021

pranithk commented May 27, 2021

dit101 commented May 27, 2021

dit101 commented May 27, 2021

pranithk commented May 27, 2021

nik-redhat commented May 27, 2021 •

edited

Loading

pranithk commented May 27, 2021

pranithk commented May 27, 2021 •

edited

Loading

dit101 commented May 28, 2021

rafikc30 commented Jun 7, 2021

nik-redhat commented Jun 7, 2021

pranithk commented Jun 8, 2021

atinmu commented Jun 8, 2021

pranithk commented Jun 8, 2021

dit101 commented Jun 8, 2021

Brick going offline on another host as well as the host which rebooted #2480

Brick going offline on another host as well as the host which rebooted #2480

Comments

dit101 commented May 25, 2021

[root@server2 ~]# gluster volume status Status of volume: volume1 Gluster process TCP Port RDMA Port Online Pid

Task Status of Volume volume1

Status of volume: volume2 Gluster process TCP Port RDMA Port Online Pid

Task Status of Volume volume2

- The output of the gluster volume status command: [root@server2 ~]# gluster volume status Status of volume: volume1 Gluster process TCP Port RDMA Port Online Pid

Task Status of Volume volume1

Status of volume: volume2 Gluster process TCP Port RDMA Port Online Pid

Task Status of Volume volume2

pranithk commented May 25, 2021

dit101 commented May 25, 2021

dit101 commented May 26, 2021

pranithk commented May 26, 2021

dit101 commented May 26, 2021

pranithk commented May 26, 2021

dit101 commented May 26, 2021

dit101 commented May 26, 2021

pranithk commented May 26, 2021

dit101 commented May 26, 2021

pranithk commented May 26, 2021

dit101 commented May 26, 2021

pranithk commented May 26, 2021

pranithk commented May 26, 2021

nik-redhat commented May 27, 2021

pranithk commented May 27, 2021

nik-redhat commented May 27, 2021

pranithk commented May 27, 2021

pranithk commented May 27, 2021

dit101 commented May 27, 2021

dit101 commented May 27, 2021

pranithk commented May 27, 2021

nik-redhat commented May 27, 2021 • edited Loading

pranithk commented May 27, 2021

pranithk commented May 27, 2021 • edited Loading

dit101 commented May 28, 2021

rafikc30 commented Jun 7, 2021

nik-redhat commented Jun 7, 2021

pranithk commented Jun 8, 2021

atinmu commented Jun 8, 2021

pranithk commented Jun 8, 2021

dit101 commented Jun 8, 2021

[root@server2 ~]# gluster volume status
Status of volume: volume1
Gluster process TCP Port RDMA Port Online Pid

Status of volume: volume2
Gluster process TCP Port RDMA Port Online Pid

- The output of the `gluster volume status` command:
[root@server2 ~]# gluster volume status
Status of volume: volume1
Gluster process TCP Port RDMA Port Online Pid

Status of volume: volume2
Gluster process TCP Port RDMA Port Online Pid

nik-redhat commented May 27, 2021 •

edited

Loading

pranithk commented May 27, 2021 •

edited

Loading