Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After "gprecoverseg -i" dbid changed #9837

Closed
star-hou opened this issue Mar 27, 2020 · 7 comments
Closed

After "gprecoverseg -i" dbid changed #9837

star-hou opened this issue Mar 27, 2020 · 7 comments
Assignees

Comments

@star-hou
Copy link

star-hou commented Mar 27, 2020

Greenplum version or build

6.1

After init my greenplum cluster, the content on gp_segment_configuration is:

 dbid | content | role | preferred_role | mode | status | port |          hostname           | address |          datadir          
------+---------+------+----------------+------+--------+------+-----------------------------+---------+---------------------------
    1 |      -1 | p    | p              | n    | u      | 5432 | hdz-0327-5-master-0         | mdw     | /home/data/master/gpseg-1
    8 |       2 | m    | m              | s    | u      | 7000 | hdz-0327-5-segment-host-0   | sdw1    | /home/data/mirror/gpseg2
    2 |       0 | p    | p              | s    | u      | 6000 | hdz-0327-5-segment-host-0   | sdw1    | /home/data/primary/gpseg0
    3 |       1 | p    | p              | s    | u      | 6001 | hdz-0327-5-segment-host-0   | sdw1    | /home/data/primary/gpseg1
    9 |       3 | m    | m              | s    | u      | 7001 | hdz-0327-5-segment-host-0   | sdw1    | /home/data/mirror/gpseg3
    4 |       2 | p    | p              | s    | u      | 6000 | hdz-0327-5-segment-host-1   | sdw2    | /home/data/primary/gpseg2
    6 |       0 | m    | m              | s    | u      | 7000 | hdz-0327-5-segment-host-1   | sdw2    | /home/data/mirror/gpseg0
    7 |       1 | m    | m              | s    | u      | 7001 | hdz-0327-5-segment-host-1   | sdw2    | /home/data/mirror/gpseg1
    5 |       3 | p    | p              | s    | u      | 6001 | hdz-0327-5-segment-host-1   | sdw2    | /home/data/primary/gpseg3
   10 |      -1 | m    | m              | s    | u      | 5432 | hdz-0327-5-standby-master-0 | smdw    | /home/data/master/gpseg-1

and then the machine where my master node on is failed, so i use gpactivatestandby and gpinitstandby to recover my cluster.After recover the content on gp_segment_configuration is:

 dbid | content | role | preferred_role | mode | status | port |          hostname           | address |          datadir          
------+---------+------+----------------+------+--------+------+-----------------------------+---------+---------------------------
   11 |      -1 | m    | m              | s    | u      | 5432 | hdz-0327-5-master-0         | mdw     | /home/data/master/gpseg-1
    8 |       2 | m    | m              | s    | u      | 7000 | hdz-0327-5-segment-host-0   | sdw1    | /home/data/mirror/gpseg2
    2 |       0 | p    | p              | s    | u      | 6000 | hdz-0327-5-segment-host-0   | sdw1    | /home/data/primary/gpseg0
    3 |       1 | p    | p              | s    | u      | 6001 | hdz-0327-5-segment-host-0   | sdw1    | /home/data/primary/gpseg1
    9 |       3 | m    | m              | s    | u      | 7001 | hdz-0327-5-segment-host-0   | sdw1    | /home/data/mirror/gpseg3
    4 |       2 | p    | p              | s    | u      | 6000 | hdz-0327-5-segment-host-1   | sdw2    | /home/data/primary/gpseg2
    6 |       0 | m    | m              | s    | u      | 7000 | hdz-0327-5-segment-host-1   | sdw2    | /home/data/mirror/gpseg0
    7 |       1 | m    | m              | s    | u      | 7001 | hdz-0327-5-segment-host-1   | sdw2    | /home/data/mirror/gpseg1
    5 |       3 | p    | p              | s    | u      | 6001 | hdz-0327-5-segment-host-1   | sdw2    | /home/data/primary/gpseg3
   10 |      -1 | p    | p              | s    | u      | 5432 | hdz-0327-5-standby-master-0 | smdw    | /home/data/master/gpseg-1

and then for some reason, i want to replace my segment node, so i use gprecoverseg -i, after do this the content on gp_segment_configuration is:

 dbid | content | role | preferred_role | mode | status | port |                 hostname                  | address |          datadir          
------+---------+------+----------------+------+--------+------+-------------------------------------------+---------+---------------------------
   11 |      -1 | m    | m              | s    | u      | 5432 | hdz-0327-5-master-0                       | mdw     | /home/data/master/gpseg-1
   10 |      -1 | p    | p              | s    | u      | 5432 | hdz-0327-5-standby-master-0               | smdw    | /home/data/master/gpseg-1
    2 |       0 | p    | p              | s    | u      | 6000 | hdz-0327-5-segment-host-0                 | sdw1    | /home/data/primary/gpseg0
    5 |       0 | m    | m              | s    | u      | 7000 | hdz-0327-5-jdw-dc1-4xlarge-segment-host-0 | sdw3    | /home/data/mirror/gpseg0
    3 |       1 | p    | p              | s    | u      | 6001 | hdz-0327-5-segment-host-0                 | sdw1    | /home/data/primary/gpseg1
    6 |       1 | m    | m              | s    | u      | 7001 | hdz-0327-5-jdw-dc1-4xlarge-segment-host-0 | sdw3    | /home/data/mirror/gpseg1
    8 |       2 | m    | m              | n    | d      | 7000 | hdz-0327-5-segment-host-0                 | sdw1    | /home/data/mirror/gpseg2
    1 |       2 | p    | p              | n    | u      | 6000 | hdz-0327-5-jdw-dc1-4xlarge-segment-host-0 | sdw3    | /home/data/primary/gpseg2
    9 |       3 | m    | m              | n    | d      | 7001 | hdz-0327-5-segment-host-0                 | sdw1    | /home/data/mirror/gpseg3
    4 |       3 | p    | p              | n    | u      | 6001 | hdz-0327-5-jdw-dc1-4xlarge-segment-host-0 | sdw3    | /home/data/primary/gpseg3

and i saw the failed log on my segment node, the content is:

2020-03-27 15:40:53.007499 CST,"gpadmin",,p1410,th-508311424,"192.168.1.213","47910",2020-03-27 15:40:53 CST,0,,,seg3,,,,,"WARNING","01000","message type: PROBE received d
bid:4 doesn't match this segments configured dbid:5",,,,,,,0,,"ftsmessagehandler.c",427,

This looks like a problem caused by a dbid mismatch, and saw the gp_segment_configuration above, the dbid is really changed.So, what caused this problem, and how can I solve it.

@ashwinstar
Copy link
Contributor

There might be some problem with gprecoverseg -i handling. Segment locally stores the gp_dbid is file internal.auto.conf file. So, seems like gp_segment_configuration was changed but internal.auto.conf was not modified and hence caused the issue.

We would be help out on the issue is please can provide the exact command and file used for gprecoverseg -i to repro the issue and resolve the same.

@star-hou
Copy link
Author

There might be some problem with gprecoverseg -i handling. Segment locally stores the gp_dbid is file internal.auto.conf file. So, seems like gp_segment_configuration was changed but internal.auto.conf was not modified and hence caused the issue.

We would be help out on the issue is please can provide the exact command and file used for gprecoverseg -i to repro the issue and resolve the same.

Yes, you are right. On the failed node, I checked the file internal.auto.conf. The values in the file are different from those in the table gp_segment_configuration.

I will restore the process of my operation:

  1. Some reasons cause the master node to shut down, so I use gpactivestandby to promote my standby node. On standby node, First of all, some access permission writes are added to the pg_hba.conf file.And then echo '*:5432:gpperfmon:gpmon:gpmon' > /home/gpadmin/.pgpass;chmod 600 /home/gpadmin/.pgpass.After that I do export PGPORT=5432;gpactivatestandby -d $MASTER_DATA_DIRECTORY -a -f to promote standby, then I use psql %s -c 'ANALYZE; to analyze my user databases.
  2. After I restore the master node manually, I decide to use it as a new standby node. So I do mv /home/data/master /home/data/master_old&&mkdir /home/data/master first on my faild master node. Then do gpinitstandby -as mdw on my new master node(original standby node).
  3. I'm going to replace the segment machine, so I do gpstop -a --host *** first, and use SELECT address, port, datadir FROM gp_segment_configuration WHERE hostname = $1; to generate a recv file manually. The content in file recv are:

sdw2|6000|/home/data/primary/gpseg2 sdw4|6000|/home/data/primary/gpseg2
sdw2|6001|/home/data/primary/gpseg3 sdw4|6001|/home/data/primary/gpseg3
sdw2|7001|/home/data/mirror/gpseg1 sdw4|7001|/home/data/mirror/gpseg1
sdw2|7000|/home/data/mirror/gpseg0 sdw4|7000|/home/data/mirror/gpseg0

Then I use gprecoverseg -a -v -i /home/gpadmin/recv to recover the failed segment node. If it succeed, I will do gprecoverseg -a -v -r. But it failed. The error is as described above.

@ashwinstar
Copy link
Contributor

I think we can ignore all the things related to gpactivatestandby and gpinitstandby in this issue. Those are just noise. Only aspect we need to focus is gprecoverseg -a -v -i which you used to move primary and mirror to some other host. It seems there lies some logic error as it's modifying DBID causing the problem. Will we look into this issue and try to resolve the same. Thanks for reporting.

@ashwinstar ashwinstar changed the title After gpactivatestandby and gpinitstandby dbid changed After "gprecoverseg -i" dbid changed Mar 31, 2020
@gaos1 gaos1 self-assigned this Apr 1, 2020
@star-hou
Copy link
Author

star-hou commented Apr 1, 2020

First of all, thank you for your help. I had some other problems, but I fixed them manually. I'm going to share them with you. If you think they're problems, you can pay attention to them.

  1. The operation gprecoverseg -p: I use gprecoverseg -o ./recv -p sdw3 to replace my failed segment node. The content of the generated configuration file recv is:
    sdw2|6000|/home/data/primary/gpseg2 sdw3|6000|/home/data/primary/gpseg2 sdw2|6001|/home/data/primary/gpseg3 sdw3|6001|/home/data/primary/gpseg3 sdw2|7001|/home/data/mirror/gpseg1 sdw3|6002|/home/data/mirror/gpseg1 sdw2|7000|/home/data/mirror/gpseg0 sdw3|6003|/home/data/mirror/gpseg0
    You can see that the ports before and after are inconsistent. It has no effect on the gprecoverseg -i operation but if I ignore it, there will be some errors when I execute the gpexpand operation, so I need to modify it manually to make them consistent, and then the errors will not appear.
  2. The operation gprecoveseg -i: I think after the gprecoverseg -i operation, we may forget to modify the pg_hba.conf file. Because the IP authentication information of the new node has not been added in the pg_hba.conf file, when I perform the gprecoverseg -r operation, some permissions fail problems are appeared. So when I finish the gprecoverseg -i operation, I will manually add the IP information of the new node to all the pg_hba.conf files on the new node. The format is host all gpadmin 192.168.1.68/32 trust, and then the errors will not appear.

@gaos1
Copy link

gaos1 commented Apr 1, 2020

for the pg_hba.conf permission issue, probably it is already be fixed in the latest 6.x release by severral related PRs. e.g. #8597

@star-hou
Copy link
Author

star-hou commented Apr 1, 2020

for the pg_hba.conf permission issue, probably it is already be fixed in the latest 6.x release by severral related PRs. e.g. #8597

Yeah I see it. Thank you very much!!!!

dreamedcheng pushed a commit to dreamedcheng/gpdb that referenced this issue Nov 10, 2021
PR9974 introduced case 'segwalrep/recoverseg_from_file' to guard
the fix to issue greenplum-db#9837, but the case is not able to do that.
Because the content of configure in it is not suitable, function
'__callSegmentAddMirror' won't be called when executing
'gprecoverseg -i xxx'.
higuoxing pushed a commit that referenced this issue Nov 12, 2021
PR #9974 introduced case 'segwalrep/recoverseg_from_file' to guard
the fix to issue #9837, but the case is not able to do that.
Because the content of configure in it is not suitable, function
'__callSegmentAddMirror' won't be called when executing
'gprecoverseg -i xxx'.

Co-authored-by: wuchengwen <wcw190496@alibaba-inc.com>
@higuoxing
Copy link
Member

I think this issue has been resolved in commits f7965d4 and 69fc4a4. Feel free to reopen if it doesn't work as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants