Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ora-host defaults #59

Closed
wants to merge 1 commit into from
Closed

Update ora-host defaults #59

wants to merge 1 commit into from

Conversation

amadel8
Copy link
Contributor

@amadel8 amadel8 commented Nov 16, 2020

Disable firewalld is a must or installer will fail. Also adding some additional needed RPMs from 19c installation guide and from runcluvfy post crsinst output.

Disable firewalld is a must or installer will fail. Also adding some additional needed RPMs from 19c installation guide and from runcluvfy post crsinst output.
Copy link
Member

@mfielding mfielding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A key goal of a toolkit this is to be secure by default; having a properly-configured host firewall is a key part of this config. (I do have a playbook in development that sets up a RAC-compatible firewall, but is bare metal specific).

Regarding the packages, they appear to be already satisfied by package-level dependencies. For example, libaio-devel depends on libaio. glibc-devel depende on glibc, etc. I've done several installs from small default images, and not found any missing packages.

@amadel8
Copy link
Contributor Author

amadel8 commented Nov 25, 2020

For firewalld as you have mentioned, if it's enabled then we must run some firewall configuration to enable interconnect communication between nodes (by default firewalld blocks it) otherwise the installer will fail. Maybe the alternative would be to make firewall_enabled as an external parameter that the user would provide instead of having it referenced internally by a role, if the user wants it enabled then we configure RAC enablement rules because as the toolkit stands now the installer will always fail unless some manual work is done before running it. Actually I see in the toolkit's user guide the following like "The disabling of the Linux firewall and SELinux, as recommended for Oracle database servers." which I think is not true.

For the packages I know some of those I added might be redundant but the currently listed ones are not sufficient, the reason I am saying that is after I did an installation with the current packages I ran runcluvfy post crsinst and I got errors that some packages are missing. What I did is that I added the packages listed by runcluvfy and all the packages from the install guide.

@jcnars
Copy link
Collaborator

jcnars commented Apr 27, 2021

Tested this on RHEL 7.7.
Hit the issue caused due to firewall.

Error is:

TASK [rac-gi-setup : rac-gi-install | Run script root.sh] **********************
failed: [bms_server_2 -> None] (item=bms_server_2) => {"ansible_loop_var": "item", "changed": true, "cmd": ["/u01/app/19.3.0/grid/root.sh"], "delta": "0:03:18.511818", "end": "2021-04-21 19:22:42.038860", "item": "bms_server_2", "msg": "non-zero return code", "rc": 25, "start": "2021-04-21 19:19:23.527042", "stderr": "", "stderr_lines": [], "stdout": "Check /u01/app/19.3.0/grid/install/root_bms_server_2_2021-04-21_19-19-23-535764383.log for the output of root script", "stdout_lines": ["Check /u01/app/19.3.0/grid/install/root_bms_server_2_2021-04-21_19-19-23-535764383.log for the output of root script"]}

Manually reproducing it in the server:

[root@bms_server_2 ~]# /u01/app/19.3.0/grid/root.sh
Check /u01/app/19.3.0/grid/install/root_bms_server_2_2021-04-23_10-55-47-796714031.log for the output of root script
[root@bms_server_2 ~]# 

[root@bms_server_2 ~]# tail -f /u01/app/19.3.0/grid/install/root_bms_server_2_2021-04-23_10-55-47-796714031.log
2021/04/23 10:55:52 CLSRSC-594: Executing installation step 7 of 19: 'SetupLocalGPNP'.
2021/04/23 10:55:53 CLSRSC-594: Executing installation step 8 of 19: 'CreateRootCert'.
2021/04/23 10:55:55 CLSRSC-594: Executing installation step 9 of 19: 'ConfigOLR'.
2021/04/23 10:55:55 CLSRSC-594: Executing installation step 10 of 19: 'ConfigCHMOS'.
2021/04/23 10:55:55 CLSRSC-594: Executing installation step 11 of 19: 'CreateOHASD'.
2021/04/23 10:55:56 CLSRSC-594: Executing installation step 12 of 19: 'ConfigOHASD'.
2021/04/23 10:55:57 CLSRSC-594: Executing installation step 13 of 19: 'InstallAFD'.
2021/04/23 10:55:57 CLSRSC-594: Executing installation step 14 of 19: 'InstallACFS'.
2021/04/23 10:55:58 CLSRSC-594: Executing installation step 15 of 19: 'InstallKA'.
2021/04/23 10:55:59 CLSRSC-594: Executing installation step 16 of 19: 'InitConfig'.

Creation of ASM spfile in disk group failed.
ORA-29783: GPnP attribute SET failed with error [CLSGPNP_NOT_FOUND]


2021/04/23 10:58:04 CLSRSC-184: Configuration of ASM failed
2021/04/23 10:58:05 CLSRSC-258: Failed to configure and start ASM
Died at /u01/app/19.3.0/grid/crs/install/crsinstall.pm line 2565.

Matching metalink note: ”The root.sh Fails with ORA-29783:GPnP Attribute SET Failed With Error [CLSGPNP_NOT_FOUND] (Doc ID 2180883.1)”

Fix - instead of disabling firewall entirely, following Ansible snippet has been proven to work on multiple BMS sites:

- hosts: all
 become: true
 tasks:
   - name: Allow local networks for RAC
     firewalld:
       zone: public
       rich_rule: rule family=ipv4 source address="{{ item.value.ipv4.network }}/{{ item.value.ipv4.netmask}}" accept
       state: enabled
       immediate: yes
       permanent: true
     with_items:
       - "{{ ansible_facts | dict2items | selectattr('value.ipv4.network', 'defined') |list }}"

The o/p of before and after running the script is:
Before:

[root@bms_server_2 ~]# firewall-cmd --list-all
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: bond0 bond0.111 bond1 bond1.112 enp173s0f0 enp173s0f1 enp17s0f0 enp17s0f1
  sources: 
  services: dhcpv6-client ssh
  ports: 1521/tcp             <====== this got added with recently [merged](https://github.com/google/bms-toolkit/blob/master/roles/rac-lsnr-firewall/tasks/main.yml#L18) role
  protocols: 
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules: 

After:

[root@bms_server_2 ~]# firewall-cmd --list-all
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: bond0 bond0.111 bond1 bond1.112 enp173s0f0 enp173s0f1 enp17s0f0 enp17s0f1
  sources: 
  services: dhcpv6-client ssh
  ports: 1521/tcp
  protocols: 
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules: 
	rule family="ipv4" source address="169.254.0.0/255.255.224.0" accept
	rule family="ipv4" source address="127.0.0.0/255.0.0.0" accept
	rule family="ipv4" source address="172.16.30.0/255.255.255.0" accept
	rule family="ipv4" source address="192.168.3.0/255.255.255.0" accept

In summary: Wholesale disabling of firewall is bug hammer that's not needed and could be counter productive w.r.t. security - we can surgically add the firewall rules as mentioned in previous comments and tested as noted above.

@amadel8
Copy link
Contributor Author

amadel8 commented Apr 28, 2021

I fully agree with your analysis, adding the accept rule for the interconnect network would work and we could then use firewalld to block access to port 80 on metadata server as well

@jcnars
Copy link
Collaborator

jcnars commented May 18, 2021

Hi,
I was able to reproduce errors due to network firewall in BMS hosts running OEL7.9.

Opening up hosts in firewall:
[grid@mntrl-host2 ~]$ /u01/app/19.3.0/grid/gridSetup.sh -silent -responseFile /u01/app/19.3.0/grid/gridsetup.rsp   -J-Doracle.install.mgmtDB=false -J-Doracle.install.mgmtDB.CDB=false -J-Doracle.install.crs.enableRemoteGIMR=false -ignorePrereqFailure
Launching Oracle Grid Infrastructure Setup Wizard...

[FATAL] [INS-41116] Installer has detected that the selected following nodes do not have connectivity with other cluster nodes through the selected interface. 
 [[]]. 
 
 These nodes will be ignored and not participate in the configured Grid Infrastructure.
*ADDITIONAL INFORMATION:*
Summary of node specific errors
mntrl-host1
mntrl-host2
 - PRVG-11067 : TCP connectivity from node "mntrl-host2": "172.16.110.1" to node "mntrl-host1": "172.16.110.2" failed. PRVG-11095 : The TCP system call "connect" failed with error "113" while executing exectask on node "mntrl-host2" No route to host
 - Cause:  Errors occurred while attempting to establish Transmission Control Protocol (TCP) connectivity between the identified two interfaces.
 - Action:  Ensure that there are no firewalls blocking TCP operations and no process monitors running that can interfere with programs'' network operations.

was resolved by adding this into rac_lsnr_firewall/tasks/main.yml:

- hosts: all
 become: true
 tasks:
   - name: Allow local networks for RAC
     firewalld:
       zone: public
       rich_rule: rule family=ipv4 source address="{{ item.value.ipv4.network }}/{{ item.value.ipv4.netmask}}" accept
       state: enabled
       immediate: yes
       permanent: true
     with_items:
       - "{{ ansible_facts | dict2items | selectattr('value.ipv4.network', 'defined') |list }}"

And then, calling that block to run on both nodes in install-sw.yml, like below:

- hosts: dbasm
  serial: 1
  tasks:
    - name: rac-gi-install | defaults from common
      include_vars:
        dir: roles/common/defaults
    - name: rac-gi-install | open firewall
      include_role:
        name: rac-lsnr-firewall
Opening up HAIP addresses:

Following error...:

            "[FATAL] PRCR-1079 : Failed to start resource ora.orcl.db", 
            "ORA-03113: end-of-file on communication channel", 
            "Process ID: 0", 
            "Session ID: 0 Serial number: 0", 
            "", 
            "CRS-2674: Start of 'ora.asm' on 'at-3793329-svr006' failed", 
            "ORA-03113: end-of-file on communication channel", 
            "Process ID: 0", 
            "Session ID: 0 Serial number: 0", 
            "", 
            "CRS-2674: Start of 'ora.asm' on 'at-3793329-svr006' failed", 
            "CRS-5017: The resource action \"ora.orcl.db start\" encountered the following error: ", 
            "ORA-03113: end-of-file on communication channel", 
            "Process ID: 0", 
            "Session ID: 0 Serial number: 0", 
            ". For details refer to \"(:CLSN00107:)\" in \"/u01/app/oracle/diag/crs/at-3793329-svr005/crs/trace/crsd_oraagent_oracle.trc\".", 
            "", 
            "CRS-2674: Start of 'ora.orcl.db' on 'at-3793329-svr005' failed", 
            "67% complete", 

matched with this metalink note:
Only One Instance of a RAC Database Can Start at a Time: Second Instance Fails to Start due to "No reconfig messages from other instances" - LMON is terminating the instance (Doc ID 2528588.1)

Our snippets:
from alert log: /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/alert_orcl1.log

   2110 Cluster Communication is configured to use IPs from: GPnP
   2111 IP: 169.254.24.169       Subnet: 169.254.0.0
...
   2268 No connectivity to other instances in the cluster during startup. Hence, LMON is terminating the instance. Please check the LMON trace file for details. Also, please ch        eck the network logs of this instance along with clusterwide network health for problems and then re-start this instance.
   2269 LMON (ospid: ): terminating the instance due to ORA error
   2270 Cause - 'Instance is being terminated by LMON'   

from lmon trc file: /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lmon_19696.trc:

  170 *** 2021-05-13T20:14:28.294209-07:00
    171 IPCLW:[0.41]{E}[WAIT]:PROTO: [1620962068294072]RETRANS DBG local acnh 0x7f2eae398ab0 dump:
    172 IPCLW:[0.42]{-}[WAIT]:UTIL: [1620962068294072]  ACNH 0x7f2eae398ab0 State: 1 SMSN: 987049026 PKT(987049027.734422451) # Pending: 1
    173 IPCLW:[0.43]{-}[WAIT]:UTIL: [1620962068294072]   Peer: LMON.KSXP_cgs.19981 AckSeq: 734422450
    174 IPCLW:[0.44]{-}[WAIT]:UTIL: [1620962068294072]   Flags: 0x00000000 IHint: 0x700a68060000001f THint: 0x713cf1d0000001f
    175 IPCLW:[0.45]{-}[WAIT]:UTIL: [1620962068294072]   Local Address: 169.254.24.169:59825 Remote Address: 169.254.26.248:63753
    176 IPCLW:[0.46]{-}[WAIT]:UTIL: [1620962068294072]   Remote PID: ver 0 flags 1 trans 2 tos 0 opts 0 xdata3 1be7 xdata2 9edd2d5c
    177 IPCLW:[0.47]{-}[WAIT]:UTIL: [1620962068294072]             : mmsz 32768 mmr 4096 mms 4096 xdata c2413228
    178 IPCLW:[0.48]{-}[WAIT]:UTIL: [1620962068294072]   IVPort: 4084 TVPort: 12840 IMPT: 52849 RMPT: 7143   Pending Sends: Yes Unacked Sends: Yes
    179 IPCLW:[0.49]{-}[WAIT]:UTIL: [1620962068294072]   Send Engine Queued: No sshdl -1 ssts 0 rtts 1620962068294188 snderrchk 4 creqcnt 2 credits 7/8
    180 IPCLW:[0.50]{-}[WAIT]:UTIL: [1620962068294072]   Unackd Messages 987049026 -> 987049026. SSEQ 734422450 Send Time: 0:0:25.744.744074 SMSN # Xmits: 256 EMSN 0:0:25.744.7        44074
    181 IPCLW:[0.51]{-}[WAIT]:UTIL: [1620962068294072]  Pending send queue:
...


    358 
    359 No reconfig messages from other instances in the cluster during startup. Hence, LMON is terminating the instance. Please check the network logs of this instance as well         as the network health of the cluster for problems or if the GI is in patching mode

Direct matches to snippets in 2528588.1.

This was resolved by reusing snippets from Marc's experimental:

   - name: Allow HAIP networks
     firewalld:
       zone: public
       rich_rule: rule family=ipv4 source address="169.254.0.0/19" accept
       state: enabled
       immediate: yes
       permanent: true
Summary:
  • CRS IPs get plumbed on pvt. i/c by the clusterware.

    • 169.254.x.x IPs on the private NIC (bond1.106) as bond1.106:1
  • This IP is used by HAIP and shows up in gv$cluster_interconnect

  • HAIP summary in 2 lines: (ref: 1210883.1):

    • The 169.254.x.x link local IP is the HAIP and it’s configured on 1 or more physical NICs. If one of the NICs is shot, the IP moves over to the other NIC.
  • Reason LMON kills the other instance is that it’s not able to receive heartbeats from the other instance via the configured cluster_interconnect.

  • Summary of all code changes to fix this:

    • roles/rac_lsnr_firewall/tasks/main.yml:
- hosts: all
 become: true
 tasks:
   - name: Allow local networks for RAC
     firewalld:
       zone: public
       rich_rule: rule family=ipv4 source address="{{ item.value.ipv4.network }}/{{ item.value.ipv4.netmask}}" accept
       state: enabled
       immediate: yes
       permanent: true
     with_items:
       - "{{ ansible_facts | dict2items | selectattr('value.ipv4.network', 'defined') |list }}"
       
   - name: Allow HAIP networks
     firewalld:
       zone: public
       rich_rule: rule family=ipv4 source address="169.254.0.0/19" accept
       state: enabled
       immediate: yes
       permanent: true

And then, calling that block to run on both nodes in install-sw.yml, like below:

- hosts: dbasm
  serial: 1
  tasks:
    - name: rac-gi-install | defaults from common
      include_vars:
        dir: roles/common/defaults
    - name: rac-gi-install | open firewall
      include_role:
        name: rac-lsnr-firewall

I will submit the firewall related changes in a new PR.

Thanks

P.S:
We will handle blocking metadata server's 169.254.169.254:80 in a separate thread, as firewalld module in Ansible doesn't support output rules for outgoing traffic at this point.

@jcnars
Copy link
Collaborator

jcnars commented May 27, 2021

Hello Ahmad,
Closing the loop on this here that the following 2 PRs address the points identified in PRs #59 and #60.

Thanks for identifying the issues.

@jcnars jcnars closed this Jun 9, 2021
@jcnars jcnars reopened this Jun 9, 2021
@amadel8 amadel8 closed this Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants