CPS gets into a state where IP addresses cannot be configured #66

dimbleby · 2018-02-26T16:05:41Z

We sometimes see a VM get into a state where attempts to set an IP address on an interface fail.

It's currently unclear exactly what the necessary steps to reproduce this are - I am working on getting a clearer picture of this.

Once in the bad state, attempts to set an IP address via CPS fail. Eg running this script results in us hitting the error branch.

I've seen examples where the opx-pas service has failed - but restarting it does not resolve the problem.

I've also seen examples where the opx-nas service makes a series of interesting logs:

root@NST-OPX-TEST-001:~# systemctl -l status opx-nas.service
● opx-nas.service - Network abstraction service
   Loaded: loaded (/lib/systemd/system/opx-nas.service; enabled)
   Active: active (running) since Mon 2018-02-26 15:13:09 UTC; 46min ago
  Process: 645 ExecStartPre=/usr/bin/dn_rules.sh (code=exited, status=0/SUCCESS)
 Main PID: 703 (opx_nas_daemon)
   CGroup: /system.slice/opx-nas.service
           └─703 /usr/bin/opx_nas_daemon

Feb 26 15:59:34 NST-OPX-TEST-001 opx_nas_daemon[703]: [INTERFACE:INTF-C], No npu port attribute found in event object
Feb 26 15:59:34 NST-OPX-TEST-001 opx_nas_daemon[703]: [ev_log_t_SAI_PORT:Switch Id: 0], Attr get for port id 0x1000000000001's attr index 0 attr id 76 failed with err -65536
Feb 26 15:59:34 NST-OPX-TEST-001 opx_nas_daemon[703]: [ev_log_t_SAI_PORT:Switch Id: 0], Attr get for port id 0x1000000000001's attr index 0 attr id 76 failed with err -65536
Feb 26 15:59:34 NST-OPX-TEST-001 opx_nas_daemon[703]: [DSAPI:CPS-EVNT-THREAD], Event processing has been stopped due to negative return from CB observed/dell-base-if-cmn/if/interfaces-state/interface
                                                      if/interfaces-state/interface/name : e101-001-0
                                                      if/interfaces-state/interface/if-index : 3
                                                      if/interfaces-state/interface/admin-status : 1

Feb 26 15:59:34 NST-OPX-TEST-001 opx_nas_daemon[703]: [INTERFACE:INTF-C], No npu port attribute found in event object
Feb 26 15:59:34 NST-OPX-TEST-001 opx_nas_daemon[703]: [INTERFACE:INTF-C], No npu port attribute found in event object
Feb 26 15:59:34 NST-OPX-TEST-001 opx_nas_daemon[703]: [INTERFACE:INTF-C], No npu port attribute found in event object
Feb 26 15:59:34 NST-OPX-TEST-001 opx_nas_daemon[703]: [ev_log_t_SAI_PORT:Switch Id: 0], Attr get for port id 0x1000000000001's attr index 0 attr id 76 failed with err -65536
Feb 26 15:59:34 NST-OPX-TEST-001 opx_nas_daemon[703]: [ev_log_t_SAI_PORT:Switch Id: 0], Attr get for port id 0x1000000000001's attr index 0 attr id 76 failed with err -65536
Feb 26 15:59:34 NST-OPX-TEST-001 opx_nas_daemon[703]: [ev_log_t_SAI_PORT:Switch Id: 0], Attr get for port id 0x1000000000001's attr index 0 attr id 76 failed with err -65536

... but restarting the opx-nas service does not resolve the issue either.

Rebooting the box does allow IP address configuration to start succeeding again.

Where should we be looking for additional diagnostics to explain what is going on when the configuration fails?

Thanks!

The text was updated successfully, but these errors were encountered:

atanu-mandal · 2018-02-27T02:07:06Z

Regarding the SAI_PORT error (i.e. attr index 0 attr id 76 failed) the below fix should solve the issue (merged recently)

https://review.openswitch.net/#/c/14562/

We need to check why the opx-pas is failing, can you please provide "service opx-pas status" details.

dimbleby · 2018-02-27T09:07:34Z

Here's opx-pas status from a recent example:

root@NST-OPX-TEST-002:~# service opx-pas status
● opx-pas.service - This PAS service is to initialize platform.
   Loaded: loaded (/lib/systemd/system/opx-pas.service; enabled)
   Active: active (running) since Tue 2018-02-27 08:37:03 UTC; 28min ago
 Main PID: 613 (opx_pas_service)
   CGroup: /system.slice/opx-pas.service
           └─613 /usr/bin/opx_pas_service

Feb 27 08:37:11 NST-OPX-TEST-002 opx_pas_service[613]: [PAS:chassis_resp_set], Chassis EEPROM PPID not programmed
Feb 27 08:37:11 NST-OPX-TEST-002 opx_pas_service[613]: [PAS:chassis_resp_set], Chassis EEPROM part number not programmed
Feb 27 08:37:11 NST-OPX-TEST-002 opx_pas_service[613]: [PAS:chassis_resp_set], Chassis EEPROM service tag not programmed
Feb 27 09:05:21 NST-OPX-TEST-002 opx_pas_service[613]: [PAS:chassis_resp_set], Chassis EEPROM vendor name not programmed
Feb 27 09:05:21 NST-OPX-TEST-002 opx_pas_service[613]: [PAS:chassis_resp_set], Chassis EEPROM product name not programmed
Feb 27 09:05:21 NST-OPX-TEST-002 opx_pas_service[613]: [PAS:chassis_resp_set], Chassis EEPROM hardware revision not programmed
Feb 27 09:05:21 NST-OPX-TEST-002 opx_pas_service[613]: [PAS:chassis_resp_set], Chassis EEPROM platform name not programmed
Feb 27 09:05:21 NST-OPX-TEST-002 opx_pas_service[613]: [PAS:chassis_resp_set], Chassis EEPROM PPID not programmed
Feb 27 09:05:21 NST-OPX-TEST-002 opx_pas_service[613]: [PAS:chassis_resp_set], Chassis EEPROM part number not programmed
Feb 27 09:05:21 NST-OPX-TEST-002 opx_pas_service[613]: [PAS:chassis_resp_set], Chassis EEPROM service tag not programmed

dimbleby · 2018-02-27T11:51:44Z

I have discovered that when CPS gets into this state, it can be recovered by service opx-ip restart.

However there is no indication that opx-ip was unhappy:

root@NST-OPX-TEST-002:~# service opx-ip status
● opx-ip.service - IP address handler
   Loaded: loaded (/lib/systemd/system/opx-ip.service; enabled)
   Active: active (running) since Tue 2018-02-27 11:43:49 UTC; 2min 42s ago
 Main PID: 5081 (python)
   CGroup: /system.slice/opx-ip.service
           └─5081 /usr/bin/python -u /usr/bin/base_ip.py

Feb 27 11:43:48 NST-OPX-TEST-002 systemd[1]: Starting IP address handler...
Feb 27 11:43:49 NST-OPX-TEST-002 systemd[1]: Started IP address handler.
Feb 27 11:44:36 NST-OPX-TEST-002 python[5081]: Attempting to add address  192.170.0.0/31 e101-001-0

Hope that helps.

dimbleby · 2018-02-27T14:36:12Z

I can now reliably reproduce this condition, by using ansible to install a debian package - any package, even one that does not exist! - on the switch.

On some other host (from which you will run ansible):

create a hosts file containing the IP address of the OPX switch, eg

echo 172.17.31.206 > hosts

run this command:

ansible -i hosts all -m apt -a "deb=foo.deb state=present"

You should now find that attempts to configure IP addresses through CPS fail.

It seems as though something happened during the ansible command that breaks the registration from /usr/bin/base_ip.py. When we attempt to configure IP addresses, trans_cb() in that file is simply not executed.

atanu-mandal · 2018-03-04T08:10:26Z

Somehow I could not reproduce the issue having followed the above procedure, hence need more info to troubleshoot further.

dimbleby · 2018-03-04T10:29:35Z

Sure - what do you need?

atanu-mandal · 2018-03-05T03:44:14Z

Can you provide the logs related to the issue (error logs, registration break, cps command failure ,etc), that may help.

dimbleby · 2018-03-05T08:31:56Z

I really don't have any useful logs:

I can tell that trans_cb() is not called because I added an additional print('hello') on entry to it - and when in the broken state we do not see that log
All I get from configure_ip_address.py is this runtime exception

What else do you need? Are there likely to be other useful logs somewhere else? Perhaps you'll want to provide debug versions of some code, making additional logs?

atanu-mandal · 2018-03-06T04:25:13Z

Assuming you are able to reproduce the issue consistently, can you please check if below changes fix the issue. Recently we have encountered some EPIPE randomly and following code helps to avoid that.
You can make the change on the target (/usr/bin/base_ip.py).

Diff:
Diff.txt

File attached with above change (please rename the file to base_ip.py) :

base_ip.py.txt

dimbleby · 2018-03-06T10:22:10Z

The code that you provided tries to log the undefined variable vrf_name. As a result, no IP configuration succeeds at all. I guess you forgot to test this!

However, if I fix that so that we only try to log addr and name - then that does seem to solve the problem. So this fix is basically looking good.

I see that there are still a handful of uses of print in this code. Will you want to fix those too?

atanu-mandal · 2018-03-07T00:00:06Z

Great to know that the change fixes the issue. We will commit this change. We may not need to remove other print.

I actually tested without restarting opx-ip service, hence it worked :-) Sorry for that.

As you have mentioned this is the updated file
base_ip.py.txt

atanu-mandal · 2018-03-08T03:47:39Z

Fix in review

https://review.openswitch.net/#/c/14580/

atanu-mandal · 2018-03-12T02:10:49Z

Closing this, please reopen for any further issue.

jeff-yin assigned jeff-yin and atanu-mandal and unassigned jeff-yin Mar 2, 2018

atanu-mandal closed this as completed Mar 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPS gets into a state where IP addresses cannot be configured #66

CPS gets into a state where IP addresses cannot be configured #66

dimbleby commented Feb 26, 2018

atanu-mandal commented Feb 27, 2018

dimbleby commented Feb 27, 2018

dimbleby commented Feb 27, 2018 •

edited

Loading

dimbleby commented Feb 27, 2018 •

edited

Loading

atanu-mandal commented Mar 4, 2018

dimbleby commented Mar 4, 2018

atanu-mandal commented Mar 5, 2018

dimbleby commented Mar 5, 2018

atanu-mandal commented Mar 6, 2018

dimbleby commented Mar 6, 2018 •

edited

Loading

atanu-mandal commented Mar 7, 2018

atanu-mandal commented Mar 8, 2018

atanu-mandal commented Mar 12, 2018

CPS gets into a state where IP addresses cannot be configured #66

CPS gets into a state where IP addresses cannot be configured #66

Comments

dimbleby commented Feb 26, 2018

atanu-mandal commented Feb 27, 2018

dimbleby commented Feb 27, 2018

dimbleby commented Feb 27, 2018 • edited Loading

dimbleby commented Feb 27, 2018 • edited Loading

atanu-mandal commented Mar 4, 2018

dimbleby commented Mar 4, 2018

atanu-mandal commented Mar 5, 2018

dimbleby commented Mar 5, 2018

atanu-mandal commented Mar 6, 2018

dimbleby commented Mar 6, 2018 • edited Loading

atanu-mandal commented Mar 7, 2018

atanu-mandal commented Mar 8, 2018

atanu-mandal commented Mar 12, 2018

dimbleby commented Feb 27, 2018 •

edited

Loading

dimbleby commented Feb 27, 2018 •

edited

Loading

dimbleby commented Mar 6, 2018 •

edited

Loading