[FeG] Gx/Gy interfaces over SCTP with connections suddenly closed, then no interface/service auto-recovered #12067

crbertoldo · 2022-03-10T18:04:14Z

Your Environment

Version: 1.5
Affected Component: FeG
Affected Subcomponent: session_proxy, FeG EC2 instance itself, and/or its underlying network
Deployment Environment: e.g. AWS (orc8r), bare-metal (AGW)
Extra: AWS Site-to-Site VPN between the FeG's VPC (us-west-2) and Cyan's OCS/PCRF (both on the same instance in Miami)

Describe the Issue

We have two issues that cause the described by the title of this issue:

Unexpected SCTP SHUTDOWNs for Gx and/or Gy that sometimes is not proceeded by the interface recovery automatically or FeG/session_proxy takes a while to recover it (observations for an interval between a few seconds to an hour to do so). We see these SHUTDOWNs happening right after a packet drop for Diameter DWR towards the PCRF/OCS. But, not every time we have a drop, SHUTDOWNs happen. When there's no auto-recovery, we can see at /proc/net/sctp/assocs the missing association(s). Restarting the FeG containers recover the interface(s).
Unexpected SCTP ABORTs also for Gx and/or Gy that is never proceeded by the interface recovery automatically. When it happens neither restarting the FeG containers recover the interface(s). Just rebooting the FeG VM (and sometimes at least 2 or 3 times consecutively) recovers the interface(s). Different from 1 above, the sctp/assoc(s) keep active for the affected interface(s) while its Rx queue keeps increasing (looks like it cannot send data to the application layer, so keeps buffering). Example for Gy:

 ASSOC     SOCK   STY SST ST HBKT ASSOC-ID TX_QUEUE RX_QUEUE UID INODE LPORT RPORT LADDRS <-> RADDRS HBINT INS OUTS MAXRT T1X T2X RTXC wmema wmemq sndbuf rcvbuf
       0        0 2   1   3  13      2        0        0       0 22809 43447  3969  10.10.0.137 172.17.0.1 <-> *10.218.10.91 	    7500    10    10   10    0    0        1        1        0   212992   212992
       0        0 2   1   3  97      3        0    54656       0 27358 60552  3869  10.10.0.137 172.17.0.1 <-> *10.218.10.91 	    7500    10    10   10    0    0        0        1        0   212992   212992

To Reproduce
Since the deployment is sorta particular, we suggest using our deployment (lab and/or prod) to take any information needed.

Expected behavior
The issues can be related to the network issues (like the drops for 1 above), to workload vs. provisioned computational resources, etc., but we would expect both logs showing why these SHUTDOWNs/ABORTs happen and better mechanisms by session_proxy to handle such "natural" failures (if so).

Additional context

Due to customer's requirements, additional logs and traces containing subscriber's data should be shared in private
session_proxy logs are verbose command: envdir /var/opt/magma/envdir /var/opt/magma/bin/session_proxy -logtostderr=true -v=10
We have backported/deployed the following fixes: [feg] Diameter failed connection recovery (possible fix for issue #7496) #7637 and fix(feg): fix destroyConnection log nil pointer exception #10765
Recent observations show a "pattern" of 15 min intervals between the ABORTS
We saw a similar issue with TCP (suddenly TCP FINs and/or RST), so why switched to SCPT
Docker network mode is host

x-generic-service: &service
  volumes: *volumes_anchor
  logging: *logging_anchor
  restart: always
  network_mode: host

Per below, protocol, retransmits, retry_count and watchdog_interval have been changed in order to try an improvement, but unsuccessfully.

  "gx": {
    "server": {
      "address": "10.218.10.91:3969",
      "dest_host": "10.218.10.91",
      "dest_realm": "inew-cs.com",
      "host": "feg.freedomfi.com",
      "protocol": "sctp",
      "realm": "freedomfi.com",
      "retransmits": 10,
      "retry_count": 10,
      "watchdog_interval": 10
    },
    "servers": null,
    "virtual_apn_rules": []
  },
  "gy": {
    "init_method": 2,
    "server": {
      "address": "10.218.10.91:3869",
      "dest_host": "10.218.10.91",
      "dest_realm": "inew-cs.com",
      "host": "feg.freedomfi.com",
      "protocol": "sctp",
      "realm": "freedomfi.com",
      "retransmits": 10,
      "retry_count": 10,
      "watchdog_interval": 10
    }

The log snippet below suggests that client, server is not passed to connection.go

magma/feg/gateway/diameter/connection_manager.go

Line 63 in beac53c

conn = newConnection(client, server)

Feb 24 03:58:14 ip-10-10-0-137 26ec7c59990c[1339]: E0224 03:58:14.746758       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Feb 24 03:58:14 ip-10-10-0-137 26ec7c59990c[1339]: E0224 03:58:14.746786       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Feb 24 03:58:15 ip-10-10-0-137 43999015c726[1339]: I0224 03:58:15.238761       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr:} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr:} connection
Feb 24 03:58:15 ip-10-10-0-137 69aefcc97113[1339]: E0224 03:58:15.287367       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.

The text was updated successfully, but these errors were encountered:

crbertoldo · 2022-03-10T18:17:08Z

@jpad-freedomfi @malcolmmadsheep @taaraora @uri200 @emakeev @amarpad

uri200 · 2022-03-10T21:21:51Z

Why don't you have local_address defined? You don't need to defin the IP, but you need the port. Local ports for GX, GY and S6a should be different.

You could use "local_address": ":0", to let feg chose a random port, but note then you will not be able to send messages initiated by OCS/PCRF because you won't know the port. So the best is define specific ports

I also suggest to update your complete setup to 1.7. The version is not official yet, but we area already testing it and it works properly. Note 1.5 is almost a year old and some of the bugs you may be experiencing are already fixed there.

    "gx": {
      "server": {
        "address": "172.16.1.6:3868",
        "dest_host": "magma-fedgw.magma.com",
        "dest_realm": "magma.com",
        "host": "string",
        "local_address": ":3907",     <-----
        "product_name": "string",
        "protocol": "sctp",
        "realm": "string"
      },
      "servers": null,
      "virtual_apn_rules": null
    },
    "gy": {
      "server": {
        "address": "172.16.1.14:3868",
        "dest_host": "magma-fedgw.magma.com",
        "dest_realm": "magma.com",
        "host": "string",
        "local_address": ":3906",      <-----
        "product_name": "string",
        "protocol": "sctp",
        "realm": "string"
      },
      "servers": null,
      "virtual_apn_rules": null
    },

crbertoldo · 2022-03-10T22:35:15Z

Thanks @uri200. It's applied.

I see in your example no gy.init_method. May we change it also?

Side note: when trying via the API, I got

{
  "message": "failed to load entity being updated: expected to load 1 ent for update, got 0"
}

So performed the local_address addition via the web UI, that made me lose:

"retransmits": 10,
"retry_count": 10,
"watchdog_interval": 10

Now:

"gx": {
  "server": {
    "address": "10.218.10.91:3969",
    "dest_host": "10.218.10.91",
    "dest_realm": "inew-cs.com",
    "host": "feg.freedomfi.com",
    "local_address": ":3907",
    "protocol": "sctp",
    "realm": "freedomfi.com"
  },
  "servers": null,
  "virtual_apn_rules": []
},
"gy": {
  "init_method": 2,
  "server": {
    "address": "10.218.10.91:3869",
    "dest_host": "10.218.10.91",
    "dest_realm": "inew-cs.com",
    "host": "feg.freedomfi.com",
    "local_address": ":3906",
    "protocol": "sctp",
    "realm": "freedomfi.com"
},

crbertoldo · 2022-03-10T22:59:42Z

@uri200 update: the ABORTs are not solved. I'm not sure about the SHUTDOWNs because the high frequency of ABORTs is not allowing them to happen now.

crbertoldo · 2022-03-10T23:12:35Z

An almost relationship between:

Mar 10 20:57:16 ip-10-10-0-137 53c6a4c9e856[1337]: E0310 20:57:16.918775       1 credits.go:201] Timed out receiving responses from OCS
Mar 10 21:27:21 ip-10-10-0-137 53c6a4c9e856[1331]: E0310 21:27:21.229695       1 credits.go:201] Timed out receiving responses from OCS
Mar 10 22:12:37 ip-10-10-0-137 53c6a4c9e856[1343]: E0310 22:12:37.210470       1 credits.go:201] Timed out receiving responses from OCS
Mar 10 22:12:38 ip-10-10-0-137 53c6a4c9e856[1343]: E0310 22:12:38.153714       1 credits.go:201] Timed out receiving responses from OCS
Mar 10 22:42:45 ip-10-10-0-137 9a3fab4f2a0e[1347]: E0310 22:42:45.195263       1 credits.go:201] Timed out receiving responses from OCS

And:

2022-03-10 20:57:16.821319: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 21:12:22.556856: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 21:27:19.147239: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 21:42:25.729426: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 21:57:32.976971: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 22:12:41.270736: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 22:27:40.597123: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 22:42:43.197597: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 22:57:49.840778: FeG VM rebooting task upon an SCTP ABORT!

Extra:

Mar 10 22:27:54 ip-10-10-0-137 systemd-modules-load[384]: Inserted module 'nf_conntrack_proto_sctp'
Mar 10 22:27:59 ip-10-10-0-137 16ad333eaa18[1347]: E0310 22:27:59.067607       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:27:59 ip-10-10-0-137 16ad333eaa18[1347]: E0310 22:27:59.067638       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:27:59 ip-10-10-0-137 kernel: [   20.180101] sctp: Hash tables configured (established 256 bind 256)
Mar 10 22:27:59 ip-10-10-0-137 f8f284827e11[1347]: 2022/03/10 22:27:59 Using default value: sctp for flag: network
Mar 10 22:27:59 ip-10-10-0-137 f8f284827e11[1347]: 2022/03/10 22:27:59 Using default value: sctp for flag: network
Mar 10 22:27:59 ip-10-10-0-137 f8f284827e11[1347]: 2022/03/10 22:27:59 Using default value: sctp for flag: network
Mar 10 22:27:59 ip-10-10-0-137 f8f284827e11[1347]: 2022/03/10 22:27:59 Using default value: sctp for flag: network
Mar 10 22:27:59 ip-10-10-0-137 f8f284827e11[1347]: I0310 22:27:59.125459       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr::3906} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr::3907} connection
Mar 10 22:27:59 ip-10-10-0-137 0f0b8d7339c3[1347]: 2022/03/10 22:27:59 Using default value: sctp for flag: network
Mar 10 22:27:59 ip-10-10-0-137 0f0b8d7339c3[1347]: E0310 22:27:59.233479       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.
Mar 10 22:32:18 ip-10-10-0-137 eff22fdee358[1347]: 2022/03/10 22:32:18 Using default value: sctp for flag: network
Mar 10 22:32:18 ip-10-10-0-137 eff22fdee358[1347]: E0310 22:32:18.519157       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.
Mar 10 22:32:19 ip-10-10-0-137 7a4ab4581bd5[1347]: 2022/03/10 22:32:19 Using default value: sctp for flag: network
Mar 10 22:32:19 ip-10-10-0-137 7a4ab4581bd5[1347]: 2022/03/10 22:32:19 Using default value: sctp for flag: network
Mar 10 22:32:19 ip-10-10-0-137 7a4ab4581bd5[1347]: 2022/03/10 22:32:19 Using default value: sctp for flag: network
Mar 10 22:32:19 ip-10-10-0-137 7a4ab4581bd5[1347]: 2022/03/10 22:32:19 Using default value: sctp for flag: network
Mar 10 22:32:19 ip-10-10-0-137 7a4ab4581bd5[1347]: I0310 22:32:19.097711       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr::3906} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr::3907} connection
Mar 10 22:32:19 ip-10-10-0-137 60d45f6719bd[1347]: E0310 22:32:19.166250       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:32:19 ip-10-10-0-137 60d45f6719bd[1347]: E0310 22:32:19.166623       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:42:29 ip-10-10-0-137 402168a9ba1f[1347]: 2022/03/10 22:42:29 Using default value: sctp for flag: network
Mar 10 22:42:29 ip-10-10-0-137 402168a9ba1f[1347]: E0310 22:42:29.115059       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.
Mar 10 22:42:29 ip-10-10-0-137 9a3fab4f2a0e[1347]: 2022/03/10 22:42:29 Using default value: sctp for flag: network
Mar 10 22:42:29 ip-10-10-0-137 9a3fab4f2a0e[1347]: 2022/03/10 22:42:29 Using default value: sctp for flag: network
Mar 10 22:42:29 ip-10-10-0-137 9a3fab4f2a0e[1347]: 2022/03/10 22:42:29 Using default value: sctp for flag: network
Mar 10 22:42:29 ip-10-10-0-137 9a3fab4f2a0e[1347]: 2022/03/10 22:42:29 Using default value: sctp for flag: network
Mar 10 22:42:29 ip-10-10-0-137 9a3fab4f2a0e[1347]: I0310 22:42:29.483542       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr::3906} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr::3907} connection
Mar 10 22:42:29 ip-10-10-0-137 0199d594ffce[1347]: E0310 22:42:29.635278       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:42:29 ip-10-10-0-137 0199d594ffce[1347]: E0310 22:42:29.635299       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:43:03 ip-10-10-0-137 systemd-modules-load[408]: Inserted module 'nf_conntrack_proto_sctp'
Mar 10 22:43:07 ip-10-10-0-137 402168a9ba1f[1367]: 2022/03/10 22:43:07 Using default value: sctp for flag: network
Mar 10 22:43:07 ip-10-10-0-137 kernel: [   18.501797] sctp: Hash tables configured (established 256 bind 256)
Mar 10 22:43:07 ip-10-10-0-137 402168a9ba1f[1367]: E0310 22:43:07.170181       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.
Mar 10 22:43:07 ip-10-10-0-137 9a3fab4f2a0e[1367]: 2022/03/10 22:43:07 Using default value: sctp for flag: network
Mar 10 22:43:07 ip-10-10-0-137 9a3fab4f2a0e[1367]: 2022/03/10 22:43:07 Using default value: sctp for flag: network
Mar 10 22:43:07 ip-10-10-0-137 9a3fab4f2a0e[1367]: 2022/03/10 22:43:07 Using default value: sctp for flag: network
Mar 10 22:43:07 ip-10-10-0-137 9a3fab4f2a0e[1367]: 2022/03/10 22:43:07 Using default value: sctp for flag: network
Mar 10 22:43:07 ip-10-10-0-137 9a3fab4f2a0e[1367]: I0310 22:43:07.803749       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr::3906} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr::3907} connection
Mar 10 22:43:07 ip-10-10-0-137 0199d594ffce[1367]: E0310 22:43:07.906226       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:43:07 ip-10-10-0-137 0199d594ffce[1367]: E0310 22:43:07.906257       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:58:03 ip-10-10-0-137 systemd-modules-load[371]: Inserted module 'nf_conntrack_proto_sctp'
Mar 10 22:58:08 ip-10-10-0-137 9a3fab4f2a0e[1356]: 2022/03/10 22:58:08 Using default value: sctp for flag: network
Mar 10 22:58:08 ip-10-10-0-137 9a3fab4f2a0e[1356]: 2022/03/10 22:58:08 Using default value: sctp for flag: network
Mar 10 22:58:08 ip-10-10-0-137 9a3fab4f2a0e[1356]: 2022/03/10 22:58:08 Using default value: sctp for flag: network
Mar 10 22:58:08 ip-10-10-0-137 9a3fab4f2a0e[1356]: 2022/03/10 22:58:08 Using default value: sctp for flag: network
Mar 10 22:58:08 ip-10-10-0-137 9a3fab4f2a0e[1356]: I0310 22:58:08.198689       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr::3906} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr::3907} connection
Mar 10 22:58:08 ip-10-10-0-137 0199d594ffce[1356]: E0310 22:58:08.216933       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:58:08 ip-10-10-0-137 0199d594ffce[1356]: E0310 22:58:08.216960       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:58:08 ip-10-10-0-137 kernel: [   19.342397] sctp: Hash tables configured (established 256 bind 256)
Mar 10 22:58:08 ip-10-10-0-137 402168a9ba1f[1356]: 2022/03/10 22:58:08 Using default value: sctp for flag: network
Mar 10 22:58:08 ip-10-10-0-137 402168a9ba1f[1356]: E0310 22:58:08.290802       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.

emakeev · 2022-03-10T23:17:04Z

can you share FeG pcaps? Maybe on a private channel.

crbertoldo · 2022-03-10T23:29:11Z

can you share FeG pcaps? Maybe on a private channel.

Sent @emakeev. Thanks!

uri200 · 2022-03-11T06:54:40Z

@crbertoldo I don't think the unit method has any impact here .

I would suggest to use

docker-compose -f logs session_proxy

To filter out the logs. Otherwise you will be collecting other services logs that can be mistaken.

BTW, solution to this issue may be found at
#12192, #12194, #12220, #12234 & #12244.

crbertoldo · 2022-03-11T13:33:36Z

Thanks @uri200

emakeev · 2022-03-24T22:28:31Z

@crbertoldo should we close the issue now?

crbertoldo · 2022-03-25T00:23:50Z

@crbertoldo should we close the issue now?

Yes @emakeev . Thanks again

crbertoldo added the type: bug Something isn't working label Mar 10, 2022

This was referenced Mar 17, 2022

fix(feg): Diameter closed connection recovery #12192

Merged

fix(feg): FeG diameter handler blocking fix and additional logging for diameter & session_proxy #12220

Merged

crbertoldo closed this as completed Mar 25, 2022

uri200 assigned uri200 and emakeev Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FeG] Gx/Gy interfaces over SCTP with connections suddenly closed, then no interface/service auto-recovered #12067

[FeG] Gx/Gy interfaces over SCTP with connections suddenly closed, then no interface/service auto-recovered #12067

crbertoldo commented Mar 10, 2022

crbertoldo commented Mar 10, 2022

uri200 commented Mar 10, 2022 •

edited

crbertoldo commented Mar 10, 2022

crbertoldo commented Mar 10, 2022

crbertoldo commented Mar 10, 2022 •

edited

emakeev commented Mar 10, 2022

crbertoldo commented Mar 10, 2022

uri200 commented Mar 11, 2022 •

edited

crbertoldo commented Mar 11, 2022

emakeev commented Mar 24, 2022

crbertoldo commented Mar 25, 2022

[FeG] Gx/Gy interfaces over SCTP with connections suddenly closed, then no interface/service auto-recovered #12067

[FeG] Gx/Gy interfaces over SCTP with connections suddenly closed, then no interface/service auto-recovered #12067

Comments

crbertoldo commented Mar 10, 2022

Your Environment

Describe the Issue

crbertoldo commented Mar 10, 2022

uri200 commented Mar 10, 2022 • edited

crbertoldo commented Mar 10, 2022

crbertoldo commented Mar 10, 2022

crbertoldo commented Mar 10, 2022 • edited

emakeev commented Mar 10, 2022

crbertoldo commented Mar 10, 2022

uri200 commented Mar 11, 2022 • edited

crbertoldo commented Mar 11, 2022

emakeev commented Mar 24, 2022

crbertoldo commented Mar 25, 2022

uri200 commented Mar 10, 2022 •

edited

crbertoldo commented Mar 10, 2022 •

edited

uri200 commented Mar 11, 2022 •

edited