Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FeG] Gx/Gy interfaces over SCTP with connections suddenly closed, then no interface/service auto-recovered #12067

Closed
crbertoldo opened this issue Mar 10, 2022 · 11 comments
Assignees
Labels
type: bug Something isn't working

Comments

@crbertoldo
Copy link

Your Environment

  • Version: 1.5
  • Affected Component: FeG
  • Affected Subcomponent: session_proxy, FeG EC2 instance itself, and/or its underlying network
  • Deployment Environment: e.g. AWS (orc8r), bare-metal (AGW)
  • Extra: AWS Site-to-Site VPN between the FeG's VPC (us-west-2) and Cyan's OCS/PCRF (both on the same instance in Miami)

Describe the Issue

We have two issues that cause the described by the title of this issue:

  1. Unexpected SCTP SHUTDOWNs for Gx and/or Gy that sometimes is not proceeded by the interface recovery automatically or FeG/session_proxy takes a while to recover it (observations for an interval between a few seconds to an hour to do so). We see these SHUTDOWNs happening right after a packet drop for Diameter DWR towards the PCRF/OCS. But, not every time we have a drop, SHUTDOWNs happen. When there's no auto-recovery, we can see at /proc/net/sctp/assocs the missing association(s). Restarting the FeG containers recover the interface(s).
  2. Unexpected SCTP ABORTs also for Gx and/or Gy that is never proceeded by the interface recovery automatically. When it happens neither restarting the FeG containers recover the interface(s). Just rebooting the FeG VM (and sometimes at least 2 or 3 times consecutively) recovers the interface(s). Different from 1 above, the sctp/assoc(s) keep active for the affected interface(s) while its Rx queue keeps increasing (looks like it cannot send data to the application layer, so keeps buffering). Example for Gy:
 ASSOC     SOCK   STY SST ST HBKT ASSOC-ID TX_QUEUE RX_QUEUE UID INODE LPORT RPORT LADDRS <-> RADDRS HBINT INS OUTS MAXRT T1X T2X RTXC wmema wmemq sndbuf rcvbuf
       0        0 2   1   3  13      2        0        0       0 22809 43447  3969  10.10.0.137 172.17.0.1 <-> *10.218.10.91 	    7500    10    10   10    0    0        1        1        0   212992   212992
       0        0 2   1   3  97      3        0    54656       0 27358 60552  3869  10.10.0.137 172.17.0.1 <-> *10.218.10.91 	    7500    10    10   10    0    0        0        1        0   212992   212992

To Reproduce
Since the deployment is sorta particular, we suggest using our deployment (lab and/or prod) to take any information needed.

Expected behavior
The issues can be related to the network issues (like the drops for 1 above), to workload vs. provisioned computational resources, etc., but we would expect both logs showing why these SHUTDOWNs/ABORTs happen and better mechanisms by session_proxy to handle such "natural" failures (if so).

Additional context

  1. Due to customer's requirements, additional logs and traces containing subscriber's data should be shared in private
  2. session_proxy logs are verbose command: envdir /var/opt/magma/envdir /var/opt/magma/bin/session_proxy -logtostderr=true -v=10
  3. We have backported/deployed the following fixes: [feg] Diameter failed connection recovery (possible fix for issue #7496) #7637 and fix(feg): fix destroyConnection log nil pointer exception #10765
  4. Recent observations show a "pattern" of 15 min intervals between the ABORTS
  5. We saw a similar issue with TCP (suddenly TCP FINs and/or RST), so why switched to SCPT
  6. Docker network mode is host
x-generic-service: &service
  volumes: *volumes_anchor
  logging: *logging_anchor
  restart: always
  network_mode: host
  1. Per below, protocol, retransmits, retry_count and watchdog_interval have been changed in order to try an improvement, but unsuccessfully.
  "gx": {
    "server": {
      "address": "10.218.10.91:3969",
      "dest_host": "10.218.10.91",
      "dest_realm": "inew-cs.com",
      "host": "feg.freedomfi.com",
      "protocol": "sctp",
      "realm": "freedomfi.com",
      "retransmits": 10,
      "retry_count": 10,
      "watchdog_interval": 10
    },
    "servers": null,
    "virtual_apn_rules": []
  },
  "gy": {
    "init_method": 2,
    "server": {
      "address": "10.218.10.91:3869",
      "dest_host": "10.218.10.91",
      "dest_realm": "inew-cs.com",
      "host": "feg.freedomfi.com",
      "protocol": "sctp",
      "realm": "freedomfi.com",
      "retransmits": 10,
      "retry_count": 10,
      "watchdog_interval": 10
    }
  1. The log snippet below suggests that client, server is not passed to connection.go
    conn = newConnection(client, server)
Feb 24 03:58:14 ip-10-10-0-137 26ec7c59990c[1339]: E0224 03:58:14.746758       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Feb 24 03:58:14 ip-10-10-0-137 26ec7c59990c[1339]: E0224 03:58:14.746786       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Feb 24 03:58:15 ip-10-10-0-137 43999015c726[1339]: I0224 03:58:15.238761       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr:} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr:} connection
Feb 24 03:58:15 ip-10-10-0-137 69aefcc97113[1339]: E0224 03:58:15.287367       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.

@crbertoldo crbertoldo added the type: bug Something isn't working label Mar 10, 2022
@crbertoldo
Copy link
Author

@uri200
Copy link
Contributor

uri200 commented Mar 10, 2022

Why don't you have local_address defined? You don't need to defin the IP, but you need the port. Local ports for GX, GY and S6a should be different.

You could use "local_address": ":0", to let feg chose a random port, but note then you will not be able to send messages initiated by OCS/PCRF because you won't know the port. So the best is define specific ports

I also suggest to update your complete setup to 1.7. The version is not official yet, but we area already testing it and it works properly. Note 1.5 is almost a year old and some of the bugs you may be experiencing are already fixed there.

    "gx": {
      "server": {
        "address": "172.16.1.6:3868",
        "dest_host": "magma-fedgw.magma.com",
        "dest_realm": "magma.com",
        "host": "string",
        "local_address": ":3907",     <-----
        "product_name": "string",
        "protocol": "sctp",
        "realm": "string"
      },
      "servers": null,
      "virtual_apn_rules": null
    },
    "gy": {
      "server": {
        "address": "172.16.1.14:3868",
        "dest_host": "magma-fedgw.magma.com",
        "dest_realm": "magma.com",
        "host": "string",
        "local_address": ":3906",      <-----
        "product_name": "string",
        "protocol": "sctp",
        "realm": "string"
      },
      "servers": null,
      "virtual_apn_rules": null
    },

@crbertoldo
Copy link
Author

Thanks @uri200. It's applied.

I see in your example no gy.init_method. May we change it also?

Side note: when trying via the API, I got

{
  "message": "failed to load entity being updated: expected to load 1 ent for update, got 0"
}

So performed the local_address addition via the web UI, that made me lose:

"retransmits": 10,
"retry_count": 10,
"watchdog_interval": 10

Now:

"gx": {
  "server": {
    "address": "10.218.10.91:3969",
    "dest_host": "10.218.10.91",
    "dest_realm": "inew-cs.com",
    "host": "feg.freedomfi.com",
    "local_address": ":3907",
    "protocol": "sctp",
    "realm": "freedomfi.com"
  },
  "servers": null,
  "virtual_apn_rules": []
},
"gy": {
  "init_method": 2,
  "server": {
    "address": "10.218.10.91:3869",
    "dest_host": "10.218.10.91",
    "dest_realm": "inew-cs.com",
    "host": "feg.freedomfi.com",
    "local_address": ":3906",
    "protocol": "sctp",
    "realm": "freedomfi.com"
},

@crbertoldo
Copy link
Author

@uri200 update: the ABORTs are not solved. I'm not sure about the SHUTDOWNs because the high frequency of ABORTs is not allowing them to happen now.

@crbertoldo
Copy link
Author

crbertoldo commented Mar 10, 2022

An almost relationship between:

Mar 10 20:57:16 ip-10-10-0-137 53c6a4c9e856[1337]: E0310 20:57:16.918775       1 credits.go:201] Timed out receiving responses from OCS
Mar 10 21:27:21 ip-10-10-0-137 53c6a4c9e856[1331]: E0310 21:27:21.229695       1 credits.go:201] Timed out receiving responses from OCS
Mar 10 22:12:37 ip-10-10-0-137 53c6a4c9e856[1343]: E0310 22:12:37.210470       1 credits.go:201] Timed out receiving responses from OCS
Mar 10 22:12:38 ip-10-10-0-137 53c6a4c9e856[1343]: E0310 22:12:38.153714       1 credits.go:201] Timed out receiving responses from OCS
Mar 10 22:42:45 ip-10-10-0-137 9a3fab4f2a0e[1347]: E0310 22:42:45.195263       1 credits.go:201] Timed out receiving responses from OCS

And:

2022-03-10 20:57:16.821319: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 21:12:22.556856: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 21:27:19.147239: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 21:42:25.729426: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 21:57:32.976971: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 22:12:41.270736: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 22:27:40.597123: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 22:42:43.197597: FeG VM rebooting task upon an SCTP ABORT!
2022-03-10 22:57:49.840778: FeG VM rebooting task upon an SCTP ABORT!

Extra:

Mar 10 22:27:54 ip-10-10-0-137 systemd-modules-load[384]: Inserted module 'nf_conntrack_proto_sctp'
Mar 10 22:27:59 ip-10-10-0-137 16ad333eaa18[1347]: E0310 22:27:59.067607       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:27:59 ip-10-10-0-137 16ad333eaa18[1347]: E0310 22:27:59.067638       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:27:59 ip-10-10-0-137 kernel: [   20.180101] sctp: Hash tables configured (established 256 bind 256)
Mar 10 22:27:59 ip-10-10-0-137 f8f284827e11[1347]: 2022/03/10 22:27:59 Using default value: sctp for flag: network
Mar 10 22:27:59 ip-10-10-0-137 f8f284827e11[1347]: 2022/03/10 22:27:59 Using default value: sctp for flag: network
Mar 10 22:27:59 ip-10-10-0-137 f8f284827e11[1347]: 2022/03/10 22:27:59 Using default value: sctp for flag: network
Mar 10 22:27:59 ip-10-10-0-137 f8f284827e11[1347]: 2022/03/10 22:27:59 Using default value: sctp for flag: network
Mar 10 22:27:59 ip-10-10-0-137 f8f284827e11[1347]: I0310 22:27:59.125459       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr::3906} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr::3907} connection
Mar 10 22:27:59 ip-10-10-0-137 0f0b8d7339c3[1347]: 2022/03/10 22:27:59 Using default value: sctp for flag: network
Mar 10 22:27:59 ip-10-10-0-137 0f0b8d7339c3[1347]: E0310 22:27:59.233479       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.
Mar 10 22:32:18 ip-10-10-0-137 eff22fdee358[1347]: 2022/03/10 22:32:18 Using default value: sctp for flag: network
Mar 10 22:32:18 ip-10-10-0-137 eff22fdee358[1347]: E0310 22:32:18.519157       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.
Mar 10 22:32:19 ip-10-10-0-137 7a4ab4581bd5[1347]: 2022/03/10 22:32:19 Using default value: sctp for flag: network
Mar 10 22:32:19 ip-10-10-0-137 7a4ab4581bd5[1347]: 2022/03/10 22:32:19 Using default value: sctp for flag: network
Mar 10 22:32:19 ip-10-10-0-137 7a4ab4581bd5[1347]: 2022/03/10 22:32:19 Using default value: sctp for flag: network
Mar 10 22:32:19 ip-10-10-0-137 7a4ab4581bd5[1347]: 2022/03/10 22:32:19 Using default value: sctp for flag: network
Mar 10 22:32:19 ip-10-10-0-137 7a4ab4581bd5[1347]: I0310 22:32:19.097711       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr::3906} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr::3907} connection
Mar 10 22:32:19 ip-10-10-0-137 60d45f6719bd[1347]: E0310 22:32:19.166250       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:32:19 ip-10-10-0-137 60d45f6719bd[1347]: E0310 22:32:19.166623       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:42:29 ip-10-10-0-137 402168a9ba1f[1347]: 2022/03/10 22:42:29 Using default value: sctp for flag: network
Mar 10 22:42:29 ip-10-10-0-137 402168a9ba1f[1347]: E0310 22:42:29.115059       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.
Mar 10 22:42:29 ip-10-10-0-137 9a3fab4f2a0e[1347]: 2022/03/10 22:42:29 Using default value: sctp for flag: network
Mar 10 22:42:29 ip-10-10-0-137 9a3fab4f2a0e[1347]: 2022/03/10 22:42:29 Using default value: sctp for flag: network
Mar 10 22:42:29 ip-10-10-0-137 9a3fab4f2a0e[1347]: 2022/03/10 22:42:29 Using default value: sctp for flag: network
Mar 10 22:42:29 ip-10-10-0-137 9a3fab4f2a0e[1347]: 2022/03/10 22:42:29 Using default value: sctp for flag: network
Mar 10 22:42:29 ip-10-10-0-137 9a3fab4f2a0e[1347]: I0310 22:42:29.483542       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr::3906} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr::3907} connection
Mar 10 22:42:29 ip-10-10-0-137 0199d594ffce[1347]: E0310 22:42:29.635278       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:42:29 ip-10-10-0-137 0199d594ffce[1347]: E0310 22:42:29.635299       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:43:03 ip-10-10-0-137 systemd-modules-load[408]: Inserted module 'nf_conntrack_proto_sctp'
Mar 10 22:43:07 ip-10-10-0-137 402168a9ba1f[1367]: 2022/03/10 22:43:07 Using default value: sctp for flag: network
Mar 10 22:43:07 ip-10-10-0-137 kernel: [   18.501797] sctp: Hash tables configured (established 256 bind 256)
Mar 10 22:43:07 ip-10-10-0-137 402168a9ba1f[1367]: E0310 22:43:07.170181       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.
Mar 10 22:43:07 ip-10-10-0-137 9a3fab4f2a0e[1367]: 2022/03/10 22:43:07 Using default value: sctp for flag: network
Mar 10 22:43:07 ip-10-10-0-137 9a3fab4f2a0e[1367]: 2022/03/10 22:43:07 Using default value: sctp for flag: network
Mar 10 22:43:07 ip-10-10-0-137 9a3fab4f2a0e[1367]: 2022/03/10 22:43:07 Using default value: sctp for flag: network
Mar 10 22:43:07 ip-10-10-0-137 9a3fab4f2a0e[1367]: 2022/03/10 22:43:07 Using default value: sctp for flag: network
Mar 10 22:43:07 ip-10-10-0-137 9a3fab4f2a0e[1367]: I0310 22:43:07.803749       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr::3906} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr::3907} connection
Mar 10 22:43:07 ip-10-10-0-137 0199d594ffce[1367]: E0310 22:43:07.906226       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:43:07 ip-10-10-0-137 0199d594ffce[1367]: E0310 22:43:07.906257       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:58:03 ip-10-10-0-137 systemd-modules-load[371]: Inserted module 'nf_conntrack_proto_sctp'
Mar 10 22:58:08 ip-10-10-0-137 9a3fab4f2a0e[1356]: 2022/03/10 22:58:08 Using default value: sctp for flag: network
Mar 10 22:58:08 ip-10-10-0-137 9a3fab4f2a0e[1356]: 2022/03/10 22:58:08 Using default value: sctp for flag: network
Mar 10 22:58:08 ip-10-10-0-137 9a3fab4f2a0e[1356]: 2022/03/10 22:58:08 Using default value: sctp for flag: network
Mar 10 22:58:08 ip-10-10-0-137 9a3fab4f2a0e[1356]: 2022/03/10 22:58:08 Using default value: sctp for flag: network
Mar 10 22:58:08 ip-10-10-0-137 9a3fab4f2a0e[1356]: I0310 22:58:08.198689       1 main.go:162] Using distinct Gy: {Addr:10.218.10.91:3869 Protocol:sctp LocalAddr::3906} & Gx: {Addr:10.218.10.91:3969 Protocol:sctp LocalAddr::3907} connection
Mar 10 22:58:08 ip-10-10-0-137 0199d594ffce[1356]: E0310 22:58:08.216933       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:58:08 ip-10-10-0-137 0199d594ffce[1356]: E0310 22:58:08.216960       1 sctp_connection.go:256] Couldn't parse the whole string of IPs 
Mar 10 22:58:08 ip-10-10-0-137 kernel: [   19.342397] sctp: Hash tables configured (established 256 bind 256)
Mar 10 22:58:08 ip-10-10-0-137 402168a9ba1f[1356]: 2022/03/10 22:58:08 Using default value: sctp for flag: network
Mar 10 22:58:08 ip-10-10-0-137 402168a9ba1f[1356]: E0310 22:58:08.290802       1 connection.go:56] Failed to establish new sctp diameter connection from '' to ''; error: invalid argument, will retry later.

@emakeev
Copy link
Contributor

emakeev commented Mar 10, 2022

can you share FeG pcaps? Maybe on a private channel.

@crbertoldo
Copy link
Author

can you share FeG pcaps? Maybe on a private channel.

Sent @emakeev. Thanks!

@uri200
Copy link
Contributor

uri200 commented Mar 11, 2022

@crbertoldo I don't think the unit method has any impact here .

I would suggest to use

docker-compose -f logs session_proxy 

To filter out the logs. Otherwise you will be collecting other services logs that can be mistaken.

BTW, solution to this issue may be found at
#12192, #12194, #12220, #12234 & #12244.

@crbertoldo
Copy link
Author

Thanks @uri200

@emakeev
Copy link
Contributor

emakeev commented Mar 24, 2022

@crbertoldo should we close the issue now?

@crbertoldo
Copy link
Author

@crbertoldo should we close the issue now?

Yes @emakeev . Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants