Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s1ap_remove_ue #9902

Open
sentry-io bot opened this issue Oct 26, 2021 · 8 comments
Open

s1ap_remove_ue #9902

sentry-io bot opened this issue Oct 26, 2021 · 8 comments
Assignees
Labels

Comments

@sentry-io
Copy link

sentry-io bot commented Oct 26, 2021

Sentry Issue: LAB-AGWS-NATIVE-2W

SIGSEGV /SEGV_MAPERR: Fatal Error: SIGSEGV /SEGV_MAPERR
  File "s1ap_mme.c", line 561, in s1ap_remove_ue
  File "s1ap_mme_handlers.c", line 3871, in s1ap_mme_release_ue_context
  File "s1ap_mme_handlers.c", line 1365, in s1ap_mme_generate_ue_context_release_command
  File "s1ap_mme_handlers.c", line 1491, in s1ap_handle_ue_context_release_command
  File "s1ap_mme.c", line 216, in handle_message
...
(4 additional frame(s) were not displayed)
@ulaskozat
Copy link
Contributor

looks like assertion:

DevAssert(enb_ref->nb_ue_associated > 0);

@stale stale bot added the wontfix This will not be worked on label Mar 9, 2022
@stale stale bot closed this as completed Mar 16, 2022
@ssanadhya ssanadhya reopened this Mar 16, 2022
@stale stale bot removed the wontfix This will not be worked on label Mar 16, 2022
@magma magma deleted a comment from stale bot Mar 30, 2022
@magma magma deleted a comment from stale bot Mar 30, 2022
@crbertoldo
Copy link

Hi @ulaskozat @ssanadhya @rsarwad (ref: #12677 (comment)), we may have a similar issue on prod (v1.6) according to the below. It seems the MME had been crashing when a specific subscriber was being putted on idle state. Please let me know if you guys have some clue here.

Jun 21 07:26:01 4G-cluster-node-4 mme[789983]: LeakSanitizer:DEADLYSIGNAL
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]: ==789983==ERROR: LeakSanitizer: SEGV on unknown address 0x000000000204 (pc 0x556099b8399c bp 0x7f80781da710 sp 0x7f80781da6e0 T23)
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]: ==789983==The signal is caused by a READ memory access.
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]: ==789983==Hint: address points to the zero page.
Jun 21 07:26:01 4G-cluster-node-4 sctpd[11738]: I0621 07:26:01.653967 11743 sctpd_downlink_impl.cpp:76] SctpdDownlinkImpl::SendDl starting
Jun 21 07:26:01 4G-cluster-node-4 sctpd[11738]: I0621 07:26:01.675810 11942 sctp_connection.cpp:167] HandleClientSock sd = 49
Jun 21 07:26:01 4G-cluster-node-4 sctpd[11738]: I0621 07:26:01.675860 11942 sctp_connection.cpp:220] [sd:49] msg of len 106 on 62:1
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]:     #0 0x556099b8399c in s1ap_remove_ue /home/ubuntu/jenkins-agent/workspace/build/build_agw_packages_ubuntu/lte/gateway/c/core/oai/tasks/s1ap/s1ap_mme.c:561
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]:     #1 0x556099bd70d4 in s1ap_mme_release_ue_context /home/ubuntu/jenkins-agent/workspace/build/build_agw_packages_ubuntu/lte/gateway/c/core/oai/tasks/s1ap/s1ap_mme_handlers.c:3879
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]:     #2 0x556099bd73c0 in s1ap_mme_generate_ue_context_release_command /home/ubuntu/jenkins-agent/workspace/build/build_agw_packages_ubuntu/lte/gateway/c/core/oai/tasks/s1ap/s1ap_mme_handlers.c:1365
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]:     #3 0x556099bd747b in s1ap_handle_ue_context_release_command /home/ubuntu/jenkins-agent/workspace/build/build_agw_packages_ubuntu/lte/gateway/c/core/oai/tasks/s1ap/s1ap_mme_handlers.c:1491
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]:     #4 0x556099b831d1 in handle_message /home/ubuntu/jenkins-agent/workspace/build/build_agw_packages_ubuntu/lte/gateway/c/core/oai/tasks/s1ap/s1ap_mme.c:216
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]:     #5 0x7f80887db5c6 in zloop_start (/lib/x86_64-linux-gnu/libczmq.so.4+0x295c6)
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]:     #6 0x556099b82ab8 in s1ap_mme_thread /home/ubuntu/jenkins-agent/workspace/build/build_agw_packages_ubuntu/lte/gateway/c/core/oai/tasks/s1ap/s1ap_mme.c:377
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]:     #7 0x7f80893ce608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x8608)
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]:     #8 0x7f80881ab162 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x11f162)
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]: LeakSanitizer can not provide additional info.
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]: SUMMARY: LeakSanitizer: SEGV /home/ubuntu/jenkins-agent/workspace/build/build_agw_packages_ubuntu/lte/gateway/c/core/oai/tasks/s1ap/s1ap_mme.c:561 in s1ap_remove_ue
Jun 21 07:26:01 4G-cluster-node-4 mme[789983]: ==789983==ABORTING
Jun 21 07:26:01 4G-cluster-node-4 systemd[1]: magma@mme.service: Main process exited, code=exited, status=23/n/a

@ssanadhya
Copy link
Collaborator

@crbertoldo , could you please share the steps to reproduce this issue? The backtrace looks different from the one in #12677 .

@crbertoldo
Copy link

@ssanadhya unfortunately we don't know how to reproduce it. We had to flush redis / restart the AGW in order to recover the services when it was happening. Regarding 12677, it seems a bit similar but I think it doesn't bring any conclusion, so please disregard it.

@alledpaiva
Copy link

Follow the logs of MME and kernel as requested.
We have just the MME log that starts at 07:26:10... the first one that we have is this one.

Any doubts please, let me know.

Files:

kernel-log-06-21.txt
MME.4G-cluster-node-4.root.log.INFO.20220621-072610.791052.txt
MME.4G-cluster-node-4.root.log.INFO.20220621-072654.791847.txt
MME.4G-cluster-node-4.root.log.INFO.20220621-072855.793853.txt
.

@rsarwad
Copy link
Contributor

rsarwad commented Jul 6, 2022

Thanks @crbertoldo and @alledpaiva for sharing the logs.
Please find my analysis below

  1. There are some UEs attached to network and moved to idle state. Due to some reason, mme restarts and mme fetches information from DB and re-establishes all contexts.
  2. Then MME receives TAU Request in s1ap-Initial Ue Message and mme rejects it and send TAU Reject and Ue Context Release Command to s1ap. As part of handling Ue context release command message, s1ap sends s1ap-Ue Context Release Command to eNB, send UE context release complete message to mme_app and also frees the UE context.
    Once mme_app receives the UE context release complete message, mme_app move the ue's ecm state to Idle.
    But before mme_app has moved ue to Idle, mme_app receives another TAU-Request in another s1ap-Initial Ue Message,
    While handling so, mme_app finds ue context holding valid s1ap_id_key, So before handling the new initial ue message mme_app first clears previous s1-signaling connection by sending Ue Context Release Command to s1ap. So s1ap clears the UE context. So here it clears the UE context allocated for second Initial UE message. At this point nb_ue_associated becomes zero.
  3. As part of handling second TAU Request, mme_app again rejects it and send TAU Reject and Ue Context Release Command to s1ap. While handling Ue Context Release Command, s1ap try to remove UE, it hits the condition, DevAssert(enb_ref->nb_ue_associated > 0); and makes the mme to crash.

Attachment may be helpful for visualizing:
9902_mme_callflow
There was similar issue, #12114
So could you pull the PRs, #11795 and #12275 to v1.6
I am not able simulate this scenario locally. Could you please pull the above PRs and verify. Or could you share the system details where I can validate.

@rsarwad
Copy link
Contributor

rsarwad commented Jul 19, 2022

Hi @crbertoldo and @alledpaiva, were you able pull above mentioned PRs and test.

@crbertoldo
Copy link

Hi @rsarwad, sorry for the late reply. We haven't had a chance to do that yet but will do asap and let you know. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants