Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless AGW crashloop, mme exiting on start with Sanitizer CHECK failed #15279

Open
rechia-venko opened this issue Aug 2, 2023 · 22 comments
Open
Labels
type: bug Something isn't working

Comments

@rechia-venko
Copy link

rechia-venko commented Aug 2, 2023

Your Environment

  • Version: 1.8
  • Affected Component: Access Gateway
  • Affected Subcomponent: AGW MME, sessiond, mobilityd, pipelined
  • Deployment Environment: e.g. bare-metal AGW, VM orc8r deployed with magma-galaxy (https://github.com/ShubhamTatvamasi/magma-galaxy)

Describe the Issue

The MME service no longer starts, it stays in an endless crashloop together with mobilityd, pipelined, sessiond.

We have observed that the mme service keeps restarting with this error:

Aug 02 14:17:41 magma mme[4004418]: ==4004418==Sanitizer CHECK failed: ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:53 ((!lsan_init_is_running)) != (0) (0, 0)

Aug 02 14:17:30 magma systemd[1]: magma@mme.service: Main process exited, code=exited, status=23/n/a
Aug 02 14:17:31 magma systemd[1]: magma@mme.service: Failed with result 'exit-code'.
Aug 02 14:17:36 magma systemd[1]: magma@mme.service: Scheduled restart job, restart counter is at 1451.
Aug 02 14:17:36 magma systemd[1]: Stopped Magma OAI MME service.
Aug 02 14:17:40 magma systemd[1]: Starting Magma OAI MME service...
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: enable_nat
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: congestion_control_enabled
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: enable5g_features
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_default_slice_service_type
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_default_slice_differentiator
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_name
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_region_id
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_set_id
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_pointer
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: default_dnn
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: auth_retry_max_count
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: auth_retry_interval
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for spgw, key not found: ovs_internal_sampling_fwd_tbl
Aug 02 14:17:40 magma mme[4004398]: WARNING:root:Service (spgw) missing in mconfig
Aug 02 14:17:41 magma mme[4004416]: STATELESS	mme -> use_stateless
Aug 02 14:17:41 magma mme[4004416]: STATELESS	pipelined -> clean_restart
Aug 02 14:17:41 magma mme[4004416]: STATELESS	pipelined -> redis_enabled
Aug 02 14:17:41 magma mme[4004416]: STATELESS	sessiond -> support_stateless
Aug 02 14:17:41 magma mme[4004416]: Check returning return_codes.STATELESS
Aug 02 14:17:41 magma mme[4004416]: AGW is stateless, no need to restart Sctpd
Aug 02 14:17:41 magma systemd[1]: Started Magma OAI MME service.
Aug 02 14:17:41 magma mme[4004418]: ==4004418==Sanitizer CHECK failed: ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:53 ((!lsan_init_is_running)) != (0) (0, 0)
Aug 02 14:17:41 magma systemd[1]: magma@mme.service: Main process exited, code=exited, status=23/n/a
Aug 02 14:17:42 magma systemd[1]: magma@mme.service: Failed with result 'exit-code'.

To Reproduce

We don't know why this is started happening. We were having issues with user data plane (no traffic was being forwarded from UE to the Internet). At some point during troubleshooting we decided to:

  1. systemctl stop magma@*
  2. systemctl start magma@magmad

Then the mme service never came back up.

Expected behavior

the mme service should come up

Screenshots

Here's the output of health_cli.py:

image

Additional context

Here's a syslog that shows part of the problem.

Aug 1 11:04 - at around this time we tried to systemctl stop magma@*. A bunch of errors show up in the syslog when the services are going down.
Aug 1 11:07:15 - we attempt to restart the service with systemctl start magma@magmad. Many errors show up, mainly related to redis connections.
Aug 1 11:07:24 - we see the first entry of this error, that now always occurs constantly:

magma mme[4019572]: ==4019572==Sanitizer CHECK failed: ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:53 ((!lsan_init_is_running)) != (0) (0, 0)

here's the full syslog of when the error started happening. syslog.1.log.gz

The test scenario that we were trying to bring up was the S1 handover using srsran and zmq radio https://docs.srsran.com/projects/4g/en/latest/app_notes/source/handover/source/index.html#s1-handover.

temporary workaround

try at your own risk

Downgrade liblsan0 anb gcc-10-base of your agw from 10.5.0 to a lower version:

wget https://ftp.debian.org/debian/pool/main/g/gcc-10/liblsan0_10.2.1-6_amd64.deb
wget https://ftp.debian.org/debian/pool/main/g/gcc-10/gcc-10-base_10.2.1-6_amd64.deb
sudo dpkg -i gcc-10-base_10.2.1-6_amd64.deb liblsan0_10.2.1-6_amd64.deb
@rechia-venko rechia-venko added the type: bug Something isn't working label Aug 2, 2023
@javieraubert
Copy link

Hello @rechia-venko I'm having the same issue, did you find what causes it?

@gabrielrbeiro
Copy link

gabrielrbeiro commented Aug 3, 2023

We had this error with our AGW in 1.6.1, but even with a brand new AGW in 1.8.0 (orc8r also in 1.8.0) the issue continue.

@vinothinfinity
Copy link

vinothinfinity commented Aug 3, 2023

Related issue. https://www.openwall.com/lists/musl/2015/07/02/13
As suggested in link, upgradinggcc/glibc might help !!

One more ref - rust-lang/rust#111073

@harsharao87
Copy link

harsharao87 commented Aug 3, 2023

We are facing this issue in the recent docker image builds of agw locally

There is a recent update in ubuntu sources for gcc-10-base package

https://launchpad.net/ubuntu/+source/gcc-10/+changelog

due to which the liblsan0 got upgraded from 10.3.0 to 10.5.0.

@gabrielrbeiro
Copy link

It seems that gcc-10-base isn't available anymore in version 10.3.0 did you manage to downgrade it ?

@harsharao87
Copy link

yes 10.3.0 is not available now, but a 10.2.1 version of deb images for gcc-10-base and liblsan0 is available - http://ftp.debian.org/debian/pool/main/g/gcc-10/.
This will help you unblock for now, please try this out in development environments only, as the impact is unknown yet.

But, we are unable to root cause why mme service is exiting with updated 10.5.0 liblsan0 package.

@rechia-venko
Copy link
Author

Great tip, thanks a lot @harsharao87! We managed to bring our agw back up with the following:

wget https://ftp.debian.org/debian/pool/main/g/gcc-10/liblsan0_10.2.1-6_amd64.deb
wget https://ftp.debian.org/debian/pool/main/g/gcc-10/gcc-10-base_10.2.1-6_amd64.deb
sudo dpkg -i gcc-10-base_10.2.1-6_amd64.deb liblsan0_10.2.1-6_amd64.deb

We'll lilkely have to do this in production environments. Better to have something working than nothing. 😂

@vinothinfinity
Copy link

vinothinfinity commented Aug 3, 2023

We saw the issue in v1.6.1 (service based arch). Here AGW went for a reboot and post reboot unattended-upgrades.service upgraded these packages. (You might want to check /var/log/dpkg.log You can see upgrade for pkg happening)

I think WA wise and RCA wise we are good, but what is the long term fix ? :)

@edaspb
Copy link
Contributor

edaspb commented Aug 4, 2023

As far as I see automatic packet update is active by default on AGW. Which can be potentially dangerous.
image

@jordanvrtanoski
Copy link
Contributor

Docker containers for x86_64 are also impacted. Workaround is to add the following lines in magma/lte/gateway/docker/services/c/Dockerfile:

RUN wget https://ftp.debian.org/debian/pool/main/g/gcc-10/liblsan0_10.2.1-6_amd64.deb && \
    wget https://ftp.debian.org/debian/pool/main/g/gcc-10/gcc-10-base_10.2.1-6_amd64.deb && \
    dpkg -i gcc-10-base_10.2.1-6_amd64.deb liblsan0_10.2.1-6_amd64.deb

before the line

RUN ldconfig 2> /dev/null

@rechia-venko
Copy link
Author

The workaround works, but afterwards it prevents us from further upgrading the system due to broken dependencies. For instance, that OVS upgrade script doesn't work anymore for us.

@Darlanewe
Copy link

Darlanewe commented Aug 24, 2023

forgive-me for speaking maybe out of topic, but anyone else after these workaround experienced problems with UE connections with the agw in non nat mode
because after the workaround all of the services up and running aparently normal in non nat mode the ue get the ip from the router as expected but then theres no communication of any kind, no ping, nothing and the only change was the downgrade of the packages to normalize mme services, if any hel is offered will be much appreciated, and again sorry if this is off topic

@edaspb
Copy link
Contributor

edaspb commented Aug 24, 2023

@Darlanewe, we don't see this behavior. 4 AGW in bridge mode. Workaround from rechia-venko

@rechia-venko
Copy link
Author

I just found out that a fresh agw docker installation works. Interestingly, the magmad container has the older version of liblsan 10.3.0 running:

image

I don't know where the docker takes its libraries from, but it is different from the host in this case. Can we be sure that the containers will always stay with the old version?

@Darlanewe
Copy link

Darlanewe commented Sep 14, 2023

hi there,
we experienced the same mme malfunction today in one of our agw's
the error message is different but it has to do with sanitizer also
follow the error log
image

the gcc-10-base and liblsan0 are still applied accordingly to the rechia venko fix

the most strange thing about it is that the services were not restarted (at least not on purpose)
it was working perfectly and then the s1 envoy message came up and the mme entered the restart loop

@rechia-venko
Copy link
Author

rechia-venko commented Sep 14, 2023 via email

@Darlanewe
Copy link

i can't say i have all the variables in this cenario but,
the files shown in the error with /home/vagrant (this directory does not exist on my agw) i have no idea how that happened
then i decided to renew the agw/orchestrator registration
and as soon as i deleted the /var/opt/magma/configs/gateway.mconfig
the mme came back to life

@ferrieux
Copy link

ferrieux commented Oct 1, 2023

May I ask what is the status of this ?
Magma is supposed to be "production grade", yet has a fatal crash loop with an unexplained connection with the latest gcc. Then what ?
(a) gcc has a bug, Magma should be proud to have found it, be sure to advertise it here
(b) Magma has a bug, earlier gcc/sanitizers overlooked it, but now thankfully we can fix it. Issue number ?
(c) Nobody knows.
Which of the above is true ?

@ferrieux
Copy link

ferrieux commented Oct 2, 2023

Update:

TL;DR:

  • (1) it is a gcc/sanitizer bug
  • (2) it is here to stay in gcc10 which is now end-of-support
  • (3) it is fixed in gcc11+ through reimplementation of dlsym allocator.
  • (4) since Ubuntu 20.04 is stuck to gcc10, a good workaround is to neutralize liblsan0 with a fake lib => NO NEED TO DOWNGRADE GCC ANY MORE

Detail:

I launched the MME with LSAN_OPTION=abort_on_error=1, which produced a core dump.
There I got a stack trace, shown below

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f1080c9f859 in __GI_abort () at abort.c:79
#2  0x00007f108209dc22 in __sanitizer::Abort () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:155
#3  0x00007f10820a92fc in __sanitizer::Die () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
#4  0x00007f10820a9381 in __sanitizer::CheckFailed (file=file@entry=0x7f10820c1a90 "../../../../src/libsanitizer/lsan/lsan_interceptors.cpp", line=line@entry=53, 
    cond=cond@entry=0x7f10820c1040 "((!lsan_init_is_running)) != (0)", v1=v1@entry=0, v2=v2@entry=0) at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:82
#5  0x00007f108208da55 in __interceptor_malloc (size=size@entry=79) at ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:53
#6  0x00007f108299feb9 in __GI__dl_exception_create_format (exception=exception@entry=0x7fffc7eb2780, objname=0x7f10829b6ea0 "/lib/x86_64-linux-gnu/liblsan.so.0", fmt=fmt@entry=0x7f10829ac6f5 "undefined symbol: %s%s%s")
    at dl-exception.c:146
#7  0x00007f10829933bd in _dl_lookup_symbol_x (undef_name=0x7f10820c5abb "_thread_db_sizeof_pthread", undef_map=0x7f1082970000, ref=0x7fffc7eb2808, symbol_scope=0x7f1082970368, version=0x0, type_class=0, flags=3, 
    skip_map=0x0) at dl-lookup.c:878
#8  0x00007f1080ddd44d in do_sym (flags=<optimized out>, vers=0x0, who=0x7f108209b088 <__sanitizer::ThreadDescriptorSize()+40>, name=0x7f10820c5abb "_thread_db_sizeof_pthread", handle=<optimized out>) at dl-sym.c:117
#9  _dl_sym (handle=<optimized out>, name=0x7f10820c5abb "_thread_db_sizeof_pthread", who=0x7f108209b088 <__sanitizer::ThreadDescriptorSize()+40>) at dl-sym.c:274
#10 0x00007f1080c764a8 in dlsym_doit (a=a@entry=0x7fffc7eb2a50) at dlsym.c:50
#11 0x00007f1080ddd928 in __GI__dl_catch_exception (exception=exception@entry=0x7fffc7eb29e0, operate=operate@entry=0x7f1080c76490 <dlsym_doit>, args=args@entry=0x7fffc7eb2a50) at dl-error-skeleton.c:208
#12 0x00007f1080ddd9f3 in __GI__dl_catch_error (objname=objname@entry=0x7f10820dc210 <__interceptor_calloc::calloc_memory_for_dlsym+16>, 
    errstring=errstring@entry=0x7f10820dc218 <__interceptor_calloc::calloc_memory_for_dlsym+24>, mallocedp=mallocedp@entry=0x7f10820dc208 <__interceptor_calloc::calloc_memory_for_dlsym+8>, 
    operate=operate@entry=0x7f1080c76490 <dlsym_doit>, args=args@entry=0x7fffc7eb2a50) at dl-error-skeleton.c:227
#13 0x00007f1080c76b59 in _dlerror_run (operate=operate@entry=0x7f1080c76490 <dlsym_doit>, args=args@entry=0x7fffc7eb2a50) at dlerror.c:170
#14 0x00007f1080c76525 in __dlsym (handle=handle@entry=0x0, name=name@entry=0x7f10820c5abb "_thread_db_sizeof_pthread") at dlsym.c:70
#15 0x00007f108209b088 in __sanitizer::ThreadDescriptorSize () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:320
#16 0x00007f108209bbbe in __sanitizer::ThreadDescriptorSize () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:316
#17 __sanitizer::GetTls (size=0x7fffc7eb2b88, addr=0x7fffc7eb2bb0) at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:447
#18 __sanitizer::GetThreadStackAndTls (main=main@entry=true, stk_addr=stk_addr@entry=0x7fffc7eb2b90, stk_size=stk_size@entry=0x7fffc7eb2b80, tls_addr=tls_addr@entry=0x7fffc7eb2bb0, tls_size=tls_size@entry=0x7fffc7eb2b88)
    at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:520
#19 0x00007f108208f7db in __lsan::ThreadStart (tid=tid@entry=0, os_id=1996211, thread_type=thread_type@entry=__sanitizer::ThreadType::Regular) at ../../../../src/libsanitizer/lsan/lsan_thread.cpp:83
#20 0x00007f1082089072 in __lsan_init () at ../../../../src/libsanitizer/lsan/lsan.cpp:119
#21 0x00007f1082998cf6 in _dl_init (main_map=0x7f10829b6190, argc=5, argv=0x7fffc7eb2c58, env=0x7fffc7eb2c88) at dl-init.c:104
#22 0x00007f108298813a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#23 0x0000000000000005 in ?? ()
#24 0x00007fffc7eb47da in ?? ()
#25 0x00007fffc7eb47ed in ?? ()
#26 0x00007fffc7eb47f0 in ?? ()
#27 0x00007fffc7eb480c in ?? ()
#28 0x00007fffc7eb480f in ?? ()
#29 0x0000000000000000 in ?? ()

There we see that __lsan_init() calls dlsym() which calls (the intercepted form of) malloc() which barfs since it doesn't expect to be called during initialization.

Funnily, the "interceptor" code contains specific steps for calloc() but none for malloc(). This is done initially to solve a chicken-and-egg problem for the instrumentation of calloc itself by dlsym (which calls calloc ...). But this also allows to call dlsym repeatedly within the init. This has worked for years I guess, but is fragile, as it makes the assumption that dlsym will always call calloc and not malloc... Too bad nobody cared to notify the dlsym() developers about this :)

As we can see in the above stack trace, dlsym in some situations ends up calling malloc, hence our crash.

Later versions of gcc fix that by using a dedicated allocator for dlsym. End of the story ?
Nope, for people like us who are stuck with a version "beyond backports", as 10.x is end-of-support !!!

Okay, no problem, we have radical options: just get rid of the bugger.
Write a one-line replacement for liblsan0.so that just does:

void __lsan_init(void){}

And replace the (dysfunctional) system-installed libsan0.

MME happy and running without any downgrade :)

@Darlanewe
Copy link

hi @ferrieux
could you be more specific on how to do the whole replacement process of this lib?

@ferrieux
Copy link

ferrieux commented Oct 3, 2023

Just compile with gcc -shared -fPIC, then either overwrite the existing liblsan0.so.0 wherever your distro puts it, or put the new one earlier in the LD_LIBRARY_PATH.

@ferrieux
Copy link

ferrieux commented Oct 3, 2023

Note, I have reported the gcc bug to Ubuntu:
https://bugs.launchpad.net/ubuntu/+source/gcc-defaults/+bug/2038386

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants