Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"omiagent" segfault libnss_dns on Linux (scx provider) #96

Closed
srice01 opened this issue Mar 5, 2018 · 13 comments
Closed

"omiagent" segfault libnss_dns on Linux (scx provider) #96

srice01 opened this issue Mar 5, 2018 · 13 comments

Comments

@srice01
Copy link

srice01 commented Mar 5, 2018

Copied over from microsoft/omi#491 (please see this for full communication on this issue).

On our RM provisioned VMs in Azure we noticed that the root partition is filling up with large numbers of "core.###" files in the /var/opt/omi/run directory.

Further investigation shows segmentation faults (in /var/log/messages) as follows:

Jan 30 10:17:17 ML001 kernel: omiagent[2054]: segfault at 7f6b7e181e00 ip 00007f6b7e181e00 sp 00007f6b78846d50 error 14
Jan 30 10:32:12 ML001 kernel: omiagent[3298]: segfault at 7fa8356f4e00 ip 00007fa8356f4e00 sp 00007fa82fdb9d50 error 14
Jan 30 11:02:19 ML001 kernel: omiagent[5873]: segfault at 7fbd84d3de00 ip 00007fbd84d3de00 sp 00007fbd7ede0d50 error 14 in libnss_dns-2.17.so[7fbd84ec5000+5000]
Jan 30 11:17:19 ML001 kernel: omiagent[13175]: segfault at 7fbac740ae00 ip 00007fbac740ae00 sp 00007fbac54bcd50 error 14 in libnss_dns-2.17.so[7fbac7592000+5000]
Jan 30 11:32:21 ML001 kernel: omiagent[20049]: segfault at 7f79230dfe00 ip 00007f79230dfe00 sp 00007f791d7a4d50 error 14
Jan 30 12:02:14 ML001 kernel: omiagent[46782]: segfault at 7f9fa8939e00 ip 00007f9fa8939e00 sp 00007f9fa296dd50 error 14 in libnss_dns-2.17.so[7f9fa8ac1000+5000]

Environment information:

  • OMS Agent: 1.4.4-210
  • OMI 1.4.2
  • SCX 1.6.3-527

Operating System: CentOS Release 7.4.1708 (fully patched, that is, "yum update" shows no updates pending).

So far the workaround has been to write a cron job (!) to periodically wipe the core files but obviously this is not an ideal situation.

Further information from "JumpingYang001":

Following debug info shows omiagent loaded scx provider:

(gdb) info sharedlibrary

From To Syms Read Shared Object Library
0x00007fa598b2f900 0x00007fa598b3ace1 Yes () /lib64/libpthread.so.0
0x00007fa598926e60 0x00007fa59892795e Yes (
) /lib64/libdl.so.2
0x00007fa598719670 0x00007fa598720d0c Yes () /lib64/libpam.so.0
0x00007fa5984bfbb0 0x00007fa5984fb58d Yes (
) /opt/omi/lib/libssl.so.1.0.0
0x00007fa5980b0f00 0x00007fa5981e8bd7 Yes () /opt/omi/lib/libcrypto.so.1.0.0
0x00007fa597ca0480 0x00007fa597de6bcf Yes (
) /lib64/libc.so.6
0x00007fa598d46b10 0x00007fa598d61440 Yes () /lib64/ld-linux-x86-64.so.2
0x00007fa597a5c100 0x00007fa597a62402 Yes (
) /lib64/libaudit.so.1
0x00007fa597818650 0x00007fa59784aa1a Yes () /lib64/libgssapi_krb5.so.2
0x00007fa597549a10 0x00007fa5975b0e8a Yes (
) /lib64/libkrb5.so.3
0x00007fa597321570 0x00007fa597322143 Yes () /lib64/libcom_err.so.2
0x00007fa5970f18c0 0x00007fa59710fc0f Yes (
) /lib64/libk5crypto.so.3
0x00007fa596ed9170 0x00007fa596ee56f8 Yes () /lib64/libz.so.1
0x00007fa596cd2580 0x00007fa596cd43bc Yes (
) /lib64/libcap-ng.so.0
0x00007fa596ac6890 0x00007fa596acd42b Yes () /lib64/libkrb5support.so.0
0x00007fa5968c05b0 0x00007fa5968c11cc Yes (
) /lib64/libkeyutils.so.1
0x00007fa5966a89d0 0x00007fa5966b77e1 Yes () /lib64/libresolv.so.2
0x00007fa596484ac0 0x00007fa59649a8c6 Yes (
) /lib64/libselinux.so.1
0x00007fa59621d5f0 0x00007fa5962635b0 Yes () /lib64/libpcre.so.1
0x00007fa595ed6430 0x00007fa596034438 Yes /opt/omi/lib/libSCXCoreProviderModule.so
0x00007fa598e0fcc0 0x00007fa598e2b568 Yes /opt/omi/lib/libmicxx.so
0x00007fa595b78e50 0x00007fa595b7daac Yes (
) /lib64/libcrypt.so.1
0x00007fa595972250 0x00007fa59597504c Yes () /lib64/librt.so.1
0x00007fa5956c3510 0x00007fa59572a5ba Yes (
) /lib64/libstdc++.so.6
0x00007fa59536b370 0x00007fa5953d6276 Yes () /lib64/libm.so.6
0x00007fa595152af0 0x00007fa5951622a5 Yes (
) /lib64/libgcc_s.so.1
0x00007fa594f4dba0 0x00007fa594f4e309 Yes () /lib64/libfreebl3.so
0x00007fa58e8131d0 0x00007fa58e81a3e1 Yes (
) /lib64/libnss_files.so.2
0x00007fa58e60c090 0x00007fa58e60f4f0 Yes () /lib64/libnss_dns.so.2
0x00007fa58d908ec0 0x00007fa58d933b0f Yes (
) /lib64/libssl3.so
0x00007fa58d6df380 0x00007fa58d6f3e57 Yes () /lib64/libsmime3.so
0x00007fa58d3c5740 0x00007fa58d498654 Yes (
) /lib64/libnss3.so
0x00007fa58d18b390 0x00007fa58d199d45 Yes () /lib64/libnssutil3.so
0x00007fa58cf7bf10 0x00007fa58cf7cc78 Yes (
) /lib64/libplds4.so
0x00007fa58cd77510 0x00007fa58cd78b78 Yes () /lib64/libplc4.so
0x00007fa58cb44ca0 0x00007fa58cb64cc0 Yes (
) /lib64/libnspr4.so
0x00007fa58c27e2d0 0x00007fa58c2a7f5c Yes () /lib64/libsoftokn3.so
---Type to continue, or q to quit---
0x00007fa57e552a00 0x00007fa57e5da860 Yes (
) /lib64/libsqlite3.so.0
0x00007fa57e2c8bc0 0x00007fa57e32196d Yes () /lib64/libfreeblpriv3.so
0x00007fa58c077cd0 0x00007fa58c0783cb Yes (
) /lib64/libnsssysinit.so
0x00007fa57e09e7e0 0x00007fa57e0b8496 Yes (*) /lib64/libnsspem.so

(*): Shared library is missing debugging information.

(gdb) The crash is on 0x00007fa58e405e00 which is in /lib64/libnss_dns.so.2, that is same as your segmentation faults in /var/log/messages.

http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /opt/omi/bin/omiagent...done.
[New LWP 123588]
[New LWP 108293]
[New LWP 108369]
[New LWP 108394]
[New LWP 108295]
[New LWP 108294]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/omi/bin/omiagent 9 10 --destdir / --providerdir /opt/omi/lib --loglevel WA'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fa58e405e00 in ?? ()
Missing separate debuginfos, use: debuginfo-install omi-1.4.2-1.x86_64
(gdb) bt
#0 0x00007fa58e405e00 in ?? ()
#1 0x00007fa58e449f47 in ?? ()
#2 0x00007fa58e449b60 in ?? ()
#3 0xffffffff00000073 in ?? ()
#4 0x0000000000000000 in ?? ()

Here are the threads:

(gdb) info threads
Id Target Id Frame
6 Thread 0x7fa598e01f00 (LWP 108294) 0x00007fa598b35cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
5 Thread 0x7fa598dc2f00 (LWP 108295) 0x00007fa598b35cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
4 Thread 0x7fa58e5caf00 (LWP 108394) 0x00007fa598b35cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
3 Thread 0x7fa58e609f00 (LWP 108369) 0x00007fa598b35cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
2 Thread 0x7fa598f53880 (LWP 108293) 0x00007fa597d707a3 in select () from /lib64/libc.so.6
(*) 1 Thread 0x7fa57ffff700 (LWP 123588) 0x00007fa58e405e00 in ?? ()

Please let me know if any further debug information is required.

@srice01
Copy link
Author

srice01 commented Feb 12, 2019

Is there any activity on this? Even after almost a year we are still having these same problems with a number of our nodes.

@asoccer
Copy link

asoccer commented Sep 19, 2019

This is still a very prominent issue in Azure is there seriously no work being put into this anymore? It's a broken tool that's causing production VM's to hit full on space

@srice01
Copy link
Author

srice01 commented Sep 19, 2019

We have ended up creating a cron job to delete the core files (hopefully frequently enough to avoid HD filling) rather than waiting for a fix from Microsoft that it appears will never come.

@johanburati
Copy link

@srice01 Are you still having this issue with the latest versions ?

  • OMS Agent: 1.11.0-9
  • OMI 1.6.2-0
  • SCX 1.6.3-659

@srice01
Copy link
Author

srice01 commented Sep 23, 2019

Yes (I am using CentOS 7.6.1810).

[root]# rpm -qa | grep -i omi
omi-1.6.2-0.x86_64
[root]# rpm -qa | grep -i scx
scx-1.6.3-659.x86_64
[root]# rpm -qa | grep -i walinux
WALinuxAgent-2.2.42-1.el7.noarch
[root]# rpm -qa | grep -i oms
auoms-2.0.0-13.x86_64
omsagent-1.11.0-9.x86_64
omsconfig-1.1.1-926.x86_64

[root]# ls -al /var/opt/omi/run/
total 404888
drwxr-xr-x. 3 omi omi 4096 Sep 23 16:31 .
drwxr-xr-x. 8 root root 81 May 30 04:23 ..
-rw------- 1 root root 30789632 Sep 23 08:01 core.101250
-rw------- 1 root root 30789632 Sep 23 08:16 core.104089
-rw------- 1 root root 30814208 Sep 23 08:31 core.106826
-rw------- 1 root root 30728192 Sep 23 08:46 core.109601
-rw------- 1 root root 30711808 Sep 23 09:01 core.112330
-rw------- 1 root root 30728192 Sep 23 09:16 core.115170
-rw------- 1 root root 30793728 Sep 23 09:31 core.117975
-rw------- 1 root root 30801920 Sep 23 09:46 core.120825
-rw------- 1 root root 30814208 Sep 23 10:01 core.123533
-rw------- 1 root root 30814208 Sep 23 11:46 core.12592
...

@johanburati
Copy link

@srice01 Could you please open a support ticket and tell them to engage me (joburati) ?
That way I will be able to follow up with the devs internally and get this issue worked on.

@srice01
Copy link
Author

srice01 commented Sep 24, 2019

I am assuming you mean for me to create a support ticket in Azure. This is support request 119092422001455.

@johanburati
Copy link

Thanks @srice01, will get in touch with you via the ticket and try to get this moving.

@johanburati
Copy link

@srice01 Good news, I could fix the problem on your image.

The issue is that the DSCForLinux extension install version 1.1.1-294 of the dsc package, this version cause omiagent to segfault. Installing version 1.1.1-926 fixes the issue.

All those cases are related to this issue:

I have already submitted a fix to bump up the version of the dsc package:

I am following up with PG internally for them to merge and push the fix:

Meanwhile you can fix the issue by installing the package manually:

wget https://github.com/microsoft/PowerShell-DSC-for-Linux/releases/download/v1.1.1-926/dsc-1.1.1-926.ssl_098.x64.rpm
yum upgrade dsc-1.1.1-926.ssl_098.x64.rpm -y

I hope this helps.

@srice01
Copy link
Author

srice01 commented Sep 26, 2019

@johanburati - This is indeed good news. Given that the DSCForLinux extension is installed by Azure (not ourselves) I take it your changes are to make sure the fixed version is installed by default in future?

@johanburati
Copy link

@srice01 yes

Once my patch is merged and a new release of the DSCForLinux extension is pushed by the devs, it will be fixed for good. Until then you will have to bump up the version of the package manually.

@johanburati
Copy link

If you are having this issue check Azure/azure-linux-extensions#875 for details and solution.

@srice01
Copy link
Author

srice01 commented Sep 27, 2019

24 hours after installing the update and I have seen no core dumps...So I believe this is now resolved.

@srice01 srice01 closed this as completed Sep 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants