[Bug]: eBPF integration with apps.plugin is causing excessive CPU/memory usage growing with time #12719

vobruba-martin · 2022-04-19T10:55:57Z

Bug description

You can see that %sys CPU usage is growing with time. This is happening only if I have apps = yes in ebpd.d.conf.

I don't see Netdata directly causing this issue. It seems to be caused by Apache httpd processes bun only if apps = yes.

From ~11:48 to ~11:53 I killed eBPF plugin several times to find the cause.
~11:55 Netdata was restarted with apps = no in ebpd.d.conf.
~12:02 Netdata was restarted with apps = yes in ebpd.d.conf.

After further investigation I can tell that socket = no also helps to avoid this issue.

Expected behavior

I expect higher CPU usage but it should be constant and not to grow with time.

Steps to reproduce

I can reproduce this issue only by running Netdata along with several Apache httpd instances on Ubuntu 20.04. For example MySQL instance, which is also accepting a lot of network connections (but not from so many sources), doesn't seem to affected.

Installation method

kickstart.sh

System info

Linux 5.13.0-37-generic #42~20.04.1-Ubuntu SMP Tue Mar 15 15:44:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.04
/etc/lsb-release:DISTRIB_CODENAME=focal
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.04.4 LTS (Focal Fossa)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.04.4 LTS"
/etc/os-release:VERSION_ID="20.04"
/etc/os-release:VERSION_CODENAME=focal
/etc/os-release:UBUNTU_CODENAME=focal

Netdata build info

Version: netdata v1.34.1
Configure options:  '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--with-bundled-protobuf' 'CFLAGS=-O2' 'LDFLAGS='
Install type: kickstart-build
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (bundled)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: NO

Additional info

No response

The text was updated successfully, but these errors were encountered:

ilyam8 · 2022-04-19T11:03:44Z

@vobruba-martin hi, did you check memory usage? Does it grow too?

I noticed it a few days ago on my server, the memory usage is constantly growing.

vobruba-martin · 2022-04-19T11:22:51Z

@ilyam8 Yes, memory usage seems growing

With apps = no there was no memory usage at all.

thiagoftsm · 2022-04-19T11:39:29Z

Hello @vobruba-martin ,

I have only one more question, because this will be my priority for today. Are you using default configuration for plugin? Or did you change your configuration?

Best regards!

vobruba-martin · 2022-04-19T11:44:48Z

Hi @thiagoftsm ,

I'm running with all ebpf programs set to yes.

See my ebpf.d.conf

#
# Global options
#
# The `ebpf load mode` option accepts the following values :
#  `entry` : The eBPF collector only monitors calls for the functions, and does not show charts related to errors.
#  `return : In the `return` mode, the eBPF collector monitors the same kernel functions as `entry`, but also creates
#            new charts for the return of these functions, such as errors.
#
# The eBPF collector also creates charts for each running application through an integration with the `apps.plugin`
# or `cgroups.plugin`.
# If you want to disable the integration with `apps.plugin` or `cgroups.plugin` along with the above charts, change the setting 
# `apps` and `cgroups` to  'no'.
#
# The `update every` option defines the number of seconds used to read data from kernel and send to netdata
#
# The `pid table size` defines the maximum number of PIDs stored in the application hash tables.
#
[global]
    ebpf load mode = entry
    apps = yes
    cgroups = no
    update every = 5
    pid table size = 32768

#
# eBPF Programs
#
# The eBPF collector enables and runs the following eBPF programs by default:
#
#  `cachestat` : Make charts for kernel functions related to page cache.
#  `dcstat`    : Make charts for kernel functions related to directory cache.
#  `disk`      : Monitor I/O latencies for disks
#  `fd`        : This eBPF program creates charts that show information about file manipulation.
#  `mdflush`   : Monitors flush counts for multi-devices.
#  `mount`     : Monitor calls for syscalls mount and umount
#  `filesystem`: Monitor calls for functions used to manipulate specific filesystems
#  `hardirq`   : Monitor latency of serving hardware interrupt requests (hard IRQs).
#  `oomkill`   : This eBPF program creates a chart that shows which process got OOM killed and when.
#  `process`   : This eBPF program creates charts that show information about process life.
#  `shm`       : Monitor calls for syscalls shmget, shmat, shmdt and shmctl.
#  `socket`    : This eBPF program creates charts with information about `TCP` and `UDP` functions, including the
#                bandwidth consumed by each.
#  `softirq`   : Monitor latency of serving software interrupt requests (soft IRQs).
#  `sync`      : Montitor calls for syscall sync(2).             
#  `swap`      : Monitor calls for internal swap functions.
#  `vfs`       : This eBPF program creates charts that show information about process VFS IO, VFS file manipulation and
#               files removed.
[ebpf programs]
    cachestat = yes
    dcstat = yes
    disk = yes
    fd = yes
    filesystem = yes
    hardirq = yes
    mdflush = yes
    mount = yes
    oomkill = yes
    process = yes
    shm = yes
    socket = yes
    softirq = yes
    sync = yes
    swap = yes
    vfs = yes
    network connections = yes

thiagoftsm · 2022-04-19T18:59:38Z

Hello @vobruba-martin ,

The behavior that you and @ilyam8 described is happening, mainly because ebpf.plugin allocates memory dynamically for apps. And every time a process ends, it goes and clean this memory.

I agree with you that we have room to improve the way the plugin is working right now, and I agreed with product team to work with this issue in the next days.

As soon the PR is merged, I will return here with more details about how you can enable the new feature, that won't be used by default. 🤝

best regards!

ilyam8 · 2022-04-19T19:16:02Z

mainly because ebpf.plugin allocates memory dynamically for apps. And every time a process ends, it goes and clean this memory

Does it explain constant CPU/mem increasing (no decreasing)?

thiagoftsm · 2022-04-19T20:54:59Z

Does it explain constant CPU/mem increasing (no decreasing)?

Yes. Every time you require a kernel a specific area, it tries to allocate more memory than you have required. When you have a big sample of connections or other actions on your host from different processes/threads, it can increase both data, because you are requiring more effort from kernel to allocate data.

Now about the "decreasing", I am reviewing all codes related to allocating with this PR I am working right now. I am not discarding that for a specific thread, we did not clean data until the thread ends. :/

vobruba-martin · 2022-04-26T15:05:06Z

BTW today I had to set apps = no because socket = no was not enough.

thiagoftsm · 2022-04-26T15:30:37Z

Hello @vobruba-martin ,

The PR was initially created last night, but I was missing few threads to finish it. Now that the PR blocking the fix was merged, I am working with it today, and tomorrow it will be ready for reviewers, I expect to merge the final solution until Friday.

Best regards!

thiagoftsm · 2022-04-29T01:48:38Z

Hello guys,

An update, I am finishing the description of the PR that fixed the majority of the problems right now. As you can see in the next image, when the PR is merged, the eBPF.plugin will use less CPU than apps:

And it will also use less memory:

The memory usage won't be static yet, because the functions opendir, readdir, and closedir are responsible for this behavior that apps and cgroup are showing, but as you can see in the image, the memory grows during a period of time, and it falls down later. To solve this completely, I will have to change an eBPF program, so I will bring the final solution in another PR. I will talk with the product team (ping @cpipilas ) to plan a possible adjust for apps.plugin too.

Best regards!

ilyam8 · 2022-09-21T15:27:36Z

@thiagoftsm is this issue fixed/you r working on/can't reproduce/etc?

@vobruba-martin it's been a while and @thiagoftsm added some changes to the ebpf plugin. Can you check the latest (nightly) Netdata?

thiagoftsm · 2022-09-21T15:55:10Z

Hello @ilyam8 ,

This is fixed and will be available in next stable-release, but the nightly already have the code merged.

vobruba-martin · 2022-09-23T05:37:47Z

Hello @thiagoftsm @ilyam8 ,

Do you plan to release the next stable in near future? I have bad experiences with nightlies so I'd rather wait for the stable release.

ilyam8 · 2022-09-23T07:01:30Z

bad experiences with nightlies

@vobruba-martin are you on nightly and the problem is not fixed? Can you show your Netdata Agent version?

vobruba-martin · 2022-09-23T10:05:05Z

@ilyam8 No, I run current stable version and I'd like to wait for the next stable release to test if this issue is fixed.

thiagoftsm · 2022-09-23T12:14:42Z

Hello @thiagoftsm @ilyam8 ,

Do you plan to release the next stable in near future? I have bad experiences with nightlies so I'd rather wait for the stable release.

Hello @vobruba-martin ,

We had an internal talk about this yesterday, and we expect to have next release in the first weeks of October.

vobruba-martin added bug needs triage Issues which need to be manually labelled labels Apr 19, 2022

ilyam8 added area/collectors Everything related to data collection collectors/ebpf labels Apr 19, 2022

ilyam8 assigned thiagoftsm Apr 19, 2022

ilyam8 removed the needs triage Issues which need to be manually labelled label Apr 19, 2022

ilyam8 changed the title ~~[Bug]: eBPF integration with apps.plugin is causing excessive CPU usage growing with time~~ [Bug]: eBPF integration with apps.plugin is causing excessive CPU/memory usage growing with time Apr 19, 2022

thiagoftsm mentioned this issue Apr 26, 2022

Add option for eBPF plugin to manipulate memory #12760

Closed

cpipilas added the priority/high Super important issue label Jun 29, 2022

thiagoftsm closed this as completed Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: eBPF integration with apps.plugin is causing excessive CPU/memory usage growing with time #12719

[Bug]: eBPF integration with apps.plugin is causing excessive CPU/memory usage growing with time #12719

vobruba-martin commented Apr 19, 2022

ilyam8 commented Apr 19, 2022

vobruba-martin commented Apr 19, 2022

thiagoftsm commented Apr 19, 2022

vobruba-martin commented Apr 19, 2022

thiagoftsm commented Apr 19, 2022

ilyam8 commented Apr 19, 2022 •

edited

thiagoftsm commented Apr 19, 2022

vobruba-martin commented Apr 26, 2022

thiagoftsm commented Apr 26, 2022

thiagoftsm commented Apr 29, 2022

ilyam8 commented Sep 21, 2022

thiagoftsm commented Sep 21, 2022

vobruba-martin commented Sep 23, 2022

ilyam8 commented Sep 23, 2022

vobruba-martin commented Sep 23, 2022

thiagoftsm commented Sep 23, 2022

[Bug]: eBPF integration with apps.plugin is causing excessive CPU/memory usage growing with time #12719

[Bug]: eBPF integration with apps.plugin is causing excessive CPU/memory usage growing with time #12719

Comments

vobruba-martin commented Apr 19, 2022

Bug description

Expected behavior

Steps to reproduce

Installation method

System info

Netdata build info

Additional info

ilyam8 commented Apr 19, 2022

vobruba-martin commented Apr 19, 2022

thiagoftsm commented Apr 19, 2022

vobruba-martin commented Apr 19, 2022

thiagoftsm commented Apr 19, 2022

ilyam8 commented Apr 19, 2022 • edited

thiagoftsm commented Apr 19, 2022

vobruba-martin commented Apr 26, 2022

thiagoftsm commented Apr 26, 2022

thiagoftsm commented Apr 29, 2022

ilyam8 commented Sep 21, 2022

thiagoftsm commented Sep 21, 2022

vobruba-martin commented Sep 23, 2022

ilyam8 commented Sep 23, 2022

vobruba-martin commented Sep 23, 2022

thiagoftsm commented Sep 23, 2022

ilyam8 commented Apr 19, 2022 •

edited