Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export individual processes stats #794

Closed
lawre opened this issue Feb 20, 2016 · 25 comments
Closed

Export individual processes stats #794

lawre opened this issue Feb 20, 2016 · 25 comments

Comments

@lawre
Copy link

lawre commented Feb 20, 2016

I would really like some way to export the individual process metrics to influxdb so we can better pinpoint what process is hogging our memory, network or cpu and for how long. I work in a software build environment, so these kinds of metrics can really help nail down bottlenecks in a jenkins build.

@nicolargo
Copy link
Owner

It is a good idea but... how to identify the process (primary key) ?

  1. PID ?
  2. Command line ?
  3. Process name ?
  4. Other ?

TBD

@lawre
Copy link
Author

lawre commented Feb 21, 2016

My primary goal for this request is to be able to hover over a line in a grafana graph and a tooltip will pop up with the full command line of the process I'm interested in. How can this best be accomplished?

primary = PID: This would be a good choice if we can figure out how to get Grafana to tooltip a second field instead of the one we're querying on. I'm not so interested in the PID per se, but it does offer the convenience of guaranteeing uniqueness for each process.

primary = Command Line: I think this would be the most straight-forward approach. But these lines can get pretty long with lots of funny characters. Can InfluxDB handle this? Perhaps truncate at # of characters? This approach also leaves the danger of a program like apache spawning multiple processes for the same thing... these wouldn't look unique in this case.

primary = PID-Command Line: Concatenating the PID and the command line might be the best approach to guarantee uniqueness if we can't figure out the tooltip shows secondary info problem.

primary = command: This does not get granular enough for me. Think of the Apache web server, or in my case the Jenkins node allowing simultaneous runs.

Thank you for considering this feature!

@wingsof
Copy link

wingsof commented Feb 23, 2016

Wow! I also need this feature!
My purpose is to capture load of specific processes while running stress test on several solutions. I think only nmon currently exports this information.

Regarding primary key, I think combination of pid and start time of the process (because of pid reuse) might be enough to uniquely identify process.

@johnhill2
Copy link

My team desperately needs this feature too. I've had to do some crazy stuff with the per process utilization flat file and rsync. Having it export directly to influxdb is our number one feature request

@gabrieljames
Copy link

+1

I've been working my way through the data collection agents (collectd, telegraf, topbeat) and have been completely surprised, shocked, that none of these agents collect all process cpu, mem and disk metrics. Some will allow specific process monitors to be defined in configuration files, but i'm looking for comprehensive process monitoring metrics stored in Influx or Logstash.

Hate to say it, but this is straight forward in Windows performance logging.

I am about to test Glances specifically for this capability, was expecting it to be there based on the data displayed in the command line interface

@nicolargo
Copy link
Owner

@lawre Another approch: use the PID as primary key and use tags to store: process name, full command line, process start...

What do you think ?

@johnhill2
Copy link

@nicolargo that will work for me

@nicolargo nicolargo modified the milestone: 2.7 Mar 26, 2016
@nicolargo nicolargo changed the title export individual process stats to influxdb Export individual process stats to influxdb Mar 27, 2016
@nicolargo nicolargo changed the title Export individual process stats to influxdb Export individual processes stats to influxdb Mar 27, 2016
@lawre
Copy link
Author

lawre commented May 11, 2016

I think that would work, good suggestion.

@nicolargo
Copy link
Owner

nicolargo commented Jul 6, 2016

My proposal is to use the PID as primary key and store the process name, command line in tags only work for InfluxDB, the current architecture of the Glances export module is not related to the storage database (for example, same function are used to store in CSV, InfluxDB or Cassandra...). Tags only exist in the InfluxDB ecosystem.

Two others points to keep in mind:

  • for the moment, the Glances process list is optimized. It grabs cpu_percent, memory_percent, io_counters, name, cmdline for all processes (it's called mandatory stats) and others stats (extended stats) only for processes displayed in the UI. Export will only be done on mandatory stats.
  • performance issue can occur while writting huge process list to the database

My new proposal: define a processes list (for example using regular expression on command line), only the filtered process will be exported.

For example, if you want to monitor the NGinx process:

[processlist]
# Export process stats (export_* lines)
# Export NGinx processes (name matching ^nginx.* regexp) to foonginx key
export_foonginx=^nginx.*

The key is foonginx. It will export all the mandatory stats for the processes command line starting with nginx (one line per process or one line for all the processes, to be discus).

What do you think ?

@nicolargo
Copy link
Owner

@lawre : Any head up concerning my last proposal ?

@nicolargo nicolargo changed the title Export individual processes stats to influxdb Export individual processes stats Jul 18, 2016
@lawre
Copy link
Author

lawre commented Jul 18, 2016

Sorry, I didn't see the alert for your question before.

collectd's processes plugin already allows for regex match of processes, so in our case we would just continue to use that. Is there a possibility for this just to be an "InfluxDB only" feature as it is probably the only uniquely qualified database for this type of data? I'm not sure any other fixed key DB would even be able to handle this. I couldn't imagine reading it in a CSV where there could be 20-100 ever changing PIDs to key off.

@nicolargo
Copy link
Owner

I just try to code a first version in a local branch. The main problem is that the process stats update is done asynchronously in a specific thread. So when the stats are exported we are not sure that the process stats are completed. One workaround is to force the update before the export. But it cost a lot of CPU consumption...

@nicolargo nicolargo added this to the Next releases milestone Dec 24, 2016
@nicolargo nicolargo modified the milestones: Glances 3.0, Glances 2.11 Aug 12, 2017
@nicolargo
Copy link
Owner

No time to work on this request for Glances 3.0. Need contributor !

@nicolargo nicolargo removed this from the Glances 3.0 milestone Feb 25, 2018
@gotjoshua
Copy link

Just getting into glances and I doubt I (nor anyone on my team) will have the bandwidth to be that contributor... but I want to broaden the scope beyond influx, let me know if it is better to spin off into a new issue:

from the original issue:

better pinpoint what process is hogging our memory, network or cpu and for how long.

I would love to be able to configure actions/alerts to show the current "hogs" eg:

Warning or critical alerts (lasts 10 entries)
2019-01-27 10:14:30 (00:00:22) - CRITICAL on MEM (96.0) [Hogs: apached in dev (55%),gitlab worker in gitlab](41%)]

so the format i envision is: [{{process_name}} in {{container_name}} ({{hog%}})]

If this info could be logged to a file (that i could add to logstash) and displayed in the web interface alert list, it would be mega sexy.

Is this at all possible with the current API? or even with the command line somehow?

I ask for it in the alerts list, because this is persistent in the UI for longer, so when I check intermittently it would be great to have this glimpse into the previously critical moments.

@unlikelyzero
Copy link
Sponsor

For now, I'm going to be using a combination of https://github.com/ncabatoff/process-exporter and glances (container and gpu measurement). Ideally, I'd route all of this through the grafana agent to make a single target in prometheus.

@nicolargo
Copy link
Owner

nicolargo commented Apr 2, 2024

Feature previous implemented in the branch: https://github.com/nicolargo/glances/tree/issue794

The processes list to export could be defined in the Glances configuration file:

#
# Define the list of processes to export using
# export is a comma separated list of Glances filter
export=.*firefox.*,username:nicolargo

or with the --export-process-filter option:

glances -C --export csv --export-csv-file /tmp/glances.csv --disable-plugin all --enable-plugin processlist --quiet --export-process-filter ".*python.*"

@nicolargo
Copy link
Owner

nicolargo commented Apr 6, 2024

Feature merged into develop.

The documentation is here: https://github.com/nicolargo/glances/blob/develop/docs/aoa/ps.rst#export-process

Need beta testers on this feature !

cc: @unlikelyzero @gotjoshua @lawre @johnhill2 @gabrieljames @wingsof @alex-ruhl

@bLuka
Copy link

bLuka commented Apr 18, 2024

What a great day to be alive! I’ve been looking for such a feature for years, it’s been ages I haven’t touched Glances, and I just redid my monitoring setup over Glances + InfluxDB + Grafana past few days.
Just to learn my long-awaited feature have been implemented 2 weeks ago ❤️

I deployed upstream’s develop Glances over my server, and process exporter works like a charm! I see data coming to InfluxDB and rendered on Grafana already. I don’t see anything wrong there, I’ll keep you updated if you want to confirm it’s stable on the long run.

I’m wondering (as I’m investigating disks I/O bottlenecks from unmonitored processes), is there any reason per-process disk I/O metrics are not exported?
I am talking about the bytes R/s and W/s metrics which are already present in the curses display.

EDIT: I faced a weird behaviour where Glances ended claiming several dozens of GiB of RAM, along with a few CPU cores at 100% (iirc, 40GiB and 4 cores at 100%). I’m investigating to check whether it comes from misuse, from develop, or from the process list export.

I cannot reproduce. I believe it came from first experiments without filtering output. CPU is still a bit high (2 cores ~100%), but I believe it is known and expected.

@bLuka
Copy link

bLuka commented Apr 21, 2024

I find the cardinality a bit odd (and difficult to manipulate subsequently).
Is there specific reasons/constraints that require an export such as:

timestamp 821811.num_threads 821811.name 821811.cpu_timesuser 2950324.num_threads 2950324.name 2950324.cpu_timesuser
N 1 kworker/13:5-events 0.02 22 dotnet 233.79
N+1 1 kworker/13:5-events 0.02 22 dotnet 233.79

Instead of a format with 1 row per process and fewer columns? Such as:

timestamp pid num_threads name cpu_timesuser
N 821811 1 kworker/13:5-events 0.02
N 2950324 22 dotnet 233.79
N+1 821811 1 kworker/13:5-events 0.02
N+1 2950324 22 dotnet 233.79

I’m currently struggling to parse the results to build a Grafana dashboard from the InfluxDB backend, and the latter format would make it easier.

Additionally, regarding the format, I noticed a weird behavior where the processes were not aligned in the proper columns (have a look at the 270.pid column):
image

I reproduce consistently using the following command (and my own process list which includes a lot of processes):

glances -C --export csv --export-csv-file /tmp/glances.csv  --export-process-filter ".*dotnet.*,*java*,*qemu*,*samba*,*lxd*,*docker*" --disable-plugin all --enable-plugin processlist --quiet

I’m wondering if it comes from the fact that I monitor a lot of processes (some of which terminate, other spawn)?

@nicolargo
Copy link
Owner

Hi @bLuka and thanks for the feedback.

Concerning the CSV export, i want to keep one line per timestamp because the header should be the same for each line.
This behavior is also used for others plugins like network, where stats are displayed with the following format: .. However, the weird behavior concerning the column are not "normal". But i think that i understand what happen. If one of your process die during the capture then the column can be overwritten by another stats. had to investigate this...

For the InfluxDB export, this is not the same behavior and all the stats are exported line by line, the pid become a tag (in the InfluxDB data model). I just add the process name has another tag. So it should simplify the way InfluxDB store the information and so also simplify Grafana dashboard creation.

@nicolargo
Copy link
Owner

nicolargo commented Apr 24, 2024

For the "weird" behavior, i also reproduce it on my side. Without any creation or deletion of a exported process, every 1 minute, the column are generate in a different order...

image

The problem is related to the cache_timeout=60 in the processes.py file:

class GlancesProcesses(object):
    """Get processed stats using the psutil library."""

    def __init__(self, cache_timeout=60):
        """Init the class to collect stats about processes."""
        ...

If i change the cache_timeout=60 to 30 then the glitch appear every 30 seconds...

@nicolargo
Copy link
Owner

@bLuka Last commit should correct the issue with clumn alignement. Stats are now sorted before export.

For the last point (new incoming process or process removed) i do not know what is the best solution:

  1. Generate a new CSV file with a new header
  2. Change the header and add new column if a new process is created
  3. If 2) if a process is stopped, than the column will be filled with empty space

Any advise ?

@guidocioni
Copy link

Hey all. I've been testing also this feature, so I'll report here in the future in case I have some feedback.

Just a quick question. Is there any way to export in the csv only the N (user defined) processes with the highest CPU usage? Or the only way to filter what is saved in the CSV (or anything else) is the process filter with regular expression? I guess it would be hard to have PIDs changing at every timestamp as the schema of the table exported is defined at the beginning.

@nicolargo
Copy link
Owner

@guidocioni For the moment it is not possible but it is a nice feature. Can you open a new issue ?

Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants