Skip to content

Conversation

@rgwood
Copy link
Contributor

@rgwood rgwood commented Oct 22, 2022

Description

Fixes #6829, and another newly discovered bug where CPU usage was always reported as zero on Linux.

Bug 1

Previously, ps would sometimes return an empty list on Linux. The easiest way to see this is to run ps in a loop:

/home/pi〉1..100 | each { ps | describe }
╭────┬─────────────────────────────────────────────────────────────────────────────────────────────╮
│  0 │ table<pid: int, name: string, status: string, cpu: float, mem: filesize, virtual: filesize> │
│  1 │ table<pid: int, name: string, status: string, cpu: float, mem: filesize, virtual: filesize> │
│  2 │ list<any>                                                                                   │
│  3 │ table<pid: int, name: string, status: string, cpu: float, mem: filesize, virtual: filesize> │
│  4 │ table<pid: int, name: string, status: string, cpu: float, mem: filesize, virtual: filesize> │
...

This line was the culprit:

Err(_) => return Vec::<ProcessInfo>::new(),

ps works by taking 2 snapshots of process activity, and then comparing them to get an idea of process activity over time. It's a reasonable approach, but we weren't handling the situation where a process dies between the 2 snapshots well.

The fix: skip over just the process that died instead of skipping over all processes.

Bug 2

Another problem: I noticed that the cpu field in ps results was always returning zero. Even when compiling Rust code on all cores, ps was saying that every process was using zero CPU.

When I took a closer look, the problem seemed to be that we were using procfs::process::Process as if it were a snapshot of process activity, when in reality it's more like a handle to a process:

pub fn cpu_usage(&self) -> f64 {
let curr_time = match self.curr_proc.stat() {
Ok(c) => c.utime + c.stime,
Err(_) => 0,
};
let prev_time = match self.prev_proc.stat() {
Ok(c) => c.utime + c.stime,
Err(_) => 0,
};
let usage_ms =
(curr_time - prev_time) * 1000 / procfs::ticks_per_second().unwrap_or(100) as u64;

The calls to curr_proc.stat() and prev_proc.stat() are not retrieving previously taken snapshots of stat data ; they are taking new snapshots. And so the intended 100ms delay between snapshots was effectively 0ms.

To fix this, I tweaked the ps code to take snapshots of stat data like we already do for I/O data.

Tests

I did a bunch of manual testing, but this is an area that's somewhat difficult to write automated tests for. If anyone has ideas, I'm all ears.

  • Add tests that cover your changes, either in the command examples, the crate/tests folder, or in the /tests folder.
  • Try to think about corner cases and various ways how your changes could break. Cover them with tests.
  • If adding tests is not possible, please document in the PR body a minimal example with steps on how to reproduce so one can verify your change works.

Make sure you've run and fixed any issues with these commands:

  • cargo fmt --all -- --check to check standard code formatting (cargo fmt --all applies these changes)
  • cargo clippy --workspace --features=extra -- -D warnings -D clippy::unwrap_used -A clippy::needless_collect to check that you're using the standard code style
  • cargo test --workspace --features=extra to check that all the tests pass

@fdncred
Copy link
Contributor

fdncred commented Oct 22, 2022

we used to rely exclusively on helm (iirc), then we tried another crate, then there were so many problems we rolled our own with pieces of a handful of crates. I'm not surprised at all how brittle ps has been. hopefully, this will continue to move us forward toward stabilization. good work here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ps emits empty output sometimes

2 participants