Is it possible to derive the list of currently running actors from the performance database? #273

DavidPCoster · 2023-10-25T13:12:32Z

Is it possible to derive the list of currently running actors from the performance database?

This could be useful if there are longer running actors ...

If so, it would be good to also know when it started running this time (or the run time so far), and, perhaps, the average of previous run times.

LourensVeen · 2023-10-25T19:59:12Z

You're not the first to ask 😄

The direct answer is no. The profiling subsystem measures things that libmuscle does, and computing things isn't part of that, so whether something is running needs to be inferred from the fact that it isn't doing anything else. Mostly, that would be waiting to receive a message. That's recorded as an event, but the record is only complete once we actually receive the message, so the manager doesn't know about this while it's going on. Also, records are saved up for a while and sent in batches in the background, to reduce the performance impact, so that delays things further.

I think what we want to have is some kind of monitoring system, but monitoring isn't the same as profiling. The latter collects exhaustive data for analysis after the fact, while the former aims to give the user a real-time look at what's going on. Since users aren't very fast compared to CPUs, monitoring can sample and skip or summarise some data in particular when things are going very quickly.

We do have a remote logging system, through which log messages are sent from the instances to the manager at least if muscle_remote_log_level is set low enough. That could be upgraded to do structured logging, so that we can send machine readable records with things like "instance x has been waiting on port y for one second now". This information could then be collected by the manager and amalgamated into a view of the global state of the simulation, and then there would have to be a way to get that information to the user in a user-friendly way. This is especially interesting if we're running on HPC. Do we write snapshots to a file? Run a TCP server? Is there a graphical display, a textual summary?

Of course, this would still be limited to monitoring things that libmuscle does. It would probably be nice to have effectively a kind of top for the simulation that monitors actual usage of the CPU and other resources. That would require a way of mapping instances to a (node, PID) pair identifying the process, and then some kind of node agent that can do whatever top does to collect the data and send it to the manager.

That's all doable, but a fair bit of work. We should probably investigate whether there are existing performance monitoring tools that can do this, and if we can just integrate better with them. I'm not aware of there being many open source options in this space though, and existing tools may not work so well with complex coupled simulations, so it may still be worth it.

DavidPCoster · 2023-10-27T13:50:29Z

Thanks for the explanation.

I think as a starting point a csv file that is appended to every <user_selected_time> seconds with an entry for each actor with a character indicating waiting, sending, receiving, running might be good.

LourensVeen · 2023-10-27T16:44:42Z

Oh, that's a nice idea. Then you can watch tail <file> to watch it live, and also load it into Pandas or a spreadsheet or something for analysis after the fact. You may need a very wide screen though for your simulation.

LourensVeen · 2023-10-29T15:41:35Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to derive the list of currently running actors from the performance database? #273

Is it possible to derive the list of currently running actors from the performance database? #273

DavidPCoster commented Oct 25, 2023

LourensVeen commented Oct 25, 2023

DavidPCoster commented Oct 27, 2023

LourensVeen commented Oct 27, 2023

LourensVeen commented Oct 29, 2023

Is it possible to derive the list of currently running actors from the performance database? #273

Is it possible to derive the list of currently running actors from the performance database? #273

Comments

DavidPCoster commented Oct 25, 2023

LourensVeen commented Oct 25, 2023

DavidPCoster commented Oct 27, 2023

LourensVeen commented Oct 27, 2023

LourensVeen commented Oct 29, 2023