Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to derive the list of currently running actors from the performance database? #273

Open
DavidPCoster opened this issue Oct 25, 2023 · 4 comments

Comments

@DavidPCoster
Copy link
Contributor

Is it possible to derive the list of currently running actors from the performance database?

This could be useful if there are longer running actors ...

If so, it would be good to also know when it started running this time (or the run time so far), and, perhaps, the average of previous run times.

@LourensVeen
Copy link
Contributor

You're not the first to ask 😄

The direct answer is no. The profiling subsystem measures things that libmuscle does, and computing things isn't part of that, so whether something is running needs to be inferred from the fact that it isn't doing anything else. Mostly, that would be waiting to receive a message. That's recorded as an event, but the record is only complete once we actually receive the message, so the manager doesn't know about this while it's going on. Also, records are saved up for a while and sent in batches in the background, to reduce the performance impact, so that delays things further.

I think what we want to have is some kind of monitoring system, but monitoring isn't the same as profiling. The latter collects exhaustive data for analysis after the fact, while the former aims to give the user a real-time look at what's going on. Since users aren't very fast compared to CPUs, monitoring can sample and skip or summarise some data in particular when things are going very quickly.

We do have a remote logging system, through which log messages are sent from the instances to the manager at least if muscle_remote_log_level is set low enough. That could be upgraded to do structured logging, so that we can send machine readable records with things like "instance x has been waiting on port y for one second now". This information could then be collected by the manager and amalgamated into a view of the global state of the simulation, and then there would have to be a way to get that information to the user in a user-friendly way. This is especially interesting if we're running on HPC. Do we write snapshots to a file? Run a TCP server? Is there a graphical display, a textual summary?

Of course, this would still be limited to monitoring things that libmuscle does. It would probably be nice to have effectively a kind of top for the simulation that monitors actual usage of the CPU and other resources. That would require a way of mapping instances to a (node, PID) pair identifying the process, and then some kind of node agent that can do whatever top does to collect the data and send it to the manager.

That's all doable, but a fair bit of work. We should probably investigate whether there are existing performance monitoring tools that can do this, and if we can just integrate better with them. I'm not aware of there being many open source options in this space though, and existing tools may not work so well with complex coupled simulations, so it may still be worth it.

@DavidPCoster
Copy link
Contributor Author

Thanks for the explanation.

I think as a starting point a csv file that is appended to every <user_selected_time> seconds with an entry for each actor with a character indicating waiting, sending, receiving, running might be good.

@LourensVeen
Copy link
Contributor

Oh, that's a nice idea. Then you can watch tail <file> to watch it live, and also load it into Pandas or a spreadsheet or something for analysis after the fact. You may need a very wide screen though for your simulation.

@LourensVeen
Copy link
Contributor

See also #171.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants