Add basic performance statistics to phone home #3044
Conversation
This requires the psutil module, and is still opt-in based on the report_stats config option.
3f5519a
to
4ceaa74
Compare
Should probably go out in combination with #3041 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadly looks good, just a few nits.
We would appreciate it if you could assist by ensuring this module is available | ||
and ``report_stats`` is enabled. This will let us see if performance changes to | ||
synapse are having an impact to the general community. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably go in the changelog IMO, otherwise a) we'll forget to update $NEXT_VERSION
and b) I think most people look at CHANGELOG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably then move the historical information into the changelog and leave UPDATING as just the top level howto, referring to the changelog for any specific requirements on upgrade? Just so we don't split the idea of "upgrade notes" across multiple places.
synapse/app/homeserver.py
Outdated
stats["memory_rss"] = 0 | ||
stats["cpu_average"] = 0 | ||
for process in stats_process: | ||
with process.oneshot(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't exist in older versions of psutil, alas. We have theoretically support any version of psutil >= 2. and in practice I think the debian package (for jessie) relies on being able to use v4.x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
synapse/app/homeserver.py
Outdated
if hs.config.report_stats: | ||
logger.info("Scheduling stats reporting for 3 hour intervals") | ||
clock.looping_call(phone_stats_home, 3 * 60 * 60 * 1000) | ||
|
||
# We need to defer this init for the cases that we daemonize | ||
# otherwise the process ID we get is that of the non-daemon process | ||
clock.call_later(15, performance_stats_init) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 0
works to defer it to the next reactor tick, which is probably nicer than arbitrarily waiting 15 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option is to add a callback to _base.start_reactor
to accept a function to run on the reactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
synapse/app/homeserver.py
Outdated
@@ -401,6 +401,7 @@ def profiled(*args, **kargs): | |||
start_time = clock.time() | |||
|
|||
stats = {} | |||
stats_process = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using an array here?
Wouldn't it possible to just use an assignment?
The loop will cause multiple entries to be overwritten either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting set from inside a local function, so an assignment won't work, e.g.:
stats_proc = None
def get_proc():
stats_proc = object() # This creates a new local variable
get_proc()
assert stats_proc is None
Though I think you're right that its a bit unclear at the moment, we generally write:
# Contains the psutil.Process once we've fetched it.
stats_proc = [None]
def get_proc():
stats_proc[0] = object()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh good to know. I expected python to behave like other languages...
then nvm
synapse/app/homeserver.py
Outdated
logger.warn( | ||
"report_stats enabled but psutil is not installed or incorrect version." | ||
" Disabling reporting of memory/cpu stats." | ||
" Ensuring psutil is available will help matrix track performance changes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/matrix/matrix.org/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thanks
If I'm not mistaken, this doesn't seem to handle people who use workers. Is that intentional/not a concern? |
So the logic behind having the processes as an array and loop (even if we only have one entry at the moment) over them to sum up the values is that the times when we do split workers out into other processes (such as for the main matrix.org server and elsewhere) we should pass those PIDs around to monitor the memory/cpu of everything that's cooperating to run the synapse. Doing that is a bit of a bigger job than just gluing in a check for the main synapse worker - we need to make the startup of workers cooperate more with the main python which sends the stats. But in those cases we're also needing to worry about how many workers, what they're doing, it kinda grows a little beyond simple stats. So yeah, I'm aware we're not getting the full picture for those using workers, but I just wanted to get some trivial first cut on the data out there, especially for those running smaller homeserver instances. |
stats["cpu_average"] = 0 | ||
for process in stats_process: | ||
stats["memory_rss"] += process.memory_info().rss | ||
stats["cpu_average"] += int(process.cpu_percent(interval=None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to include the time since the last call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nvm
To help identifying if code changes to performance changes are having an impact to homeservers deployed in the community, include statistics on memory used and average CPU use in the statistics we phone home every 3 hours.