Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Add basic performance statistics to phone home #3044

Merged
merged 3 commits into from Apr 4, 2018

Conversation

michaelkaye
Copy link
Contributor

To help identifying if code changes to performance changes are having an impact to homeservers deployed in the community, include statistics on memory used and average CPU use in the statistics we phone home every 3 hours.

This requires the psutil module, and is still opt-in based on the report_stats
config option.
@michaelkaye
Copy link
Contributor Author

Should probably go out in combination with #3041

Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly looks good, just a few nits.

We would appreciate it if you could assist by ensuring this module is available
and ``report_stats`` is enabled. This will let us see if performance changes to
synapse are having an impact to the general community.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably go in the changelog IMO, otherwise a) we'll forget to update $NEXT_VERSION and b) I think most people look at CHANGELOG

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably then move the historical information into the changelog and leave UPDATING as just the top level howto, referring to the changelog for any specific requirements on upgrade? Just so we don't split the idea of "upgrade notes" across multiple places.

stats["memory_rss"] = 0
stats["cpu_average"] = 0
for process in stats_process:
with process.oneshot():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't exist in older versions of psutil, alas. We have theoretically support any version of psutil >= 2. and in practice I think the debian package (for jessie) relies on being able to use v4.x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if hs.config.report_stats:
logger.info("Scheduling stats reporting for 3 hour intervals")
clock.looping_call(phone_stats_home, 3 * 60 * 60 * 1000)

# We need to defer this init for the cases that we daemonize
# otherwise the process ID we get is that of the non-daemon process
clock.call_later(15, performance_stats_init)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 0 works to defer it to the next reactor tick, which is probably nicer than arbitrarily waiting 15 seconds

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option is to add a callback to _base.start_reactor to accept a function to run on the reactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -401,6 +401,7 @@ def profiled(*args, **kargs):
start_time = clock.time()

stats = {}
stats_process = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using an array here?
Wouldn't it possible to just use an assignment?
The loop will cause multiple entries to be overwritten either way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting set from inside a local function, so an assignment won't work, e.g.:

stats_proc = None

def get_proc():
    stats_proc = object()  # This creates a new local variable

get_proc()

assert stats_proc is None

Though I think you're right that its a bit unclear at the moment, we generally write:

# Contains the psutil.Process once we've fetched it.
stats_proc = [None]

def get_proc():
    stats_proc[0] = object()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh good to know. I expected python to behave like other languages...
then nvm

logger.warn(
"report_stats enabled but psutil is not installed or incorrect version."
" Disabling reporting of memory/cpu stats."
" Ensuring psutil is available will help matrix track performance changes"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/matrix/matrix.org/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks

@turt2live
Copy link
Member

If I'm not mistaken, this doesn't seem to handle people who use workers. Is that intentional/not a concern?

@michaelkaye
Copy link
Contributor Author

So the logic behind having the processes as an array and loop (even if we only have one entry at the moment) over them to sum up the values is that the times when we do split workers out into other processes (such as for the main matrix.org server and elsewhere) we should pass those PIDs around to monitor the memory/cpu of everything that's cooperating to run the synapse.

Doing that is a bit of a bigger job than just gluing in a check for the main synapse worker - we need to make the startup of workers cooperate more with the main python which sends the stats. But in those cases we're also needing to worry about how many workers, what they're doing, it kinda grows a little beyond simple stats.

So yeah, I'm aware we're not getting the full picture for those using workers, but I just wanted to get some trivial first cut on the data out there, especially for those running smaller homeserver instances.

stats["cpu_average"] = 0
for process in stats_process:
stats["memory_rss"] += process.memory_info().rss
stats["cpu_average"] += int(process.cpu_percent(interval=None))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include the time since the last call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nvm

@richvdh richvdh merged commit a89f9f8 into develop Apr 4, 2018
@DMRobertson DMRobertson deleted the michaelk/performance_stats branch June 28, 2022 11:19
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants