Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Increase in calls to get_user_by_id and other strange metrics #7351

Closed
richvdh opened this issue Apr 27, 2020 · 7 comments
Closed

Increase in calls to get_user_by_id and other strange metrics #7351

richvdh opened this issue Apr 27, 2020 · 7 comments
Assignees
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication)

Comments

@richvdh
Copy link
Member

richvdh commented Apr 27, 2020

We switched matrix.org to current develop for a couple of hours on 2020-04-24, and saw some surprising changes to the metrics, including a large number of calls to the get_user_by_id database transaction.

Some of these could be due to more accurate metrics reporting due to #7203, but there were other changes which merit investigation.

@richvdh richvdh self-assigned this Apr 29, 2020
@richvdh
Copy link
Member Author

richvdh commented Apr 29, 2020

To enumerate the things that I'd like to understand:

  • CPU usage by replication_notifier bg task went up from ~8% to ~20%
  • CPU usage by replication-USER_IP bg task went up from ~0.05% to ~2%
  • txn rate for get_user_by_id rose from zero to 20Hz
  • txn rate for should_update_remote_profile_cache_for_user fell from ~40Hz to ~15Hz
  • CPU usage and db transactions of many blocks fell significantly.
  • DB transactions for state._resolve_events rose from ~2.5 to ~10
  • cache lookups for client_ip_last_seen increased from ~60Hz to ~100Hz
  • cache lookups for get_user_by_id increased from ~2Hz to ~50Hz. Hit rate dropped from 75% to 50%.
  • user_ip_cache replication events increased from ~0 to ~40Hz.

@richvdh
Copy link
Member Author

richvdh commented Apr 29, 2020

  • CPU usage by replication-USER_IP bg task went up from ~0.05% to ~2%
  • txn rate for get_user_by_id rose from zero to 20Hz
  • cache lookups for client_ip_last_seen increased from ~60Hz to ~100Hz
  • cache lookups for get_user_by_id increased from ~2Hz to ~50Hz. Hit rate dropped from 75% to 50%.
  • user_ip_cache replication events increased from ~0 to ~40Hz.

These are all due to the same thing. It turns out that we weren't correctly processing USER_IP commands previously: ServerReplicationStreamProtocol attempted to call ReplicationStreamer.on_user_ip, but that is an async function.

There is a warning about this in the logs, but it's lost in the noise:

2020-04-24 16:14:08,657 - twisted - 192 - WARNING - replication-USER_IP-0 - /home/synapse/src/synapse/replication/tcp/protocol.py:474: builtins.RuntimeWarning: coroutine 'ReplicationStreamer.on_user_ip' was never awaited

For each USER_IP command, we run the server notices code to consider sending a consent notice to any users who haven't consented to the ToS. 40 Hz of USER_IP commands corresponds to 40Hz of get_user_by_id lookups, which with a 50% hit rate corresponds to 20Hz of get_user_by_id. We might want to consider increasing the cache size for get_user_by_id.

This seems like it was introduced in #6666.

@richvdh
Copy link
Member Author

richvdh commented Apr 29, 2020

  • txn rate for should_update_remote_profile_cache_for_user fell from ~40Hz to ~15Hz

This appears to be a red herring. It seems that this happens whenever the process gets busy, and of course we'd deployed it during a busy time of the week.

@richvdh
Copy link
Member Author

richvdh commented Apr 29, 2020

  • CPU usage and db transactions of many blocks fell significantly.

This is a regression introduced by #7120. Raised as #7370.

@richvdh richvdh changed the title Increase in calls to get_user_by_id Increase in calls to get_user_by_id and other strange metrics Apr 29, 2020
@richvdh
Copy link
Member Author

richvdh commented Apr 29, 2020

  • DB transactions for state._resolve_events rose from ~2.5 to ~10

This seems to mean "on average, database transactions done within state._resolve_events took about 10ms to execute, where previously they took about 2.5ms". I'm not convinced this is reliable given #7370.

@richvdh
Copy link
Member Author

richvdh commented Apr 29, 2020

CPU usage by replication_notifier bg task went up from ~8% to ~20%

This didn't actually start immediately after deployment: it took about 10 minutes to start, which coincides with the start of assertions like this:

2020-04-24 13:19:03,152 - synapse.replication.tcp.resource - 169 - INFO - replication_notifier-10430 - Failed to handle stream federation
...
Traceback (most recent call last):
  File "/home/synapse/src/synapse/metrics/background_process_metrics.py", line 213, in run
    return (yield result)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/synapse/src/synapse/replication/tcp/resource.py", line 166, in _run_notifier_loop
    updates, current_token, limited = await stream.get_updates()
  File "/home/synapse/src/synapse/replication/tcp/streams/_base.py", line 138, in get_updates
    self.last_token, current_token
  File "/home/synapse/src/synapse/replication/tcp/streams/_base.py", line 163, in get_updates_since
    from_token, upto_token, _STREAM_UPDATE_TARGET_ROW_COUNT,
  File "/home/synapse/src/synapse/replication/tcp/streams/_base.py", line 182, in update_function
    assert len(updates) <= limit
AssertionError

That is later joined by similar assertions for the account_data stream, and then the events stream which we already know was broken as per #7308. I'm just going to track this under #7340; we should probably just check the other streams while we are there.

@richvdh
Copy link
Member Author

richvdh commented Apr 29, 2020

right, I think that's everything. Thinks that need fixing are #7340 and #7370.

@richvdh richvdh closed this as completed Apr 29, 2020
@richvdh richvdh added the A-Workers Problems related to running Synapse in Worker Mode (or replication) label Feb 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication)
Projects
None yet
Development

No branches or pull requests

1 participant