Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the number of EDUs in transactions to 100 as expected by receiver #5138

Merged
merged 7 commits into from May 9, 2019

Conversation

2 participants
@superboum
Copy link
Contributor

commented May 4, 2019

Goal

I have observed this problem on my server:

2019-05-04 06:16:57,274 - synapse.federation.sender.transaction_manager - 87 - INFO - federation_transaction_transmission_loop-678370- TX [matrix.livda.fr] {1556878459543} Sending transaction [1556878459543], (PDUs: 0, EDUs: 103)
2019-05-04 06:16:57,338 - synapse.http.matrixfederationclient - 356 - INFO - federation_transaction_transmission_loop-678370- {PUT-O-448448} [matrix.livda.fr] Sending request: PUT matrix://matrix.livda.fr/_matrix/federation/v1/send/1556878459543; timeout 60.000000s
2019-05-04 06:16:57,339 - synapse.http.federation.matrix_federation_agent - 373 - INFO - federation_transaction_transmission_loop-678370- Connecting to matrix.livda.fr:443
[...]
2019-05-04 06:17:08,409 - synapse.http.matrixfederationclient - 388 - INFO - federation_transaction_transmission_loop-678370- {PUT-O-448448} [matrix.livda.fr] Got response headers: 400 Bad Request
2019-05-04 06:17:08,410 - synapse.http.matrixfederationclient - 472 - WARNING - federation_transaction_transmission_loop-678370- {PUT-O-448448} [matrix.livda.fr] Request failed: PUT matrix://matrix.livda.fr/_matrix/federation/v1/send/1556878459543: HttpResponseException("400: b'Bad Request'",)
2019-05-04 06:17:08,411 - synapse.federation.sender.transaction_manager - 123 - INFO - federation_transaction_transmission_loop-678370- TX [matrix.livda.fr] {1556878459543} got 400 response

The corresponding issue seems to be Synapse is not limiting the number of EDUs in a transaction #3951.
It is a blocker since Reject large transactions on federation #4513 as it blocks the federation between two servers.
In my case, the bug was caused by device_message_edus that was containing 103 entries.

I don't know synapse internals and just wrote this quick fix to bring back federations. I might have written some wrong/dangerous things.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file
  • Pull request includes a sign off

superboum added some commits May 4, 2019

Limit EDUs number to 100 in a transaction
Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>
Add a changelog file
Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>

@superboum superboum marked this pull request as ready for review May 4, 2019

@richvdh
Copy link
Member

left a comment

Thanks for this! A couple of thoughts below.

@@ -217,17 +217,21 @@ def _transaction_transmission_loop(self):
pending_edus = []

pending_edus.extend(self._get_rr_edus(force_flush=False))
pending_edus.extend(device_message_edus[:99])

This comment has been minimized.

Copy link
@richvdh

richvdh May 6, 2019

Member

I don't think we can safely discard the other device_message_edus ?

pending_edus.extend(self._pop_pending_edus(100 - len(pending_edus)))
# But we should keep one slot free for presence
to_remove = []
for key, val in self._pending_edus_keyed.items():

This comment has been minimized.

Copy link
@richvdh

richvdh May 6, 2019

Member

I think this would be clearer written as a while loop:

while len(pending_edus) < 100:
    _, val = self._pending_edus_keyed.popitem()
    pending_edus.append(val)
@superboum

This comment has been minimized.

Copy link
Contributor Author

commented May 7, 2019

Thanks for your reply. I will work on it tonight.
However, feel free to directly edit the PR if you need to apply this fix quickly.

Build pending_edus without black holing EDUs
Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>

@superboum superboum force-pushed the superboum:pr_edus_synapse branch from 1c45efc to a533acc May 7, 2019

contents, stream_id = yield self._store.get_new_device_msgs_for_remote(
self._destination, last_device_stream_id, to_device_stream_id
last_device_list = self._last_device_list_stream_id
now_stream_id, results = yield self._store.get_devices_by_remote(

This comment has been minimized.

Copy link
@superboum

superboum May 7, 2019

Author Contributor

It seems that get_devices_by_remote has a hard coded limit of 20 elements.
So we start with this request to be able to complete with the following one (this is why I swapped get_devices_by_remote and get_new_device_msgs_for_remote).

This comment has been minimized.

Copy link
@richvdh

richvdh May 8, 2019

Member

makes sense. I think it would be helpful to include a comment that says get_devices_by_remote will return no more than 20 entries.

Remove wrong comment
Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>
last_device_stream_id = self._last_device_stream_id
to_device_stream_id = self._store.get_to_device_stream_token()
contents, stream_id = yield self._store.get_new_device_msgs_for_remote(
self._destination, last_device_stream_id, to_device_stream_id, 98 - len(edus)

This comment has been minimized.

Copy link
@superboum

superboum May 7, 2019

Author Contributor

Here we can set a limit to fill the edus list. I chose 98 instead of 100 as presence and rr_edus could both take one slot after that.

This comment has been minimized.

Copy link
@richvdh

richvdh May 8, 2019

Member

I think, rather than hardcode 98 here, it would be clearer to make _get_new_device_messages take a limit parameter.

This comment has been minimized.

Copy link
@richvdh

richvdh May 8, 2019

Member

Also: the default value for the limit parameter on get_new_device_msgs_for_remote is now redundant. I suggest removing the default value for clarity.

@superboum

This comment has been minimized.

Copy link
Contributor Author

commented May 7, 2019

I have 4 failing unit tests for now (tox -e py37) :

tests.test_visibility.FilterEventsForServerTestCase.test_large_room
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/quentin/Documents/dev/synapse/tests/federation/test_federation_sender.py", line 46, in test_send_receipts
    mock_send_transaction.assert_called_once()
  File "/home/quentin/Documents/dev/synapse/.tox/py37/lib/python3.7/site-packages/mock/mock.py", line 918, in assert_called_once
    raise AssertionError(msg)
builtins.AssertionError: Expected 'send_transaction' to have been called once. Called 0 times.

tests.federation.test_federation_sender.FederationSenderTestCases.test_send_receipts
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/quentin/Documents/dev/synapse/tests/federation/test_federation_sender.py", line 81, in test_send_receipts_with_backoff
    mock_send_transaction.assert_called_once()
  File "/home/quentin/Documents/dev/synapse/.tox/py37/lib/python3.7/site-packages/mock/mock.py", line 918, in assert_called_once
    raise AssertionError(msg)
builtins.AssertionError: Expected 'send_transaction' to have been called once. Called 0 times.

tests.federation.test_federation_sender.FederationSenderTestCases.test_send_receipts_with_backoff
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/quentin/Documents/dev/synapse/tests/handlers/test_typing.py", line 195, in test_started_typing_remote_send
    try_trailing_slash_on_400=True,
  File "/home/quentin/Documents/dev/synapse/.tox/py37/lib/python3.7/site-packages/mock/mock.py", line 955, in assert_called_once_with
    raise AssertionError(msg)
builtins.AssertionError: Expected 'put_json' to be called once. Called 0 times.

tests.handlers.test_typing.TypingNotificationsTestCase.test_started_typing_remote_send
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/quentin/Documents/dev/synapse/tests/handlers/test_typing.py", line 273, in test_stopped_typing
    try_trailing_slash_on_400=True,
  File "/home/quentin/Documents/dev/synapse/.tox/py37/lib/python3.7/site-packages/mock/mock.py", line 955, in assert_called_once_with
    raise AssertionError(msg)
builtins.AssertionError: Expected 'put_json' to be called once. Called 0 times.

tests.handlers.test_typing.TypingNotificationsTestCase.test_stopped_typing
-------------------------------------------------------------------------------
Ran 635 tests in 471.275s

FAILED (skips=1, errors=4, successes=630)

I don't know if its due to my modifications. I will check that tomorrow.

@richvdh richvdh self-requested a review May 8, 2019

@richvdh richvdh added this to In progress in Homeserver Task Board via automation May 8, 2019

@richvdh
Copy link
Member

left a comment

This is looking great. Thanks again for your work on it. Suggestion below which I think will fix the test failures, and a few other suggestions for minor cleanups.

pending_edus.extend(device_message_edus)
pending_edus.extend(self._pop_pending_edus(100 - len(pending_edus)))
while len(pending_edus) < 100:
_, val = self._pending_edus_keyed.popitem()

This comment has been minimized.

Copy link
@richvdh

richvdh May 8, 2019

Member

From _trial_temp/test.log when running the tests:

        Traceback (most recent call last):
          File "/home/rav/work/synapse/synapse/federation/sender/per_destination_queue.py", line 244, in _transaction_transmission_loop
            _, val = self._pending_edus_keyed.popitem()
        KeyError: 'popitem(): dictionary is empty'

Apologies, this was my fault. I suggest adding a check that _pending_edus_keyed is not empty to the while condition.

@@ -248,6 +238,12 @@ def _transaction_transmission_loop(self):
)
)

pending_edus.extend(device_message_edus)
pending_edus.extend(self._pop_pending_edus(100 - len(pending_edus)))

This comment has been minimized.

Copy link
@richvdh

richvdh May 8, 2019

Member

this 100 number is starting to appear in a lot of places. Could I suggest creating a constant MAX_EDUS_PER_TRANSACTION in this file?

contents, stream_id = yield self._store.get_new_device_msgs_for_remote(
self._destination, last_device_stream_id, to_device_stream_id
last_device_list = self._last_device_list_stream_id
now_stream_id, results = yield self._store.get_devices_by_remote(

This comment has been minimized.

Copy link
@richvdh

richvdh May 8, 2019

Member

makes sense. I think it would be helpful to include a comment that says get_devices_by_remote will return no more than 20 entries.

last_device_stream_id = self._last_device_stream_id
to_device_stream_id = self._store.get_to_device_stream_token()
contents, stream_id = yield self._store.get_new_device_msgs_for_remote(
self._destination, last_device_stream_id, to_device_stream_id, 98 - len(edus)

This comment has been minimized.

Copy link
@richvdh

richvdh May 8, 2019

Member

I think, rather than hardcode 98 here, it would be clearer to make _get_new_device_messages take a limit parameter.

last_device_stream_id = self._last_device_stream_id
to_device_stream_id = self._store.get_to_device_stream_token()
contents, stream_id = yield self._store.get_new_device_msgs_for_remote(
self._destination, last_device_stream_id, to_device_stream_id, 98 - len(edus)

This comment has been minimized.

Copy link
@richvdh

richvdh May 8, 2019

Member

Also: the default value for the limit parameter on get_new_device_msgs_for_remote is now redundant. I suggest removing the default value for clarity.

Rework EDUs limitation handling
Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>
@superboum

This comment has been minimized.

Copy link
Contributor Author

commented May 8, 2019

Thanks a lot for your extensive and precise review. Now tests are passing locally.

  • I added a check for self._pending_edus_keyed directly in the while condition
  • I created a constant as suggested and replaced the hardcoded value by the constant.
  • I added the comment to mention that it will not return more than 20 entries (as I am not very familiar with the code, I am not 100% sure it is the case, so if you have any doubt, we should check before merging)
  • I added a limit parameter to _get_new_device_messages instead of hardcoding 98. I call this function with MAX_EDUS_PER_TRANSACTION - 2 and added a comment to explain the - 2 part.
  • I removed the default value for the limit

However I still have a concern, the limit passed to _get_new_device_messages will be enforced only if limit >= 20 : as the first call is hardcoded to 20, it may return 20 values whatever the limit value is. Even worse, it could result in a negative limit in the second call. Can we suppose that it will not be called with a value less than 20? Should we add a check?

@richvdh

This comment has been minimized.

Copy link
Member

commented May 8, 2019

However I still have a concern, the limit passed to _get_new_device_messages will be enforced only if limit >= 20 : as the first call is hardcoded to 20, it may return 20 values whatever the limit value is. Even worse, it could result in a negative limit in the second call. Can we suppose that it will not be called with a value less than 20? Should we add a check?

well, given we've just hardcoded it to be called with 98, I think it's a valid assumption for now.

I think an easy way to sanity-check this would be to add:

assert len(edus) <= limit, "get_devices_by_remote returned too many EDUs"
@richvdh

This comment has been minimized.

Copy link
Member

commented May 8, 2019

as I am not very familiar with the code, I am not 100% sure it is the case, so if you have any doubt, we should check before

[I checked the code. I am satisfied that it will never return more than 20 rows, however I did discover #5153. That is a separate problem, though...]

@superboum

This comment has been minimized.

Copy link
Contributor Author

commented May 8, 2019

To be sure, is the PR blocked on your side (still some review to do, internal process, etc.) or on my side (you expect some modifications, like the assert one even if the assumption is valid for now)?

@richvdh

This comment has been minimized.

Copy link
Member

commented May 8, 2019

If you could add the assertion that would be great.

Add an assert to check that limit has been enforced after calling get…
…_devices_by_remote

Signed-off-by: Quentin Dufour <quentin@deuxfleurs.fr>

@superboum superboum force-pushed the superboum:pr_edus_synapse branch from 74ecfec to 47b8871 May 9, 2019

@richvdh

richvdh approved these changes May 9, 2019

Copy link
Member

left a comment

lgtm!

@richvdh richvdh merged commit 11ea167 into matrix-org:develop May 9, 2019

0 of 4 checks passed

ci/circleci: sytestpy2merged CircleCI is running your tests
Details
ci/circleci: sytestpy2postgresmerged Your tests are queued behind your running builds
Details
ci/circleci: sytestpy3merged CircleCI is running your tests
Details
ci/circleci: sytestpy3postgresmerged CircleCI is running your tests
Details

Homeserver Task Board automation moved this from In progress to Done May 9, 2019

anoadragon453 added a commit that referenced this pull request May 10, 2019

Merge branch 'develop' into anoa/blacklist_ip_ranges
* develop: (45 commits)
  URL preview blacklisting fixes (#5155)
  Revert 085ae34
  Add a DUMMY stage to captcha-only registration flow
  Make Prometheus snippet less confusing on the metrics collection doc (#4288)
  Set syslog identifiers in systemd units (#5023)
  Run Black on the tests again (#5170)
  Add AllowEncodedSlashes to apache (#5068)
  remove instructions for jessie installation (#5164)
  Run `black` on per_destination_queue
  Limit the number of EDUs in transactions to 100 as expected by receiver (#5138)
  Fix bogus imports in tests (#5154)
  add options to require an access_token to GET /profile and /publicRooms on CS API (#5083)
  Do checks on aliases for incoming m.room.aliases events (#5128)
  Remove the requirement to authenticate for /admin/server_version. (#5122)
  Fix spelling in server notices admin API docs (#5142)
  Fix sample config
  0.99.3.2
  include disco in deb build target list
  changelog
  Debian: we now need libpq-dev.
  ...

anoadragon453 added a commit that referenced this pull request May 10, 2019

Merge branch 'develop' into anoa/device_list_update_fixes
* develop:
  Revert 085ae34
  Add a DUMMY stage to captcha-only registration flow
  Make Prometheus snippet less confusing on the metrics collection doc (#4288)
  Set syslog identifiers in systemd units (#5023)
  Run Black on the tests again (#5170)
  Add AllowEncodedSlashes to apache (#5068)
  remove instructions for jessie installation (#5164)
  Run `black` on per_destination_queue
  Limit the number of EDUs in transactions to 100 as expected by receiver (#5138)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.