Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Replace HTTP replication with TCP replication (Server side part) #2082

Merged
merged 17 commits into from Apr 4, 2017

Conversation

Projects
None yet
2 participants
Owner

erikjohnston commented Mar 30, 2017 edited

This is the server side component of #2069, including the requested changes.

Docs rendered

erikjohnston added some commits Mar 27, 2017

Add new storage functions for new replication
The new replication protocol will keep all the streams separate, rather
than muxing multiple streams into one.
Initial TCP protocol implementation
This defines the low level TCP replication protocol
Add functions to presence to support remote syncs
The TCP replication protocol streams deltas of who has started or
stopped syncing. This is different from the HTTP API which periodically
sends the full list of users who are syncing. This commit adds support
for the new TCP style of sending deltas.
synapse/notifier.py
+ """Returns a deferred which resolves when there is new data for
+ replication to handle.
+ """
+ return self.replication_deferred.observe()
@erikjohnston

erikjohnston Mar 30, 2017

Owner

It may be nicer to let the replication resource register a callback rather than bouncing through a deferred?

@richvdh

richvdh Mar 30, 2017

Member

yes, I think it would. it may be more efficient too.

@richvdh

richvdh Mar 30, 2017

Member

otherwise I think it needs a make_deferred_yieldable.

+ connections if there are.
+
+ This should get called each time new data is available, even if it
+ is currently being executed, so that nothing gets missed
@erikjohnston

erikjohnston Mar 30, 2017

Owner

Maybe it would be a lot clearer if this function looped until there were no changes? Although that is less immediately obvious is correct.

It may be more obvious if we can register this function directly with the notifer, rather than bouncing via a deferred in a loop.

@richvdh

richvdh Mar 30, 2017

Member

+1 to registering a callback
-1 to looping until there are no changes. It will complicate this function and I don't think it will make anything clearer.

erikjohnston added some commits Mar 30, 2017

docs/tcp_replication.rst
+TCP Replication
+===============
+
+This describes the TCP replication protocol that replaces the HTTP protocol.
@richvdh

richvdh Mar 30, 2017

Member

get rid of this line, it's too vague to be useful

docs/tcp_replication.rst
+Motivation
+----------
+
+The HTTP API used long poll from the workers to the master, this has the problem
@richvdh

richvdh Mar 30, 2017

Member

This paragraph is going to look out of date real soon. I would go straight for:

Previously the workers used an HTTP long poll mechanism to get updates from the master, which had the problem of causing a lot of duplicate work on the server. This TCP protocol replaces those APIs with the aim of increased efficiency.

[or something]

docs/tcp_replication.rst
+--------
+
+The protocol is based on fire and forget, line based commands. An example flow
+would be (where '>' indicates master->worker and '<' worker->master flows)::
@richvdh

richvdh Mar 30, 2017

Member

can you find some other way of writing "master->worker" so that github doesn't put a linebreak in the middle of "->"

docs/tcp_replication.rst
+The example shows the server accepting a new connection and sending its identity
+with the ``SERVER`` command, followed by the client asking to subscribe to the
+``events`` stream from the token ``53``. The server then periodically sends ``RDATA``
+commands which have the format ``RDATA <stream_name> <token> <row>```, where the
@richvdh

richvdh Mar 30, 2017

Member

excess `

+
+Blank lines are ignored.
+
+
@richvdh

richvdh Mar 30, 2017

Member

it would be nice to give a complete list of the commands here, with the command syntax, the direction of transmission, a quick summary and a reference to the section where it is explained in more detail.

docs/tcp_replication.rst
+Reliability
+~~~~~~~~~~~
+
+In general the replication stream should be consisdered an unreliable transport
@richvdh

richvdh Mar 30, 2017

Member

consisdered

synapse/handlers/presence.py
+ updates.append(prev_state.copy_and_replace(
+ last_user_sync_ts=time_now_ms,
+ ))
+ process_presence.discard(user_id)
@richvdh

richvdh Mar 30, 2017

Member

we'll hit this if is_syncing and user_id in process_presence. I think it's incorrect.

+ if updates:
+ yield self._update_states(updates)
+
+ self.external_process_last_updated_ms[process_id] = self.clock.time_msec()
@richvdh

richvdh Mar 30, 2017

Member

just use time_now_ms ?

synapse/notifier.py
+ """Returns a deferred which resolves when there is new data for
+ replication to handle.
+ """
+ return self.replication_deferred.observe()
@richvdh

richvdh Mar 30, 2017

Member

yes, I think it would. it may be more efficient too.

synapse/replication/tcp/__init__.py
+ > RDATA events 55 ["$foo4:bar.com", ...]
+
+The example shows the server accepting a new connection and sending its identity
+with the `SERVER` command, followed by the client asking to subscribe to the
@richvdh

richvdh Mar 30, 2017

Member

I'm not going to insist you go through and change them, but for future reference: AIUI docstrings are interpreted as RST and really ought to have double-backticks.

synapse/replication/tcp/__init__.py
+ * resource.py - the server classes that accepts and handle client connections
+ * streams.py - the definitons of all the valid streams
+
+Further details can be found in docs/tcp_replication.rst
@richvdh

richvdh Mar 30, 2017

Member

suggest moving this up a bit to where you describe the protocol. I nearly missed it here.

Tbh I'm not sure you actually need the description of the protocol in this file - just referring to the doc would be fine imho

+ connections if there are.
+
+ This should get called each time new data is available, even if it
+ is currently being executed, so that nothing gets missed
@richvdh

richvdh Mar 30, 2017

Member

+1 to registering a callback
-1 to looping until there are no changes. It will complicate this function and I don't think it will make anything clearer.

synapse/replication/tcp/streams.py
+ True then limit is provided, otherwise it's not.
+
+ Returns:
+ list(tuple): the first entry in the tuple is the token for that
@richvdh

richvdh Mar 30, 2017

Member

If it is valid for this function to return a Deferred, say so.

synapse/storage/pusher.py
+ """Get all the pushers that have changed between the given tokens.
+
+ Returns:
+ list(tuple): each tuple consists of:
@richvdh

richvdh Mar 30, 2017

Member

No, it returns a Deferred, which resolves to that lot.

(I'd prefer it if we even put this on @defer.inlineCallbacks, but I won't insist there because it's "obvious". Here though, there is no clue that it actually returns a Deferred, rather than a list.)

@richvdh richvdh assigned erikjohnston and unassigned richvdh Mar 30, 2017

erikjohnston added some commits Mar 31, 2017

Owner

erikjohnston commented Mar 31, 2017 edited

I think I've addressed all the issues raised.

I also cheekily added a <last_sync_ms> to the USER_SYNC command (which is the last sync time), as I think a) its more correct and b) is required if we want to batch up the USER_SYNC stuff on the workers to reduce traffic.

erikjohnston added some commits Mar 31, 2017

Add a timestamp to USER_SYNC command
This timestamp is used to indicate when the user last sync'd

@erikjohnston erikjohnston assigned richvdh and unassigned erikjohnston Apr 3, 2017

richvdh approved these changes Apr 3, 2017

lgtm, modulo the conflict

@richvdh richvdh assigned erikjohnston and unassigned richvdh Apr 3, 2017

@erikjohnston erikjohnston merged commit 27cc627 into develop Apr 4, 2017

8 checks passed

Sytest Dendron (Commit) Build #1889 origin/erikj/repl_tcp_server succeeded in 13 min
Details
Sytest Dendron (Merged PR) Build finished.
Details
Sytest Postgres (Commit) Build #2717 origin/erikj/repl_tcp_server succeeded in 8 min 12 sec
Details
Sytest Postgres (Merged PR) Build finished.
Details
Sytest SQLite (Commit) Build #2789 origin/erikj/repl_tcp_server succeeded in 7 min 2 sec
Details
Sytest SQLite (Merged PR) Build finished.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

psaavedra added a commit to psaavedra/synapse that referenced this pull request May 19, 2017

Merge tag 'v0.21.0' into v0.21.0_no_federate_by_default
Changes in synapse v0.21.0 (2017-05-18)
=======================================

No changes since v0.21.0-rc3

Changes in synapse v0.21.0-rc3 (2017-05-17)
===========================================

Features:

* Add per user rate-limiting overrides (PR #2208)
* Add config option to limit maximum number of events requested by ``/sync``
  and ``/messages`` (PR #2221) Thanks to @psaavedra!

Changes:

* Various small performance fixes (PR #2201, #2202, #2224, #2226, #2227, #2228,
  #2229)
* Update username availability checker API (PR #2209, #2213)
* When purging, don't de-delta state groups we're about to delete (PR #2214)
* Documentation to check synapse version (PR #2215) Thanks to @hamber-dick!
* Add an index to event_search to speed up purge history API (PR #2218)

Bug fixes:

* Fix API to allow clients to upload one-time-keys with new sigs (PR #2206)

Changes in synapse v0.21.0-rc2 (2017-05-08)
===========================================

Changes:

* Always mark remotes as up if we receive a signed request from them (PR #2190)

Bug fixes:

* Fix bug where users got pushed for rooms they had muted (PR #2200)

Changes in synapse v0.21.0-rc1 (2017-05-08)
===========================================

Features:

* Add username availability checker API (PR #2183)
* Add read marker API (PR #2120)

Changes:

* Enable guest access for the 3pl/3pid APIs (PR #1986)
* Add setting to support TURN for guests (PR #2011)
* Various performance improvements (PR #2075, #2076, #2080, #2083, #2108,
  #2158, #2176, #2185)
* Make synctl a bit more user friendly (PR #2078, #2127) Thanks @APwhitehat!
* Replace HTTP replication with TCP replication (PR #2082, #2097, #2098,
  #2099, #2103, #2014, #2016, #2115, #2116, #2117)
* Support authenticated SMTP (PR #2102) Thanks @DanielDent!
* Add a counter metric for successfully-sent transactions (PR #2121)
* Propagate errors sensibly from proxied IS requests (PR #2147)
* Add more granular event send metrics (PR #2178)

Bug fixes:

* Fix nuke-room script to work with current schema (PR #1927) Thanks
  @zuckschwerdt!
* Fix db port script to not assume postgres tables are in the public schema
  (PR #2024) Thanks @jerrykan!
* Fix getting latest device IP for user with no devices (PR #2118)
* Fix rejection of invites to unreachable servers (PR #2145)
* Fix code for reporting old verify keys in synapse (PR #2156)
* Fix invite state to always include all events (PR #2163)
* Fix bug where synapse would always fetch state for any missing event (PR #2170)
* Fix a leak with timed out HTTP connections (PR #2180)
* Fix bug where we didn't time out HTTP requests to ASes  (PR #2192)

Docs:

* Clarify doc for SQLite to PostgreSQL port (PR #1961) Thanks @benhylau!
* Fix typo in synctl help (PR #2107) Thanks @HarHarLinks!
* ``web_client_location`` documentation fix (PR #2131) Thanks @matthewjwolff!
* Update README.rst with FreeBSD changes (PR #2132) Thanks @feld!
* Clarify setting up metrics (PR #2149) Thanks @encks!

@erikjohnston erikjohnston deleted the erikj/repl_tcp_server branch Oct 26, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment