Sync current token may be ahead of event stream cache position using workers #14158

Fizzadar · 2022-10-12T13:34:31Z

Based on my initial work investigating sync cache races in: #14154

Sync makes use of the event stream cache to determine whether a room has changed between the since & current tokens. This is then used to limit the number of rooms events are queried for in get_room_events_stream_for_rooms. After discovering the cache invalidation races above I added a quick log line for this: beeper/synapse@62497db (after beeper/synapse@5297155).

And it logs! Only a very small handful of occurrences over the last ~5 days and the position difference has so far been 1 every time. I suspect this may also occur against other stream caches but have not confirmed.

The worry here is if an event was sent within the gap it may be missed from an incremental sync which is especially bad because the user will never see or know about this event unless they re-init sync (or the client backfills it?).

One solution to this is to implement a waiting mechanism on StreamCache so a worker can wait for the cache to catch up with the current token for a given sync before fetching data. Because this is super rare and even when it happens it's a tiny position difference this would probably have negligable impact in sync performance and provide a shield against cache invalidation races over replication.

The text was updated successfully, but these errors were encountered:

Fizzadar · 2022-10-21T14:14:38Z

Confirmed via the log change above, this is actually not that rare (😱). We have large bursts of these in sync workers:

Checking entity in DeviceInboxStreamChangeCache stream cache cache with
    current position 42696041 ahead of  the max stream cache position 42696040

There's even more in federation sender instances DeviceFederationOutboxStreamChangeCache, pretty much a constant flow of these:

Checking entity in DeviceFederationOutboxStreamChangeCache stream cache cache with
    current position 42696140 ahead of the max stream cache position 42696092

Finally the background work has a constant (1/s) flow of such errors on the _curr_state_delta_stream_cache cache, where in that case the cache often appears to be lagging significantly:

Checking entity in _curr_state_delta_stream_cache stream cache cache with
    current position 540123283 ahead of the max stream cache position 540123279

Fizzadar · 2022-10-22T23:24:05Z

Implemented wait-for fix option: #14269

richvdh · 2022-10-24T10:42:58Z

err, interesting. That doesn't sound good.

There's even more in federation sender instances DeviceFederationOutboxStreamChangeCache, pretty much a constant flow of these:
Checking entity in DeviceFederationOutboxStreamChangeCache stream cache cache with
    current position 42696140 ahead of the max stream cache position 42696092

This sounds very much like it could be the cause of #14251 and other device-list-tracking bugs we've been tearing our hair out over.

For those of us struggling to keep up at the back: are you able to explain the mechanics of this race? How does the "current position" come to be ahead of the "max stream cache position"?

I wasn't able to find where exactly you introduced these log lines - can you point me to a relevant diff?

Fizzadar · 2022-10-24T13:50:25Z

Diff for the logging code: #14239

Essentially it seems to boil down to:

Something gets written to the database using a new token (get_next) out of an ID generator
At that point, get_current_token will immediately point at that token
But at that point the stream cache may not have yet been notified about the change, for one of two reasons:
- no call to entity_has_changed is made after incrementing the token (only noticed this yesterday and made a provisional fix for events persisting in 9b5ef68), this would affect both workers and monlithic installs I believe
- the replication lag time taken for the change to propagate from writer to other workers where they call entity_has_changed

Definitely seems likely this would trigger the device list issues as well.

erikjohnston · 2022-10-25T09:59:19Z

At that point, get_current_token will immediately point at that token

get_current_token should only return the new token when we've finished persisting that token (and it gets more complicated with multi-writer). So, as long as we poke the stream change cache at the same time as calling get_next (or advance in replication case) this should be fine.

I think part of the confusion here is that there is not a one-to-one mapping between stream caches and and stream ID generators. For example, the device inbox ID generator feeds into both DeviceInboxStreamChangeCache and DeviceFederationOutboxStreamChangeCache:

synapse/synapse/storage/databases/main/deviceinbox.py

Lines 688 to 698 in 8c94dd3

    
           async with self._device_inbox_id_gen.get_next() as stream_id: 
        
               now_ms = self._clock.time_msec() 
        
               await self.db_pool.runInteraction( 
        
                   "add_messages_to_device_inbox", add_messages_txn, now_ms, stream_id 
        
               ) 
        
               for user_id in local_messages_by_user_then_device.keys(): 
        
                   self._device_inbox_stream_cache.entity_has_changed(user_id, stream_id) 
        
               for destination in remote_messages_by_destination.keys(): 
        
                   self._device_federation_outbox_stream_cache.entity_has_changed( 
        
                       destination, stream_id 
        
                   )

So it's not unexpected if one xor the other stream change cache lags behind the current token.

Now, that's not to say there isn't a problem here. I'd be tempted for some of these stream caches to pass in the relevant stream ID generator, and then log if we are inserting a change with a lower stream ID than returned by get_current_token() (and maybe also get_current_token(instance_name) for multi writer streams). This may have some a false positive in process replication rows where we advance the ID gen and then update the stream change cache, e.g.:

synapse/synapse/storage/databases/main/deviceinbox.py

Lines 141 to 146 in 8c94dd3

    
           self._device_inbox_id_gen.advance(instance_name, token) 
        
           for row in rows: 
        
               if row.entity.startswith("@"): 
        
                   self._device_inbox_stream_cache.entity_has_changed( 
        
                       row.entity, token 
        
                   )

Fizzadar · 2022-10-25T14:07:23Z

I think I've been misunderstanding get_current_token - I was thinking it fetched it from the database but it's not it's the "known latest position" which is effectively the same as the waiting thing. Which makes this more confusing - how are we getting in a situation where the event ID generator has a token ahead of the events stream cache?

Re: device inbox I discovered this also where the two stream caches share one ID generator. My WIP PR #14286 addresses this by making sure both stream caches are always updated (to enable the waiting logic).

Fizzadar · 2022-10-25T14:13:19Z

Digging into the events stream a bit, the only place the stream ID is advanced is:

synapse/synapse/storage/databases/main/events_worker.py

Lines 285 to 295 in 2d0ba3f

    
           def process_replication_rows( 
        
               self, 
        
               stream_name: str, 
        
               instance_name: str, 
        
               token: int, 
        
               rows: Iterable[Any], 
        
           ) -> None: 
        
               if stream_name == EventsStream.NAME: 
        
                   self._stream_id_gen.advance(instance_name, token) 
        
               elif stream_name == BackfillStream.NAME: 
        
                   self._backfill_id_gen.advance(instance_name, -token)

And the change should be propagated at the same time here:

synapse/synapse/storage/databases/main/cache.py

Lines 185 to 210 in 2d0ba3f

    
           def _process_event_stream_row(self, token: int, row: EventsStreamRow) -> None: 
        
               data = row.data 
        
               if row.type == EventsStreamEventRow.TypeId: 
        
                   assert isinstance(data, EventsStreamEventRow) 
        
                   self._invalidate_caches_for_event( 
        
                       token, 
        
                       data.event_id, 
        
                       data.room_id, 
        
                       data.type, 
        
                       data.state_key, 
        
                       data.redacts, 
        
                       data.relates_to, 
        
                       backfilled=False, 
        
                   ) 
        
               elif row.type == EventsStreamCurrentStateRow.TypeId: 
        
                   assert isinstance(data, EventsStreamCurrentStateRow) 
        
                   self._curr_state_delta_stream_cache.entity_has_changed(data.room_id, token) 
        
                   if data.type == EventTypes.Member: 
        
                       self.get_rooms_for_user_with_stream_ordering.invalidate( 
        
                           (data.state_key,) 
        
                       ) 
        
                       self.get_rooms_for_user.invalidate((data.state_key,)) 
        
               else: 
        
                   raise Exception("Unknown events stream row type %s" % (row.type,))

Is it possible that we see a EventsStreamCurrentStateRow.TypeId row in a separate batch to EventsStreamEventRow.TypeId? Both of those would advance the token but only the first would flag the change in the stream cache at that token.

Fizzadar · 2022-10-25T14:24:32Z

Should the entity_has_changed call just be moved into the EventsWorkerStore.process_replication_rows? It seems to me the token and change cache are deeply interlinked and keeping the code together would make sense (like DeviceInboxWorkerStore does).

erikjohnston · 2022-10-26T08:25:05Z

Is it possible that we see a EventsStreamCurrentStateRow.TypeId row in a separate batch to EventsStreamEventRow.TypeId? Both of those would advance the token but only the first would flag the change in the stream cache at that token.

The code to decide what get sent down the event stream is at:

synapse/synapse/replication/tcp/streams/events.py

Lines 135 to 229 in d902181

    
           async def _update_function( 
        
               self, 
        
               instance_name: str, 
        
               from_token: Token, 
        
               current_token: Token, 
        
               target_row_count: int, 
        
           ) -> StreamUpdateResult: 
        
               # the events stream merges together three separate sources: 
        
               #  * new events 
        
               #  * current_state changes 
        
               #  * events which were previously outliers, but have now been de-outliered. 
        
               # 
        
               # The merge operation is complicated by the fact that we only have a single 
        
               # "stream token" which is supposed to indicate how far we have got through 
        
               # all three streams. It's therefore no good to return rows 1-1000 from the 
        
               # "new events" table if the state_deltas are limited to rows 1-100 by the 
        
               # target_row_count. 
        
               # 
        
               # In other words: we must pick a new upper limit, and must return *all* rows 
        
               # up to that point for each of the three sources. 
        
               # 
        
               # Start by trying to split the target_row_count up. We expect to have a 
        
               # negligible number of ex-outliers, and a rough approximation based on recent 
        
               # traffic on sw1v.org shows that there are approximately the same number of 
        
               # event rows between a given pair of stream ids as there are state 
        
               # updates, so let's split our target_row_count among those two types. The target 
        
               # is only an approximation - it doesn't matter if we end up going a bit over it. 
        
               target_row_count //= 2 
        
               # now we fetch up to that many rows from the events table 
        
               event_rows = await self._store.get_all_new_forward_event_rows( 
        
                   instance_name, from_token, current_token, target_row_count 
        
               ) 
        
               # we rely on get_all_new_forward_event_rows strictly honouring the limit, so 
        
               # that we know it is safe to just take upper_limit = event_rows[-1][0]. 
        
               assert ( 
        
                   len(event_rows) <= target_row_count 
        
               ), "get_all_new_forward_event_rows did not honour row limit" 
        
               # if we hit the limit on event_updates, there's no point in going beyond the 
        
               # last stream_id in the batch for the other sources. 
        
               if len(event_rows) == target_row_count: 
        
                   limited = True 
        
                   upper_limit: int = event_rows[-1][0] 
        
               else: 
        
                   limited = False 
        
                   upper_limit = current_token 
        
               # next up is the state delta table. 
        
               ( 
        
                   state_rows, 
        
                   upper_limit, 
        
                   state_rows_limited, 
        
               ) = await self._store.get_all_updated_current_state_deltas( 
        
                   instance_name, from_token, upper_limit, target_row_count 
        
               ) 
        
               limited = limited or state_rows_limited 
        
               # finally, fetch the ex-outliers rows. We assume there are few enough of these 
        
               # not to bother with the limit. 
        
               ex_outliers_rows = await self._store.get_ex_outlier_stream_rows( 
        
                   instance_name, from_token, upper_limit 
        
               ) 
        
               # we now need to turn the raw database rows returned into tuples suitable 
        
               # for the replication protocol (basically, we add an identifier to 
        
               # distinguish the row type). At the same time, we can limit the event_rows 
        
               # to the max stream_id from state_rows. 
        
               event_updates: Iterable[Tuple[int, Tuple]] = ( 
        
                   (stream_id, (EventsStreamEventRow.TypeId, rest)) 
        
                   for (stream_id, *rest) in event_rows 
        
                   if stream_id <= upper_limit 
        
               ) 
        
               state_updates: Iterable[Tuple[int, Tuple]] = ( 
        
                   (stream_id, (EventsStreamCurrentStateRow.TypeId, rest)) 
        
                   for (stream_id, *rest) in state_rows 
        
               ) 
        
               ex_outliers_updates: Iterable[Tuple[int, Tuple]] = ( 
        
                   (stream_id, (EventsStreamEventRow.TypeId, rest)) 
        
                   for (stream_id, *rest) in ex_outliers_rows 
        
               ) 
        
               # we need to return a sorted list, so merge them together. 
        
               updates = list(heapq.merge(event_updates, state_updates, ex_outliers_updates)) 
        
               return updates, upper_limit, limited

Anything that that function returns with the same token will (well should) get batched up.

Should the entity_has_changed call just be moved into the EventsWorkerStore.process_replication_rows? It seems to me the token and change cache are deeply interlinked and keeping the code together would make sense (like DeviceInboxWorkerStore does).

Yes, that sounds like a good idea. I wonder if we should have a class that stores both the ID gen and the stream change cache, to make it more obvious that they are interlinked

Fizzadar · 2022-11-01T13:56:17Z

I’m keen to work on this plan and should have some time over the next few weeks, feel free to assign to me :) There’s some precursor work I have identified as well involving cleanup of the slaved stores so will start there.

Fizzadar · 2022-11-26T09:06:07Z

Small update - the new ReceiptsRoomChangeCache also triggers the same issue, I threw up a very quick debug change to our sync workers, looks like they are getting calls to process_replication_data with an empty row list (beeper/synapse@2610e25).

This is expected for POSITION commands... but as far as I can tell these are not sent for receipts, so how is there ever an empty list here!

UPDATE: this probably makes sense, seems to be on worker startup fetching rows and because receipts replace each other gaps would be expected. The above idea of combining stream change cache + ID generators would address this problem.

Fizzadar · 2022-12-09T09:15:10Z

I think I may have figured out why this happens with the EventsRoomStreamChangeCache!

Sync workers call get_unread_event_push_actions_by_room_for_user for rooms
This executes check on the cache inside a transaction, and thus inside a thread
Python unlocks the GIL every Xms for threads, so it's possible this thread is switched to mid-way through the process_replication_rows calls
Because of class resolution order the EventsWorkerStore.process_replication_rows is called before CacheInvalidationWorkerStore.process_replication_rows which opens up this window

This explains why we're not seeing this particular log causing major issues and why it's so rare; I have pushed some logging to our fork to confirm the above is the case. Fixing this is relatively easy by making sure the stream cache is updated before the ID generator: #14648.

erikjohnston · 2022-12-09T09:57:28Z

Fixing this is relatively easy by making sure the stream cache is updated before the ID generator

Should we add an assertion in the stream change cache where we assert that when we mark something as changed the stream ID is strictly greater than the current token?

erikjohnston · 2022-12-09T09:57:36Z

Also, 😱

Fizzadar · 2022-12-09T10:13:18Z

Should we add an assertion in the stream change cache where we assert that when we mark something as changed the stream ID is strictly greater than the current token?

Yes! That’d be ideal, will probably require passing the relevant id generator into the stream change cache but that seems sensible anyway…

Fizzadar · 2022-12-09T11:20:58Z

I have added that to my PR, enabled only for the events stream, in 5ee0425. I think enabling for other streams is best left to a new PR?

erikjohnston · 2022-12-09T11:39:30Z

@Fizzadar which PR?

clokep · 2022-12-09T12:30:52Z

@Fizzadar which PR?

Seem to be #14648

This creates a new store method, `process_replication_position` that is called after `process_replication_rows`. By moving stream ID advances here this guarantees any relevant cache invalidations will have been applied before the stream is advanced. This avoids race conditions where Python switches between threads mid way through processing the `process_replication_rows` method where stream IDs may be advanced before caches are invalidated due to class resolution ordering. See this comment/issue for further discussion: matrix-org#14158 (comment)

This creates a new store method, `process_replication_position` that is called after `process_replication_rows`. By moving stream ID advances here this guarantees any relevant cache invalidations will have been applied before the stream is advanced. This avoids race conditions where Python switches between threads mid way through processing the `process_replication_rows` method where stream IDs may be advanced before caches are invalidated due to class resolution ordering. See this comment/issue for further discussion: #14158 (comment)

richvdh · 2023-01-09T13:48:36Z

As I understand it, this has been partially addressed by #14723? Is anyone able to provide an update on what remains to be done here?

Fizzadar · 2023-01-16T08:15:04Z

As I understand it, this has been partially addressed by #14723? Is anyone able to provide an update on what remains to be done here?

This should indeed resolve the issue once and for all - I'm waiting on rolling 1.75 to our deployment to confirm (as we have the logs indicating the issue) but I'm pretty sure that PR solves it. Will update here/close issue once that's live.

…4723) This creates a new store method, `process_replication_position` that is called after `process_replication_rows`. By moving stream ID advances here this guarantees any relevant cache invalidations will have been applied before the stream is advanced. This avoids race conditions where Python switches between threads mid way through processing the `process_replication_rows` method where stream IDs may be advanced before caches are invalidated due to class resolution ordering. See this comment/issue for further discussion: matrix-org#14158 (comment) # Conflicts: # synapse/storage/databases/main/devices.py # synapse/storage/databases/main/events_worker.py

…4723) (#52) * Update all stream IDs after processing replication rows (matrix-org#14723) This creates a new store method, `process_replication_position` that is called after `process_replication_rows`. By moving stream ID advances here this guarantees any relevant cache invalidations will have been applied before the stream is advanced. This avoids race conditions where Python switches between threads mid way through processing the `process_replication_rows` method where stream IDs may be advanced before caches are invalidated due to class resolution ordering. See this comment/issue for further discussion: matrix-org#14158 (comment) # Conflicts: # synapse/storage/databases/main/devices.py # synapse/storage/databases/main/events_worker.py * Fix bad cherry-picking * Remove leftover stream advance

Fizzadar · 2023-01-27T10:42:38Z

So this is still happening on our deployment but it did occur to me this morning - could this be a side effect of multiple event persisters, where one is running behind the others?

If I'm following the ReplicationCommandHandler._process_position code properly it looks like replication positions may be processed out of order. This would then mean changes are read/written to the change caches out of order from time to time?

Note: this probably means the assertion that changes are incremental only on cache update would blow up in these situations...

Does it make sense to force and order here and prevent out-of-order event persisters? Would mean any slow persister holds back sync/replication everywhere but they really shouldn't be that far behind each other...

This was referenced Oct 17, 2022

Sync race with get rooms for user cache invalidation over replication #14154

Open

Log stream cache behind current token #14239

Closed

This was referenced Oct 22, 2022

Wait for room stream change cache position #14269

Closed

cross-signing signature was not sent out over federation #14251

Closed

Fizzadar mentioned this issue Nov 5, 2022

Merge/remove Slaved* stores into WorkerStores #14375

Merged

4 tasks

Fizzadar mentioned this issue Dec 9, 2022

Update events stream cache before ID generator #14648

Closed

4 tasks

Fizzadar mentioned this issue Dec 22, 2022

Update all stream IDs after processing replication rows #14723

Merged

4 tasks

Fizzadar mentioned this issue Jan 16, 2023

Update all stream IDs after processing replication rows (#14723) beeper/synapse-legacy-fork#52

Merged

matrixbot mentioned this issue Dec 21, 2023

Sync current token may be ahead of event stream cache position using workers element-hq/synapse#14158

Open

kegsay mentioned this issue Feb 13, 2024

/members endpoint incorrectly returning 403 element-hq/synapse#16913

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync current token may be ahead of event stream cache position using workers #14158

Sync current token may be ahead of event stream cache position using workers #14158

Fizzadar commented Oct 12, 2022

Fizzadar commented Oct 21, 2022 •

edited

Fizzadar commented Oct 22, 2022

richvdh commented Oct 24, 2022

Fizzadar commented Oct 24, 2022

erikjohnston commented Oct 25, 2022

Fizzadar commented Oct 25, 2022

Fizzadar commented Oct 25, 2022 •

edited

Fizzadar commented Oct 25, 2022

erikjohnston commented Oct 26, 2022

Fizzadar commented Nov 1, 2022

Fizzadar commented Nov 26, 2022 •

edited

Fizzadar commented Dec 9, 2022 •

edited

erikjohnston commented Dec 9, 2022

erikjohnston commented Dec 9, 2022

Fizzadar commented Dec 9, 2022

Fizzadar commented Dec 9, 2022

erikjohnston commented Dec 9, 2022

clokep commented Dec 9, 2022

richvdh commented Jan 9, 2023

Fizzadar commented Jan 16, 2023

Fizzadar commented Jan 27, 2023

Sync current token may be ahead of event stream cache position using workers #14158

Sync current token may be ahead of event stream cache position using workers #14158

Comments

Fizzadar commented Oct 12, 2022

Fizzadar commented Oct 21, 2022 • edited

Fizzadar commented Oct 22, 2022

richvdh commented Oct 24, 2022

Fizzadar commented Oct 24, 2022

erikjohnston commented Oct 25, 2022

Fizzadar commented Oct 25, 2022

Fizzadar commented Oct 25, 2022 • edited

Fizzadar commented Oct 25, 2022

erikjohnston commented Oct 26, 2022

Fizzadar commented Nov 1, 2022

Fizzadar commented Nov 26, 2022 • edited

Fizzadar commented Dec 9, 2022 • edited

erikjohnston commented Dec 9, 2022

erikjohnston commented Dec 9, 2022

Fizzadar commented Dec 9, 2022

Fizzadar commented Dec 9, 2022

erikjohnston commented Dec 9, 2022

clokep commented Dec 9, 2022

richvdh commented Jan 9, 2023

Fizzadar commented Jan 16, 2023

Fizzadar commented Jan 27, 2023

Fizzadar commented Oct 21, 2022 •

edited

Fizzadar commented Oct 25, 2022 •

edited

Fizzadar commented Nov 26, 2022 •

edited

Fizzadar commented Dec 9, 2022 •

edited