Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

CS-API /messages requests fail on develop when using postgres due to NumericValueOutOfRange if some events have repeatedly failed to backfill #13929

Closed
Tracked by #14284
DMRobertson opened this issue Sep 28, 2022 · 8 comments · Fixed by #13936
Assignees
Labels
A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Regression Something broke which worked on a previous release X-Release-Blocker Must be resolved before making a release

Comments

@DMRobertson
Copy link
Contributor

DMRobertson commented Sep 28, 2022

https://sentry.tools.element.io/organizations/element/issues/33447/?project=2&query=is%3Aunresolved

Running this query:

SELECT backward_extrem.event_id, event.depth FROM events AS event
/**
* Get the edge connections from the event_edges table
* so we can see whether this event's prev_events points
* to a backward extremity in the next join.
*/
INNER JOIN event_edges AS edge
ON edge.event_id = event.event_id
/**
* We find the "oldest" events in the room by looking for
* events connected to backwards extremeties (oldest events
* in the room that we know of so far).
*/
INNER JOIN event_backward_extremities AS backward_extrem
ON edge.prev_event_id = backward_extrem.event_id
/**
* We use this info to make sure we don't retry to use a backfill point
* if we've already attempted to backfill from it recently.
*/
LEFT JOIN event_failed_pull_attempts AS failed_backfill_attempt_info
ON
failed_backfill_attempt_info.room_id = backward_extrem.room_id
AND failed_backfill_attempt_info.event_id = backward_extrem.event_id
WHERE
backward_extrem.room_id = ?
/* We only care about non-state edges because we used to use
* `event_edges` for two different sorts of "edges" (the current
* event DAG, but also a link to the previous state, for state
* events). These legacy state event edges can be distinguished by
* `is_state` and are removed from the codebase and schema but
* because the schema change is in a background update, it's not
* necessarily safe to assume that it will have been completed.
*/
AND edge.is_state is ? /* False */
/**
* Exponential back-off (up to the upper bound) so we don't retry the
* same backfill point over and over. ex. 2hr, 4hr, 8hr, 16hr, etc.
*
* We use `1 << n` as a power of 2 equivalent for compatibility
* with older SQLites. The left shift equivalent only works with
* powers of 2 because left shift is a binary operation (base-2).
* Otherwise, we would use `power(2, n)` or the power operator, `2^n`.
*/
AND (
failed_backfill_attempt_info.event_id IS NULL
OR ? /* current_time */ >= failed_backfill_attempt_info.last_attempt_ts + /*least*/%s((1 << failed_backfill_attempt_info.num_attempts) * ? /* step */, ? /* upper bound */)
)
/**
* Sort from highest to the lowest depth. Then tie-break on
* alphabetical order of the event_ids so we get a consistent
* ordering which is nice when asserting things in tests.
*/
ORDER BY event.depth DESC, backward_extrem.event_id DESC

with args

[['!redacted_room_name:example.com', False, 1664362899049, 3600000, 604800000]]

from

txn.execute(
sql % (least_function,),
(
room_id,
False,
self._clock.time_msec(),
1000 * BACKFILL_EVENT_EXPONENTIAL_BACKOFF_STEP_SECONDS,
1000 * BACKFILL_EVENT_BACKOFF_UPPER_BOUND_SECONDS,
),
)

Failed handle request via "RoomMessageListRestServlet": "<XForwardedForRequest at 0x7fba292b3490 method='GET' uri='/_matrix/client/r0/rooms/!redacted_room_id:example.com/messages?from=s3305809885_757284974_68989_1616590947_1642068190_3598845_602714760_5320124732_0&dir=b&limit=50&filter=%7B%22lazy_load_members%22%3Atrue%7D' clientproto='HTTP/1.1' site='11106'>"

Running on 1.68.0 (b=matrix-org-hotfixes,3f30bdca19)

@DMRobertson DMRobertson added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Regression Something broke which worked on a previous release A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Sep 28, 2022
@DMRobertson DMRobertson self-assigned this Sep 28, 2022
@DMRobertson DMRobertson added the X-Release-Blocker Must be resolved before making a release label Sep 28, 2022
@DMRobertson
Copy link
Contributor Author

I don't like the look of this:

matrix=> SELECT count(*), (
matrix(>     1 << event_failed_pull_attempts.num_attempts
matrix(> ) c
matrix-> FROM event_failed_pull_attempts
matrix-> GROUP BY c
matrix-> ORDER BY c ASC;
 count │      c      
═══════╪═════════════
    15 │ -2147483648
    11 │           1
   848 │           2
   921 │           4
   378 │           8
   180 │          16
   117 │          32
    70 │          64
    40 │         128
    39 │         256
    26 │         512
    53 │        1024
    26 │        2048
    33 │        4096
    15 │        8192
    25 │       16384
    18 │       32768
    13 │       65536
    20 │      131072
    18 │      262144
     6 │      524288
    18 │     1048576
     6 │     2097152
    19 │     4194304
    35 │     8388608
    18 │    16777216
    17 │    33554432
     8 │    67108864
    13 │   134217728
     3 │   268435456
     4 │   536870912
    13 │  1073741824
(32 rows)

I think that first row is the smallest 32 bit signed int?

@DMRobertson
Copy link
Contributor Author

Right. In the query

SELECT count(*)
FROM event_failed_pull_attempts
WHERE 1664364716150 >= last_attempt_ts + LEAST((1 << num_attempts) * 3600000, 604800000);

(which causes ERROR: integer out of range), consider the boolean expression

1664364716150 >= last_attempt_ts + LEAST((1 << num_attempts) * 3600000, 604800000) 

The types of "leaf" expressions involved here are

matrix=> SELECT (pg_typeof(1664364716150) currentSELECT_time, pg_typeof(last_attempt_ts) last_attempt_ts, pg_typeof(num_attempts) num_attempts, pg_typeof(3600000) ms_per_hour, pg_typeof(604800000) ms_per_week
FROM event_failed_pull_attempts
LIMIT 1;
matrix(>     pg_typeof(1664364716150)   AS current_time,
matrix(>     pg_typeof(last_attempt_ts) AS last_attempt_ts,
matrix(>     pg_typeof(num_attempts)    AS num_attempts,
matrix(>     pg_typeof(3600000)         AS ms_per_hour,
matrix(>     pg_typeof(604800000)       AS ms_per_week
matrix(> FROM event_failed_pull_attempts
matrix(> LIMIT 1;
matrix(> ^C
matrix=> SELECT
matrix->     pg_typeof(1664364716150)   AS current_time,
matrix->     pg_typeof(last_attempt_ts) AS last_attempt_ts,
matrix->     pg_typeof(num_attempts)    AS num_attempts,
matrix->     pg_typeof(3600000)         AS ms_per_hour,
matrix->     pg_typeof(604800000)       AS ms_per_week
matrix-> FROM event_failed_pull_attempts
matrix-> LIMIT 1;
 current_time │ last_attempt_ts │ num_attempts │ ms_per_hour │ ms_per_week 
══════════════╪═════════════════╪══════════════╪═════════════╪═════════════
 bigint       │ bigint          │ integer      │ integer     │ integer
(1 row)

@DMRobertson
Copy link
Contributor Author

The left-shift expression is

SELECT pg_typeof(1 << num_attempts) AS shifted FROM event_failed_pull_attempts LIMIT 1;
 shifted 
═════════
 integer

because 1 is an integer.

our expression is now bigint >= bigint + LEAST(integer * integer, integer).

It looks like multiplication of integers (32 bit signed ints) simply errors on overflow, but I couldn't see anything in the docs which explains this.

matrix=> select (1<<30) * (1 << 30);
ERROR:  integer out of range

Similarly for bigints (64 bit signed ints):

matrix=> select (1::bigint << 60) * (1::bigint << 60);
ERROR:  bigint out of range

In contrast the left shift operation will happily overflow, wrapping around to negative ints.

matrix=> WITH series(x) AS (SELECT generate_series(1, 40)) SELECT x, 1 << x AS shifted FROM series;
 x  │   shifted   
════╪═════════════
  1 │           2
  2 │           4
  3 │           8
  4 │          16
  5 │          32
  6 │          64
  7 │         128
  8 │         256
  9 │         512
 10 │        1024
 11 │        2048
 12 │        4096
 13 │        8192
 14 │       16384
 15 │       32768
 16 │       65536
 17 │      131072
 18 │      262144
 19 │      524288
 20 │     1048576
 21 │     2097152
 22 │     4194304
 23 │     8388608
 24 │    16777216
 25 │    33554432
 26 │    67108864
 27 │   134217728
 28 │   268435456
 29 │   536870912
 30 │  1073741824
 31 │ -2147483648
 32 │           1
 33 │           2
 34 │           4
 35 │           8
 36 │          16
 37 │          32
 38 │          64
 39 │         128
 40 │         256
(40 rows)

@DMRobertson
Copy link
Contributor Author

TL;DR: (1 << num_attempts) * 3600000 is too large for a 32 bit integer in for some value of num_attempts that we have in our database.

As for why we sometimes num_attempts % 32 is very large, and hence 1 << num_attempts is large. To try to illustrate this:

matrix=> SELECT MAX(num_attempts), MAX(num_attempts % 32), MIN(1 << num_attempts), MAX(1 << num_attempts) FROM event_failed_pull_attempts;
  max   │ max │     min     │    max     
════════╪═════╪═════════════╪════════════
 157177 │  31 │ -2147483648 │ 1073741824
(1 row)
matrix=> select pct, percentile_disc(pct) WITHIN GROUP (ORDER BY num_attempts) FROM event_failed_pull_attempts, generate_series(0, 1, 0.05) AS pct group by pct;
 pct  │ percentile_disc 
══════╪═════════════════
    0 │               1
 0.05 │               1
 0.10 │               1
 0.15 │               1
 0.20 │               1
 0.25 │               1
 0.30 │               2
 0.35 │               2
 0.40 │               2
 0.45 │               2
 0.50 │               2
 0.55 │               2
 0.60 │               3
 0.65 │               3
 0.70 │               3
 0.75 │               4
 0.80 │               6
 0.85 │              10
 0.90 │              17
 0.95 │              38
 1.00 │          157406

I think we need to clamp 1 << num_attempts somehow, e.g. by using 1 << LEAST(num_attempts, 5) instead. If I understand correctly, that should mean a maximum backoff time of 1 << 5 = 32 hours?

@DMRobertson
Copy link
Contributor Author

Assuming a regression in #13635.

@squahtx
Copy link
Contributor

squahtx commented Sep 28, 2022

I think we need to clamp 1 << num_attempts somehow

or cast the left hand side to a bigint?

synapse=# SELECT CAST(1 AS BIGINT) << 33;
  ?column?
------------
 8589934592
(1 row)

@DMRobertson
Copy link
Contributor Author

or cast the left hand side to a bigint?

Even if we do this, (bigint << integer) * integer will become (bigint * integer). Postgres still errors on overflows for those multiplications.

matrix=> SELECT (1::bigint << 50) * 3600000;
ERROR:  bigint out of range

(Maybe there's some setting to control overflow behaviour?)

@squahtx
Copy link
Contributor

squahtx commented Sep 28, 2022

Ah yes you're right. I missed that num_attempts would keep going up even though we limit the backoff interval.

@DMRobertson DMRobertson changed the title NumericValueOutOfRange seen on develop CS-API /messages requests fail on develop when using postgres due to NumericValueOutOfRange if some events have repeatedly failed to backfill Sep 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Regression Something broke which worked on a previous release X-Release-Blocker Must be resolved before making a release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants