New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicates in the user_ips table and add an index #4370

Merged
merged 23 commits into from Jan 11, 2019

Conversation

3 participants
@hawkowl
Copy link
Contributor

hawkowl commented Jan 9, 2019

No description provided.

hawkowl added some commits Jan 3, 2019

fix
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Jan 9, 2019

Codecov Report

Merging #4370 into develop will decrease coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4370      +/-   ##
===========================================
- Coverage    73.68%   73.64%   -0.05%     
===========================================
  Files          300      300              
  Lines        29815    29815              
  Branches      4895     4897       +2     
===========================================
- Hits         21970    21958      -12     
- Misses        6407     6419      +12     
  Partials      1438     1438
Impacted Files Coverage Δ
synapse/util/stringutils.py 61.7% <0%> (-6.39%) ⬇️
synapse/api/auth.py 87.96% <0%> (-2.78%) ⬇️
synapse/handlers/search.py 80.24% <0%> (ø) ⬆️
synapse/handlers/federation.py 61.72% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55c3e85...497945f. Read the comment docs.

@hawkowl hawkowl requested a review from matrix-org/synapse-core Jan 9, 2019

@erikjohnston

This comment has been minimized.

Copy link
Member

erikjohnston commented Jan 11, 2019

A couple of notes:

  1. Given how large this table is I'd probably want to do this using the batching mechanism (which correctly balances the batches so that it doesn't overload the DB)
  2. We probably want to have the index be (user_id, access_token, ip), i.e. remove user_agent and add user_id. We don't really care about keeping unique UA's, we just use them to have an idea about what type the device is, and it really shouldn't change much in practice anyway. We add user_id mainly so that we can reuse the existing index during the de-duplication process (and then we can drop the existing index after adding the new one)
  3. We need to be careful that new inserts don't add duplicates while we're de-duplicating.

The way I'd probably do this is by iterating over the table using last_seen, and then deleting rows that have duplicate entries. Something like:

  1. Find a span of last_seen times that have roughly the right number of rows:
SELECT last_seen FROM user_ips
WHERE last_seen > ?
ORDER BY last_seen 
LIMIT 1
OFFSET ?
  1. Fetch all rows in that span which have duplicate entries (I've checked this query is fast enough):
SELECT user_id, access_token, ip, MAX(last_seen)
FROM (
    SELECT user_id, access_token, ip
    FROM user_ips
    WHERE ? <= last_seen AND last_seen < ?
    ORDER BY last_seen
) c
INNER JOIN user_ips USING (user_id, access_token, ip)
GROUP BY user_id, access_token, ip;
HAVING count(*) > 1
  1. Delete all rows matching the returned (user_id, access_token, ip) and reinserting one row with the returned last_seen.

  2. Repeat.

After we've successfully walked the entire table we probably need to:

  1. Stop inserting into user_ips
  2. Rerun the deduplicate for the time span since we started deduplication, i.e. for the period we might have inserted into
  3. Create the unique index.
  4. Allow insertions again.

Otherwise we risk inserting duplicates while trying to build the unique index :(

Show resolved Hide resolved changelog.d/4370.feature Outdated
@hawkowl

This comment has been minimized.

hawkowl added some commits Jan 11, 2019

@erikjohnston
Copy link
Member

erikjohnston left a comment

Looks good! Would be good to add a final BG update to remove the old index

VALUES (?, ?, ?, ?, ?, ?)
""",
(user_id, access_token, ip, device_id, user_agent, last_seen)
)

This comment has been minimized.

@erikjohnston

erikjohnston Jan 11, 2019

Member

May as well use the _simple_{insert,delete}_txn wrappers here?

This comment has been minimized.

@hawkowl

hawkowl Jan 11, 2019

Author Contributor

yeahhhh but everything else in this function is sql :P

Show resolved Hide resolved synapse/storage/client_ips.py
uid, access_token, ip = key
if uid == user_id:
user_id, access_token, ip = key
if user_id == user_id:

This comment has been minimized.

@erikjohnston

This comment has been minimized.

@hawkowl

hawkowl Jan 11, 2019

Author Contributor

HMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

yeah need some coffee

ORDER BY last_seen
) c
INNER JOIN user_ips USING (user_id, access_token, ip)
GROUP BY user_id, access_token, ip;

This comment has been minimized.

@erikjohnston

erikjohnston Jan 11, 2019

Member

Spurious railing ;

(user_id, access_token, ip, device_id, user_agent, last_seen)
)

self._background_update_progress_txn(txn, {"last_seen": last_seen})

This comment has been minimized.

@erikjohnston

erikjohnston Jan 11, 2019

Member

Missing update name param

hawkowl and others added some commits Jan 11, 2019

@hawkowl hawkowl merged commit a35c66a into develop Jan 11, 2019

4 of 5 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
ci/circleci: sytestpy2merged Your tests passed on CircleCI!
Details
ci/circleci: sytestpy2postgresmerged Your tests passed on CircleCI!
Details
ci/circleci: sytestpy3merged Your tests passed on CircleCI!
Details
ci/circleci: sytestpy3postgresmerged Your tests passed on CircleCI!
Details

Superceded by https://github.com/orgs/matrix-org/projects/8 automation moved this from Review to Done - Operations Jan 11, 2019

@hawkowl hawkowl deleted the hawkowl/remove-userip-dupes branch Jan 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment