Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Casefold when processing email addresses #374

Merged
merged 32 commits into from Aug 24, 2021
Merged

Conversation

H-Shay
Copy link
Contributor

@H-Shay H-Shay commented Jul 7, 2021

First crack at resolving #341. I tried to make sure emails were being casefolded in the proper places (i.e. if an address was passed through a few functions the email address was casefolded in a function that made the most sense logically) and I tried to only casefold where necessary.

@callahad callahad requested a review from a team July 8, 2021 21:49
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here is that if anybody has already registered with Richard@Example.com, then that's what's going to be stored in the database, so if we instead look up richard@example.com, we're not going to find it.

I think we either need to migrate the existing database entries to a casefolded format, or we need to do two lookups. Did anyone more familiar than me with Sydent (@babolivier? @reivilibre) had any thoughts on this?

@H-Shay
Copy link
Contributor Author

H-Shay commented Jul 12, 2021

I think Brendan outlined a migration step here: https://github.com/matrix-org/matrix-doc/blob/babolivier/msc_email_case/proposals/2265-email-lowercase.md#conflicts-with-existing-associations, but that seems to be something that would be implemented by people actually running/maintaining a Sydent instance?

@babolivier
Copy link
Contributor

babolivier commented Jul 14, 2021

For context, I was thinking of that migration step as something that'd need to be implemented in Sydent. It probably can just be a script, though we'd need to figure out how to make sure people running their own Sydent know they need to run it - maybe we can prevent Sydent from starting if we find not-casefolded addresses in the DB (with an error to tell them to run the script to fix their setup)?

@H-Shay
Copy link
Contributor Author

H-Shay commented Jul 15, 2021

@babolivier: I am trying to think through the migration script portion of this, and I'm wondering if this makes sense to you (and forgive me if this is super obvious, I haven't done much work with databases yet):
Would it be fair to say that the script might look something like this:

  1. from table local_threepid_associations, select all associations where the medium is 'email'
  2. iterate through the selected associations, casefolding the email and storing the new casefolded email with it's associations
  3. check to see if there are any other mxids associated with the original, non-casefolded email address
    a. if there are, implement the steps you listed in the casefolding proposal:
    1. list all MXIDs associated with a variant of the email address, and the timestamp of that association
    2. delete all associations except for the most recent one [0]
    3. inform the user of the deletion by sending them an email notice to the email address

My assumption is that we would have to repeat this process for the global_threepid_associations table as well- I wasn't totally sure about that. Does this approach make sense overall? Thanks in advance for your input!

@babolivier
Copy link
Contributor

@H-Shay Gah I'm sorry I've completely missed your comment.

I think your solution is heading in the right direction, but it looks to me like "check to see if there are any other mxids associated with the original, non-casefolded email address" cannot work. There's a unique index on (medium, address) so only one association can exist for a given case of the same email address, and sqlite doesn't have a casefold function (so we can't do WHERE casefold(address) = '...'). This means that the only way to do this is to store every variation of a same email address we've encountered while iterating on the database results.

If we need to do this, I think the easiest solution is, after SELECTing all associations in the local_threepid_associations with the email medium, just iterating through them and storing them in a Dict[str, List[Tuple[str, str]]]. In this dict, the key is the casefolded address and the list contains associated tuple of MXIDs with the associated email address with its original case. Then you can just iterate over that dict and:

  • update the row for the first result
  • if there's any other result, delete them from the database, send an email with the MXIDs

Which might have been what you suggested, sorry if I haven't parsed it correctly. Also this is just a suggestion, feel free to adapt it as you want.

Bonus points for queuing up the DB operations and running them all at once with executemany instead of running them one by one, as a perf improvement.

In Python, my proposal would look something like this:

def update_assocs(conn: sqlite3.Connection):
    # Do everything in a transaction: that way if a deletion or update fails, or the
    # script is stopped halfway through, we don't end up in a weird state.
    cur = conn.cursor()

    res = cur.execute(
        """
        SELECT address, mxid FROM local_threepid_associations
        WHERE medium = 'email'
        ORDER BY ts DESC
        """
    )

    assocs: Dict[str, List[Tuple[str, str]]] = {}

    # Iterate over the raw database results to build our data set.
    # In this case, fetchall returns a list of tuples with the address as first value and
    # the mxid as second value.
    for row in res.fetchall():
        cf_address = row[0].casefold()

        if cf_address not in assocs:
            assocs[cf_address] = []

        assocs[cf_address].append((row[0], row[1]))

    # The UPDATE statement we'll give to executemany will need three arguments:
    #   * the casefolded address (SET address = ...)
    #   * the original address and the original MXID (WHERE address = ... AND mxid = ...)
    update_args: List[Tuple[str, str, str]] = []
    # The DELETE stetement we'll give to executemany will need one argument, which is the
    # original address (WHERE address = ...)
    delete_args: List[str] = []

    # Now we've got our data set, let's use it to determine what changes we need to make
    # to the database.
    for cf_address, assoc_tuples in assocs.items():
        # We know there's at least one tuple. If there's more than one, we've ordered the
        # results by most recent association first, so we can just use that one.
        update_args.append((cf_address, assoc_tuples[0][0], assoc_tuples[0][1]))

        if len(assoc_tuples) > 1:
            # List of the MXIDs we need to tell the email address's owner they've been
            # disassociated from the address.
            mxids: List[str] = []

            # Iterate over all associations except for the first one, since we've already
            # processed it.
            for assoc_tuple in assoc_tuples[1:]:
                delete_args.append(assoc_tuple[0])
                mxids.append(assoc_tuple[1])

            # Note: we could scope the mxids list outside of the for loop and process them
            # only after we've finished iterating.
            send_mail(mxids)

    # Delete the extra rows.
    if len(delete_args) > 0:
        cur.executemany(
            "DELETE FROM local_threepid_associations WHERE address = ?",
            delete_args,
        )

    # Update the remaining rows with the casefolded versions of the addresses.
    if len(update_args) > 0:
        cur.executemany(
            "UPDATE local_threepid_associations SET address = ? WHERE address = ? AND mxid = ?",
            update_args,
        )

    # We've finished updating the database, committing the transaction.
    conn.commit()

(again, this is just a suggestion, feel free to adapt it however you want; if anything it'll also require more logging since there's none at all in this snippet)

My assumption is that we would have to repeat this process for the global_threepid_associations table as well- I wasn't totally sure about that

Yes I'm afraid we'll have to. I think the best approach would be to get every row that's got a originServer value that matches our server name, then do basically the same operation as with local_threepid_associations; except we'll also need to update the sgAssoc with the new data and recalculate its signature. I think a good place in the code to look at to figure out how to do this would be here:

sgassoc = globalAssocStore.signedAssociationStringForThreepid(medium, address)
if not sgassoc:
return {}
sgassoc = json_decoder.decode(sgassoc)
if self.sydent.server_name not in sgassoc["signatures"]:
# We have not yet worked out what the proper trust model should be.
#
# Maybe clients implicitly trust a server they talk to (and so we
# should sign every assoc we return as ourselves, so they can
# verify this).
#
# Maybe clients really want to know what server did the original
# verification, and want to only know exactly who signed the assoc.
#
# Until we work out what we should do, sign all assocs we return as
# ourself. This is vaguely ok because there actually is only one
# identity server, but it happens to have two names (matrix.org and
# vector.im), and so we're not really lying too much.
#
# We do this when we return assocs, not when we receive them over
# replication, so that we can undo this decision in the future if
# we wish, without having destroyed the raw underlying data.
sgassoc = signedjson.sign.sign_json(
sgassoc, self.sydent.server_name, self.sydent.keyring.ed25519
)

Hope this was helpful (and sorry for the wall of text), feel free to ping me if you've got any question 🙂

@babolivier
Copy link
Contributor

As pointed out in DM, we also need to update the lookup_hashes, which I've missed.

@H-Shay
Copy link
Contributor Author

H-Shay commented Jul 29, 2021

So I added the functions to update the DB, but I wasn't sure where to put the functions, so I just popped them in a file in the DB folder for now, which I am sure is wrong. Obviously they will have to be modified slightly depending on where the file ends up living, but the core logic will stay the same and that's what I am looking for feedback on. I also wrote some tests to try and convince myself that the scrips were working, I pushed them just in case people wanted to look but I assume it might be better to drop them before actually merging?

Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming along!

Sorry for the wall of comments, but hopefully you should be able to get something out of them. I tried to explain what I'm thinking. Ask about anything you like (here or Matrix is fine)!

And by all means, please do include the tests! We will bite your hands off for tests!

N.B. I haven't checked that you're casefolding in all the needed places, but I'll go through and double check that in the next review.
I also should have a closer look at the tests.

res/matrix-org/migration_template.eml Outdated Show resolved Hide resolved
sydent/db/migration.py Outdated Show resolved Hide resolved
Comment on lines 15 to 18
res = cur.execute(
"SELECT address, mxid FROM local_threepid_associations WHERE medium = 'email'"
"ORDER BY ts DESC"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how many rows we're expecting? If this were e.g. Synapse, we'd split this up into batches somehow to avoid a really large query.

I don't know how big our Sydent deployments are, but to be honest I'm not sure it's vector.im that's the scary one -- perhaps some of the government deployments have everyone in the identity lookup, which I would guess could be several hundred thousand users? (@babolivier do you know?)

If you end up making this run in batches, you might want to save the position of this process to the database so that if Sydent is restarted, it can resume where it left off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this I don't know-I do know that the deletion step is supposed to be rare, but as for how large a DB this code will run on, I have no idea-this might be a questions for Brendan.

sydent/db/migration.py Outdated Show resolved Hide resolved
sydent/db/migration.py Outdated Show resolved Hide resolved
sydent/db/migration.py Outdated Show resolved Hide resolved
sydent/db/migration.py Outdated Show resolved Hide resolved
sydent/db/migration.py Outdated Show resolved Hide resolved
sydent/db/migration.py Outdated Show resolved Hide resolved
tests/test_migration.py Outdated Show resolved Hide resolved
@H-Shay H-Shay requested a review from reivilibre August 2, 2021 21:03
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more comments :)

res/matrix-org/migration_template.eml Outdated Show resolved Hide resolved
sydent/threepid/bind.py Outdated Show resolved Hide resolved
sydent/db/migration.py Outdated Show resolved Hide resolved
tests/test_migration.py Outdated Show resolved Hide resolved
tests/test_migration.py Outdated Show resolved Hide resolved
@H-Shay H-Shay requested a review from reivilibre August 5, 2021 22:35
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we're coming close now. I think you saw the discussion about how this is going to work in #synapse-dev:

We would like to make this migration executable as a script.
We don't need to check for casefoldable addresses & warn -- we'll assume that the sysadmin will check the upgrade notes for now.

The current structure (where a lot of reading happens before a small amount of writing all happens at the end) is good for minimising the time that this will hold an EXCLUSIVE lock on the SQLite3 database, so it should be fine to run on a live system (though the SELECT will block other writers, so it might interrupt the ability to create new bindings for a short time, but I think we'll be OK as the database isn't very large).

It would be nice to have two options for this script:

  • dry-run: don't actually make any changes or send any e-mails
  • no-emails: don't send e-mails

Sorry for the brain-dump... but does it make sense? :)

Thanks!

changelog.d/374.misc Outdated Show resolved Hide resolved
sydent/db/invite_tokens.py Outdated Show resolved Hide resolved
Comment on lines 75 to 80
sendEmail(
self.sydent,
templateFile,
address,
{"mxid": "mxid", "subject_header_value": "MatrixID Update"},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I think one thing we could do would be to not send an e-mail if all the bindings are for the same Matrix ID.
(For bob@example.org → @bob:example.org and BOB@example.org → @bob:example.org, it won't make any difference if you remove one of the two.)
Just thinking that this will reduce the number of e-mails sent and will reduce the number of people who might get confused by this.

@H-Shay H-Shay requested a review from reivilibre August 13, 2021 14:33
scripts/casefold_db.py Outdated Show resolved Hide resolved
scripts/casefold_db_dry_run.py Outdated Show resolved Hide resolved
scripts/casefold_db_dry_run.py Outdated Show resolved Hide resolved
@@ -0,0 +1,199 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for this file; make it a flag

@H-Shay H-Shay requested a review from reivilibre August 16, 2021 21:54
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, the structure is good

whilst you're here, do you want to add scripts/ as one of the directories to be checked in .github/workflows/pipeline.yml? 😇
(If the other scripts are making the CI fail, then just add scripts/casefold_db.py instead, since that's not your fault and we shouldn't be fixing them in this PR :))

scripts/casefold_db.py Outdated Show resolved Hide resolved
scripts/casefold_db.py Outdated Show resolved Hide resolved
scripts/casefold_db.py Outdated Show resolved Hide resolved
scripts/casefold_db.py Outdated Show resolved Hide resolved
scripts/casefold_db.py Outdated Show resolved Hide resolved
scripts/casefold_db.py Outdated Show resolved Hide resolved
scripts/casefold_db.py Outdated Show resolved Hide resolved
scripts/casefold_db.py Outdated Show resolved Hide resolved
tests/test_casefold_migration.py Outdated Show resolved Hide resolved
@H-Shay H-Shay requested a review from reivilibre August 18, 2021 20:26
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking great; let's get another set of eyes from Patrick (edit: or Brendan; I think Patrick was looking at the other PR, that's me being confused) today?

scripts/casefold_db.py Outdated Show resolved Hide resolved
tests/test_casefold_migration.py Outdated Show resolved Hide resolved
Copy link
Contributor

@babolivier babolivier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this! And sorry I didn't have a look sooner :/
It broadly looks good, there are just a few issues to fix before I'm 100% happy with it.

Hello,

We’ve recently improved how people discover your Matrix account.
In the past, identity services did not take capitalization into account when creating and storing Matrix IDs. We’ve now updated this behavior so anyone can find you, no matter how your email is capitalized. As part of this recent update, the duplicate Matrix ID %(mxid)s is no longer associated with this e-mail address.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some parts of this paragraph are confusing, e.g.:

identity services did not take capitalization into account when creating and storing Matrix IDs

Isn't the issue that they did take capitalisation (btw I think case might be a better word?) into account and we don't want them to?

the duplicate Matrix ID %(mxid)s is no longer associated with this e-mail address

The Matrix ID might still be associated with the email address, just not the email address with this case.

Generally, I think this should be looked at by our wordsmiths before we can use it on vector.im and matrix.org, so I don't think fixing it should block this PR as long as we remember to ping the right people internally.

tests/test_casefold_migration.py Outdated Show resolved Hide resolved
tests/test_casefold_migration.py Outdated Show resolved Hide resolved
tests/test_casefold_migration.py Outdated Show resolved Hide resolved
tests/test_casefold_migration.py Outdated Show resolved Hide resolved
scripts/casefold_db.py Outdated Show resolved Hide resolved
scripts/casefold_db.py Outdated Show resolved Hide resolved
scripts/casefold_db.py Outdated Show resolved Hide resolved
scripts/casefold_db.py Show resolved Hide resolved
scripts/casefold_db.py Show resolved Hide resolved
@H-Shay H-Shay requested a review from babolivier August 23, 2021 20:21
Copy link
Contributor

@babolivier babolivier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

scripts/casefold_db.py Outdated Show resolved Hide resolved
tests/test_casefold_migration.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants