Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Casefold when processing email addresses #374

Merged
merged 32 commits into from Aug 24, 2021
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
087bd4a
add casefolding to bind and associated functions
H-Shay Jul 7, 2021
0988bdd
add casefolding to necessary functions + lints
H-Shay Jul 7, 2021
d80191a
add changelog
H-Shay Jul 7, 2021
3227de7
fix broken terms test
H-Shay Jul 27, 2021
5950186
draft migration script
H-Shay Jul 27, 2021
80a33f3
draft test for migration script
H-Shay Jul 27, 2021
406c9ac
add send email and email test
H-Shay Jul 28, 2021
d20c2bc
function for updating global table added
H-Shay Jul 29, 2021
4449f00
lints
H-Shay Jul 29, 2021
9b363c5
requested changes
H-Shay Aug 2, 2021
a99a18e
lints
H-Shay Aug 2, 2021
568b603
fix lint fail
H-Shay Aug 2, 2021
775f0ec
update tests + misc requested changes
H-Shay Aug 4, 2021
7ae6d36
extract casefold logic to function and add to affected files
H-Shay Aug 5, 2021
5d3ecc8
update email template and associated test
H-Shay Aug 5, 2021
76ad036
lints
H-Shay Aug 5, 2021
e23232a
requested changes
H-Shay Aug 10, 2021
3871005
rename file, rework to be script, rework tests to test script
H-Shay Aug 12, 2021
1f8ebfe
update tests and dry run/no email versions
H-Shay Aug 12, 2021
69f654d
update tests + scripts
H-Shay Aug 12, 2021
4b83b25
lints
H-Shay Aug 13, 2021
ac08265
add ability to call script from command line and update tests
H-Shay Aug 16, 2021
91d98d3
remove deleted files
H-Shay Aug 16, 2021
ec585a3
refine commandline invocation, add print statement
H-Shay Aug 16, 2021
ef40611
lints
H-Shay Aug 16, 2021
e774fa1
requested changes
H-Shay Aug 18, 2021
83560a4
requested changes + lints
H-Shay Aug 19, 2021
eb5ad93
requested changes
H-Shay Aug 23, 2021
f1b4f9e
lints
H-Shay Aug 23, 2021
969dda4
Actually let me just do that myself - doesn't have to wait till Shay …
babolivier Aug 24, 2021
b6b95c6
Merge branch 'matrix-org:main' into casefold
H-Shay Aug 24, 2021
96eee48
lints
H-Shay Aug 24, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/374.misc
@@ -0,0 +1 @@
Implements casefolding on email addresses when binding to mxid, storing associations, or looking up asssociations by email.
H-Shay marked this conversation as resolved.
Show resolved Hide resolved
38 changes: 38 additions & 0 deletions res/matrix-org/migration_template.eml
@@ -0,0 +1,38 @@
Date: %(date)s
From: %(from)s
To: %(to)s
Message-ID: %(messageid)s
Subject: %(subject_header_value)s
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="%(multipart_boundary)s"

--%(multipart_boundary)s
Content-Type: text/plain; charset=UTF-8
Content-Disposition: inline

Hello,

In the past, identity services did not take case into account when creating and storing
mxids. This led to your email address, %(to)s being stored with two mxids. A recent update
changed this behavior and streamlined the mxids associated with your email address.
This is a notification to let you know that matrix id %(mxid)s has been deleted in accordance
with this update.
H-Shay marked this conversation as resolved.
Show resolved Hide resolved


About Matrix:

Matrix.org is an open standard for interoperable, decentralised, real-time communication
over IP, supporting group chat, file transfer, voice and video calling, integrations to
other apps, bridges to other communication systems and much more. It can be used to power
Instant Messaging, VoIP/WebRTC signalling, Internet of Things communication - or anywhere
you need a standard HTTP API for publishing and subscribing to data whilst tracking the
conversation history.

Matrix defines the standard, and provides open source reference implementations of
Matrix-compatible Servers, Clients, Client SDKs and Application Services to help you
create new communication solutions or extend the capabilities and reach of existing ones.

Thanks,

Matrix
3 changes: 3 additions & 0 deletions sydent/db/invite_tokens.py
Expand Up @@ -34,6 +34,9 @@ def storeToken(
:param sender: The MXID of the user that sent the invite.
:param token: The token to store.
"""
if medium == "email":
address = address.casefold()

cur = self.sydent.db.cursor()

cur.execute(
Expand Down
197 changes: 197 additions & 0 deletions sydent/db/migration.py
@@ -0,0 +1,197 @@
import json
import sqlite3

import signedjson.sign

from sydent.util import json_decoder
from sydent.util.emailutils import sendEmail


def update_local_associations(self, conn: sqlite3.Connection):
"""Update the DB table local_threepid_associations so that all stored
emails are casefolded, and any duplicate mxid's associated with the
given email are deleted.

:return: None
"""
cur = conn.cursor()

res = cur.execute(
"SELECT address, mxid FROM local_threepid_associations WHERE medium = 'email'"
"ORDER BY ts DESC"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how many rows we're expecting? If this were e.g. Synapse, we'd split this up into batches somehow to avoid a really large query.

I don't know how big our Sydent deployments are, but to be honest I'm not sure it's vector.im that's the scary one -- perhaps some of the government deployments have everyone in the identity lookup, which I would guess could be several hundred thousand users? (@babolivier do you know?)

If you end up making this run in batches, you might want to save the position of this process to the database so that if Sydent is restarted, it can resume where it left off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this I don't know-I do know that the deletion step is supposed to be rare, but as for how large a DB this code will run on, I have no idea-this might be a questions for Brendan.


# a dict that associates an email address with correspoinding mxids and lookup hashes
associations: Dict[str, List[Tuple[str, str, str]]] = {}
H-Shay marked this conversation as resolved.
Show resolved Hide resolved

# iterate through selected associations, casefold email, rehash it, and add to
# associations dict
for address, mxid in res.fetchall():
casefold_address = address.casefold()

# rehash email since hashes are case-sensitive
lookup_hash = self.calculate_lookup(casefold_address)

if casefold_address in associations:
associations[casefold_address].append((address, mxid, lookup_hash))
else:
associations[casefold_address] = [(address, mxid, lookup_hash)]

# list of arguments to update db with
db_update_args: List[Tuple[str, str, str, str]] = []

# list of mxids to delete
to_delete: List[Tuple[str]] = []

# list of mxids to send emails to letting them know the mxid has been deleted
mxids: List[str] = []

for casefold_address, assoc_tuples in associations.items():
db_update_args.append(
(
casefold_address,
assoc_tuples[0][2],
assoc_tuples[0][0],
assoc_tuples[0][1],
)
)

if len(assoc_tuples) > 1:
H-Shay marked this conversation as resolved.
Show resolved Hide resolved
# Iterate over all associations except for the first one, since we've already
# processed it.
for address, mxid, _ in assoc_tuples[1:]:
to_delete.append((address,))
mxids.append((mxid, address))

# iterate through the mxids and send email
for mxid, address in mxids:
templateFile = self.sydent.get_branded_template(
"matrix-org",
"migration_template.eml",
("email", "email.template"),
)

sendEmail(
self.sydent,
templateFile,
address,
{"mxid": "mxid", "subject_header_value": "MatrixID Update"},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I think one thing we could do would be to not send an e-mail if all the bindings are for the same Matrix ID.
(For bob@example.org → @bob:example.org and BOB@example.org → @bob:example.org, it won't make any difference if you remove one of the two.)
Just thinking that this will reduce the number of e-mails sent and will reduce the number of people who might get confused by this.


if len(to_delete) > 0:
cur.executemany(
"DELETE FROM local_threepid_associations WHERE address = ?", to_delete
)
H-Shay marked this conversation as resolved.
Show resolved Hide resolved

if len(db_update_args) > 0:
cur.executemany(
"UPDATE local_threepid_associations SET address = ?, lookup_hash = ? WHERE address = ? AND mxid = ?",
db_update_args,
)

# We've finished updating the database, committing the transaction.
conn.commit()


def update_global_assoc(self, conn: sqlite3.Connection):
H-Shay marked this conversation as resolved.
Show resolved Hide resolved
"""Update the DB table global_threepid_associations so that all stored
emails are casefolded, the signed association is re-signed and any duplicate
mxid's associated with the given email are deleted.

:return: None
"""

# get every row where the local server is origin server and medium is email
origin_server = self.sydent.server_name
medium = "email"

cur = self.sydent.db.cursor()
res = cur.execute(
"SELECT address, mxid, sgAssoc FROM global_threepid_associations WHERE medium = ?"
"AND originServer = ? ORDER BY ts DESC",
(medium, origin_server),
)

# dict that stores email address with mxid, email address, lookup hash, and
# signed association
associations: Dict[str, List[Tuple[str, str, str, str]]] = {}

# iterate through selected associations, casefold email, rehash it, re-sign the
# associations and add to associations dict
for address, mxid, sg_assoc in res.fetchall():
casefold_address = address.casefold()

# rehash the email since hash functions are case-sensitive
lookup_hash = self.calculate_lookup(casefold_address)
H-Shay marked this conversation as resolved.
Show resolved Hide resolved

# update signed associations with new casefolded address and re-sign
sg_assoc = json_decoder.decode(sg_assoc)
sg_assoc["address"] = address.casefold()
sg_assoc = json.dumps(
signedjson.sign.sign_json(
sg_assoc, self.sydent.server_name, self.sydent.keyring.ed25519
)
)

if casefold_address in associations:
associations[casefold_address].append(
(address, mxid, lookup_hash, sg_assoc)
)
else:
associations[casefold_address] = [(address, mxid, lookup_hash, sg_assoc)]

# list of arguments to update db with
db_update_args: List[Tuple[str, str, str, str]] = []

# list of mxids to delete
to_delete: List[Tuple[str]] = []

# list of mxids and addresses to send emails to letting them know the mxid
# has been deleted
mxids: List[Tuple[str]] = []

for casefold_address, assoc_tuples in associations.items():
db_update_args.append(
(
casefold_address,
assoc_tuples[0][2],
assoc_tuples[0][3],
assoc_tuples[0][0],
assoc_tuples[0][1],
)
)

if len(assoc_tuples) > 1:
# Iterate over all associations except for the first one, since we've already
# processed it.
for address, mxid, _, _ in assoc_tuples[1:]:
to_delete.append((address,))
mxids.append((mxid, address))

# iterate through the mxids and send email
for mxid, address in mxids:
templateFile = self.sydent.get_branded_template(
"matrix-org",
"migration_template.eml",
("email", "email.template"),
)

sendEmail(
self.sydent,
templateFile,
address,
{"mxid": "mxid", "subject_header_value": "MatrixID Update"},
)

if len(to_delete) > 0:
cur.executemany(
"DELETE FROM global_threepid_associations WHERE address = ?", to_delete
)

if len(db_update_args) > 0:
cur.executemany(
"UPDATE global_threepid_associations SET address = ?, lookup_hash = ?, sgAssoc = ? WHERE address = ? AND mxid = ?",
db_update_args,
)

conn.commit()
10 changes: 10 additions & 0 deletions sydent/db/threepid_associations.py
Expand Up @@ -189,6 +189,7 @@ def signedAssociationStringForThreepid(
:return: The signed association, or None if no association was found for this
3PID.
"""

cur = self.sydent.db.cursor()
# We treat address as case-insensitive because that's true for all the
# threepids we have currently (we treat the local part of email addresses as
Expand Down Expand Up @@ -219,6 +220,9 @@ def getMxid(self, medium: str, address: str) -> Optional[str]:

:return: The associated MXID, or None if no MXID is associated with this 3PID.
"""
if medium == "email":
address = address.casefold()

cur = self.sydent.db.cursor()
res = cur.execute(
"select mxid from global_threepid_associations where "
Expand Down Expand Up @@ -307,6 +311,9 @@ def addAssociation(
:param commit: Whether to commit the database transaction after inserting the
association.
"""
if assoc.medium == "email":
assoc.address = assoc.address.casefold()

cur = self.sydent.db.cursor()
cur.execute(
"insert or ignore into global_threepid_associations "
Expand Down Expand Up @@ -357,6 +364,9 @@ def removeAssociation(self, medium: str, address: str) -> None:
:param medium: The medium for the 3PID.
:param address: The address for the 3PID.
"""
if medium == "email":
address = address.casefold()

cur = self.sydent.db.cursor()
cur.execute(
"DELETE FROM global_threepid_associations WHERE "
Expand Down
2 changes: 1 addition & 1 deletion sydent/terms/terms.py
Expand Up @@ -107,7 +107,7 @@ def get_terms(sydent) -> Optional[Terms]:
return Terms(None)

with open(termsPath) as fp:
termsYaml = yaml.full_load(fp)
termsYaml = yaml.safe_load(fp)
if "master_version" not in termsYaml:
raise Exception("No master version")
if "docs" not in termsYaml:
Expand Down
3 changes: 3 additions & 0 deletions sydent/threepid/bind.py
Expand Up @@ -58,6 +58,9 @@ def addBinding(self, medium: str, address: str, mxid: str) -> Dict[str, Any]:

:return: The signed association.
"""
if medium == "email":
address = address.casefold()
H-Shay marked this conversation as resolved.
Show resolved Hide resolved

localAssocStore = LocalAssociationStore(self.sydent)

# Fill out the association details
Expand Down