Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata resistance #4565

Open
gerg5c42g542g2c54g52c opened this issue Feb 5, 2019 · 1 comment

Comments

@gerg5c42g542g2c54g52c
Copy link

commented Feb 5, 2019

Currently synapse (and AFAIK the whole Matrix ecosystem) doesn't attempt to minimize metadata gathering in any way. This is one of it's biggest issues in terms of security and privacy. This makes Matrix to not be a sensible option for people who care about these values and they have to choose between privacy/security and decentralization/modern FOSS protocol and I think the latter values are significantly less important. In next few weeks Matrix should get to the state where there's bandwidth available to make these basic things right and only then work on things of less importance like new features, app rewrites and dendrite. I think it's a good strategy to first make the base robust and only then move further.

Incomplete list of unnecessary data gathered by synapse:

  • Database stores unnecessary information. All joins and leaves in every room are stored, these entries consist of user_id, access_token, device_id, ip, user_agent, last_seen, timestamp. There's most likely more. These should be truncated to only contain information that is truly necessary and shouldn't be stored longer than necessary.
  • Logs contain vast amount of information. Remove anything that isn't absolutely necessary from them and either implement a user-friendly mechanism (or documentation) to manage them, purge them automatically after a short period of time (fe. 7 days) or don't store them at all. Logs in production releases of synapse shouldn't contain debugging information, but only information required for security reasons, fe. audit after a breach and with guidance in documentation on how to secure this data up while minimizing metadata retention.
  • Other things like redacted and deleted events, accounts, sent files.

I didn't investigate this thoroughly and there's likely more, if you know of anything else, don't forget to share in comments.

Since synapse requires other services for operation like reverse proxy, coturn and postgres (i'm not sure if python or anything else logs anything), this should also be dealt with. Either by removing these dependencies or by crafting a good documentation together with tools that will enable even a person without an infosec and sysadmin background to be able to set it up easily, properly and fast using only that documentation to learn. This is particularly important as Matrix aims to have a well balanced ecosystem of smaller servers avoiding the common problem of federation.

Users should be sufficiently and visibly informed in the documentation of anything that is stored and about possible options to modify this behavior, fe. log removal and how should it be done.

@ara4n

This comment has been minimized.

Copy link
Member

commented Oct 10, 2019

As you can see from the timeline here this issue is very much on our radar, although we haven't fed back on the points raised here which we've fixed (oops).

All joins and leaves in every room are stored, these entries consist of user_id, access_token, device_id, ip, user_agent, last_seen, timestamp.

We no longer store access_token, device_id, user_agent, last_seen for users (unless that user's session is still active), as of #6098.

It's inevitable that we track the user_id and timestamp for when users join/leave rooms in order for the room history to actually function. MSC1228 will help us obfuscate the user_id however and is coming shortly.

Logs contain vast amount of information.

We have always gone to great lengths to avoid logging any sensitive data (e.g. message contents, secrets, key data etc) in logs. However, log lines do include user IDs and room IDs required to trace problems. Synapse doesn't run in a log minimisation configuration by default because it's still not stable enough to run unattended by itself, flying blind. We need the logs to help people out when things break. As soon as we hit a sufficient level of stability we'll change the default log level for sure (and we are headed in that direction).

Remove anything that isn't absolutely necessary from them and either implement a user-friendly mechanism (or documentation) to manage them, purge them automatically after a short period of time (fe. 7 days) or don't store them at all.

Synapse doesn't dictate how you store your logs or what retention scheme you apply. Each package of Synapse does it differently (systemd; python logging; docker logs etc), and it's up to the sysadmin to specify the log rotation & retention policy. They can also switch the log level if they want to WARN, which hides all PII.

Other things like redacted and deleted events, accounts, sent files.

Redacted/deleted events now get pruned after N days as of #5934. Deleting files referenced by redacted events is harder, but we're working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
4 participants
You can’t perform that action at this time.