Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Metadata resistance #4565
Currently synapse (and AFAIK the whole Matrix ecosystem) doesn't attempt to minimize metadata gathering in any way. This is one of it's biggest issues in terms of security and privacy. This makes Matrix to not be a sensible option for people who care about these values and they have to choose between privacy/security and decentralization/modern FOSS protocol and I think the latter values are significantly less important. In next few weeks Matrix should get to the state where there's bandwidth available to make these basic things right and only then work on things of less importance like new features, app rewrites and dendrite. I think it's a good strategy to first make the base robust and only then move further.
Incomplete list of unnecessary data gathered by synapse:
I didn't investigate this thoroughly and there's likely more, if you know of anything else, don't forget to share in comments.
Since synapse requires other services for operation like reverse proxy, coturn and postgres (i'm not sure if python or anything else logs anything), this should also be dealt with. Either by removing these dependencies or by crafting a good documentation together with tools that will enable even a person without an infosec and sysadmin background to be able to set it up easily, properly and fast using only that documentation to learn. This is particularly important as Matrix aims to have a well balanced ecosystem of smaller servers avoiding the common problem of federation.
Users should be sufficiently and visibly informed in the documentation of anything that is stored and about possible options to modify this behavior, fe. log removal and how should it be done.
As you can see from the timeline here this issue is very much on our radar, although we haven't fed back on the points raised here which we've fixed (oops).
We no longer store access_token, device_id, user_agent, last_seen for users (unless that user's session is still active), as of #6098.
It's inevitable that we track the user_id and timestamp for when users join/leave rooms in order for the room history to actually function. MSC1228 will help us obfuscate the user_id however and is coming shortly.
We have always gone to great lengths to avoid logging any sensitive data (e.g. message contents, secrets, key data etc) in logs. However, log lines do include user IDs and room IDs required to trace problems. Synapse doesn't run in a log minimisation configuration by default because it's still not stable enough to run unattended by itself, flying blind. We need the logs to help people out when things break. As soon as we hit a sufficient level of stability we'll change the default log level for sure (and we are headed in that direction).
Synapse doesn't dictate how you store your logs or what retention scheme you apply. Each package of Synapse does it differently (systemd; python logging; docker logs etc), and it's up to the sysadmin to specify the log rotation & retention policy. They can also switch the log level if they want to WARN, which hides all PII.
Redacted/deleted events now get pruned after N days as of #5934. Deleting files referenced by redacted events is harder, but we're working on it.