Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opting Out #47

Open
Sleuth56 opened this issue Aug 6, 2022 · 60 comments
Open

Opting Out #47

Sleuth56 opened this issue Aug 6, 2022 · 60 comments
Labels
T-Enhancement New feature or request

Comments

@Sleuth56
Copy link

Sleuth56 commented Aug 6, 2022

Has there been any thought given to how a room admins or homeserver admins could opt out their room or server?
My thought would be some sort of state event in a room and the already standard X-Robots-Tag for servers.

I know this project is in very early stages but given the distributed nature of the matrix network I believe this to be a very important thing, especially coming from the matrix core team (Or at least that's who it looks like it's being driven by).


Related MSC's:

@MadLittleMods MadLittleMods added the T-Enhancement New feature or request label Aug 8, 2022
@MadLittleMods
Copy link
Contributor

The archive will only access rooms where the history is world_readable. A room can be public without world_readable history so there is already mechanism to sorta control this.

But it is possible that we add an additional signal/control to determine whether search engines should index it. I imagine that you would still be able to access the room via Matrix public archive but we would tell Google not to index the room if that signal is present.

@Mikaela
Copy link

Mikaela commented Aug 16, 2022

Will removing world readability effect the archive or will the past history still be visible using it?

Is there any mechanism in Matrix for redacting the past public history that any room administrator can use without resorting to running bots/servers or unspecced hacks (m.room.retention and specced changing history visibility to members only and /upgraderoom 10)

@Mikaela
Copy link

Mikaela commented Aug 16, 2022

Another concern is do tombstoned rooms automatically get excluded by both this and matrix-static? They may be inaccessible if they have weird join rules and no one to invite.

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Aug 16, 2022

Will removing world readability effect the archive or will the past history still be visible using it?

@Mikaela I'm not sure exactly how the m.room.history_visibility lookup will happen exactly (current state or state at the time). It depends on that.

Is there any mechanism in Matrix for redacting the past public history [...]

Best to create an issue and ask elsewhere about this.

Another concern is do tombstoned rooms automatically get excluded by both this and matrix-static? They may be inaccessible if they have weird join rules and no one to invite.

Any world readable rooms, including tombstoned will be accessible. I don't have any details about the other project, matrix-static, for how it works around this.

MadLittleMods added a commit that referenced this issue Sep 9, 2022
…ex `world_readable` (#66)

Only show `world_readable` or `public` rooms in the archive. Only allow `world_readable` rooms to be indexed by search engines.

Related to #47
@MadLittleMods
Copy link
Contributor

Related to matrix-org/synapse#14127

@Cyberes
Copy link

Cyberes commented May 6, 2023

I have a few rooms that are set to world_readable and published to my homeserver's room directory that I don't want to be on my public archive site. My solution was to simply ban the archive bot from those rooms. This really isn't a good solution because the rooms are still listed and when you try to view them you get an error page.

What if the archive backend was to query all the rooms to check if its bot has access? If it is banned from a room then it would remove that room from the list on the website.

I imagine this check would be in addition to whatever setting is used to opt-out.

@MadLittleMods
Copy link
Contributor

MadLittleMods commented May 6, 2023

@Cyberes The app is stateless so we can't just hold onto that kind of access/ban information across requests. It needs to be something we can query from the API and for the room directory, we wouldn't want to make a separate look-up for every single room we're trying to display in the grid so it needs to come from the room directory, /publicRooms endpoint.

A new state event like X-Robots-Tag/<meta name="robots" content="noindex, nofollow"> that the issue description mentions seems like a good way to go. The room directory can relay this information just like it already does for join_rule and world_readable. A new MSC needs to be created for this though.


If you can share the details, what's the distinction that your rooms should be public, world_readable but not accessible in the archive? Trying to understand the context/use-case.

@Cyberes
Copy link

Cyberes commented May 6, 2023

The rooms that fit the public world_readable but excluded from the archive are all related to "server administration" such as announcements, the room for things relating to the homeserver, etc. It's fine if a user from another HS finds and joins those rooms but I'd rather not have their content public on the internet (you know what they say, "the internet never forgets" or whatever).

@paul90
Copy link

paul90 commented May 7, 2023

Really not sure about how fit for purpose using public world_readable is as a basis for inclusion in the archive. I'm going to assume that world_readable equates to Anyone in the Room Security & Privacy options in Element. So, if you want people to see content before joining the room it will get archived.

Changing the setting to Members only (since the point in time of selecting this option), but doesn't change the visibility of existing history. And, if @archive:matrix.org is still a member of the room?

At least for now, banning the bot seems like the only solution, but the resulting 403 is really doesn't look good for the archive.

@Mikaela

This comment was marked as outdated.

@tulir
Copy link
Member

tulir commented May 30, 2023

After the bot joined a room with history visibility set to joined, it rightfully showed an error on the website. However, the error says

StatusError: 403 - Only world_readable or shared rooms that are public can be viewed in the archive.

shared rooms should not be archived, only world_readable rooms should be.

@MadLittleMods
Copy link
Contributor

MadLittleMods commented May 30, 2023

shared rooms should not be archived, only world_readable rooms should be.

"archived" is a bit of a overloaded term here but given that this project is called "Matrix Public Archive" I can see where the confusion may be be coming from. Any public room with shared/world_readable history visibility should be viewable in Matrix Public Archive. The idea is if a random Matrix user can view the room, then it should be viewable in the archive. But only history_visibility: "world_readable" rooms are indexable by search engines.

The Matrix Public Archive doesn't hold onto any data (it's stateless) and requests the messages from the homeserver every time (it archives nothing). The archive.matrix.org instance has some caching in place, 5
minutes for the current day, and 2 days for past content.

I've tried to clarify more of this in the FAQ document and added more details on why not guest access/peeking.

Banning @archive:matrix.org will prevent the room from showing up on archive.matrix.org and the cache will expire after 5-minutes/2-days for any content that is showing there now. Adding better opt-out controls like this issue is discussing is on the list 👍. I've updated the description with the current MSC proposals out there.

@bkil

This comment was marked as off-topic.

@bkil
Copy link

bkil commented May 31, 2023

Why was my comment marked off topic?

But also, it was not obvious to many of us that this project aims for 5 minutes/2 days ephemeral caching.

According to the documentation and the name, most users seem to associate it with some kind of crawling mass-joining bot with unlimited storage and resources that slows the federation to a crawl and eternally captures all of their content to use against them or train a mastermind that will take over the world.

Many moderators are also suffering from PTSD due to encountering unidentified bots almost every week mass-joining basically every room they can and about half of them start to flood or spam thousands of rooms at a time after a few days of delay.

This bot and the system itself falls under a different category as long as it is true that it is an interactive agent acting on behalf of a given human visitor who is browsing the calendar or clicking on a permalink they got from another platform. This would be seen highly beneficial and a quality of life improvement by most. That would be much more palatable to all stakeholders. If this was true, the spirit of the robots.txt exclusion standard similarly wouldn't apply.

I recommend renaming the project to something that makes this more obvious, such as matrix-static-proxy, matrix-nojs-preview, matrix-nojs-reader, matrix-nojs-client, matrix-ssr, server-rendered-rooms, matrix-permalinker, matrix-permalink-resolver, ...

I seems logical and uncontested that we should allow crawling & indexing of a message by search engines as long as it is linked this way from a personal web page. However, enabling spidering (following intra-site links with indefinite scrolling) again falls under a different philosophical debate as can be seen from the relevant MSCs.

@MTRNord
Copy link

MTRNord commented Jun 1, 2023

Just an additional data point but is it considered that it may be a bad idea to allow anyone to activate this on any room? Shouldn't it be a thing only room admins can do? Basically because a) a room admin may not be aware of this b) a room admin may not want this but also doesn't want to clutter state with yet another ban.

Another thing is the right to be forgotten which exists in the eu. A user has the right to be forgotten as per gdpr. It's not clear how an individual can opt out even if the room itself is deciding to allow the bot. Mass redactions in the past have been the equivalent of a denial of service attack. So I don't think they are a sensible way to do this.

@bkil
Copy link

bkil commented Jun 1, 2023

@MTRNord Could you please read the past messages? Or at least mine #47 (comment)

@n0toose
Copy link

n0toose commented Jun 3, 2023

I want to never, ever, ever feel the need to write an entire essay to this organization for something that should be obvious to an organization developing infrastructures for communities and open-source projects.

No matter if you think that you can implement it in a way that is better / more ethical, the existence of a service joining channels out of nowhere (while letting everyone else try to put the pieces together as to why it exists, how it joined, did somebody invite it?, etc.) wastes time of volunteers using Matrix to figure out what changed in their channel and what implications it may have.

There's a clear power imbalance, you have the matrix.org infrastructure that cannot have any accountability whatsoever and are also being paid to clean the entire thing with the "move fast, break things" approach. On the other hand, you have a lot of volunteer-run communities that depend on you and have to figure out what's going on with their channel with zero communication whatsoever. I spent at least 2 hours of limited, volunteering time examining what was happening, and there were more people involved (EDIT: Just to be clear, this is a strictly personal opinion and I do not speak for them.).

Discord uses announcements to explain changes in the way people run their communities. You can announce stuff server-wide in IRC. The UX problems that exist in Matrix are not the fault of the communities that use your infrastructure and the "move fast, break things" approach to this is in every way super annoying, especially with the Trust & Safety implications that were dismissed with a series of whataboutisms.

@Mikaela
Copy link

Mikaela commented Jun 3, 2023

Do I understand correctly that the whole XMPP protocol is also opted-in to the bot with no method of opting out?

@MTRNord
Copy link

MTRNord commented Jun 3, 2023

I had just another thought about this:

Would it help any admins or in general if the bot instead of silently joining would announce what is happening and link to its privacy policy as well as the FAQ? Currently, it joins silently, which feels like intentionally wanting to stay under the radar, even if that's not intentional. I think at least in some cases having the bot explain itself might help with acceptance and also allows room admins to more easily and quickly opt out if they wish to do so.

@akierig
Copy link

akierig commented Jun 4, 2023

Hi I'm a Libera channel op and my channels' policy is to kickban anyone with [m] in their nick. Matrix bridges with their "we remember everything forever by default" are bad enough, this is just appalling and I agree with most of @n0toose's comment above, depending on one's view of the matrix.org/element organization

The design right now is, in my honest opinion, very user-hostile and I would think that this was being done by a very malicious party if this repository was not under the matrix-org namespace.

I'd also point to Libera's own public logging policy: https://libera.chat/policies/#public-logging

Some projects may wish to log their channels publicly, if you do so the logging should be authorised by the channel owners and users in the channel should be notified (through for instance the topic, entry message, or similar) that public logging is taking place. Channel operators should consider ways for users to make unlogged comments and a process for requesting the removal of certain logs.

If you operate a service that scrapes internal channel content or published logs, ensure that you have obtained permission to do so from Libera Chat staff or the channel owners before you start scraping data, also make sure that there is an easy way for channels to opt-out.

If you wish to publish logs of a single conversation, please make sure you have gotten permission from all participants before doing so.

@JokerGermany
Copy link

Hi I'm a Libera channel op and my channels' policy is to kickban anyone with [m] in their nick. Matrix bridges with their "we remember everything forever by default" are bad enough, [...]

And what do you want in a matrix related Issue?
It looks like you are hostile against Matrix. Kickban anyone with [m] in their nick is your choice, but it's wierd that you are commenting matrix project which you seems to hate...

@Porkepix
Copy link

Porkepix commented Jun 4, 2023

Hi I'm a Libera channel op and my channels' policy is to kickban anyone with [m] in their nick. Matrix bridges with their "we remember everything forever by default" are bad enough, [...]

And what do you want in a matrix related Issue? It looks like you are hostile against Matrix. Kickban anyone with [m] in their nick is your choice, but it's wierd that you are commenting matrix project which you seems to hate...

Whether they hate it or not is a thing, and I won't comment on this. But it's completely legitimate even for non-users to comment on such matters considering due to the various bridging here and there even non-users are affected by things such as public logs available to everyone, or public indexing/crawl/scrapping resulting in logs that can be searched for in search engines.

Ie. a pure IRC user with many matrix user that joined their channel (and you can translate that to XMPP or pretty much any other protocol Matrix is bridging to) is completely affected by these choices, pretty much against their will by design (not even talking about all the users who're not even aware about this).

@akierig
Copy link

akierig commented Jun 5, 2023

And what do you want in a matrix related Issue?

This effects me because I participate in IRC networks that matrix bridges to. I never opted in to being archived publicly by Matrix, nor do I want to be.

@MadLittleMods
Copy link
Contributor

@akierig It looks like the libera.chat rooms have history visibility set to joined which means the rooms won't be accessible in the archive at all (will just give a 403 Forbidden).

The archive bot will still join the room because it doesn't know the history visibility before it joins but it won't show any content from the room in that case (only world_readable and shared history visibility are accessible in the archive). If the public room is in the room directory, it will be listed in the archive but will still lead to a 403 Forbidden in that case.

It seems useful if we had an endpoint that would return the history visibility information without joining (GET /_matrix/client/v3/directory/list/room/{roomId} currently only returns whether it's in the room directory or not) . It would need to take a roomIdOrAlias instead to be useful though as that's the beauty of POST /_matrix/client/v3/join/{roomIdOrAlias} right now. It would also be nice to hide these rooms from the archive homepage as well which means /publicRooms would need to show the history visibility beyond the world_readable indicator it has now.

@Mikaela
Copy link

Mikaela commented Jun 7, 2023

I would like to request increasing the priority for resolving this issue as there are currently at least two instances of bot requiring manual banning:

Additionally I suspect the following of being matrix-public-archive instances:

  • @archive:privacyguides.org - I have no idea where its web interface is, but the homeserver admin is @jonaharagon who is active on this issue tracker and hasn't gotten back to requests for comment at Community Moderation Effort (or PrivacyGuides moderators haven't reported back hearing anything about him).

@jonaharagon
Copy link

jonaharagon commented Jun 7, 2023

@Mikaela oops, I didn't see a question in CME. I do have this set up, but my instance can only join two specific rooms (which I admin), so it is not contributing to this problem.

@Half-Shot
Copy link

For what it's worth, we've had to ban the archive bot on the bridge because it joined too many rooms. The IRC bridge (at least, the libera.chat bridge) requires that all Matrix-side users are joined to the IRC channels they are bridged to.

However, most IRC networks will limit how many channels one user may join at any time. The bot exceeded this value and the result was the bridge became very unstable. We're hoping there might be a solution to this problem at some point because I believe the archive provides value to some users of the bridge, but ultimately in it's present incarnation it's not suitable.

@Mikaela
Copy link

Mikaela commented Jun 23, 2023

As there is even more development with opting more rooms into archiving, I would like to ask whether there is development also with opting out?

I also wonder whether this issue could be pinned for its significance? This tracker doesn't currently utilise that feature and three can be pinned at once on GitHub as far as I am aware of.

Additionally I have been wondering whether declaring Free Tibet and Slava Ukraini in public rooms gets them opted out from Baidu and Yandex?

@Mikaela
Copy link

Mikaela commented Jun 23, 2023

I just learned that https://staging.archive.matrix.org/ is a thing and using a separate account @archiver:gitter.im and I find it upsetting that it's not mentioned anywhere and is again more whacking a mole while you could be better and allow rooms to just opt out of this "service".

You could be even better and be opt-in for room administrators which would be in spirit of privacy friendliness without even mentioning modern privacy legislations or directives.

@Mikaela
Copy link

Mikaela commented Jun 23, 2023

I apologise for multiple comments in a row and heated emotions, but if staging.archive.matrix.org is meant for internal testing/staging, why is it publicly accessible? Surely Matrix Foundation would have the resources to at least put HTTP Basic Auth in front of it?

I also question it having a different account. If the bot is truly stateless, why cannot it share the account so one ban would affect both instances?

@jae1911
Copy link

jae1911 commented Jun 23, 2023

Doing a ban evasion with a second “testing” bot that is publishing data publicly isn’t the smartest thing to do.
Stay the fuck away from my rooms, make your thing opt-in.

@MadLittleMods
Copy link
Contributor

I just learned that https://staging.archive.matrix.org/ is a thing and using a separate account @archiver:gitter.im

staging.archive.matrix.org has search engine indexing turned off completely with the stopSearchEngineIndexing option.

I've also updated it to use the @archive:matrix.org user so there is no difference anymore. It was only historically using @archiver:gitter.im because MSC3030 wasn't supported on matrix.org while developing things.

The end-goal is to have staging.archive.matrix.org live behind some authentication. This is tracked internally by the deployment issue.

@1bcb
Copy link

1bcb commented Jun 23, 2023

Currently Element gives room admins the option to make a room visible by "anyone" or "members only." I don't see anything wrong with making a public archive of rooms that are readable by anyone already.

But the archive bot is joining rooms where the admins have already set the room to not be publicly visible, and it's archiving their history. My room is set to be visible only to members, but the archive bot joined today and made the history visible on the public archive. It's gone now, since I banned the archive bot, but I shouldn't have had to do that when I already "opted out" by setting my room to be not publicly visible.

View.matrix.org does this right.

@spacekookie
Copy link

Even if a room is publicly readable, that doesn't mean that its admins consent to conversations in that room being systematically scraped. Yes, anyone could come by and archive the room against the will of its participants. But this requires a targeted or coordinated effort.

I am very pro the idea of archiving Matrix rooms by the way. So much information gets lost on the internet every day because it got shared once in a Discord or Matrix or whatever room and will disappear with time. But this should be an opt-in process.

Make a UI element for it. Show people why they should have their rooms archived, make people care about archiving. Don't force this on people. Making privacy decisions for the user, not trusting them to understand what is best for them, is one of the ways that the tech industry has been eroding the trust of users over the last 20 years.

Please don't follow Silicon Valley's footsteps. It won't lead you anywhere nice.

@Midou36O
Copy link

I am very pro the idea of archiving Matrix rooms by the way.

Funnily enough it doesn't actually archive, it just makes search engines able to scrape the content, once the bot is banned, any content stored there is gone.

@haansn08
Copy link

I never used any Matrix products but I might be affected since I use XMPP and IRC. Can I request a listing of my personal data which is stored by Matrix under the GDPR?

@Mikaela

This comment was marked as off-topic.

@br4yd
Copy link

br4yd commented Jun 25, 2023

Just an additional data point but is it considered that it may be a bad idea to allow anyone to activate this on any room? Shouldn't it be a thing only room admins can do? Basically because a) a room admin may not be aware of this b) a room admin may not want this but also doesn't want to clutter state with yet another ban.

Another thing is the right to be forgotten which exists in the eu. A user has the right to be forgotten as per gdpr. It's not clear how an individual can opt out even if the room itself is deciding to allow the bot. Mass redactions in the past have been the equivalent of a denial of service attack. So I don't think they are a sensible way to do this.

The GDPR thing is really important. Especially inside the EU people can really harm admins if a room gets archived against their will and they take legal action on this. A court would probably not have the knowledge to understand what this is and would blame the room admin or the server admin the room was created on for the fact that everyone's message gets archived (yes it's not really archived as in stored but archived as in search engines can scrape it). To be compatible with the GDPR it probably would have to be opt-in and opt-in on a user per user basis instead of a room per room basis. I as a room admin can never decide if all my users would be okay to have their (public) messages archived and scanned by search engines so I can't willingly take that decision. It would probably have to be so that I as a Matrix user can say "yes, archive all my messages I write in all public rooms per default, but don't archive my messages in room X Y and Z". This would only cause some part or contents of a room's conversation to be archived but it would be GDPR compatible because it'd be opt-in per user-basis.
I'm not a lawyer but GDPR can be good for users but very bad for service administrators. - Especially when those administrators can't really change it through to technical limitations or when I as an administrator are not even aware of a feature because it's not implemented with a notice inside the clients or the protocol itself.

Don't get me wrong I love the idea of public rooms being indexed by search engines so that knowledge can be shared in a better way and also archived for the future, BUT it's not good if some random user takes legal action against an administrator because of that at some point and this administrator then lands in a court without knowledge or technical understanding about this topic so the administrator get's eventually punished without any real reason.

-> Very hypothetical example but it's still possible IMO.

@Mikaela
Copy link

Mikaela commented Jun 25, 2023

Additional complication with GDPR is that nothing prevents archiving the archive. From Forĝejo discussion on the bots:

There are also other archives like archive.ph, view there.

Edit; I also forgot that IPFS Companion provides decentralised archive/snapshots too, so whack-a-mole with arvhives may not work any better on archive.matrix.org side than users trying to opt-out of these archive bots.


@wojtekLs,

BTW: Why your comment was marked as oftopic? Its on topic to its core.

I try to keep the issue cleanish by selfmarking myself as offtopic when my comment is not strictly relevant to the issue at hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests