Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Stop tying settings that don't imply that something should be on a Google indexed directory as such #14127

Closed
ell1e opened this issue Oct 11, 2022 · 14 comments
Labels
O-Occasional Affects or can be seen by some users regularly or most users rarely S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@ell1e
Copy link

ell1e commented Oct 11, 2022

Description

I noticed having a space on matrix.org there is a very counter-intuitive result from the following settings: I have set the space's "Visibility" set to "Public", which to me means guests can join. A "History" setting isn't shown in Element. I also enabled the "Preview" setting, which I expect to allow people to see on matrix.to what the space's name and description is and nothing more.

All the above somehow implies with all account names and handles joined fully leaked into the Google index (uhm, what?).

I didn't ever set some "[Add to] Public Directory Listing" setting where I expect no privacy for anyone joining, I just want people to see the a preview on matrix.to.

Therefore, please add a separate "Public Directory Listing" setting instead, linking other related settings like this is just weird. This also concerns setting a regular room's "History" to "Anyone" for previews: I've also approached multiple channel owners on this to ask them to change it and only 1 out of around 10 wasn't extremely surprised. You should make it separate.

Steps to reproduce

  1. Create a room or space
  2. Enable previewing by either setting "Preview" to on (space) or "History" to "Anyone" (room) for visitors on matrix.to
  3. Observe there is no ability now to not be thrown into a public directory and be Google-indexed, even though this really should be a separate setting

Homeserver

matrix.org

Synapse Version

1.69.0rc2 (b=matrix-org-hotfixes,3d5242da14)

Installation Method

No response

Platform

don't know, I'm not running it

Relevant log output

don't know, I'm not running it

Anything else that would be useful to know?

No response

@richvdh
Copy link
Member

richvdh commented Oct 17, 2022

I'm not really following this. Setting the history visibility shouldn't affect whether the room is in the room directory, so I think you must mean something different when you say "public directory". Can you give more details of where this "public directory" can be found?

@richvdh
Copy link
Member

richvdh commented Oct 17, 2022

(also, sorry, but if you make something visible to the public internet, there's not much we can do to stop Google seeing it too.)

@ell1e
Copy link
Author

ell1e commented Oct 17, 2022

Setting the history visibility shouldn't affect whether the room is in the room directory

It does, "history" and "anyone" in Element for a room will make the messages visible on view.matrix.org and then indexed by Google. Any other history setting will give an error on view.matrix.org, leading to the actual message contents not being indexed.

The problem is that spaces preview and "history" set to "anyone" aren't associated by anyone with "visible to the public internet via a way that also invites general purpose search engines", but rather only visible to users connected via a personal matrix client. This is exactly why I made the this ticket. Matrix is not expected to be alike to regular, browsable web by the average user I'm quite sure.

To further my point, even Facebook and LinkedIn have a make profile indexable by search engines option, which is independent from "can other people already on the platform see my profile", why can't Element and Synapse add that clear setting for history? That is something people understand and can adjust to.

And I'm not proposing you shut view.matrix.org down, just maybe make people aware what's happening?

@richvdh
Copy link
Member

richvdh commented Oct 17, 2022

Setting the history visibility shouldn't affect whether the room is in the room directory

It does, "history" and "anyone" in Element for a room will make the messages visible on view.matrix.org

We might be arguing semantics, but view.matrix.org is not the same thing as the public room directory: public room directory is the results of /publicRooms, which already has a control:
image

... so suggestions like 'please add a separate "Public Directory Listing" setting' are distinctly unhelpful. Let's stop talking about "public room directory" and try to use terms that we all understand to mean the same thing.

@richvdh
Copy link
Member

richvdh commented Oct 17, 2022

The problem is that spaces preview and "history" set to "anyone" aren't associated by anyone with "visible to the public internet via a way that also invites general purpose search engines", but rather only visible to users connected via a personal matrix client.

I'm struggling to follow this sentence, but does turning off "Enable guest access" do what you want?

image

(view.matrix.org is one of the "supported clients" here)

If so, it sounds like this just needs clarifying in the Element UI. If not, could you explain again what you're suggesting?

@ell1e
Copy link
Author

ell1e commented Oct 17, 2022

My bad, when I say public listing I meant view.matrix.org.

If so, it sounds like this just needs clarifying in the Element UI.

I think it's possible to leave the guest setting enabled but history set to members only, and view.matrix.org won't list it. So this might be the wrong one.

To unlist, you'd not set "history" to "anyone" but that disables previewing the room before joining in Element, similarly, disabling "Preview" in spaces to unlist likely does the same. However, people often want previews in the client, but not the indexed web listing. I got this from a couple of room owners I talked to about the unintentional indexing, but then were like "hm, I'll disable it, but that'll disable previews too right? That's a shame".

So I think it's not just done with a rename, I think view.matrix.org listing just needs a different setting. (At least as long as it remains indexed, which I understand has reasons.)

@bkil
Copy link

bkil commented Jun 27, 2023

I can see quite some confusion in recent issues, including this one. What works for a centralized service such as LinkedIn usually does not work at all even in theory in decentralized & federated systems.

Providing one to "just" preview a room implies that the chat log must be distributed to members outside of those who have joined the room. To rephrase, any outside requestor may ask for previewing of the chat log of any other participating HS as per how Matrix works.

How would this work in a theoretically "perfect" world in your viewpoint? Would you somehow authenticate to each of the participating HS on such a query? Would they log you and your request and analyze it to make sure you are not previewing too many rooms per 24 hours? Wouldn't this itself raise additional GDPR questions itself (extensive logging by random servers for non-essential purpose)?

If nobody would log or authenticate such requests, what could stop a crawler from just registering a bot account and regularly previewing every room in sight (let alone joining every one - we encounter such a spider every month in practice)?

You can't properly exclude S2S due to how federation works, but let's ignore this for a moment. What would the web interface of such a preview pane look like if you assume that good web crawlers can already run JavaScript (or given noJS matrix clients)? Should such an Allow SEO chat log indexing room option alter robots.txt or add robots meta tags to the page <head>? Should it show a CAPTCHA per IP before each access, possibly through reCAPTCHA? Should it blocking access from AS associated with data centres (and Tor as a trivial abuse vector)?

Some web archiving services declare that they readily ignore robots.txt for various reasons (and if they do obey, it only affects their momentarily presentation of the material, but not its crawling and persistence on their instance).

I find it amusing how public phpBB forums (along with its modern incarnations such as StackExchange) are already indexed and serve as a valuable knowledge base and reference material that many benefit from during their web search quest while nobody raised such concerns back in the days. The two problems are closely related, as moderators of forums and certain boards within can and do make some of them private and invite only as well, yet they only opt to exercise this with great caution - weighing its costs to benefits.

@Porkepix
Copy link

Providing one to "just" preview a room implies that the chat log must be distributed to members outside of those who have joined the room. To rephrase, any outside requestor may ask for previewing of the chat log of any other participating HS as per how Matrix works.

Previewing isn't always available. And isn't always something that can be considered as a good thing (for my part, I see more sense into this in a corporate context than otherwise).

How would this work in a theoretically "perfect" world in your viewpoint? Would you somehow authenticate to each of the participating HS on such a query? Would they log you and your request and analyze it to make sure you are not previewing too many rooms per 24 hours? Wouldn't this itself raise additional GDPR questions itself (extensive logging by random servers for non-essential purpose)?

You'd need to gather a freely given, specific, informed and unambiguous consent. Therefore duly informing people. And not giving your consent must have no negative effect for the user if it's not required to run the service or for legal reasons (and a couple of other exceptions). Here, it doesn't seems like most of people are even duly informed.

I'm not even talking about people not using Matrix and that'd still have all their messages brought in due to the various bridges.

You can't properly exclude S2S due to how federation works, but let's ignore this for a moment. What would the web interface of such a preview pane look like if you assume that good web crawlers can already run JavaScript (or given noJS matrix clients)? Should such an Allow SEO chat log indexing room option alter robots.txt or add robots meta tags to the page ? Should it show a CAPTCHA per IP before each access, possibly through reCAPTCHA? Should it blocking access from AS associated with data centres (and Tor as a trivial abuse vector)?

reCAPTCHA is illegal for Europe (Schrems II). Not to mention its use would really go against privacy values.
Aside from this, this is not because you can technically do something that it's legal or moral.

I find it amusing how public phpBB forums (along with its modern incarnations such as StackExchange) are already indexed and serve as a valuable knowledge base and reference material that many benefit from during their web search quest while nobody raised such concerns back in the days. The two problems are closely related, as moderators of forums and certain boards within can and do make some of them private and invite only as well, yet they only opt to exercise this with great caution - weighing its costs to benefits.

One doesn't expect from a chat the same you'd expect from a forum (and even forums can be private, or block indexing): your point here could also be used to say that every emails or mailing lists, newsgroups and so on should be indexed because forums are?
The Web never forgets. If people knows everything they say will be stored forever and available to anyone to find out, they act differently.

So, to come back to the initial title of this issue, aside from all of the other problems raised by this feature, mentioned in my answer and on other places, yes to have a dedicated setting should be pretty obvious (but it should even be a personal setting, as mentioned by someone else in another issue).

@bkil
Copy link

bkil commented Jun 27, 2023

As cited just now in matrix-org/matrix-viewer#239 (comment)

world_readable - All events while this is the m.room.history_visibility value may be shared by any participating homeserver with anyone, regardless of whether they have ever joined the room.

All major clients display the chat log visibility rules within the room settings when you join a given room as a user, hence you had the chance to inform about this since like 2014. Feel free to open a PR to your client if you find this property of the given room too hidden for your taste. You may also consider that the moderator & admin of each given room has the legal obligation to repeat this information in the room topic where they also link to the code of conduct that applies within the given room.

The default setting when an admin create a matrix room is shared (members only) logging (in the above PR, they also plan to allow indexing of the chat log with this setting). We take elective action in dozens of our big rooms to set this to world_readable for the benefit of the community at big. This is our choice. We have made this conscious choice back in around 2018 and have been following up on it ever since. view.matrix already existed back then and we all knew it will index our content.

Public NNTP groups and mailing lists are being indexed by various services and have been for quite some time.

@bkil
Copy link

bkil commented Jun 27, 2023

You also failed to address my main question. If the admin set this new checkbox on the room (Allow web search engine indexing), would the only difference it made was to add <meta name="robots" content="noindex,noarchive"> to the top of the generated HTML? Would that comply with your wishes? Anything else would make potential human nojs users of archive.matrix suffer.

@Porkepix
Copy link

As cited just now in matrix-org/matrix-public-archive#239 (comment)

world_readable - All events while this is the m.room.history_visibility value may be shared by any participating homeserver with anyone, regardless of whether they have ever joined the room.

All major clients display the chat log visibility rules within the room settings when you join a given room as a user, hence you had the chance to inform about this since like 2014. Feel free to open a PR to your client if you find this property of the given room too hidden for your taste. You may also consider that the moderator & admin of each given room has the legal obligation to repeat this information in the room topic where they also link to the code of conduct that applies within the given room.

1/ Visible by anyone isn't synonym to "indexed by search engines for ever", especially for a chat;
2/ In 90% of the cases, my "client" is an IRC one, for which I never asked for public logs to exist, even less for them to be indexed and offered to search engines.

The default setting when an admin create a matrix room is shared (members only) logging (in the above PR, they also plan to allow indexing of the chat log with this setting). We take elective action in dozens of our big rooms to set this to world_readable for the benefit of the community at big. This is our choice. We have made this conscious choice back in around 2018 and have been following up on it ever since. view.matrix already existed back then and we all knew it will index our content.

You might have known. I didn't, and by the look of the various answers and reactions here and there I was by far not alone in this situation.

Public NNTP groups and mailing lists are being indexed by various services and have been for quite some time.

Not all of them.

@Mikaela
Copy link
Contributor

Mikaela commented Jun 28, 2023

All major clients display the chat log visibility rules within the room settings when you join a given room as a user, hence you had the chance to inform about this since like 2014.

I don't know how you define a major client, but I would argue Nheko to be one and it doesn't display history visibility anywhere. Nheko-Reborn/nheko#1470

I would support making the visibility status more visible in other clients too as room settings is not something I constantly open when joining new rooms.

@bkil
Copy link

bkil commented Jun 28, 2023

For inspiration, here is the choice you have for a similar class of properties on Friendica:

  • Publish your profile in your local site directory?
    Your profile will be published in this node's local directory. Your profile details may be publicly visible depending on the system settings.
  • Allow your profile to be searchable globally?
    Activate this setting if you want others to easily find and follow you. Your profile will be searchable on remote systems. This setting also determines whether Friendica will inform search engines that your profile should be indexed or not. Your profile will also be published in the global friendica directories (e.g. https://dir.friendica.social/).
  • Hide your public content from anonymous viewers
    Anonymous visitors will only see your basic profile details. Your public posts and replies will still be freely accessible on the remote servers of your followers and through relays.
  • Make public posts unlisted
    Your public posts will not appear on the community pages or in search results, nor be sent to relay servers. However they can still appear on public feeds on remote servers.
  • Make all posted pictures accessible
    This option makes every posted picture accessible via the direct link. This is a workaround for the problem that most other networks can't handle permissions on pictures. Non public pictures still won't be visible for the public on your photo albums though.
  • Allow anonymous access to your calendar
    Allows anonymous visitors to consult your calendar and your public events. Contact birthday events are private to you.

@clokep
Copy link
Contributor

clokep commented Jun 29, 2023

I'm going to close this issue as it seems like it is not a bug in Synapse. This issue is long and confusing, but it sounds like it is either:

If that sounds incorrect and Synapse is not implementing something properly please reply and we can reopen this issue. Regardless, please be clear & concise about what the issue is.

@clokep clokep closed this as not planned Won't fix, can't repro, duplicate, stale Jun 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
O-Occasional Affects or can be seen by some users regularly or most users rarely S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

No branches or pull requests

7 participants