MSC4021: Archive client controls #4021

jonaharagon · 2023-05-28T14:52:39Z

This proposal solves these problems:

MTRNord

I believe the alternative linked in my comment may be more useful and this should require fed peeking to exist to actually work as intended.

MTRNord · 2023-05-28T16:21:31Z

proposals/4021-archive-controls.md

@@ -0,0 +1,36 @@
+# MSC4021: Archive client controls


#2291

Should be added as an alternative.

Also this likely would need to depend on fed peeking since currently you need to join a room to access the info which some people may find bad.

I don't think that 2291 is an alternative; I think the goals are different. 2291 indicates whether the bot is allowed to crawl the room, whereas it looks like the intent for this one is to communicate to search engines whether they are allowed to index the room. For example, I might want my room available on archive.matrix.org, but I may not want Google to index it and present it in search results.

Imho the indexing is already a form of crawling a room. That's my reasoning. And the other msc can also be used for this case imho. It's a little more generic than this one

There is a desire (matrix-org/matrix-viewer#47 (comment)) to have the room directory API include this sort of information directly, which is why I'm not sure 2291 will work here. I edited the doc to expand upon this and add 2291 as an alternative. Because this is intended to function more similarly to m.room.join_rules I don't think fed peeking is an issue, but I'm not sure.

For example, I might want my room available on archive.matrix.org, but I may not want Google to index it and present it in search results.

@uhoreg With MSC2291, I think this could be achieved with a mix of messages and log in the m.room.robots event. Am I misinterpreting?

m.room.robots

{ "*": { "messages": false, "log": true } }

messages: (boolean) whether the bot is allowed to index the room's
messages. Default: true if m.room.history_visibility is
world_readable, and false otherwise.

log: (boolean) whether the bot is allowed to display logs of the room to
users. This will be false if messages is false. Default: true if
m.room.history_visibility is world_readable, and false otherwise.

The names are slightly confusing to what they actually do.

In MSC2291, messages is intended to indicate whether the bot itself is allowed to index messages, whereas this proposal is intended to communicate preferences to other crawlers that crawl the bot's logs. This may be able to be done with an addition to 2291 (e.g. add a new property), but 2291 itself doesn't do this.

Feels like there might not be any difference between the bot itself (Matrix Public Archive) and a different crawler that crawls the bots logs (a search engine). They're both accessing the same information (same or derived) and feels like messages from MSC2291 to control indexing of messages covers that. In other words, if messages: false, the archive can't index messages and neither can search engines.

Basically, any bot preference should probably be passed down for other bots to follow?

I think if it's a wildcard *, it should apply to downstream bots. It's less clear how things should flow if someone specified an app. Perhaps it wouldn't flow in the specific app case but could use the * rules to govern how search engines look at it.

And maybe we want to define some generic "search_engines" key for example since it might be common. But not all of the preferences are applicable since we can't pass along all of this preference detail seamlessly (impedance mismatch).

uhoreg · 2023-05-28T17:31:52Z

proposals/4021-archive-controls.md

+from returning duplicate content or taking precedence in search results over an organization's self-hosted archive.
+
+For example, if `via` is set to `"archive.example.net"` in `#main:example.net`, the page at
+https://archive.matrix.org/r/main:example.net/date/2023/05/28 should return this HTTP header:


This seems to assume that all archivers will have the same URL format, which may not be true. If they all run matrix-public-archive, then that may be, but it's possible that some other archiving software may use a different format.

An alternative could be to have via be a full URI, like https://archive.example.net/r/main:example.net, and then https://archive.matrix.org/r/main:example.net/date/2023/05/28 would return:

Link: <https://archive.example.net/r/main:example.net>; rel="canonical"

It would miss out on features like date pagination, although it now occurs to me that for the purposes of web indexing, this might actually be preferable behavior?

The problem with this alternative is that it might be more difficult for the self-hosted client at archive.example.net to parse and not include this canonical link header, because I don't think it would be ideal for the canonical archive to return this header. So I don't know, maybe that's something to leave up to client interpretation, maybe a standard URL format should be part of the spec? 🤷‍♂️

If we wanted something specific to the Matrix Public Archive URL format, we could use an event type scoped to the sub-domain like org.matrix.archive.canonical to convey this information.

MadLittleMods · 2023-06-01T07:20:34Z

proposals/4021-archive-controls.md

+| `archive` | boolean | | Whether the room should be included in room directory listings which are indended to be viewed by the public |
+| `robots` | [string] | Valid [robots meta rules](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#directives) | A list of rules which should be included in a `robots` meta tag and/or [HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag-implementation) by public-facing clients. e.g. `["noarchive"]` or `["noindex", "nofollow"]`.


For the Matrix Public Archive, there are kind of two things to consider:

Whether you want to show up in the archive at all (display)

Whether you want to allow search engines to index that content (indexing)

The robots field definately covers the search engine indexing decision by being able to opt out with noindex

For the display decision, it's less clear whether robots can cover it. But noarchive sounds pretty decent just by name and also because of what it means:

noarchive

Requests the search engine not to cache the page content.

-- https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta/name#other_metadata_names

And the Matrix Public Archive really just allows you to view a public Matrix room with some potential caching on top (it doesn't store anything). But this might be an overloaded usage of noarchive since caching is not the same as displaying which the archive also does at its core.

Depending on the answer here, the archive field may be redundant compared to what can be specified in robots

Perhaps the display should be keyed off something else entirely anyway.

MadLittleMods · 2023-06-01T07:24:12Z

proposals/4021-archive-controls.md

+
+## Proposal
+
+Add an `m.room.archive_controls` state event where you can specify information about if and how you would like your


m.room.archive_controls feels very specific to the archive use case and we may want to be more generic.

For example, people building a blog or forum on Matrix would use similar robots controls (see other beyond chat applications for Matrix)

Maybe we only need to be generic with a m.room.robots state event and other archive specific event types would still be useful.

Maybe, but I didn't want this to be confused for controls over Matrix chat/integration bots, which this really isn't. It's more of a control over a specific class of clients in my mind, which I wasn't sure how to refer to.

Unless you think this has purpose outside of clients which are intended for public unauthenticated access, but I think a comments system on a blog would also fall under that category.

alphapapa · 2023-06-04T21:21:04Z

IMHO this proposal is misguided, because a room with world-readable history can have its history read by any client, which would be free to ignore such "controls." That is, these would not be controls, but merely expressed preferences. They would likely give users or room admins a false sense of security, because while they may have set a "control" to prevent indexing of their room's history, any party could be doing so with just a few lines of code, even connecting it to their own archive.matrix.org-like instance and feeding it to search engines.

Instead, we should seek to make the consequences of room settings as clear as possible to users and admins.

Create 4021-archive-controls.md

acffccf

MTRNord reviewed May 28, 2023

View reviewed changes

uhoreg added proposal A matrix spec change proposal client-server Client-Server API kind:feature MSC for not-core and not-maintenance stuff needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. labels May 28, 2023

uhoreg reviewed May 28, 2023

View reviewed changes

Update 4021-archive-controls.md

144fff4

MadLittleMods mentioned this pull request May 30, 2023

Opting Out matrix-org/matrix-viewer#47

Open

Mikaela mentioned this pull request May 30, 2023

Room history visibility cannot be seen anywhere Nheko-Reborn/nheko#1470

Closed

1 task

MadLittleMods reviewed Jun 1, 2023

View reviewed changes

This comment was marked as off-topic.

Sign in to view

jonaharagon mentioned this pull request Jun 2, 2023

Think about rel=canonical linking matrix-org/matrix-viewer#251

Closed

MadLittleMods mentioned this pull request Jun 22, 2023

Search should return private rooms where the bot is already in or invited to (world_readable) matrix-org/matrix-viewer#271

Open

bkil mentioned this pull request Jun 28, 2023

Room history visibility is not explicit enough element-hq/element-meta#1807

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSC4021: Archive client controls #4021

MSC4021: Archive client controls #4021

jonaharagon commented May 28, 2023 •

edited by MadLittleMods

Loading

MTRNord left a comment

MTRNord May 28, 2023

uhoreg May 28, 2023

MTRNord May 28, 2023

jonaharagon May 28, 2023 •

edited

Loading

MadLittleMods Jun 29, 2023 •

edited

Loading

uhoreg Jun 30, 2023 •

edited by MadLittleMods

Loading

MadLittleMods Jul 5, 2023

uhoreg May 28, 2023

jonaharagon May 28, 2023

jonaharagon May 28, 2023

MadLittleMods Jun 1, 2023 •

edited

Loading

MadLittleMods Jun 1, 2023 •

edited

Loading

MadLittleMods Jun 1, 2023 •

edited

Loading

jonaharagon Jun 1, 2023 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

alphapapa commented Jun 4, 2023

		\| `archive` \| boolean \| \| Whether the room should be included in room directory listings which are indended to be viewed by the public \|
		\| `robots` \| [string] \| Valid [robots meta rules](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#directives) \| A list of rules which should be included in a `robots` meta tag and/or [HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag-implementation) by public-facing clients. e.g. `["noarchive"]` or `["noindex", "nofollow"]`.


		## Proposal

		Add an `m.room.archive_controls` state event where you can specify information about if and how you would like your

MSC4021: Archive client controls #4021

Are you sure you want to change the base?

MSC4021: Archive client controls #4021

Conversation

jonaharagon commented May 28, 2023 • edited by MadLittleMods Loading

MTRNord left a comment

Choose a reason for hiding this comment

MTRNord May 28, 2023

Choose a reason for hiding this comment

uhoreg May 28, 2023

Choose a reason for hiding this comment

MTRNord May 28, 2023

Choose a reason for hiding this comment

jonaharagon May 28, 2023 • edited Loading

Choose a reason for hiding this comment

MadLittleMods Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

uhoreg Jun 30, 2023 • edited by MadLittleMods Loading

Choose a reason for hiding this comment

MadLittleMods Jul 5, 2023

Choose a reason for hiding this comment

uhoreg May 28, 2023

Choose a reason for hiding this comment

jonaharagon May 28, 2023

Choose a reason for hiding this comment

jonaharagon May 28, 2023

Choose a reason for hiding this comment

MadLittleMods Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

MadLittleMods Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

MadLittleMods Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

jonaharagon Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

alphapapa commented Jun 4, 2023

jonaharagon commented May 28, 2023 •

edited by MadLittleMods

Loading

jonaharagon May 28, 2023 •

edited

Loading

MadLittleMods Jun 29, 2023 •

edited

Loading

uhoreg Jun 30, 2023 •

edited by MadLittleMods

Loading

MadLittleMods Jun 1, 2023 •

edited

Loading

MadLittleMods Jun 1, 2023 •

edited

Loading

MadLittleMods Jun 1, 2023 •

edited

Loading

jonaharagon Jun 1, 2023 •

edited

Loading