Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC4021: Archive client controls #4021

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jonaharagon
Copy link

@jonaharagon jonaharagon commented May 28, 2023

Copy link
Contributor

@MTRNord MTRNord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the alternative linked in my comment may be more useful and this should require fed peeking to exist to actually work as intended.

@@ -0,0 +1,36 @@
# MSC4021: Archive client controls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2291

Should be added as an alternative.

Also this likely would need to depend on fed peeking since currently you need to join a room to access the info which some people may find bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that 2291 is an alternative; I think the goals are different. 2291 indicates whether the bot is allowed to crawl the room, whereas it looks like the intent for this one is to communicate to search engines whether they are allowed to index the room. For example, I might want my room available on archive.matrix.org, but I may not want Google to index it and present it in search results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho the indexing is already a form of crawling a room. That's my reasoning. And the other msc can also be used for this case imho. It's a little more generic than this one

Copy link
Author

@jonaharagon jonaharagon May 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a desire (matrix-org/matrix-viewer#47 (comment)) to have the room directory API include this sort of information directly, which is why I'm not sure 2291 will work here. I edited the doc to expand upon this and add 2291 as an alternative. Because this is intended to function more similarly to m.room.join_rules I don't think fed peeking is an issue, but I'm not sure.

Copy link
Contributor

@MadLittleMods MadLittleMods Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, I might want my room available on archive.matrix.org, but I may not want Google to index it and present it in search results.

@uhoreg With MSC2291, I think this could be achieved with a mix of messages and log in the m.room.robots event. Am I misinterpreting?

m.room.robots

{
  "*": {
    "messages": false,
    "log": true
  }
}
  • messages: (boolean) whether the bot is allowed to index the room's
    messages
    . Default: true if m.room.history_visibility is
    world_readable, and false otherwise.
  • log: (boolean) whether the bot is allowed to display logs of the room to
    users. This will be false if messages is false. Default: true if
    m.room.history_visibility is world_readable, and false otherwise.

The names are slightly confusing to what they actually do.

Copy link
Member

@uhoreg uhoreg Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MSC2291, messages is intended to indicate whether the bot itself is allowed to index messages, whereas this proposal is intended to communicate preferences to other crawlers that crawl the bot's logs. This may be able to be done with an addition to 2291 (e.g. add a new property), but 2291 itself doesn't do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like there might not be any difference between the bot itself (Matrix Public Archive) and a different crawler that crawls the bots logs (a search engine). They're both accessing the same information (same or derived) and feels like messages from MSC2291 to control indexing of messages covers that. In other words, if messages: false, the archive can't index messages and neither can search engines.

Basically, any bot preference should probably be passed down for other bots to follow?

I think if it's a wildcard *, it should apply to downstream bots. It's less clear how things should flow if someone specified an app. Perhaps it wouldn't flow in the specific app case but could use the * rules to govern how search engines look at it.

And maybe we want to define some generic "search_engines" key for example since it might be common. But not all of the preferences are applicable since we can't pass along all of this preference detail seamlessly (impedance mismatch).

@uhoreg uhoreg added proposal A matrix spec change proposal client-server Client-Server API kind:feature MSC for not-core and not-maintenance stuff needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. labels May 28, 2023
from returning duplicate content or taking precedence in search results over an organization's self-hosted archive.

For example, if `via` is set to `"archive.example.net"` in `#main:example.net`, the page at
https://archive.matrix.org/r/main:example.net/date/2023/05/28 should return this HTTP header:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to assume that all archivers will have the same URL format, which may not be true. If they all run matrix-public-archive, then that may be, but it's possible that some other archiving software may use a different format.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative could be to have via be a full URI, like https://archive.example.net/r/main:example.net, and then https://archive.matrix.org/r/main:example.net/date/2023/05/28 would return:

Link: <https://archive.example.net/r/main:example.net>; rel="canonical"

It would miss out on features like date pagination, although it now occurs to me that for the purposes of web indexing, this might actually be preferable behavior?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this alternative is that it might be more difficult for the self-hosted client at archive.example.net to parse and not include this canonical link header, because I don't think it would be ideal for the canonical archive to return this header. So I don't know, maybe that's something to leave up to client interpretation, maybe a standard URL format should be part of the spec? 🤷‍♂️

Copy link
Contributor

@MadLittleMods MadLittleMods Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted something specific to the Matrix Public Archive URL format, we could use an event type scoped to the sub-domain like org.matrix.archive.canonical to convey this information.

Comment on lines +21 to +22
| `archive` | boolean | | Whether the room should be included in room directory listings which are indended to be viewed by the public |
| `robots` | [string] | Valid [robots meta rules](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#directives) | A list of rules which should be included in a `robots` meta tag and/or [HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag-implementation) by public-facing clients. e.g. `["noarchive"]` or `["noindex", "nofollow"]`.
Copy link
Contributor

@MadLittleMods MadLittleMods Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Matrix Public Archive, there are kind of two things to consider:

  • Whether you want to show up in the archive at all (display)
  • Whether you want to allow search engines to index that content (indexing)

The robots field definately covers the search engine indexing decision by being able to opt out with noindex

For the display decision, it's less clear whether robots can cover it. But noarchive sounds pretty decent just by name and also because of what it means:

noarchive

Requests the search engine not to cache the page content.

-- https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta/name#other_metadata_names

And the Matrix Public Archive really just allows you to view a public Matrix room with some potential caching on top (it doesn't store anything). But this might be an overloaded usage of noarchive since caching is not the same as displaying which the archive also does at its core.

Depending on the answer here, the archive field may be redundant compared to what can be specified in robots

Perhaps the display should be keyed off something else entirely anyway.


## Proposal

Add an `m.room.archive_controls` state event where you can specify information about if and how you would like your
Copy link
Contributor

@MadLittleMods MadLittleMods Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m.room.archive_controls feels very specific to the archive use case and we may want to be more generic.

For example, people building a blog or forum on Matrix would use similar robots controls (see other beyond chat applications for Matrix)

Maybe we only need to be generic with a m.room.robots state event and other archive specific event types would still be useful.

Copy link
Author

@jonaharagon jonaharagon Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but I didn't want this to be confused for controls over Matrix chat/integration bots, which this really isn't. It's more of a control over a specific class of clients in my mind, which I wasn't sure how to refer to.

Unless you think this has purpose outside of clients which are intended for public unauthenticated access, but I think a comments system on a blog would also fall under that category.

@bkil

This comment was marked as off-topic.

@MadLittleMods

This comment was marked as off-topic.

@alphapapa
Copy link

IMHO this proposal is misguided, because a room with world-readable history can have its history read by any client, which would be free to ignore such "controls." That is, these would not be controls, but merely expressed preferences. They would likely give users or room admins a false sense of security, because while they may have set a "control" to prevent indexing of their room's history, any party could be doing so with just a few lines of code, even connecting it to their own archive.matrix.org-like instance and feeding it to search engines.

Instead, we should seek to make the consequences of room settings as clear as possible to users and admins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client-server Client-Server API kind:feature MSC for not-core and not-maintenance stuff needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants