New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

MSC2291: Configuration to Control Crawling #2291

Open

uhoreg wants to merge 5 commits into matrix-org:old_master from uhoreg:crawling

Member

uhoreg commented Sep 14, 2019 •

edited

Loading

uhoreg added 2 commits

September 13, 2019 23:23


          initial draft

1cf962a


          rename file to match MSC number

deb6cb5

uhoreg changed the title ~~MSCxxxx: Configuration to Control Crawling~~ MSC2291: Configuration to Control Crawling

uhoreg added proposal proposal-in-review labels

turt2live self-requested a review

September 14, 2019 05:08

turt2live approved these changes

View reviewed changes

Member

turt2live left a comment

Looks good, thank you! This sounds like a thing which might fit into the integrations spec when it exists, but could also fit nicely into the client-server spec (unlike widgets, though similar to widgets).

ps: merging develop -> the branch should fix the build error

anoadragon453 reviewed

View reviewed changes

proposals/2291-configuration-to-control-crawling.md

+                "*": {
+                  "members": false
+                },
+                "io.t2bot": {

Member

anoadragon453 Sep 22, 2019

I wonder whether it'd be easier for bot authors to parse io.t2bot or io.t2bot* here.

Member Author

uhoreg Jun 16, 2021

I think that io.t2bot* would imply that io.t2botfoobar would use that key as well, and io.t2bot.* is unclear whether or not a bot named simply io.t2bot should use that key. But I'm largely indifferent to this issue, and anyone with strong opinions should give a good reason for one way or the other.

turt2live added the kind:feature label

KB1RD reviewed

View reviewed changes

proposals/2291-configuration-to-control-crawling.md

+              following the Java package naming convention.  For example, the Voyager bot
+              from t2bot.io could use the name `io.t2bot.voyager`.
+              A new room state event `m.room.robots` is used to define what bots are allowed

KB1RD Apr 29, 2021

I wonder if this could be applied to homeservers as well. I.e, for a homeserver to join a room, they must agree to keep room data confidential to the best of their ability. Homeserver owners would have to confirm that they do not mine data from rooms to be able to join rooms with such an option.

Member Author

uhoreg May 3, 2021

Possibly, but that would probably be the subject for a different MSC. Though I think it would be hard to encode what policies a homeserver admin needs to agree with.

erkinalp suggested changes

View reviewed changes

erkinalp left a comment

Alternative: policy lists as rooms

proposals/2291-configuration-to-control-crawling.md Show resolved Hide resolved


          remove obsolete conclusion section, add unstable prefix

c920c9f

and recommend that bots leave when not allowed

jplatte reviewed

View reviewed changes

proposals/2291-configuration-to-control-crawling.md Outdated Show resolved Hide resolved


          Update proposals/2291-configuration-to-control-crawling.md

24061a9

Co-authored-by: Jonas Platte <jplatte@users.noreply.github.com>

turt2live added the needs-implementation label

MTRNord reviewed

View reviewed changes

proposals/2291-configuration-to-control-crawling.md

+              - Otherwise, it will use the default value for that parameter.
+              A bot may have multiple names that could be applicable to it.  For example, if
+              uhoreg.ca ran an instance of the Voyager bot, then the configuration for both

Contributor

MTRNord Jun 16, 2021

How do these get defined? Do common ones get speced? Is it purely based on "others used this" or why not do this via a mxid? Basically how do I know what to look for when I build a bot? As having this for all bots individually would probably make this hard to use as admins

Member Author

uhoreg Jun 16, 2021

The names would be using the Java package naming convention, but bot authors/admins would declare the name(s) that they use. So, for example, Travis could decide that his Voyager bot uses the io.t2bot.voyager name. Your server stats bot could then use either both io.t2bot.voyager and dev.nordeganken.serverstats, if you think that its behaviour is close enough to the original behaviour, or just dev.nordeganken.serverstats if you think that its behaviour is sufficiently different that it should no longer be considered as a Voyager bot.

This is similar to how web crawlers define their own User-agent when checking robots.txt.

Perhaps when we get extensible profiles, we can add something in there so that bots can declare which names the bot uses.

Contributor

MTRNord Jun 16, 2021

ok yeah that makes sense. That just leaves me at how a) a room admin and b) new bot writers that arent as connected as we are get to know these keys :) Thats kinda one flaw I see here. While I see this MSC as necessary it is only as effective as the amount of known bot names/names bot filter for. For robots.txt it is pretty much solved by having lists for this and I am not sure how to solve this in spec considering the amount of time spec changes take. So maybe this should be part of appendix or even better a key in the bots entry for the "Try matrix now" page? (to make those keys somewhat discoverable)

proposals/2291-configuration-to-control-crawling.md

+                rooms.  This will be `false` if `messages` is `false`.  Default: `true` if
+                `m.room.history_visibility` is `world_readable`, and `false` otherwise.
+              Bots may use other parameter names, but the names that are not listed in the

Contributor

MTRNord Jun 16, 2021

Why would the custom ones need to be named in a different format? Aka if I remember the "Java package naming convention" the above proposed ones do not follow this scheme. Or am I missing something? It feels inconsistent

Member Author

uhoreg Jun 16, 2021

Yeah, we could use m.* for the pre-defined keys. The format used here is consistent with what we do with room events, but aside from that, I'm fine with either way.

proposals/2291-configuration-to-control-crawling.md

+                "io.t2bot": {
+                  "allow": false
+                },
+                "io.t2bot.voyager": {

Contributor

MTRNord Jun 16, 2021

Combined with the other comment I made: Would this apply only to the voyager bot running on t2bot or would this apply to all voyager type bots? aka is this a type or a user? Especially as most voyager bots around do slightly different ways of crawling. So a room admin might be fine with the one by travis which only looks into future messages after join while the admin doesnt want to allow mine because my bot also looks into the history.

Member Author

uhoreg Jun 16, 2021

I think I mostly answered this in #2291 (comment) . Your bot could look up both io.t2bot.voyager and dev.nordeganken.serverstats, preferring dev.nordeganken.serverstats. So if room admins could declare a config for just io.t2bot.voyager, in which case all Voyager-type bots would use the same config. Or they could declare a config for both io.t2bot.voyager and dev.nordeganken.serverstats, in which case your bot would use dev.nordeganken.serverstats for keys defined there and io.t2bot.voyager for other keys, and other Voyager-type bots would only use the io.t2bot.voyager config.

turt2live force-pushed the old_master branch from e895827 to dca99ee Compare

August 30, 2021 22:34

uhoreg mentioned this pull request

Rooms need a way to indicate their spidering preferences matrix-org/matrix-spec#286

Open

turt2live removed the proposal-in-review label

MTRNord mentioned this pull request

MSC4021: Archive client controls #4021

Draft

MadLittleMods mentioned this pull request

Opting Out matrix-org/matrix-viewer#47

Open

MadLittleMods reviewed

View reviewed changes

proposals/2291-configuration-to-control-crawling.md Outdated Show resolved Hide resolved

MadLittleMods reviewed

View reviewed changes

proposals/2291-configuration-to-control-crawling.md Outdated Show resolved Hide resolved

MadLittleMods reviewed

View reviewed changes

proposals/2291-configuration-to-control-crawling.md

Comment on lines +58 to +60

+                then the bot may not display any information about the room to users who are
+                searching its directory, and may not store any information about the room
+                other than its existence and its crawling preferences.  The bot should also

Contributor

MadLittleMods Jun 29, 2023

In order to best avoid displaying a room that isn't allowed, it would be nice if the m.room.robots content would also be included in the /publicRooms room directory response. Probably under the robots key for a PublicRoomsChunk.

Otherwise, a stateless app navigating the room directory has to make a request for each room to determine whether it's allowed.

Member

turt2live Jun 29, 2023

I'd be more in favour of adding it to the stripped state for the room, and exposing stripped state properly on /publicRooms. The extensibility of PublicRoomsChunk doesn't scale.

Contributor

MadLittleMods Jun 29, 2023

Sounds great to me 👍

Stripped state spec docs for reference: https://spec.matrix.org/v1.7/client-server-api/#stripped-state

m.room.history_visibility would be another good one to add but seems like that would fall better under a separate MSC.

MadLittleMods reviewed

View reviewed changes

proposals/2291-configuration-to-control-crawling.md

Comment on lines +139 to +141

+              to index the room may not want to join the room.  However, bots may not be able
+              to peek in rooms that its server is not already a part of until
+              [MSC1777](https://github.com/matrix-org/matrix-doc/pull/1777) is fixed.

Contributor

MadLittleMods Jun 29, 2023 •

edited

Loading

The room summary API's from MSC3266 could cover the federated peeking niche.

Would just need to layer on the m.room.robots info in a robots key (same as the room directory)

(shout out to @turt2live for pointing MSC3266 out)


          Apply suggestions from code review

9b52343

Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>

FSG-Cat reviewed

View reviewed changes

proposals/2291-configuration-to-control-crawling.md

Comment on lines +58 to +60

+                then the bot may not display any information about the room to users who are
+                searching its directory, and may not store any information about the room
+                other than its existence and its crawling preferences.  The bot should also

Contributor

FSG-Cat Oct 20, 2023

In the context of 2023 late october this problem came up again. It has become desirable to be able to opt out of aggregated room directory searches where you aggregate results from multiple room directories.

As a way of being able to use the robots event in this context a querry param could be used to ask only for rooms that allow them selfs to be returned in aggregated searches. Allowing the creation of a distinction like showing up on google search and being public on your website. But in this case it would be showing up in direct searches like those current gen clients do but being invisible to aggregated searches powered by spiders.

tezlm reviewed

View reviewed changes

proposals/2291-configuration-to-control-crawling.md

+              - `messages`: (boolean) whether the bot is allowed to index the room's
+                messages.  Default: `true` if `m.room.history_visibility` is
+                `world_readable`, and `false` otherwise.
+              - `log`: (boolean) whether the bot is allowed to display logs of the room to

tezlm Oct 20, 2023

What is the difference between messages and log? Does there need to be two, or is one permission for messages in general fine?

Member Author

uhoreg Oct 20, 2023

The difference is that with just messages, but not log the bot can process the room's messages, but cannot display them to end users. For example, say that the bot is part of some room searching thing, and you ask it for rooms related to "cats". For rooms that just have messages enabled, it can say "Here are some rooms that I think are related to cats". For rooms that have both messages and log enabled, it can say "Here are some rooms that I think are related to cats, and here are some messages from the room that are about cats". (I don't think it makes sense to have log enabled, and messages disabled.)

Member Author

uhoreg Oct 20, 2023

(I don't think it makes sense to have log enabled, and messages disabled.)

(And in fact, the MSC does say that if messages is false, then log is false.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

MadLittleMods MadLittleMods left review comments

jplatte jplatte left review comments

turt2live turt2live approved these changes

anoadragon453 anoadragon453 left review comments

MTRNord MTRNord left review comments

erkinalp erkinalp requested changes

KB1RD KB1RD left review comments

FSG-Cat FSG-Cat left review comments

tezlm tezlm left review comments