Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SEARCH #496

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions extensions/search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
title: Search
layout: spec
work-in-progress: true
copyrights:
-
name: "delthas"
email: "delthas@dille.cc"
period: "2022"
---

## Notes for implementing work-in-progress version

This is a work-in-progress specification.

Software implementing this work-in-progress specification MUST NOT use the
unprefixed `search` capability name. Instead, implementations SHOULD
use the `draft/search` capability name to be interoperable with other
software implementing a compatible work-in-progress version.

The final version of the specification will use an unprefixed capability name.

## Introduction

This feature is intended to clients to search server-side through messages that were previously sent. Server-side search enables clients to quickly find messages rather than downloading all the history from their server.

It is *not* a goal of this specification to offer context around matching messages, as this is covered by the [`chathistory`](chathistory.md) specification.

The server as mentioned in this document may refer to either an IRC server or an IRC bouncer.

Full support for this extension requires support for the batch, server-time and message-tags capabilities. However, limited functionality is available to clients without support for these CAPs. Servers SHOULD NOT enforce that clients support all related capabilities before using the search extension.

## Architecture

### Capabilities

This specification introduces the `search` capability. This capability advertises to clients that the `SEARCH` command is available.

The `search` capability MUST be negotiated.

### Batch type

This specification introduces the `search` batch type. This batch type is used when listing matching messages of a `SEARCH` command.

### Messages

This specification introduces the `SEARCH` commmand.

## Descriptiob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo


Clients can request a message search by sending the `SEARCH` command to the server.

This command has the following general syntax:

SEARCH <attributes>

The `attributes` parameter is a dictionary of attributes keys and values, formatted according to the [`message-tags`](message-tags.md) specification (without the leading `@`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/according to/like message tags, as defined in/

I would have gone with the ISUPPORT format instead of message-tags format; because the ISUPPORT format is already used in parameters; but ok.


If the batch capability was negotiated, the server MUST reply to a successful SEARCH command using a batch with batch type `search`. If no content exists to return, the server SHOULD return an empty batch in order to avoid the client waiting for a reply.

The server then replies with a batch of batch type `search` containing messages matching all the specified match attributes. These messages MUST be `PRIVMSG` or `NOTICE` messages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it a MUST instead of a SHOULD? No TAGMSG?


### Returned message notes

The order of returned messages within the batch is implementation-defined, but SHOULD be ascending time order or some approximation thereof, regardless of the subcommand used. The server-time tag on each message SHOULD be the time at which the message was received by the IRC server. When provided, the msgid tag that identifies each individual message in a response MUST be the msgid tag as originally sent by the IRC server.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about supporting specification of a specific ordering to return? (Oldest, newest, most relevant, etc)


Servers SHOULD provide clients with a consistent message order that is valid across the lifetime of a single connection, and which determinately orders any two messages (even if they share a timestamp). This order SHOULD coincide with the order in which messages are returned within a response batch. It need not coincide with the delivery order of messages when they were relayed on any particular server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention three orders here: 1. ???, 2. in response batch 3. delivery; and say 1 and 2 should coincide. But I don't understand what this first order is. Is this something internal to the IRCd?


### Errors and Warnings

Errors are returned using the standard replies syntax.

If the selectors were invalid, the `INVALID_PARAMS` error code SHOULD be returned.

FAIL SEARCH INVALID_PARAMS [invalid_parameters] :Invalid parameters

If the search cannot be run due to an internal error, the `INTERNAL_ERROR` error code SHOULD be returned.

FAIL SEARCH INTERNAL_ERROR [extra_context] :The search could not be run

### Standard search attributes

Servers MUST recognise the following attributes.

The following match attributes are considered a match when:
* `in`: the message was sent to this target (channel or user).
* `from`: the message was sent with this nick.
* `after`: the message was sent at or after this time (same format as the [`server-time`](server-time.md) specification).
* `before`: the message was sent at or before this time (same format as the [`server-time`](server-time.md) specification).
* `text`: the message text matches the specified text. The actual algorithm used for matching the text is implementation defined.
Copy link
Contributor

@progval progval Apr 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual algorithm used for matching the text is implementation defined

IMO, not defining it at all makes this spec unusable by some clients. It should at least define whether:

  1. it matches full lines or just substrings by default
  2. .* vs * vs %

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible solution: standardize two different attributes:

  1. one with generic IRC matching (just * and ?) as a standard which is easy to translate to any backend
  2. one with advanced implementation-defined behavior

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one hand, I like the potential of specifying multiple sets of matching behavior that could be used - you could potentially imagine even more advanced pattern matching. However, in practice, would clients care about this format, or just end users? Additionally, I think it's worth considering that "viable" pattern types/matching algorithms may be highly dependent on backend (eg if you're using Postgres with a GIN index, MySQL FULLTEXT index, elastisearch, ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why the matching algo is undefined is indeed that each message store implementation may have different capabilities and syntax. Users would be expected to type the search query, hence it's not a very big deal to leave it implementation-defined.


If `after` is specified, messages SHOULD be searched from that time. Otherwise, messages SHOULD be searched from the `before` time, which defaults to the current server time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this sentence mean? That before should be ignored if after is given?


Additionally, the following attributes MUST be recognized:
* `limit`: a number representing an upper bound on the count of messages to return. The server MAY return fewer messages than this number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow vendors to define their own vendor-prefixed extensions; as well as a mechanism to list supported attributes.

### Examples

Searching messages sent by `jackie` in `#chan`
~~~~
[c] SEARCH from=jackie;in=#chan
[s] :irc.host BATCH +ID search
[s] @batch=ID;msgid=1234;time=2019-01-04T14:33:26.123Z :jackie!indent@host PRIVMSG #chan :Be what you want
[s] @batch=ID;msgid=1234;time=2019-01-04T14:35:26.123Z :jackie!indent@host PRIVMSG #chan :Want what you be
[s] :irc.host BATCH -ID
~~~~

Searching messages matching the text `fast` in `#chan`, returning up to 2 messages
~~~~
[c] SEARCH text=fast;in=#chan;limit=2
[s] :irc.host BATCH +ID search
[s] @batch=ID;msgid=1234;time=2019-01-04T14:33:26.123Z :bill!indent@host PRIVMSG #chan :That was fast!
[s] @batch=ID;msgid=1234;time=2019-01-04T14:35:26.123Z :jackie!indent@host PRIVMSG #chan :Fasting is hard.
[s] :irc.host BATCH -ID
~~~~

Searching messages when none match
~~~~
[c] SEARCH before=2010-01-01T00:00:00.000Z;in=#chan
[s] :irc.host BATCH +ID search
[s] :irc.host BATCH -ID
~~~~
Comment on lines +100 to +122
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the same format for example blocks as other specs; and replace irc.host with an invalid or documentation domain.

You should also fix the duplicate msgids.


## Implementation Considerations

Server implementations may use various algorithms for matching messages against the specified `text`. Some implementation may choose to match by substrings, by whole words, or by other algorithms such as what is offered by their database (e.g. SQLite full-text search). The comparison may be case-insensitive or case-sensitive.

## Security Considerations

Processing logs can be slow. Servers offering this feature should implement a timeout on their total request time.