Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy and partial message cache #7770

Open
ChristophWurst opened this issue Dec 15, 2022 · 10 comments
Open

Lazy and partial message cache #7770

ChristophWurst opened this issue Dec 15, 2022 · 10 comments

Comments

@ChristophWurst
Copy link
Member

ChristophWurst commented Dec 15, 2022

Is your feature request related to a problem? Please describe.

As a user of the Mail app I notice that the first use experience is slow. That is because the app first indexes all my emails before I'm able to access a mailbox.

From a technical PoV we do this because experience has shown that IMAP search is not always reliable, especially if one wants to sort messages by their date. This feature depends on IMAP capabilities that are not always available. As a consequence, Horde falls back to a client-side pagination algorithm that fetches a full mailbox, sorts locally and then fetches the details of the calculated page. This trickled down as slow performance for our app.

Moreover ever connection to IMAP has a latency penalty for our web app as a new connection needs to be established, authentication happens, etc. A classic desktop client can leave the connection open.

Describe the solution you'd like

Relax the way the message cache works. Do not index all messages at once before we give users access to the mailbox.

Without concrete technical ideas in mind, we need to match some acceptance criteria

  1. There must be an efficient way to fetch the latest x threads in a mailbox (not just messages).
  2. There must be an efficient way to build message threads without having all message data available locally. We can not use IMAP threading because that is limited to a single mailbox and we want to thread across mailboxes. As in, even combine messages from Inbox and Sent so that threads appear like a conversation in a chat.
  3. There must be an efficient way to access the input data we need for the importance classifier training.

Describe alternatives you've considered

N/a

Additional context

No response

@alpianon
Copy link

alpianon commented Jan 9, 2023

This feature depends on IMAP capabilities that are not always available

What about integrating Nextcloud with only one officially supported IMAP server, with an optimized configuration that works well with Nextcloud email app? Maybe with an admin webui to manage email accounts directly in Nextcloud?

There must be an efficient way to build message threads without having all message data available locally. We can not use IMAP threading because that is limited to a single mailbox and we want to thread across mailboxes

An external search engine may help here. If not Elasticsearch (license issues etc.), something like that.

@the-djmaze
Copy link

the-djmaze commented Jan 10, 2023

2. We can not use IMAP threading because that is limited to a single mailbox and we want to thread across mailboxes

This depends on the server configuration.
With Dovecot virtual plugin you can setup an \All mailbox and then all messages in thread can be fetched.

But, when a thread is spread over a long period and 1000's of unrelated messages are inbetween, it takes a long time.

For a faster fetching you can call STATUS or SELECT that returns the amount of messages in a mailbox.
Then use SORT, SEARCH, etc. for last N messages.
For example there are 5000 messages and you get the last 100: SORT REVERSE DATE 4900:*

It's always best to fetch more ID's then the pagination because message ID's are not related to the sent date. This is still not fool proof when someone moves old messages to other folders and the old messages get a new higher ID.

More complex is sorting by FROM, SUBJECT or SIZE because then all messages should be analyzed.

@alpianon
Copy link

alpianon commented Jan 10, 2023

With Dovecot virtual plugin you can setup an \All mailbox and then all messages in thread can be fetched.

But, when a thread is spread over a long period and 1000's of unrelated messages are inbetween, it takes a long time.

Actually, if one uses dovecot with fts-elastic plugin, speed is not a problem, even with hundreds thousand of messages in between. But it currently cannot search in virtual folders filiphanes/fts-elastic#19 😞 while apparenlty solr plugin works, instead. I need to try it

@ChristophWurst
Copy link
Member Author

ChristophWurst commented Jan 10, 2023

But, when a thread is spread over a long period and 1000's of unrelated messages are inbetween, it takes a long time.

I did test those scenarios and it's actually not a problem for the threading algorithm itself. It's relatively fast. The problem is rather that you need to fetch a lot of data to run the algorithm.

@ChristophWurst
Copy link
Member Author

There must be an efficient way to fetch the latest x threads in a mailbox (not just messages)

To me that is still the biggest blocker. Finding the x latest messages is solvable with search. Finding out if those messages belong to threads and loading that data when the message/thread is opened is a lot more complex unfortunately.

@the-djmaze
Copy link

Finding out if those messages belong to threads and loading that data when the message/thread is opened is a lot more complex unfortunately.

There are two headers in a MIME message for this:

  • References
  • In-Reply-To

https://www.rfc-editor.org/rfc/rfc5322#section-3.6.4

Although they are optional, they should be there.
If not, the sender might not want the message be referenced (or does but something screwed up).

The complex part is: find all thread messages in all mailboxes
But is that important?
Mostly when you reply you are quoting the parent message and the recipient will receive his text and your comments.

@ChristophWurst
Copy link
Member Author

Thank you @the-djmaze. I am aware of the headers. I wrote the threading algorithm for this app.

Mostly when you reply you are quoting the parent message and the recipient will receive his text and your comments.

Fair point but along threads you will lose attachments, can't verify signed messages once they are quoted and so on. So I think there are good reasons to still show the thread as conversation, even though most text is preserved in replies.

@chbusold
Copy link

chbusold commented Mar 3, 2023

There must be an efficient way to fetch the latest x threads in a mailbox (not just messages)

To me that is still the biggest blocker. Finding the x latest messages is solvable with search. Finding out if those messages belong to threads and loading that data when the message/thread is opened is a lot more complex unfortunately.

Not sure if you are aware, so I wanted to mention what I think the Dovecot solution for this is, which is virtual folders (https://doc.dovecot.org/configuration_manual/virtual_plugin/). See the examples for a conversion view, "which shows all threads that have messages in INBOX, but shows all messages in the thread regardless of in what mailbox they physically exist in".
I don't know about other IMAP servers, but it may be worth having this as option at least for Dovecot users, since it should be much more efficient.

@ChristophWurst
Copy link
Member Author

That is nice, but like you say, specific to the IMAP server. We can't generally rely on a \all mailbox and therefore would have to implement threading twice.

@ChristophWurst
Copy link
Member Author

The reduce the amount of data we have to write to the database cache it could be an interesting idea to remove the recipients table.

Pro:

  • Average message has at least two entries for the sender and recipient. Messages to groups have one row for the sender and one for each recipient. We can save at least two INSERT statements for cached messages.

Con:

  • When showing messages we have to go to IMAP to fetch the recipients. This roundtrip can cost 150ms because that is a typical time it takes for IMAP to log in.
  • Searches in recipients are potentially slower because they are performed on IMAP, not the indexed, local database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants