Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fts_decoder option support for indexing non-text attachments #68

Closed
EuroTrash2 opened this issue Feb 5, 2021 · 7 comments
Closed

fts_decoder option support for indexing non-text attachments #68

EuroTrash2 opened this issue Feb 5, 2021 · 7 comments

Comments

@EuroTrash2
Copy link

EuroTrash2 commented Feb 5, 2021

Hi, I'm trying to use dovecot's fts_decoder option but it doesn't seem to work.

My environment for testing:

  • 2 fresh dovecot setups, Debian buster-backports, one with xapian search, one with Solr 7.7.3
  • same set of emails in both setups
  • dovecot's fts_decoder sample script with the necessary dependencies: poppler-utils, antiword, unzip, catdoc

My dovecot configuration for the xapian setup:
plugin {
plugin = fts fts_xapian

fts = xapian
fts_xapian = partial=3 full=20 attachments=1 verbose=2

fts_autoindex = yes
fts_enforced = yes

fts_autoindex_exclude = \Trash
fts_autoindex_exclude2 = \Junk
}
service indexer-worker {
vsz_limit = 2G
}
plugin {
fts_decoder = decode2text
}
service decode2text {
# NB: cp /usr/share/doc/dovecot-core/examples/decode2text.sh (Debian's own misplacement) to /usr/lib/dovecot
executable = script /usr/lib/dovecot/decode2text.sh
user = dovecot
unix_listener decode2text {
mode = 0666
}
}

The Solr setup is pretty much the same, with Solr's relevant options.

My steps to reproduce, using Mozilla Thunderbird:

  • Perform a IMAP SEARCH BODY with a known keyword that's present in one of the message bodies... both setups return the relevant message
  • Perform a IMAP SEARCH BODY with a known keyword that's present in one of the PDF attachments... only the Solr one returns results.

Is the fts_decoder option supported in this Xapian FTS plugin? If not, are there any plans to support searching non-text attachments? I'd love to have that option, instead of having to use Solr

@grosjo
Copy link
Owner

grosjo commented Feb 7, 2021

Yes, this part need a bit more of debugging. I'll give it a push

grosjo added a commit that referenced this issue Feb 11, 2021
@grosjo
Copy link
Owner

grosjo commented Feb 11, 2021

Please kindly try to apply PR of dovecot ( dovecot/core#155 ) with latest git

grosjo added a commit that referenced this issue Mar 3, 2021
grosjo added a commit that referenced this issue Mar 3, 2021
@grosjo
Copy link
Owner

grosjo commented Mar 8, 2021

@EuroTrash2 Any news ?

@arodier
Copy link
Contributor

arodier commented Mar 21, 2021

Is there any way to force xapian to index attachments ?

I can clearly see the error when running my unit tests:

doveadm(camille): Info: FTS Xapian: fts_backend_xapian_check_access
doveadm(camille): Info: FTS Xapian: Memory stats : Used = 56 MB, Free = 66 MB
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_index_hdr
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_query
doveadm(camille): Info: FTS Xapian: Query= uid:"44"
doveadm(camille): Info: FTS Xapian: Ngram(S) -> 63 items (total 0 KB)
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_unset_build_key
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=Message-Id,Type=(null),Disposition=(null))
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_build_more
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_check_access
doveadm(camille): Info: FTS Xapian: Memory stats : Used = 56 MB, Free = 66 MB
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_index_hdr
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_query
doveadm(camille): Info: FTS Xapian: Query= uid:"44"
doveadm(camille): Info: FTS Xapian: Ngram(XMID) -> 4 items (total 0 KB)
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_unset_build_key
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=X-Mailer,Type=(null),Disposition=(null))
doveadm(camille): Info: FTS Xapian: Unknown header (indexing) 'xmailer'
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=MIME-Version,Type=(null),Disposition=(null))
doveadm(camille): Info: FTS Xapian: Unknown header (indexing) 'mimeversion'
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=Content-Type,Type=(null),Disposition=(null))
doveadm(camille): Info: FTS Xapian: Unknown header (indexing) 'contenttype'
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=Authentication-Results,Type=(null),Disposition=(null))
doveadm(camille): Info: FTS Xapian: Unknown header (indexing) 'authenticationresults'
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=X-AV-Checked,Type=(null),Disposition=(null))
doveadm(camille): Info: FTS Xapian: Unknown header (indexing) 'xavchecked'
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=Content-Type,Type=(null),Disposition=(null))
doveadm(camille): Info: FTS Xapian: Unknown header (indexing) 'contenttype'
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=(null),Type=text/plain,Disposition=(null))
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_build_more
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_unset_build_key
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=Content-Type,Type=(null),Disposition=(null))
doveadm(camille): Info: FTS Xapian: Unknown header (indexing) 'contenttype'
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=Content-Description,Type=(null),Disposition=(null))
doveadm(camille): Info: FTS Xapian: Unknown header (indexing) 'contentdescription'
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=Content-Disposition,Type=(null),Disposition=(null))
doveadm(camille): Info: FTS Xapian: Unknown header (indexing) 'contentdisposition'
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=Content-Transfer-Encoding,Type=(null),Disposition=(null))
doveadm(camille): Info: FTS Xapian: Unknown header (indexing) 'contenttransferencoding'
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_build_key
doveadm(camille): Info: FTS Xapian: New part (Header=(null),Type=text/csv,Disposition=attachment; filename="file.csv")
doveadm(camille): Info: FTS Xapian: Skipping part of type 'text/csv' and disposition 'attachment; filename="file.csv"'
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_set_mailbox
doveadm(camille): Info: FTS Xapian: Unset box 'INBOX' (c0d4e304584e5460dae30000075d7e67)
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_oldbox
doveadm(camille): Info: FTS Xapian: Done indexing 'INBOX' (c0d4e304584e5460dae30000075d7e67) (13 msgs in 261 ms, rate: 49.8)
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_release (unset_box)
doveadm(camille): Info: FTS Xapian: Committed 'unset_box' in 17 ms
doveadm(camille): Info: FTS Xapian: Box is empty
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_update_deinit (/home/users/camille/mails/indexes/xapian-indexes)
doveadm(camille): Info: FTS Xapian: fts_backend_xapian_release (update_deinit)
doveadm(camille): Info: FTS Xapian: Committed 'update_deinit' in 0 ms
doveadm(camille): Info: FTS Xapian: Deinit /home/users/camille/mails/indexes/xapian-indexes)

The annoying line is this one:

doveadm(camille): Info: FTS Xapian: Skipping part of type 'text/csv' and disposition 'attachment; filename="file.csv"'

@grosjo
Copy link
Owner

grosjo commented Mar 21, 2021

Are you using the last version and setup you dovecot according to
https://doc.dovecot.org/settings/plugin/fts-plugin/#plugin-fts-setting-fts-decoder
?

@arodier
Copy link
Contributor

arodier commented Mar 21, 2021

I am using the version in Debian bullseye, I want to stick with the Debian version. I hope this version will be fixed when Debian Bullseye will be out.

@EuroTrash2
Copy link
Author

I confirm this is fixed for me in v1.4.9a provided with Debian Bullseye. I can now IMAP SEARCH BODY for specific keywords contained into PDF attachments.
Apologies for the long silence. I'm sorry I didn't have the bandwidth to compile from sources.

This is beautiful.
Thanks a lot @grosjo for finding the problem on the Dovecot side and chasing it with the Dovecot devs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants