New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rich text indexing issues again #2066

Closed
mauritsvanrees opened this Issue Jun 8, 2017 · 4 comments

Comments

Projects
None yet
4 participants
@mauritsvanrees
Member

mauritsvanrees commented Jun 8, 2017

There is code in plone.app.contenttypes/indexers.py that is getting changed back and forth. The code gets rich text for the SearchableText index like this:

textvalue = richtext.text
text = transforms.convertTo('text/plain',
    safe_unicode(textvalue.out).encode('utf8'),
    mimetype=textvalue.mimeType)

Let's simplify this to make it easier to talk about:

convert(text, mimetype)

It is not clear what text and mimetype need to be.
It started out like this:

convert(textvalue.output, textvalue.mimeType)

Then @petri filed plone/plone.app.contenttypes#357.
There he argued that the current code was using the output in combination with the original mimetype, which indeed sounds strange, although he did not give an example of an actual bug caused by this.
His idea was that it should be one of these:

# 1: use output and output mimetype
convert(textvalue.output, textvalue.outputMimeType)
# 2: use original text and original mimetype
convert(textvalue.raw, textvalue.mimeType)

Both seem okay.

So he made pull requests choosing the first solution: output and output mimetype. This was included in 1.1.2 and 1.2.18.

Then @agitator noticed problems with this in #1844.
Basically, the outputMimeType was text/x-safe-html, and there was no transform from this to text/plain, so the SearchableText contained nothing (except title and description).
This got fixed by reverting to the original:

convert(textvalue.output, textvalue.mimeType)

This was released in 1.2.19 (branch 1.2.x and master). It was not yet backported to branch 1.1.x, but it should be: I notice the same problem there.

So:

  • @petri: What problems did you encounter? Do they come back when you use 1.2.19?
  • The fix from @agitator needs to be backported to 1.1.x. I can do that.
  • Most importantly: Should we instead use the second method on all branches? So convert(textvalue.raw, textvalue.mimeType). It seems to work fine then.
  • Once we have the correct code, an upgrade step to reindex the SearchableText index for Collections, Documents and News Items would be good. I have such code in plone/plone.app.contenttypes#407 for Collections already.
@petri

This comment has been minimized.

Show comment
Hide comment
@petri

petri Jun 18, 2017

Member

It's been a while... I don't remember of any particular problems in Plone that I would have encountered.

After re-reading the sources and #1844 discussion, it seems to me that for indexing, we should use the raw RichText value for plaintext conversion, not the output value, because:

  • the .output property triggers what seems a unnecessary extra portal transform run, before plaintext conversion
  • I cannot imagine how that extra first transform would add any value for indexing purposes (ideas, anyone?)
  • we can use convertTo correctly, ie. pass the raw value mime type to it (with no missing transforms)

Thus it seems to me the above-mentioned possible alternative by @mauritsvanrees is what we should do:

Most importantly: Should we instead use the second method on all branches? So convert(textvalue.raw, textvalue.mimeType). It seems to work fine then.

If I understand correctly, this is what @thet implied as well, in the #1844 discussion.

Member

petri commented Jun 18, 2017

It's been a while... I don't remember of any particular problems in Plone that I would have encountered.

After re-reading the sources and #1844 discussion, it seems to me that for indexing, we should use the raw RichText value for plaintext conversion, not the output value, because:

  • the .output property triggers what seems a unnecessary extra portal transform run, before plaintext conversion
  • I cannot imagine how that extra first transform would add any value for indexing purposes (ideas, anyone?)
  • we can use convertTo correctly, ie. pass the raw value mime type to it (with no missing transforms)

Thus it seems to me the above-mentioned possible alternative by @mauritsvanrees is what we should do:

Most importantly: Should we instead use the second method on all branches? So convert(textvalue.raw, textvalue.mimeType). It seems to work fine then.

If I understand correctly, this is what @thet implied as well, in the #1844 discussion.

mauritsvanrees added a commit to plone/plone.app.contenttypes that referenced this issue Jul 12, 2017

Use original raw text and mimetype when indexing rich text.
This avoids a double transform (raw source to output mimetype to plain text).
Issue plone/Products.CMFPlone#2066

mauritsvanrees added a commit to plone/plone.app.contenttypes that referenced this issue Jul 12, 2017

Use original raw text and mimetype when indexing rich text.
This avoids a double transform (raw source to output mimetype to plain text).
Includes a reindex of the SearchableText index for Collections, Documents and News Items.

Issue plone/Products.CMFPlone#2066
@pbauer

This comment has been minimized.

Show comment
Hide comment
@pbauer

pbauer Jul 25, 2017

Member

Yes on all points.

Member

pbauer commented Jul 25, 2017

Yes on all points.

mauritsvanrees added a commit to plone/plone.app.contenttypes that referenced this issue Aug 9, 2017

Use original raw text and mimetype when indexing rich text.
This avoids a double transform (raw source to output mimetype to plain text).
Includes a reindex of the SearchableText index for Collections, Documents and News Items.

Issue plone/Products.CMFPlone#2066

mauritsvanrees added a commit to plone/plone.app.contenttypes that referenced this issue Aug 9, 2017

Use original raw text and mimetype when indexing rich text.
This avoids a double transform (raw source to output mimetype to plain text).
Includes a reindex of the SearchableText index for Collections, Documents and News Items.

Backported from master
Issue plone/Products.CMFPlone#2066
@jensens

This comment has been minimized.

Show comment
Hide comment
@jensens

jensens Oct 18, 2017

Member

any reason to keep this one open?

Member

jensens commented Oct 18, 2017

any reason to keep this one open?

@mauritsvanrees

This comment has been minimized.

Show comment
Hide comment
@mauritsvanrees

mauritsvanrees Oct 19, 2017

Member

Nope, everything has been fixed as far as I know.

Member

mauritsvanrees commented Oct 19, 2017

Nope, everything has been fixed as far as I know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment