Finder: Indexer Parsers break search result descriptions #4327

andykirk · 2014-09-23T15:29:24Z

Files in question:

(A): \administrator\components\com_finder\helpers\indexer\parser.php
(B): \administrator\components\com_finder\helpers\indexer\parser\html.php

I have found two related issues with the code in the above files.

Firstly in file (A) the parse method checks if the $input is greater than 2048 bytes and if so chunks the $input at the last space.
The problem with this is that it doesn't take into account any HTML tags.
If the last space is inside a tag, the chunking splits the tag before it's passed to the parse method of file (B).
It's this parse method that's responsible for striping HTML tags (strip_tags()) so by the time the $input reaches this function it's too late - the tag is already split the search result description is left with partial tag garbage.

Example:

'Lorum ipsum ...' + 2006 more characters + 'dolor <abbr title="Some useful title">A thing</abbr> ...'

The 2048 character count would finish at the end of 'useful', the last available space is after 'Some'.
Because of the chunking the Finder result description looks like:

'Lorum ipsum ... dolor useful title">A thing ...'

Whereas it should be:

'Lorum ipsum ... dolor A thing ...'

The only ways I can think of solving this would be to either strip_tags() BEFORE chunking (simple but not inline with code separation) or to introduce some proper HTML chunking (complicated) with each parser having it's own chunking method.

The second issue I've found is that line 41 of file (B) ($input = str_replace('>', '> ', $input);) messes up the Finder results description too.
(I'm not sure what this line achieves but I suspect it fixes something that occurs because of the chunking)

Example:

I have a need to stylistically offset certain letters that make up an acronym.
I can't do this with CSS so I need to use tags like this:

'HyperText Markup Language'

Because of the line I mentioned (and because it occurs before tags are stripped), the Finder results description looks like this:

'H yperT ext M arkup L anguage' - see the extra spaces.

Line 41 needs to be changed to take into account tags, removed or strip_tags() needs to occure first.

Because this isn't straightforward to fix and would need discussion or a decision by a Joomla dev, I can't really raise a PR, but please let me know if there's anything else you want me to do.

Thanks,
Andy

Kubik-Rubik · 2014-09-24T21:34:40Z

Hello Andy,

thank you taking the time to create this ticket. I have analyzed the code and have to give you right.

How or where do you get the description with the messed up code in the output? In the smart search component the field for the description in the database is only 255 characters long, so only the first 255 characters are saved for the output. The problematic area after 2000 + x characters is not considered at all. Nevertheless, this bug should be fixed in the function process in the html.php file. I would suggest to add an additional check whether some HTML parts are still present.

I can also confirm the second issue with the example text that you've posted.

Summary: We should tackle it! :-)

This comment was created with the J!Tracker Application at http://issues.joomla.org/.

andykirk · 2014-09-25T09:35:19Z

Hi,
I'm seeing this on the Smart Search results page.
Live example: https://www.npeu.ox.ac.uk/search?q=graces+training&Search=
Description includes remnants of a <video>.

This chunking is all happening before tags are stripped and before the 255 character truncation.
The process is going like this:

'Lorum ipsum ...' + 2006 more characters + 'dolor A thing ...'

Example as above. If the 2006 more characters is mainly HTML, that gets stripped.
In my case there's a couple of <video> and <params> etc weighing in at around 850 characters each.
Added together that's 1700 characters striped between the start and the badly chunked tag.
Add in a 100 or so more characters from other tags and suddenly the useful title">A thing is right within the 255 limit.

I hope that makes sense.

Kubik-Rubik · 2014-09-25T09:47:32Z

@andykirk Yes, it makes totally sense. As I suggested, we should improve the check whether some HTML parts are still present and remove them to avoid such an output!

It is on my to do list. :-)

andykirk · 2014-09-25T09:50:37Z

If it helps, I created a truncate HTML function ages ago.
I've just put it up as a gist:

https://gist.github.com/andykirk/b304a3c84594515677e6

I'm sure it could do with a lot of improvement but I thought I'd share in case it was useful.

Thanks,

Kubik-Rubik · 2014-09-25T09:57:23Z

@andykirk Wow, too much code... :-) I just wanted to add a simple check whether an angle bracket is present and remove the part from the beginning to this bracket. This should solve the first issue completely.

andykirk · 2014-09-25T10:05:34Z

That's totally fine. As I said - I just put the truncate function up in case it was useful.
Actually, thinking about it - it probably won't be useful, as you could still end up with HTML showing when the 255 limit kicks in.

andykirk · 2014-09-25T10:08:38Z

FWIW - still think using strip_tags before the string is counted would be a more bullet-proof way of solving this.
I.e.

$input = strip_tags($input);
Above:
if (strlen($input) > 2048)

Edit: in file (A).

Cheers

Kubik-Rubik · 2014-09-25T10:26:47Z

Yes, this is of course the easiest solution but we have the HTML parser exactly for that reason, so the processing should be in the corresponding parser class (parser/html.php).

andykirk · 2014-09-25T11:30:05Z

Yes, of course, I understand that, but parser.php makes assumptions about the character length and chunks it before passing it to parser/html.php parse().
I'm not 100% sure I fully understand how your suggested fix will work, but I've got a feeling it may still be problematic, especially if you do that before strip tags, but I could be wrong :-)

One idea could be to add a pre_process() method parser/html.php which is simply:

return strip_tags($input);

Then parser.php could check for the existence of this method before establishing the character count:

...
$return = null;
/*add this >>>*/ if (method_exists('pre_process', $this))
{
    $input = $this->pre_process($input);
}

// Parse the input in batches if bigger than 2KB.
if (strlen($input) > 2048)
{
...

As I say, just an idea. :-)

Kubik-Rubik · 2014-09-25T11:47:38Z

@andykirk 👍 for your idea! :-)

But then we should also reduce the process function. Will check it now!

andykirk · 2014-09-25T11:53:56Z

True.
And thanks :-)

Kubik-Rubik · 2014-09-25T12:47:02Z

Please check out the fix in #4345

I decided to go with a change only in the HTML parser.

Kubik-Rubik · 2014-09-25T12:50:01Z

I will close this issue here, please go on in the linked PR with the discussion.

Thank you for reporting this, @andykirk!

This comment was created with the J!Tracker Application at http://issues.joomla.org/.

jissues-bot · 2014-09-25T12:50:14Z

Set to "closed" on behalf of @Kubik-Rubik by The JTracker Application at issues.joomla.org

andykirk · 2014-09-25T13:07:16Z

Hi, thanks for the fix. However, I still anticipate a problem (though I'm unable to test at the moment).

Firstly, the fix makes this line redundant:
$input = str_replace('>', '> ', $input);
since it's all just being deleted anyway.

Secondly, as per my original example:
Lorum ipsum ... + 2006 more characters + dolor <abbr title="Some useful title">A thing</abbr> ...

the chunk of text that's passed to the parser would be:

Lorum ipsum ... + 2006 more characters + dolor <abbr title="Some

So maybe you meant to use < in the fix and not > to remove the tag remnant?

Cheers

Kubik-Rubik · 2014-09-25T13:11:23Z

Yes, this line is obsolete with the fix (could only be applied to a broken HTML part which is removed later anyway). You are right, the cut could also lead to an open bracket. So we should check both possibilities (in the beginning or in the end of the string).

Kubik-Rubik · 2014-09-25T13:18:06Z

Please test again, thanks!

andykirk · 2014-09-25T13:53:12Z

That's great. I can't test until tomorrow, but it looks good to me :-)

Thanks for getting on with this so quickly.

andykirk · 2014-09-26T13:58:26Z

Hi,
Applied fix and re-index my dev site and the results are now as expected, so I can confirm the fix works.

Many thanks. :-)

…4327. Fixes #4345 Fix - Finder: Indexer Parsers break search result descriptions #4327

…oomla#4327. Fixes joomla#4345 Fix - Finder: Indexer Parsers break search result descriptions joomla#4327

Kubik-Rubik mentioned this issue Sep 25, 2014

Fix - Finder: Indexer Parsers break search result descriptions #4327 #4345

Closed

jissues-bot closed this as completed Sep 25, 2014

phproberto pushed a commit that referenced this issue Oct 8, 2014

[fix] Finder: Indexer Parsers break search result descriptions. Fixes #…

88effc7

…4327. Fixes #4345 Fix - Finder: Indexer Parsers break search result descriptions #4327

phproberto pushed a commit that referenced this issue Oct 8, 2014

[fix] Finder: Indexer Parsers break search result descriptions. Fixes #…

9036e28

…4327. Fixes #4345 Fix - Finder: Indexer Parsers break search result descriptions #4327

rdeutz pushed a commit to rdeutz/joomla-cms that referenced this issue Oct 24, 2014

[fix] Finder: Indexer Parsers break search result descriptions. Fixes j…

cc8df04

…oomla#4327. Fixes joomla#4345 Fix - Finder: Indexer Parsers break search result descriptions joomla#4327

zero-24 added the No Code Attached Yet label Jul 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finder: Indexer Parsers break search result descriptions #4327

Finder: Indexer Parsers break search result descriptions #4327

andykirk commented Sep 23, 2014

Kubik-Rubik commented Sep 24, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

andykirk commented Sep 25, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

jissues-bot commented Sep 25, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

andykirk commented Sep 25, 2014

andykirk commented Sep 26, 2014

Finder: Indexer Parsers break search result descriptions #4327

Finder: Indexer Parsers break search result descriptions #4327

Comments

andykirk commented Sep 23, 2014

Kubik-Rubik commented Sep 24, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

andykirk commented Sep 25, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

jissues-bot commented Sep 25, 2014

andykirk commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

Kubik-Rubik commented Sep 25, 2014

andykirk commented Sep 25, 2014

andykirk commented Sep 26, 2014