Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finder: Indexer Parsers break search result descriptions #4327

Closed
andykirk opened this issue Sep 23, 2014 · 19 comments
Closed

Finder: Indexer Parsers break search result descriptions #4327

andykirk opened this issue Sep 23, 2014 · 19 comments

Comments

@andykirk
Copy link
Contributor

Files in question:

(A): \administrator\components\com_finder\helpers\indexer\parser.php
(B): \administrator\components\com_finder\helpers\indexer\parser\html.php

I have found two related issues with the code in the above files.

Firstly in file (A) the parse method checks if the $input is greater than 2048 bytes and if so chunks the $input at the last space.
The problem with this is that it doesn't take into account any HTML tags.
If the last space is inside a tag, the chunking splits the tag before it's passed to the parse method of file (B).
It's this parse method that's responsible for striping HTML tags (strip_tags()) so by the time the $input reaches this function it's too late - the tag is already split the search result description is left with partial tag garbage.

Example:

'Lorum ipsum ...' + 2006 more characters + 'dolor <abbr title="Some useful title">A thing</abbr> ...'

The 2048 character count would finish at the end of 'useful', the last available space is after 'Some'.
Because of the chunking the Finder result description looks like:

'Lorum ipsum ... dolor useful title">A thing ...'

Whereas it should be:

'Lorum ipsum ... dolor A thing ...'

The only ways I can think of solving this would be to either strip_tags() BEFORE chunking (simple but not inline with code separation) or to introduce some proper HTML chunking (complicated) with each parser having it's own chunking method.

The second issue I've found is that line 41 of file (B) ($input = str_replace('>', '> ', $input);) messes up the Finder results description too.
(I'm not sure what this line achieves but I suspect it fixes something that occurs because of the chunking)

Example:

I have a need to stylistically offset certain letters that make up an acronym.
I can't do this with CSS so I need to use tags like this:

'<b>H</b>yper<b>T</b>ext <b>M</b>arkup <b>L</b>anguage'

Because of the line I mentioned (and because it occurs before tags are stripped), the Finder results description looks like this:

'H yperT ext M arkup L anguage' - see the extra spaces.

Line 41 needs to be changed to take into account tags, removed or strip_tags() needs to occure first.

Because this isn't straightforward to fix and would need discussion or a decision by a Joomla dev, I can't really raise a PR, but please let me know if there's anything else you want me to do.

Thanks,
Andy

@Kubik-Rubik
Copy link
Member

Hello Andy,

thank you taking the time to create this ticket. I have analyzed the code and have to give you right.

How or where do you get the description with the messed up code in the output? In the smart search component the field for the description in the database is only 255 characters long, so only the first 255 characters are saved for the output. The problematic area after 2000 + x characters is not considered at all. Nevertheless, this bug should be fixed in the function process in the html.php file. I would suggest to add an additional check whether some HTML parts are still present.

I can also confirm the second issue with the example text that you've posted.

Summary: We should tackle it! :-)

This comment was created with the J!Tracker Application at http://issues.joomla.org/.

@andykirk
Copy link
Contributor Author

Hi,
I'm seeing this on the Smart Search results page.
Live example: https://www.npeu.ox.ac.uk/search?q=graces+training&Search=
Description includes remnants of a <video>.

This chunking is all happening before tags are stripped and before the 255 character truncation.
The process is going like this:

'Lorum ipsum ...' + 2006 more characters + 'dolor A thing ...'

Example as above. If the 2006 more characters is mainly HTML, that gets stripped.
In my case there's a couple of <video> and <params> etc weighing in at around 850 characters each.
Added together that's 1700 characters striped between the start and the badly chunked tag.
Add in a 100 or so more characters from other tags and suddenly the useful title">A thing is right within the 255 limit.

I hope that makes sense.

@Kubik-Rubik
Copy link
Member

@andykirk Yes, it makes totally sense. As I suggested, we should improve the check whether some HTML parts are still present and remove them to avoid such an output!

It is on my to do list. :-)

@andykirk
Copy link
Contributor Author

If it helps, I created a truncate HTML function ages ago.
I've just put it up as a gist:

https://gist.github.com/andykirk/b304a3c84594515677e6

I'm sure it could do with a lot of improvement but I thought I'd share in case it was useful.

Thanks,

@Kubik-Rubik
Copy link
Member

@andykirk Wow, too much code... :-) I just wanted to add a simple check whether an angle bracket is present and remove the part from the beginning to this bracket. This should solve the first issue completely.

@andykirk
Copy link
Contributor Author

That's totally fine. As I said - I just put the truncate function up in case it was useful.
Actually, thinking about it - it probably won't be useful, as you could still end up with HTML showing when the 255 limit kicks in.

@andykirk
Copy link
Contributor Author

FWIW - still think using strip_tags before the string is counted would be a more bullet-proof way of solving this.
I.e.

$input = strip_tags($input);
Above:
if (strlen($input) > 2048)

Edit: in file (A).

Cheers

@Kubik-Rubik
Copy link
Member

Yes, this is of course the easiest solution but we have the HTML parser exactly for that reason, so the processing should be in the corresponding parser class (parser/html.php).

@andykirk
Copy link
Contributor Author

Yes, of course, I understand that, but parser.php makes assumptions about the character length and chunks it before passing it to parser/html.php parse().
I'm not 100% sure I fully understand how your suggested fix will work, but I've got a feeling it may still be problematic, especially if you do that before strip tags, but I could be wrong :-)

One idea could be to add a pre_process() method parser/html.php which is simply:

return strip_tags($input);

Then parser.php could check for the existence of this method before establishing the character count:

...
$return = null;
/*add this >>>*/ if (method_exists('pre_process', $this))
{
    $input = $this->pre_process($input);
}

// Parse the input in batches if bigger than 2KB.
if (strlen($input) > 2048)
{
...

As I say, just an idea. :-)

@Kubik-Rubik
Copy link
Member

@andykirk 👍 for your idea! :-)

But then we should also reduce the process function. Will check it now!

@andykirk
Copy link
Contributor Author

True.
And thanks :-)

@Kubik-Rubik
Copy link
Member

Please check out the fix in #4345

I decided to go with a change only in the HTML parser.

@Kubik-Rubik
Copy link
Member

I will close this issue here, please go on in the linked PR with the discussion.

Thank you for reporting this, @andykirk!

This comment was created with the J!Tracker Application at http://issues.joomla.org/.

@jissues-bot
Copy link

Set to "closed" on behalf of @Kubik-Rubik by The JTracker Application at issues.joomla.org

@andykirk
Copy link
Contributor Author

Hi, thanks for the fix. However, I still anticipate a problem (though I'm unable to test at the moment).

Firstly, the fix makes this line redundant:
$input = str_replace('>', '> ', $input);
since it's all just being deleted anyway.

Secondly, as per my original example:
Lorum ipsum ... + 2006 more characters + dolor <abbr title="Some useful title">A thing</abbr> ...

the chunk of text that's passed to the parser would be:

Lorum ipsum ... + 2006 more characters + dolor <abbr title="Some

So maybe you meant to use < in the fix and not > to remove the tag remnant?

Cheers

@Kubik-Rubik
Copy link
Member

Yes, this line is obsolete with the fix (could only be applied to a broken HTML part which is removed later anyway). You are right, the cut could also lead to an open bracket. So we should check both possibilities (in the beginning or in the end of the string).

@Kubik-Rubik
Copy link
Member

Please test again, thanks!

@andykirk
Copy link
Contributor Author

That's great. I can't test until tomorrow, but it looks good to me :-)

Thanks for getting on with this so quickly.

@andykirk
Copy link
Contributor Author

Hi,
Applied fix and re-index my dev site and the results are now as expected, so I can confirm the fix works.

Many thanks. :-)

phproberto pushed a commit that referenced this issue Oct 8, 2014
…4327. Fixes #4345

Fix - Finder: Indexer Parsers break search result descriptions #4327
phproberto pushed a commit that referenced this issue Oct 8, 2014
…4327. Fixes #4345

Fix - Finder: Indexer Parsers break search result descriptions #4327
rdeutz pushed a commit to rdeutz/joomla-cms that referenced this issue Oct 24, 2014
…oomla#4327. Fixes joomla#4345

Fix - Finder: Indexer Parsers break search result descriptions joomla#4327
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants