New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finder: Indexer Parsers break search result descriptions #4327
Comments
Hello Andy, thank you taking the time to create this ticket. I have analyzed the code and have to give you right. How or where do you get the description with the messed up code in the output? In the smart search component the field for the description in the database is only 255 characters long, so only the first 255 characters are saved for the output. The problematic area after 2000 + x characters is not considered at all. Nevertheless, this bug should be fixed in the function I can also confirm the second issue with the example text that you've posted. Summary: We should tackle it! :-) |
Hi, This chunking is all happening before tags are stripped and before the 255 character truncation. 'Lorum ipsum ...' + 2006 more characters + 'dolor A thing ...' Example as above. If the 2006 more characters is mainly HTML, that gets stripped. I hope that makes sense. |
@andykirk Yes, it makes totally sense. As I suggested, we should improve the check whether some HTML parts are still present and remove them to avoid such an output! It is on my to do list. :-) |
If it helps, I created a truncate HTML function ages ago. https://gist.github.com/andykirk/b304a3c84594515677e6 I'm sure it could do with a lot of improvement but I thought I'd share in case it was useful. Thanks, |
@andykirk Wow, too much code... :-) I just wanted to add a simple check whether an angle bracket is present and remove the part from the beginning to this bracket. This should solve the first issue completely. |
That's totally fine. As I said - I just put the truncate function up in case it was useful. |
FWIW - still think using strip_tags before the string is counted would be a more bullet-proof way of solving this.
Edit: in file (A). Cheers |
Yes, this is of course the easiest solution but we have the HTML parser exactly for that reason, so the processing should be in the corresponding parser class (parser/html.php). |
Yes, of course, I understand that, but parser.php makes assumptions about the character length and chunks it before passing it to parser/html.php parse(). One idea could be to add a
Then parser.php could check for the existence of this method before establishing the character count:
As I say, just an idea. :-) |
@andykirk 👍 for your idea! :-) But then we should also reduce the process function. Will check it now! |
True. |
Please check out the fix in #4345 I decided to go with a change only in the HTML parser. |
I will close this issue here, please go on in the linked PR with the discussion. Thank you for reporting this, @andykirk! |
Set to "closed" on behalf of @Kubik-Rubik by The JTracker Application at issues.joomla.org |
Hi, thanks for the fix. However, I still anticipate a problem (though I'm unable to test at the moment). Firstly, the fix makes this line redundant: Secondly, as per my original example: the chunk of text that's passed to the parser would be:
So maybe you meant to use Cheers |
Yes, this line is obsolete with the fix (could only be applied to a broken HTML part which is removed later anyway). You are right, the cut could also lead to an open bracket. So we should check both possibilities (in the beginning or in the end of the string). |
Please test again, thanks! |
That's great. I can't test until tomorrow, but it looks good to me :-) Thanks for getting on with this so quickly. |
Hi, Many thanks. :-) |
…oomla#4327. Fixes joomla#4345 Fix - Finder: Indexer Parsers break search result descriptions joomla#4327
Files in question:
(A): \administrator\components\com_finder\helpers\indexer\parser.php
(B): \administrator\components\com_finder\helpers\indexer\parser\html.php
I have found two related issues with the code in the above files.
Firstly in file (A) the parse method checks if the $input is greater than 2048 bytes and if so chunks the $input at the last space.
The problem with this is that it doesn't take into account any HTML tags.
If the last space is inside a tag, the chunking splits the tag before it's passed to the parse method of file (B).
It's this parse method that's responsible for striping HTML tags (strip_tags()) so by the time the $input reaches this function it's too late - the tag is already split the search result description is left with partial tag garbage.
Example:
'Lorum ipsum ...' + 2006 more characters + 'dolor <abbr title="Some useful title">A thing</abbr> ...'
The 2048 character count would finish at the end of 'useful', the last available space is after 'Some'.
Because of the chunking the Finder result description looks like:
'Lorum ipsum ... dolor useful title">A thing ...'
Whereas it should be:
'Lorum ipsum ... dolor A thing ...'
The only ways I can think of solving this would be to either strip_tags() BEFORE chunking (simple but not inline with code separation) or to introduce some proper HTML chunking (complicated) with each parser having it's own chunking method.
The second issue I've found is that line 41 of file (B) ($input = str_replace('>', '> ', $input);) messes up the Finder results description too.
(I'm not sure what this line achieves but I suspect it fixes something that occurs because of the chunking)
Example:
I have a need to stylistically offset certain letters that make up an acronym.
I can't do this with CSS so I need to use tags like this:
'<b>H</b>yper<b>T</b>ext <b>M</b>arkup <b>L</b>anguage'
Because of the line I mentioned (and because it occurs before tags are stripped), the Finder results description looks like this:
'H yperT ext M arkup L anguage' - see the extra spaces.
Line 41 needs to be changed to take into account tags, removed or strip_tags() needs to occure first.
Because this isn't straightforward to fix and would need discussion or a decision by a Joomla dev, I can't really raise a PR, but please let me know if there's anything else you want me to do.
Thanks,
Andy
The text was updated successfully, but these errors were encountered: