New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved HTML parser for Smart Search #5340
Conversation
Thanks! I will test this ASAP. |
@smanzi I just finished my document for the test:
|
Perfect, Dimitris, thanks, but... too neat! I'll minify this into a "one liner"... |
@dgt41 are you on latest staging? I can't run the indexer... no progress bar... stuck... only:
|
Just to be clear the above problem is without this PR! |
@test on a 2k boundary success Article used:
Result: |
WTF! #5336 MUST be merged!!! 😄 thanks! 👍 |
Seems to be really OK, but, being the a..hole I am, I want to test it more thoroughly and also in multilingual environment In the meanwhile I'm going (my tail between legs) to close #5206 Thanks Chris!! |
It definitely needs testing in a multi-lingual environment. I think it should be okay because all the string manipulations are done using byte offsets, but it really does need testing to make sure. Thanks for #5206. It was the stimulus I needed to get the new parser finished! |
I'm DUMB |
But |
I always use "editor: none" 😃
|
@smanzi @chrisdavenport Confirmed that tags in series should have a space to separate the words |
@chrisdavenport Chris, about multilingual do you know of the bug (feature?) that makes so that if you have content flagged for "All" languages it is really searched only for the "default" language and not any other? This is driving me crazy, because on one of my sites (bilingual) I also have pages where content is not assigned to a particular language but to "All" (It really is content for all languages!), and... I can't find it in the secondary language. But this is of course another story.... |
@dgt41 really??? My bad ignorance.... sorry! |
@chrisdavenport @smanzi Sorry I removed the patch earlier 😕 |
@dgt41 This is generated by JCH Optimizer Pro. It might be their mistake, but...
|
@dgt41 I think only inline elements (eg. |
So, is there a problem? If there is, can you give me a specific test where it fails? |
@smanzi Regarding the language issue, can you open a separate issue for that (assuming you haven't already)? Let's keep the issues separate. Thanks. |
No Chris, no problem! The only problem is me, that I disabled your PR when I couldn't index (because of the lack of PR #5336) and then forgot to re-enable it... So far everything is OK! |
@chrisdavenport The other issue... I've opened it at the times of old JTracker, and I think I also reopened here in GitHub. Let me check... |
@chrisdavenport All Good here as well @test success |
@chrisdavenport There is #5204 where I reported both issues: the one for the tags without spacing and also the multilingual search... |
@chrisdavenport Do you mind if I wait to give you the @test until I finished some more tests? Anyway... it seems REALLY OK! |
@chrisdavenport Spacing problem also occurs in 2k boundary, try this:
The word some should come up as single word but it comes up as |
…im so that tokens remain separated when 2Kb chunks are re-assembled.
Okay, I fixed a couple of bugs.
|
@dgt41 How do you make sure that a short string used for searching falls exactly at the 2k boundary inside a long string? 2k starting counting from what? |
@smanzi count 2048 |
I have prepared a test file (an extended an slightly modified version of the @dgt41 one). N.B.: it is not enough to switch editor: editor none must be the default editor... |
@test success |
@dgt41 Thanks Dimitris! You can then send it to me by mail or skype... |
@test success no trim on the 2k boundary! |
@dgt41 Dimitris, can you give the @test to this also in http://issues.joomla.org/tracker/joomla-cms/5340 so that this can go RTC? Thanks! |
RTC This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/5340. |
And merged into staging. Thanks Chris! |
This PR fixes several known issues with the HTML parser used by Smart Search. Most of the issues result from breaking the input string into 2Kb chunks to improve performance when saving large articles.
In particular
To test this PR