[4.0] Add OutputFilter tests #28493

Hackwar · 2020-03-28T14:39:17Z

This adds unittests for the OutputFilter class. The code is half copied over from 3.x. Nothing special to look at. The chinese symbol should mean "test" and be a 4-byte-unicode character.

This is related to #25845.

infograf768 · 2020-03-31T06:53:10Z

Note

The PR does not respect case for the test part
we have
a/tests/Unit/libraries/cms/Filter/OutputFilterTest.php
which should be
a/tests/Unit/Libraries/Cms/Filter/OutputFilterTest.php

Before patch

Interesting results in 4.0 vs 3.x (See #25845 )
Although the 4bytes character 𠹷 is not highlighted, we now get a result.

And, contrary to 3.x, the string 不能创建文件 is highlighted.

After patch

We do get a highlighted single 4 bytes character BUT it is not the correct one.
Looking for 𨈇, we get 𠺝 highlighted.

infograf768 · 2020-03-31T07:00:23Z

Hint: Looks also like a simple 不 which is not in the range supposed to have problems, is not found.
In 3.x it was found but not highlighted.

infograf768 · 2020-03-31T09:39:24Z

tried using StringHelper also for str_pad
$code = StringHelper::str_pad(dechex(StringHelper::ord($chr)), 4, '0', STR_PAD_LEFT);

but no apparent change.

Fixing 5 byte character encoding

Hackwar · 2020-04-04T20:37:55Z

Thank you @infograf768 for your tests here. You are right, the cases for the folder names was wrong. Curse you, Windows! I fixed that. The OutputFilter class and the tests in here are correct, so this PR would be fine to merge, but there is indeed at least one serious bug in Smart Search regarding 5 byte characters. I will try to document what I could find out so far:

Indexing messes up 5 byte characters. Create a test article on an english site with "𨈇𠺝". This will result in the later character being added to the index, the first one not. The characters are properly added to the #__finder_tokens table, but then those are not properly moved over to the #__finder_tokens_aggregate. It fails when doing a "SELECT DISTINCT term FROM #__finder_tokens" in the indexer in /administrator/components/com_finder/src/Indexer/Driver/Mysql.php line 262 No idea what to do here...
The search term is somehow messed up as well. When searching for 𨈇, it matches on 𠺝. It states that 𨈇 is required in the output, but then again asks the highlighter to highlight 𠺝.

I will open new issues for this, but would like to ask that this PR is merged. The error described by @infograf768 is not in this part of the code.

Hackwar · 2020-04-04T21:40:13Z

Sorry, I messed up a bit above. It's not 5 byte chars, but 5 hex chars, thus 3 byte unicode characters. In any way, MySQL simply takes those 2 characters, compares them and thinks they are identical.

richard67 · 2020-04-04T22:35:13Z

It seems there is really a problem with MySQL and these 2 chinese characters 𨈇 and 𠺝 . I've inserted with phpMyAdmin some records with these characters into a table in a varchar column with utf8mb4_unicode_ci collation on MySQL 8.0.19 and did a "SELECT DISTINCT ..." and got returned only one of them.

richard67 · 2020-04-04T22:38:06Z

If the table has collation utf8mb4_0900_ai_ci it works, but this collation is as far as I can see available on MySQL 8 but not on e.g. 5.7.

richard67 · 2020-04-04T22:42:44Z

Possible explanations for that see https://mysqlserverteam.com/mysql-8-0-collations-the-devil-is-in-the-details/.

richard67 · 2020-04-04T23:03:45Z

@alikon If you check my 3 previous comments I am 100 % sure about what you will say: With PostgreSQL we don't have that problem ;-)

richard67 · 2020-04-05T12:01:16Z

To me the changes in this PR seem to be correct.

The problems we have e.g. with certain Chinese characters seems to be a collation problem in MySQL (and MariaDB):

Now one could think "let's use utf8mb4_bin collation everywhere", but that might not be correct for a particular language. Really correct for that language regarding equivalence of certain UTF-8 characters or 2 character sequences like it is in German with ß and ssand regarding sorting would always only be the language-specific collation, and so at least on multilingual sites you always have to make some compromise.

@infograf768 Any thoughts?

richard67 · 2020-04-05T12:31:11Z

@Hackwar @chrisdavenport Maybe it could make sense to use utf8mb4_bin collation for the finder_tokens and the finder_tokens_aggregate table, or at least for particular coluns of these tables, e.g. term? See my previous comment about the difference between some collations.

infograf768 · 2020-04-05T17:08:27Z

This is exactly why I have always said that we should keep com_search... It is a viable alternative for some languages.

infograf768 · 2020-04-06T06:47:26Z

See #25845 (comment) concerning com_search

alikon · 2020-04-06T07:25:49Z

@richard67 yes postgresql works just fine as usual 🤣

the test table

CREATE TABLE IF NOT EXISTS "chtest" (
  "id" serial NOT NULL,
  "col1" varchar(50) NOT NULL,
  PRIMARY KEY ("id")
);

the test data

insert into "chtest" values (1,'𨈇');
insert into "chtest" values (2,'𠺝');
insert into "chtest" values (3,'Ä');
insert into "chtest" values (4,'ä');
insert into "chtest" values (5,'Ë');
insert into "chtest" values (6,'ë');
insert into "chtest" values (7,'Ï');
insert into "chtest" values (8,'ï');
insert into "chtest" values (9,'Ö');
insert into "chtest" values (10,'ö');
insert into "chtest" values (11,'Ü');
insert into "chtest" values (12,'ü');
insert into "chtest" values (13,'Ÿ');
insert into "chtest" values (14,'ÿ');
insert into "chtest" values (15,'å');
insert into "chtest" values (16,'æé');
insert into "chtest" values (17,'ø');

thet test query

select distinct col1 from chtest;

the results

infograf768 · 2020-04-06T10:19:19Z

Please test #28587

joomla-cms-bot added PR-4.0-dev Unit/System Tests labels Mar 28, 2020

infograf768 mentioned this pull request Mar 31, 2020

Support non-english chars in JS #25845

Merged

Adding tests for OutputFilter

002fee6

Fixing 5 byte character encoding

Hackwar force-pushed the j4testoutput branch from dc4a6f9 to 002fee6 Compare April 4, 2020 20:02

rdeutz merged commit 7abd203 into joomla:4.0-dev Apr 6, 2020

rdeutz added this to the Joomla 4.0 milestone Apr 6, 2020

infograf768 mentioned this pull request Apr 6, 2020

[4.0] Fixing smartsearch issue with some multibytes characters #28587

Closed

richard67 mentioned this pull request Apr 6, 2020

[4.0] Fixing smartsearch issue with some multibytes characters (alternative proposal) #28592

Merged

Hackwar deleted the j4testoutput branch April 16, 2020 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4.0] Add OutputFilter tests #28493

[4.0] Add OutputFilter tests #28493

Hackwar commented Mar 28, 2020

infograf768 commented Mar 31, 2020

infograf768 commented Mar 31, 2020 •

edited

infograf768 commented Mar 31, 2020

Hackwar commented Apr 4, 2020

Hackwar commented Apr 4, 2020

richard67 commented Apr 4, 2020

richard67 commented Apr 4, 2020

richard67 commented Apr 4, 2020

richard67 commented Apr 4, 2020

richard67 commented Apr 5, 2020

richard67 commented Apr 5, 2020

infograf768 commented Apr 5, 2020

infograf768 commented Apr 6, 2020

alikon commented Apr 6, 2020

infograf768 commented Apr 6, 2020

[4.0] Add OutputFilter tests #28493

[4.0] Add OutputFilter tests #28493

Conversation

Hackwar commented Mar 28, 2020

infograf768 commented Mar 31, 2020

Note

Before patch

After patch

infograf768 commented Mar 31, 2020 • edited

infograf768 commented Mar 31, 2020

Hackwar commented Apr 4, 2020

Hackwar commented Apr 4, 2020

richard67 commented Apr 4, 2020

richard67 commented Apr 4, 2020

richard67 commented Apr 4, 2020

richard67 commented Apr 4, 2020

richard67 commented Apr 5, 2020

richard67 commented Apr 5, 2020

infograf768 commented Apr 5, 2020

infograf768 commented Apr 6, 2020

alikon commented Apr 6, 2020

infograf768 commented Apr 6, 2020

infograf768 commented Mar 31, 2020 •

edited