Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.0] Add OutputFilter tests #28493

Merged
merged 1 commit into from
Apr 6, 2020
Merged

[4.0] Add OutputFilter tests #28493

merged 1 commit into from
Apr 6, 2020

Conversation

Hackwar
Copy link
Member

@Hackwar Hackwar commented Mar 28, 2020

This adds unittests for the OutputFilter class. The code is half copied over from 3.x. Nothing special to look at. The chinese symbol should mean "test" and be a 4-byte-unicode character.

This is related to #25845.

@infograf768
Copy link
Member

Note

The PR does not respect case for the test part
we have
a/tests/Unit/libraries/cms/Filter/OutputFilterTest.php
which should be
a/tests/Unit/Libraries/Cms/Filter/OutputFilterTest.php

Before patch

Interesting results in 4.0 vs 3.x (See #25845 )
Although the 4bytes character 𠹷 is not highlighted, we now get a result.
Screen Shot 2020-03-31 at 08 29 53

And, contrary to 3.x, the string 不能创建文件 is highlighted.
finder before 4_0

After patch

We do get a highlighted single 4 bytes character BUT it is not the correct one.
Looking for 𨈇, we get 𠺝 highlighted.
Screen Shot 2020-03-31 at 08 48 11

@infograf768
Copy link
Member

infograf768 commented Mar 31, 2020

Hint: Looks also like a simple which is not in the range supposed to have problems, is not found.
In 3.x it was found but not highlighted.

@infograf768
Copy link
Member

tried using StringHelper also for str_pad
$code = StringHelper::str_pad(dechex(StringHelper::ord($chr)), 4, '0', STR_PAD_LEFT);

but no apparent change.

Fixing 5 byte character encoding
@Hackwar
Copy link
Member Author

Hackwar commented Apr 4, 2020

Thank you @infograf768 for your tests here. You are right, the cases for the folder names was wrong. Curse you, Windows! I fixed that. The OutputFilter class and the tests in here are correct, so this PR would be fine to merge, but there is indeed at least one serious bug in Smart Search regarding 5 byte characters. I will try to document what I could find out so far:

  1. Indexing messes up 5 byte characters. Create a test article on an english site with "𨈇 𠺝". This will result in the later character being added to the index, the first one not. The characters are properly added to the #__finder_tokens table, but then those are not properly moved over to the #__finder_tokens_aggregate. It fails when doing a "SELECT DISTINCT term FROM #__finder_tokens" in the indexer in /administrator/components/com_finder/src/Indexer/Driver/Mysql.php line 262 No idea what to do here...

  2. The search term is somehow messed up as well. When searching for 𨈇, it matches on 𠺝. It states that 𨈇 is required in the output, but then again asks the highlighter to highlight 𠺝.

I will open new issues for this, but would like to ask that this PR is merged. The error described by @infograf768 is not in this part of the code.

@Hackwar
Copy link
Member Author

Hackwar commented Apr 4, 2020

Sorry, I messed up a bit above. It's not 5 byte chars, but 5 hex chars, thus 3 byte unicode characters. In any way, MySQL simply takes those 2 characters, compares them and thinks they are identical.

@richard67
Copy link
Member

It seems there is really a problem with MySQL and these 2 chinese characters 𨈇 and 𠺝 . I've inserted with phpMyAdmin some records with these characters into a table in a varchar column with utf8mb4_unicode_ci collation on MySQL 8.0.19 and did a "SELECT DISTINCT ..." and got returned only one of them.

@richard67
Copy link
Member

If the table has collation utf8mb4_0900_ai_ci it works, but this collation is as far as I can see available on MySQL 8 but not on e.g. 5.7.

@richard67
Copy link
Member

Possible explanations for that see https://mysqlserverteam.com/mysql-8-0-collations-the-devil-is-in-the-details/.

@richard67
Copy link
Member

@alikon If you check my 3 previous comments I am 100 % sure about what you will say: With PostgreSQL we don't have that problem ;-)

@richard67
Copy link
Member

To me the changes in this PR seem to be correct.

The problems we have e.g. with certain Chinese characters seems to be a collation problem in MySQL (and MariaDB):

j4-collation-problem-00

j4-collation-problem-01

j4-collation-problem-02

j4-collation-problem-03

j4-collation-problem-04

Now one could think "let's use utf8mb4_bin collation everywhere", but that might not be correct for a particular language. Really correct for that language regarding equivalence of certain UTF-8 characters or 2 character sequences like it is in German with ß and ssand regarding sorting would always only be the language-specific collation, and so at least on multilingual sites you always have to make some compromise.

@infograf768 Any thoughts?

@richard67
Copy link
Member

@Hackwar @chrisdavenport Maybe it could make sense to use utf8mb4_bin collation for the finder_tokens and the finder_tokens_aggregate table, or at least for particular coluns of these tables, e.g. term? See my previous comment about the difference between some collations.

@infograf768
Copy link
Member

This is exactly why I have always said that we should keep com_search... It is a viable alternative for some languages.

@infograf768
Copy link
Member

See #25845 (comment) concerning com_search

@alikon
Copy link
Contributor

alikon commented Apr 6, 2020

@richard67 yes postgresql works just fine as usual 🤣

the test table

CREATE TABLE IF NOT EXISTS "chtest" (
  "id" serial NOT NULL,
  "col1" varchar(50) NOT NULL,
  PRIMARY KEY ("id")
);

the test data

insert into "chtest" values (1,'𨈇');
insert into "chtest" values (2,'𠺝');
insert into "chtest" values (3,'Ä');
insert into "chtest" values (4,'ä');
insert into "chtest" values (5,'Ë');
insert into "chtest" values (6,'ë');
insert into "chtest" values (7,'Ï');
insert into "chtest" values (8,'ï');
insert into "chtest" values (9,'Ö');
insert into "chtest" values (10,'ö');
insert into "chtest" values (11,'Ü');
insert into "chtest" values (12,'ü');
insert into "chtest" values (13,'Ÿ');
insert into "chtest" values (14,'ÿ');
insert into "chtest" values (15,'å');
insert into "chtest" values (16,'æé');
insert into "chtest" values (17,'ø');

thet test query

select distinct col1 from chtest;

the results
Screenshot from 2020-04-06 09-24-42

@rdeutz rdeutz merged commit 7abd203 into joomla:4.0-dev Apr 6, 2020
@rdeutz rdeutz added this to the Joomla 4.0 milestone Apr 6, 2020
@infograf768
Copy link
Member

Please test #28587

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants