-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4.0] Add OutputFilter tests #28493
[4.0] Add OutputFilter tests #28493
Conversation
NoteThe PR does not respect case for the test part Before patchInteresting results in 4.0 vs 3.x (See #25845 ) And, contrary to 3.x, the string After patchWe do get a highlighted single 4 bytes character BUT it is not the correct one. |
|
tried using but no apparent change. |
Fixing 5 byte character encoding
Thank you @infograf768 for your tests here. You are right, the cases for the folder names was wrong. Curse you, Windows! I fixed that. The OutputFilter class and the tests in here are correct, so this PR would be fine to merge, but there is indeed at least one serious bug in Smart Search regarding 5 byte characters. I will try to document what I could find out so far:
I will open new issues for this, but would like to ask that this PR is merged. The error described by @infograf768 is not in this part of the code. |
Sorry, I messed up a bit above. It's not 5 byte chars, but 5 hex chars, thus 3 byte unicode characters. In any way, MySQL simply takes those 2 characters, compares them and thinks they are identical. |
It seems there is really a problem with MySQL and these 2 chinese characters 𨈇 and 𠺝 . I've inserted with phpMyAdmin some records with these characters into a table in a varchar column with utf8mb4_unicode_ci collation on MySQL 8.0.19 and did a "SELECT DISTINCT ..." and got returned only one of them. |
If the table has collation |
Possible explanations for that see https://mysqlserverteam.com/mysql-8-0-collations-the-devil-is-in-the-details/. |
@alikon If you check my 3 previous comments I am 100 % sure about what you will say: With PostgreSQL we don't have that problem ;-) |
To me the changes in this PR seem to be correct. The problems we have e.g. with certain Chinese characters seems to be a collation problem in MySQL (and MariaDB): Now one could think "let's use utf8mb4_bin collation everywhere", but that might not be correct for a particular language. Really correct for that language regarding equivalence of certain UTF-8 characters or 2 character sequences like it is in German with @infograf768 Any thoughts? |
@Hackwar @chrisdavenport Maybe it could make sense to use utf8mb4_bin collation for the finder_tokens and the finder_tokens_aggregate table, or at least for particular coluns of these tables, e.g. term? See my previous comment about the difference between some collations. |
This is exactly why I have always said that we should keep com_search... It is a viable alternative for some languages. |
See #25845 (comment) concerning com_search |
@richard67 yes postgresql works just fine as usual 🤣 the test table CREATE TABLE IF NOT EXISTS "chtest" (
"id" serial NOT NULL,
"col1" varchar(50) NOT NULL,
PRIMARY KEY ("id")
); the test data insert into "chtest" values (1,'𨈇');
insert into "chtest" values (2,'𠺝');
insert into "chtest" values (3,'Ä');
insert into "chtest" values (4,'ä');
insert into "chtest" values (5,'Ë');
insert into "chtest" values (6,'ë');
insert into "chtest" values (7,'Ï');
insert into "chtest" values (8,'ï');
insert into "chtest" values (9,'Ö');
insert into "chtest" values (10,'ö');
insert into "chtest" values (11,'Ü');
insert into "chtest" values (12,'ü');
insert into "chtest" values (13,'Ÿ');
insert into "chtest" values (14,'ÿ');
insert into "chtest" values (15,'å');
insert into "chtest" values (16,'æé');
insert into "chtest" values (17,'ø');
thet test query select distinct col1 from chtest; |
Please test #28587 |
This adds unittests for the OutputFilter class. The code is half copied over from 3.x. Nothing special to look at. The chinese symbol should mean "test" and be a 4-byte-unicode character.
This is related to #25845.