Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.0] Fixing smartsearch issue with some multibytes characters (alternative proposal) #28592

Merged

Conversation

richard67
Copy link
Member

@richard67 richard67 commented Apr 6, 2020

Pull Request for remaining part of Issue #28493 .

Alternative to PR #28587 .

Summary of Changes

Changing collation of columns term, stem and soundex in table #__finder_terms and columns term and stem in tables #__finder_tokens and #__finder_tokens_aggregates to utf8mb4_bin.

This is the same as done in PR #28587 , except here it is done only for columns mentioned above and not for the complete tables.

Changing table #__finder_terms_common so that only column term has utf8mb4_bin collation, so it fits to how we do it with other tables having utf8mb4_bin collated columns.

Testing Instructions

Thanks to @infograf768 for the testing instructions.

Moving testing instructions to a comment below because Drone seems to have problems with certain unicode characters in the description of a PR.

Documentation Changes Required

None.

@infograf768
Copy link
Member

This works also.

@richard67
Copy link
Member Author

@wilsonge Please chose which one you like more, this one or #28587 .

@richard67
Copy link
Member Author

richard67 commented Apr 7, 2020

Testing Instructions

Thanks to @infograf768 for the testing instructions. I've added the update test to make sure I haven't made a mistake in the update sql script.

Test 1: New installation

Step 1: Patch and make a clean install.

Step 2: Create and publish an article which contains

Chinese: 不能创建文件

Greek: Εγκατάσταση Γλωσσών

German: Europäer

French: être noël

Simple chinese: 不

Four bytes character:
𠹷
or
𨈇
equivalent of U+20E9D  𠺝
groupés par 3 不𠹷𨈇创

Step 3: Create a smartsearch module in frontend.

Step 4: In frontend, search for any Chinese character or group of characters, but specially for the 4 bytes
𠹷
and
𨈇

Result: OK and hightlighting is correct.
Screen Shot 2020-04-06 at 12 16 14

Test 2: Update

Step 1: Update a 3.10 to 4.0-dev plus the patch of this PR applied, using the update package built for this PR or the corresponding custom update URL. Packages and update URL can be found here: https://ci.joomla.org/artifacts/joomla/joomla-cms/4.0-dev/28592/downloads/31062/.

Step 2: Repeat steps 2 to 4 of the previous "Test 1: New installation".

Result: Same as for "Test 1: New installation".

@richard67 richard67 marked this pull request as ready for review April 7, 2020 08:08
@richard67 richard67 changed the title [4.0] [WiP] Fixing smartsearch issue with some multibytes characters (alternative proposal) [4.0] Fixing smartsearch issue with some multibytes characters (alternative proposal) Apr 7, 2020
@Hackwar
Copy link
Member

Hackwar commented Apr 7, 2020

I would prefer this over #28587.

@alikon
Copy link
Contributor

alikon commented Apr 8, 2020

I have tested this item ✅ successfully on 0b20612

with mysql 8.0.19


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/28592.

@infograf768
Copy link
Member

I have tested this item ✅ successfully on 0b20612

Will close the other one


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/28592.

@infograf768
Copy link
Member

RTC


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/28592.

@joomla-cms-bot joomla-cms-bot added the RTC This Pull Request is Ready To Commit label Apr 8, 2020
@infograf768 infograf768 added this to the Joomla 4.0 milestone Apr 8, 2020
@richard67
Copy link
Member Author

@alikon @infograf768 I hope you have also done "Test 2: Update".

@alikon
Copy link
Contributor

alikon commented Apr 8, 2020

my bad and laziness no.... i'll do sorry

@richard67
Copy link
Member Author

richard67 commented Apr 8, 2020

Update test is important to check that I haven't made a mistake in the update SQL and that the order of processing is correct, i.e. at the end collations are like they should be for the tables and columns handled by this PR.

@richard67
Copy link
Member Author

@alikon @infograf768 wait with the update test. I've just updated to latest 4.0-dev and so new update package will be built by drone.

@richard67
Copy link
Member Author

New update package and custom URL have been built by drone. I've updated the link in the testing instructions.

@alikon Ready for the update test now.

@infograf768 If you have done both tests before, installation and update, please just mark your test result again. Otherwise, if you haven't done the update test: Could you do that test, too? It is important to check there is no mistake in the update sql.

@alikon
Copy link
Contributor

alikon commented Apr 8, 2020

Screenshot from 2020-04-08 12-19-43
test done and can confirm that work updating from 3.10 to 4

@alikon
Copy link
Contributor

alikon commented Apr 8, 2020

Screenshot from 2020-04-08 12-32-27

@alikon
Copy link
Contributor

alikon commented Apr 8, 2020

I have tested this item ✅ successfully on 938edd1

this time tested the update too


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/28592.

@humblehumanbeing
Copy link

Not sure if related but Joomla 3x table fields such as images in #__content, params in #__modules and #__menu are converting unicode characters to 6-bytes. They appear unicode in the backend but you see multi-byte in db.

@infograf768
Copy link
Member

infograf768 commented Apr 9, 2020

@humblehumanbeing

Not sure if related but Joomla 3x table fields such as images in #__content, params in #__modules and #__menu are converting unicode characters to 6-bytes. They appear unicode in the backend but you see multi-byte in db.

Looks like you are confusing bytes and bits. Unicode UTF8 is max 4 bytes.
Please give some examples of what you mean.
For params, getting the format \u....\u... is totally normal and is unrelated to the issue here.

@humblehumanbeing
Copy link

Sorry if it is the case.
In the backend create an article, enter 'ş İ Ğ Ö Ç' for fulltext image alt tag.
Check the field images via phpmyadmin
It is encoded as '\u015f \u0130 \u011e \u00d6 \u00c7'
Same in module/menu params.
This makes db manipulation, find and replace for example, via a db software nearly impossible

@infograf768
Copy link
Member

These are JSON encoded and it is not related to the issue here with finder.
See https://www.php.net/manual/fr/function.json-encode.php

@humblehumanbeing
Copy link

Thanks for the clarification.

@wilsonge wilsonge merged commit 4617ce9 into joomla:4.0-dev Apr 9, 2020
@wilsonge
Copy link
Contributor

wilsonge commented Apr 9, 2020

Thanks!

@joomla-cms-bot joomla-cms-bot removed the RTC This Pull Request is Ready To Commit label Apr 9, 2020
@richard67 richard67 deleted the 4.0-dev-smart-search-binary-collation-1 branch April 11, 2020 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants