Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode math italics/bold truncates text notes #7663

Closed
fgnievinski opened this issue Jan 31, 2022 · 10 comments
Closed

Unicode math italics/bold truncates text notes #7663

fgnievinski opened this issue Jan 31, 2022 · 10 comments

Comments

@fgnievinski
Copy link
Contributor

fgnievinski commented Jan 31, 2022

A reviewer copied content from an article (typeset externally), resulting in formatted text such as "𝑚𝑒𝑡𝑒𝑟𝑠" (contrast: meters and meters).

When such text is pasted in a TineMCE box, it's displayed correctly, but saving the text is not possible:
image

I believe these Latin italics characters belong to the Unicode Mathematical Alphanumeric Symbols block.
Maybe the database doesn't support storing such so extended Unicode?

Slightly related issue: #2564

@asmecher
Copy link
Member

@fgnievinski, OJS should support the full Unicode set. Can you please include the following information:

  • What version of OJS are you using?
  • Does anything appear in the PHP error log from the time the request was made to save the form?

@fgnievinski
Copy link
Contributor Author

I was using OJS 3.3.0.8 in an installation where I don't have access to the logs (it's a hassle to get IT to do anything, sorry).

but I've just tried in the public demo/testdrive installation (also v.3.3.0.8) -- maybe the logs could be retrieved more easily?

pasting "A𝐵C" (with math italics only in the middle char) results in truncated string starting at the offending char, so only A is stored.

  • in a non-TinyMCE field: Workflow > Submission > Submission Files > 🠞🠟 > More Information > Notes > Add Note
    image

  • in a TinyMCE field: Workflow > Review > Round 1 > Reviewers > 🠞🠟 > Editorial Notes > OK
    image

I've tried to replicate at my local installation and I'm still unable to save. The only difference seemed to be the primary language, so I've changed the primary language in the testdrive but was still able to save the notes, although truncation persists.

So the reproducible problem is: text gets truncated after math italics characters.

@jonasraoni
Copy link
Contributor

As a note... I've had some encoding issues when using the utf8 charset in MySQL, but they were gone after I updated the database to use the newer utf8mb4.
If you're going to try, ensure the database, tables and also the columns are all using the same charset and collation.

@fgnievinski
Copy link
Contributor Author

As a note... I've had some encoding issues when using the utf8 charset in MySQL, but they were gone after I updated the database to use the newer utf8mb4. If you're going to try, ensure the database, tables and also the columns are all using the same charset and collation.

thanks for the tip. the installation has been recently restored after been invaded, so probably there are some lose ends.

I'll change the issue description to focus on the reproducible problem.

@fgnievinski fgnievinski changed the title Unicode math italics/bold prevents text to be saved Unicode math italics/bold truncates text notes Jan 31, 2022
@jonasraoni
Copy link
Contributor

Anyway, every application will present some kind of weird behavior when faced with the strings in this list https://github.com/minimaxir/big-list-of-naughty-strings, so perhaps you've found one of them :)

@fgnievinski
Copy link
Contributor Author

just to tell a sad story: the reviewer submitted their review but didn't notice it was truncated then the author ended up missing half of the reviewer's comments, who was not very pleased. :-|

@NateWr
Copy link
Contributor

NateWr commented Feb 1, 2022

I was able to reproduce this on the OJS3 test drive install but not locally.

Locally, I am running psql (PostgreSQL) 12.9 (Ubuntu 12.9-0ubuntu0.20.04.1) and this is the character encoding:

ojs_330=# SHOW SERVER_ENCODING;
 server_encoding 
-----------------
 UTF8

It seems this is related to the deployment in some way. If this requires a change to the default or recommended database configuration, can someone propose a change?

@jonasraoni
Copy link
Contributor

jonasraoni commented Feb 1, 2022

@NateWr I've just checked in my environment and my previous comment is enough to address the issue. Just forgot to add that the charset must be configured in the config.inc.php:

[i18n]
connection_charset = utf8mb4

[database]
collation = utf8mb4_general_ci

As I've already created an issue to address this configuration, I'll close this one.

@NateWr
Copy link
Contributor

NateWr commented Feb 2, 2022

It's interesting, though, because these are my config settings and I didn't have the problem. Is it because I'm on postgres?

[i18n]

; Default locale
locale = en_US

; Client output/input character set
client_charset = utf-8

; Database connection character set
; Must be set to "Off" if not supported by the database server
; If enabled, must be the same character set as "client_charset"
; (although the actual name may differ slightly depending on the server)
connection_charset = utf8

; Database storage character set
; Must be set to "Off" if not supported by the database server
database_charset = utf8

Also, I don't have [database]collation at all in my config.inc.php. I see it in the template though. My config is probably from a long time ago...

@jonasraoni
Copy link
Contributor

Yeah, it happens only in MySQL. The old utf8 charset supports at maximum characters composed by 3 bytes, but we happen to have characters with 4 bytes 😁

PostgreSQL supports the 4 bytes pattern, but it will also fail if you try to insert an invalid UTF-8 sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants