Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support characters in the Unicode Astral Plane #7030

Closed
ranek opened this issue Jan 31, 2014 · 14 comments
Closed

Support characters in the Unicode Astral Plane #7030

ranek opened this issue Jan 31, 2014 · 14 comments

Comments

@ranek
Copy link

ranek commented Jan 31, 2014

Expected behaviour

Users can use all Unicode characters when naming files, entering calendar appointments, or saving contacts.

Actual behaviour

Using characters outside the Basic Multilingual Plane causes severe problems on both the web interface and through WebDAV-based sync services.

Steps to reproduce

  1. Name a file, contact, or event with an astral character (emoji are an easy choice).
  2. Try to sync the file with a desktop app or reload the web interface.
  3. Notice the entry will not sync or does not appear.

Server configuration

Operating system: Ubuntu 12.04.4 LTS

Web server: Apache/2.2.22

Database: 5.5.35-0ubuntu0.12.04.2

PHP version: 5.3.10-1ubuntu3.9

ownCloud version: ownCloud 6.0.1 (stable)

Updated from an older ownCloud or fresh install: updated from 6.0.0

Client configuration

Browser: Safari 7.0.1

Operating system: Mac OS X 10.9.1

Logs

ownCloud log (data/owncloud.log)

Example with contact card:

OCA\Contacts\Contact::retrieve Error parsing carddata for: 907 Invalid VObject. Document ended prematurely.

Related Issues

@ranek ranek closed this as completed Jan 31, 2014
@ranek ranek reopened this Jan 31, 2014
@ranek
Copy link
Author

ranek commented Jan 31, 2014

Upon further investigation, this appears to be a result of ownCloud using the utf8 charset with its MySQL backend, which only supports characters up to three bytes long. Perhaps switching to utf8mb4 would be sufficient to fix this and the related issues? Right now the database truncates entries at the first 4-byte character, leading to invalid objects being stored that cause problems when loaded back into ownCloud.

@PVince81
Copy link
Contributor

PVince81 commented Feb 3, 2014

If we do, we'd need to make sure this also works with other databases.

CC @bantu @icewind1991 @DeepDiver1975

@bantu
Copy link

bantu commented Feb 3, 2014

This is a MySQL only problem. Everyone else does this properly. utf8mb4
requires MySQL 5.5.3. Also see
https://area51.phpbb.com/phpBB/viewtopic.php?f=108&t=44807

@DeepDiver1975
Copy link
Member

Moving to utf8mb4 has an impact on the index length as @bantu pointed out https://area51.phpbb.com/phpBB/viewtopic.php?f=108&t=44807#p258271

We will really run into issues here from a conceptual point of view as our indexes are 'optimized' to fit into 3x255

Would it make sense to extend you db schema xml to give us the possibility to choose utf8mb4?

@karlitschek

@karlitschek
Copy link
Contributor

We have to check the compatibility with older and other databases. And we also have to consider the increased space requirements and the decreased speed that this would mean. I think a valid option would be to just not support characters like that.
Other opinions?

@bantu
Copy link

bantu commented Feb 3, 2014

We have to check the compatibility with older and other databases.

As I said, this is a MySQL only issue. All other DBMSes support 4 byte utf8 characters just fine.

And we also have to consider the increased space requirements and the decreased speed that this would mean.

There are basically no additional space requirements, both utf8 and utf8mb4 use a variable number of 8bit blocks (this is what the 8 stands for). The difference is that utf8 only supports up to three bytes, while utf8mb4 supports up to four bytes (and considering RFC3629 basically all utf8 characters). Four bytes per character will only be used when required.

I think a valid option would be to just not support characters like that.

The only good way of doing this (in terms of complexity and required work) in my opinion is to just switch from utf8 to utf8mb4 and require MySQL 5.5.3.

The only remaining concern is key/index size considerations.

@whitehairtong
Copy link

not sure if it maybe related, I tested the mapper.php file on owncloud 6.0.2 \lib\private\files

There is a private function slugify($text)

it seems it is doing a job to change the file name stored onto the physic_path in mysql table oc_file_map in a way that unicode part of the file name is removed.

for example, (unicdoename).pdf becomes -.pdf in physic_path stored into the database.
At the same time, the logic_path data in the database is storing a correct path with unicode.

in mysql oc_file_map table
logic_path
J:\datafolder(unicode).pdf
physic_path
J:\datafolder-.pdf

I think it can be one of the reason why unicode files will have problem, become considering two unicdoe file name:

(unicode1).pdf
(unicode2).pdf
while logic_path is correct:

j:\datapath(unicode1).pdf
j:\datapath(unicode2).pdf

under current arrangement, both file will be stored in physic_path as
j:\datapath-.pdf
j:\datapath-.pdf
and then you have a problem when you open or download the file by web interface...

I tested that, if private function slugify($text) simply returns $text,
then the correct unicode file path can be stored on both logic_path and also physic_path into the mysql database table OC_File_Map

but still, there maybe some other function need to be modified so that the correct unicode file name can also stored into the datafolder (and keeping the unicode file name).

Andrew

@bantu bantu added this to the ownCloud 8 milestone Jun 14, 2014
@butonic butonic mentioned this issue Jun 16, 2014
2 tasks
@DeepDiver1975 DeepDiver1975 modified the milestones: 8.1-next, ownCloud 8 Jan 9, 2015
@oparoz
Copy link
Contributor

oparoz commented Feb 2, 2015

Please don't push this back too much. Without it any notes, calendar, txt app can't be used in a professional environment as it's too unreliable. Any field using an emoji will be saved as empty text.

@thrdroom
Copy link

thrdroom commented Feb 2, 2015

Im totally 100% with what @oparoz said! Im not that deep into coding, im more a frontend developer. But as i think, this problem seems not very hard to get solved. But it is a big problem and a long time showstopper for me to suggest owncloud to others. I dont understand why it took so long to get this working, cause other projects which are working with sabredav and mysql got it working a long time ago. Example software for this would be http://baikal-server.com/.

@PVince81
Copy link
Contributor

PVince81 commented Feb 2, 2015

Please be aware that there were a lot of other more important bugs that needed to be fixed and that resources/time are limited, which is why this issue here hasn't been fixed yet.

As this is an open-source project, you and others are free to look into the issue too and submit a proposal / pull request that fixes it. Even documentation/research details about how to fix can be useful and save some time.

Thanks for your understanding.

@DeepDiver1975
Copy link
Member

To ensure greater consistency with oc8.1 we properly detect astral plane characters and throw exceptions to the clients (browser, desktop and mobile). Fully support astral plane chars support is moved to the backlog with respect to files.

With respect to contacts and calendar we need to find a way to make things work on mysql - urlencoding is still my idea to fight this issue ...

@PVince81
Copy link
Contributor

PVince81 commented Sep 7, 2017

@DeepDiver1975 does mb4 support cover this ? If yes, please close

@PVince81
Copy link
Contributor

Please try again with 10.0.4 which supports emojis when MySQL is configured properly for mb4 support. If emojis work but not astral plane chars, please reopen.

@lock
Copy link

lock bot commented Aug 2, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants