Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ownCloud File Mapper Not Encoding UTF-8 Properly #12112

Closed
LeeThompson opened this issue Nov 11, 2014 · 11 comments
Closed

ownCloud File Mapper Not Encoding UTF-8 Properly #12112

LeeThompson opened this issue Nov 11, 2014 · 11 comments

Comments

@LeeThompson
Copy link

Preface

I've been wrestling with an issue with ownCloud on a new install, my initial issue and research is also at http://forum.owncloud.org/viewtopic.php?f=29&t=24688

I have been able to isolate exactly what the actual issue and even what ownCloud needs to do to resolve it. I do plan on working on a fix and submitting it but that may take awhile since I'm new to ownCloud and not familiar with it's codebase.

This may be related to #4513

Steps to reproduce

  • Have owncloud browse to a folder with international characters in a filename via WebUI
  • Have owncloud sync with a folder with international characters in a filename

Sample:
"é or ç.txt"

Expected behaviour

It should be able to handle the file or folder.

Actual behaviour

  • WebUI will return an unknown error
  • Insert to file_map will fail with "incorrect string value"
  • Sync application will fail.

Solution?
Was able to reproduce the issue with a test script/table (mimicking the file_map table)
and was able to fix it by encoding the logic and physic paths with utf8_encode which is
probably not the optimal solution but it did insert the row properly.

ownCloud's INSERT

INSERT INTO `oc_file_map` (`logic_path`, `physic_path`, `logic_path_hash`, `physic_path_hash`) VALUES ('é or ç.txt', 'é or ç.txt', 'TEST', 'TEST');

correct INSERT

INSERT INTO `oc_file_map` (`logic_path`, `physic_path`, `logic_path_hash`, `physic_path_hash`) VALUES ('é or ç.txt', 'é or ç.txt', 'TEST', 'TEST')

This can be done with utf8_encode("é or ç.txt") but that's not an optimal solution for many reasons.

Server configuration

Operating system:
Windows Server 2008 R2

Web server:
Apache 2.4.10

Database:
MariaDB 10.0.14 (MySQL compatible)

PHP version:
5.6.2

ownCloud version:
7.0.2 (community edition)

Updated from an older ownCloud or fresh install:
fresh

List of activated apps:
default applications only
deleted files/Versioning are disabled

config.php
data/config.php with sensitive information redacted.

$CONFIG = array (
  'instanceid' => 'oc76783e0396',
  'passwordsalt' => REDACTED,
  'datadirectory' => 'E:\\ownCloud',
  'dbtype' => 'mysql',
  'version' => '7.0.2.1',
  'dbname' => 'owncloud',
  'dbhost' => 'localhost',
  'dbtableprefix' => 'oc_',
  'dbuser' => 'oc_admin',
  'dbpassword' => REDACTED,
  'installed' => true,
  'skeletondirectory' => '',
  'logtimezone' => 'America/Los_Angeles',
  'loglevel' => 2,
  'logfile' => 'C:\\Data\\Logs\\ownCloud\\owncloud.log',
  'log_rotate_size' => 209715200,
  'log_query' => false,
  'maintenance' => false,
  'blacklisted_files' => 
  array (
    0 => '.htaccess',
    1 => '*.lnk',
    2 => '.DS_Store',
    3 => 'thumbs.db',
    4 => 'desktop.ini',
    5 => '*.url',
    6 => '\$RECYCLE.BIN',
    7 => '~\$*',
    8 => 'hiberfil.sys',
    9 => 'pagefile.sys',
  ),
  'trusted_domains' => 
  array (
    0 => 'localhost',
    1 => REDACTED,
    2 => REDACTED
    3 => REDACTED
    4 => REDACTED
    5 => REDACTED
  ),
  'filesystem_check_changes' => 2,
  'forcessl' => true,
  'mail_smtpmode' => 'smtp',
  'mail_smtphost' => REDACTED
  'mail_from_address' => 'noreply-owncloud',
  'mail_domain' => REDACTED,
);

External Storage:

local
smb

Encryption:

no

Client configuration

not a client issue

Logs

not a web server issue

ownCloud log (data/owncloud.log)

{"app":"index","message":"Doctrine\\DBAL\\DBALException: An exception occurred while executing 'INSERT INTO `oc_file_map` (`logic_path`, `physic_path`, `logic_path_hash`, `physic_path_hash`)\n\t\t\t\tVALUES (?, ?, ?, ?)':\n\nSQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\\xE9 or \\xE7...' for column 'logic_path' at row 1","level":4,"time":"2014-11-10T16:43:43-08:00"}
@PVince81
Copy link
Contributor

@nickvergessen

@nickvergessen
Copy link
Contributor

The message from the error log seems to show a different issue:
SQLSTATE[22007]: Invalid datetime format

@LeeThompson
Copy link
Author

@nickvergessen, actually no, the pdo driver adds the invalid datetime format for some reason (the mysqli driver does not). The rest of the error message is the same in both drivers, however. "Incorrect string value: '\xE9 or \xE7...' for column 'logic_path' at row 1".

(The invalid datetime format confuses the hell out of me too, for the record.)

@LeeThompson
Copy link
Author

I made two test scripts in php inserting a row into a copy of the oc_file_map table, one uses pdo_mysql and the other uses mysqli:

pdo_mysql

SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xE9 or \xE7...' for column 'logic_path' at row 1

mysqli

Incorrect string value: '\xE9 or \xE7...' for column 'logic_path' at row 1

In both cases, if the script is changed so logic_path and physic_path are run through utf8_encode, the insert is fine and no errors are thrown.

$logic_path = "E:/ownCloud/testuser/files/MyFiles/_test/é or ç.txt";
$physic_path = "E:/ownCloud/testuser/files/MyFiles/_test/é or ç.txt";
$logic_path = utf8_encode($logic_path);
$physic_path = utf8_encode($physic_path);

utf8_encode changes the string from é or ç.txt to é or ç.txt.

success

And the row is inserted (appears twice, one from pdo_test one from mysqli_test).

@LeeThompson
Copy link
Author

I think I've found a solution. Please note, I have so far only run this test script on Windows.

This test script scans a subdirectory called "files" and converts encoding, if needed.

Files
The contents of ./files for this run are:

AnotherFile.txt
é or ç.txt
My File.txt

Script

<?php

$folder = './files';
$files = scandir($folder);
$log_file = "./test.log";

$target_encoding = "UTF-8";
$default_codepage = "UTF-8";

if ( 'WIN' == substr( PHP_OS, 0, 3 ) ) {
    $codepage = 'Windows-' . trim( strstr( setlocale( LC_CTYPE, "" ), '.' ), '.' );
} else {
    $codepage = $default_codepage;
}

echo "Using Codepage: $codepage\n";
echo "Target Encoding: $target_encoding\n";
echo "\n";

file_put_contents($log_file, "");

foreach($files as $filename){
    if ($filename === '.' || $filename === '..') { continue; }
    $encoded_filename = mb_convert_encoding( $filename, $target_encoding, $codepage );
    $output = "$filename = $encoded_filename";
    echo "$output\n";
    file_put_contents($log_file, "$output\n", FILE_APPEND | LOCK_EX);
}
?>

Comments

Unfortunately, mb_detect_encoding($filename, "auto") was not sufficient, in this test it returned ASCII and UTF-8 (even though it wasn't).

Output

AnotherFile.txt = AnotherFile.txt
My File.txt = My File.txt
é or ç.txt = é or ç.txt

Conclusion

I believe if the file scanner could simply do the steps as shown above, it should be able to handle this much better.

Some issues come to mind however:

  • When syncing, if the client codepage is different from the server's it may not be correctly encoded. (Perhaps the client could do this when uploading the file or at least report it's codepage so it can be encoded properly at the server.)
  • If external storage locations are using different codepages, it would not be detected properly (I'm not sure if Windows would translate it or not.) Perhaps just add a codepage dropdown to setting up external storages.
  • It might also be a good idea to have a config option to force the file system codepage in case of something weird.

@LeeThompson
Copy link
Author

Also a note, when creating the SQL tables, if you want true UTF-8 on MySQL 5.5.3+ (and compatible forks) you need to use utf8mb4 with utf8mb4_bin_ci or utf8mb4_unicode_ci collation.

@LeeThompson
Copy link
Author

Been researching this and it appears that the best solution is to use the Windows-1252 codepage when storing and referencing pathnames.

Implementation wise, it would probably be best to add to storages the ability to specify what codepage to use on an invidivudal basis since shares may be using different file systems (not extremely likely but possible). (With configurable default of UTF-8.)

Unfortunately PHP's overall handling of this leaves a lot to be desired (it's own UTF-8 handling is... iffy... at best) so configuration options are the best course of action.

I submitted a proposal to work on this on the dev mailing list but no one seems to reply (?) so I guess they don't want me to work on it.

@PVince81
Copy link
Contributor

@nickvergessen is already working on fixing encoding issues / fixing the file mapper

@PVince81
Copy link
Contributor

Not sure what the current state is, you might want to check on master.

@sdesvergez
Copy link

All seems to be fixed only by the side of MariaDB/MySQL server since type of columns logic_path and physic_path are changed to blob. Then all INSERT queries work as expected.
But, from WebGUI or Sync client the SELECT query does not work and current directory looks empty. Query on blob values should work with internal function CONVERT(column USING UTF8) as select convert(logic_path using utf8) from oc_file_map limit 0,10
But I really don't know where it could be used.

@DeepDiver1975 DeepDiver1975 modified the milestone: backlog Mar 2, 2015
@LeeThompson
Copy link
Author

Just for the heck of it, I tested this stuff under Debian Unix without Windows involved at all. UTF-8 is still broken.

I've given up on owncloud.

@MorrisJobke MorrisJobke removed this from the backlog milestone Jun 10, 2015
@lock lock bot locked as resolved and limited conversation to collaborators Aug 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants