Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 NFD file name on SMB storage cannot be accessed #21365

Closed
PVince81 opened this issue Dec 24, 2015 · 77 comments · Fixed by #24349
Closed

UTF-8 NFD file name on SMB storage cannot be accessed #21365

PVince81 opened this issue Dec 24, 2015 · 77 comments · Fixed by #24349

Comments

@PVince81
Copy link
Contributor

Steps

  1. Setup a SMB server on Linux
  2. Configure a share "shared"
  3. Directly inside the share's folder on disk (not going through the network), create a file "good-ümlaut.txt"
  4. Create another file "bad-ümlaut.txt"
  5. Convert the file name encoding to NFD: convmv -r -f utf8 -t utf8 --nfd --notest bad-ümlaut.txt (ref: http://laufer.tumblr.com/post/1066670961/solved-weird-problem-with-special-characters-in).
  6. Setup an external storage in OC and mount that share as "/smb"
  7. Navigate into the share in the web UI

Expected result

File appears in list

Actual result

File not visible

{"reqId":"KAz39vteBSxGCMxr1TEG","remoteAddr":"127.0.0.1","app":"OC\\Files\\Cache\\Scanner","message":"!!! Path 'encoding\/bad-u\u0308mlaut.txt' is not accessible or present !!!","level":0,"time":"2015-12-24T09:58:29+00:00","method":"GET","url":"\/owncloud\/index.php\/apps\/files\/ajax\/list.php?dir=%2Fsmb%2Fencoding&sort=name&sortdirection=asc"}

Versions

Observed in OC 8.2.2

@PVince81
Copy link
Contributor Author

Had a chat with @icewind1991 and he told me that the file names are normalized by OC.
So if the remote SMB server contains a file with a file with a "mac" umlaut, it will normalize the name and store the normalized name in the file cache. Then later when it wants to download the file, it would use the normalized name. So obviously that name doesn't exist on the SMB storage => not found.

We cannot disable normalization of file names because it will break syncing on Mac OS then.

Ideal would be if the SMB server itself would have a setting that tells it to normalize UTF-8 names...

So far the only known workaround is to use convmv recursively to fix the file names on the remote share.

@icewind1991 @DeepDiver1975 @butonic won't fix?
CC @cmonteroluque

@PVince81
Copy link
Contributor Author

Another idea would be to have ownCloud detect such files and automatically normalize the file names on the remote storage, but that would be a bit "intrusive" and it doesn't feel like it's OC's job to do that.

@PVince81
Copy link
Contributor Author

More info about NFD vs NFC:

@PVince81
Copy link
Contributor Author

Another possibly, very ugly and with possible performance impact: whenever we have a UTF-8 file name and connect to the remote SMB, try with both the NFC and NFD file name encoding. If one matches, then assume it's the same file.
But then what happens if the SMB storage has two files, one with each encoding ? 💥

@PVince81
Copy link
Contributor Author

I just tried locally and it is indeed possible to have both files with the same name and different encoding:

-rw-r--r-- 1 vincent nobody 18 Dec 24 10:51 bad-ümlaut.txt
-rw-r--r-- 1 vincent nobody 18 Dec 24 11:24 bad-ümlaut.txt
-rw-r--r-- 1 vincent nobody 32 Dec 24 10:51 good-ümlaut.txt

My font represent them slightly differently:
umlauts

@PVince81
Copy link
Contributor Author

We could also reopen the discussion about adding a new column to oc_filecache to store the real file name (maybe call it storage_name).
Let me find the old ticket about it.

@PVince81
Copy link
Contributor Author

also CC @nickvergessen who worked on this idea back then

@PVince81
Copy link
Contributor Author

Somehow can't find an exact ticket.
Found some bits here:

@PVince81
Copy link
Contributor Author

Setting to 9.0 for now to keep in view.

@PVince81 PVince81 added this to the 9.0-current milestone Dec 24, 2015
@karlitschek
Copy link
Contributor

@DeepDiver1975 FYI For the WND discussions

@PVince81
Copy link
Contributor Author

Can we add a new column in oc_filecache to store the original non-normalized file name ?
@DeepDiver1975 @karlitschek @nickvergessen

@PVince81
Copy link
Contributor Author

and @icewind1991

@nickvergessen
Copy link
Contributor

Can we add a new column in oc_filecache

technically or in terms of "should we".
With a millions of entries the migration step is going to take hours, so I'd rather use another way when possible.

@PVince81
Copy link
Contributor Author

Both.

I don't see another solution for this, except maybe having this value in a separate table, mapped by file id. If we don't store the non-normalized name there is no way to properly match it on the remote storage.

@PVince81
Copy link
Contributor Author

@icewind1991 can you post information about NFC/NFD collisions, in what direction they can happen and how often ?

@PVince81
Copy link
Contributor Author

Discussed so far: if we'd add an additional column in the cache, we'd need to query that column in the Storage layer which would cause additional SQL queries and impair performance.
So basically if ownCloud calls $storage->file_exists($someFile) with $someFile being the normalized name (NFC), the storage would have to ask the cache again to find out what the original NFD name was. This would cause additional queries.

@PVince81
Copy link
Contributor Author

@icewind1991 said it might be possible to provide an mount options to tell the storage to always manually convert all file names to NFD when accessing the storage. But this assumed that ALL files on the remote storage are using the NFD encoding, and none is using NFC.

@PVince81
Copy link
Contributor Author

I was wondering why we are normalizing the file names in the first place, the answer is that it is needed due to sync clients and also due to the fact that URLs must be encoded in NFC format. Not doing so would then break the web UI and Webdav (and sync clients since they use the Webdav URLs)

@PVince81
Copy link
Contributor Author

Regarding the proposal from #21365 (comment) with the extra column: since the extra DB query to retrieve the non-normalized name can penalize the performance, one idea is to make it optional, per storage.
So whenever someone sets up an external storage, it's disabled by default.
If NFD files are to be used, the admin can choose to enable the option in the mount options dropdown.
This would impact performance, but only in such specific cases.

It could be implemented as a storage wrapper that only gets inserted when the option is set.
The storage wrapper would retrieve the non-normalized name from the database.

@PVince81
Copy link
Contributor Author

@karlitschek @icewind1991 @DeepDiver1975 let me know what you think about #21365 (comment)
So far I think it's the best solution.

@LukasReschke
Copy link
Member

@icewind1991 Might #19825 help here a bit?

@karlitschek
Copy link
Contributor

Hmmm. Sounds interesting. But would needs testing.
@DeepDiver1975 What do you think?

@DeepDiver1975
Copy link
Member

Sounds reasonable - and this also adds testing effort - (not only but as well) on ci.

We basically need to run our smb tests twice - so 4 in total:

  • windows and samba
  • with NFD and without NFD support

@PVince81
Copy link
Contributor Author

I don't think #19825 will work.
Because when we call "file_exists($nfcName)", how can we find the nfd name to compare against ?
We'd have to do a ls on the remote SMB first to get ALL file names and then try and find the matching name in the list. => even worse performance

@PVince81
Copy link
Contributor Author

As discussed, use the storage wrapper solution 3 from #21365 (comment)

@SergioBertolinSG
Copy link
Contributor

This is happening also in google drive. Maybe all external storages?

@butonic
Copy link
Member

butonic commented Mar 31, 2016

sure ... whenever something changes the encoding outsido of oc we can run into this problem.

Maybe it would make more sense if OC and the clients ignored the UTF encoding and used it as is ... AFAICT Win/Mac/Linux understand both ... not sure where the translation happens. IIRC the client does some magic to always talk NFC to the server .... @dragotin care to shed some light into the reason?

@dragotin
Copy link
Contributor

@butonic no, we can not ignore that problem as it is not true that the OSes understand both. If it were, we would not have this problem.

The client normalizes all filenames when they were read from a native system to NFC unicode. This happens with this function: https://github.com/owncloud/client/blob/master/csync/src/std/c_string.c#L263

This very carefully engineered (don't ask where my hair color comes from ;-) function converts from win32 wide char (a problem that the server does not have as it does not run on windows) and from NFD utf8 on mac (within the macro WITH_ICONV). There is also a function that does the other way round, from NFC UTF8 to locale.

That is the exact same problem solved that you guys experience on the server now, with a little difference: The client knows that it is always NFD when he's on Mac. In the server external storage case it can be NFC ("usual" case) or NFD which makes solving it a bit more complicated.

I think solution 3) is fine for the error case that we see now: Whenever the file is accessed on the storage, first the NFC name is checked, and if that fails, NFD is tried. A problem is if the file needs to be written later: The correct file name needs to be known for that. You could, again, first try the NFC, if that fails, try NFD, however, here you would have to know if the file is new or not, new files should probably default to NFC.

I think another thing is very important: As on the client, I would demand that in ownCloud core, every filename is supposed to be in UTF8 NFC. That saves a lot of trouble in there. All file system access happens through the external storage api, and always happen with NFC filenames from the core POV.

The conversions have to happen in the external storage layer. Whenever a filename is read from the external storage (through readdir usually), the name needs to be converted to NFC immediately with a function similar to the client ones. That requires that core does not access the files directly, I hope that is doable.

For file access (stat(), write(), open() and such), core continues to use NFC based names, and the specific external storage implementation needs to know how to convert the file name to access the right file. That can either happen through trail and error (try NFC first, if fail, try NFD), which might be tricky for the cases where the file may simply not exist. The other alternative is that the ext storage module somehow remembers the encoding of the filenames in the DB, which is prolly better.

Bottom line: Keep core NFC only. Use the ext storage implementations as converter. That is what @DeepDiver1975 and me discussed ages ago already.

@danimo fyi, that fun again, did I say right?

@MTRichards
Copy link
Contributor

Estimation: 1 week

@butonic
Copy link
Member

butonic commented Apr 1, 2016

OS X uses HFS+ and by default stores files in that encoding. I wonder what happens when you mount a windows share (real windows, not samba) in OS X and store a file with umlaut on it. Will windows see the correct character? will it be stored as NFC / NFD? Is there some magic conversion implemented on windows or macos? I don't have a mac to test this, but looking at the reports from our customers it seems that the filename is stored as is (NFD) and the windows clients should see broken chars. If that is the case we can at least point in that direction and blame apple...

@MorrisJobke
Copy link
Contributor

00004418

@MorrisJobke
Copy link
Contributor

00004611

@dragotin
Copy link
Contributor

I mounted a share from my windows machine on my Mac and put some interesting files on it.
bildschirmfoto 2016-04-14 um 16 41 54

@dragotin
Copy link
Contributor

And that ends up on windows like this:
glyphs

So there is no blaming. Lets find out how the filenams are encoded now on windows.

@dragotin
Copy link
Contributor

All files synced nicely from Windows to ownCloud, with proper names. That makes me confident that the file names on windows are properly encoded.

@MorrisJobke
Copy link
Contributor

I had another case. There the files were migrated from an old setup. So it's possible that they were already broken on that old setup, because the storage is only used by ownCloud directly.

@MorrisJobke
Copy link
Contributor

I had another case. There the files were migrated from an old setup. So it's possible that they were already broken on that old setup, because the storage is only used by ownCloud directly.

00005213

@danimo
Copy link
Contributor

danimo commented Apr 25, 2016

@MorrisJobke occ scan:files needs to warn about NFD files, at least on local storage.

@PVince81
Copy link
Contributor Author

PR to display a warning when scanning on the CLI: #24341
(preliminary work)

@PVince81
Copy link
Contributor Author

Okay, I did some debugging in regards to normalized file names and found this interesting behavior:

  • whenever the scanner finds a NFD file name, it keeps the non-normalized name most of the time and only normalizes at the time it writes it to the database
  • the entry is correctly inserted into the database
  • however, there is a call to Watcher::cleanFolders() which will iterate over every file in the folder to automatically remove all entries where file_exists is false. Since file_exists for the non-normalized name it will remove the entry.

@icewind1991 now thinking of it, why do we even have cleanFolders ? I thought scanChildren was already able to do a diff of the directory contents. Maybe something to consider separately for perf improvements.

@PVince81
Copy link
Contributor Author

  • extra challenge: moving NFD file to trash and back again, the file name would be normalized on restore. I think that should be fine

@PVince81
Copy link
Contributor Author

WIP PR for the new storage wrapper here: #24349

@PVince81
Copy link
Contributor Author

PVince81 commented May 3, 2016

Raised #24421 to discuss cleanFolder()

@lock
Copy link

lock bot commented Aug 4, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.