-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 NFD file name on SMB storage cannot be accessed #21365
Comments
Had a chat with @icewind1991 and he told me that the file names are normalized by OC. We cannot disable normalization of file names because it will break syncing on Mac OS then. Ideal would be if the SMB server itself would have a setting that tells it to normalize UTF-8 names... So far the only known workaround is to use @icewind1991 @DeepDiver1975 @butonic won't fix? |
Another idea would be to have ownCloud detect such files and automatically normalize the file names on the remote storage, but that would be a bit "intrusive" and it doesn't feel like it's OC's job to do that. |
More info about NFD vs NFC: |
Another possibly, very ugly and with possible performance impact: whenever we have a UTF-8 file name and connect to the remote SMB, try with both the NFC and NFD file name encoding. If one matches, then assume it's the same file. |
I just tried locally and it is indeed possible to have both files with the same name and different encoding:
|
We could also reopen the discussion about adding a new column to oc_filecache to store the real file name (maybe call it |
also CC @nickvergessen who worked on this idea back then |
Somehow can't find an exact ticket.
|
Setting to 9.0 for now to keep in view. |
@DeepDiver1975 FYI For the WND discussions |
Can we add a new column in oc_filecache to store the original non-normalized file name ? |
and @icewind1991 |
technically or in terms of "should we". |
Both. I don't see another solution for this, except maybe having this value in a separate table, mapped by file id. If we don't store the non-normalized name there is no way to properly match it on the remote storage. |
@icewind1991 can you post information about NFC/NFD collisions, in what direction they can happen and how often ? |
Discussed so far: if we'd add an additional column in the cache, we'd need to query that column in the Storage layer which would cause additional SQL queries and impair performance. |
@icewind1991 said it might be possible to provide an mount options to tell the storage to always manually convert all file names to NFD when accessing the storage. But this assumed that ALL files on the remote storage are using the NFD encoding, and none is using NFC. |
I was wondering why we are normalizing the file names in the first place, the answer is that it is needed due to sync clients and also due to the fact that URLs must be encoded in NFC format. Not doing so would then break the web UI and Webdav (and sync clients since they use the Webdav URLs) |
Regarding the proposal from #21365 (comment) with the extra column: since the extra DB query to retrieve the non-normalized name can penalize the performance, one idea is to make it optional, per storage. It could be implemented as a storage wrapper that only gets inserted when the option is set. |
@karlitschek @icewind1991 @DeepDiver1975 let me know what you think about #21365 (comment) |
@icewind1991 Might #19825 help here a bit? |
Hmmm. Sounds interesting. But would needs testing. |
Sounds reasonable - and this also adds testing effort - (not only but as well) on ci. We basically need to run our smb tests twice - so 4 in total:
|
I don't think #19825 will work. |
As discussed, use the storage wrapper solution 3 from #21365 (comment) |
This is happening also in google drive. Maybe all external storages? |
sure ... whenever something changes the encoding outsido of oc we can run into this problem. Maybe it would make more sense if OC and the clients ignored the UTF encoding and used it as is ... AFAICT Win/Mac/Linux understand both ... not sure where the translation happens. IIRC the client does some magic to always talk NFC to the server .... @dragotin care to shed some light into the reason? |
@butonic no, we can not ignore that problem as it is not true that the OSes understand both. If it were, we would not have this problem. The client normalizes all filenames when they were read from a native system to NFC unicode. This happens with this function: https://github.com/owncloud/client/blob/master/csync/src/std/c_string.c#L263 This very carefully engineered (don't ask where my hair color comes from ;-) function converts from win32 wide char (a problem that the server does not have as it does not run on windows) and from NFD utf8 on mac (within the macro That is the exact same problem solved that you guys experience on the server now, with a little difference: The client knows that it is always NFD when he's on Mac. In the server external storage case it can be NFC ("usual" case) or NFD which makes solving it a bit more complicated. I think solution 3) is fine for the error case that we see now: Whenever the file is accessed on the storage, first the NFC name is checked, and if that fails, NFD is tried. A problem is if the file needs to be written later: The correct file name needs to be known for that. You could, again, first try the NFC, if that fails, try NFD, however, here you would have to know if the file is new or not, new files should probably default to NFC. I think another thing is very important: As on the client, I would demand that in ownCloud core, every filename is supposed to be in UTF8 NFC. That saves a lot of trouble in there. All file system access happens through the external storage api, and always happen with NFC filenames from the core POV. The conversions have to happen in the external storage layer. Whenever a filename is read from the external storage (through For file access (stat(), write(), open() and such), core continues to use NFC based names, and the specific external storage implementation needs to know how to convert the file name to access the right file. That can either happen through trail and error (try NFC first, if fail, try NFD), which might be tricky for the cases where the file may simply not exist. The other alternative is that the ext storage module somehow remembers the encoding of the filenames in the DB, which is prolly better. Bottom line: Keep core NFC only. Use the ext storage implementations as converter. That is what @DeepDiver1975 and me discussed ages ago already. @danimo fyi, that fun again, did I say right? |
Estimation: 1 week |
OS X uses HFS+ and by default stores files in that encoding. I wonder what happens when you mount a windows share (real windows, not samba) in OS X and store a file with umlaut on it. Will windows see the correct character? will it be stored as NFC / NFD? Is there some magic conversion implemented on windows or macos? I don't have a mac to test this, but looking at the reports from our customers it seems that the filename is stored as is (NFD) and the windows clients should see broken chars. If that is the case we can at least point in that direction and blame apple... |
00004418 |
00004611 |
All files synced nicely from Windows to ownCloud, with proper names. That makes me confident that the file names on windows are properly encoded. |
I had another case. There the files were migrated from an old setup. So it's possible that they were already broken on that old setup, because the storage is only used by ownCloud directly. |
00005213 |
@MorrisJobke |
PR to display a warning when scanning on the CLI: #24341 |
Okay, I did some debugging in regards to normalized file names and found this interesting behavior:
@icewind1991 now thinking of it, why do we even have |
|
WIP PR for the new storage wrapper here: #24349 |
Raised #24421 to discuss |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Steps
convmv -r -f utf8 -t utf8 --nfd --notest bad-ümlaut.txt
(ref: http://laufer.tumblr.com/post/1066670961/solved-weird-problem-with-special-characters-in).Expected result
File appears in list
Actual result
File not visible
Versions
Observed in OC 8.2.2
The text was updated successfully, but these errors were encountered: