Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-Unicode filenames are lost #1764

Open
yotann opened this issue Feb 21, 2022 · 3 comments
Open

Non-Unicode filenames are lost #1764

yotann opened this issue Feb 21, 2022 · 3 comments

Comments

@yotann
Copy link

yotann commented Feb 21, 2022

When using a UTF-8 locale on Linux, Kopia gets confused by filenames that aren't valid UTF-8. It replaces the non-UTF-8 bytes with the replacement character U+FFFD, making it impossible to recover the original filename. Ideally, Kopia would preserve the original bytes of the filename, but if nothing else it should log an error message.

$ echo $LANG
en_US.UTF-8
$ touch $'\xfe'
$ touch $'\xff'
$ kopia snapshot create .
Snapshotting user@host:/tmp/test ...
 * 0 hashing, 1 hashed (0 B), 0 cached (0 B), uploaded 198 B, estimating...
Created snapshot with root k66fef850bdb93bbb59883c9a3598943c and ID ad11d536043bebc02cf947d46c87929a in 0s
$ kopia content show -j k66fef850bdb93bbb59883c9a3598943c
{
  "stream": "kopia:directory",
  "entries": [
    {
      "name": "\ufffd",
      "type": "f",
      "mode": "0644",
      "mtime": "2022-02-20T22:22:11.374460959-06:00",
      "uid": 1000,
      "gid": 100,
      "obj": "7b5e7b718fdf4d06e0e7e7a8d2c12894"
    },
    {
      "name": "\ufffd",
      "type": "f",
      "mode": "0644",
      "mtime": "2022-02-20T22:20:09.030978745-06:00",
      "uid": 1000,
      "gid": 100,
      "obj": "7b5e7b718fdf4d06e0e7e7a8d2c12894"
    }
  ],
  "summary": {
    "size": 0,
    "files": 2,
    "symlinks": 0,
    "dirs": 1,
    "maxTime": "2022-02-20T22:22:11.374460959-06:00",
    "numFailed": 0
  }
}
@yotann
Copy link
Author

yotann commented Feb 21, 2022

Both Borg and Restic handle this correctly (preserving the bytes). I'm not sure exactly how they store filenames.

@yotann
Copy link
Author

yotann commented Feb 21, 2022

Options that occur to me:

  • Represent all filenames as bytes, not Unicode. Seems bad, since almost all filenames are valid Unicode.
  • Add a new field that indicates whether the name field contains normal Unicode or base64-encoded bytes.
  • Leave the name field as is, and add a new field name_bytes that holds the original bytes when necessary.
  • Use \x00 or / as escape codes in the name field, making something like "name": "/base64:/w==".

@jkowalski
Copy link
Contributor

Thanks for the report. We should definitely fix that, i'm wondering how this may behave in Windows, which uses Unicode filenames only.

@github-actions github-actions bot added the stale label Jun 11, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 18, 2023
@julio-lopez julio-lopez removed the stale label Jul 6, 2023
@julio-lopez julio-lopez reopened this Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants