-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non latin-1 filenames are not supported #13
Comments
Replacing return codecs.latin_1_encode(x)[0]
# codecs.latin_1_encode("зображення") with return codecs.utf_8_encode(x)[0]
# codecs.utf_8_encode("зображення") will work, but will likely raise an exception elsewhere, where |
Thank you for your feedback!
I think your change should be fine. There is an exhaustive unit test, you can try to run it with your change, if it works then you are good.
I will try to do that myself but no guarantee, i have a big backlog of projects maintenance...
12 nov. 2023 19:59:34 Bogdan ***@***.***>:
…
Thank you for a very thought-out tool! Currently evaluating it for keeping my 400+GB, 50k-file archive safe(r).
While doing so, came across this exception:
Traceback (most recent call last):
File "/home/user/.local/bin/pff", line 8, in <module>
sys.exit(main())
File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/pff.py", line 108, in main
return saecc_main(argv=subargs, command=fullcommand)
File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 574, in main
relfilepath_ecc = compute_ecc_hash_from_string(relfilepath, ecc_manager_intra, hasher_intra, max_block_size, resilience_rate_intra)
File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 203, in compute_ecc_hash_from_string
fpfile = BytesIO(b(string))
File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/lib/_compat.py", line 36, in b
return codecs.latin_1_encode(x)[0]
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 16-26: ordinal not in range(256)
Looking at the code, it seems that *latin-1* is used as an internal encoding - which can indeed not handle some of the non-latin-1 characters:
if sys.version_info < (3,):
def b(x):
return x
else:
import codecs
def b(x):
if isinstance(x, _str):
return codecs.latin_1_encode(x)[0] # <-- here
else:
return x
Problematic filename had Ukrainian/Cyrillic characters, which I think are not a part of *latin-1* encoding. Example string: *зображення*.
pyFileFixity version 3.1.4 installed with *pip*. I'm on Python 3.10.12.
—
Reply to this email directly, view it on GitHub[#13], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIRFXVYFC3FXXKM3VJLUTDYEEMBLAVCNFSM6AAAAAA7IGI56OVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4DSNJSGY4DOOA].
You are receiving this because you are subscribed to this thread.
[Image de pistage][https://github.com/notifications/beacon/AAIRFXSTXAGCIF7WG3QZUSLYEEMBLA5CNFSM6AAAAAA7IGI56OWGG33NNVSW45C7OR4XAZNFJFZXG5LFVJRW63LNMVXHIX3JMTHHNFOFLY.gif]
|
Or you know what? If you can make a PR, then an automated CI workflow will launch a unit test online, so if you make a PR yol don't need to run the unit test yourself. And it will allow to credit you properly for this change :-)
12 nov. 2023 21:42:28 Stephen L. ***@***.***>:
… Thank you for your feedback!
I think your change should be fine. There is an exhaustive unit test, you can try to run it with your change, if it works then you are good.
I will try to do that myself but no guarantee, i have a big backlog of projects maintenance...
12 nov. 2023 19:59:34 Bogdan ***@***.***>:
>
> Thank you for a very thought-out tool! Currently evaluating it for keeping my 400+GB, 50k-file archive safe(r).
>
> While doing so, came across this exception:
>
> Traceback (most recent call last):
> File "/home/user/.local/bin/pff", line 8, in <module>
> sys.exit(main())
> File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/pff.py", line 108, in main
> return saecc_main(argv=subargs, command=fullcommand)
> File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 574, in main
> relfilepath_ecc = compute_ecc_hash_from_string(relfilepath, ecc_manager_intra, hasher_intra, max_block_size, resilience_rate_intra)
> File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 203, in compute_ecc_hash_from_string
> fpfile = BytesIO(b(string))
> File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/lib/_compat.py", line 36, in b
> return codecs.latin_1_encode(x)[0]
> UnicodeEncodeError: 'latin-1' codec can't encode characters in position 16-26: ordinal not in range(256)
>
> Looking at the code, it seems that *latin-1* is used as an internal encoding - which can indeed not handle some of the non-latin-1 characters:
>
> if sys.version_info < (3,):
> def b(x):
> return x
> else:
> import codecs
> def b(x):
> if isinstance(x, _str):
> return codecs.latin_1_encode(x)[0] # <-- here
> else:
> return x
>
> Problematic filename had Ukrainian/Cyrillic characters, which I think are not a part of *latin-1* encoding. Example string: *зображення*.
>
> pyFileFixity version 3.1.4 installed with *pip*. I'm on Python 3.10.12.
>
> —
> Reply to this email directly, view it on GitHub[#13], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIRFXVYFC3FXXKM3VJLUTDYEEMBLAVCNFSM6AAAAAA7IGI56OVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4DSNJSGY4DOOA].
> You are receiving this because you are subscribed to this thread.
> [Image de pistage][https://github.com/notifications/beacon/AAIRFXSTXAGCIF7WG3QZUSLYEEMBLA5CNFSM6AAAAAA7IGI56OWGG33NNVSW45C7OR4XAZNFJFZXG5LFVJRW63LNMVXHIX3JMTHHNFOFLY.gif]
>
|
Ok so I remember why it is in latin-1, because the software encodes byte by byte, and a byte is 255 characters maximum, so the idea was to use latin-1 as a codec if necessary but normally these should be treated as bytes. This is an old code that remains from the Python 2/3 compatibility era, now since Py2 support is dropped everywhere, I should rewrite this code to be more Py3 idiomatic. Can you please maybe share a minimum example file that produces this issue? Just a simple text file with some random non latin-1 characters should be enough (I'll try to make some myself but just in case it's good if you can provide an example file too). |
Ok I can reproduce the issue using the example filename you provided, thank you very much. I can't believe I never tested a non-latin-1 filename. I will work on it, hopefully it's not too complicated. |
Thank you for a very thought-out tool! Currently evaluating it for keeping my 400+GB, 50k-file archive safe(r).
While doing so, came across this exception:
Looking at the code, it seems that
latin-1
is used as an internal encoding - which can indeed not handle some of the non-latin-1 characters:Problematic filename had Ukrainian/Cyrillic characters, which I think are not a part of
latin-1
encoding.Example string:
зображення
.pyFileFixity version 3.1.4 installed with
pip
. I'm on Python 3.10.12.The text was updated successfully, but these errors were encountered: