Skip to content
This repository has been archived by the owner on Jul 3, 2024. It is now read-only.

Can't parse files with BOM #45

Closed
mwtoews opened this issue Jan 20, 2022 · 5 comments · Fixed by #46
Closed

Can't parse files with BOM #45

mwtoews opened this issue Jan 20, 2022 · 5 comments · Fixed by #46

Comments

@mwtoews
Copy link
Contributor

mwtoews commented Jan 20, 2022

absolufy-imports 0.3.0 cannot parse a python file with a BOM.

For example, download bom.py.txt and rename it bom.py (this is a necessary step on GitHub as only *.txt files can be attached), move it into an example project (e.g. src), see the following:

$ cat src/bom.py
import sys
$ file src/bom.py	
src/bom.py: UTF-8 Unicode (with BOM) text
$ python src/bom.py
$ absolufy-imports src/bom.py
Traceback (most recent call last):
  File "/tmp/py310/bin/absolufy-imports", line 8, in <module>
    sys.exit(main())
  File "/tmp/py310/lib/python3.10/site-packages/absolufy_imports.py", line 208, in main
    absolute_imports(
  File "/tmp/py310/lib/python3.10/site-packages/absolufy_imports.py", line 151, in absolute_imports
    tree = ast.parse(txt)
  File "/opt/python-3.10.0/lib/python3.10/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 1
    import sys
    ^
SyntaxError: invalid non-printable character U+FEFF

Perhaps a better message could help? Or try to compare these two uses reading the source file, the last with encoding='utf-8-sig':

$ python -c "from ast import parse; from pathlib import Path; parse(Path('src/bom.py').read_text())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/python-3.10.0/lib/python3.10/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 1
    import sys
    ^
SyntaxError: invalid non-printable character U+FEFF
$ python -c "from ast import parse; from pathlib import Path; parse(Path('src/bom.py').read_text(encoding='utf-8-sig'))"
$ 
@mwtoews
Copy link
Contributor Author

mwtoews commented Jan 20, 2022

I suppose the other facet is if the file needs to be rewritten, does it get rewritten with the BOM if it had one before? (I also noticed that DOS files get re-written as UNIX files, which is a separate issue...)

@MarcoGorelli
Copy link
Owner

Hi @mwtoews

Thanks for your report - do you want to submit a pull request to fix this?

@mwtoews
Copy link
Contributor Author

mwtoews commented Jan 20, 2022

I'll take a look...

@MarcoGorelli
Copy link
Owner

I think something like this

    with open(file, 'rb') as fb:
        contents_bytes = fb.read()
    try:
        contents_text = contents_bytes.decode()
    except UnicodeDecodeError:
        print(f'{file} is non-utf-8 (not supported)')
        return 1
    tree = ast.parse(contents_text)

should be enough. I'll try to get something out today, there's another issue to fix

@mwtoews
Copy link
Contributor Author

mwtoews commented Jan 20, 2022

I have something similar, but I'm also looking at line breaks too ...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants