Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_ndjson read_json: Add encoding parameter #15301

Open
2 tasks done
JulianCologne opened this issue Mar 26, 2024 · 2 comments
Open
2 tasks done

read_ndjson read_json: Add encoding parameter #15301

JulianCologne opened this issue Mar 26, 2024 · 2 comments
Labels
bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@JulianCologne
Copy link
Contributor

JulianCologne commented Mar 26, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

data.json (encoding=UTF-8-BOM)

{"col1": "a", "col2": 1}
{"col1": "b", "col2": 2}
import polars as pl

pl.read_ndjson("data.json")

Log output

RuntimeError: BindingsError: "InternalError(TapeError) at character 0 ('ï')"

Issue description

read_ndjson errors with different encodings and there is no option to speficy encoding

Many popular etl tools like azure DataFactory write json files with UTF-8-BOM encoding so this is happens quite a lot.

IMO every time you interact with a file the encoding is important to specify. Pandas also has this parameter.

Expected behavior

should allow encoding parameter and read data into dataframe

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.8 (tags/v3.11.8:db85d51, Feb  6 2024, 22:03:32) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               0.9.1
matplotlib:           <not installed>
numpy:                1.26.3
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              14.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@JulianCologne JulianCologne added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 26, 2024
@JulianCologne JulianCologne changed the title read_ndjson: RuntimeError: BindingsError: "InternalError(TapeError) at character 0 ('ï')" read_ndjson: Add encoding parameter Mar 26, 2024
@JulianCologne JulianCologne changed the title read_ndjson: Add encoding parameter read_ndjson read_json: Add encoding parameter Mar 26, 2024
@ritchie46
Copy link
Member

We will not support different encodings than ascii or utf8, but we should ignore the BOM, if we encounter it.

@ritchie46 ritchie46 added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Mar 26, 2024
@ritchie46
Copy link
Member

p-high because it is low effort and fixes unreadable files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

2 participants