Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing Parquet to JSON when row_oriented=True #15410

Open
2 tasks done
bobir01 opened this issue Apr 1, 2024 · 3 comments
Open
2 tasks done

Writing Parquet to JSON when row_oriented=True #15410

bobir01 opened this issue Apr 1, 2024 · 3 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@bobir01
Copy link

bobir01 commented Apr 1, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from tempfile import NamedTemporaryFile
import polars as pl
from polars import selectors as sc


def get_pl_table(file: NamedTemporaryFile) -> pl.DataFrame:
    s_time = time.time()
    
    df_lazy = pl.scan_parquet(file.name).select(
        sc.by_dtype(pl.Binary).cast(pl.String),
        sc.all().exclude(pl.Binary)
    ).collect()




    df_lazy.write_json(Path(__file__).parent / 'data'/ "support_ticket_sla1.json", row_oriented=True)

Log output

/Users/bobdev/PycharmProjects/korzinkaGo/.venv/bin/python /Users/bobdev/PycharmProjects/korzinkaGo/aws_tasks/commpnl_upsert.py 
Time elapsed for s3 download: 3.2596969604492188

thread '<unnamed>' panicked at crates/polars-json/src/json/write/serialize.rs:494:18:
not yet implemented: Writing BinaryView to JSON
stack backtrace:
   0:        0x175468524 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h6aecb9d07bb8db1b
   1:        0x17362c24c - core::fmt::write::h57932930d5d73fd6
   2:        0x1754420d4 - std::io::Write::write_fmt::hb638df451817bf25
   3:        0x17546b3f0 - std::sys_common::backtrace::print::h6910c90959d8cad9
   4:        0x17546ad34 - std::panicking::default_hook::{{closure}}::h1bb1130b8dcb1188
   5:        0x17546c600 - std::panicking::rust_panic_with_hook::hfc5079cd9be86c57
   6:        0x17546b720 - std::panicking::begin_panic_handler::{{closure}}::h5532dad9383be66c
   7:        0x17546b684 - std::sys_common::backtrace::__rust_end_short_backtrace::h39762fc8d44c97d9
   8:        0x17546b678 - _rust_begin_unwind
   9:        0x1756092f4 - core::panicking::panic_fmt::hdaff94c2cbb4d934
  10:        0x174329e60 - polars_json::json::write::serialize::new_serializer::h41779dfb65e9fc7c
  11:        0x174328708 - polars_json::json::write::serialize::new_serializer::h41779dfb65e9fc7c
  12:        0x174328f70 - polars_json::json::write::serialize::new_serializer::h41779dfb65e9fc7c
  13:        0x174334630 - polars_json::json::write::serialize::serialize::h65b146cc2bb39732
  14:        0x1731b2da8 - <polars_io::json::JsonWriter<W> as polars_io::SerWriter<W>>::finish::h2955a06d621ed22b
  15:        0x1733f4ee0 - polars::dataframe::_::<impl polars::dataframe::PyDataFrame>::__pymethod_write_json__::hdb909103cdc666f3
  16:        0x172ffa130 - pyo3::impl_::trampoline::trampoline::hccfd42f2554b28e9
  17:        0x173554d70 - polars::dataframe::_::_::__INVENTORY::trampoline::h60e14f143153bc4a
  18:        0x102ce1b70 - _method_vectorcall_VARARGS_KEYWORDS
  19:        0x102e09160 - _call_function
  20:        0x102e0050c - __PyEval_EvalFrameDefault
  21:        0x102df9f44 - __PyEval_Vector
  22:        0x102cd6dcc - _method_vectorcall
  23:        0x102e09160 - _call_function
  24:        0x102dffc80 - __PyEval_EvalFrameDefault
  25:        0x102df9f44 - __PyEval_Vector
  26:        0x102e09160 - _call_function
  27:        0x102dffbfc - __PyEval_EvalFrameDefault
  28:        0x102df9f44 - __PyEval_Vector
  29:        0x102e09160 - _call_function
  30:        0x102dffbfc - __PyEval_EvalFrameDefault
  31:        0x102df9f44 - __PyEval_Vector
  32:        0x102e647cc - _pyrun_file
  33:        0x102e63f10 - __PyRun_SimpleFileObject
  34:        0x102e6355c - __PyRun_AnyFileObject
  35:        0x102e8f76c - _pymain_run_file_obj
  36:        0x102e8ee0c - _pymain_run_file
  37:        0x102e8e3f4 - _pymain_run_python
  38:        0x102e8e288 - _Py_RunMain
  39:        0x102e8f914 - _pymain_main
  40:        0x102e8fbd8 - _Py_BytesMain
Traceback (most recent call last):
  File "/Users/bobdev/PycharmProjects/korzinkaGo/aws_tasks/commpnl_upsert.py", line 118, in <module>
    main()
  File "/Users/bobdev/PycharmProjects/korzinkaGo/aws_tasks/commpnl_upsert.py", line 113, in main
    table = get_pl_table(file)
  File "/Users/bobdev/PycharmProjects/korzinkaGo/aws_tasks/commpnl_upsert.py", line 53, in get_pl_table
    df_lazy.write_json(Path(__file__).parent / 'data'/ "support_ticket_sla1.json", row_oriented=True)
  File "/Users/bobdev/PycharmProjects/korzinkaGo/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py", line 2490, in write_json
    self._df.write_json(file, pretty, row_oriented)
pyo3_runtime.PanicException: not yet implemented: Writing BinaryView to JSON

Process finished with exit code 1

Issue description

When i enabling row_oriented behavior, it caused NotImplemented exception but It works fine when i disable this feature, i am trying to convert parquet file into json,
default json output without row_oriented=True is around 17MB

Expected behavior

should output row_oriented JSON:

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             macOS-13.6-arm64-arm-64bit
Python:               3.10.3 (v3.10.3:a342a49189, Mar 16 2022, 09:34:18) [Clang 13.0.0 (clang-1300.0.29.30)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               2.2.0
pyarrow:              15.0.1
pydantic:             2.6.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.51
xlsx2csv:             <not installed>
xlsxwriter:           3.2.0
@bobir01 bobir01 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 1, 2024
@ritchie46
Copy link
Member

Can you make a reproducible example? I don't have the parquet file you used. Ideally your example doesn't include any files, but creates an example from memory.

@reswqa
Copy link
Collaborator

reswqa commented Apr 1, 2024

I think the MRE can be:

import io
df = pl.DataFrame({"a": [b"123", b"abc"]})
buf = io.StringIO()
df.write_json(buf, row_oriented=True)

Is there any reason why we wouldn't want to implement this for Binary here.

pub(crate) fn new_serializer<'a>(
array: &'a dyn Array,
offset: usize,
take: usize,
) -> Box<dyn StreamingIterator<Item = [u8]> + 'a + Send + Sync> {
match array.data_type().to_logical_type() {

@bobir01
Copy link
Author

bobir01 commented Apr 5, 2024

hi @ritchie46
he is the full reproducible code:

from tempfile import NamedTemporaryFile
from pprint import pprint
import polars as pl
from polars import selectors as sc
import requests
from pathlib import Path

base_dir = Path(__file__).parent


def get_pl_table(file: NamedTemporaryFile) -> pl.DataFrame:
    df = pl.scan_parquet(file.name).select(
        sc.by_dtype(pl.Binary).cast(pl.String),
        sc.all().exclude(pl.Binary)
    ).collect()
    # let's print the schema of the dataframe
    pprint(df.schema)
    # ERROR-prone code -> when enabling row_oriented=True, the code will fail
    # for other cases, it will work fine
    df.write_json(base_dir / 'tmp_sprt.json', row_oriented=True)
    file.close()
    return df


def get_parquet_file() -> NamedTemporaryFile:
    base_url = 'https://cloud-api.yandex.net/v1/disk/public/resources/download?public_key=https://disk.yandex.com/d/bAhxN41-pwdGwA'
    response = requests.get(base_url)
    download_url = response.json()['href']
    response = requests.get(download_url)
    file = NamedTemporaryFile()
    file.write(response.content)
    return file


def main():
    file = get_parquet_file()
    get_pl_table(file)


if __name__ == '__main__':
    main()

please, note this url is safe and on my cloud storage, i believe it's on the side of rust,because it failed only when enabling the row_oriented=True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants