BUG: read_json broken for S3 URL with non-null chunksize #47659

dungba88 · 2022-07-10T02:34:08Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.read_json(path_or_buf="s3://...json", lines=True, chunksize=100)

Issue Description

This issue happens when using Pandas read_json with s3fs, with a non-null chunksize. There's a similar report for the null chunksize case.

Using the code above will results in this error:

TypeError: initial_value must be str or None, not bytes

I found out it's due to this method

    def _preprocess_data(self, data):
        """
        At this point, the data either has a `read` attribute (e.g. a file
        object or a StringIO) or is a string that is a JSON document.
        If self.chunksize, we prepare the data for the `__next__` method.
        Otherwise, we read it into memory for the `read` method.
        """
        if hasattr(data, "read") and not (self.chunksize or self.nrows):
            with self:
                data = data.read()
        if not hasattr(data, "read") and (self.chunksize or self.nrows):
-->         data = StringIO(data)

        return data

I found the fix is simple, just the change the above line to:

            data = StringIO(ensure_str(data))

Will put together a PR

Expected Behavior

Using pandas read_json with S3 url and non-null chunksize should work

Installed Versions

If happens with current versions (1.4.3)

The text was updated successfully, but these errors were encountered:

dungba88 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 10, 2022

simonjayhawkins added the IO JSON read_json, to_json, json_normalize label Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_json broken for S3 URL with non-null chunksize #47659

BUG: read_json broken for S3 URL with non-null chunksize #47659

dungba88 commented Jul 10, 2022 •

edited

BUG: read_json broken for S3 URL with non-null chunksize #47659

BUG: read_json broken for S3 URL with non-null chunksize #47659

Comments

dungba88 commented Jul 10, 2022 • edited

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

dungba88 commented Jul 10, 2022 •

edited