Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series read_json tries to convert all column values to dates even when using keep_default_dates=True, if one column has an na value #49585

Open
2 of 3 tasks
efagerberg opened this issue Nov 8, 2022 · 12 comments
Assignees
Labels
Bug IO JSON read_json, to_json, json_normalize

Comments

@efagerberg
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
x = pd.Series({'a': None, 'b': '012345', 'c': 1})
print(x)
print(
    pd.read_json(
        pd.Series(x).to_json(),
        typ="series",
        orient="records",
        keep_default_dates=True,
    )
)


a      None
b    012345
c         1
dtype: object
a                   NaT
b   1970-01-01 03:25:45
c   1970-01-01 00:00:01
dtype: datetime64[ns]


# Without na column

x = pd.Series({'b': '012345', 'c': 1})
print(x)
print(
    pd.read_json(
        pd.Series(x).to_json(),
        typ="series",
        orient="records",
        keep_default_dates=True,
    )
)

b    012345
c         1
dtype: object
b    12345
c        1
dtype: int64

Issue Description

When a series has a column that could be parsed as a date, and when there is another column with an na value, read_json will convert all columns to datetimes.

Expected Behavior

Ideally none of the columns would be parsed as dates, unless I set keep_default_dates=False or I do not supply it.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 91111fd python : 3.9.6.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-52-generic Version : #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.5.1
numpy : 1.22.2
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.1.2
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6
jinja2 : 3.1.2
IPython : 7.34.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.10.0
gcsfs : None
matplotlib : None
numba : 0.56.3
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.42
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None
tzdata : None

@efagerberg efagerberg added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 8, 2022
@simar2001
Copy link

Hey I'm a student at the University of Michigan and was looking to contribute for a course project. However, this is my first time contributing to a open-source project. Do you think I would be able to take this one?

@simar2001
Copy link

Hey @efagerberg I'm a new contributor to this project. Would you be able to walk me through where you think this bug might be located in the codebase?

@efagerberg
Copy link
Author

efagerberg commented Dec 5, 2022

Sure, I have not contributed to pandas myself but here is the trace I can see:

  1. json_reader = JsonReader(
  2. class JsonReader(abc.Iterator, Generic[FrameSeriesStrT]):
  3. def read(self) -> DataFrame | Series:

There is also a test file which you can likely use to replicate the issue and help you validate you have fixed it here:

@pytest.mark.parametrize(
"reader, module, path",
[
(pd.read_csv, "os", ("io", "data", "csv", "iris.csv")),
(pd.read_table, "os", ("io", "data", "csv", "iris.csv")),
(
pd.read_fwf,
"os",
("io", "data", "fixed_width", "fixed_width_format.txt"),
),
(pd.read_excel, "xlrd", ("io", "data", "excel", "test1.xlsx")),
(
pd.read_feather,
"pyarrow",
("io", "data", "feather", "feather-0_3_1.feather"),
),
(
pd.read_hdf,
"tables",
("io", "data", "legacy_hdf", "datetimetz_object.h5"),
),
(pd.read_stata, "os", ("io", "data", "stata", "stata10_115.dta")),
(pd.read_sas, "os", ("io", "sas", "data", "test1.sas7bdat")),
(pd.read_json, "os", ("io", "json", "data", "tsframe_v012.json")),
(
pd.read_pickle,
"os",
("io", "data", "pickle", "categorical.0.25.0.pickle"),
),
],
)
@pytest.mark.filterwarnings( # pytables np.object usage
"ignore:`np.object` is a deprecated alias:DeprecationWarning"
)
def test_read_fspath_all(self, reader, module, path, datapath):
pytest.importorskip(module)
path = datapath(*path)
mypath = CustomFSPath(path)
result = reader(mypath)
expected = reader(path)
if path.endswith(".pickle"):
# categorical
tm.assert_categorical_equal(result, expected)
else:
tm.assert_frame_equal(result, expected)

@simar2001
Copy link

simar2001 commented Dec 5, 2022

Thanks @efagerberg. I have looked at the issue for a little bit and notice that we are able to solve the issue by adding in convert_dates=False. So for your initial Reproducible Example it would look something like this to get the output you desire:

x = pd.Series({'a': None, 'b': '012345', 'c': 1})
print(x)

print(
    
pd.read_json(
        pd.Series(x).to_json(),
        typ="series",
        orient="records",
        convert_dates=False,
        keep_default_dates=True,
    )

)

I think this issue might just actually be with how read_json works since its default behavior is to always have convert_dates set equal to true. One solution I might suggest is to change the default value of convert_dates to instead be false to avoid having a situation like the one shown above.

Not sure if this change is too drastic as it might change the default behavior of pandas that is currently expected.

Let me know what you think.

@efagerberg
Copy link
Author

One trickiness to just changing the default is that people using older versions may suddenly get string dates when before they were parsed so in that way it is not backwards compatible.

It may be advisable to do more analysis of the whole series to get more signal if the column is a date or not. In my example it is pretty nebulous. So I would expect pandas to make less assumptions.

@simar2001
Copy link

Thanks for the insight @efagerberg. After going through and debugging the code side by side for the two examples I have noticed that pandas tries to figure out if our code is "nansafe" before going ahead and parsing the json string into dates or int64.

Since pandas figures out that a "None" is present in the dataset I am thinking that it goes ahead and disregards that "None", and converts the rest of the json. Pandas doesn't get a chance to look over the json data passed without None value present.

Instead of reworking the entire logic of how pandas figures out what kind of data is present within some data passed into the read_json function I go back to my earlier suggestion of switching the default of convert_dates = True to be convert_dates = False.

Currently, if convert_dates is not set as a parameter being passed convert_dates is automatically being set to True. I think intuitively it is wrong to assume the user always wants their data to be converted into date format. We could have the user take on the responsibility to specify convert_dates = True if they want their data converted into date format and make the switch to have the default be convert_dates = False.

I understand this may not be backwards compatible, but I think in order to:

  1. Solve this bug
  2. Make the read_json function work more intuitively

This switch is the best solution I can propose.

@efagerberg
Copy link
Author

That seems like a reasonable plan to me.

@simar2001
Copy link

Sounds good. Would you be able to assign me this task or is there some way I can do it myself?

@efagerberg
Copy link
Author

Hmm I can't do it on my side it seems like only maintainers would be able to do it.

@simar2001
Copy link

take

@MarcoGorelli
Copy link
Member

Thanks for the report

Changing the default doesn't solve the issue and would need a deprecation cycle anyway

@simar2001
Copy link

Thanks @MarcoGorelli. I am a new contributor so I am looking for a little help on this issue.

Would you have ideas or proposals about how I can go about solving this issue? At this point I am a little stumped.

Specifically I am having a hard time navigating the pandas code base and would really appreciate if you could point me in the direction of where you think this issue might be located.

@topper-123 topper-123 added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
4 participants