BUG: Series read_json tries to convert all column values to dates even when using keep_default_dates=True, if one column has an na value #49585

efagerberg · 2022-11-08T18:26:36Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
x = pd.Series({'a': None, 'b': '012345', 'c': 1})
print(x)
print(
    pd.read_json(
        pd.Series(x).to_json(),
        typ="series",
        orient="records",
        keep_default_dates=True,
    )
)


a      None
b    012345
c         1
dtype: object
a                   NaT
b   1970-01-01 03:25:45
c   1970-01-01 00:00:01
dtype: datetime64[ns]


# Without na column

x = pd.Series({'b': '012345', 'c': 1})
print(x)
print(
    pd.read_json(
        pd.Series(x).to_json(),
        typ="series",
        orient="records",
        keep_default_dates=True,
    )
)

b    012345
c         1
dtype: object
b    12345
c        1
dtype: int64

Issue Description

When a series has a column that could be parsed as a date, and when there is another column with an na value, read_json will convert all columns to datetimes.

Expected Behavior

Ideally none of the columns would be parsed as dates, unless I set keep_default_dates=False or I do not supply it.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 91111fd python : 3.9.6.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-52-generic Version : #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.5.1
numpy : 1.22.2
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.1.2
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6
jinja2 : 3.1.2
IPython : 7.34.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.10.0
gcsfs : None
matplotlib : None
numba : 0.56.3
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.42
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

simar2001 · 2022-11-08T22:59:02Z

Hey I'm a student at the University of Michigan and was looking to contribute for a course project. However, this is my first time contributing to a open-source project. Do you think I would be able to take this one?

simar2001 · 2022-12-05T05:23:48Z

Hey @efagerberg I'm a new contributor to this project. Would you be able to walk me through where you think this bug might be located in the codebase?

efagerberg · 2022-12-05T13:27:14Z

Sure, I have not contributed to pandas myself but here is the trace I can see:

pandas/pandas/io/json/_json.py

Line 724 in 0cebd75

json_reader = JsonReader(
pandas/pandas/io/json/_json.py

Line 749 in 0cebd75

class JsonReader(abc.Iterator, Generic[FrameSeriesStrT]):
pandas/pandas/io/json/_json.py

Line 885 in 0cebd75

def read(self) -> DataFrame | Series:

There is also a test file which you can likely use to replicate the issue and help you validate you have fixed it here:

pandas/pandas/tests/io/test_common.py

Lines 289 to 335 in 0cebd75

    
               @pytest.mark.parametrize( 
        
                   "reader, module, path", 
        
                   [ 
        
                       (pd.read_csv, "os", ("io", "data", "csv", "iris.csv")), 
        
                       (pd.read_table, "os", ("io", "data", "csv", "iris.csv")), 
        
                       ( 
        
                           pd.read_fwf, 
        
                           "os", 
        
                           ("io", "data", "fixed_width", "fixed_width_format.txt"), 
        
                       ), 
        
                       (pd.read_excel, "xlrd", ("io", "data", "excel", "test1.xlsx")), 
        
                       ( 
        
                           pd.read_feather, 
        
                           "pyarrow", 
        
                           ("io", "data", "feather", "feather-0_3_1.feather"), 
        
                       ), 
        
                       ( 
        
                           pd.read_hdf, 
        
                           "tables", 
        
                           ("io", "data", "legacy_hdf", "datetimetz_object.h5"), 
        
                       ), 
        
                       (pd.read_stata, "os", ("io", "data", "stata", "stata10_115.dta")), 
        
                       (pd.read_sas, "os", ("io", "sas", "data", "test1.sas7bdat")), 
        
                       (pd.read_json, "os", ("io", "json", "data", "tsframe_v012.json")), 
        
                       ( 
        
                           pd.read_pickle, 
        
                           "os", 
        
                           ("io", "data", "pickle", "categorical.0.25.0.pickle"), 
        
                       ), 
        
                   ], 
        
               ) 
        
               @pytest.mark.filterwarnings(  # pytables np.object usage 
        
                   "ignore:`np.object` is a deprecated alias:DeprecationWarning" 
        
               ) 
        
               def test_read_fspath_all(self, reader, module, path, datapath): 
        
                   pytest.importorskip(module) 
        
                   path = datapath(*path) 
        
                   mypath = CustomFSPath(path) 
        
                   result = reader(mypath) 
        
                   expected = reader(path) 
        
                   if path.endswith(".pickle"): 
        
                       # categorical 
        
                       tm.assert_categorical_equal(result, expected) 
        
                   else: 
        
                       tm.assert_frame_equal(result, expected)

simar2001 · 2022-12-05T18:17:48Z

Thanks @efagerberg. I have looked at the issue for a little bit and notice that we are able to solve the issue by adding in convert_dates=False. So for your initial Reproducible Example it would look something like this to get the output you desire:

x = pd.Series({'a': None, 'b': '012345', 'c': 1})
print(x)

print(
    
pd.read_json(
        pd.Series(x).to_json(),
        typ="series",
        orient="records",
        convert_dates=False,
        keep_default_dates=True,
    )

)

I think this issue might just actually be with how read_json works since its default behavior is to always have convert_dates set equal to true. One solution I might suggest is to change the default value of convert_dates to instead be false to avoid having a situation like the one shown above.

Not sure if this change is too drastic as it might change the default behavior of pandas that is currently expected.

Let me know what you think.

efagerberg · 2022-12-05T18:35:11Z

One trickiness to just changing the default is that people using older versions may suddenly get string dates when before they were parsed so in that way it is not backwards compatible.

It may be advisable to do more analysis of the whole series to get more signal if the column is a date or not. In my example it is pretty nebulous. So I would expect pandas to make less assumptions.

simar2001 · 2022-12-07T23:49:28Z

Thanks for the insight @efagerberg. After going through and debugging the code side by side for the two examples I have noticed that pandas tries to figure out if our code is "nansafe" before going ahead and parsing the json string into dates or int64.

Since pandas figures out that a "None" is present in the dataset I am thinking that it goes ahead and disregards that "None", and converts the rest of the json. Pandas doesn't get a chance to look over the json data passed without None value present.

Instead of reworking the entire logic of how pandas figures out what kind of data is present within some data passed into the read_json function I go back to my earlier suggestion of switching the default of convert_dates = True to be convert_dates = False.

Currently, if convert_dates is not set as a parameter being passed convert_dates is automatically being set to True. I think intuitively it is wrong to assume the user always wants their data to be converted into date format. We could have the user take on the responsibility to specify convert_dates = True if they want their data converted into date format and make the switch to have the default be convert_dates = False.

I understand this may not be backwards compatible, but I think in order to:

Solve this bug
Make the read_json function work more intuitively

This switch is the best solution I can propose.

efagerberg · 2022-12-07T23:56:01Z

That seems like a reasonable plan to me.

simar2001 · 2022-12-08T00:23:40Z

Sounds good. Would you be able to assign me this task or is there some way I can do it myself?

efagerberg · 2022-12-08T00:30:28Z

Hmm I can't do it on my side it seems like only maintainers would be able to do it.

simar2001 · 2022-12-08T04:35:45Z

take

MarcoGorelli · 2022-12-08T08:29:51Z

Thanks for the report

Changing the default doesn't solve the issue and would need a deprecation cycle anyway

simar2001 · 2022-12-08T16:21:34Z

Thanks @MarcoGorelli. I am a new contributor so I am looking for a little help on this issue.

Would you have ideas or proposals about how I can go about solving this issue? At this point I am a little stumped.

Specifically I am having a hard time navigating the pandas code base and would really appreciate if you could point me in the direction of where you think this issue might be located.

efagerberg added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 8, 2022

github-actions bot assigned simar2001 Dec 8, 2022

simar2001 mentioned this issue Dec 8, 2022

BUG: Series read_json tries to convert all column values to dates even when using keep_default_dates=True, if one column has an na value FIX #50123

Closed

5 tasks

topper-123 added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 26, 2023

This was referenced Sep 6, 2023

ENH: Extending the orient="table" option to all Table Schema types #55038

Open

PDEP-12: compact-and-reversible-JSON-interface.md #53714

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Series read_json tries to convert all column values to dates even when using keep_default_dates=True, if one column has an na value #49585

BUG: Series read_json tries to convert all column values to dates even when using keep_default_dates=True, if one column has an na value #49585

efagerberg commented Nov 8, 2022

simar2001 commented Nov 8, 2022

simar2001 commented Dec 5, 2022

efagerberg commented Dec 5, 2022 •

edited

simar2001 commented Dec 5, 2022 •

edited

efagerberg commented Dec 5, 2022

simar2001 commented Dec 7, 2022

efagerberg commented Dec 7, 2022

simar2001 commented Dec 8, 2022

efagerberg commented Dec 8, 2022

simar2001 commented Dec 8, 2022

MarcoGorelli commented Dec 8, 2022

simar2001 commented Dec 8, 2022

BUG: Series read_json tries to convert all column values to dates even when using keep_default_dates=True, if one column has an na value #49585

BUG: Series read_json tries to convert all column values to dates even when using keep_default_dates=True, if one column has an na value #49585

Comments

efagerberg commented Nov 8, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

simar2001 commented Nov 8, 2022

simar2001 commented Dec 5, 2022

efagerberg commented Dec 5, 2022 • edited

simar2001 commented Dec 5, 2022 • edited

efagerberg commented Dec 5, 2022

simar2001 commented Dec 7, 2022

efagerberg commented Dec 7, 2022

simar2001 commented Dec 8, 2022

efagerberg commented Dec 8, 2022

simar2001 commented Dec 8, 2022

MarcoGorelli commented Dec 8, 2022

simar2001 commented Dec 8, 2022

efagerberg commented Dec 5, 2022 •

edited

simar2001 commented Dec 5, 2022 •

edited