Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Python's date and time objects do not roundtrip with JSON orient='table' #44705

Closed
3 tasks done
jmg-duarte opened this issue Dec 1, 2021 · 12 comments · Fixed by #44722
Closed
3 tasks done

BUG: Python's date and time objects do not roundtrip with JSON orient='table' #44705

jmg-duarte opened this issue Dec 1, 2021 · 12 comments · Fixed by #44722
Labels
Bug IO JSON read_json, to_json, json_normalize
Milestone

Comments

@jmg-duarte
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import datetime as dt
import pandas as pd

date_df = pd.DataFrame({"d": [dt.date.today()]})
date_df_json = date_df.to_json(orient='table')
date_df_back = pd.read_json(date_df_json, orient='table')
assert date_df.equals(date_df_back) # fails

time_df = pd.DataFrame({"t": [dt.time()]})
time_df_json = time_df.to_json(orient='table')
time_df_back = pd.read_json(time_df_json, orient='table')
assert time_df.equals(time_df_back) # fails

Issue Description

datetime.date and datetime.time objects are not serialized correctly when getting converted to and from JSON.

The main issue seems to be the lack of information when (de)serializing the JSON DataFrame.
Given that pandas makes use of the the Table Schema from Frictionless Standards, it should support both, as made evident by:

NOTE: This has been discussed in #20612 and #32037.

Expected Behavior

Both types make the roundtrip.

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 73c68257545b5f8530b7044f56647bd2db92e2ba
python           : 3.8.10.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.6.0
Version          : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.1
setuptools : 56.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.29.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.11.0
fastparquet : 0.7.1
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 6.0.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@jmg-duarte jmg-duarte added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 1, 2021
@jorisvandenbossche
Copy link
Member

Since the Table Schema specification supports date and time, it seems we can support this as well (though our date and time support is quite limited of course, given you can only use it with the generic object dtype).

Currently we are using type="string" for any object dtype column in the json table schema, which is wrong anyway. I would suggest that we use infer_dtype on object dtype columns to check the actual content of the column, and this way we can use "date" or "time" in the schema we write to json.

The other question is the exact format we use for the values in the JSON. The specification's default format for "date" is in the YYYY-MM-DD format (while currently the low level C code will convert dates to timestamps and thus end up using something like "2021-12-01T00:00:00.000Z"). Now it seems that the specification allows to specify the exact format being used, so that could be a workaround. Alternatively we could also convert the column to proper YYYY-MM-DD strings before passing it down to the low-level C code to serialize to JSON.

@jorisvandenbossche jorisvandenbossche added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 1, 2021
@jmg-duarte
Copy link
Contributor Author

The other question is the exact format we use for the values in the JSON. The specification's default format for "date" is in the YYYY-MM-DD format (while currently the low level C code will convert dates to timestamps and thus end up using something like "2021-12-01T00:00:00.000Z"). Now it seems that the specification allows to specify the exact format being used, so that could be a workaround. Alternatively we could also convert the column to proper YYYY-MM-DD strings before passing it down to the low-level C code to serialize to JSON.

I am in favor of serializing the date as YYYY-MM-DD and time in hh:mm:ss. W.r.t. time I don't know how it currently works though.

Regardless, if I am to work with the C code, guidance as to what/where I should change is very appreciated.

@jorisvandenbossche
Copy link
Member

If we convert to string up front, then no C changes should be necessary. I think the main advantage of doing it in C is that you don't have the full converted column at the same time (so lower memory use), and potentially faster (but that might depend on the way the formatting is done).
In any case I think a first version of a PR doing it in Python is certainly good enough. We can discuss which parts would be better done in C or not.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 1, 2021

@jmg-duarte In #20612 (comment), you wrote:

Putting it simply, I have actual datetime.date and datetime.time objects inside the DataFrame.

I don't understand why you don't just use the pandas datetime type to represent the dates and times.

That's not to say that we can't support date and time in the JSON table format, but it's worth noting the following:

  1. The python json library does not support converting datetime.date or datetime.time datatypes to JSON, so we are putting something in pandas that isn't even supported in python. (But maybe that is OK)
  2. If we do decide to do this, it does open the questions of what other python object types (e.g., ipaddress.IPv4Address, decimal.Decimal, etc.) we should support, or if we do as suggested in Serialization / Deserialization of ExtensionArrays #20612 and come up with general support for creating JSON for extension types.

@jmg-duarte
Copy link
Contributor Author

  1. If we do decide to do this, it does open the questions of what other python object types (e.g., ipaddress.IPv4Address, decimal.Decimal, etc.) we should support, or if we do as suggested in Serialization / Deserialization of ExtensionArrays #20612 and come up with general support for creating JSON for extension types.

You now see why I hoped that the ExtensionDtype could help me. With proper serialization hooks, I could implement an extension and work from there.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 1, 2021

In any case I think a first version of a PR doing it in Python is certainly good enough. We can discuss which parts would be better done in C or not.

I think the serialization and deserialization code is in python, and you'd have to modify pandas/io/json/_table_schema.py .

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 1, 2021

You now see why I hoped that the ExtensionDtype could help me. With proper serialization hooks, I could implement an extension and work from there.

You might be able to do this for all ExtensionDtype classes by modifying pandas/io/json/_table_schema.py and leveraging the existing extension API ExtensionArray._from_sequence_of_strings() for deserialization, ExtensionArray._formatter() for serialization, and ExtensionDtype.name . That might be better than doing something specific for datetime.date and datetime.time .

@jmg-duarte
Copy link
Contributor Author

You might be able to do this for all ExtensionDtype classes by modifying pandas/io/json/_table_schema.py and leveraging the existing extension API ExtensionArray._from_sequence_of_strings() for deserialization, ExtensionArray._formatter() for serialization, and ExtensionDtype.name . That might be better than doing something specific for datetime.date and datetime.time .

Could you be more explicit what you mean with "modifying pandas/io/json/_table_schema.py"? I'd rather not fork pandas for a fix just on my side, especially considering that allowing ExtensionDtypes to serialize as the author wants would be an expected feature from them.

@jorisvandenbossche
Copy link
Member

To clarify why I suggested to start with special casing "date" and "time" instead of general (extension) dtype support, is because the Table Schema explicitly supports date and time (while this is not explicitly the case for eg decimal, although you can of course store that as a float number). The python default json serialization might not handle date and time, but it's also not creating a very specific subset of json as is the case here (i.e. the table schema spec).

Could you be more explicit what you mean with "modifying pandas/io/json/_table_schema.py"? I'd rather not fork pandas for a fix just on my side,

(I think @Dr-Irv meant that suggestion of where to look for modifying the code if you would do a PR to contribute it upstream, not just for patching it on your side)

@jmg-duarte
Copy link
Contributor Author

To clarify why I suggested to start with special casing "date" and "time" instead of general (extension) dtype support, is because the Table Schema explicitly supports date and time (while this is not explicitly the case for eg decimal, although you can of course store that as a float number). The python default json serialization might not handle date and time, but it's also not creating a very specific subset of json as is the case here (i.e. the table schema spec).

👍 I was writting the same point and you beat me to it.

(I think @Dr-Irv meant that suggestion of where to look for modifying the code if you would do a PR to contribute it upstream, not just for patching it on your side)

I'm happy to help, but I need more guidance w.r.t. the extension types // _table_schema.py part.

@jmg-duarte
Copy link
Contributor Author

jmg-duarte commented Dec 2, 2021

Diving into the serialization code, the schema is easy enough to extend.
In pandas/io/json/_table_schema.py we can apply the following:

diff --git a/pandas/io/json/_table_schema.py b/pandas/io/json/_table_schema.py
index 75fd950cd6..8d16f668df 100644
--- a/pandas/io/json/_table_schema.py
+++ b/pandas/io/json/_table_schema.py
@@ -18,13 +18,17 @@ from pandas._typing import (
     JSONSerializable,
 )
 
+from pandas.api.types import infer_dtype
+
 from pandas.core.dtypes.common import (
     is_bool_dtype,
     is_categorical_dtype,
     is_datetime64_dtype,
     is_datetime64tz_dtype,
     is_integer_dtype,
     is_numeric_dtype,
+    is_object_dtype,
     is_period_dtype,
     is_string_dtype,
     is_timedelta64_dtype,
@@ -83,6 +87,8 @@ def as_json_table_type(x: DtypeObj) -> str:
         return "duration"
     elif is_categorical_dtype(x):
         return "any"
+    elif is_object_dtype(x): # first because the string dtype covers this
+        return "object"
     elif is_string_dtype(x):
         return "string"
     else:
@@ -112,17 +118,19 @@ def set_default_names(data):
         data.index.name = data.index.name or "index"
     return data
 
-
 def convert_pandas_type_to_json_field(arr):
     dtype = arr.dtype
     if arr.name is None:
         name = "values"
     else:
         name = arr.name
+
     field: dict[str, JSONSerializable] = {
         "name": name,
         "type": as_json_table_type(dtype),
     }
+    if field["type"] == "object":
+        field["pytype"] = infer_dtype(arr)
 
     if is_categorical_dtype(dtype):
         cats = dtype.categories
@@ -179,7 +187,7 @@ def convert_json_field_to_pandas_type(field):
     'datetime64[ns, US/Central]'
     """
     typ = field["type"]
-    if typ == "string":
+    if typ in ["string", "object"]:
         return "object"
     elif typ == "integer":
         return "int64"

Which will be able to generate the following schema:

{
  "schema":{
    "fields":[
      {
        "name":"index",
        "type":"integer"
      },
      {
        "name":"d",
        "type":"object",
        "pytype":"date"
      }
    ],
    "primaryKey":[
      "index"
    ],
    "pandas_version":"0.20.0"
  },
  "data":[
    {
      "index":0,
      "d":"2021-12-02T00:00:00.000Z"
    }
  ]
}

However, when reading, given that datetime.date and datetime.time aren't actual dtypes I can't really just

    df = df.astype(dtypes)

I could try to convert the date and time columns by moving them around as datetime64[ns] and then calling df['column'].dt.date and df['column'].dt.time

@jmg-duarte
Copy link
Contributor Author

Just submitted a PR (as you can see above), it ends up taking the approach suggested by @Dr-Irv (quick note, when deserializing, it calls _from_sequence rather than _from_string_sequence).

@jreback jreback added this to the 1.4 milestone Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants