BUG: Python's date and time objects do not roundtrip with JSON orient='table' #44705

jmg-duarte · 2021-12-01T16:12:45Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import datetime as dt
import pandas as pd

date_df = pd.DataFrame({"d": [dt.date.today()]})
date_df_json = date_df.to_json(orient='table')
date_df_back = pd.read_json(date_df_json, orient='table')
assert date_df.equals(date_df_back) # fails

time_df = pd.DataFrame({"t": [dt.time()]})
time_df_json = time_df.to_json(orient='table')
time_df_back = pd.read_json(time_df_json, orient='table')
assert time_df.equals(time_df_back) # fails

Issue Description

datetime.date and datetime.time objects are not serialized correctly when getting converted to and from JSON.

The main issue seems to be the lack of information when (de)serializing the JSON DataFrame.
Given that pandas makes use of the the Table Schema from Frictionless Standards, it should support both, as made evident by:

NOTE: This has been discussed in #20612 and #32037.

Expected Behavior

Both types make the roundtrip.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 73c68257545b5f8530b7044f56647bd2db92e2ba python : 3.8.10.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8

pandas : 1.3.3 numpy : 1.21.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.1.1 setuptools : 56.0.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : 7.29.0 pandas_datareader: None bs4 : None bottleneck : None fsspec : 2021.11.0 fastparquet : 0.7.1 gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 6.0.1 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2021-12-01T16:25:01Z

Since the Table Schema specification supports date and time, it seems we can support this as well (though our date and time support is quite limited of course, given you can only use it with the generic object dtype).

Currently we are using type="string" for any object dtype column in the json table schema, which is wrong anyway. I would suggest that we use infer_dtype on object dtype columns to check the actual content of the column, and this way we can use "date" or "time" in the schema we write to json.

The other question is the exact format we use for the values in the JSON. The specification's default format for "date" is in the YYYY-MM-DD format (while currently the low level C code will convert dates to timestamps and thus end up using something like "2021-12-01T00:00:00.000Z"). Now it seems that the specification allows to specify the exact format being used, so that could be a workaround. Alternatively we could also convert the column to proper YYYY-MM-DD strings before passing it down to the low-level C code to serialize to JSON.

jmg-duarte · 2021-12-01T16:32:19Z

The other question is the exact format we use for the values in the JSON. The specification's default format for "date" is in the YYYY-MM-DD format (while currently the low level C code will convert dates to timestamps and thus end up using something like "2021-12-01T00:00:00.000Z"). Now it seems that the specification allows to specify the exact format being used, so that could be a workaround. Alternatively we could also convert the column to proper YYYY-MM-DD strings before passing it down to the low-level C code to serialize to JSON.

I am in favor of serializing the date as YYYY-MM-DD and time in hh:mm:ss. W.r.t. time I don't know how it currently works though.

Regardless, if I am to work with the C code, guidance as to what/where I should change is very appreciated.

jorisvandenbossche · 2021-12-01T16:39:49Z

If we convert to string up front, then no C changes should be necessary. I think the main advantage of doing it in C is that you don't have the full converted column at the same time (so lower memory use), and potentially faster (but that might depend on the way the formatting is done).
In any case I think a first version of a PR doing it in Python is certainly good enough. We can discuss which parts would be better done in C or not.

Dr-Irv · 2021-12-01T16:43:38Z

@jmg-duarte In #20612 (comment), you wrote:

Putting it simply, I have actual datetime.date and datetime.time objects inside the DataFrame.

I don't understand why you don't just use the pandas datetime type to represent the dates and times.

That's not to say that we can't support date and time in the JSON table format, but it's worth noting the following:

The python json library does not support converting datetime.date or datetime.time datatypes to JSON, so we are putting something in pandas that isn't even supported in python. (But maybe that is OK)
If we do decide to do this, it does open the questions of what other python object types (e.g., ipaddress.IPv4Address, decimal.Decimal, etc.) we should support, or if we do as suggested in Serialization / Deserialization of ExtensionArrays #20612 and come up with general support for creating JSON for extension types.

jmg-duarte · 2021-12-01T16:53:02Z

If we do decide to do this, it does open the questions of what other python object types (e.g., ipaddress.IPv4Address, decimal.Decimal, etc.) we should support, or if we do as suggested in Serialization / Deserialization of ExtensionArrays #20612 and come up with general support for creating JSON for extension types.

You now see why I hoped that the ExtensionDtype could help me. With proper serialization hooks, I could implement an extension and work from there.

Dr-Irv · 2021-12-01T16:56:11Z

In any case I think a first version of a PR doing it in Python is certainly good enough. We can discuss which parts would be better done in C or not.

I think the serialization and deserialization code is in python, and you'd have to modify pandas/io/json/_table_schema.py .

Dr-Irv · 2021-12-01T17:26:36Z

You now see why I hoped that the ExtensionDtype could help me. With proper serialization hooks, I could implement an extension and work from there.

You might be able to do this for all ExtensionDtype classes by modifying pandas/io/json/_table_schema.py and leveraging the existing extension API ExtensionArray._from_sequence_of_strings() for deserialization, ExtensionArray._formatter() for serialization, and ExtensionDtype.name . That might be better than doing something specific for datetime.date and datetime.time .

jmg-duarte · 2021-12-02T08:46:41Z

You might be able to do this for all ExtensionDtype classes by modifying pandas/io/json/_table_schema.py and leveraging the existing extension API ExtensionArray._from_sequence_of_strings() for deserialization, ExtensionArray._formatter() for serialization, and ExtensionDtype.name . That might be better than doing something specific for datetime.date and datetime.time .

Could you be more explicit what you mean with "modifying pandas/io/json/_table_schema.py"? I'd rather not fork pandas for a fix just on my side, especially considering that allowing ExtensionDtypes to serialize as the author wants would be an expected feature from them.

jorisvandenbossche · 2021-12-02T08:59:28Z

To clarify why I suggested to start with special casing "date" and "time" instead of general (extension) dtype support, is because the Table Schema explicitly supports date and time (while this is not explicitly the case for eg decimal, although you can of course store that as a float number). The python default json serialization might not handle date and time, but it's also not creating a very specific subset of json as is the case here (i.e. the table schema spec).

Could you be more explicit what you mean with "modifying pandas/io/json/_table_schema.py"? I'd rather not fork pandas for a fix just on my side,

(I think @Dr-Irv meant that suggestion of where to look for modifying the code if you would do a PR to contribute it upstream, not just for patching it on your side)

jmg-duarte · 2021-12-02T09:02:33Z

To clarify why I suggested to start with special casing "date" and "time" instead of general (extension) dtype support, is because the Table Schema explicitly supports date and time (while this is not explicitly the case for eg decimal, although you can of course store that as a float number). The python default json serialization might not handle date and time, but it's also not creating a very specific subset of json as is the case here (i.e. the table schema spec).

👍 I was writting the same point and you beat me to it.

(I think @Dr-Irv meant that suggestion of where to look for modifying the code if you would do a PR to contribute it upstream, not just for patching it on your side)

I'm happy to help, but I need more guidance w.r.t. the extension types // _table_schema.py part.

jmg-duarte · 2021-12-02T11:51:36Z

Diving into the serialization code, the schema is easy enough to extend.
In pandas/io/json/_table_schema.py we can apply the following:

diff --git a/pandas/io/json/_table_schema.py b/pandas/io/json/_table_schema.py
index 75fd950cd6..8d16f668df 100644
--- a/pandas/io/json/_table_schema.py
+++ b/pandas/io/json/_table_schema.py
@@ -18,13 +18,17 @@ from pandas._typing import (
     JSONSerializable,
 )
 
+from pandas.api.types import infer_dtype
+
 from pandas.core.dtypes.common import (
     is_bool_dtype,
     is_categorical_dtype,
     is_datetime64_dtype,
     is_datetime64tz_dtype,
     is_integer_dtype,
     is_numeric_dtype,
+    is_object_dtype,
     is_period_dtype,
     is_string_dtype,
     is_timedelta64_dtype,
@@ -83,6 +87,8 @@ def as_json_table_type(x: DtypeObj) -> str:
         return "duration"
     elif is_categorical_dtype(x):
         return "any"
+    elif is_object_dtype(x): # first because the string dtype covers this
+        return "object"
     elif is_string_dtype(x):
         return "string"
     else:
@@ -112,17 +118,19 @@ def set_default_names(data):
         data.index.name = data.index.name or "index"
     return data
 
-
 def convert_pandas_type_to_json_field(arr):
     dtype = arr.dtype
     if arr.name is None:
         name = "values"
     else:
         name = arr.name
+
     field: dict[str, JSONSerializable] = {
         "name": name,
         "type": as_json_table_type(dtype),
     }
+    if field["type"] == "object":
+        field["pytype"] = infer_dtype(arr)
 
     if is_categorical_dtype(dtype):
         cats = dtype.categories
@@ -179,7 +187,7 @@ def convert_json_field_to_pandas_type(field):
     'datetime64[ns, US/Central]'
     """
     typ = field["type"]
-    if typ == "string":
+    if typ in ["string", "object"]:
         return "object"
     elif typ == "integer":
         return "int64"

Which will be able to generate the following schema:

{
  "schema":{
    "fields":[
      {
        "name":"index",
        "type":"integer"
      },
      {
        "name":"d",
        "type":"object",
        "pytype":"date"
      }
    ],
    "primaryKey":[
      "index"
    ],
    "pandas_version":"0.20.0"
  },
  "data":[
    {
      "index":0,
      "d":"2021-12-02T00:00:00.000Z"
    }
  ]
}

However, when reading, given that datetime.date and datetime.time aren't actual dtypes I can't really just

    df = df.astype(dtypes)

I could try to convert the date and time columns by moving them around as datetime64[ns] and then calling df['column'].dt.date and df['column'].dt.time

jmg-duarte · 2021-12-02T15:17:00Z

Just submitted a PR (as you can see above), it ends up taking the approach suggested by @Dr-Irv (quick note, when deserializing, it calls _from_sequence rather than _from_string_sequence).

jmg-duarte added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 1, 2021

jorisvandenbossche added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 1, 2021

jmg-duarte mentioned this issue Dec 2, 2021

ENH: allow JSON (de)serialization of ExtensionDtypes #44722

Merged

2 tasks

jreback added this to the 1.4 milestone Dec 15, 2021

jreback closed this as completed in #44722 Dec 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Python's date and time objects do not roundtrip with JSON orient='table' #44705

BUG: Python's date and time objects do not roundtrip with JSON orient='table' #44705

jmg-duarte commented Dec 1, 2021

jorisvandenbossche commented Dec 1, 2021

jmg-duarte commented Dec 1, 2021

jorisvandenbossche commented Dec 1, 2021

Dr-Irv commented Dec 1, 2021

jmg-duarte commented Dec 1, 2021

Dr-Irv commented Dec 1, 2021

Dr-Irv commented Dec 1, 2021

jmg-duarte commented Dec 2, 2021

jorisvandenbossche commented Dec 2, 2021

jmg-duarte commented Dec 2, 2021

jmg-duarte commented Dec 2, 2021 •

edited

Loading

jmg-duarte commented Dec 2, 2021

BUG: Python's date and time objects do not roundtrip with JSON orient='table' #44705

BUG: Python's date and time objects do not roundtrip with JSON orient='table' #44705

Comments

jmg-duarte commented Dec 1, 2021

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

jorisvandenbossche commented Dec 1, 2021

jmg-duarte commented Dec 1, 2021

jorisvandenbossche commented Dec 1, 2021

Dr-Irv commented Dec 1, 2021

jmg-duarte commented Dec 1, 2021

Dr-Irv commented Dec 1, 2021

Dr-Irv commented Dec 1, 2021

jmg-duarte commented Dec 2, 2021

jorisvandenbossche commented Dec 2, 2021

jmg-duarte commented Dec 2, 2021

jmg-duarte commented Dec 2, 2021 • edited Loading

jmg-duarte commented Dec 2, 2021

jmg-duarte commented Dec 2, 2021 •

edited

Loading