BUG: pd.read_json sets wrong value for numeric column names #40674

ChiQiao · 2021-03-29T04:32:52Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

df = pd.DataFrame([1.], columns=[0])
pd.read_json(df.to_json(orient="table"), orient="table")

Problem description

The returned DataFrame will be all NaN without exception or warning.
If the value is int as well (e.g., df = pd.DataFrame([1], columns=[0]), a ValueError is raised instead. The error message was not clear ("Cannot convert non-finite values (NA or inf) to integer"), but it's still better than returning a DataFrame with wrong values.
NaN is set during the call of pd.read_json.

Expected Output

Based on #19129, it looks like numeric column names are not allowed in the first place. I think an exception is needed rather than setting NaN, which makes the debugging difficult.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 9d598a5
python : 3.7.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 1.2.1
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : 0.29.22
pytest : 6.2.2
hypothesis : None
sphinx : 3.5.2
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : 0.4.2
scipy : 1.6.1
sqlalchemy : 1.3.23
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.52.0

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2021-04-10T23:56:10Z

cc @WillAyd

WillAyd · 2021-04-16T20:00:15Z

Hmm yea this is a little strange. If you just look at the to_json output

>>> df.to_json(orient="table")
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":0,"type":"number"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"0":1.0}]}'

You'll see that the "schema" part says there is a column with a name of 0, but the actual JSON writes this out as a key of "0"

We maybe should ensure that all of the names defined in the schema are strings for strict JSON compliance. There may be other solutions as well

ChristopherDavisUCI · 2022-06-27T08:59:22Z

In the example above, would it make sense to have "name":"0" instead of "name":0? I feel like that would be in line with what happens in df.to_json(orient="columns"). (I was picturing converting the column name to a string. I think that's also what @WillAyd meant.)

I'd be happy to try to contribute if there is something in this direction that would be helpful.

I had one other naive question about this method (I'm not sure how related it is). Can the schema returned by df.to_json(orient="table") be used on a website like JSON Schema validator? I haven't been able to get it to work, but maybe this returned schema serves a different purpose.

Here is an example:

import pandas as pd
import json

df = pd.DataFrame([[7.3]])

result = df.to_json(orient="table", index=False)
parsed = json.loads(result)

print(json.dumps(parsed, indent=4))

This above code displays:

{
    "schema": {
        "fields": [
            {
                "name": 0,
                "type": "number"
            }
        ],
        "pandas_version": "1.4.0"
    },
    "data": [
        {
            "0": 7.3
        }
    ]
}

The closest similar thing I've gotten to work with JSON Schema validator is the following:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "0": {
        "type": "number"
      }
    }
  }
}

against which the following validates successfully:

[
    {
        "0": 7.3
    }
]

jmg-duarte · 2022-09-10T10:50:43Z

This issue is related with #46392. The problem lies on the fact that JSON keys must be strings.

jmg-duarte · 2022-09-10T11:07:11Z

And so is #38256

ChiQiao added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 29, 2021

jbrockmendel added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2021

coatless mentioned this issue Dec 5, 2022

Preserve Pandas Period Columns in pl.to_json() and handle exporting NaN values PrairieLearn/PrairieLearn#6501

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.read_json sets wrong value for numeric column names #40674

BUG: pd.read_json sets wrong value for numeric column names #40674

ChiQiao commented Mar 29, 2021 •

edited

INSTALLED VERSIONS

jbrockmendel commented Apr 10, 2021

WillAyd commented Apr 16, 2021

ChristopherDavisUCI commented Jun 27, 2022

jmg-duarte commented Sep 10, 2022

jmg-duarte commented Sep 10, 2022

BUG: pd.read_json sets wrong value for numeric column names #40674

BUG: pd.read_json sets wrong value for numeric column names #40674

Comments

ChiQiao commented Mar 29, 2021 • edited

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jbrockmendel commented Apr 10, 2021

WillAyd commented Apr 16, 2021

ChristopherDavisUCI commented Jun 27, 2022

jmg-duarte commented Sep 10, 2022

jmg-duarte commented Sep 10, 2022

ChiQiao commented Mar 29, 2021 •

edited

Output of `pd.show_versions()`