pd.json_normalize doesn't return data with index from series #51452

regularised · 2023-02-17T10:21:44Z

TLDR;

pd.json_normalize creates a new index for the normalised series.

Issue

pd.json_normalize returns a new index for the normalised series, rather than the one supplied. Given the other methods in pandas, this seems like it violates the 'principal of least surprise'? This caught me out for longer than I would care to admit :D

Minimum working example:

import pandas as pd
from IPython.display import display
data = [
    {'col1':1, 'col2':'positive', 'json_data':{'a':1, 'b':2, 'c':3}},
    {'col1':2, 'col2':'negative', 'json_data':{'a':-1, 'b':-2, 'c':-3}}
]
df1 = pd.DataFrame(data)
display(df1)
df2 = df1.sort_values('col1', ascending=False)
display(df2)

df1j = pd.json_normalize(df1['json_data'])
df2j = pd.json_normalize(df2['json_data'])
display(df1.join(df1j))
display(df2.join(df2j))

The join on the two indices provides and unexpected result as the two rows are now no-longer consistent.

When looking at the data returned from json_normalize, we can see that the index of the returned data has been reset, which ultimately means the join is not as expected.

display(df1j)
display(df2j)

The text was updated successfully, but these errors were encountered:

marickmanrho · 2023-05-07T10:29:25Z

It sure looks weird, but I don't think it is a bug. You perform the join on two dataframes with the rows interchanged due to the sort you performed. If you do this:

df1j = pd.json_normalize(df1['json_data'].sort_index())
df2j = pd.json_normalize(df2['json_data'].sort_index())
display(df1.join(df1j))
display(df2.join(df2j))

you'll see it is again as expected. The behavior you described happens because the itteration over a sorted DataFrame returns the items in the sorted order and not in the order of the index.

By the way, it is surprising to me that this works at all, as json_normalized is only documented to work with a dict or a list of dicts.

pawin35 · 2023-06-22T09:39:36Z

I encountered a similar issue while using pd.json_normalize with a filtered DataFrame. It seems that the resulting json normalized series does not retain the original index. Since filtering is a common operation in data analysis, it would be great if json_normalize could preserve the original index to ensure consistency in the data when performing subsequent operations.
Here is the minimal code to illustrate:

import pandas as pd

data = [
    {"name": "Peter", "content": {"text": "Hello", "time": "2021-01-01 00:00:00"}},
    {"name": "Mark", "content": {"text": "Hi", "time": "2021-01-01 00:00:01"}},
    {"name": "Peter", "content": {"text": "How are you?", "time": "2021-01-01 00:00:02"}}
]

df = pd.DataFrame(data)
print(df) # df has index 0,1,2

# filter out mark
df = df[df["name"] != "Mark"]
print(df) # df has index 0,2

# json_normalize
js = pd.json_normalize(df["content"])
print(js) # js has index 0,1

marickmanrho · 2023-06-23T07:23:09Z

The current implementation of json_normalize() doesn't allow you to supply a DataFrame object. Hence, I'd make this a feature request which asks for json_normalize() to parse DataFrame objects.

JMBurley · 2023-11-09T20:35:04Z

The root problem here is that python duck-typing allows any array-like-of-dicts to pass through json_normalise successfully. It will iterate over them in order and then generate a new index on the output dataframe 0:N-1.

Out in the real world, I bet any amount of money that the most common use-case for json_normalise is a Pandas series:

pd.json_normalise(df['some_col_full_of_dicts'])

And principle of least surprise should be to persist the index on an incoming Pandas series. It is a bug in the existing code that should not have made it into prod. But unfortunately fixing that is a breaking change, so for now we should have a toggleable param keep_index:bool = False

proposed solution:

pd.json_normalise: add pd.Series to the valid input params (it already works without warning).
pd.json_normalise: add param to signature keep_index:bool = False, and implement the code + add tests. This is trivial

jbrockmendel added the IO JSON read_json, to_json, json_normalize label May 3, 2023

nworb-cire mentioned this issue Feb 14, 2024

ENH: Preserve Series index on json_normalize #57422

Merged

5 tasks

mroeschke closed this as completed in #57422 Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.json_normalize doesn't return data with index from series #51452

pd.json_normalize doesn't return data with index from series #51452

regularised commented Feb 17, 2023

marickmanrho commented May 7, 2023

pawin35 commented Jun 22, 2023

marickmanrho commented Jun 23, 2023

JMBurley commented Nov 9, 2023 •

edited

Loading

pd.json_normalize doesn't return data with index from series #51452

pd.json_normalize doesn't return data with index from series #51452

Comments

regularised commented Feb 17, 2023

TLDR;

Issue

Minimum working example:

marickmanrho commented May 7, 2023

pawin35 commented Jun 22, 2023

marickmanrho commented Jun 23, 2023

JMBurley commented Nov 9, 2023 • edited Loading

JMBurley commented Nov 9, 2023 •

edited

Loading