Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.json_normalize doesn't return data with index from series #51452

Closed
regularised opened this issue Feb 17, 2023 · 4 comments · Fixed by #57422
Closed

pd.json_normalize doesn't return data with index from series #51452

regularised opened this issue Feb 17, 2023 · 4 comments · Fixed by #57422
Labels
IO JSON read_json, to_json, json_normalize

Comments

@regularised
Copy link

TLDR;

pd.json_normalize creates a new index for the normalised series.

Issue

pd.json_normalize returns a new index for the normalised series, rather than the one supplied. Given the other methods in pandas, this seems like it violates the 'principal of least surprise'? This caught me out for longer than I would care to admit :D

Minimum working example:

import pandas as pd
from IPython.display import display
data = [
    {'col1':1, 'col2':'positive', 'json_data':{'a':1, 'b':2, 'c':3}},
    {'col1':2, 'col2':'negative', 'json_data':{'a':-1, 'b':-2, 'c':-3}}
]
df1 = pd.DataFrame(data)
display(df1)
df2 = df1.sort_values('col1', ascending=False)
display(df2)

image

df1j = pd.json_normalize(df1['json_data'])
df2j = pd.json_normalize(df2['json_data'])
display(df1.join(df1j))
display(df2.join(df2j))

image

The join on the two indices provides and unexpected result as the two rows are now no-longer consistent.

When looking at the data returned from json_normalize, we can see that the index of the returned data has been reset, which ultimately means the join is not as expected.

display(df1j)
display(df2j)

image

@jbrockmendel jbrockmendel added the IO JSON read_json, to_json, json_normalize label May 3, 2023
@marickmanrho
Copy link

It sure looks weird, but I don't think it is a bug. You perform the join on two dataframes with the rows interchanged due to the sort you performed. If you do this:

df1j = pd.json_normalize(df1['json_data'].sort_index())
df2j = pd.json_normalize(df2['json_data'].sort_index())
display(df1.join(df1j))
display(df2.join(df2j))

you'll see it is again as expected. The behavior you described happens because the itteration over a sorted DataFrame returns the items in the sorted order and not in the order of the index.

By the way, it is surprising to me that this works at all, as json_normalized is only documented to work with a dict or a list of dicts.

@pawin35
Copy link

pawin35 commented Jun 22, 2023

I encountered a similar issue while using pd.json_normalize with a filtered DataFrame. It seems that the resulting json normalized series does not retain the original index. Since filtering is a common operation in data analysis, it would be great if json_normalize could preserve the original index to ensure consistency in the data when performing subsequent operations.
Here is the minimal code to illustrate:

import pandas as pd

data = [
    {"name": "Peter", "content": {"text": "Hello", "time": "2021-01-01 00:00:00"}},
    {"name": "Mark", "content": {"text": "Hi", "time": "2021-01-01 00:00:01"}},
    {"name": "Peter", "content": {"text": "How are you?", "time": "2021-01-01 00:00:02"}}
]

df = pd.DataFrame(data)
print(df) # df has index 0,1,2

# filter out mark
df = df[df["name"] != "Mark"]
print(df) # df has index 0,2

# json_normalize
js = pd.json_normalize(df["content"])
print(js) # js has index 0,1

@marickmanrho
Copy link

The current implementation of json_normalize() doesn't allow you to supply a DataFrame object. Hence, I'd make this a feature request which asks for json_normalize() to parse DataFrame objects.

@JMBurley
Copy link
Contributor

JMBurley commented Nov 9, 2023

The root problem here is that python duck-typing allows any array-like-of-dicts to pass through json_normalise successfully. It will iterate over them in order and then generate a new index on the output dataframe 0:N-1.

Out in the real world, I bet any amount of money that the most common use-case for json_normalise is a Pandas series:

pd.json_normalise(df['some_col_full_of_dicts'])

And principle of least surprise should be to persist the index on an incoming Pandas series. It is a bug in the existing code that should not have made it into prod. But unfortunately fixing that is a breaking change, so for now we should have a toggleable param keep_index:bool = False

proposed solution:

  • pd.json_normalise: add pd.Series to the valid input params (it already works without warning).
  • pd.json_normalise: add param to signature keep_index:bool = False, and implement the code + add tests. This is trivial

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants