Skip to content

Conversation

@parthava-adabala
Copy link

This PR fixes a bug in pd.json_normalize where an AttributeError was raised if max_level was set to an integer and the input data contained NaN or other non-dict items.

The fix involves two parts:

  • Updating the if any(...) check in json_normalize to correctly trigger nested_to_record.
  • Adding a check inside nested_to_record to handle non-dict items (like nan) by treating them as empty dicts, which prevents the AttributeError.

@parthava-adabala
Copy link
Author

pre-commit.ci autofix

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Comment on lines 121 to 122
new_ds.append({})
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The json_normalize function is type hinted as "dict or list of dicts". It seems to me if this is not adhered to, the method should raise instead of silently ignoring entries.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @rhshadrach , I see it's violating the "dict or list of dicts" requirement.

So, ideally it should raise type error regardless of whether the max_level is set or not. In that case, I'm thinking of adding a new validation check at the top of the json_normalize function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR based on your feedback.

The function now raises a TypeError if data is a list containing any non-dict items, which is enforced before either the max_level=None or max_level=0 paths are taken. I have also updated the respective tests and docs for what's new.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - just a small request on simplifying the test.

Would like to get another eye here, don't love O(n) validation but I don't see a better approach. And the time it takes is 1% compared to the runtime without.

Timings
d = [{"id": 12, "size": 20} for _ in range(10_000)]
%timeit pd.json_normalize(d, max_level=0)
# 16.5 ms ± 127 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

def validate(data):
    for item in data:
        if not isinstance(item, dict):
            raise TypeError

%timeit validate(d)
# 167 μs ± 669 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

@rhshadrach rhshadrach requested a review from mroeschke October 27, 2025 20:40
@rhshadrach rhshadrach added Error Reporting Incorrect or improved errors from pandas IO JSON read_json, to_json, json_normalize labels Oct 27, 2025
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach added this to the 3.0 milestone Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Error Reporting Incorrect or improved errors from pandas IO JSON read_json, to_json, json_normalize

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: json_normalize doesn't handle nan well when max_level=n

3 participants