-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
BUG: Handle non-dict items in json_normalize with max_level #62848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BUG: Handle non-dict items in json_normalize with max_level #62848
Conversation
|
pre-commit.ci autofix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
pandas/io/json/_normalize.py
Outdated
| new_ds.append({}) | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The json_normalize function is type hinted as "dict or list of dicts". It seems to me if this is not adhered to, the method should raise instead of silently ignoring entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @rhshadrach , I see it's violating the "dict or list of dicts" requirement.
So, ideally it should raise type error regardless of whether the max_level is set or not. In that case, I'm thinking of adding a new validation check at the top of the json_normalize function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the PR based on your feedback.
The function now raises a TypeError if data is a list containing any non-dict items, which is enforced before either the max_level=None or max_level=0 paths are taken. I have also updated the respective tests and docs for what's new.
7f68034 to
55d1419
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - just a small request on simplifying the test.
Would like to get another eye here, don't love O(n) validation but I don't see a better approach. And the time it takes is 1% compared to the runtime without.
Timings
d = [{"id": 12, "size": 20} for _ in range(10_000)]
%timeit pd.json_normalize(d, max_level=0)
# 16.5 ms ± 127 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
def validate(data):
for item in data:
if not isinstance(item, dict):
raise TypeError
%timeit validate(d)
# 167 μs ± 669 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)b64a372 to
3ec415f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
doc/source/whatsnew/v3.0.0.rstfile if fixing a bug or adding a new feature.This PR fixes a bug in pd.json_normalize where an AttributeError was raised if max_level was set to an integer and the input data contained NaN or other non-dict items.
The fix involves two parts: