Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: json_normalize, for basic use case #40035

Merged
merged 18 commits into from
Mar 5, 2021
Merged

PERF: json_normalize, for basic use case #40035

merged 18 commits into from
Mar 5, 2021

Conversation

smpurkis
Copy link
Contributor

@smpurkis smpurkis commented Feb 24, 2021

Proposed speed up for very simple use cases using the pd.json_normalize function. E.g. pd.json_normalize(data)

The speed up can be seen in this example:

import datetime
import pandas as pd

# example json data
data = {"hello": ["thisisatest", 999898, datetime.date.today()],
        "nest1": {"nest2": {"nest3": "nest3_value", "nest3_int": 3445}},
        "nest1_list": {"nest2": ["blah", 32423, 546456.876, 92030234]},
        "hello2": "string"}

hundred_thousand_rows = [data for i in range(100000)]

s = time()
pd.json_normalize(hundred_thousand_rows)
pandas_json_normalize_time_taken = time() - s
print(f"\npandas time taken for a 100,000 rows: {pandas_json_normalize_time_taken} seconds")

With output from Pandas 1.2.2: pandas time taken for a 100,000 rows: 3.0518009662628174 seconds
From this branch: pandas time taken for a 100,000 rows: 0.632451057434082 seconds

To show tests pass for the appropriate file, ran pytest pandas/tests/io/json/test_normalize.py -v test_normalize.py_pytest.log

To show pre-commit passed on file, ran pre-commit run --files pandas/io/json/_normalize.py _normalize.py_pre-commit.log

There was one code check that was caught, running ./ci/code_checks.sh code_checks.log

pandas/io/json/_normalize.py:208: error: Incompatible types in assignment (expression has type "List[Union[List[Dict[Any, Any]], Dict[Any, Any]]]", variable has type "Dict[Any, Any]")  [assignment]

As it a type hint issue, decided to still make pull request. Can you advice on how to fix, I'm fairly new to type hints?

Kind regards,
Sam

@smpurkis
Copy link
Contributor Author

smpurkis commented Feb 24, 2021

Just remembered forgot to update the changelog, woopsy

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to simply dispatch to this if the basic case is selected?

why is ordering not preserved?

do we have sufficient asv's for this? e.g. pls add the cases you are measuring.

@jreback jreback added the IO JSON read_json, to_json, json_normalize label Feb 25, 2021
@smpurkis
Copy link
Contributor Author

smpurkis commented Feb 25, 2021

@jreback

is it possible to simply dispatch to this if the basic case is selected?

If possible that would be best, but am not familiar enough with pandas codebase to know where to look. Have tried looking around the pandas/io/json/ but can't find where dispatch configuration is setup. Any advise for where to look/how to get started would be appreciated.

why is ordering not preserved?

Oh it is, I need to update that part of the comment.

do we have sufficient asv's for this? e.g. pls add the cases you are measuring.

I do, I will add those in at the next opportunity and think of a few more cases.

What would you recommend to fix the type hint issue I have?

pandas/io/json/_normalize.py:208: error: Incompatible types in assignment (expression has type "List[Union[List[Dict[Any, Any]], Dict[Any, Any]]]", variable has type "Dict[Any, Any]")  [assignment]

@pep8speaks
Copy link

pep8speaks commented Feb 25, 2021

Hello @smpurkis! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-03 19:07:29 UTC

@smpurkis
Copy link
Contributor Author

Added asv, as there were none for json_normalize function. Although can't run locally as machine isn't powerful enough.
Have also touched up a comment and condition statement

@smpurkis
Copy link
Contributor Author

asv failing, I've not written any before and my laptop is too slow to run them

@smpurkis
Copy link
Contributor Author

asv benchmark should be working now

@WillAyd
Copy link
Member

WillAyd commented Feb 26, 2021

Can you run the related JSON benchmarks and post the output of them here?

@smpurkis
Copy link
Contributor Author

Running

asv continuous -f 1.1 -E virtualenv upstream/master HEAD -b json

gave this log:
asv-json-benchmark.log

@jreback
Copy link
Contributor

jreback commented Feb 26, 2021

so net effect is this

     <issue-15621-improve-json-normalize-perf>       <master>  
         320±2ms        285±0.5ms     0.89  io.json.NormalizeJSON.time_normalize_json('values', 'df_int_floats')
        326±20ms          277±9ms     0.85  io.json.NormalizeJSON.time_normalize_json('records', 'df_int_floats')

?

@jreback
Copy link
Contributor

jreback commented Feb 26, 2021

this doesnt' seem to match your results.

@smpurkis
Copy link
Contributor Author

smpurkis commented Feb 27, 2021

this doesnt' seem to match your results.

Can you please explain a bit more on how you are comparing them. The test is the same, but asv sets up its own environment which surely would change the times.
Unless asv runs the same benchmark on pandas master, not sure how the results are useful

@jreback
Copy link
Contributor

jreback commented Feb 27, 2021

if you added an asv then we could see

this doesn't thatch your timings from the top (the ratio not the absolute time)

@smpurkis
Copy link
Contributor Author

Assuming asv is comparing my forked master to main master then my addition in asv might be wrong.
Will have another look, and do some reading up on asv

@smpurkis
Copy link
Contributor Author

Found the issue, my checking of the parameters was incorrect. Reran the benchmark.

       before           after         ratio
     [a241cfc6]       [5959eaab]
     <issue-15621-improve-json-normalize-perf>       <master>  
-         317±1ms         65.5±2ms     0.21  io.json.NormalizeJSON.time_normalize_json('values', 'df_date_idx')
-       317±0.5ms       65.4±0.3ms     0.21  io.json.NormalizeJSON.time_normalize_json('split', 'df_date_idx')
-         317±2ms       65.3±0.9ms     0.21  io.json.NormalizeJSON.time_normalize_json('values', 'df_td_int_ts')
-         316±2ms       65.2±0.5ms     0.21  io.json.NormalizeJSON.time_normalize_json('index', 'df_date_idx')
-         317±1ms       65.4±0.4ms     0.21  io.json.NormalizeJSON.time_normalize_json('index', 'df_int_floats')
-       316±0.9ms       65.1±0.3ms     0.21  io.json.NormalizeJSON.time_normalize_json('values', 'df')
-       315±0.8ms       64.9±0.1ms     0.21  io.json.NormalizeJSON.time_normalize_json('columns', 'df_td_int_ts')
-       316±0.4ms       65.0±0.2ms     0.21  io.json.NormalizeJSON.time_normalize_json('index', 'df')
-       316±0.6ms       64.9±0.4ms     0.21  io.json.NormalizeJSON.time_normalize_json('split', 'df_td_int_ts')
-         316±2ms       64.9±0.3ms     0.21  io.json.NormalizeJSON.time_normalize_json('split', 'df_int_floats')
-       316±0.6ms       64.8±0.2ms     0.21  io.json.NormalizeJSON.time_normalize_json('records', 'df')
-         317±1ms       65.0±0.2ms     0.21  io.json.NormalizeJSON.time_normalize_json('records', 'df_date_idx')
-         317±1ms       65.0±0.5ms     0.21  io.json.NormalizeJSON.time_normalize_json('index', 'df_int_float_str')
-       317±0.4ms       65.1±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('records', 'df_int_floats')
-       316±0.7ms       64.8±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('values', 'df_int_floats')
-       316±0.2ms       64.8±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('columns', 'df_int_floats')
-         315±1ms       64.5±0.3ms     0.20  io.json.NormalizeJSON.time_normalize_json('split', 'df_int_float_str')
-         318±1ms       64.9±0.1ms     0.20  io.json.NormalizeJSON.time_normalize_json('split', 'df')
-       317±0.9ms       64.8±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('columns', 'df_date_idx')
-         317±1ms       64.8±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('values', 'df_int_float_str')
-         318±1ms       64.8±0.3ms     0.20  io.json.NormalizeJSON.time_normalize_json('records', 'df_td_int_ts')
-       318±0.8ms       64.8±0.4ms     0.20  io.json.NormalizeJSON.time_normalize_json('records', 'df_int_float_str')
-         319±5ms       65.1±0.5ms     0.20  io.json.NormalizeJSON.time_normalize_json('columns', 'df_int_float_str')
-         317±2ms       64.6±0.2ms     0.20  io.json.NormalizeJSON.time_normalize_json('columns', 'df')
-         319±2ms       64.9±0.4ms     0.20  io.json.NormalizeJSON.time_normalize_json('index', 'df_td_int_ts')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look pretty good. can you add a whatsnew note in the 1.3 Perf section.

pandas/io/json/_normalize.py Outdated Show resolved Hide resolved
pandas/io/json/_normalize.py Outdated Show resolved Hide resolved
pandas/io/json/_normalize.py Outdated Show resolved Hide resolved
pandas/io/json/_normalize.py Outdated Show resolved Hide resolved
pandas/io/json/_normalize.py Outdated Show resolved Hide resolved
pandas/io/json/_normalize.py Outdated Show resolved Hide resolved
@jreback jreback added this to the 1.3 milestone Feb 27, 2021
@jreback jreback added the Performance Memory or execution speed performance label Feb 27, 2021
@smpurkis
Copy link
Contributor Author

Have made the whatsnew note and the changes you advised.
There is still the type hint issue originally picked up, haven't found a way to correct it

pandas/io/json/_normalize.py Show resolved Hide resolved
pandas/io/json/_normalize.py Show resolved Hide resolved
@jreback jreback merged commit c7d3e9b into pandas-dev:master Mar 5, 2021
@jreback
Copy link
Contributor

jreback commented Mar 5, 2021

thanks @smpurkis very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: json_normalize
5 participants