Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: 5x speedup for read_json() with orient='index' by avoiding transpose #26773

Merged
merged 1 commit into from Jul 17, 2019

Conversation

@qwhelan
Copy link
Contributor

commented Jun 10, 2019

The .T operator can be quite slow on mixed-type DataFrames due to the creation of object dtype columns. In comparison to direct construction with DataFrame.from_dict() can generally be much more efficient.

Making that swap inside pd.read_json() yields a ~5-6x speedup for the orient='index' case:

       before           after         ratio
     [d47fc0cb]       [b0fd99ec]
     <read_json_speedup~1>       <read_json_speedup>
-      5.37±0.03s          907±5ms     0.17  io.json.ReadJSON.time_read_json('index', 'int')
-      5.27±0.01s          804±3ms     0.15  io.json.ReadJSON.time_read_json('index', 'datetime')
  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
@WillAyd

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

cc @TomAugspurger would this be related to #24387 at all?

@WillAyd WillAyd added the Performance label Jun 10, 2019

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

Ignore previous comment was too focused on the constructor and not the transposition. This makes sense to me

@codecov

This comment has been minimized.

Copy link

commented Jun 10, 2019

Codecov Report

Merging #26773 into master will decrease coverage by 50.5%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26773       +/-   ##
===========================================
- Coverage   91.71%   41.21%   -50.51%     
===========================================
  Files         178      178               
  Lines       50771    50771               
===========================================
- Hits        46567    20926    -25641     
- Misses       4204    29845    +25641
Flag Coverage Δ
#multiple ?
#single 41.21% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 63.17% <0%> (-30.07%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/plotting/_matplotlib/__init__.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/sparse/scipy_sparse.py 10.14% <0%> (-89.86%) ⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47fc0c...b0fd99e. Read the comment docs.

1 similar comment
@codecov

This comment has been minimized.

Copy link

commented Jun 10, 2019

Codecov Report

Merging #26773 into master will decrease coverage by 50.5%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26773       +/-   ##
===========================================
- Coverage   91.71%   41.21%   -50.51%     
===========================================
  Files         178      178               
  Lines       50771    50771               
===========================================
- Hits        46567    20926    -25641     
- Misses       4204    29845    +25641
Flag Coverage Δ
#multiple ?
#single 41.21% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 63.17% <0%> (-30.07%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/plotting/_matplotlib/__init__.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/sparse/scipy_sparse.py 10.14% <0%> (-89.86%) ⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47fc0c...b0fd99e. Read the comment docs.

@alimcmaster1

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

nice! @qwhelan mind taking a look at the test cases ( looks like this changes the order of the index ) https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=12630

>   raise_assert_detail(obj, msg, lobj, robj)
E   AssertionError: DataFrame.columns are different
E   
E   DataFrame.columns values are different (100.0 %)
E   [left]:  Index(['A', 'B', 'C', 'D'], dtype='object')
E   [right]: Index(['D', 'C', 'B', 'A'], dtype='object')
@qwhelan

This comment has been minimized.

Copy link
Contributor Author

commented Jun 10, 2019

@alimcmaster1 Given that this only fails on 3.5, I'm guessing this is a dict-orderedness issue in from_dict()

@jreback jreback added the IO JSON label Jun 27, 2019

@qwhelan qwhelan force-pushed the qwhelan:read_json_speedup branch 2 times, most recently from d77a2a2 to 5edd63c Jul 8, 2019

@jreback jreback added this to the 0.25.0 milestone Jul 8, 2019

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 8, 2019

lgtm, can you add a note in Performance for 0.25.0, ping on green.

@qwhelan qwhelan force-pushed the qwhelan:read_json_speedup branch from 5edd63c to cef3d80 Jul 8, 2019

@jreback jreback merged commit a373e0e into pandas-dev:master Jul 17, 2019

14 checks passed

codecov/patch 100% of diff hit (target 50%)
Details
codecov/project 92.82% (target 82%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190708.27 succeeded
Details
pandas-dev.pandas (Checks) Checks succeeded
Details
pandas-dev.pandas (Docs) Docs succeeded
Details
pandas-dev.pandas (Linux py35_compat) Linux py35_compat succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow_old_np) Linux py36_locale_slow_old_np succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np15) Windows py36_np15 succeeded
Details
pandas-dev.pandas (Windows py37_np141) Windows py37_np141 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2019

thanks @qwhelan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.