Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: json support for blocks GH9037 #9130

Merged
merged 1 commit into from
Dec 24, 2014

Conversation

Komnomnomnom
Copy link
Contributor

This adds block support to the JSON serialiser, as per #9037. I also added code to directly cast and serialise numpy data, which replaces the previous use of intermediate Python objects.

Large performance improvement (~25x) for mixed frames containing datetimes / timedeltas.

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
packers_write_json_mixed_delta_int_tstamp    |  97.1374 | 2633.7843 |   0.0369 |
packers_write_json_mixed_float_int_T         |  68.7390 |  86.4150 |   0.7955 |
packers_write_json_date_index                |  68.9886 |  83.2930 |   0.8283 |
packers_write_json_T                         |  61.1283 |  72.9477 |   0.8380 |
packers_write_json                           |  60.7293 |  71.3053 |   0.8517 |
packers_read_json_date_index                 | 157.4903 | 161.6703 |   0.9741 |
packers_read_json                            | 157.4220 | 157.8477 |   0.9973 |
packers_write_json_mixed_float_int           |  98.1897 |  98.1696 |   1.0002 |
packers_write_json_mixed_float_int_str       |  84.0390 |  83.6937 |   1.0041 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Some questions, any comments appreciated:

  1. there's a little overhead to dealing with blocks so I'm avoiding using them and using values instead if the frame is 'simple'. I'm using the BlockManager methods _is_single_block and is_mixed_dtype for this to check for a simple frame.
  2. I'm using the DataFrame _data attr and mgr_locs to get access to the block data and the block-to-column mapping. Are there any caveats to this? I know the DataFrame does some caching, but I'm not familiar enough with the details.

Tested locally on Python 2.7 for 32 & 64 bit linux and 3.3. on 64 bit linux. JSON tests run through valgrind. Would appreciate if someone could give it a bash on Windows before merging.

@cpcloud I also fixed a ref leak and added support for date_unit in the #9028 code.

@jreback jreback added IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance labels Dec 22, 2014
@jreback jreback added this to the 0.16.0 milestone Dec 22, 2014
@jreback
Copy link
Contributor

jreback commented Dec 22, 2014

@Komnomnomnom

  1. you can really just use _is_mixed_type. _is_single_block distinguished between a ndim==1 and ndim>1 (e.g. a Series and DataFrame).

The key is that .values on a non-mixed_type (e.g. a Series or a single-dtyped DataFrames) is free (no copyin data), and you will get back a single dtype.

  1. _data is guaranteed to exists on a PandasObject (e.g. Series/DataFrame) as well as mgr_locs (which is also correct), this handles the position in the block to the column names mappings.

lots of code changes! I will give a test on windows and let you know.

Looks good though.

Has the serialization order changed at all? Does it matter if it does? IOW for some formats the orderings might be different now.

@Komnomnomnom
Copy link
Contributor Author

Thanks @jreback, I used ._data.mgr_locs rather than .blocks so I could preserve the column order (also .blocks on the frame returns a copy I think?).

I have not modified (or added) tests, which should enforce that serialisation order has not changed. Maybe I'll add a compat test though, just to be sure.

FWIW the valgrind run for these changes is clean, (but I'll be submitting a PR soon to fix an unrelated segfault in the code for handling Python datetime.time objects)

@jreback
Copy link
Contributor

jreback commented Dec 23, 2014

this passes everythng for me on windows, so good to go for me. lmk when you are satisfied with tests.

aside: https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_json/test_ujson.py#L120 fails on windows (on master), but I wonder if its just a precision in the test issue (its like 1 part in 100 digits close), so it passes an .allclose test

@jreback
Copy link
Contributor

jreback commented Dec 23, 2014

cc @cpcloud

@cpcloud
Copy link
Member

cpcloud commented Dec 23, 2014

+1000 here nice work. Sorry about the leak!

@cpcloud
Copy link
Member

cpcloud commented Dec 23, 2014

@Komnomnomnom

What are your thoughts on a round-trippable json orient (maybe "roundtrip")? Ie provide enough metadata to reconstruct a frame or series with 100% fidelity.

@cpcloud
Copy link
Member

cpcloud commented Dec 23, 2014

Alternatively how about exposing dumps at the toplevel, and folks could roll their own.

@Komnomnomnom
Copy link
Contributor Author

@cpcloud a roundtrip orient sounds like a good idea, there was some related discussion in #4889 but I'm not keen on adding metadata into all the orients. But I like the notion of a new orient with the same format as split but an additional _meta entry with info on blocks and their dtypes. Would it need any other info to roundtrip properly? I've been thinking about a couple more orients too (from #8333 and #5729).

I also like the idea of exposing json on the top level, as you can easily give it more than just frames and series types, i.e. it will happily process numpy arrays, pandas indices and other Python types quite efficiently. Maybe pd.to_json ?

@Komnomnomnom
Copy link
Contributor Author

@jreback that windows issue is weird, what do you get when you try:

In [6]: import pandas.json as ujson

In [7]: ujson.encode(1e-100)
Out[7]: '1e-100'

In [8]: ujson.decode('1e-100')
Out[8]: 1e-100

@Komnomnomnom
Copy link
Contributor Author

Ok I've added a compat test for completeness, removed the call to _is_single_block and updated the release notes.

The only compat issue I can think of is that the json output for mixed frames will be slightly different due to integers being promoted to floats previously.

e.g. v0.15.2

In [2]: pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1.0,"1":2.0}}'

this PR

In [3]:  pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
Out[3]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1,"1":2}}'

I've added an entry in the release notes about this too.

@@ -28,6 +28,7 @@ Backwards incompatible API changes
.. _whatsnew_0160.api_breaking:

- ``Index.duplicated`` now returns `np.array(dtype=bool)` rather than `Index(dtype=object)` containing `bool` values. (:issue:`8875`)
- ``DataFrame.to_json`` now returns accurate type serialisation for each column for frames of mixed dtype (:issue:`9037`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe expand on this just a bit (you can in fact show the example of int/float conversions that you gave). You can do a code-block for both prior and current behavor if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done. Thanks!

@jreback
Copy link
Contributor

jreback commented Dec 24, 2014

@Komnomnomnom minor doc comment. ping when ready and we can merge.

@jreback
Copy link
Contributor

jreback commented Dec 24, 2014

@Komnomnomnom

I tested 0.15.2 on win64 and

In [8]: ujson.decode('1e-100')
Out[8]: 1e-100

is indeed produced

with the current PR

9.999999999999999999998e-101 is produced
very odd

@Komnomnomnom
Copy link
Contributor Author

Ok just tried this myself on win64. I used conda and this article to compile with msvc. I had to make a couple of code fixes to get it to compile and it worked fine for ujson.decode('1e-100') for me.....

The code for deserialisation wasn't changed by this PR and hasn't changed since 0.15.2 I think. Might be an issue with your compiler? What are you using, mingw?

@jreback
Copy link
Contributor

jreback commented Dec 24, 2014

if you conda install libpython then u can use mingw out of the box to compile

so prob a compiler issue then
ok maybe should skip this test on windows (or at least that part)?

@Komnomnomnom
Copy link
Contributor Author

I'm not sure how conda is setup but I've had issues before with an extension compiled with mingw when the python core and other extensions were compiled with msvc.

@jreback
Copy link
Contributor

jreback commented Dec 24, 2014

it's abi compat
and works for me (I think older versions might have that issue)

ok no biggie then

@jreback
Copy link
Contributor

jreback commented Dec 24, 2014

ok works fine with msvc.....

thanks @Komnomnomnom

jreback added a commit that referenced this pull request Dec 24, 2014
PERF: json support for blocks GH9037
@jreback jreback merged commit 9b453e0 into pandas-dev:master Dec 24, 2014
@jorisvandenbossche
Copy link
Member

@Komnomnomnom @cpcloud If you are interested in what is mentioned above (the 'roundtrip' orient, or the exposing of dumps or a to_json in the toplevel namespave), maybe open a new issue for that?

@Komnomnomnom
Copy link
Contributor Author

Thanks @jorisvandenbossche #9146 #9147

@Komnomnomnom Komnomnomnom deleted the json-block-support branch December 24, 2014 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants