New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] handle } in line delimited json #14391

Closed
wants to merge 11 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@joshowen
Contributor

joshowen commented Oct 10, 2016

  • closes #14390
  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

joshowen added some commits Oct 10, 2016

@joshowen joshowen changed the title from [BUG] fix for quoted special characters to [BUG] fix for quoted special characters in line delimited json Oct 10, 2016

@joshowen joshowen changed the title from [BUG] fix for quoted special characters in line delimited json to [BUG] handle } in line delimited json Oct 10, 2016

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Oct 11, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14391 into master will increase coverage by <.01%

@@             master     #14391   diff @@
==========================================
  Files           140        140          
  Lines         50634      50639     +5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43173      43178     +5   
  Misses         7461       7461          
  Partials          0          0          

Powered by Codecov. Last update d98e982...edb1488

codecov-io commented Oct 11, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14391 into master will increase coverage by <.01%

@@             master     #14391   diff @@
==========================================
  Files           140        140          
  Lines         50634      50639     +5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43173      43178     +5   
  Misses         7461       7461          
  Partials          0          0          

Powered by Codecov. Last update d98e982...edb1488

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 11, 2016

Contributor

can you add a benchmark for lines=True in asv_bench/benchmarks/packers.py (alongside the other json ones)

Contributor

jreback commented Oct 11, 2016

can you add a benchmark for lines=True in asv_bench/benchmarks/packers.py (alongside the other json ones)

joshowen added some commits Oct 11, 2016

@joshowen

This comment has been minimized.

Show comment
Hide comment
@joshowen

joshowen Oct 11, 2016

Contributor

Not sure why that test failed,, the only change was a line of whitespace

Contributor

joshowen commented Oct 11, 2016

Not sure why that test failed,, the only change was a line of whitespace

joshowen added some commits Oct 12, 2016

self.f = '__test__.msg'
self.N = 100000
self.C = 5
self.index = date_range('20000101', periods=self.N, freq='H')

This comment has been minimized.

@jreback

jreback Oct 12, 2016

Contributor

looks like some things got repeated here

@jreback

jreback Oct 12, 2016

Contributor

looks like some things got repeated here

This comment has been minimized.

@joshowen

joshowen Oct 12, 2016

Contributor

Looks like self.N/self.C are repeated in most of these classes. Want me to clean them all up?

@joshowen

joshowen Oct 12, 2016

Contributor

Looks like self.N/self.C are repeated in most of these classes. Want me to clean them all up?

This comment has been minimized.

@jreback

jreback Oct 12, 2016

Contributor

sure that would be great (you can also make a common base class(s) if that helps as well)

@jreback

jreback Oct 12, 2016

Contributor

sure that would be great (you can also make a common base class(s) if that helps as well)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Oct 12, 2016

Member

@joshowen you can leave it here as is. I cleaned this up in another PR (@jreback yes I know, I should merge that ...)

@jorisvandenbossche

jorisvandenbossche Oct 12, 2016

Member

@joshowen you can leave it here as is. I cleaned this up in another PR (@jreback yes I know, I should merge that ...)

This comment has been minimized.

@jreback

jreback Oct 12, 2016

Contributor

ok that's fine (though @joshowen make sure your example doesn't have dups as this is new code)

@jreback

jreback Oct 12, 2016

Contributor

ok that's fine (though @joshowen make sure your example doesn't have dups as this is new code)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Oct 12, 2016

Member

ah, yes, it is of course OK to remove the lines in this added code that you do not need for this benchmark

@jorisvandenbossche

jorisvandenbossche Oct 12, 2016

Member

ah, yes, it is of course OK to remove the lines in this added code that you do not need for this benchmark

df = DataFrame([["foo}", "bar"], ['foo"', "bar"]], columns=['a', 'b'])
result = df.to_json(orient="records", lines=True)
expected = '{"a":"foo}","b":"bar"}\n{"a":"foo\\"","b":"bar"}'
self.assertEqual(result, expected)

This comment has been minimized.

@jreback

jreback Oct 12, 2016

Contributor

can you also round trip it and user assert_frame_equal on the result (in addition to the above test)

@jreback

jreback Oct 12, 2016

Contributor

can you also round trip it and user assert_frame_equal on the result (in addition to the above test)

@jorisvandenbossche jorisvandenbossche added this to the 0.19.1 milestone Oct 12, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 12, 2016

Contributor

@joshowen can you also post a run for this benchmark (versus previous); can also do it in a %timeit as well. Just checking if any perf issues.

Contributor

jreback commented Oct 12, 2016

@joshowen can you also post a run for this benchmark (versus previous); can also do it in a %timeit as well. Just checking if any perf issues.

@joshowen

This comment has been minimized.

Show comment
Hide comment
@joshowen

joshowen Oct 12, 2016

Contributor

@jreback is there an easy way to do that? Or should I port the asv test to master and run/compare?

Contributor

joshowen commented Oct 12, 2016

@jreback is there an easy way to do that? Or should I port the asv test to master and run/compare?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 12, 2016

Contributor

you can run asv if you want, otherwise just do it in ipython (before and after), e.g. something like

In [6]: df = DataFrame(dict([('float{0}'.format(i), np.random.randn(N)) for i in range(C)]), index=index).reset_index(drop=True)

In [7]: df.to_json('foo.json',orient='records',lines=True)

In [8]: %timeit df.to_json('foo.json',orient='records',lines=True)
1 loop, best of 3: 3.66 s per loop

In [9]: %timeit df.to_json('foo.json',orient='records')
10 loops, best of 3: 98.8 ms per loop

and looking at this, we have a BIG perf hit when lines=True (this is a separate issue).

Contributor

jreback commented Oct 12, 2016

you can run asv if you want, otherwise just do it in ipython (before and after), e.g. something like

In [6]: df = DataFrame(dict([('float{0}'.format(i), np.random.randn(N)) for i in range(C)]), index=index).reset_index(drop=True)

In [7]: df.to_json('foo.json',orient='records',lines=True)

In [8]: %timeit df.to_json('foo.json',orient='records',lines=True)
1 loop, best of 3: 3.66 s per loop

In [9]: %timeit df.to_json('foo.json',orient='records')
10 loops, best of 3: 98.8 ms per loop

and looking at this, we have a BIG perf hit when lines=True (this is a separate issue).

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 12, 2016

Contributor
Contributor

jreback commented Oct 12, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 12, 2016

Contributor

@joshowen can you open another issue about the perf

In [15]: %prun df.to_json('foo.json',orient='records',lines=True)
         100058 function calls (100056 primitive calls) in 3.759 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.017    2.017    3.656    3.656 json.py:600(_convert_to_line_delimits)
        1    1.002    1.002    1.002    1.002 {method 'join' of 'str' objects}
        1    0.630    0.630    0.630    0.630 {numpy.core.multiarray.array}
        1    0.078    0.078    0.078    0.078 {pandas.json.dumps}
        1    0.012    0.012    0.012    0.012 {method 'write' of 'file' objects}

something odd going on here

Contributor

jreback commented Oct 12, 2016

@joshowen can you open another issue about the perf

In [15]: %prun df.to_json('foo.json',orient='records',lines=True)
         100058 function calls (100056 primitive calls) in 3.759 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.017    2.017    3.656    3.656 json.py:600(_convert_to_line_delimits)
        1    1.002    1.002    1.002    1.002 {method 'join' of 'str' objects}
        1    0.630    0.630    0.630    0.630 {numpy.core.multiarray.array}
        1    0.078    0.078    0.078    0.078 {pandas.json.dumps}
        1    0.012    0.012    0.012    0.012 {method 'write' of 'file' objects}

something odd going on here

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback
Contributor

jreback commented Oct 14, 2016

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche
Member

jorisvandenbossche commented Oct 15, 2016

Thanks @joshowen !

tworec pushed a commit to RTBHOUSE/pandas that referenced this pull request Oct 21, 2016

@joshowen joshowen referenced this pull request Nov 19, 2016

Closed

BUG: Fix for json lines issue with backslashed quotes #14693

3 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment