[BUG] handle } in line delimited json #14391

Closed
wants to merge 11 commits into
from

Conversation

Projects
None yet
4 participants
Contributor

joshowen commented Oct 10, 2016 edited

  • closes #14390
  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

joshowen added some commits Oct 10, 2016

@joshowen joshowen fix for quoted special characters
58074fd
@joshowen joshowen fix typo in expected output
9b150b5

joshowen changed the title from [BUG] fix for quoted special characters to [BUG] fix for quoted special characters in line delimited json Oct 10, 2016

joshowen changed the title from [BUG] fix for quoted special characters in line delimited json to [BUG] handle } in line delimited json Oct 10, 2016

@joshowen joshowen added whatsnew entry
d2724d3

codecov-io commented Oct 11, 2016 edited

Current coverage is 85.26% (diff: 100%)

Merging #14391 into master will increase coverage by <.01%

@@             master     #14391   diff @@
==========================================
  Files           140        140          
  Lines         50634      50639     +5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43173      43178     +5   
  Misses         7461       7461          
  Partials          0          0          

Powered by Codecov. Last update d98e982...edb1488

Contributor

jreback commented Oct 11, 2016

can you add a benchmark for lines=True in asv_bench/benchmarks/packers.py (alongside the other json ones)

joshowen added some commits Oct 11, 2016

@joshowen joshowen added asv benchmark
3a9dd5d
@joshowen joshowen added whitespace
3560a8e
Contributor

joshowen commented Oct 11, 2016

Not sure why that test failed,, the only change was a line of whitespace

joshowen added some commits Oct 12, 2016

@joshowen joshowen handle double quotes in strings
2aefa85
@joshowen joshowen fix typo
444e6c2
@joshowen joshowen fixed string formatting
be43f39
@joshowen joshowen lint
edb1488
+ self.f = '__test__.msg'
+ self.N = 100000
+ self.C = 5
+ self.index = date_range('20000101', periods=self.N, freq='H')
@jreback

jreback Oct 12, 2016

Contributor

looks like some things got repeated here

@joshowen

joshowen Oct 12, 2016

Contributor

Looks like self.N/self.C are repeated in most of these classes. Want me to clean them all up?

@jreback

jreback Oct 12, 2016

Contributor

sure that would be great (you can also make a common base class(s) if that helps as well)

@jorisvandenbossche

jorisvandenbossche Oct 12, 2016

Owner

@joshowen you can leave it here as is. I cleaned this up in another PR (@jreback yes I know, I should merge that ...)

@jreback

jreback Oct 12, 2016

Contributor

ok that's fine (though @joshowen make sure your example doesn't have dups as this is new code)

@jorisvandenbossche

jorisvandenbossche Oct 12, 2016

Owner

ah, yes, it is of course OK to remove the lines in this added code that you do not need for this benchmark

+ df = DataFrame([["foo}", "bar"], ['foo"', "bar"]], columns=['a', 'b'])
+ result = df.to_json(orient="records", lines=True)
+ expected = '{"a":"foo}","b":"bar"}\n{"a":"foo\\"","b":"bar"}'
+ self.assertEqual(result, expected)
@jreback

jreback Oct 12, 2016

Contributor

can you also round trip it and user assert_frame_equal on the result (in addition to the above test)

jorisvandenbossche added this to the 0.19.1 milestone Oct 12, 2016

@joshowen joshowen remove duplicate and assert round_trip works
8b057ff
Contributor

jreback commented Oct 12, 2016 edited

@joshowen can you also post a run for this benchmark (versus previous); can also do it in a %timeit as well. Just checking if any perf issues.

@joshowen joshowen remove duplicated code
8591cb9
Contributor

joshowen commented Oct 12, 2016

@jreback is there an easy way to do that? Or should I port the asv test to master and run/compare?

Contributor

jreback commented Oct 12, 2016 edited

you can run asv if you want, otherwise just do it in ipython (before and after), e.g. something like

In [6]: df = DataFrame(dict([('float{0}'.format(i), np.random.randn(N)) for i in range(C)]), index=index).reset_index(drop=True)

In [7]: df.to_json('foo.json',orient='records',lines=True)

In [8]: %timeit df.to_json('foo.json',orient='records',lines=True)
1 loop, best of 3: 3.66 s per loop

In [9]: %timeit df.to_json('foo.json',orient='records')
10 loops, best of 3: 98.8 ms per loop

and looking at this, we have a BIG perf hit when lines=True (this is a separate issue).

Contributor

jreback commented Oct 12, 2016

Contributor

jreback commented Oct 12, 2016

@joshowen can you open another issue about the perf

In [15]: %prun df.to_json('foo.json',orient='records',lines=True)
         100058 function calls (100056 primitive calls) in 3.759 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.017    2.017    3.656    3.656 json.py:600(_convert_to_line_delimits)
        1    1.002    1.002    1.002    1.002 {method 'join' of 'str' objects}
        1    0.630    0.630    0.630    0.630 {numpy.core.multiarray.array}
        1    0.078    0.078    0.078    0.078 {pandas.json.dumps}
        1    0.012    0.012    0.012    0.012 {method 'write' of 'file' objects}

something odd going on here

Contributor

jreback commented Oct 14, 2016

jreback closed this in 286b9b9 Oct 15, 2016

Thanks @joshowen !

@tworec tworec added a commit to RTBHOUSE/pandas that referenced this pull request Oct 21, 2016

@joshowen @tworec joshowen + tworec BUG: fix json with lines=True for quoted special characters
closes #14391
closes #14390
04023d2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment