Modest performance, address #12647 #12656

kshedden · 2016-03-17T12:55:37Z

closes #12659
closes #12654
closes #12647
closes #12809

Major performance improvements through use of Cython

jreback · 2016-03-17T13:22:19Z

@kshedden you prob know this, but all of the perf issues have to do with going back-forth between cython/python. the ideal would be to put the entire loop in cython, rather than in python, then a call to parse a single line in cython.

kshedden · 2016-03-17T13:27:49Z

Actually I didn't know that there was that much function call overhead for
cython. I am just passing scalar value and memory slices (plus the parser
object but we could refactor that out if that is the problem).

On Thu, Mar 17, 2016 at 9:22 AM, Jeff Reback notifications@github.com
wrote:

@kshedden https://github.com/kshedden you prob know this, but all of
the perf issues have to do with going back-forth between cython/python. the
ideal would be to put the entire loop in cython, rather than in python,
then a call to parse a single line in cython.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#12656 (comment)

jreback · 2016-03-17T13:31:28Z

yeah because everything has to be checked every time. Its not what you are passing, but the function call itself. You should be able move the entire loop and will get pretty good speedups.

kshedden · 2016-03-19T15:51:43Z

This should resolve #12659, #12654, #12647. Also adds test coverage with two new files.

The tests pass on most setups but there is one coredump on Travis. I'm not sure it that is related to this PR.

There are also some performance enhancements here in the 2x-4x range.

@gdementen @jreback @ywhuofu @benjello @raderaj

jreback · 2016-03-19T17:09:13Z

pandas/io/sas/saslib.pyx

+            break
+
+
+def _readline(parser):


make this a cdef

cdef bint _readline

jreback · 2016-03-19T19:34:36Z

pandas/io/sas/saslib.pyx

-def process_byte_array_with_data(parser, int offset, int length, np.ndarray[uint8_t, ndim=2] byte_chunk,
-                                 np.ndarray[dtype=object, ndim=2] string_chunk):
+
+def _do_read(parser, int nrows):


you don't have to, but its good to do: _do_read(object parser, int nrows):

kshedden · 2016-04-09T21:59:23Z

@jreback I'm having trouble working with object dtype arrays in cython, would appreciate your help.

I have the following code, where source is a memory slice directly copied from the SAS file, which I need to slice into smaller chunks corresponding to individual data values. I am able to do something similar without trouble for float64 types (not shown below), but I can't get it to work for variable-length byte arrays. These byte arrays will eventually become strings but I don't want to do the conversion here because we give the user the option to retain the raw bytes. The error message I am getting is copied below the code.

I have also tried typing string_chunk as object[:, ::1] but that didn't work either.

cdef void process_byte_array_with_data(object parser, int offset, int length,
                                       uint8_t[:, ::1] byte_chunk,
                                       np.ndarray string_chunk):
    # ...
    bvec = bytearray(source[start:start+lngt])
    string_chunk[js, parser._current_row_in_chunk_index] = bvec

BufferError: Object is not writable.
Exception ignored in: 'pandas.io.sas.saslib.process_byte_array_with_data'
Traceback (most recent call last):
  File "stringsource", line 616, in View.MemoryView.memoryview_cwrapper (pandas/io/sas/saslib.c:14511)
  File "stringsource", line 323, in View.MemoryView.memoryview.__cinit__ (pandas/io/sas/saslib.c:10880)

jreback · 2016-04-09T22:13:27Z

docs are here

I think you want something like:

def process_byte_data(unsigned char[:] data):
    length = data.shape[0]
    first_byte = data[0]
    slice_view = data[1:-1]

so your data would be your source, its just a bunch of bytes.
you can directly slice them. Then you can cast them to what you need (you need to allocate memory and such). This is very much like working directly in C.

note that you are always working with bytes, and NOT unicode/str. you can decode things later.

HTH, and lmk.

kshedden · 2016-04-11T13:33:52Z

@jreback I think this is ready now, known bugs and performance issues should be resolved

jreback · 2016-04-11T13:41:04Z

can u post perf comparison (also haven't looked but do we have asv for this?)

kshedden · 2016-04-11T13:58:49Z

We don't have an asv, I tried setting one up but failed (I don't have a continuum distro and couldn't get it to work with virtualenv).

A rough timing on a 100K file: the released version (v0.18) takes around 11 seconds, this version takes around 5 seconds. However csv reading is about 50x faster, so still a ways to go.

jreback · 2016-04-17T15:47:54Z

how this coming?

kshedden · 2016-04-17T16:15:56Z

I think it is ready. to go, the known bugs are fixed and the performance is at least somewhat better.

jreback · 2016-04-17T17:03:03Z

doc/source/whatsnew/v0.18.1.txt

@@ -191,7 +191,7 @@ Deprecations
 Performance Improvements
 ~~~~~~~~~~~~~~~~~~~~~~~~

-
+- Improved speed of SAS reader (PR 12656)


write the issue number like others (it doesn't matter that its a PR number)

jreback · 2016-04-17T20:44:17Z

a PEP checked failed

git diff master | flake8 --diff

jreback · 2016-04-21T23:52:05Z

did u look at the refactor I posted?

kshedden · 2016-04-22T00:58:50Z

Do you mean this one?

#12656 (comment)

On Thu, Apr 21, 2016 at 7:52 PM, Jeff Reback notifications@github.com
wrote:

did u look at the refactor I posted?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#12656 (comment)

jreback · 2016-04-22T01:09:48Z

kshedden#1

kshedden · 2016-04-22T01:47:15Z

Thanks, runs clean locally, waiting on travis now. The only change I made is that I left a few things in the decompressors int type due to overflow potential.

jreback · 2016-04-22T01:49:59Z

btw I only had a really small benchmark file
you prob have better

jreback · 2016-04-22T01:53:53Z

pandas/io/sas/saslib.pyx

+        # Loop until a data row is read
+        while True:
+            if self.parser._current_page_type == const.page_meta_type:
+                flag = (self.current_row_on_page_index >=


FYI there is a fair amount more that can be typed here but would involve some rewriting

kshedden · 2016-04-22T12:47:05Z

For the test file I have been using, I am now getting 3 seconds per 100K lines verus 11 seconds per 100K lines in the 0.18.0 version.

Also, I'm not aware of any files that fail under this version, although in some cases the encoding will need to be manually specified.

jreback · 2016-04-22T12:49:38Z

oh that sounds great!

oh so you are able to infer encodings in some instances? worth mentioning in docs / doc-string?

are there any other updates for doc-string / docs?

jreback · 2016-04-22T14:34:48Z

couple of issues on windows. going to fixup.

jreback · 2016-04-22T14:54:00Z

the airlines.sas7bdat, csv seem to be missing, can you push them up in another commit?

kshedden · 2016-04-22T14:57:04Z

done

On Fri, Apr 22, 2016 at 10:54 AM, Jeff Reback notifications@github.com
wrote:

the airlines.sas7bdat, csv seem to be missing, can you push them up in
another commit?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#12656 (comment)

jreback · 2016-04-22T15:00:53Z

ty. not sure why that didn't fail the tests, but oh well.

jreback · 2016-04-22T15:10:39Z

thanks @kshedden great effort!!!!

jreback · 2016-04-22T15:11:03Z

I may come back around with some more perf improvements in the cython code when I have some time.

jreback added Performance Memory or execution speed performance IO SAS SAS: read_sas labels Mar 17, 2016

kshedden force-pushed the sas7bdat_perf branch from 5923873 to 2c89a73 Compare March 19, 2016 03:11

jreback reviewed Mar 19, 2016
View reviewed changes

jreback added this to the 0.18.1 milestone Mar 19, 2016

jreback reviewed Mar 19, 2016
View reviewed changes

jreback mentioned this pull request Mar 23, 2016

ENH: read_sas, to_sas #4052

Closed

kshedden mentioned this pull request Apr 6, 2016

Pandas read_sas error: 'ascii' codec can't decode byte 0xd8 in position 0: ordinal not in range(128) #12809

Closed

kshedden force-pushed the sas7bdat_perf branch from b31d85c to 61b6850 Compare April 11, 2016 12:58

jreback reviewed Apr 17, 2016
View reviewed changes

kshedden added 9 commits April 21, 2016 19:32

Working on cython issues

7e156b7

Working on cython issues

3ef626e

Further cythonization

11c2f31

Added option to not decode header text

873a877

added to whatsnew

c26d22b

Further encoding work

1af73b3

pep8 cleanup

8b4b96d

Fix encoding handling bug for py2

ea87a7f

flake8 fixes

b7de358

kshedden force-pushed the sas7bdat_perf branch from 736f979 to b7de358 Compare April 21, 2016 23:37

Integrate jreback's cython improvements

fe4731b

jreback reviewed Apr 22, 2016
View reviewed changes

Add one more type

af085f7

Add missing test data files

b3024ed

jreback closed this in 33683cc Apr 22, 2016

jreback mentioned this pull request Apr 22, 2016

BUG: fixed SAS7BDATReader._get_properties #12658

Closed

kshedden mentioned this pull request Apr 23, 2016

PERF: some more perf/clean in saslib.pyx #12961

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modest performance, address #12647 #12656

Modest performance, address #12647 #12656

kshedden commented Mar 17, 2016

jreback commented Mar 17, 2016

kshedden commented Mar 17, 2016

jreback commented Mar 17, 2016

kshedden commented Mar 19, 2016

jreback Mar 19, 2016

jreback Mar 19, 2016

jreback Mar 19, 2016

kshedden commented Apr 9, 2016

jreback commented Apr 9, 2016

kshedden commented Apr 11, 2016

jreback commented Apr 11, 2016

kshedden commented Apr 11, 2016

jreback commented Apr 17, 2016

kshedden commented Apr 17, 2016

jreback Apr 17, 2016

jreback commented Apr 17, 2016 •

edited

Loading

jreback commented Apr 21, 2016

kshedden commented Apr 22, 2016

jreback commented Apr 22, 2016

kshedden commented Apr 22, 2016

jreback commented Apr 22, 2016

jreback Apr 22, 2016

kshedden commented Apr 22, 2016

jreback commented Apr 22, 2016

jreback commented Apr 22, 2016

jreback commented Apr 22, 2016

kshedden commented Apr 22, 2016

jreback commented Apr 22, 2016

jreback commented Apr 22, 2016

jreback commented Apr 22, 2016

Modest performance, address #12647 #12656

Modest performance, address #12647 #12656

Conversation

kshedden commented Mar 17, 2016

jreback commented Mar 17, 2016

kshedden commented Mar 17, 2016

jreback commented Mar 17, 2016

kshedden commented Mar 19, 2016

jreback Mar 19, 2016

Choose a reason for hiding this comment

jreback Mar 19, 2016

Choose a reason for hiding this comment

jreback Mar 19, 2016

Choose a reason for hiding this comment

kshedden commented Apr 9, 2016

jreback commented Apr 9, 2016

kshedden commented Apr 11, 2016

jreback commented Apr 11, 2016

kshedden commented Apr 11, 2016

jreback commented Apr 17, 2016

kshedden commented Apr 17, 2016

jreback Apr 17, 2016

Choose a reason for hiding this comment

jreback commented Apr 17, 2016 • edited Loading

jreback commented Apr 21, 2016

kshedden commented Apr 22, 2016

jreback commented Apr 22, 2016

kshedden commented Apr 22, 2016

jreback commented Apr 22, 2016

jreback Apr 22, 2016

Choose a reason for hiding this comment

kshedden commented Apr 22, 2016

jreback commented Apr 22, 2016

jreback commented Apr 22, 2016

jreback commented Apr 22, 2016

kshedden commented Apr 22, 2016

jreback commented Apr 22, 2016

jreback commented Apr 22, 2016

jreback commented Apr 22, 2016

jreback commented Apr 17, 2016 •

edited

Loading