Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modest performance, address #12647 #12656

Closed
wants to merge 19 commits into from

Conversation

kshedden
Copy link
Contributor

closes #12659
closes #12654
closes #12647
closes #12809

Major performance improvements through use of Cython

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

@kshedden you prob know this, but all of the perf issues have to do with going back-forth between cython/python. the ideal would be to put the entire loop in cython, rather than in python, then a call to parse a single line in cython.

@jreback jreback added Performance Memory or execution speed performance IO SAS SAS: read_sas labels Mar 17, 2016
@kshedden
Copy link
Contributor Author

Actually I didn't know that there was that much function call overhead for
cython. I am just passing scalar value and memory slices (plus the parser
object but we could refactor that out if that is the problem).

On Thu, Mar 17, 2016 at 9:22 AM, Jeff Reback notifications@github.com
wrote:

@kshedden https://github.com/kshedden you prob know this, but all of
the perf issues have to do with going back-forth between cython/python. the
ideal would be to put the entire loop in cython, rather than in python,
then a call to parse a single line in cython.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#12656 (comment)

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

yeah because everything has to be checked every time. Its not what you are passing, but the function call itself. You should be able move the entire loop and will get pretty good speedups.

@kshedden
Copy link
Contributor Author

This should resolve #12659, #12654, #12647. Also adds test coverage with two new files.

The tests pass on most setups but there is one coredump on Travis. I'm not sure it that is related to this PR.

There are also some performance enhancements here in the 2x-4x range.

@gdementen @jreback @ywhuofu @benjello @raderaj

break


def _readline(parser):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a cdef

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cdef bint _readline

@jreback jreback added this to the 0.18.1 milestone Mar 19, 2016
def process_byte_array_with_data(parser, int offset, int length, np.ndarray[uint8_t, ndim=2] byte_chunk,
np.ndarray[dtype=object, ndim=2] string_chunk):

def _do_read(parser, int nrows):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't have to, but its good to do: _do_read(object parser, int nrows):

@kshedden
Copy link
Contributor Author

kshedden commented Apr 9, 2016

@jreback I'm having trouble working with object dtype arrays in cython, would appreciate your help.

I have the following code, where source is a memory slice directly copied from the SAS file, which I need to slice into smaller chunks corresponding to individual data values. I am able to do something similar without trouble for float64 types (not shown below), but I can't get it to work for variable-length byte arrays. These byte arrays will eventually become strings but I don't want to do the conversion here because we give the user the option to retain the raw bytes. The error message I am getting is copied below the code.

I have also tried typing string_chunk as object[:, ::1] but that didn't work either.

cdef void process_byte_array_with_data(object parser, int offset, int length,
                                       uint8_t[:, ::1] byte_chunk,
                                       np.ndarray string_chunk):
    # ...
    bvec = bytearray(source[start:start+lngt])
    string_chunk[js, parser._current_row_in_chunk_index] = bvec
BufferError: Object is not writable.
Exception ignored in: 'pandas.io.sas.saslib.process_byte_array_with_data'
Traceback (most recent call last):
  File "stringsource", line 616, in View.MemoryView.memoryview_cwrapper (pandas/io/sas/saslib.c:14511)
  File "stringsource", line 323, in View.MemoryView.memoryview.__cinit__ (pandas/io/sas/saslib.c:10880)

@jreback
Copy link
Contributor

jreback commented Apr 9, 2016

docs are here

I think you want something like:

def process_byte_data(unsigned char[:] data):
    length = data.shape[0]
    first_byte = data[0]
    slice_view = data[1:-1]

so your data would be your source, its just a bunch of bytes.
you can directly slice them. Then you can cast them to what you need (you need to allocate memory and such). This is very much like working directly in C.

note that you are always working with bytes, and NOT unicode/str. you can decode things later.

HTH, and lmk.

@kshedden
Copy link
Contributor Author

@jreback I think this is ready now, known bugs and performance issues should be resolved

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

can u post perf comparison (also haven't looked but do we have asv for this?)

@kshedden
Copy link
Contributor Author

We don't have an asv, I tried setting one up but failed (I don't have a continuum distro and couldn't get it to work with virtualenv).

A rough timing on a 100K file: the released version (v0.18) takes around 11 seconds, this version takes around 5 seconds. However csv reading is about 50x faster, so still a ways to go.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

how this coming?

@kshedden
Copy link
Contributor Author

I think it is ready. to go, the known bugs are fixed and the performance is at least somewhat better.

@@ -191,7 +191,7 @@ Deprecations
Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~


- Improved speed of SAS reader (PR 12656)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write the issue number like others (it doesn't matter that its a PR number)

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

a PEP checked failed

git diff master | flake8 --diff

@jreback
Copy link
Contributor

jreback commented Apr 21, 2016

did u look at the refactor I posted?

@kshedden
Copy link
Contributor Author

Do you mean this one?

#12656 (comment)

On Thu, Apr 21, 2016 at 7:52 PM, Jeff Reback notifications@github.com
wrote:

did u look at the refactor I posted?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#12656 (comment)

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

kshedden#1

@kshedden
Copy link
Contributor Author

Thanks, runs clean locally, waiting on travis now. The only change I made is that I left a few things in the decompressors int type due to overflow potential.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

btw I only had a really small benchmark file
you prob have better

# Loop until a data row is read
while True:
if self.parser._current_page_type == const.page_meta_type:
flag = (self.current_row_on_page_index >=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there is a fair amount more that can be typed here but would involve some rewriting

@kshedden
Copy link
Contributor Author

For the test file I have been using, I am now getting 3 seconds per 100K lines verus 11 seconds per 100K lines in the 0.18.0 version.

Also, I'm not aware of any files that fail under this version, although in some cases the encoding will need to be manually specified.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

oh that sounds great!

oh so you are able to infer encodings in some instances? worth mentioning in docs / doc-string?

are there any other updates for doc-string / docs?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

couple of issues on windows. going to fixup.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

the airlines.sas7bdat, csv seem to be missing, can you push them up in another commit?

@kshedden
Copy link
Contributor Author

done

On Fri, Apr 22, 2016 at 10:54 AM, Jeff Reback notifications@github.com
wrote:

the airlines.sas7bdat, csv seem to be missing, can you push them up in
another commit?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#12656 (comment)

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

ty. not sure why that didn't fail the tests, but oh well.

@jreback jreback closed this in 33683cc Apr 22, 2016
@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

thanks @kshedden great effort!!!!

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

I may come back around with some more perf improvements in the cython code when I have some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas Performance Memory or execution speed performance
Projects
None yet
2 participants