BUG/PERF: Stata value labels #11591

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
3 participants
Contributor

kshedden commented Nov 13, 2015

closes #12014

This PR fixes a minor bug and introduces some performance enhancements, all related to value label reading in Stata files.

The bug is that when reading a Stata file incrementally, the value labels will be read even when specifying convert_categoricals=False (this does not happen when reading the entire file at once).

The performance enhancements are:

  1. Read key/value information as an ndarray using np.frombuffer rather than as a Python loop.
  2. When splitting the value labels at the offsets, we currently pass txt[off[i]:] to null_terminate, which creates many copies of large segments of a potentially large byte array. I changed it so that
    only the relevant part of the string is copied.

Relating to 2, further performance improvements might be possible since there is no trailing null byte to remove except for the last element of txt (thus some of the work in _null_terminate is superfluous).

Background: This is an issue when processing large Stata files with millions of distinct value labels.

Contributor

jreback commented Nov 13, 2015

ok, sounds gr8!.

just confirm that we have reasonable benchmarks in asv (the packers.py) file is the only one for stata.

Contributor

kshedden commented Nov 15, 2015

Is there a place to put data files for ASV to read in? The stata writer is more limited than the stata reader, so it's hard to test some features like strls that are not part of dta version 114.

Contributor

jreback commented Nov 15, 2015

you can just make a directory under the asv_benchmarks

Contributor

jreback commented Nov 25, 2015

@kshedden want to update

Contributor

kshedden commented Nov 26, 2015

I added a test but haven't been able to get asv to run it. I don't have conda so switched the asv conf file to use virtualenv. It fails when using pip to install pytables:

(py3)-bash-4.1$ pip install --upgrade pytables Collecting pytables Could not find a version that satisfies the requirement pytables (from versions: ) No matching distribution found for pytables

I'm using python 3.4.2rc1 (and also switched asv conf to use python 3.4)

Contributor

jreback commented Nov 27, 2015

asv can work with pip but needs a slighty modified config file (e.g. a different one) as pip uses tables (not that conda accepts both of these).

Contributor

jreback commented Nov 27, 2015

pls add a whatsnew note (in performance). you don't necessarily need to include an asv benchmark (though nice if its easy), but post a perf comparison at the top of the issue.

Contributor

pv commented Nov 27, 2015

Re conda vs. virtualenv, this may be of interest: spacetelescope/asv#322 (comment) spacetelescope/asv#329

Contributor

jreback commented Nov 29, 2015

@pv that's a nice feature....care to do a PR to upgrade pandas/asv.conf.json

Contributor

jreback commented Dec 11, 2015

@kshedden can you update

Contributor

jreback commented Jan 2, 2016

@kshedden can you update

Contributor

kshedden commented Jan 9, 2016

Sorry for the delay... I wasn't able to test this in ASV. I updated whatsnew, not sure what else needs to be done.

@jreback jreback commented on an outdated diff Jan 10, 2016

doc/source/whatsnew/v0.18.0.txt
@@ -416,7 +416,7 @@ Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~
- Improved performance of ``andrews_curves`` (:issue:`11534`)
-
+- Improved performance of ``StataReader``
@jreback

jreback Jan 10, 2016

Contributor

add this issue number here

Contributor

jreback commented Jan 10, 2016

looks good
can u run flake8 on the diff and fix any issue

kshedden referenced this pull request Jan 11, 2016

Closed

BUG: Stata value labels #12014

jreback added this to the 0.18.0 milestone Jan 11, 2016

@jreback jreback commented on the diff Jan 11, 2016

doc/source/whatsnew/v0.18.0.txt
@@ -440,6 +439,7 @@ Bug Fixes
- Bug in consistency of passing nested dicts to ``.groupby(...).agg(...)`` (:issue:`9052`)
- Accept unicode in ``Timedelta`` constructor (:issue:`11995`)
+- Bug in value label reading for ``StataReader`` when reading incrementally (:issue:`12014`)
@jreback

jreback Jan 11, 2016

Contributor

this is also a perf enhancement, yes? pls add a line in Performance (use this PR number for it)

Contributor

jreback commented Jan 11, 2016

minor comments. pls squash. ping when green.

@kshedden kshedden Initial commit for PR
added text to whatsnew

Update whatsnew

flake8 edits

edited whatsnew
098c805
Contributor

kshedden commented Jan 12, 2016

@jreback should be good to go

Contributor

jreback commented Jan 13, 2016

merged via 449ab6b

thanks!

jreback closed this Jan 13, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment