Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
BUG/PERF: Stata value labels #11591
Conversation
|
ok, sounds gr8!. just confirm that we have reasonable benchmarks in |
jreback
added Performance IO Stata
labels
Nov 13, 2015
|
Is there a place to put data files for ASV to read in? The stata writer is more limited than the stata reader, so it's hard to test some features like strls that are not part of dta version 114. |
|
you can just make a directory under the |
|
@kshedden want to update |
|
I added a test but haven't been able to get asv to run it. I don't have conda so switched the asv conf file to use virtualenv. It fails when using pip to install pytables:
I'm using python 3.4.2rc1 (and also switched asv conf to use python 3.4) |
|
asv can work with pip but needs a slighty modified config file (e.g. a different one) as pip uses |
|
pls add a whatsnew note (in performance). you don't necessarily need to include an asv benchmark (though nice if its easy), but post a perf comparison at the top of the issue. |
|
Re conda vs. virtualenv, this may be of interest: spacetelescope/asv#322 (comment) spacetelescope/asv#329 |
|
@pv that's a nice feature....care to do a PR to upgrade |
|
@kshedden can you update |
|
@kshedden can you update |
|
Sorry for the delay... I wasn't able to test this in ASV. I updated whatsnew, not sure what else needs to be done. |
jreback
commented on an outdated diff
Jan 10, 2016
|
looks good |
jreback
added this to the
0.18.0
milestone
Jan 11, 2016
jreback
commented on the diff
Jan 11, 2016
| @@ -440,6 +439,7 @@ Bug Fixes | ||
| - Bug in consistency of passing nested dicts to ``.groupby(...).agg(...)`` (:issue:`9052`) | ||
| - Accept unicode in ``Timedelta`` constructor (:issue:`11995`) | ||
| +- Bug in value label reading for ``StataReader`` when reading incrementally (:issue:`12014`) |
jreback
Contributor
|
|
minor comments. pls squash. ping when green. |
|
@jreback should be good to go |
|
merged via 449ab6b thanks! |
kshedden commentedNov 13, 2015
closes #12014
This PR fixes a minor bug and introduces some performance enhancements, all related to value label reading in Stata files.
The bug is that when reading a Stata file incrementally, the value labels will be read even when specifying convert_categoricals=False (this does not happen when reading the entire file at once).
The performance enhancements are:
only the relevant part of the string is copied.
Relating to 2, further performance improvements might be possible since there is no trailing null byte to remove except for the last element of
txt(thus some of the work in_null_terminateis superfluous).Background: This is an issue when processing large Stata files with millions of distinct value labels.