Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
to_stata + read_stata results in NaNs (close to double precision limit) #14618
Comments
|
these look like out of bounds floats (or very close to the limit) |
mverleg
commented
Nov 8, 2016
•
|
Yeah it's close to the limit, but should be within an IEEE double precision float I think. I don't know if Stata deals with those somehow differently. |
mverleg
changed the title from
to_stata + read_stata results in NaNs to to_stata + read_stata results in NaNs (close to double precision limit)
Nov 8, 2016
|
the floats might have overflowed and so when round tripped they have an undefined behavior welcome to have a look though using numbers close to the limit can easily cause issues - any particular reason you are trying to do this? |
mverleg
commented
Nov 8, 2016
|
If they're overflowing (which seems likely), happens in either to_stata or read_stata. It's not overflowed in the script yet; I compared other methods (npy, csv, etc) which don't give NaNs. I'm using it for a benchmark, so I'm using random data that uses the full range to test compression. I guess most people don't use such data and I don't urgently need it, so this is probably a low-priority bug. But it seems like a bug nonetheless. |
|
csv is not a valid comparison the floats get stringified sure could be either in to or from stata - i'll mark it but would take a community pull request to fix |
jreback
added Bug Numeric IO Stata Difficulty Intermediate Effort Low
labels
Nov 8, 2016
jreback
added this to the
Next Major Release
milestone
Nov 8, 2016
|
cc @bashtage |
mverleg
commented
Nov 8, 2016
|
Yeah CSV doesn't win the benckmark :-) Thanks, I'll use smaller data for now, hope no one else runs into it! |
|
Stata has a maximum value for doubles and uses the very largest values to indicate coded values From the dta spec:
|
|
Would probably be best to warn/error when values like these are encountered for float and double. Right now integers are promoted to a larger type if possible to avoid this issue. |
|
I don't think there is any promise to correctly round trip to_state/read_stata, especially for edge cases. The most important cases are to read data saved by Stata with |
|
Also, for performance measurement, at its core to_stata uses |
mverleg
commented
Nov 9, 2016
|
Ah I guess it's related to the encoding thing. It's probably good to use NaNs rather than just returning A warning would be useful though. If the performance penalty is worth it, which I'm not sure of. Also thanks for the benchmark hint. |
mverleg
referenced
this issue
Nov 9, 2016
Closed
to_htlm + read_html small errors for floats despite formatter #14623
bashtage
added a commit
to bashtage/pandas
that referenced
this issue
Nov 10, 2016
|
|
bashtage |
8c547c4
|
bashtage
added a commit
to bashtage/pandas
that referenced
this issue
Nov 10, 2016
|
|
bashtage |
b6f6432
|
bashtage
referenced
this issue
Nov 10, 2016
Closed
ENH: Explicit range checking when writing Stata #14637
bashtage
added a commit
to bashtage/pandas
that referenced
this issue
Nov 10, 2016
|
|
bashtage |
db89413
|
bashtage
added a commit
to bashtage/pandas
that referenced
this issue
Nov 11, 2016
|
|
bashtage |
90d65fe
|
bashtage
added a commit
to bashtage/pandas
that referenced
this issue
Nov 14, 2016
|
|
bashtage |
af41353
|
|
I think this is ready unless you see something. |
bashtage
added a commit
to bashtage/pandas
that referenced
this issue
Nov 15, 2016
|
|
bashtage |
f057d03
|
mverleg
commented
Nov 16, 2016
|
Couldn't get cython to work to test it, but the source looks good! |
bashtage
added a commit
to bashtage/pandas
that referenced
this issue
Nov 17, 2016
|
|
bashtage |
55a98f5
|
jreback
modified the milestone: 0.19.2, Next Major Release
Nov 17, 2016
jreback
closed this
in fe555db
Nov 17, 2016
amolkahat
added a commit
to amolkahat/pandas
that referenced
this issue
Nov 26, 2016
|
|
bashtage + amolkahat |
143c85c
|
jorisvandenbossche
added a commit
to jorisvandenbossche/pandas
that referenced
this issue
Dec 14, 2016
|
|
bashtage + jorisvandenbossche |
36fc61e
|
mverleg commentedNov 8, 2016
Explanation
Saving and loading data as stata results in a lot of NaNs.
I think the code & output is pretty self-explanatory, otherwise please ask.
I've not been able to test this on other systems yet.
If this is somehow expected behaviour, maybe a bigger warning would be in order.
A small, complete example of the issue
Expected Output
Actual Output
Output of
pd.show_versions()pandas: 0.18.1
nose: None
pip: 9.0.1
setuptools: 26.1.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None