Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fixed incorrect type in integer conversion in to_stata #6335

Closed
wants to merge 71 commits into from

Conversation

bashtage
Copy link
Contributor

Simple fix for bug #6327. All conversions were off by a factor of 2.
Added test for this case.

Types used for integer conversion where always half the size they should be.  
Produced a bug when exporting data tables with long integer data (np.int64).
Added test for incorrect integer conversion from int16, int32 and int64
@jreback
Copy link
Contributor

jreback commented Feb 12, 2014

typo in your test...pls test locally!

@bashtage
Copy link
Contributor Author

This should probably be rejected. I had a look at

http://www.stata.com/help.cgi?dta#variabletypes

and there does not appear to be any int64 type.

What's the correct thing?

  • Raise an exception and ask the user to coerce int64 to double
  • Silently coerce long to double.

I probably thing the former is the correct solution.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2014

hmm...int64 is kind of a fundamental type in pandas/numpy land.

The issue only really arises if the value cannot fit in a int32 right?
pretty easy to try to coerce and raise if it doesn't work properly.
(of course this will essentially make the round-trip non-exact, but oh well)

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

can u look at the current test failures?

@jreback jreback added this to the 0.14.0 milestone Feb 16, 2014
Types used for integer conversion where always half the size they should be.
Produced a bug when exporting data tables with long integer data (np.int64).
Added test for incorrect integer conversion from int16, int32 and int64
@bashtage
Copy link
Contributor Author

I have been working backward on this and I'm still not really sure that a fix which actually makes to_stata() produce actual Stata compliant files is ever going to be acceptable since, from what I can tell, Stata does not support any type of 8-byte integers.

As you say, this is a real problem for a DataFrame since many columns are naturally int64 (e.g. a plain index).

I tried removing all references to int64 and all that did was more-or-less break everything.

I've found another issue in the writer when it iterates across rows of the DataFrame, which produces a Series. This is a problem with mixed numeric data since the data is first transformed using the largest data type, including coercing integer types to float if the row contains a float, before bring written. I can't imagine that a loss of fidelity that this can induce would be desirable.

This shows what is effectively happening inside to_stata:

from pandas import Series,DataFrame
import numpy as np

original = np.int64(2**63-2**32)

s1 = Series(original,dtype=np.int64)
s2 = Series(0.1,dtype=np.float32)
df = DataFrame({'long':s1,'single':s2})

# No problem in data frame
print df['long']-original

# Coercing int64 to float32 creates  a problem
print df.iloc[0]['long'] - original

@bashtage
Copy link
Contributor Author

The conversion is silently happening at

https://github.com/bashtage/pandas/blob/stata-export-datatype/pandas/io/stata.py#L1144

since the rows of the DataFrame are being iterated over.

@jreback
Copy link
Contributor

jreback commented Feb 24, 2014

cc @PKEuS

@bashtage IIUC this happens for values that don't fit in float32? e.g. large ints. So a warning / raising is prob appropriate. This 'works' it seems if the values fit, though? E.g. if you generate a file with to_stata it is read by stata ok?

@bashtage
Copy link
Contributor Author

@jreback I have now tested it on Stata and the dta produced with pandas 0.13.1 is not a correct stata file.

This simple code outputs and inputs fine in pandas, but the variables have the wrong values in Stata

s1 = pd.Series(1,dtype=np.int8)
s2 = pd.Series(2**8+1,dtype=np.int16)
s3 = pd.Series(2**16+1,dtype=np.int32)
s4 = pd.Series(2**32+1,dtype=np.int64)
df = pd.DataFrame({'s1':s1,'s2':s2,'s3':s3,'s4':s4})
df.to_stata('test.dta')

Running the command

list

returns

index   s1  s2  s3              s4
     0   0   0   0        1.68e+07

I ran a couple of other experiments with other data dtypes (e.g. float64) and they also do not load the correct values into Stata 12.1.

@jreback
Copy link
Contributor

jreback commented Feb 25, 2014

ok, feel free to make a new version that works with out-of-size values. I was asking whether the more normal case works, e.g. regular sized values.

@bashtage
Copy link
Contributor Author

Those are all normal sized values. s1 is a byte with the value 1. s2 and int16 with value 257. Stata sees them with the correct type but wrong values.

Even the trivial dta

df = pd.DataFrame(np.array([[1.0]]),columns=['s1'])
df.to_stata('test.dta')
print df.dtypes

does not contain the correct values when loaded into Stata.

@jreback
Copy link
Contributor

jreback commented Feb 25, 2014

so the test data files are wrong? are they not valid stata files? since these can be roundtripped you are sying that the read is compensating somehow? (for the writing issues)

@jreback
Copy link
Contributor

jreback commented Feb 25, 2014

is this a version issue? I recall very specific versions of formats that this supports (or doesn't support)

@jseabold are you seeing the same thing here?

@bashtage
Copy link
Contributor Author

That seems to be the case. I need to do some more systematic investigation which may take a little while.

It appears that in some cases the reader and writer are internally compatible, but not always compatible with what Stata wants.

@jreback
Copy link
Contributor

jreback commented Feb 25, 2014

ok compat is obviously important
do pls let is know

@jseabold
Copy link
Contributor

I can have a look if needed. I didn't write the writer, so I'm less familiar with it. I can also point you to (or provide you with) the technical documentation if needed. It comes with stata since v. 11.x or so.

You are correct though that Stata doesn't have an int64. Only (type, min, max, bytes)

byte, -127, 100, 1
int, -32,767, 32,740, 2
long, -2,147,483,647, 2,147,483,620, 4
float, -1.70141173319 x 10^38, 1.70141173319 x 10^38, 4
double, -8.9884656742 x 10^307, 8.9884656743 x 10^307, 8

@jseabold
Copy link
Contributor

Adding my comment from #6327 here. Ripping out the int64 stuff is the way to go (except for handling it appropriately on the writer side of things). The reader shouldn't know about it, and it shouldn't be in dtype_maps. Those aren't suggestions, but determined from the stata binary format. Cf. the statsmodels implementation without int64.

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/iolib/foreign.py#L289

FIX: Remove unnecessary, potentially precision degrading cast to Series when writing data
ENH: Added function to cast columns from NumPy data types to Stata data types
FIX: Corrected tests for correct Stata datatypes
The extreme values of float and double (Stata, pandas eqiv: float 32 and
float64) were not correct.  This resulted in incorrect truncation. The
handling of missing values have been improved and code to convert missing
values in any format has been added.  The improvement differentiated between
valid ranges for data and missing values.

Additional issues were found when handling missing Dates, where missing Dates
(NaT) were converted to non-missing dates when written.

A test has been added for extreme numeric values as well as missing values.
The extreme values of float and double (Stata, pandas eqiv: float 32 and
float64) were not correct.  This resulted in incorrect truncation. The
handling of missing values have been improved and code to convert missing
values in any format has been added.  The improvement differentiated between
valid ranges for data and missing values.

Additional issues were found when handling missing Dates, where missing Dates
(NaT) were converted to non-missing dates when written.

A test has been added for extreme numeric values as well as missing values.
The extreme values of float and double (Stata, pandas eqiv: float 32 and
float64) were not correct.  This resulted in incorrect truncation. The
handling of missing values have been improved and code to convert missing
values in any format has been added.  The improvement differentiated between
valid ranges for data and missing values.

Additional issues were found when handling missing Dates, where missing Dates
(NaT) were converted to non-missing dates when written.

A test has been added for extreme numeric values as well as missing values.
Renamed Stata data files to include file format
@jreback
Copy link
Contributor

jreback commented Mar 4, 2014

if u can squash this down a bit would be great

@bashtage
Copy link
Contributor Author

bashtage commented Mar 4, 2014

I think it would benefit from some more squashing, so please do.

Kevin

@bashtage
Copy link
Contributor Author

bashtage commented Mar 4, 2014

There is still at least one issue that hasn't been fixed, and I don't think is practical to fix now. Most Stata dates do not contain times and so using datetime64[ns] really limits the range. One of the tests has the year 2 (as in 2 AD) in it but cannot be converted to a datetime because of the limited range of ns. Once this is resolved, it would make more sense to use [D], [M] or [Y], corresponding to the Stata datatime format (%td, %tm, %ty) which would provide a more accurate range.

Not sure if more general datetime64 is planned though.

@jreback
Copy link
Contributor

jreback commented Mar 4, 2014

http://pandas-docs.github.io/pandas-docs-travis/gotchas.html#timestamp-limitations

datetimes were settled long ago
if u need to represent outside the ns range then use a period

@jreback
Copy link
Contributor

jreback commented Mar 4, 2014

pls add a release notes and v0.14.0 entry in the API sections referencing the issue that this closes

@jreback
Copy link
Contributor

jreback commented Mar 4, 2014

I rebased and squashed, see here: jreback@42e3e17

just give a once over

I am not sure how you got in that state. You should investigate using --amend which rewrites the last commit, so you don't have a gozillion number of commits. Just makes it harder to work with.

@bashtage
Copy link
Contributor Author

bashtage commented Mar 5, 2014

Would it be best it I just merged it down to a single commit, with a nice commit message? My git strategy is, at best, not very good.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

Already did, that's the commit I posted. It was a bit non-trivial to do this actually. but done now.

@jseabold
Copy link
Contributor

jseabold commented Mar 5, 2014

FWIW, the pandas strategy of hide all commit messages isn't preferred everywhere ;). Sometimes a nice, informative commit message about why a particular line changed can save much headbanging against the desk later... It's admittedly a delicate balance and mostly personal/team preference though.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

merged via 689d765

FYI most of your commit messages are their. Its actually pretty informative.

thanks for the effort!

@jreback jreback closed this Mar 5, 2014
@jseabold
Copy link
Contributor

jseabold commented Mar 5, 2014

Yes, many thanks for catching this and most importantly, fixing it.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

@bashtage or @jseabold if you have insight into #6545 pls comment

@jseabold
Copy link
Contributor

jseabold commented Mar 5, 2014

Re: squashed commits. The commit messages are there in the full merge commit, but not from git log or git blame IIUC.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

@jseabold right

i put logical changes usually in a small number of commits
it prob would work other ways just as well
but that has been the convention

rebase - squash - merge

I find it makes pieces independent and pretty easy to look at
you can always look back the original history via the pr if u want
of course but master is pretty clean

@bashtage bashtage deleted the stata-export-datatype branch March 13, 2014 22:25
@bashtage bashtage restored the stata-export-datatype branch March 21, 2014 19:49
@bashtage bashtage deleted the stata-export-datatype branch March 23, 2014 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants