StataReader.variable_labels() does not read variable label correctly for stata datasets saved under Stata 13 using 'save' (but it can read datasets saved using 'saveold') #7816

Closed
shafiquejamal opened this Issue Jul 22, 2014 · 14 comments

Comments

Projects
None yet
3 participants

If I use SataReader to read a Stata dataset saved in Stata 13 using the save command, I can get the data but not the variable labels.

If, however, I use the saveold command in Stata 13, I am able to get the variable labels in Python3 using StataReader.variable_labels().

Can anyone suggest how to accommodate Stata 13? Thanks,

Contributor

jreback commented Jul 22, 2014

docs are here: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-from-stata-format

something like:

reader = pandas.io.stata.StataReader(file)

# labels
reader.variable_labels()

# data
reader.data(....)

look inside the pandas.io.stata.read_stata (and doc-string of StataReader)

Contributor

jreback commented Jul 22, 2014

closing as a usage question

jreback closed this Jul 22, 2014

Hello, I'm sorry if I wasn't clear earlier.

I did use the variables_label() method of reader. But this does NOT work for Stata datasets saved in later versions of Stata (e.g. Stata 13) using the save command. (It DOES work if the dataset was saved in Stata 13 using the saveold command.)

Can you please re-open this issue? It is still not resolved (I am using the latest Pandas master branch). Thanks.

Contributor

jreback commented Jul 22, 2014

ok, so this is a feature/bug request then? ok

jreback reopened this Jul 22, 2014

jreback added the Bug label Jul 22, 2014

jreback added this to the 0.15.1 milestone Jul 22, 2014

Contributor

jreback commented Jul 22, 2014

Yes it is a bug/feature request. I guess Stata changed something in how they save data files, which means that the Stata reader needs to be updated to accommodate this change. Many thanks!

Contributor

bashtage commented Jul 22, 2014

@shafiquejamal Would be helpful if you could share a simple example file .dta which produces the problem, as well as a v12 one that works.

This looks like it is implemented in the v13 path - although it probably is buggy

jreback removed the Usage Question label Jul 22, 2014

Certainly. I have a couple of .dta files of about 450kb each that I can share (problem dataset HHRosterEducHealth_small_varwithnolabel_notsaveold and non-problem dataset HHRosterEducHealth_small_varwithnolabel_saveold).

I tried dragging them into this comment window, but I'm getting this error at the bottom of this comment window: "Unfortunately, we don't support that file type. Try again with a PNG, GIF, or JPG."

How can I share these .dta files with you? Thanks,

Contributor

jreback commented Jul 22, 2014

@shafiquejamal put them up on a public dropbox / share site. I think you can do it via gist as well.

and post the link here.

Here is the dropbox link:

https://www.dropbox.com/sh/4r0fhspsiwpim5p/AACBaC-lu7TaNPLUQQgU_rt4a

So StataReader can handle the file ending in _saveold.dta (saved using an old Stata dataset format), not the file ending in _notsaveold.dta (saved using the newer Stata dataset format). Thanks.

Contributor

bashtage commented Jul 22, 2014

The bug, unfortunately, seems to be in stata. Stata's dta file definition claims that it gives the offset to the start of this segment as 1 of 14 8 byte values, in . Unfortunately, this value is 0 (0000 0000 0000 0000 in the file) in this file, and is 0 in 1 I just saved from Stata 13.

The code appears to be a correct implementation of Stata's documented file format, so I'm not sure if this should be "fixed" (which would be to hack around Stata's problem).

Thanks for looking into this so quickly. I'll see about contacting folks at Stata to see whether they can fix their documentation, which would then just justify modifying Pandas.

To summarize then: the problem is that the offset (to the start of the segment in the dta file that defines the variable labels) should be 1, according to Stata's documentation (in help dta), but this offset is in fact 0 instead. Correct?

Many thanks,

Contributor

bashtage commented Jul 22, 2014

I have submitted a patch that works around the difference between the docs and the implementation. The required value is technically unnecessary since it can be computed from other values.

@jreback jreback modified the milestone: 0.15.0, 0.15.1 Jul 22, 2014

@bashtage bashtage pushed a commit to bashtage/pandas that referenced this issue Jul 23, 2014

Kevin Sheppard BUG: Fixed failure in StataReader when reading variable labels in 117…
… files

Stata's implementation does not match the online dta file format description.
The solution used here is to directly compute the offset rather than reading
it from the dta file.  If Stata fixes their implementation, the original code
can be restored.
closes #7816
6265450

Thanks! Its working with my datasets. Cheers,

jreback closed this in #7818 Jul 23, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment