New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble writing to_stata with a GzipFile #21041

Closed
karldw opened this Issue May 14, 2018 · 10 comments

Comments

Projects
None yet
4 participants
@karldw

karldw commented May 14, 2018

Problem description

When a Stata dataset writing to a GzipFile, the written dataset is all zero/blank.
I think the Pandas would ideally write out the correct information to the GzipFile Stata output, or if that's not an easy change, might consider raising an error when the user tries to write to a GzipFile.

Expected Output

I expected to read back the same data I tried to write, or to get an error when writing.

Here's the table I tried to write to the GzipFile (df in the code):

a b c
1 1.5 "z"

Here's the table that gets read back (df_from_gzip in the code):

a b c
0 0.0 ""

I think this is an error in writing, rather than in reading back, because Stata reads the same all-zeros table.

Code Sample

import pandas as pd
import gzip
import subprocess


df = pd.DataFrame({
    'a': [1],
    'b': [1.5],
    'c': ["z"]})

# Use GzipFile to write a compressed version:
with gzip.GzipFile("test_gz.dta.gz", mode = "wb") as f:
    df.to_stata(f, write_index = False)

# Use the system gunzip to extract (using GzipFile fails; see attempt below)
subprocess.run(["gunzip", "--keep", "test_gz.dta.gz"])
df_from_gzip = pd.read_stata("test_gz.dta")

print(df)
print(df_from_gzip)

Other fun facts

  • bz2.BZ2File and lzma.LZMAFile refuse to write dta files, with the error "UnsupportedOperation: Seeking is only supported on files open for reading"
  • Everything works for feather files.
  • This isn't an issue with read_stata; opening the files in Stata itself gives the same results.
  • Variable types are retained.
  • Value labels for categorical variables are written correctly.
  • The number of rows is correct, even for larger examples.
  • Reading a system-compressed Stata file is fine.
import bz2
import lzma


# Try to read the compressed file created before -- fails with the message
# "Not a gzipped file (b'\x01\x00')". I'm not sure why, but it's not central
# to this issue.
with gzip.GzipFile("test_gz.dta.gz") as f:
    df2 = pd.read_stata(f)

    
# Writing feather files to these compressed connections works:
with gzip.GzipFile("test_gz.feather.gz", mode = "wb") as f:
    df.to_feather(f)
with bz2.BZ2File("test_bz.feather.bz2", mode = "wb") as f:
    df.to_feather(f)
with lzma.LZMAFile("test_xz.feather.xz", mode = "wb") as f:
    df.to_feather(f)
        

# Next, writing stata files with other compressors fails because the
# file isn't open for reading.
with bz2.BZ2File("test_bz.dta.bz", mode = "wb") as f:
    df.to_stata(f)  # this raises an error
with lzma.LZMAFile("test_xz.dta.xz", mode = "wb") as f:
    df.to_stata(f)  # this also raises an error


# But reading a system-compressed Stata file works:
df.to_stata("test.dta", write_index = False)
subprocess.run(["gzip", "test.dta"])
with gzip.GzipFile("test.dta.gz") as f:
    assert all(pd.read_stata(f) == df)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-20-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.1.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 15, 2018

Thanks! Could you narrow down your example to a minimal example? It's hard to see exactly what the problem is with that long of an input. http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@karldw

This comment has been minimized.

karldw commented May 15, 2018

Sorry about that! I reorganized things above, hopefully for the better.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 16, 2018

Thanks for the update. Agreed something is going on here. Any interest in debugging further? It'll all be in https://github.com/pandas-dev/pandas/blob/master/pandas/io/stata.py probably. I can help narrow it down further if you need.

@karldw

This comment has been minimized.

karldw commented May 19, 2018

I can give it a look, but it will take me a bit. If you (or anyone else reading this) want to get this working sooner, please do!

@bashtage

This comment has been minimized.

Contributor

bashtage commented May 21, 2018

I can see this is never going to work with the new format Stata dta writer since writing a dta file requires rewriting some values in an area of the file called the map. It might work with the old format since it is pretty linear. The docstring should be updated to reflect that it one really works with file object, not general file-like objects.

@bashtage

This comment has been minimized.

Contributor

bashtage commented May 21, 2018

It appears that ndarray.tofile, which is used to write the data, does not work correctly with gzip files.

a = np.arange(2**12)
with gzip.GzipFile("test_nb.gz", mode = "wb") as f:
    a.tofile(f)

and then

gunzip test_nb.gz

gzip: test_nb.gz: not in gzip format
@karldw

This comment has been minimized.

karldw commented May 21, 2018

Ah, that map business is messy.

@bashtage

This comment has been minimized.

Contributor

bashtage commented May 22, 2018

Unfortunately, BytesIO doesn't work with NumPy tofile either. TempFile is destroyed immediately on close and so neither of these would allow gzipped files to be cleanly written to disk without an intermediate dta file.

bashtage added a commit to bashtage/pandas that referenced this issue May 22, 2018

BUG: Enable stata files to be written to buffers
Enable support for general file-like objects whene xporting stata files

closes pandas-dev#21041

bashtage added a commit to bashtage/pandas that referenced this issue May 22, 2018

BUG: Enable stata files to be written to buffers
Enable support for general file-like objects when exporting stata files

closes pandas-dev#21041

bashtage added a commit to bashtage/pandas that referenced this issue May 22, 2018

BUG: Enable stata files to be written to buffers
Enable support for general file-like objects when exporting stata files

closes pandas-dev#21041
@karldw

This comment has been minimized.

karldw commented May 22, 2018

@bashtage, you can get around the deletion with NamedTemporaryFile(delete = False), but I'm not sure this is better.

import tempfile
import subprocess
import shutil
import pandas as pd

df = pd.DataFrame({
    'a': [1],
    'b': [1.5],
    'c': ["z"]})
tmp = tempfile.NamedTemporaryFile(delete = False, suffix = ".dta")
df.to_stata(tmp.name)
tmp.close()
subprocess.run(['gzip', tmp.name])
shutil.move(tmp.name + ".gz", some_file)
@bashtage

This comment has been minimized.

Contributor

bashtage commented May 22, 2018

Could use something like:

with gzip.GzipFile('test.dta.gz','wb') as gz, tempfile.NamedTemporaryFile(delete=False) as ntf:
    df.to_stata(ntf)
    with open(ntf.name,'rb') as ntf2:
        gz.write(ntf2.read())

with current Pandas.

The patch fixes this issue so that a standard gzip can be used. It should be in 0.23.1

@jreback jreback added this to the 0.23.1 milestone May 23, 2018

bashtage added a commit to bashtage/pandas that referenced this issue May 24, 2018

BUG: Enable stata files to be written to buffers
Enable support for general file-like objects when exporting stata files

closes pandas-dev#21041

bashtage added a commit to bashtage/pandas that referenced this issue May 24, 2018

BUG: Enable stata files to be written to buffers
Enable support for general file-like objects when exporting stata files

closes pandas-dev#21041

jreback added a commit that referenced this issue May 24, 2018

BUG: Enable stata files to be written to buffers (#21169)
Enable support for general file-like objects when exporting stata files

closes #21041

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Jun 8, 2018

BUG: Enable stata files to be written to buffers (pandas-dev#21169)
Enable support for general file-like objects when exporting stata files

closes pandas-dev#21041
(cherry picked from commit f91e28c)

jorisvandenbossche added a commit that referenced this issue Jun 9, 2018

BUG: Enable stata files to be written to buffers (#21169)
Enable support for general file-like objects when exporting stata files

closes #21041
(cherry picked from commit f91e28c)

david-liu-brattle-1 added a commit to david-liu-brattle-1/pandas that referenced this issue Jun 18, 2018

BUG: Enable stata files to be written to buffers (pandas-dev#21169)
Enable support for general file-like objects when exporting stata files

closes pandas-dev#21041
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment