BUG: read_stata ignores columns parameter and dtypes of empty dta files #46240

sterlinm · 2022-03-05T19:42:19Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# create an empty DataFrame with int64 and float64 dtypes
df = pd.DataFrame(data={"a": range(3), "b": [1.0, 2.0, 3.0]}).head(0)

# write to Stata .dta file
df.to_stata('empty.dta', write_index=False, version=117)

# read one column of empty .dta file
df2 = pd.read_stata('empty.dta', columns=["a"])

# show dtypes of df2
df2.dtypes

Issue Description

A stata .dta file with zero rows still has type information, but when you try to read an empty .dta file using pd.read_stata all of the columns have object dtype. It will also ignore the columns parameter and read all of the columns.

Expected Behavior

In the above example df2.dtypes should return:

In [2]: df2.dtypes
Out[2]:
a    object
b    object
dtype: object

Installed Versions

Apologies, pd.show_versions() fails for some reason. I've included it, but the pandas version is 1.4.1.

In [5]: pd.__version__
Out[5]: '1.4.1'

In [3]: pd.show_versions()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [3], in <module>
----> 1 pd.show_versions()

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py:109, in show_versions(as_json)
     94 """
     95 Provide useful information, important for bug reports.
     96
   (...)
    106     * If True, outputs info in JSON format to the console.
    107 """
    108 sys_info = _get_sys_info()
--> 109 deps = _get_dependency_info()
    111 if as_json:
    112     j = {"system": sys_info, "dependencies": deps}

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py:88, in _get_dependency_info()
     86 result: dict[str, JSONSerializable] = {}
     87 for modname in deps:
---> 88     mod = import_optional_dependency(modname, errors="ignore")
     89     result[modname] = get_version(mod) if mod else None
     90 return result

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/pandas/compat/_optional.py:126, in import_optional_dependency(name, extra, errors, min_version)
    121 msg = (
    122     f"Missing optional dependency '{install_name}'. {extra} "
    123     f"Use pip or conda to install {install_name}."
    124 )
    125 try:
--> 126     module = importlib.import_module(name)
    127 except ImportError:
    128     if errors == "raise":

File ~/mambaforge/envs/py310/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    124             break
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1006, in _find_and_load_unlocked(name, import_)

File <frozen importlib._bootstrap>:688, in _load_unlocked(spec)

File <frozen importlib._bootstrap_external>:883, in exec_module(self, module)

File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/setuptools/__init__.py:8, in <module>
      5 import os
      6 import re
----> 8 import _distutils_hack.override  # noqa: F401
     10 import distutils.core
     11 from distutils.errors import DistutilsOptionError

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/_distutils_hack/override.py:1, in <module>
----> 1 __import__('_distutils_hack').do_override()

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/_distutils_hack/__init__.py:72, in do_override()
     70 if enabled():
     71     warn_distutils_present()
---> 72     ensure_local_distutils()

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/_distutils_hack/__init__.py:59, in ensure_local_distutils()
     57 # check that submodules load as expected
     58 core = importlib.import_module('distutils.core')
---> 59 assert '_distutils' in core.__file__, core.__file__
     60 assert 'setuptools._distutils.log' not in sys.modules

The text was updated successfully, but these errors were encountered:

Pydare · 2022-03-08T23:01:51Z

A stata .dta file with zero rows still has type information, but when you try to read an empty .dta file using pd.read_stata all of the columns have object type.

I think this occurs for other file types as well. I tried it for .csv and .xlsx file types and the same thing occurred.

The second part of this bug did not occur in these file types. So I am guessing it's an issue with the method

Pydare · 2022-03-08T23:02:12Z

take

phofl · 2022-03-10T11:52:45Z

Could you check your expected? I think you meant that the dtype should be something else than object?

csv files do not have dtype information, so this is expected. Not sure what we would want to do with excel files though

sterlinm · 2022-03-11T00:27:40Z

csv files do not have dtype information, so this is expected. Not sure what we would want to do with excel files though

I'm not sure if Excel files have any implicit type but I don't think so. I could check and see what happens with parquet and SAS files. I think both of those still have type information even if there are no observations.

simonjayhawkins · 2022-06-20T18:58:27Z

Could you check your expected? I think you meant that the dtype should be something else than object?

@sterlinm you wrote in the OP

Expected Behavior

In the above example df2.dtypes should return:
In [2]: df2.dtypes
Out[2]:
a    object
b    object
dtype: object

you meant

a    int32
dtype: object

sterlinm · 2022-07-08T01:53:59Z

@simonjayhawkins You're right, I mixed it up because I was highlighting two separate issues with reading empty files:

pd.read_stata ignores the columns parameter when reading an empty file.
pd.read_stata loses dtype information when reading an empty file.

Here's an updated example:

Reproducible Example

import numpy as np
import pandas as pd
from pandas.io.stata import StataReader

# create a DataFrame with int32 and float64 dtypes
df = pd.DataFrame(data={"a": range(3), "b": [1.0, 2.0, 3.0]})
df.loc[:, 'a'] = df['a'].astype('int32')
df_empty = df.head(0)

# write the empty and non-empty DataFrame's to .dta files
df.to_stata('nonempty.dta', write_index=False, version=117)
df_empty.to_stata('empty.dta', write_index=False, version=117)

# column variables
expected_cols = pd.Index(['a'])
all_cols = df.columns

# reading one column of non-empty .dta file works
assert pd.read_stata('nonempty.dta', columns=["a"]).columns.equals(expected_cols)

# reading one column of empty .dta file does not work
assert pd.read_stata('empty.dta', columns=["a"]).columns.equals(all_cols)
assert pd.read_stata('empty.dta', columns=["xyz"]).columns.equals(all_cols)  # should raise error

# reading non-empty .dta file retains correct dtypes
assert pd.read_stata('nonempty.dta').dtypes.equals(df.dtypes)

# reading empty .dta file makes all the columns object columns
assert (pd.read_stata('empty.dta').dtypes == 'object').all()

# we can confirm that the empty .dta file does retain the type information
expected_dtyplist = [np.dtype('int32'), np.dtype('float64')]
assert StataReader('nonempty.dta').dtyplist == expected_dtyplist
assert StataReader('empty.dta').dtyplist == expected_dtyplist

Expected Behavior

In the above example pd.read_stata('empty.dta').dtypes should return:

In [2]: df2.dtypes
Out[2]:
a    int32
b    float64
dtype: object

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes #46240

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

sterlinm added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 5, 2022

github-actions bot assigned Pydare Mar 8, 2022

phofl added IO Stata read_stata, to_stata Needs Info Clarification about behavior needed to assess issue labels Mar 10, 2022

Pydare removed their assignment Mar 11, 2022

sterlinm mentioned this issue Nov 8, 2022

CLN/FIX/PERF: Don't buffer entire Stata file into memory #49228

Merged

8 tasks

phofl removed Needs Info Clarification about behavior needed to assess issue Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 9, 2022

bashtage added a commit to bashtage/pandas that referenced this issue May 14, 2023

BUG: Correct behavior when reading empty dta files

a3f80cf

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

bashtage added a commit to bashtage/pandas that referenced this issue May 14, 2023

BUG: Correct behavior when reading empty dta files

defa500

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

bashtage mentioned this issue May 14, 2023

BUG: Correct behavior when reading empty dta files #53226

Merged

5 tasks

bashtage added a commit to bashtage/pandas that referenced this issue May 14, 2023

BUG: Correct behavior when reading empty dta files

44a4268

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

bashtage added a commit to bashtage/pandas that referenced this issue May 15, 2023

BUG: Correct behavior when reading empty dta files

2dcaf80

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

bashtage added a commit to bashtage/pandas that referenced this issue May 15, 2023

BUG: Correct behavior when reading empty dta files

eabc07d

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

bashtage added a commit to bashtage/pandas that referenced this issue May 15, 2023

BUG: Correct behavior when reading empty dta files

d79f2d7

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

bashtage added a commit to bashtage/pandas that referenced this issue May 15, 2023

BUG: Correct behavior when reading empty dta files

2b2c87d

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

bashtage added a commit to bashtage/pandas that referenced this issue May 15, 2023

BUG: Correct behavior when reading empty dta files

d991eb2

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

mroeschke closed this as completed in #53226 May 15, 2023

mroeschke pushed a commit that referenced this issue May 15, 2023

BUG: Correct behavior when reading empty dta files (#53226)

d8046b5

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes #46240

Rylie-W pushed a commit to Rylie-W/pandas that referenced this issue May 19, 2023

BUG: Correct behavior when reading empty dta files (pandas-dev#53226)

166e341

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

Daquisu pushed a commit to Daquisu/pandas that referenced this issue Jul 8, 2023

BUG: Correct behavior when reading empty dta files (pandas-dev#53226)

8fc2a2b

Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_stata ignores columns parameter and dtypes of empty dta files #46240

BUG: read_stata ignores columns parameter and dtypes of empty dta files #46240

sterlinm commented Mar 5, 2022 •

edited by bashtage

Loading

Pydare commented Mar 8, 2022

Pydare commented Mar 8, 2022

phofl commented Mar 10, 2022

sterlinm commented Mar 11, 2022

simonjayhawkins commented Jun 20, 2022

Expected Behavior

sterlinm commented Jul 8, 2022

BUG: read_stata ignores columns parameter and dtypes of empty dta files #46240

BUG: read_stata ignores columns parameter and dtypes of empty dta files #46240

Comments

sterlinm commented Mar 5, 2022 • edited by bashtage Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

Pydare commented Mar 8, 2022

Pydare commented Mar 8, 2022

phofl commented Mar 10, 2022

sterlinm commented Mar 11, 2022

simonjayhawkins commented Jun 20, 2022

Expected Behavior

sterlinm commented Jul 8, 2022

Reproducible Example

Expected Behavior

sterlinm commented Mar 5, 2022 •

edited by bashtage

Loading