Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading with read_stata in chunks messes up categories #31544

Closed
toobaz opened this issue Feb 1, 2020 · 1 comment · Fixed by #34128
Closed

Reading with read_stata in chunks messes up categories #31544

toobaz opened this issue Feb 1, 2020 · 1 comment · Fixed by #34128
Labels
Bug Categorical Categorical Data Type IO Stata read_stata, to_stata
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Feb 1, 2020

Code Sample, a copy-pastable example if possible

In [2]: df = pd.DataFrame({'col{}'.format(k) : pd.Categorical(['a_label'] +
                                                              ['another_label']*500)
                           for k in range(2)})                                                                           

In [3]: df.dtypes                                                                                                                                                                                
Out[3]: 
col0    category
col1    category
dtype: object

In [4]: df.dtypes[0]                                                                                                                                                                             
Out[4]: CategoricalDtype(categories=['a_label', 'another_label'], ordered=False)

In [5]: df.to_stata('/tmp/stata_test.dta', write_index=False)                                                                                                                                    

In [6]: pd.read_stata('/tmp/stata_test.dta').dtypes                                                                                                                                              
Out[6]: 
col0    category
col1    category
dtype: object
# ... that's good

In [7]: reader = pd.read_stata('/tmp/stata_test.dta', chunksize=100)                                                                                                                             

In [8]: reader.value_labels()                                                                                                                                                                    
Out[8]: 
{'col0': {0: 'a_label', 1: 'another_label'},
 'col1': {0: 'a_label', 1: 'another_label'}}
# ... still all good

In [9]: out_chunks = [chunk for chunk in reader]                                                                                                                                                 

In [10]: out_chunks[1].dtypes[0]                                                                                                                                                                 
Out[10]: CategoricalDtype(categories=['another_label'], ordered=True)
# Ooops... where's the other label gone?

In [11]: reader.close()                                                                                                                                                                          

In [12]: all_together = pd.concat(out_chunks)                                                                                                                                                    

In [13]: all_together.dtypes[0]                                                                                                                                                                  
Out[13]: dtype('O')
# Ouch!

Problem description

My data has categories, but they are lost only because I'm reading it in chunks. I noticed this because I was reading in chunks a large database of which I only needed a subset of columns: ironically, precisely the fact that I was reading it in chunks made memory usage explode when I reattached them.

An by the way, Out[8]: shows that pandas is aware of the actual categories, even before iterating... so this is the information that should be used to consistently recreate them, and all chunks should have exactly the same (as in is) categorical dtype.

Expected Output

Out[10] should feature both categories, and Out[13] should still be a categorical.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.0-6-amd64
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : it_IT.UTF-8
LOCALE : it_IT.UTF-8

pandas : 1.1.0.dev0+276.g2495068ad
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 18.1
setuptools : 41.0.1
Cython : 0.29.13
pytest : 4.6.3
hypothesis : 3.71.11
sphinx : 1.8.4
blosc : 1.7.0
feather : None
xlsxwriter : 0.9.3
lxml.etree : 4.3.2
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.7 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.2
matplotlib : 3.0.2
numexpr : 2.6.9
odfpy : None
openpyxl : 2.4.9
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 4.6.3
pyxlsb : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.2.18
tables : 3.4.4
tabulate : 0.8.3
xarray : 0.11.3
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : 0.9.3
numba : 0.45.0

@toobaz toobaz added Categorical Categorical Data Type IO Stata read_stata, to_stata labels Feb 1, 2020
@bashtage bashtage added the Bug label Mar 25, 2020
@bashtage
Copy link
Contributor

Seems like a bug. Might need to handle categories using a new path when chunked.

bashtage pushed a commit to bashtage/pandas that referenced this issue May 12, 2020
Return categoricals with the same categories if possible when reading
data through an interator.
Warn if not possible.

closes pandas-dev#31544
bashtage pushed a commit to bashtage/pandas that referenced this issue May 12, 2020
Return categoricals with the same categories if possible when reading
data through an interator.
Warn if not possible.

closes pandas-dev#31544
bashtage pushed a commit to bashtage/pandas that referenced this issue May 12, 2020
Return categoricals with the same categories if possible when reading
data through an interator.
Warn if not possible.

closes pandas-dev#31544
bashtage pushed a commit to bashtage/pandas that referenced this issue May 12, 2020
Return categoricals with the same categories if possible when reading
data through an interator.
Warn if not possible.

closes pandas-dev#31544
bashtage pushed a commit to bashtage/pandas that referenced this issue May 12, 2020
Return categoricals with the same categories if possible when reading
data through an interator.
Warn if not possible.

closes pandas-dev#31544
@jreback jreback added this to the 1.1 milestone May 13, 2020
bashtage pushed a commit to bashtage/pandas that referenced this issue Jun 2, 2020
Return categoricals with the same categories if possible when reading
data through an interator.
Warn if not possible.

closes pandas-dev#31544
jreback pushed a commit that referenced this issue Jun 4, 2020
…StataReader (#34128)

* BUG/ENH: Correct categorical on iterators

Return categoricals with the same categories if possible when reading
data through an interator.
Warn if not possible.

closes #31544
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants