Reading with read_stata in chunks messes up categories #31544

toobaz · 2020-02-01T15:39:57Z

Code Sample, a copy-pastable example if possible

In [2]: df = pd.DataFrame({'col{}'.format(k) : pd.Categorical(['a_label'] +
                                                              ['another_label']*500)
                           for k in range(2)})                                                                           

In [3]: df.dtypes                                                                                                                                                                                
Out[3]: 
col0    category
col1    category
dtype: object

In [4]: df.dtypes[0]                                                                                                                                                                             
Out[4]: CategoricalDtype(categories=['a_label', 'another_label'], ordered=False)

In [5]: df.to_stata('/tmp/stata_test.dta', write_index=False)                                                                                                                                    

In [6]: pd.read_stata('/tmp/stata_test.dta').dtypes                                                                                                                                              
Out[6]: 
col0    category
col1    category
dtype: object
# ... that's good

In [7]: reader = pd.read_stata('/tmp/stata_test.dta', chunksize=100)                                                                                                                             

In [8]: reader.value_labels()                                                                                                                                                                    
Out[8]: 
{'col0': {0: 'a_label', 1: 'another_label'},
 'col1': {0: 'a_label', 1: 'another_label'}}
# ... still all good

In [9]: out_chunks = [chunk for chunk in reader]                                                                                                                                                 

In [10]: out_chunks[1].dtypes[0]                                                                                                                                                                 
Out[10]: CategoricalDtype(categories=['another_label'], ordered=True)
# Ooops... where's the other label gone?

In [11]: reader.close()                                                                                                                                                                          

In [12]: all_together = pd.concat(out_chunks)                                                                                                                                                    

In [13]: all_together.dtypes[0]                                                                                                                                                                  
Out[13]: dtype('O')
# Ouch!

Problem description

My data has categories, but they are lost only because I'm reading it in chunks. I noticed this because I was reading in chunks a large database of which I only needed a subset of columns: ironically, precisely the fact that I was reading it in chunks made memory usage explode when I reattached them.

An by the way, Out[8]: shows that pandas is aware of the actual categories, even before iterating... so this is the information that should be used to consistently recreate them, and all chunks should have exactly the same (as in is) categorical dtype.

Expected Output

Out[10] should feature both categories, and Out[13] should still be a categorical.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.0-6-amd64
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : it_IT.UTF-8
LOCALE : it_IT.UTF-8

pandas : 1.1.0.dev0+276.g2495068ad
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 18.1
setuptools : 41.0.1
Cython : 0.29.13
pytest : 4.6.3
hypothesis : 3.71.11
sphinx : 1.8.4
blosc : 1.7.0
feather : None
xlsxwriter : 0.9.3
lxml.etree : 4.3.2
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.7 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.2
matplotlib : 3.0.2
numexpr : 2.6.9
odfpy : None
openpyxl : 2.4.9
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 4.6.3
pyxlsb : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.2.18
tables : 3.4.4
tabulate : 0.8.3
xarray : 0.11.3
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : 0.9.3
numba : 0.45.0

The text was updated successfully, but these errors were encountered:

bashtage · 2020-03-25T11:31:58Z

Seems like a bug. Might need to handle categories using a new path when chunked.

Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544

…StataReader (#34128) * BUG/ENH: Correct categorical on iterators Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes #31544

toobaz added Categorical Categorical Data Type IO Stata read_stata, to_stata labels Feb 1, 2020

bashtage added the Bug label Mar 25, 2020

bashtage pushed a commit to bashtage/pandas that referenced this issue May 12, 2020

BUG/ENH: Correct categorical on iterators

7238bd0

Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544

bashtage mentioned this issue May 12, 2020

BUG/ENH: Improve categorical construction when using the iterator in StataReader #34128

Merged

5 tasks

bashtage pushed a commit to bashtage/pandas that referenced this issue May 12, 2020

BUG/ENH: Correct categorical on iterators

3f09064

Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544

bashtage pushed a commit to bashtage/pandas that referenced this issue May 12, 2020

BUG/ENH: Correct categorical on iterators

fcdcbcf

Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544

bashtage pushed a commit to bashtage/pandas that referenced this issue May 12, 2020

BUG/ENH: Correct categorical on iterators

aef1622

Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544

bashtage pushed a commit to bashtage/pandas that referenced this issue May 12, 2020

BUG/ENH: Correct categorical on iterators

0f8116e

Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544

jreback added this to the 1.1 milestone May 13, 2020

bashtage pushed a commit to bashtage/pandas that referenced this issue Jun 2, 2020

BUG/ENH: Correct categorical on iterators

0a069be

Return categoricals with the same categories if possible when reading data through an interator. Warn if not possible. closes pandas-dev#31544

jreback closed this as completed in #34128 Jun 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading with read_stata in chunks messes up categories #31544

Reading with read_stata in chunks messes up categories #31544

toobaz commented Feb 1, 2020 •

edited

Loading

INSTALLED VERSIONS

bashtage commented Mar 25, 2020

Reading with read_stata in chunks messes up categories #31544

Reading with read_stata in chunks messes up categories #31544

Comments

toobaz commented Feb 1, 2020 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

bashtage commented Mar 25, 2020

toobaz commented Feb 1, 2020 •

edited

Loading

Output of `pd.show_versions()`