Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error with astype if duplicate columns are supplied for categorical #24704

Open
thomasfrederikhoeck opened this issue Jan 10, 2019 · 5 comments

Comments

Projects
None yet
3 participants
@thomasfrederikhoeck
Copy link

commented Jan 10, 2019

Code Sample

Running the following code for changing type to category runs perfectly

import pandas as pd

df = pd.DataFrame({'a': ['1',1,3], 'b' : [1,2,3]})

print(df.dtypes)

categoricals = list(df.select_dtypes(include='object').columns.values)
df[categoricals] = df[categoricals].astype('category')

print(df.dtypes)

which returns

a    object
b     int64
dtype: object

a    category
b       int64
dtype: object

If an extra extra column is faulty added ('a' is added again):

import pandas as pd

df = pd.DataFrame({'a': ['1',1,3], 'b' : [1,2,3]})

print(df.dtypes)

categoricals = list(df.select_dtypes(include='object').columns.values)
categoricals =categoricals + ['a']

df[categoricals] = df[categoricals].astype('category')

print(df.dtypes)

Python crashes with

a    object
b     int64
dtype: object
Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00007f806cac8700 (most recent call first):
  File "<frozen importlib._bootstrap>", line 172 in _get_module_lock
  File "<frozen importlib._bootstrap>", line 148 in __enter__
  File "<frozen importlib._bootstrap>", line 960 in _find_and_load
  File "<frozen importlib._bootstrap>", line 205 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 936 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 961 in _find_and_load
  File "<frozen importlib._bootstrap>", line 205 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 936 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 961 in _find_and_load
  File "/home/runner/.site-packages/pandas/core/indexes/base.py", line 4960 in _ensure_index
  File "/home/runner/.site-packages/pandas/core/indexes/base.py", line 3363 in get_indexer_non_unique
  File "/home/runner/.site-packages/pandas/core/indexes/base.py", line 3386 in get_indexer_for
  File "/home/runner/.site-packages/pandas/core/internals.py", line 4132 in get
  File "/home/runner/.site-packages/pandas/core/frame.py", line 2698 in _getitem_column
  File "/home/runner/.site-packages/pandas/core/frame.py", line 2671 in __getitem__
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper

Problem description

One would expect pandas to raise an error that there is duplicate columns or remove duplicate instead of crashing.

I'm using Python 3.6.1 and pandas-0.23.4.

Expected Output

"The list of columns you have supplied has duplicates"

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-1011-gcp
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 9.0.1
setuptools: 40.6.2
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@thomasfrederikhoeck

This comment has been minimized.

Copy link
Author

commented Jan 10, 2019

It also seem to be the case when mapping from int64 too (so not specific to object):

import pandas as pd

df = pd.DataFrame({'a': ['1',1,3], 'b' : [1,2,3]})

print(df.dtypes)

categoricals = list(df.select_dtypes(include='int64').columns.values)
categoricals =categoricals + ['b']

df[categoricals] = df[categoricals].astype('category')

print(df.dtypes)
a    object
b     int64
dtype: object
Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00007fa47ad78700 (most recent call first):
  File "<frozen importlib._bootstrap>", line 172 in _get_module_lock
  File "<frozen importlib._bootstrap>", line 148 in __enter__
  File "<frozen importlib._bootstrap>", line 960 in _find_and_load
  File "<frozen importlib._bootstrap>", line 205 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 936 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 961 in _find_and_load
  File "<frozen importlib._bootstrap>", line 205 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 936 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 961 in _find_and_load
  File "/home/runner/.site-packages/pandas/core/indexes/base.py", line 4960 in _ensure_index
  File "/home/runner/.site-packages/pandas/core/indexes/base.py", line 3363 in get_indexer_non_unique
  File "/home/runner/.site-packages/pandas/core/indexes/base.py", line 3386 in get_indexer_for
  File "/home/runner/.site-packages/pandas/core/internals.py", line 4132 in get
  File "/home/runner/.site-packages/pandas/core/frame.py", line 2698 in _getitem_column
  File "/home/runner/.site-packages/pandas/core/frame.py", line 2671 in __getitem__
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
  File "/home/runner/.site-packages/pandas/core/generic.py", line 4996 in <genexpr>
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 256 in __init__
  File "/home/runner/.site-packages/pandas/core/reshape/concat.py", line 225 in concat
  File "/home/runner/.site-packages/pandas/core/generic.py", line 5005 in astype
  File "/home/runner/.site-packages/pandas/util/_decorators.py", line 178 in wrapper
@thomasfrederikhoeck

This comment has been minimized.

Copy link
Author

commented Jan 10, 2019

But not when the astype is 'int64':

import pandas as pd

df = pd.DataFrame({'a': ['1',1,3], 'b' : [1,2,3]})

print(df.dtypes)

categoricals = list(df.select_dtypes(include='object').columns.values)
categoricals =categoricals + ['a']

df[categoricals] = df[categoricals].astype('int64')

print(df.dtypes)
a    object
b     int64
dtype: object
a    int64
b    int64
dtype: object

So it is probably related to category.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2019

you are trying to set with a duplicate

In [6]: categoricals
Out[6]: ['b', 'b']
@jschendel

This comment has been minimized.

Copy link
Member

commented Jan 10, 2019

It looks like there are actually two issues here:

1. DataFrame.astype(ExtensionDtype) fails with duplicate columns

In [2]: df = pd.DataFrame([[1, 2], [1, 1], [3, 2]], columns=['a', 'a'])

In [3]: df
Out[3]:
   a  a
0  1  2
1  1  1
2  3  2

In [4]: df.astype('category')
---------------------------------------------------------------------------
RecursionError: maximum recursion depth exceeded

In [5]: df.astype('Int64')
---------------------------------------------------------------------------
RecursionError: maximum recursion depth exceeded

This works for other dtypes when duplicate columns are present, and the fix looks easy, so we could probably support it.

2. Setting to a DataFrame[ExtensionDtype] with duplicate columns results in object dtype

Even if item 1 was fixed, the setting process would result in an object dtype instead of a categorical/extension dtype:

In [6]: df = pd.DataFrame({'a': [1, 1, 2], 'b' :['foo', 'bar', 'baz']})

In [7]: df
Out[7]:
   a    b
0  1  foo
1  1  bar
2  2  baz

In [8]: df.dtypes
Out[8]:
a     int64
b    object
dtype: object

In [9]: df_aa = pd.concat([pd.Series([10, 20, 30], name='a', dtype='category'),
    ...:                   pd.Series([11, 22, 33], name='a', dtype='category')], axis=1)

In [10]: df_aa
Out[10]:
    a   a
0  10  11
1  20  22
2  30  33

In [11]: df_aa.dtypes
Out[11]:
a    category
a    category
dtype: object

In [12]: df['a'] = df_aa

In [13]: df
Out[13]:
    a    b
0  10  foo
1  20  bar
2  30  baz

In [14]: df.dtypes
Out[14]:
a    object
b    object
dtype: object

In [15]: df[['a', 'a']] = df_aa

In [16]: df
Out[16]:
    a    b
0  10  foo
1  20  bar
2  30  baz

In [17]: df.dtypes
Out[17]:
a    object
b    object
dtype: object

I'm not sure that this should be supported. The operation doesn't really make sense to me, and I'm a little bit surprised that it didn't raise.

@jreback : what are your thoughts on items 1 and 2?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2019

yeah 1) is ok, 2) is somewhat tricky and prob ok to not support right away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.