Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values #31954

Closed
igorluppi opened this issue Feb 13, 2020 · 32 comments · Fixed by #38354
Assignees
Labels
Error Reporting Incorrect or improved errors from pandas good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@igorluppi
Copy link

igorluppi commented Feb 13, 2020

Code Sample

import pandas 
import numpy as np

a = np.array([[1,2],[3,4]]) 

# DO NOT WORKS
b = np.array([[0.5,6],[7,8]])  
# b = np.array([[.5,6],[7,8]])  # The same problem

# This one works fine:
# b = np.array([[5,6],[7,8]]) 

dfA = pandas.DataFrame(a)
# This works fine EVEN using .5, because the columns name is different
# dfA = pandas.DataFrame(a, columns=['a','b'])
dfB = pandas.DataFrame(b)

df_new = pandas.concat([dfA, dfB], axis = 1)

print(df_new[df_new > 5])

Problem description

It has a bug that combines numpy specific values and duplicated DataFrame column names when it's used a select operation, such as df[df > 5]. A exception is thrown saying "cannot reindex from duplicate axis", however It should not be, because:

  • The DataFrame has no duplicated indexes ( df.index.is_unique is True)
  • The DataFrame has duplicated column names, but should not be a problem when we apply the selection operation, such as df_new[df_new > 5]
  • The DataFrame uses float or int numpy values, so it should not change the behavior of the code

However the values in the numpy array DO changes the behavior of the DataFrame selection, if the DataFrame has duplicated column names.

Expected Output

    0   1    0  1
0 NaN NaN  NaN  6
1 NaN NaN  7.0  8

Current Output

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-28-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : pt_BR.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@igorluppi
Copy link
Author

Moreover, doing this:

In [177]:  new_df = df.reset_index(drop=True) 
In [178]:  new_df[new_df > 10]

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

So, it's 100% sure that we have no duplicates here, so what is going on?

@MarcoGorelli
Copy link
Member

Thanks @igorluppi

I just tried

df = pd.DataFrame(np.random.randn(150001, 792))
df[df>10]                                                                                                                                                         

and got no error - could you give us some more details about your dataframe? Do you still get the error if you only consider its head, or if you only use (say) its first 5 columns?

@igorluppi
Copy link
Author

igorluppi commented Feb 13, 2020

I have many dataframes, and a put all of them in a single one:

let df_items be a list of dataframes.
I got this error using:
df_final = pandas.concat(df_items, axis = 1)

However I verified that
df_final = reduce(lambda x, y: pandas.merge(x, y, left_index=True, right_index=True, how='outer'), df_items)

Works fine, I got the same result DF and when I apply df_final[df_final>10] it works. But this method requires a long process to be done, concat is faster than it (at least 10 times faster).

Thanks for https://stackoverflow.com/questions/45885043/pandas-concat-cannot-reindex-from-a-duplicate-axis?rq=1 about this possible solution. But why the error happens?

@MarcoGorelli
Copy link
Member

Applying df[df>10] I got "cannot reindex from duplicate axis",

I got this error using:
df_final = pandas.concat(df_items, axis = 1)

Sorry, I'm a bit confused, which command gave you the error - pd.concat or df[df>10]?

@igorluppi
Copy link
Author

igorluppi commented Feb 13, 2020

pd.concat gave me the df_final, this df_final got that error when I use df_final[df_final>10]
The interesting part is, when I use the reduce method and got a df_final2, this one works, in another words df_final2[df_final2>10] works fine.

Moreover,

In [14]: df_final.equals(df_final2)                                                                                                                                                                            
Out[14]: False

But I didnt find where the difference is

@igorluppi
Copy link
Author

@MarcoGorelli I found the why the problem is happening but this implies in another problem regarding the exception I got. Give me a second

@igorluppi
Copy link
Author

igorluppi commented Feb 13, 2020

@MarcoGorelli "cannot reindex from duplicate axis" should be broken in two messages:
both "cannot reindex from duplicate index" and "cannot reindex from duplicate columns". I will explain why.

Why is that? Because all the messages and solutions I was looking for told me to took at the indexes, but in my case I found duplicated columns.

But why the second case worked? reduce(lambda x, y: pandas.merge(x, y, left_index=True, right_index=True, how='outer'), df_items)
In this case, when it finds a duplicated column, automatically it appended a string "_x" to the duplicated, it became "duplicated_column_x" It's not the case for concat, it keeps the duplicated column name "duplicated_column".

My sugestion

Please change the exception, to be specific that the problem belongs to the column (or index). Just saying duplicate axis was a little bit confused to find the solution

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 13, 2020

Thanks @igorluppi

tbh I still can't reproduce the error:

df = pd.DataFrame([0, 1], columns=['a'])
new_df = pd.concat([df, df], axis=1)
new_df[new_df>0]  # works

could you try coming up with a minimal reproducible example?

@igorluppi
Copy link
Author

Ok, I will create a simple example

@igorluppi
Copy link
Author

igorluppi commented Feb 13, 2020

@MarcoGorelli

import pandas 
import numpy as np

a = np.array([[1,2],[3,4]]) 

# DO NOT WORKS
b = np.array([[0.5,6],[7,8]]) 
# OR
# b = np.array([[.5,6],[7,8]])

# This one works fine:
# b = np.array([[5,6],[7,8]]) 

dfA = pandas.DataFrame(a)
# This works fine EVEN using .5, because the columns name is different
# dfA = pandas.DataFrame(a, columns=['a','b'])
dfB = pandas.DataFrame(b)

df_new = pandas.concat([dfA, dfB], axis = 1)

df_new[df_new>3]
~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

Basically, using .5 or 0.5 in numpy there breaks the dataframe operation. This might be a problem with pandas + numpy .

The interesting part is: Numpy float values just break the code if we have duplication on columns name.

@MarcoGorelli
Copy link
Member

@igorluppi great, thanks! Could you edit this example into the original post?

@igorluppi igorluppi changed the title "cannot reindex from duplicate axis" when I apply some operation like df[df > 10] using unique indexes BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values Feb 14, 2020
@igorluppi
Copy link
Author

@MarcoGorelli

For sure, it's done my friend!

@igorluppi
Copy link
Author

@MarcoGorelli is it a bug ? Anything new ?

@MarcoGorelli
Copy link
Member

cc @jorisvandenbossche

@igorluppi
Copy link
Author

should someone from numpy-dev look at this? @jorisvandenbossche @MarcoGorelli

@MarcoGorelli
Copy link
Member

should someone from numpy-dev look at this? @jorisvandenbossche @MarcoGorelli

I don't think so - I presume the core team is prioritising what'll be in the v1.0.2 release. I'm working on another issue at the moment but I plan to get back to this

@igorluppi
Copy link
Author

Any news ? @MarcoGorelli @jorisvandenbossche

@MarcoGorelli
Copy link
Member

I've not (yet) looked into this more, but you're welcome to submit a pull request if you like https://pandas.pydata.org/pandas-docs/stable/development/contributing.html

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 5, 2020

This works fine in pandas 1.1.1

@Dr-Irv Dr-Irv closed this as completed Sep 5, 2020
@MarcoGorelli
Copy link
Member

This works fine in pandas 1.1.1

Any idea when it was fixed? It's probably good to make sure this was intentional and that there's a test for it...I'll do a git bisect

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Sep 5, 2020

If I've done git bisect correctly (which I'm not I have, see below) it looks like this was fixed in #33616

Could do with a test, so am reopening.


Why I'm not sure I've done git bisect correctly: there's a gcc error (@simonjayhawkins is this something you've come across?)

(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev$ git checkout 1c0cc62e30a3077476e97f8e7e6ba17b4ac754b6
Previous HEAD position was ad8ce0be9 CLN: Clean missing.py (#33631)
HEAD is now at 1c0cc62e3 REF: get .items out of BlockManager.apply (#33616)
(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev$ python setup.py build_ext -i -j 8
running build_ext
building 'pandas._libs.tslibs.nattype' extension
gcc -pthread -B /home/marco/miniconda/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs/tslibs -I/home/marco/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marco/miniconda/envs/pandas-dev/include/python3.8 -c pandas/_libs/tslibs/nattype.c -o build/temp.linux-x86_64-3.8/pandas/_libs/tslibs/nattype.o -Werror
building 'pandas._libs.interval' extension
gcc -pthread -B /home/marco/miniconda/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs -Ipandas/_libs/src/klib -I/home/marco/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marco/miniconda/envs/pandas-dev/include/python3.8 -c pandas/_libs/interval.c -o build/temp.linux-x86_64-3.8/pandas/_libs/interval.o -Werror
pandas/_libs/tslibs/nattype.c:5108:18: error: ‘__pyx_pw_6pandas_5_libs_6tslibs_7nattype_4_NaT_11__div__’ defined but not used [-Werror=unused-function]
 5108 | static PyObject *__pyx_pw_6pandas_5_libs_6tslibs_7nattype_4_NaT_11__div__(PyObject *__pyx_v_self, PyObject *__pyx_v_other) {
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pandas/_libs/interval.c:8278:18: error: ‘__pyx_pw_6pandas_5_libs_8interval_8Interval_25__div__’ defined but not used [-Werror=unused-function]
 8278 | static PyObject *__pyx_pw_6pandas_5_libs_8interval_8Interval_25__div__(PyObject *__pyx_v_self, PyObject *__pyx_v_y) {
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
cc1: all warnings being treated as errors
error: command 'gcc' failed with exit status 1

EDIT

@dsaxton I saw you've brought something similar up in the Gitter chat, were you able to resolve it?

@MarcoGorelli MarcoGorelli reopened this Sep 5, 2020
@MarcoGorelli MarcoGorelli added the Needs Tests Unit test(s) needed to prevent regressions label Sep 5, 2020
@MarcoGorelli MarcoGorelli added this to the Contributions Welcome milestone Sep 5, 2020
@jorisvandenbossche
Copy link
Member

@MarcoGorelli thanks for the analysis!

@dsaxton
Copy link
Member

dsaxton commented Sep 6, 2020

@MarcoGorelli I found that building instead with the command CFLAGS='-Wno-error=deprecated-declarations' python setup.py build_ext -i generally fixes things, although I'm not sure if it'll work in this case. There's a thread about these problems here: #33315

@simonjayhawkins
Copy link
Member

Why I'm not sure I've done git bisect correctly: there's a gcc error (@simonjayhawkins is this something you've come across?)

I've set up a workflow for bisecting. didn't see that error but added || exit 125 to runner script to skip failed builds.

https://github.com/simonjayhawkins/pandas/runs/1078479989?check_suite_focus=true

agrees that #33616 fixed.

@MarcoGorelli
Copy link
Member

I've set up a workflow for bisecting.

wow, nice!!

@GabrielSimonetto
Copy link

take

@GabrielSimonetto
Copy link

Ok, I'm stuck.

After investigating PR 33616, we check that 2 files have been changed:
pandas/core/generic.py and pandas/core/internals/managers.py, although they seem tighly correlated, and although the generic.py is directly reindexing stuff, after using some breakpoints with the following code, I've noted that the generic.py portion is not called upon.

df = pd.DataFrame([[1,2,5,6],
                    [3,4,7,8]])
df.columns=[0,1,0,1]
df[df>5]

Besides that, grepping I've found out that the exception mentioned in this issue is only raised on the function _can_reindex(), and, this function in only used on reindex_indexer() which should make it easy to debug how the error happens

(venv) [bigode@coala pandas]$ grep -r _can_reindex
core/indexes/base.py:    def _can_reindex(self, indexer):
core/internals/managers.py:            self.axes[axis]._can_reindex(indexer)

The problem is, after breakpointing both functions, they are never called on this operation! Which means, that the fix on pandas/core/internals/managers.py actively made the code avoid a section which should never get into. Which is supported by the comments @jreblack inserted:

# The caller is responsible for ensuring that
#  obj.axes[-1].equals(self.items)

I was already a bit stuck on which should be the specific test before...

(I was rehearsing something with pandas.core.internals.managers.BlockManager.{reindex_indexer, reindex_axis}, but I could not confirm they are being used since the only entrypoint I could confirm was the aforementioned internals.managers.apply(), and actually, inserting a breakpoint on reindex_indexer and reindex_axis didn't work on the test code. Which makes me think they are not being called, as absurd as that sounds),

...but now I'm completely lost. If someone could shed some light on the issue that would be awesome. Besides that, if I have some spare time I will try to use a pandas version prior to PR 33616 to see if I can pinpoint what exact interaction fixed this issue.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 12, 2020

@GabrielSimonetto To address this issue, you only need to add a test that demonstrates that the bug was fixed. Don't worry about the internals. What happened here is that I saw the issue was fixed, and closed it, then @MarcoGorelli wanted to figure out where it was fixed, and we reopened it deciding we just needed a test to make sure that the issue is truly addressed.

@MarcoGorelli
Copy link
Member

Yup 😄 @GabrielSimonetto if you wanted to submit a test to make sure this doesn't break again in the future, that would be welcome!

@GabrielSimonetto
Copy link

@Dr-Irv would you know where would be the right module to insert this test? If I understood correctly just a high level check will be enough?

@MarcoGorelli
Copy link
Member

@GabrielSimonetto You can use the example provided by in #31954 (comment) as a test

If you open a pull request you can put it in where you think a sensible location is and if necessary we'll ask you to put it somewhere else

@GabrielSimonetto
Copy link

Great @MarcoGorelli! I'm on it, thanks!

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Oct 16, 2020
@jreback jreback added Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves labels Oct 16, 2020
@jreback jreback modified the milestones: 1.2, Contributions Welcome Nov 19, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.3 Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
8 participants