New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: better error message on invalid on with multi-index columns #9455

Closed
amelio-vazquez-reina opened this Issue Feb 9, 2015 · 15 comments

Comments

Projects
None yet
3 participants
@amelio-vazquez-reina
Contributor

amelio-vazquez-reina commented Feb 9, 2015

Consider the following:

> a_df

   object_uid       item_uid
0  0FlIRitxKX     0D1dPxep2C
1  0FLPRriacw     0DoYxRkSdr
2  0FUh6FQ4mQ     0DsOBGsYl9
3  0FH1hIfNCL     0DoYxRkSdr
4  0FZVBzi2DX     0DoYxRkSdr
5  0FAJ5vfLQs     0DoYxRkSdr
6  0F7lEk65qa     0DoYxRkSdr
7  0Fqe3TIN7Q     0DoYxRkSdr
8  0FrSgt4Vut     0DsOBGsYl9
9  0FByA7F19x     0DoYxRkSdr

and

> b_df

   object_uid         after        before change_rate
                micro_spend   micro_spend            
0  0F0092Ntoi  2.940774e+09  2.932917e+09    0.267909
1  0F01SZnTfs  3.000654e+07  2.887034e+07    3.935523
2  0F027Rm6rd  6.729106e+07  6.912947e+07   -2.659376
3  0F02BZeTr6  2.072908e+09  2.048737e+09    1.179802
4  0F02Vwd6Ou  1.619934e+08  1.682784e+08   -3.734909

I would like to do an inner join in object_uid. When I do:

pd.merge(a_df, b_df, on='object_uid')

I get:
AttributeError: 'numpy.ndarray' object has no attribute 'start'

This is with:

> pd.pandas.__version__
'0.15.2'

The full error trace is below:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-131-5e11878c8b10> in <module>()
----> 1 pd.merge(exp_setup_df.head(10).iloc[:, [0,4]].copy(), spend_diff.reset_index().head(10).copy(), on='flight_uid')

/Users/avazquez/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/tools/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, s
uffixes, copy)                                                                                                                                                         
     36                          right_on=right_on, left_index=left_index,
     37                          right_index=right_index, sort=sort, suffixes=suffixes,
---> 38                          copy=copy)
     39     return op.get_result()
     40 if __debug__:

/Users/avazquez/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/tools/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right
_index, sort, suffixes, copy)                                                                                                                                          
    182         (self.left_join_keys,
    183          self.right_join_keys,
--> 184          self.join_names) = self._get_merge_keys()
    185 
    186     def get_result(self):

/Users/avazquez/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/tools/merge.py in _get_merge_keys(self)
    386 
    387         if right_drop:
--> 388             self.right = self.right.drop(right_drop, axis=1)
    389 
    390         return left_keys, right_keys, join_names

/Users/avazquez/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/core/generic.py in drop(self, labels, axis, level, inplace, **kwargs)
   1559                 new_axis = axis.drop(labels, level=level)
   1560             else:
-> 1561                 new_axis = axis.drop(labels)
   1562             dropped = self.reindex(**{axis_name: new_axis})
   1563             try:

/Users/avazquez/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/core/index.py in drop(self, labels, level)
   3734                 inds.append(loc)
   3735             else:
-> 3736                 inds.extend(lrange(loc.start, loc.stop))
   3737 
   3738         return self.delete(inds)

AttributeError: 'numpy.ndarray' object has no attribute 'start'

Also, the following works:

> result_df = pd.concat([a_df.set_index('object_uid'), b_df.set_index('object_uid'], axis=1).columns

but it flattens my columns:

> result_df.columns
Index(['object_uid', ('after','micro_spend'), ('before', 'micro_spend'), 'item_id'], dtype='object']
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 9, 2015

Contributor

show .info() on both frames

Contributor

jreback commented Feb 9, 2015

show .info() on both frames

@amelio-vazquez-reina

This comment has been minimized.

Show comment
Hide comment
@amelio-vazquez-reina

amelio-vazquez-reina Feb 9, 2015

Contributor

Thanks @jreback

> a_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 2 columns):
object_uid      10 non-null object
item_uid        10 non-null object
dtypes: object(2)
memory usage: 240.0+ bytes

and

> b_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 4 columns):
(object_uid, )           10 non-null object
(after, micro_spend)     10 non-null float64
(before, micro_spend)    10 non-null float64
(change_rate, )          10 non-null float64
dtypes: float64(3), object(1)
memory usage: 400.0+ bytes
Contributor

amelio-vazquez-reina commented Feb 9, 2015

Thanks @jreback

> a_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 2 columns):
object_uid      10 non-null object
item_uid        10 non-null object
dtypes: object(2)
memory usage: 240.0+ bytes

and

> b_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 4 columns):
(object_uid, )           10 non-null object
(after, micro_spend)     10 non-null float64
(before, micro_spend)    10 non-null float64
(change_rate, )          10 non-null float64
dtypes: float64(3), object(1)
memory usage: 400.0+ bytes
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 9, 2015

Contributor

how did you create the columns ?

Contributor

jreback commented Feb 9, 2015

how did you create the columns ?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 9, 2015

Contributor

the columns index mean

Contributor

jreback commented Feb 9, 2015

the columns index mean

@amelio-vazquez-reina

This comment has been minimized.

Show comment
Hide comment
@amelio-vazquez-reina

amelio-vazquez-reina Feb 9, 2015

Contributor

I created the multi-index one (b_df as follows):

b_df            = pd.concat([left_df, right_df], keys=['after', 'before'], axis=1)
b_df.index.name = 'object_uid'
b_df            = b_df.reset_index()

Other than that, a_df and the data in left_df and right_df comes from SQL queries and perhaps a few other joins before that.

Contributor

amelio-vazquez-reina commented Feb 9, 2015

I created the multi-index one (b_df as follows):

b_df            = pd.concat([left_df, right_df], keys=['after', 'before'], axis=1)
b_df.index.name = 'object_uid'
b_df            = b_df.reset_index()

Other than that, a_df and the data in left_df and right_df comes from SQL queries and perhaps a few other joins before that.

@amelio-vazquez-reina

This comment has been minimized.

Show comment
Hide comment
@amelio-vazquez-reina

amelio-vazquez-reina Feb 9, 2015

Contributor

That said @jreback I have tried using .copy() on both dataframes with the hope that that may perhaps recreate the dataframes with no luck.

Contributor

amelio-vazquez-reina commented Feb 9, 2015

That said @jreback I have tried using .copy() on both dataframes with the hope that that may perhaps recreate the dataframes with no luck.

@amelio-vazquez-reina

This comment has been minimized.

Show comment
Hide comment
@amelio-vazquez-reina

amelio-vazquez-reina Feb 9, 2015

Contributor

Also, the following (flattening columns for b_df) works:

b_df.columns = ['object_uid', 'after', 'before', 'change_rate']
pd.merge(a_df, b_df, on='object_uid')
Contributor

amelio-vazquez-reina commented Feb 9, 2015

Also, the following (flattening columns for b_df) works:

b_df.columns = ['object_uid', 'after', 'before', 'change_rate']
pd.merge(a_df, b_df, on='object_uid')
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 10, 2015

Contributor

When you have a multi-index column, you have to specify the entire column depth in the on, IOW, a tuple. The error message could be better I think though. Want to submit a pull-request to improve that?

Contributor

jreback commented Feb 10, 2015

When you have a multi-index column, you have to specify the entire column depth in the on, IOW, a tuple. The error message could be better I think though. Want to submit a pull-request to improve that?

@jreback jreback added this to the 0.17.0 milestone Feb 10, 2015

@jreback jreback changed the title from Unable to merge single-level with multi-level to ERR: better error message on invalid on with multi-index columns Feb 10, 2015

@amelio-vazquez-reina

This comment has been minimized.

Show comment
Hide comment
@amelio-vazquez-reina

amelio-vazquez-reina Feb 11, 2015

Contributor

Thanks @jreback I wonder if the following problem is related to this:

Consider a dataframe that we populate as follows:

df = pd.DataFrame()
df['col1'] = ...
df['col2'] = ...

idx = pd.IndexSlice
df[idx['higher_col_1', 'col3']] = ...
df[idx['higher_col_2', 'col4']] = ...

I end up with

list(df.columns)
['col1',
'col2',
('higher_col_1', 'col3')
('higher_col_1', 'col4')
]

and when I do:

df.sort_index(axis=1

I get:

TypeError: unorderable types: str() > tuple()

Contributor

amelio-vazquez-reina commented Feb 11, 2015

Thanks @jreback I wonder if the following problem is related to this:

Consider a dataframe that we populate as follows:

df = pd.DataFrame()
df['col1'] = ...
df['col2'] = ...

idx = pd.IndexSlice
df[idx['higher_col_1', 'col3']] = ...
df[idx['higher_col_2', 'col4']] = ...

I end up with

list(df.columns)
['col1',
'col2',
('higher_col_1', 'col3')
('higher_col_1', 'col4')
]

and when I do:

df.sort_index(axis=1

I get:

TypeError: unorderable types: str() > tuple()

@amelio-vazquez-reina

This comment has been minimized.

Show comment
Hide comment
@amelio-vazquez-reina

amelio-vazquez-reina Feb 11, 2015

Contributor

And on that note, is there any way to recreate the index in the example above automatically? (e.g. pushing any index value with lower depth than the max depth of the axis to the highest and lower level, and creating, empty level values for any depth missing?)

Contributor

amelio-vazquez-reina commented Feb 11, 2015

And on that note, is there any way to recreate the index in the example above automatically? (e.g. pushing any index value with lower depth than the max depth of the axis to the highest and lower level, and creating, empty level values for any depth missing?)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 11, 2015

Contributor

what you are doing is not supported, you need to use df.loc[...]

Contributor

jreback commented Feb 11, 2015

what you are doing is not supported, you need to use df.loc[...]

@amelio-vazquez-reina

This comment has been minimized.

Show comment
Hide comment
@amelio-vazquez-reina

amelio-vazquez-reina Feb 11, 2015

Contributor

Thanks @jreback. That's good to know. I changed all my df[idx[..]] commands to df.loc[:, idx[..]] and, for what is worth, i get the same exact result (unable to sort the index).

I will definitely try to improve the error message with a PR, but I think the problem I am having in my last two posts is different (unable to sort the index with columns with different depths).

Contributor

amelio-vazquez-reina commented Feb 11, 2015

Thanks @jreback. That's good to know. I changed all my df[idx[..]] commands to df.loc[:, idx[..]] and, for what is worth, i get the same exact result (unable to sort the index).

I will definitely try to improve the error message with a PR, but I think the problem I am having in my last two posts is different (unable to sort the index with columns with different depths).

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 11, 2015

Contributor

@amelio-vazquez-reina you cannot have columns of different depths; you think you do, but they are really all the same depth.

Contributor

jreback commented Feb 11, 2015

@amelio-vazquez-reina you cannot have columns of different depths; you think you do, but they are really all the same depth.

@amelio-vazquez-reina

This comment has been minimized.

Show comment
Hide comment
@amelio-vazquez-reina

amelio-vazquez-reina Feb 11, 2015

Contributor

Thanks @jreback. I understand now the problem.

Contributor

amelio-vazquez-reina commented Feb 11, 2015

Thanks @jreback. I understand now the problem.

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Feb 3, 2016

Contributor

Xref #12078, #11640

The initial issue has been solved with PR #12158

I'll do a PR to add some tests

Contributor

nbonnotte commented Feb 3, 2016

Xref #12078, #11640

The initial issue has been solved with PR #12158

I'll do a PR to add some tests

@jreback jreback modified the milestones: 0.18.1, Next Major Release Apr 25, 2016

@jreback jreback closed this in bb9b9c5 Apr 25, 2016

nps added a commit to nps/pandas that referenced this issue May 17, 2016

ERR automatic broadcast for merging different levels
closes pandas-dev#9455

Author: Nicolas Bonnotte <nicolas.bonnotte@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff@reback.net>

Closes pandas-dev#12219 from nbonnotte/9455-test-for-merge and squashes the following commits:

6532ab2 [Nicolas Bonnotte] ERR automatic broadcast for merging different levels, pandas-dev#9455
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment