Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: cross join in merge() and join() #5401

Closed
RomanPekar opened this issue Oct 31, 2013 · 42 comments · Fixed by #37864
Closed

ENH: cross join in merge() and join() #5401

RomanPekar opened this issue Oct 31, 2013 · 42 comments · Fixed by #37864
Labels
Enhancement Indexing Reshaping
Milestone

Comments

@RomanPekar
Copy link
Contributor

RomanPekar commented Oct 31, 2013

related here: #7203

If I haven't missed something, there's no simple way to make cartesian product method in pandas. For example, see http://stackoverflow.com/questions/19684765/pandas-extending-the-dataframe-by-adding-another-levels/19684948#19684948.
To do cross join, one have to add some keys to DataFrames, like this:

>>> dfA = pd.DataFrame(['a','b','c'],columns=['A'])
>>> dfB = pd.DataFrame(range(3),columns=['B'])
>>> dfA['key'] = 1
>>> dfB['key'] = 1
>>> pd.merge(dfB, dfA, on='key').ix[:, ('A','B')]
   A  B
0  a  0
1  b  0
2  c  0
3  a  1
4  b  1
5  c  1
6  a  2
7  b  2
8  c  2

How about adding possibility to pass how='cross' into merge() and join() methods?

@jtratner
Copy link
Contributor

jtratner commented Nov 1, 2013

Why did you close this?

@RomanPekar RomanPekar reopened this Nov 1, 2013
@RomanPekar
Copy link
Contributor Author

RomanPekar commented Nov 1, 2013

Oops my bad it's totally by accident, hit the button somehow :(

@jreback
Copy link
Contributor

jreback commented Nov 1, 2013

related #3835 , there is a method: from pandas.tools.util import cartesian_product

how is this different from an outer join? (which does this and just drops the na groups)

@RomanPekar
Copy link
Contributor Author

RomanPekar commented Nov 1, 2013

@jreback could you provide an example how to do outer join if there's no column on which to join? May be it's possible to do with apply somehow?

@jreback
Copy link
Contributor

jreback commented Nov 1, 2013

A bit hacky

In [28]: import pandas.tools.util as tools

In [29]: DataFrame(index=MultiIndex.from_arrays(
                  tools.cartesian_product([dfA['A'].tolist(),dfB['B'].tolist()]),
                    names=['A','B'])).reset_index()
Out[29]: 
   A  B
0  a  0
1  a  1
2  a  2
3  b  0
4  b  1
5  b  2
6  c  0
7  c  1
8  c  2

@RomanPekar
Copy link
Contributor Author

RomanPekar commented Nov 1, 2013

@jreback yes, good one! But what if there're more than one column in either dataframe, or in both?

@jreback
Copy link
Contributor

jreback commented Nov 1, 2013

not sure....can you create input and output frames?

@RomanPekar
Copy link
Contributor Author

RomanPekar commented Nov 1, 2013

>>> dfA = pd.DataFrame({'A':list('ab'), 'B':range(2,0,-1)})
>>> dfB = pd.DataFrame({'C':range(2), 'D':range(4, 6)})
>>> dfA
   A  B
0  a  2
1  b  1
>>> dfB
   C  D
0  0  4
1  1  5
>>> dfA['key'] = 1
>>> dfB['key'] = 1
>>> df = pd.merge(dfA, dfB, on='key')
>>> del df['key']
>>> df
   A  B  C  D
0  a  2  0  4
1  a  2  1  5
2  b  1  0  4
3  b  1  1  5

here's sql fiddle example of cross join

@dsm054
Copy link
Contributor

dsm054 commented Nov 9, 2013

FWIW, this is something I've often wanted too, and I think is fundamental enough an operation to deserve a better way.

@CarstVaartjes
Copy link

CarstVaartjes commented Nov 14, 2013

I use the same method as @RomanPekar regularly too (cartesian products are handy for basket analysis as they create a normalized version of a scarce matrix). Main thing I guess is that you could improve performance inside Pandas if you know you don't have to look at an index to reference back and forth but instead can multiply the dataframe for any possible combination? in regular python I would expect something like this:

list_1 = [['a', 1], ['b', 2], ['c', 3]]
list_2 = [['x', 11], ['y', 12], ['z', 13]]

list_result = []
for sub_list_1 in list_1:
    for sub_list_2 in list_2:
        list_result.append(sub_list_1 + sub_list_2)

When it comes notation I myself would like a "how='cartesian'" where you do not specify left_on and right_on...

@dsm054
Copy link
Contributor

dsm054 commented Jun 18, 2014

Is there any reason not to implement this? The only thing I'm not sure of is what we expect from how='cross' when there are common columns; do we want (1) a complete merge, (2) _x and _y suffixes, or (3) to raise an exception? I.e. which of

>>> df0 = pd.DataFrame({"A": [1,2], "B": [3,4]})
>>> df1 = pd.DataFrame({"C": [5,6], "B": [3,8]})
>>> df0["key"] = 1; df1["key"] = 2; m = df0.merge(df1, how='outer'); del m["key"]; print(m)
    A  B   C
0   1  3 NaN
1   2  4 NaN
2 NaN  3   5
3 NaN  8   6
>>> df0["key"] = 1; df1["key"] = 2; m = df0.merge(df1, on="key", how='outer'); del m["key"]; print(m)
    A  B_x  B_y   C
0   1    3  NaN NaN
1   2    4  NaN NaN
2 NaN  NaN    3   5
3 NaN  NaN    8   6

do we want?

@larsmans
Copy link
Contributor

larsmans commented Feb 2, 2015

+1, I just hit this issue too while trying to represent factors of Bayesian networks as dataframes. When factors have no common variables, their products is a Cartesian product, i.e., cross join. My workaround was to add a dummy column and join on that, then remove it afterwards. I see other people do that as well.

In this application it doesn't matter what happens when there are common columns since I need to treat one of them specially anyway, but I guess adding suffixes would be a sane default; that would be like joining on an empty list of columns.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@dsm054
Copy link
Contributor

dsm054 commented Apr 14, 2015

Because concat doesn't give you the Cartesian product:

>>> dfA = pd.DataFrame({'A':list('ab'), 'B':range(2,0,-1)})
>>> dfB = pd.DataFrame({'C':range(2), 'D':range(4, 6)})
>>> pd.concat([dfA, dfB])
     A   B   C   D
0    a   2 NaN NaN
1    b   1 NaN NaN
0  NaN NaN   0   4
1  NaN NaN   1   5
>>> dfA["key"] = 1; dfB["key"] = 1;
>>> m = dfA.merge(dfB,how='outer'); del m["key"];
>>> m
   A  B  C  D
0  a  2  0  4
1  a  2  1  5
2  b  1  0  4
3  b  1  1  5

@cpcloud
Copy link
Member

cpcloud commented Apr 14, 2015

It's worth noting that attempting a CROSS JOIN with an ON predicate raises a syntax error in postgres. Pandas should disallow the on argument.

pcloud=# select * from orders;
 number | order_date | cust_id | salesperson_id | amount
--------+------------+---------+----------------+--------
     10 | 1996-08-02 |       4 |              2 |    540
     20 | 1999-01-30 |       4 |              8 |   1800
     30 | 1995-07-14 |       9 |              1 |    460
     40 | 1998-01-29 |       7 |              2 |   2400
     50 | 1998-02-03 |       6 |              7 |    600
     60 | 1998-03-02 |       6 |              7 |    720
     70 | 1998-05-06 |       9 |              7 |    150
     80 | 1999-07-23 |       4 |             12 |    200
     90 | 2000-07-23 |       9 |             12 |   2300
(9 rows)

pcloud=# select * from customer;
 id |   name   |   city   | industry_type
----+----------+----------+---------------
  4 | Samsonic | pleasant | J
  6 | Panasung | oaktown  | J
  7 | Samony   | jackson  | B
  9 | Orange   | Jackson  | B
(4 rows)

pcloud=# select * from orders cross join customer on orders.cust_id = customer.id;
ERROR:  syntax error at or near "on"
LINE 1: select * from orders cross join customer on orders.cust_id =...
                                                 ^
pcloud=# select * from orders left join customer on orders.cust_id = customer.id;
 number | order_date | cust_id | salesperson_id | amount | id |   name   |   city   | industry_type
--------+------------+---------+----------------+--------+----+----------+----------+---------------
     10 | 1996-08-02 |       4 |              2 |    540 |  4 | Samsonic | pleasant | J
     20 | 1999-01-30 |       4 |              8 |   1800 |  4 | Samsonic | pleasant | J
     30 | 1995-07-14 |       9 |              1 |    460 |  9 | Orange   | Jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |  7 | Samony   | jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |  6 | Panasung | oaktown  | J
     60 | 1998-03-02 |       6 |              7 |    720 |  6 | Panasung | oaktown  | J
     70 | 1998-05-06 |       9 |              7 |    150 |  9 | Orange   | Jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |  4 | Samsonic | pleasant | J
     90 | 2000-07-23 |       9 |             12 |   2300 |  9 | Orange   | Jackson  | B
(9 rows)

@cpcloud
Copy link
Member

cpcloud commented Apr 14, 2015

We need to differentiate same name columns because the cartesian product can potentially generate different values

pcloud=# select * from orders, customer2;

 number | order_date | cust_id | salesperson_id | amount | cust_id |   name   |   city   | industry_type
--------+------------+---------+----------------+--------+---------+----------+----------+---------------
     10 | 1996-08-02 |       4 |              2 |    540 |       4 | Samsonic | pleasant | J
     10 | 1996-08-02 |       4 |              2 |    540 |       6 | Panasung | oaktown  | J
     10 | 1996-08-02 |       4 |              2 |    540 |       7 | Samony   | jackson  | B
     10 | 1996-08-02 |       4 |              2 |    540 |       9 | Orange   | Jackson  | B
     20 | 1999-01-30 |       4 |              8 |   1800 |       4 | Samsonic | pleasant | J
     20 | 1999-01-30 |       4 |              8 |   1800 |       6 | Panasung | oaktown  | J
     20 | 1999-01-30 |       4 |              8 |   1800 |       7 | Samony   | jackson  | B
     20 | 1999-01-30 |       4 |              8 |   1800 |       9 | Orange   | Jackson  | B
     30 | 1995-07-14 |       9 |              1 |    460 |       4 | Samsonic | pleasant | J
     30 | 1995-07-14 |       9 |              1 |    460 |       6 | Panasung | oaktown  | J
     30 | 1995-07-14 |       9 |              1 |    460 |       7 | Samony   | jackson  | B
     30 | 1995-07-14 |       9 |              1 |    460 |       9 | Orange   | Jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |       4 | Samsonic | pleasant | J
     40 | 1998-01-29 |       7 |              2 |   2400 |       6 | Panasung | oaktown  | J
     40 | 1998-01-29 |       7 |              2 |   2400 |       7 | Samony   | jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |       9 | Orange   | Jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |       4 | Samsonic | pleasant | J
     50 | 1998-02-03 |       6 |              7 |    600 |       6 | Panasung | oaktown  | J
     50 | 1998-02-03 |       6 |              7 |    600 |       7 | Samony   | jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |       9 | Orange   | Jackson  | B
     60 | 1998-03-02 |       6 |              7 |    720 |       4 | Samsonic | pleasant | J
     60 | 1998-03-02 |       6 |              7 |    720 |       6 | Panasung | oaktown  | J
     60 | 1998-03-02 |       6 |              7 |    720 |       7 | Samony   | jackson  | B
     60 | 1998-03-02 |       6 |              7 |    720 |       9 | Orange   | Jackson  | B
     70 | 1998-05-06 |       9 |              7 |    150 |       4 | Samsonic | pleasant | J
     70 | 1998-05-06 |       9 |              7 |    150 |       6 | Panasung | oaktown  | J
     70 | 1998-05-06 |       9 |              7 |    150 |       7 | Samony   | jackson  | B
     70 | 1998-05-06 |       9 |              7 |    150 |       9 | Orange   | Jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |       4 | Samsonic | pleasant | J
     80 | 1999-07-23 |       4 |             12 |    200 |       6 | Panasung | oaktown  | J
     80 | 1999-07-23 |       4 |             12 |    200 |       7 | Samony   | jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |       9 | Orange   | Jackson  | B
     90 | 2000-07-23 |       9 |             12 |   2300 |       4 | Samsonic | pleasant | J
     90 | 2000-07-23 |       9 |             12 |   2300 |       6 | Panasung | oaktown  | J
     90 | 2000-07-23 |       9 |             12 |   2300 |       7 | Samony   | jackson  | B
     90 | 2000-07-23 |       9 |             12 |   2300 |       9 | Orange   | Jackson  | B
(36 rows)

@dsm054
Copy link
Contributor

dsm054 commented Apr 14, 2015

Okay, @cpcloud, so if I understand the semantic we need to match so as not to violate least surprise for our database-minded friends, shared column names should behave as if they had an implicit suffix to distinguish them?

@cpcloud
Copy link
Member

cpcloud commented Apr 14, 2015

@dsm054 Yep.

@dsm054
Copy link
Contributor

dsm054 commented Apr 17, 2015

I've got this working for merge, so that df0.merge(df1, how='cross') works exactly as it would if you introduced the temp column, but I'm not entirely sure what to do with join. Thoughts?

@bmoscon
Copy link
Contributor

bmoscon commented May 20, 2016

Any update on this? Would love to see this feature/enhancement in an upcoming release

@pkch
Copy link

pkch commented Aug 11, 2016

For API purposes, would it be useful to just allow an arbitrary join condition? That way, cross-join is just the special case of no condition (None, which is interpreted as lambda x, y: True).

@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

@pkch on='cross' is a nice API here (it could be implemented as a general conditional, but is more performant as a special case). Separately the on= could be a condition (eg. a callable)

@ngould
Copy link

ngould commented May 31, 2017

I just came across this thread after looking to an answer to the same question. I think the following API would be great, noting that when the how argument is set to cross, the on argument must be None.

For pd.merge:
result = pd.merge(df1, df2, how='cross')

For pd.DataFrame.merge:
result = df1.merge(df2, how='cross')

For pd.DataFrame.join:
result = df1.join(df2, how='cross')

My experience is the same as noted above, in that the best way to accomplish this in pandas currently is:

df1['key'] = 1
df2['key'] = 1
result = pd.merge(df1, df2, on='key')

That works, but in my opinion isn't readable enough to use in polite company. Thoughts on how to move forward?

@dsm054
Copy link
Contributor

dsm054 commented May 31, 2017

I was just thinking about this the other day when I was looking at old branches. It seems reasonable to try to get this into 0.21.

@dsm054
Copy link
Contributor

dsm054 commented Jun 3, 2017

I'd like to get some input on the possible behaviour of a cross join. I have a candidate implementation but while testing different cases I came across some decisions I want people to have a chance to think about. Partly this is because after lots of time spent on SO, I think we chose the wrong default for as_index in groupby, but it's too late to change now-- I'd prefer not for that to happen again, so if someone else can spot a problem, that'd be great.

Overview

Based on the earlier discussion, I'm proposing a new option how='cross', which would give the Cartesian product of the two dataframes. In terms of row values, this would be equivalent to adding a dummy column to each frame, performing an inner merge on that column, and then deleting it from the result, which is the standard workaround at the moment.

As an example:

In [17]: df0 = pd.DataFrame({"A": [1,2]})

In [18]: df1 = pd.DataFrame({"B": [3,4,5]})

In [19]: df0.merge(df1, how='cross')
Out[19]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [20]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[20]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

Parameters

pd.merge has lots of options, though, and we need to decide how they're being handled. Here are my thoughts so far.

  • on, left_on, right_on: if they're specified, raise a ValueError. how='cross' implicitly specifies its own join target for each frame, and so it doesn't make sense to allow these.
  • sort: keep but ignore. The join keys are effectively all equal, so sorting has no effect, and leaving it in is harmless.
  • suffixes: keep. In the case of matching column names in each frame, it'll behave just like the reference merge, and we get
In [25]: df0.merge(df1, how='cross')
Out[25]: 
   A  C_x  B  C_y
0  1   10  3   30
1  1   10  4   40
2  1   10  5   50
3  2   20  3   30
4  2   20  4   40
5  2   20  5   50

In [26]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[26]: 
   A  C_x  B  C_y
0  1   10  3   30
1  1   10  4   40
2  1   10  5   50
3  2   20  3   30
4  2   20  4   40
5  2   20  5   50
  • copy: keep, although whether or not it does anything will depend on the current merge implementation (and the promise "not unnecessarily" is pretty weak anyhow.)
  • indicator: keep. The output won't be very interesting, as it will always be "both", but it's harmless.
  • validate: behave exactly as if we're doing an inner merge on a dummy column (so that two one-row frames pass a 'one_to_one' check but longer frames would raise a MergeError.)

Index

For me, the interesting question is what to do with left_index and right_index. I can think of at least three defensible options:

  1. Raise if either left_index or right_index is not False. The docstring says "Use the index from the left DataFrame as the join key(s)" as the meaning of left_index, but how='cross' already specified a custom join key, and so it doesn't make sense.

Advantage: very easy to explain and understand. Also doesn't prevent us from changing our minds later, as it's much easier to give code which used to fail a new meaning than to change the meaning of code which used to give a result. If you want the index information to be preserved, you'd still have the option of calling .reset_index() on the arguments.

Disadvantage: pandas code is already too cluttered with .reset_index() calls, IMHO.

  1. Behave exactly as merging on a dummy column would. In other words:
In [72]: df0 = pd.DataFrame({"A": [1,2]}, index=[10,20])

In [73]: df1 = pd.DataFrame({"B": [3,4,5]}, index=[20,30,40])

In [74]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[74]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [75]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True).drop('m', axis=1)
Out[75]: 
    A  B
20  1  3
30  1  4
40  1  5
20  2  3
30  2  4
40  2  5

In [76]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', right_index=True).drop('m', axis=1)
Out[76]: 
    A  B
10  1  3
10  1  4
10  1  5
20  2  3
20  2  4
20  2  5

In [77]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True, right_index=True).drop('m', axis=1)
Out[77]: 
    A  B
20  2  3

That is, in the case of left_index=right_index=False, give the natural index. (Which is an Int64Index, not a Range, but that's minor). In the case of left_index=True, repeat the right index values. In the case of right_index=True, repeat the left index values. In the case of both, keep the shared indices.

Advantage: the behaviour, while not obvious, can be explained by reference to the workaround everyone uses to get the cross join now.

Disadvantage: it's really not obvious at first glance that this is the behaviour you get if you specify on as well as left/right index, and setting them both breaks the expectation that you'd have len(df0) * len(df1) rows in the output of a cross join.

3a) Do the same as in #2, but instead of keeping only the common indices when left_index and right_index are True, set the index to the multiindex product of the two indices:

In [109]: m0 = pd.MultiIndex.from_product([df0.index, df1.index])

In [110]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', right_index=True).drop('m', axis=1).set_index(m0)
Out[110]: 
       A  B
10 20  1  3
   30  1  4
   40  1  5
20 20  2  3
   30  2  4
   40  2  5

Advantage: now the cross join produces the expected number of rows, and there's a way to preserve the index information if you want it, while the default (with left_index=right_index=False would continue to ignore it.)

Disadvantage: now it really feels like we're reinterpreting left_index and right_index in the how='cross' case.

3b) Do the same as in (3a), but change the interpretation of the index flags in the case of how='cross' to mean keep that index information. In other words, left_index=True would give you an index in the result which consists of df0's index elements repeated appropriately, right_index=True would give you df1's index elements repeated appropriately, and both true would give you the multiindex.

Advantage: all of those of 3a, with a much easier way to explain what gets used in the resulting index.

Disadvantage: inverts the behaviour from anyone used to the merge behaviour in this (admittedly obscure) case.

Conclusion

The fact we have indices to worry about complicates things a bit. Option #1 (raise) is certainly the simplest. I think option #2, although I like the consistency with the workaround, is ruled out by the fact it doesn't always return the expected number of elements.

I don't think I've ever used on with left_index=True or right_index=True. To be honest, it feels a little weird that it works in the first place.. but for me since I never use it, any incompatibility with current behaviour isn't a problem. If I were the only user of pandas I'd go for 3b and get on with my day, but I want to preserve least surprise for anyone who might be using it.

Thoughts?

@ngould
Copy link

ngould commented Jun 3, 2017

Wow, great writeup @dsm054. I agree with ruling out option 2, since it could yield the wrong row count. And 3b stands out to me as the least surprising behavior. Namely,

  • I get the correct number of rows.
  • I get an ordered integer index when I use the default values of left_on and right_on. (Same behavior as other types of merge.)
  • I have the ability to specify which original indices I want to keep, if any.

You mentioned that 3b inverts the usual interpretation of left_on and right_on. Can you explain why you think so? My understanding is that merge gives an integer index by default, and only preserve index info from the original dataframes if left_on and right_on is specified. So 3b seems in line with that behavior, unless I'm misunderstanding something.

As for option 1, can you elaborate on your comment about reset_index(), and how it could be used to restore the indices from the dataframes being merged? I don't see how that is possible, but again that might be my misunderstanding.

@dsm054
Copy link
Contributor

dsm054 commented Jun 23, 2017

On 1, I meant that you can reset_index before you call merge:

In [12]: pd.merge(df0, df1, how='cross')
Out[12]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [13]: pd.merge(df0.reset_index(), df1.reset_index(), how='cross')
Out[13]: 
   index_x  A  index_y  B
0       10  1       20  3
1       10  1       30  4
2       10  1       40  5
3       20  2       20  3
4       20  2       30  4
5       20  2       40  5

On 3b's "inversion", what I mean is that if you think of the cross join as an inner merge on a dummy column, then when you do left_index=True, you get the right index elements in the result, and vice versa:

In [21]: df0
Out[21]: 
    A
10  1
20  2

In [22]: df1
Out[22]: 
    B
20  3
30  4
40  5

In [23]: pd.merge(df0.assign(m=1), df1.assign(m=1), how='inner', on='m', left_index=True)
Out[23]: 
    A  m  B
20  1  1  3
30  1  1  4
40  1  1  5
20  2  1  3
30  2  1  4
40  2  1  5

In [24]: pd.merge(df0.assign(m=1), df1.assign(m=1), how='inner', on='m', right_index=True)
Out[24]: 
    A  m  B
10  1  1  3
10  1  1  4
10  1  1  5
20  2  1  3
20  2  1  4
20  2  1  5

which is the opposite relation of X_index=True to the resulting index from the one you might expect.

BUT given #16228, which would forbid the use of left_index/right_index being set when left_on or right_on is set, maybe this isn't a problem, because either we follow option #1 and bypass the whole issue, or there's not going to be any other behaviour to interfere with anymore because you won't be setting left_on and left_index together.

@ngould
Copy link

ngould commented Jun 23, 2017

Thanks, got it! Seems like for consistency, #16228 should also aim to forbid on with any *_index args. Right now it only mentions right_on with right_index in particular. Would you agree, @dsm054?

If we're going down the path of disallowing those combinations of args, obviously option 1 is the way to go.

@jreback
Copy link
Contributor

jreback commented Jun 28, 2017

yes #16228 would disallow combo operations here (and it should account for any of index setting w/ on setting)

@jreback
Copy link
Contributor

jreback commented Jun 28, 2017

@ngould

Thanks, got it! Seems like for consistency, #16228 should also aim to forbid on with any *_index args. Right now it only mentions right_on with right_index in particular. Would you agree, @dsm054?

the purpose is left_on with left_index and right_on with right_index

@ftune
Copy link

ftune commented Feb 9, 2018

I come up with an idea to do it:

def crossjoindf(df1,df2):
    keycolist=str(set(df1.columns.values+df2.columns.values))
    df1[keycolist]=0
    df2[keycolist]=0
    return pd.merge(dfr,dfr,on=keycolist).drop([keycolist],axis=1)

It might be too easy to build a particular function `merge(df1,df2,how='cross') for it.
Is it the reason why there is no such function in pandas document?
(T_^)
#19614

@dsm054
Copy link
Contributor

dsm054 commented Nov 13, 2018

Okay, I want to bring this back. If to avoid ambiguity and weird corner cases of interpretation, if we disallow on, left_on, right_on, left_index, and right_index, then we can pretty easily have

In [35]: df.merge(df2, how='cross')                                                                                         
Out[35]: 
   A  B  C   D
0  1  2  7  77
1  1  2  8  88
2  2  3  7  77
3  2  3  8  88
4  3  4  7  77
5  3  4  8  88

In [36]: df.merge(df2.rename(columns={"C": "B"}), how='cross')                                                              
Out[36]: 
   A  B_x  B_y   D
0  1    2    7  77
1  1    2    8  88
2  2    3    7  77
3  2    3    8  88
4  3    4    7  77
5  3    4    8  88

@jreback: any objection if I bring this back with a PR?

@Coldsp33d
Copy link

Coldsp33d commented Dec 12, 2018

Hi, I've recently written a canonical post on computing the cartesian product of two DataFrames on SO.
Included in that post is a fast and simple method of computing the cartesian product using np.broadcast_arrays. Here it is:

def cartesian_product_simplified(left, right):
    la, lb = len(left), len(right)
    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

    return pd.DataFrame(
        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

This is just a very basic example and does not handle column names and suffixes. It is by no means the fastest possible method, but I feel it could be a good starting point in terms of simplicity and performance. I have added performance benchmarks in the answer, if you would like to check it out.

@jreback, @dsm054: What are your thoughts?

@jreback
Copy link
Contributor

jreback commented Dec 12, 2018

@dsm054 have to have this, I think should address #16228 first, then the choice of what to do is easier. I would simply do the 'easy' impl of adding a column (you could prob have to generate a temporary hash column to avoid columns) is fine. In a later PR could explore if this is performant (might be ok).

@Coldsp33d I am not sure that soln is going to work on mixed dtypes, nor will it easily handle things like column suffixing.

@Coldsp33d
Copy link

Coldsp33d commented Dec 13, 2018

@jreback Fair point!

Perhaps we can substitute the calls to column_stack and .values.__getitem__ with pd.concat and loc, but my tests show a significant reduction in speed (almost 10x) for larger frames. If there is some efficient way of determining whether the columns contain mixed dtypes (eg, pd.api.types.infer_dtype()), we could determine whether to use the optimized version, or fall back to a more primitive approach that preserves the types. I will need to do some more tests to see if it is worth it, or maybe something more substantial is required.

@reedjosh
Copy link

reedjosh commented May 1, 2019

Was there ever any conclusion here? how='cross' would be super helpful.

@jreback
Copy link
Contributor

jreback commented May 1, 2019

a pull request for a naive implementation would be a good start

@DiogoFerrari
Copy link

DiogoFerrari commented Jan 31, 2020

It looks like it was not implemented yet...

@mxmaxime
Copy link

mxmaxime commented Mar 3, 2020

Hi, Is this topic abandoned?

@Meng0625
Copy link

Meng0625 commented Mar 21, 2020

Really expect this.....

@dilip1408
Copy link

dilip1408 commented Apr 7, 2020

Seems, this has not implemented yet. I'm using the latest version-1.0.3

@ishmeetk
Copy link

ishmeetk commented Jun 24, 2020

Will this feature get implemented or it's still in discussion ? Thank You

@jreback
Copy link
Contributor

jreback commented Jun 24, 2020

@ishmeetk thinhs are implemented when someone from the community submits a pull request.

this is not hard and would welcome it

the core team can provide review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Reshaping
Projects
None yet
Development

Successfully merging a pull request may close this issue.