ENH: cross join in merge() and join() #5401

RomanPekar · 2013-10-31T20:53:13Z

related here: #7203

If I haven't missed something, there's no simple way to make cartesian product method in pandas. For example, see http://stackoverflow.com/questions/19684765/pandas-extending-the-dataframe-by-adding-another-levels/19684948#19684948.
To do cross join, one have to add some keys to DataFrames, like this:

>>> dfA = pd.DataFrame(['a','b','c'],columns=['A'])
>>> dfB = pd.DataFrame(range(3),columns=['B'])
>>> dfA['key'] = 1
>>> dfB['key'] = 1
>>> pd.merge(dfB, dfA, on='key').ix[:, ('A','B')]
   A  B
0  a  0
1  b  0
2  c  0
3  a  1
4  b  1
5  c  1
6  a  2
7  b  2
8  c  2

How about adding possibility to pass how='cross' into merge() and join() methods?

The text was updated successfully, but these errors were encountered:

jtratner · 2013-11-01T00:10:43Z

Why did you close this?

RomanPekar · 2013-11-01T03:47:02Z

Oops my bad it's totally by accident, hit the button somehow :(

jreback · 2013-11-01T12:10:11Z

related #3835 , there is a method: from pandas.tools.util import cartesian_product

how is this different from an outer join? (which does this and just drops the na groups)

RomanPekar · 2013-11-01T16:28:56Z

@jreback could you provide an example how to do outer join if there's no column on which to join? May be it's possible to do with apply somehow?

jreback · 2013-11-01T16:46:06Z

A bit hacky

In [28]: import pandas.tools.util as tools

In [29]: DataFrame(index=MultiIndex.from_arrays(
                  tools.cartesian_product([dfA['A'].tolist(),dfB['B'].tolist()]),
                    names=['A','B'])).reset_index()
Out[29]: 
   A  B
0  a  0
1  a  1
2  a  2
3  b  0
4  b  1
5  b  2
6  c  0
7  c  1
8  c  2

RomanPekar · 2013-11-01T17:05:55Z

@jreback yes, good one! But what if there're more than one column in either dataframe, or in both?

jreback · 2013-11-01T17:12:44Z

not sure....can you create input and output frames?

RomanPekar · 2013-11-01T17:53:57Z

>>> dfA = pd.DataFrame({'A':list('ab'), 'B':range(2,0,-1)})
>>> dfB = pd.DataFrame({'C':range(2), 'D':range(4, 6)})
>>> dfA
   A  B
0  a  2
1  b  1
>>> dfB
   C  D
0  0  4
1  1  5
>>> dfA['key'] = 1
>>> dfB['key'] = 1
>>> df = pd.merge(dfA, dfB, on='key')
>>> del df['key']
>>> df
   A  B  C  D
0  a  2  0  4
1  a  2  1  5
2  b  1  0  4
3  b  1  1  5

here's sql fiddle example of cross join

dsm054 · 2013-11-09T16:54:51Z

FWIW, this is something I've often wanted too, and I think is fundamental enough an operation to deserve a better way.

CarstVaartjes · 2013-11-14T22:11:54Z

I use the same method as @RomanPekar regularly too (cartesian products are handy for basket analysis as they create a normalized version of a scarce matrix). Main thing I guess is that you could improve performance inside Pandas if you know you don't have to look at an index to reference back and forth but instead can multiply the dataframe for any possible combination? in regular python I would expect something like this:

list_1 = [['a', 1], ['b', 2], ['c', 3]]
list_2 = [['x', 11], ['y', 12], ['z', 13]]

list_result = []
for sub_list_1 in list_1:
    for sub_list_2 in list_2:
        list_result.append(sub_list_1 + sub_list_2)

When it comes notation I myself would like a "how='cartesian'" where you do not specify left_on and right_on...

dsm054 · 2014-06-18T13:08:11Z

Is there any reason not to implement this? The only thing I'm not sure of is what we expect from how='cross' when there are common columns; do we want (1) a complete merge, (2) _x and _y suffixes, or (3) to raise an exception? I.e. which of

>>> df0 = pd.DataFrame({"A": [1,2], "B": [3,4]})
>>> df1 = pd.DataFrame({"C": [5,6], "B": [3,8]})
>>> df0["key"] = 1; df1["key"] = 2; m = df0.merge(df1, how='outer'); del m["key"]; print(m)
    A  B   C
0   1  3 NaN
1   2  4 NaN
2 NaN  3   5
3 NaN  8   6
>>> df0["key"] = 1; df1["key"] = 2; m = df0.merge(df1, on="key", how='outer'); del m["key"]; print(m)
    A  B_x  B_y   C
0   1    3  NaN NaN
1   2    4  NaN NaN
2 NaN  NaN    3   5
3 NaN  NaN    8   6

do we want?

larsmans · 2015-02-02T16:37:45Z

+1, I just hit this issue too while trying to represent factors of Bayesian networks as dataframes. When factors have no common variables, their products is a Cartesian product, i.e., cross join. My workaround was to add a dummy column and join on that, then remove it afterwards. I see other people do that as well.

In this application it doesn't matter what happens when there are common columns since I need to treat one of them specially anyway, but I guess adding suffixes would be a sane default; that would be like joining on an empty list of columns.

dsm054 · 2015-04-14T14:41:39Z

Because concat doesn't give you the Cartesian product:

>>> dfA = pd.DataFrame({'A':list('ab'), 'B':range(2,0,-1)})
>>> dfB = pd.DataFrame({'C':range(2), 'D':range(4, 6)})
>>> pd.concat([dfA, dfB])
     A   B   C   D
0    a   2 NaN NaN
1    b   1 NaN NaN
0  NaN NaN   0   4
1  NaN NaN   1   5
>>> dfA["key"] = 1; dfB["key"] = 1;
>>> m = dfA.merge(dfB,how='outer'); del m["key"];
>>> m
   A  B  C  D
0  a  2  0  4
1  a  2  1  5
2  b  1  0  4
3  b  1  1  5

cpcloud · 2015-04-14T14:47:20Z

It's worth noting that attempting a CROSS JOIN with an ON predicate raises a syntax error in postgres. Pandas should disallow the on argument.

pcloud=# select * from orders;
 number | order_date | cust_id | salesperson_id | amount
--------+------------+---------+----------------+--------
     10 | 1996-08-02 |       4 |              2 |    540
     20 | 1999-01-30 |       4 |              8 |   1800
     30 | 1995-07-14 |       9 |              1 |    460
     40 | 1998-01-29 |       7 |              2 |   2400
     50 | 1998-02-03 |       6 |              7 |    600
     60 | 1998-03-02 |       6 |              7 |    720
     70 | 1998-05-06 |       9 |              7 |    150
     80 | 1999-07-23 |       4 |             12 |    200
     90 | 2000-07-23 |       9 |             12 |   2300
(9 rows)

pcloud=# select * from customer;
 id |   name   |   city   | industry_type
----+----------+----------+---------------
  4 | Samsonic | pleasant | J
  6 | Panasung | oaktown  | J
  7 | Samony   | jackson  | B
  9 | Orange   | Jackson  | B
(4 rows)

pcloud=# select * from orders cross join customer on orders.cust_id = customer.id;
ERROR:  syntax error at or near "on"
LINE 1: select * from orders cross join customer on orders.cust_id =...
                                                 ^
pcloud=# select * from orders left join customer on orders.cust_id = customer.id;
 number | order_date | cust_id | salesperson_id | amount | id |   name   |   city   | industry_type
--------+------------+---------+----------------+--------+----+----------+----------+---------------
     10 | 1996-08-02 |       4 |              2 |    540 |  4 | Samsonic | pleasant | J
     20 | 1999-01-30 |       4 |              8 |   1800 |  4 | Samsonic | pleasant | J
     30 | 1995-07-14 |       9 |              1 |    460 |  9 | Orange   | Jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |  7 | Samony   | jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |  6 | Panasung | oaktown  | J
     60 | 1998-03-02 |       6 |              7 |    720 |  6 | Panasung | oaktown  | J
     70 | 1998-05-06 |       9 |              7 |    150 |  9 | Orange   | Jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |  4 | Samsonic | pleasant | J
     90 | 2000-07-23 |       9 |             12 |   2300 |  9 | Orange   | Jackson  | B
(9 rows)

cpcloud · 2015-04-14T14:53:45Z

We need to differentiate same name columns because the cartesian product can potentially generate different values

pcloud=# select * from orders, customer2;

 number | order_date | cust_id | salesperson_id | amount | cust_id |   name   |   city   | industry_type
--------+------------+---------+----------------+--------+---------+----------+----------+---------------
     10 | 1996-08-02 |       4 |              2 |    540 |       4 | Samsonic | pleasant | J
     10 | 1996-08-02 |       4 |              2 |    540 |       6 | Panasung | oaktown  | J
     10 | 1996-08-02 |       4 |              2 |    540 |       7 | Samony   | jackson  | B
     10 | 1996-08-02 |       4 |              2 |    540 |       9 | Orange   | Jackson  | B
     20 | 1999-01-30 |       4 |              8 |   1800 |       4 | Samsonic | pleasant | J
     20 | 1999-01-30 |       4 |              8 |   1800 |       6 | Panasung | oaktown  | J
     20 | 1999-01-30 |       4 |              8 |   1800 |       7 | Samony   | jackson  | B
     20 | 1999-01-30 |       4 |              8 |   1800 |       9 | Orange   | Jackson  | B
     30 | 1995-07-14 |       9 |              1 |    460 |       4 | Samsonic | pleasant | J
     30 | 1995-07-14 |       9 |              1 |    460 |       6 | Panasung | oaktown  | J
     30 | 1995-07-14 |       9 |              1 |    460 |       7 | Samony   | jackson  | B
     30 | 1995-07-14 |       9 |              1 |    460 |       9 | Orange   | Jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |       4 | Samsonic | pleasant | J
     40 | 1998-01-29 |       7 |              2 |   2400 |       6 | Panasung | oaktown  | J
     40 | 1998-01-29 |       7 |              2 |   2400 |       7 | Samony   | jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |       9 | Orange   | Jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |       4 | Samsonic | pleasant | J
     50 | 1998-02-03 |       6 |              7 |    600 |       6 | Panasung | oaktown  | J
     50 | 1998-02-03 |       6 |              7 |    600 |       7 | Samony   | jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |       9 | Orange   | Jackson  | B
     60 | 1998-03-02 |       6 |              7 |    720 |       4 | Samsonic | pleasant | J
     60 | 1998-03-02 |       6 |              7 |    720 |       6 | Panasung | oaktown  | J
     60 | 1998-03-02 |       6 |              7 |    720 |       7 | Samony   | jackson  | B
     60 | 1998-03-02 |       6 |              7 |    720 |       9 | Orange   | Jackson  | B
     70 | 1998-05-06 |       9 |              7 |    150 |       4 | Samsonic | pleasant | J
     70 | 1998-05-06 |       9 |              7 |    150 |       6 | Panasung | oaktown  | J
     70 | 1998-05-06 |       9 |              7 |    150 |       7 | Samony   | jackson  | B
     70 | 1998-05-06 |       9 |              7 |    150 |       9 | Orange   | Jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |       4 | Samsonic | pleasant | J
     80 | 1999-07-23 |       4 |             12 |    200 |       6 | Panasung | oaktown  | J
     80 | 1999-07-23 |       4 |             12 |    200 |       7 | Samony   | jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |       9 | Orange   | Jackson  | B
     90 | 2000-07-23 |       9 |             12 |   2300 |       4 | Samsonic | pleasant | J
     90 | 2000-07-23 |       9 |             12 |   2300 |       6 | Panasung | oaktown  | J
     90 | 2000-07-23 |       9 |             12 |   2300 |       7 | Samony   | jackson  | B
     90 | 2000-07-23 |       9 |             12 |   2300 |       9 | Orange   | Jackson  | B
(36 rows)

dsm054 · 2015-04-14T15:20:00Z

Okay, @cpcloud, so if I understand the semantic we need to match so as not to violate least surprise for our database-minded friends, shared column names should behave as if they had an implicit suffix to distinguish them?

cpcloud · 2015-04-14T15:22:51Z

@dsm054 Yep.

dsm054 · 2015-04-17T04:35:15Z

I've got this working for merge, so that df0.merge(df1, how='cross') works exactly as it would if you introduced the temp column, but I'm not entirely sure what to do with join. Thoughts?

bmoscon · 2016-05-20T19:24:46Z

Any update on this? Would love to see this feature/enhancement in an upcoming release

pkch · 2016-08-11T17:52:23Z

For API purposes, would it be useful to just allow an arbitrary join condition? That way, cross-join is just the special case of no condition (None, which is interpreted as lambda x, y: True).

jreback · 2016-08-15T11:45:22Z

@pkch on='cross' is a nice API here (it could be implemented as a general conditional, but is more performant as a special case). Separately the on= could be a condition (eg. a callable)

ngould · 2017-05-31T17:23:23Z

I just came across this thread after looking to an answer to the same question. I think the following API would be great, noting that when the how argument is set to cross, the on argument must be None.

For pd.merge:
result = pd.merge(df1, df2, how='cross')

For pd.DataFrame.merge:
result = df1.merge(df2, how='cross')

For pd.DataFrame.join:
result = df1.join(df2, how='cross')

My experience is the same as noted above, in that the best way to accomplish this in pandas currently is:

df1['key'] = 1
df2['key'] = 1
result = pd.merge(df1, df2, on='key')

That works, but in my opinion isn't readable enough to use in polite company. Thoughts on how to move forward?

dsm054 · 2017-05-31T17:41:29Z

I was just thinking about this the other day when I was looking at old branches. It seems reasonable to try to get this into 0.21.

dsm054 · 2017-06-03T21:37:37Z

I'd like to get some input on the possible behaviour of a cross join. I have a candidate implementation but while testing different cases I came across some decisions I want people to have a chance to think about. Partly this is because after lots of time spent on SO, I think we chose the wrong default for as_index in groupby, but it's too late to change now-- I'd prefer not for that to happen again, so if someone else can spot a problem, that'd be great.

Overview

Based on the earlier discussion, I'm proposing a new option how='cross', which would give the Cartesian product of the two dataframes. In terms of row values, this would be equivalent to adding a dummy column to each frame, performing an inner merge on that column, and then deleting it from the result, which is the standard workaround at the moment.

As an example:

In [17]: df0 = pd.DataFrame({"A": [1,2]})

In [18]: df1 = pd.DataFrame({"B": [3,4,5]})

In [19]: df0.merge(df1, how='cross')
Out[19]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [20]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[20]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

Parameters

pd.merge has lots of options, though, and we need to decide how they're being handled. Here are my thoughts so far.

on, left_on, right_on: if they're specified, raise a ValueError. how='cross' implicitly specifies its own join target for each frame, and so it doesn't make sense to allow these.
sort: keep but ignore. The join keys are effectively all equal, so sorting has no effect, and leaving it in is harmless.
suffixes: keep. In the case of matching column names in each frame, it'll behave just like the reference merge, and we get

In [25]: df0.merge(df1, how='cross')
Out[25]: 
   A  C_x  B  C_y
0  1   10  3   30
1  1   10  4   40
2  1   10  5   50
3  2   20  3   30
4  2   20  4   40
5  2   20  5   50

In [26]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[26]: 
   A  C_x  B  C_y
0  1   10  3   30
1  1   10  4   40
2  1   10  5   50
3  2   20  3   30
4  2   20  4   40
5  2   20  5   50

copy: keep, although whether or not it does anything will depend on the current merge implementation (and the promise "not unnecessarily" is pretty weak anyhow.)
indicator: keep. The output won't be very interesting, as it will always be "both", but it's harmless.
validate: behave exactly as if we're doing an inner merge on a dummy column (so that two one-row frames pass a 'one_to_one' check but longer frames would raise a MergeError.)

Index

For me, the interesting question is what to do with left_index and right_index. I can think of at least three defensible options:

Raise if either left_index or right_index is not False. The docstring says "Use the index from the left DataFrame as the join key(s)" as the meaning of left_index, but how='cross' already specified a custom join key, and so it doesn't make sense.

Advantage: very easy to explain and understand. Also doesn't prevent us from changing our minds later, as it's much easier to give code which used to fail a new meaning than to change the meaning of code which used to give a result. If you want the index information to be preserved, you'd still have the option of calling .reset_index() on the arguments.

Disadvantage: pandas code is already too cluttered with .reset_index() calls, IMHO.

Behave exactly as merging on a dummy column would. In other words:

In [72]: df0 = pd.DataFrame({"A": [1,2]}, index=[10,20])

In [73]: df1 = pd.DataFrame({"B": [3,4,5]}, index=[20,30,40])

In [74]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[74]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [75]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True).drop('m', axis=1)
Out[75]: 
    A  B
20  1  3
30  1  4
40  1  5
20  2  3
30  2  4
40  2  5

In [76]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', right_index=True).drop('m', axis=1)
Out[76]: 
    A  B
10  1  3
10  1  4
10  1  5
20  2  3
20  2  4
20  2  5

In [77]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True, right_index=True).drop('m', axis=1)
Out[77]: 
    A  B
20  2  3

That is, in the case of left_index=right_index=False, give the natural index. (Which is an Int64Index, not a Range, but that's minor). In the case of left_index=True, repeat the right index values. In the case of right_index=True, repeat the left index values. In the case of both, keep the shared indices.

Advantage: the behaviour, while not obvious, can be explained by reference to the workaround everyone uses to get the cross join now.

Disadvantage: it's really not obvious at first glance that this is the behaviour you get if you specify on as well as left/right index, and setting them both breaks the expectation that you'd have len(df0) * len(df1) rows in the output of a cross join.

3a) Do the same as in #2, but instead of keeping only the common indices when left_index and right_index are True, set the index to the multiindex product of the two indices:

In [109]: m0 = pd.MultiIndex.from_product([df0.index, df1.index])

In [110]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', right_index=True).drop('m', axis=1).set_index(m0)
Out[110]: 
       A  B
10 20  1  3
   30  1  4
   40  1  5
20 20  2  3
   30  2  4
   40  2  5

Advantage: now the cross join produces the expected number of rows, and there's a way to preserve the index information if you want it, while the default (with left_index=right_index=False would continue to ignore it.)

Disadvantage: now it really feels like we're reinterpreting left_index and right_index in the how='cross' case.

3b) Do the same as in (3a), but change the interpretation of the index flags in the case of how='cross' to mean keep that index information. In other words, left_index=True would give you an index in the result which consists of df0's index elements repeated appropriately, right_index=True would give you df1's index elements repeated appropriately, and both true would give you the multiindex.

Advantage: all of those of 3a, with a much easier way to explain what gets used in the resulting index.

Disadvantage: inverts the behaviour from anyone used to the merge behaviour in this (admittedly obscure) case.

Conclusion

The fact we have indices to worry about complicates things a bit. Option #1 (raise) is certainly the simplest. I think option #2, although I like the consistency with the workaround, is ruled out by the fact it doesn't always return the expected number of elements.

I don't think I've ever used on with left_index=True or right_index=True. To be honest, it feels a little weird that it works in the first place.. but for me since I never use it, any incompatibility with current behaviour isn't a problem. If I were the only user of pandas I'd go for 3b and get on with my day, but I want to preserve least surprise for anyone who might be using it.

Thoughts?

ngould · 2017-06-03T22:51:33Z

Wow, great writeup @dsm054. I agree with ruling out option 2, since it could yield the wrong row count. And 3b stands out to me as the least surprising behavior. Namely,

I get the correct number of rows.
I get an ordered integer index when I use the default values of left_on and right_on. (Same behavior as other types of merge.)
I have the ability to specify which original indices I want to keep, if any.

You mentioned that 3b inverts the usual interpretation of left_on and right_on. Can you explain why you think so? My understanding is that merge gives an integer index by default, and only preserve index info from the original dataframes if left_on and right_on is specified. So 3b seems in line with that behavior, unless I'm misunderstanding something.

As for option 1, can you elaborate on your comment about reset_index(), and how it could be used to restore the indices from the dataframes being merged? I don't see how that is possible, but again that might be my misunderstanding.

dsm054 · 2017-06-23T02:10:03Z

On 1, I meant that you can reset_index before you call merge:

In [12]: pd.merge(df0, df1, how='cross')
Out[12]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [13]: pd.merge(df0.reset_index(), df1.reset_index(), how='cross')
Out[13]: 
   index_x  A  index_y  B
0       10  1       20  3
1       10  1       30  4
2       10  1       40  5
3       20  2       20  3
4       20  2       30  4
5       20  2       40  5

On 3b's "inversion", what I mean is that if you think of the cross join as an inner merge on a dummy column, then when you do left_index=True, you get the right index elements in the result, and vice versa:

In [21]: df0
Out[21]: 
    A
10  1
20  2

In [22]: df1
Out[22]: 
    B
20  3
30  4
40  5

In [23]: pd.merge(df0.assign(m=1), df1.assign(m=1), how='inner', on='m', left_index=True)
Out[23]: 
    A  m  B
20  1  1  3
30  1  1  4
40  1  1  5
20  2  1  3
30  2  1  4
40  2  1  5

In [24]: pd.merge(df0.assign(m=1), df1.assign(m=1), how='inner', on='m', right_index=True)
Out[24]: 
    A  m  B
10  1  1  3
10  1  1  4
10  1  1  5
20  2  1  3
20  2  1  4
20  2  1  5

which is the opposite relation of X_index=True to the resulting index from the one you might expect.

BUT given #16228, which would forbid the use of left_index/right_index being set when left_on or right_on is set, maybe this isn't a problem, because either we follow option #1 and bypass the whole issue, or there's not going to be any other behaviour to interfere with anymore because you won't be setting left_on and left_index together.

ngould · 2017-06-23T17:15:37Z

Thanks, got it! Seems like for consistency, #16228 should also aim to forbid on with any *_index args. Right now it only mentions right_on with right_index in particular. Would you agree, @dsm054?

If we're going down the path of disallowing those combinations of args, obviously option 1 is the way to go.

jreback · 2017-06-28T11:25:04Z

yes #16228 would disallow combo operations here (and it should account for any of index setting w/ on setting)

jreback · 2017-06-28T11:26:15Z

@ngould

Thanks, got it! Seems like for consistency, #16228 should also aim to forbid on with any *_index args. Right now it only mentions right_on with right_index in particular. Would you agree, @dsm054?

the purpose is left_on with left_index and right_on with right_index

ftune · 2018-02-09T15:37:30Z

I come up with an idea to do it:

def crossjoindf(df1,df2):
    keycolist=str(set(df1.columns.values+df2.columns.values))
    df1[keycolist]=0
    df2[keycolist]=0
    return pd.merge(dfr,dfr,on=keycolist).drop([keycolist],axis=1)

It might be too easy to build a particular function `merge(df1,df2,how='cross') for it.
Is it the reason why there is no such function in pandas document?
(T_^)
#19614

dsm054 · 2018-11-13T16:04:00Z

Okay, I want to bring this back. If to avoid ambiguity and weird corner cases of interpretation, if we disallow on, left_on, right_on, left_index, and right_index, then we can pretty easily have

In [35]: df.merge(df2, how='cross')                                                                                         
Out[35]: 
   A  B  C   D
0  1  2  7  77
1  1  2  8  88
2  2  3  7  77
3  2  3  8  88
4  3  4  7  77
5  3  4  8  88

In [36]: df.merge(df2.rename(columns={"C": "B"}), how='cross')                                                              
Out[36]: 
   A  B_x  B_y   D
0  1    2    7  77
1  1    2    8  88
2  2    3    7  77
3  2    3    8  88
4  3    4    7  77
5  3    4    8  88

@jreback: any objection if I bring this back with a PR?

Coldsp33d · 2018-12-12T03:00:51Z

Hi, I've recently written a canonical post on computing the cartesian product of two DataFrames on SO.
Included in that post is a fast and simple method of computing the cartesian product using np.broadcast_arrays. Here it is:

def cartesian_product_simplified(left, right):
    la, lb = len(left), len(right)
    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

    return pd.DataFrame(
        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

This is just a very basic example and does not handle column names and suffixes. It is by no means the fastest possible method, but I feel it could be a good starting point in terms of simplicity and performance. I have added performance benchmarks in the answer, if you would like to check it out.

@jreback, @dsm054: What are your thoughts?

jreback · 2018-12-12T12:35:00Z

@dsm054 have to have this, I think should address #16228 first, then the choice of what to do is easier. I would simply do the 'easy' impl of adding a column (you could prob have to generate a temporary hash column to avoid columns) is fine. In a later PR could explore if this is performant (might be ok).

@Coldsp33d I am not sure that soln is going to work on mixed dtypes, nor will it easily handle things like column suffixing.

Coldsp33d · 2018-12-13T05:45:00Z

@jreback Fair point!

Perhaps we can substitute the calls to column_stack and .values.__getitem__ with pd.concat and loc, but my tests show a significant reduction in speed (almost 10x) for larger frames. If there is some efficient way of determining whether the columns contain mixed dtypes (eg, pd.api.types.infer_dtype()), we could determine whether to use the optimized version, or fall back to a more primitive approach that preserves the types. I will need to do some more tests to see if it is worth it, or maybe something more substantial is required.

reedjosh · 2019-05-01T19:48:11Z

Was there ever any conclusion here? how='cross' would be super helpful.

jreback · 2019-05-01T20:11:29Z

a pull request for a naive implementation would be a good start

DiogoFerrari · 2020-01-31T03:37:45Z

It looks like it was not implemented yet...

mxmaxime · 2020-03-03T09:34:58Z

Hi, Is this topic abandoned?

Meng0625 · 2020-03-21T10:15:45Z

Really expect this.....

dilip1408 · 2020-04-07T07:01:15Z

Seems, this has not implemented yet. I'm using the latest version-1.0.3

ishmeetk · 2020-06-24T15:02:32Z

Will this feature get implemented or it's still in discussion ? Thank You

jreback · 2020-06-24T15:31:46Z

@ishmeetk thinhs are implemented when someone from the community submits a pull request.

this is not hard and would welcome it

the core team can provide review

RomanPekar closed this as completed Oct 31, 2013

RomanPekar reopened this Nov 1, 2013

jreback mentioned this issue Feb 18, 2014

non-unique joins #2192

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Mar 28, 2014

jreback mentioned this issue May 22, 2014

Ability to merge even if no common columns #7203

Closed

jreback mentioned this issue Jun 22, 2014

R rolling join equivalent in Pandas #7546

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

ngould mentioned this issue Jun 23, 2017

ERR: left_index AND left_on or right_index AND right_on should raise #16228

Closed

jreback mentioned this issue Feb 9, 2018

Is there any fast way to apply crossjoin or join not on two dataframes? #19614

Closed

phofl mentioned this issue Nov 15, 2020

ENH: Implement cross method for Merge Operations #37864

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Nov 21, 2020

jreback closed this as completed in #37864 Nov 26, 2020

ENH: cross join in merge() and join() #5401

ENH: cross join in merge() and join() #5401

Comments

RomanPekar commented Oct 31, 2013

jtratner commented Nov 1, 2013

RomanPekar commented Nov 1, 2013

jreback commented Nov 1, 2013

RomanPekar commented Nov 1, 2013

jreback commented Nov 1, 2013

RomanPekar commented Nov 1, 2013

jreback commented Nov 1, 2013

RomanPekar commented Nov 1, 2013

dsm054 commented Nov 9, 2013

CarstVaartjes commented Nov 14, 2013

dsm054 commented Jun 18, 2014

larsmans commented Feb 2, 2015

dsm054 commented Apr 14, 2015

cpcloud commented Apr 14, 2015

cpcloud commented Apr 14, 2015

dsm054 commented Apr 14, 2015

cpcloud commented Apr 14, 2015

dsm054 commented Apr 17, 2015

bmoscon commented May 20, 2016

pkch commented Aug 11, 2016

jreback commented Aug 15, 2016

ngould commented May 31, 2017

dsm054 commented May 31, 2017

dsm054 commented Jun 3, 2017 • edited Loading

Overview

Parameters

Index

Conclusion

ngould commented Jun 3, 2017

dsm054 commented Jun 23, 2017 • edited Loading

ngould commented Jun 23, 2017

jreback commented Jun 28, 2017

jreback commented Jun 28, 2017

ftune commented Feb 9, 2018 • edited Loading

dsm054 commented Nov 13, 2018

Coldsp33d commented Dec 12, 2018

jreback commented Dec 12, 2018

Coldsp33d commented Dec 13, 2018 • edited Loading

reedjosh commented May 1, 2019

jreback commented May 1, 2019

DiogoFerrari commented Jan 31, 2020

mxmaxime commented Mar 3, 2020

Meng0625 commented Mar 21, 2020

dilip1408 commented Apr 7, 2020

ishmeetk commented Jun 24, 2020

jreback commented Jun 24, 2020

dsm054 commented Jun 3, 2017 •

edited

Loading

dsm054 commented Jun 23, 2017 •

edited

Loading

ftune commented Feb 9, 2018 •

edited

Loading

Coldsp33d commented Dec 13, 2018 •

edited

Loading