New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: cross join in merge() and join() #5401

Open
RomanPekar opened this Issue Oct 31, 2013 · 30 comments

Comments

Projects
None yet
@RomanPekar
Contributor

RomanPekar commented Oct 31, 2013

related here: #7203

If I haven't missed something, there's no simple way to make cartesian product method in pandas. For example, see http://stackoverflow.com/questions/19684765/pandas-extending-the-dataframe-by-adding-another-levels/19684948#19684948.
To do cross join, one have to add some keys to DataFrames, like this:

>>> dfA = pd.DataFrame(['a','b','c'],columns=['A'])
>>> dfB = pd.DataFrame(range(3),columns=['B'])
>>> dfA['key'] = 1
>>> dfB['key'] = 1
>>> pd.merge(dfB, dfA, on='key').ix[:, ('A','B')]
   A  B
0  a  0
1  b  0
2  c  0
3  a  1
4  b  1
5  c  1
6  a  2
7  b  2
8  c  2

How about adding possibility to pass how='cross' into merge() and join() methods?

@RomanPekar RomanPekar closed this Oct 31, 2013

@jtratner

This comment has been minimized.

Show comment
Hide comment
@jtratner

jtratner Nov 1, 2013

Contributor

Why did you close this?

Contributor

jtratner commented Nov 1, 2013

Why did you close this?

@RomanPekar RomanPekar reopened this Nov 1, 2013

@RomanPekar

This comment has been minimized.

Show comment
Hide comment
@RomanPekar

RomanPekar Nov 1, 2013

Contributor

Oops my bad it's totally by accident, hit the button somehow :(

Contributor

RomanPekar commented Nov 1, 2013

Oops my bad it's totally by accident, hit the button somehow :(

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 1, 2013

Contributor

related #3835 , there is a method: from pandas.tools.util import cartesian_product

how is this different from an outer join? (which does this and just drops the na groups)

Contributor

jreback commented Nov 1, 2013

related #3835 , there is a method: from pandas.tools.util import cartesian_product

how is this different from an outer join? (which does this and just drops the na groups)

@RomanPekar

This comment has been minimized.

Show comment
Hide comment
@RomanPekar

RomanPekar Nov 1, 2013

Contributor

@jreback could you provide an example how to do outer join if there's no column on which to join? May be it's possible to do with apply somehow?

Contributor

RomanPekar commented Nov 1, 2013

@jreback could you provide an example how to do outer join if there's no column on which to join? May be it's possible to do with apply somehow?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 1, 2013

Contributor

A bit hacky

In [28]: import pandas.tools.util as tools

In [29]: DataFrame(index=MultiIndex.from_arrays(
                  tools.cartesian_product([dfA['A'].tolist(),dfB['B'].tolist()]),
                    names=['A','B'])).reset_index()
Out[29]: 
   A  B
0  a  0
1  a  1
2  a  2
3  b  0
4  b  1
5  b  2
6  c  0
7  c  1
8  c  2
Contributor

jreback commented Nov 1, 2013

A bit hacky

In [28]: import pandas.tools.util as tools

In [29]: DataFrame(index=MultiIndex.from_arrays(
                  tools.cartesian_product([dfA['A'].tolist(),dfB['B'].tolist()]),
                    names=['A','B'])).reset_index()
Out[29]: 
   A  B
0  a  0
1  a  1
2  a  2
3  b  0
4  b  1
5  b  2
6  c  0
7  c  1
8  c  2
@RomanPekar

This comment has been minimized.

Show comment
Hide comment
@RomanPekar

RomanPekar Nov 1, 2013

Contributor

@jreback yes, good one! But what if there're more than one column in either dataframe, or in both?

Contributor

RomanPekar commented Nov 1, 2013

@jreback yes, good one! But what if there're more than one column in either dataframe, or in both?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 1, 2013

Contributor

not sure....can you create input and output frames?

Contributor

jreback commented Nov 1, 2013

not sure....can you create input and output frames?

@RomanPekar

This comment has been minimized.

Show comment
Hide comment
@RomanPekar

RomanPekar Nov 1, 2013

Contributor
>>> dfA = pd.DataFrame({'A':list('ab'), 'B':range(2,0,-1)})
>>> dfB = pd.DataFrame({'C':range(2), 'D':range(4, 6)})
>>> dfA
   A  B
0  a  2
1  b  1
>>> dfB
   C  D
0  0  4
1  1  5
>>> dfA['key'] = 1
>>> dfB['key'] = 1
>>> df = pd.merge(dfA, dfB, on='key')
>>> del df['key']
>>> df
   A  B  C  D
0  a  2  0  4
1  a  2  1  5
2  b  1  0  4
3  b  1  1  5

here's sql fiddle example of cross join

Contributor

RomanPekar commented Nov 1, 2013

>>> dfA = pd.DataFrame({'A':list('ab'), 'B':range(2,0,-1)})
>>> dfB = pd.DataFrame({'C':range(2), 'D':range(4, 6)})
>>> dfA
   A  B
0  a  2
1  b  1
>>> dfB
   C  D
0  0  4
1  1  5
>>> dfA['key'] = 1
>>> dfB['key'] = 1
>>> df = pd.merge(dfA, dfB, on='key')
>>> del df['key']
>>> df
   A  B  C  D
0  a  2  0  4
1  a  2  1  5
2  b  1  0  4
3  b  1  1  5

here's sql fiddle example of cross join

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Nov 9, 2013

Contributor

FWIW, this is something I've often wanted too, and I think is fundamental enough an operation to deserve a better way.

Contributor

dsm054 commented Nov 9, 2013

FWIW, this is something I've often wanted too, and I think is fundamental enough an operation to deserve a better way.

@CarstVaartjes

This comment has been minimized.

Show comment
Hide comment
@CarstVaartjes

CarstVaartjes Nov 14, 2013

I use the same method as @RomanPekar regularly too (cartesian products are handy for basket analysis as they create a normalized version of a scarce matrix). Main thing I guess is that you could improve performance inside Pandas if you know you don't have to look at an index to reference back and forth but instead can multiply the dataframe for any possible combination? in regular python I would expect something like this:

list_1 = [['a', 1], ['b', 2], ['c', 3]]
list_2 = [['x', 11], ['y', 12], ['z', 13]]

list_result = []
for sub_list_1 in list_1:
    for sub_list_2 in list_2:
        list_result.append(sub_list_1 + sub_list_2)

When it comes notation I myself would like a "how='cartesian'" where you do not specify left_on and right_on...

CarstVaartjes commented Nov 14, 2013

I use the same method as @RomanPekar regularly too (cartesian products are handy for basket analysis as they create a normalized version of a scarce matrix). Main thing I guess is that you could improve performance inside Pandas if you know you don't have to look at an index to reference back and forth but instead can multiply the dataframe for any possible combination? in regular python I would expect something like this:

list_1 = [['a', 1], ['b', 2], ['c', 3]]
list_2 = [['x', 11], ['y', 12], ['z', 13]]

list_result = []
for sub_list_1 in list_1:
    for sub_list_2 in list_2:
        list_result.append(sub_list_1 + sub_list_2)

When it comes notation I myself would like a "how='cartesian'" where you do not specify left_on and right_on...

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Jun 18, 2014

Contributor

Is there any reason not to implement this? The only thing I'm not sure of is what we expect from how='cross' when there are common columns; do we want (1) a complete merge, (2) _x and _y suffixes, or (3) to raise an exception? I.e. which of

>>> df0 = pd.DataFrame({"A": [1,2], "B": [3,4]})
>>> df1 = pd.DataFrame({"C": [5,6], "B": [3,8]})
>>> df0["key"] = 1; df1["key"] = 2; m = df0.merge(df1, how='outer'); del m["key"]; print(m)
    A  B   C
0   1  3 NaN
1   2  4 NaN
2 NaN  3   5
3 NaN  8   6
>>> df0["key"] = 1; df1["key"] = 2; m = df0.merge(df1, on="key", how='outer'); del m["key"]; print(m)
    A  B_x  B_y   C
0   1    3  NaN NaN
1   2    4  NaN NaN
2 NaN  NaN    3   5
3 NaN  NaN    8   6

do we want?

Contributor

dsm054 commented Jun 18, 2014

Is there any reason not to implement this? The only thing I'm not sure of is what we expect from how='cross' when there are common columns; do we want (1) a complete merge, (2) _x and _y suffixes, or (3) to raise an exception? I.e. which of

>>> df0 = pd.DataFrame({"A": [1,2], "B": [3,4]})
>>> df1 = pd.DataFrame({"C": [5,6], "B": [3,8]})
>>> df0["key"] = 1; df1["key"] = 2; m = df0.merge(df1, how='outer'); del m["key"]; print(m)
    A  B   C
0   1  3 NaN
1   2  4 NaN
2 NaN  3   5
3 NaN  8   6
>>> df0["key"] = 1; df1["key"] = 2; m = df0.merge(df1, on="key", how='outer'); del m["key"]; print(m)
    A  B_x  B_y   C
0   1    3  NaN NaN
1   2    4  NaN NaN
2 NaN  NaN    3   5
3 NaN  NaN    8   6

do we want?

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Feb 2, 2015

Contributor

+1, I just hit this issue too while trying to represent factors of Bayesian networks as dataframes. When factors have no common variables, their products is a Cartesian product, i.e., cross join. My workaround was to add a dummy column and join on that, then remove it afterwards. I see other people do that as well.

In this application it doesn't matter what happens when there are common columns since I need to treat one of them specially anyway, but I guess adding suffixes would be a sane default; that would be like joining on an empty list of columns.

Contributor

larsmans commented Feb 2, 2015

+1, I just hit this issue too while trying to represent factors of Bayesian networks as dataframes. When factors have no common variables, their products is a Cartesian product, i.e., cross join. My workaround was to add a dummy column and join on that, then remove it afterwards. I see other people do that as well.

In this application it doesn't matter what happens when there are common columns since I need to treat one of them specially anyway, but I guess adding suffixes would be a sane default; that would be like joining on an empty list of columns.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Apr 14, 2015

Contributor

Because concat doesn't give you the Cartesian product:

>>> dfA = pd.DataFrame({'A':list('ab'), 'B':range(2,0,-1)})
>>> dfB = pd.DataFrame({'C':range(2), 'D':range(4, 6)})
>>> pd.concat([dfA, dfB])
     A   B   C   D
0    a   2 NaN NaN
1    b   1 NaN NaN
0  NaN NaN   0   4
1  NaN NaN   1   5
>>> dfA["key"] = 1; dfB["key"] = 1;
>>> m = dfA.merge(dfB,how='outer'); del m["key"];
>>> m
   A  B  C  D
0  a  2  0  4
1  a  2  1  5
2  b  1  0  4
3  b  1  1  5
Contributor

dsm054 commented Apr 14, 2015

Because concat doesn't give you the Cartesian product:

>>> dfA = pd.DataFrame({'A':list('ab'), 'B':range(2,0,-1)})
>>> dfB = pd.DataFrame({'C':range(2), 'D':range(4, 6)})
>>> pd.concat([dfA, dfB])
     A   B   C   D
0    a   2 NaN NaN
1    b   1 NaN NaN
0  NaN NaN   0   4
1  NaN NaN   1   5
>>> dfA["key"] = 1; dfB["key"] = 1;
>>> m = dfA.merge(dfB,how='outer'); del m["key"];
>>> m
   A  B  C  D
0  a  2  0  4
1  a  2  1  5
2  b  1  0  4
3  b  1  1  5
@cpcloud

This comment has been minimized.

Show comment
Hide comment
@cpcloud

cpcloud Apr 14, 2015

Member

It's worth noting that attempting a CROSS JOIN with an ON predicate raises a syntax error in postgres. Pandas should disallow the on argument.

pcloud=# select * from orders;
 number | order_date | cust_id | salesperson_id | amount
--------+------------+---------+----------------+--------
     10 | 1996-08-02 |       4 |              2 |    540
     20 | 1999-01-30 |       4 |              8 |   1800
     30 | 1995-07-14 |       9 |              1 |    460
     40 | 1998-01-29 |       7 |              2 |   2400
     50 | 1998-02-03 |       6 |              7 |    600
     60 | 1998-03-02 |       6 |              7 |    720
     70 | 1998-05-06 |       9 |              7 |    150
     80 | 1999-07-23 |       4 |             12 |    200
     90 | 2000-07-23 |       9 |             12 |   2300
(9 rows)

pcloud=# select * from customer;
 id |   name   |   city   | industry_type
----+----------+----------+---------------
  4 | Samsonic | pleasant | J
  6 | Panasung | oaktown  | J
  7 | Samony   | jackson  | B
  9 | Orange   | Jackson  | B
(4 rows)

pcloud=# select * from orders cross join customer on orders.cust_id = customer.id;
ERROR:  syntax error at or near "on"
LINE 1: select * from orders cross join customer on orders.cust_id =...
                                                 ^
pcloud=# select * from orders left join customer on orders.cust_id = customer.id;
 number | order_date | cust_id | salesperson_id | amount | id |   name   |   city   | industry_type
--------+------------+---------+----------------+--------+----+----------+----------+---------------
     10 | 1996-08-02 |       4 |              2 |    540 |  4 | Samsonic | pleasant | J
     20 | 1999-01-30 |       4 |              8 |   1800 |  4 | Samsonic | pleasant | J
     30 | 1995-07-14 |       9 |              1 |    460 |  9 | Orange   | Jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |  7 | Samony   | jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |  6 | Panasung | oaktown  | J
     60 | 1998-03-02 |       6 |              7 |    720 |  6 | Panasung | oaktown  | J
     70 | 1998-05-06 |       9 |              7 |    150 |  9 | Orange   | Jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |  4 | Samsonic | pleasant | J
     90 | 2000-07-23 |       9 |             12 |   2300 |  9 | Orange   | Jackson  | B
(9 rows)
Member

cpcloud commented Apr 14, 2015

It's worth noting that attempting a CROSS JOIN with an ON predicate raises a syntax error in postgres. Pandas should disallow the on argument.

pcloud=# select * from orders;
 number | order_date | cust_id | salesperson_id | amount
--------+------------+---------+----------------+--------
     10 | 1996-08-02 |       4 |              2 |    540
     20 | 1999-01-30 |       4 |              8 |   1800
     30 | 1995-07-14 |       9 |              1 |    460
     40 | 1998-01-29 |       7 |              2 |   2400
     50 | 1998-02-03 |       6 |              7 |    600
     60 | 1998-03-02 |       6 |              7 |    720
     70 | 1998-05-06 |       9 |              7 |    150
     80 | 1999-07-23 |       4 |             12 |    200
     90 | 2000-07-23 |       9 |             12 |   2300
(9 rows)

pcloud=# select * from customer;
 id |   name   |   city   | industry_type
----+----------+----------+---------------
  4 | Samsonic | pleasant | J
  6 | Panasung | oaktown  | J
  7 | Samony   | jackson  | B
  9 | Orange   | Jackson  | B
(4 rows)

pcloud=# select * from orders cross join customer on orders.cust_id = customer.id;
ERROR:  syntax error at or near "on"
LINE 1: select * from orders cross join customer on orders.cust_id =...
                                                 ^
pcloud=# select * from orders left join customer on orders.cust_id = customer.id;
 number | order_date | cust_id | salesperson_id | amount | id |   name   |   city   | industry_type
--------+------------+---------+----------------+--------+----+----------+----------+---------------
     10 | 1996-08-02 |       4 |              2 |    540 |  4 | Samsonic | pleasant | J
     20 | 1999-01-30 |       4 |              8 |   1800 |  4 | Samsonic | pleasant | J
     30 | 1995-07-14 |       9 |              1 |    460 |  9 | Orange   | Jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |  7 | Samony   | jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |  6 | Panasung | oaktown  | J
     60 | 1998-03-02 |       6 |              7 |    720 |  6 | Panasung | oaktown  | J
     70 | 1998-05-06 |       9 |              7 |    150 |  9 | Orange   | Jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |  4 | Samsonic | pleasant | J
     90 | 2000-07-23 |       9 |             12 |   2300 |  9 | Orange   | Jackson  | B
(9 rows)
@cpcloud

This comment has been minimized.

Show comment
Hide comment
@cpcloud

cpcloud Apr 14, 2015

Member

We need to differentiate same name columns because the cartesian product can potentially generate different values

pcloud=# select * from orders, customer2;

 number | order_date | cust_id | salesperson_id | amount | cust_id |   name   |   city   | industry_type
--------+------------+---------+----------------+--------+---------+----------+----------+---------------
     10 | 1996-08-02 |       4 |              2 |    540 |       4 | Samsonic | pleasant | J
     10 | 1996-08-02 |       4 |              2 |    540 |       6 | Panasung | oaktown  | J
     10 | 1996-08-02 |       4 |              2 |    540 |       7 | Samony   | jackson  | B
     10 | 1996-08-02 |       4 |              2 |    540 |       9 | Orange   | Jackson  | B
     20 | 1999-01-30 |       4 |              8 |   1800 |       4 | Samsonic | pleasant | J
     20 | 1999-01-30 |       4 |              8 |   1800 |       6 | Panasung | oaktown  | J
     20 | 1999-01-30 |       4 |              8 |   1800 |       7 | Samony   | jackson  | B
     20 | 1999-01-30 |       4 |              8 |   1800 |       9 | Orange   | Jackson  | B
     30 | 1995-07-14 |       9 |              1 |    460 |       4 | Samsonic | pleasant | J
     30 | 1995-07-14 |       9 |              1 |    460 |       6 | Panasung | oaktown  | J
     30 | 1995-07-14 |       9 |              1 |    460 |       7 | Samony   | jackson  | B
     30 | 1995-07-14 |       9 |              1 |    460 |       9 | Orange   | Jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |       4 | Samsonic | pleasant | J
     40 | 1998-01-29 |       7 |              2 |   2400 |       6 | Panasung | oaktown  | J
     40 | 1998-01-29 |       7 |              2 |   2400 |       7 | Samony   | jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |       9 | Orange   | Jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |       4 | Samsonic | pleasant | J
     50 | 1998-02-03 |       6 |              7 |    600 |       6 | Panasung | oaktown  | J
     50 | 1998-02-03 |       6 |              7 |    600 |       7 | Samony   | jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |       9 | Orange   | Jackson  | B
     60 | 1998-03-02 |       6 |              7 |    720 |       4 | Samsonic | pleasant | J
     60 | 1998-03-02 |       6 |              7 |    720 |       6 | Panasung | oaktown  | J
     60 | 1998-03-02 |       6 |              7 |    720 |       7 | Samony   | jackson  | B
     60 | 1998-03-02 |       6 |              7 |    720 |       9 | Orange   | Jackson  | B
     70 | 1998-05-06 |       9 |              7 |    150 |       4 | Samsonic | pleasant | J
     70 | 1998-05-06 |       9 |              7 |    150 |       6 | Panasung | oaktown  | J
     70 | 1998-05-06 |       9 |              7 |    150 |       7 | Samony   | jackson  | B
     70 | 1998-05-06 |       9 |              7 |    150 |       9 | Orange   | Jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |       4 | Samsonic | pleasant | J
     80 | 1999-07-23 |       4 |             12 |    200 |       6 | Panasung | oaktown  | J
     80 | 1999-07-23 |       4 |             12 |    200 |       7 | Samony   | jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |       9 | Orange   | Jackson  | B
     90 | 2000-07-23 |       9 |             12 |   2300 |       4 | Samsonic | pleasant | J
     90 | 2000-07-23 |       9 |             12 |   2300 |       6 | Panasung | oaktown  | J
     90 | 2000-07-23 |       9 |             12 |   2300 |       7 | Samony   | jackson  | B
     90 | 2000-07-23 |       9 |             12 |   2300 |       9 | Orange   | Jackson  | B
(36 rows)
Member

cpcloud commented Apr 14, 2015

We need to differentiate same name columns because the cartesian product can potentially generate different values

pcloud=# select * from orders, customer2;

 number | order_date | cust_id | salesperson_id | amount | cust_id |   name   |   city   | industry_type
--------+------------+---------+----------------+--------+---------+----------+----------+---------------
     10 | 1996-08-02 |       4 |              2 |    540 |       4 | Samsonic | pleasant | J
     10 | 1996-08-02 |       4 |              2 |    540 |       6 | Panasung | oaktown  | J
     10 | 1996-08-02 |       4 |              2 |    540 |       7 | Samony   | jackson  | B
     10 | 1996-08-02 |       4 |              2 |    540 |       9 | Orange   | Jackson  | B
     20 | 1999-01-30 |       4 |              8 |   1800 |       4 | Samsonic | pleasant | J
     20 | 1999-01-30 |       4 |              8 |   1800 |       6 | Panasung | oaktown  | J
     20 | 1999-01-30 |       4 |              8 |   1800 |       7 | Samony   | jackson  | B
     20 | 1999-01-30 |       4 |              8 |   1800 |       9 | Orange   | Jackson  | B
     30 | 1995-07-14 |       9 |              1 |    460 |       4 | Samsonic | pleasant | J
     30 | 1995-07-14 |       9 |              1 |    460 |       6 | Panasung | oaktown  | J
     30 | 1995-07-14 |       9 |              1 |    460 |       7 | Samony   | jackson  | B
     30 | 1995-07-14 |       9 |              1 |    460 |       9 | Orange   | Jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |       4 | Samsonic | pleasant | J
     40 | 1998-01-29 |       7 |              2 |   2400 |       6 | Panasung | oaktown  | J
     40 | 1998-01-29 |       7 |              2 |   2400 |       7 | Samony   | jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |       9 | Orange   | Jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |       4 | Samsonic | pleasant | J
     50 | 1998-02-03 |       6 |              7 |    600 |       6 | Panasung | oaktown  | J
     50 | 1998-02-03 |       6 |              7 |    600 |       7 | Samony   | jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |       9 | Orange   | Jackson  | B
     60 | 1998-03-02 |       6 |              7 |    720 |       4 | Samsonic | pleasant | J
     60 | 1998-03-02 |       6 |              7 |    720 |       6 | Panasung | oaktown  | J
     60 | 1998-03-02 |       6 |              7 |    720 |       7 | Samony   | jackson  | B
     60 | 1998-03-02 |       6 |              7 |    720 |       9 | Orange   | Jackson  | B
     70 | 1998-05-06 |       9 |              7 |    150 |       4 | Samsonic | pleasant | J
     70 | 1998-05-06 |       9 |              7 |    150 |       6 | Panasung | oaktown  | J
     70 | 1998-05-06 |       9 |              7 |    150 |       7 | Samony   | jackson  | B
     70 | 1998-05-06 |       9 |              7 |    150 |       9 | Orange   | Jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |       4 | Samsonic | pleasant | J
     80 | 1999-07-23 |       4 |             12 |    200 |       6 | Panasung | oaktown  | J
     80 | 1999-07-23 |       4 |             12 |    200 |       7 | Samony   | jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |       9 | Orange   | Jackson  | B
     90 | 2000-07-23 |       9 |             12 |   2300 |       4 | Samsonic | pleasant | J
     90 | 2000-07-23 |       9 |             12 |   2300 |       6 | Panasung | oaktown  | J
     90 | 2000-07-23 |       9 |             12 |   2300 |       7 | Samony   | jackson  | B
     90 | 2000-07-23 |       9 |             12 |   2300 |       9 | Orange   | Jackson  | B
(36 rows)
@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Apr 14, 2015

Contributor

Okay, @cpcloud, so if I understand the semantic we need to match so as not to violate least surprise for our database-minded friends, shared column names should behave as if they had an implicit suffix to distinguish them?

Contributor

dsm054 commented Apr 14, 2015

Okay, @cpcloud, so if I understand the semantic we need to match so as not to violate least surprise for our database-minded friends, shared column names should behave as if they had an implicit suffix to distinguish them?

@cpcloud

This comment has been minimized.

Show comment
Hide comment
@cpcloud
Member

cpcloud commented Apr 14, 2015

@dsm054 Yep.

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Apr 17, 2015

Contributor

I've got this working for merge, so that df0.merge(df1, how='cross') works exactly as it would if you introduced the temp column, but I'm not entirely sure what to do with join. Thoughts?

Contributor

dsm054 commented Apr 17, 2015

I've got this working for merge, so that df0.merge(df1, how='cross') works exactly as it would if you introduced the temp column, but I'm not entirely sure what to do with join. Thoughts?

@bmoscon

This comment has been minimized.

Show comment
Hide comment
@bmoscon

bmoscon May 20, 2016

Any update on this? Would love to see this feature/enhancement in an upcoming release

bmoscon commented May 20, 2016

Any update on this? Would love to see this feature/enhancement in an upcoming release

@pkch

This comment has been minimized.

Show comment
Hide comment
@pkch

pkch Aug 11, 2016

For API purposes, would it be useful to just allow an arbitrary join condition? That way, cross-join is just the special case of no condition (None, which is interpreted as lambda x, y: True).

pkch commented Aug 11, 2016

For API purposes, would it be useful to just allow an arbitrary join condition? That way, cross-join is just the special case of no condition (None, which is interpreted as lambda x, y: True).

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 15, 2016

Contributor

@pkch on='cross' is a nice API here (it could be implemented as a general conditional, but is more performant as a special case). Separately the on= could be a condition (eg. a callable)

Contributor

jreback commented Aug 15, 2016

@pkch on='cross' is a nice API here (it could be implemented as a general conditional, but is more performant as a special case). Separately the on= could be a condition (eg. a callable)

@ngould

This comment has been minimized.

Show comment
Hide comment
@ngould

ngould May 31, 2017

I just came across this thread after looking to an answer to the same question. I think the following API would be great, noting that when the how argument is set to cross, the on argument must be None.

For pd.merge:
result = pd.merge(df1, df2, how='cross')

For pd.DataFrame.merge:
result = df1.merge(df2, how='cross')

For pd.DataFrame.join:
result = df1.join(df2, how='cross')

My experience is the same as noted above, in that the best way to accomplish this in pandas currently is:

df1['key'] = 1
df2['key'] = 1
result = pd.merge(df1, df2, on='key')

That works, but in my opinion isn't readable enough to use in polite company. Thoughts on how to move forward?

ngould commented May 31, 2017

I just came across this thread after looking to an answer to the same question. I think the following API would be great, noting that when the how argument is set to cross, the on argument must be None.

For pd.merge:
result = pd.merge(df1, df2, how='cross')

For pd.DataFrame.merge:
result = df1.merge(df2, how='cross')

For pd.DataFrame.join:
result = df1.join(df2, how='cross')

My experience is the same as noted above, in that the best way to accomplish this in pandas currently is:

df1['key'] = 1
df2['key'] = 1
result = pd.merge(df1, df2, on='key')

That works, but in my opinion isn't readable enough to use in polite company. Thoughts on how to move forward?

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 May 31, 2017

Contributor

I was just thinking about this the other day when I was looking at old branches. It seems reasonable to try to get this into 0.21.

Contributor

dsm054 commented May 31, 2017

I was just thinking about this the other day when I was looking at old branches. It seems reasonable to try to get this into 0.21.

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Jun 3, 2017

Contributor

I'd like to get some input on the possible behaviour of a cross join. I have a candidate implementation but while testing different cases I came across some decisions I want people to have a chance to think about. Partly this is because after lots of time spent on SO, I think we chose the wrong default for as_index in groupby, but it's too late to change now-- I'd prefer not for that to happen again, so if someone else can spot a problem, that'd be great.

Overview

Based on the earlier discussion, I'm proposing a new option how='cross', which would give the Cartesian product of the two dataframes. In terms of row values, this would be equivalent to adding a dummy column to each frame, performing an inner merge on that column, and then deleting it from the result, which is the standard workaround at the moment.

As an example:

In [17]: df0 = pd.DataFrame({"A": [1,2]})

In [18]: df1 = pd.DataFrame({"B": [3,4,5]})

In [19]: df0.merge(df1, how='cross')
Out[19]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [20]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[20]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

Parameters

pd.merge has lots of options, though, and we need to decide how they're being handled. Here are my thoughts so far.

  • on, left_on, right_on: if they're specified, raise a ValueError. how='cross' implicitly specifies its own join target for each frame, and so it doesn't make sense to allow these.
  • sort: keep but ignore. The join keys are effectively all equal, so sorting has no effect, and leaving it in is harmless.
  • suffixes: keep. In the case of matching column names in each frame, it'll behave just like the reference merge, and we get
In [25]: df0.merge(df1, how='cross')
Out[25]: 
   A  C_x  B  C_y
0  1   10  3   30
1  1   10  4   40
2  1   10  5   50
3  2   20  3   30
4  2   20  4   40
5  2   20  5   50

In [26]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[26]: 
   A  C_x  B  C_y
0  1   10  3   30
1  1   10  4   40
2  1   10  5   50
3  2   20  3   30
4  2   20  4   40
5  2   20  5   50
  • copy: keep, although whether or not it does anything will depend on the current merge implementation (and the promise "not unnecessarily" is pretty weak anyhow.)
  • indicator: keep. The output won't be very interesting, as it will always be "both", but it's harmless.
  • validate: behave exactly as if we're doing an inner merge on a dummy column (so that two one-row frames pass a 'one_to_one' check but longer frames would raise a MergeError.)

Index

For me, the interesting question is what to do with left_index and right_index. I can think of at least three defensible options:

  1. Raise if either left_index or right_index is not False. The docstring says "Use the index from the left DataFrame as the join key(s)" as the meaning of left_index, but how='cross' already specified a custom join key, and so it doesn't make sense.

Advantage: very easy to explain and understand. Also doesn't prevent us from changing our minds later, as it's much easier to give code which used to fail a new meaning than to change the meaning of code which used to give a result. If you want the index information to be preserved, you'd still have the option of calling .reset_index() on the arguments.

Disadvantage: pandas code is already too cluttered with .reset_index() calls, IMHO.

  1. Behave exactly as merging on a dummy column would. In other words:
In [72]: df0 = pd.DataFrame({"A": [1,2]}, index=[10,20])

In [73]: df1 = pd.DataFrame({"B": [3,4,5]}, index=[20,30,40])

In [74]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[74]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [75]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True).drop('m', axis=1)
Out[75]: 
    A  B
20  1  3
30  1  4
40  1  5
20  2  3
30  2  4
40  2  5

In [76]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', right_index=True).drop('m', axis=1)
Out[76]: 
    A  B
10  1  3
10  1  4
10  1  5
20  2  3
20  2  4
20  2  5

In [77]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True, right_index=True).drop('m', axis=1)
Out[77]: 
    A  B
20  2  3

That is, in the case of left_index=right_index=False, give the natural index. (Which is an Int64Index, not a Range, but that's minor). In the case of left_index=True, repeat the right index values. In the case of right_index=True, repeat the left index values. In the case of both, keep the shared indices.

Advantage: the behaviour, while not obvious, can be explained by reference to the workaround everyone uses to get the cross join now.

Disadvantage: it's really not obvious at first glance that this is the behaviour you get if you specify on as well as left/right index, and setting them both breaks the expectation that you'd have len(df0) * len(df1) rows in the output of a cross join.

3a) Do the same as in #2, but instead of keeping only the common indices when left_index and right_index are True, set the index to the multiindex product of the two indices:

In [109]: m0 = pd.MultiIndex.from_product([df0.index, df1.index])

In [110]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', right_index=True).drop('m', axis=1).set_index(m0)
Out[110]: 
       A  B
10 20  1  3
   30  1  4
   40  1  5
20 20  2  3
   30  2  4
   40  2  5

Advantage: now the cross join produces the expected number of rows, and there's a way to preserve the index information if you want it, while the default (with left_index=right_index=False would continue to ignore it.)

Disadvantage: now it really feels like we're reinterpreting left_index and right_index in the how='cross' case.

3b) Do the same as in (3a), but change the interpretation of the index flags in the case of how='cross' to mean keep that index information. In other words, left_index=True would give you an index in the result which consists of df0's index elements repeated appropriately, right_index=True would give you df1's index elements repeated appropriately, and both true would give you the multiindex.

Advantage: all of those of 3a, with a much easier way to explain what gets used in the resulting index.

Disadvantage: inverts the behaviour from anyone used to the merge behaviour in this (admittedly obscure) case.

Conclusion

The fact we have indices to worry about complicates things a bit. Option #1 (raise) is certainly the simplest. I think option #2, although I like the consistency with the workaround, is ruled out by the fact it doesn't always return the expected number of elements.

I don't think I've ever used on with left_index=True or right_index=True. To be honest, it feels a little weird that it works in the first place.. but for me since I never use it, any incompatibility with current behaviour isn't a problem. If I were the only user of pandas I'd go for 3b and get on with my day, but I want to preserve least surprise for anyone who might be using it.

Thoughts?

Contributor

dsm054 commented Jun 3, 2017

I'd like to get some input on the possible behaviour of a cross join. I have a candidate implementation but while testing different cases I came across some decisions I want people to have a chance to think about. Partly this is because after lots of time spent on SO, I think we chose the wrong default for as_index in groupby, but it's too late to change now-- I'd prefer not for that to happen again, so if someone else can spot a problem, that'd be great.

Overview

Based on the earlier discussion, I'm proposing a new option how='cross', which would give the Cartesian product of the two dataframes. In terms of row values, this would be equivalent to adding a dummy column to each frame, performing an inner merge on that column, and then deleting it from the result, which is the standard workaround at the moment.

As an example:

In [17]: df0 = pd.DataFrame({"A": [1,2]})

In [18]: df1 = pd.DataFrame({"B": [3,4,5]})

In [19]: df0.merge(df1, how='cross')
Out[19]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [20]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[20]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

Parameters

pd.merge has lots of options, though, and we need to decide how they're being handled. Here are my thoughts so far.

  • on, left_on, right_on: if they're specified, raise a ValueError. how='cross' implicitly specifies its own join target for each frame, and so it doesn't make sense to allow these.
  • sort: keep but ignore. The join keys are effectively all equal, so sorting has no effect, and leaving it in is harmless.
  • suffixes: keep. In the case of matching column names in each frame, it'll behave just like the reference merge, and we get
In [25]: df0.merge(df1, how='cross')
Out[25]: 
   A  C_x  B  C_y
0  1   10  3   30
1  1   10  4   40
2  1   10  5   50
3  2   20  3   30
4  2   20  4   40
5  2   20  5   50

In [26]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[26]: 
   A  C_x  B  C_y
0  1   10  3   30
1  1   10  4   40
2  1   10  5   50
3  2   20  3   30
4  2   20  4   40
5  2   20  5   50
  • copy: keep, although whether or not it does anything will depend on the current merge implementation (and the promise "not unnecessarily" is pretty weak anyhow.)
  • indicator: keep. The output won't be very interesting, as it will always be "both", but it's harmless.
  • validate: behave exactly as if we're doing an inner merge on a dummy column (so that two one-row frames pass a 'one_to_one' check but longer frames would raise a MergeError.)

Index

For me, the interesting question is what to do with left_index and right_index. I can think of at least three defensible options:

  1. Raise if either left_index or right_index is not False. The docstring says "Use the index from the left DataFrame as the join key(s)" as the meaning of left_index, but how='cross' already specified a custom join key, and so it doesn't make sense.

Advantage: very easy to explain and understand. Also doesn't prevent us from changing our minds later, as it's much easier to give code which used to fail a new meaning than to change the meaning of code which used to give a result. If you want the index information to be preserved, you'd still have the option of calling .reset_index() on the arguments.

Disadvantage: pandas code is already too cluttered with .reset_index() calls, IMHO.

  1. Behave exactly as merging on a dummy column would. In other words:
In [72]: df0 = pd.DataFrame({"A": [1,2]}, index=[10,20])

In [73]: df1 = pd.DataFrame({"B": [3,4,5]}, index=[20,30,40])

In [74]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[74]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [75]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True).drop('m', axis=1)
Out[75]: 
    A  B
20  1  3
30  1  4
40  1  5
20  2  3
30  2  4
40  2  5

In [76]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', right_index=True).drop('m', axis=1)
Out[76]: 
    A  B
10  1  3
10  1  4
10  1  5
20  2  3
20  2  4
20  2  5

In [77]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True, right_index=True).drop('m', axis=1)
Out[77]: 
    A  B
20  2  3

That is, in the case of left_index=right_index=False, give the natural index. (Which is an Int64Index, not a Range, but that's minor). In the case of left_index=True, repeat the right index values. In the case of right_index=True, repeat the left index values. In the case of both, keep the shared indices.

Advantage: the behaviour, while not obvious, can be explained by reference to the workaround everyone uses to get the cross join now.

Disadvantage: it's really not obvious at first glance that this is the behaviour you get if you specify on as well as left/right index, and setting them both breaks the expectation that you'd have len(df0) * len(df1) rows in the output of a cross join.

3a) Do the same as in #2, but instead of keeping only the common indices when left_index and right_index are True, set the index to the multiindex product of the two indices:

In [109]: m0 = pd.MultiIndex.from_product([df0.index, df1.index])

In [110]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', right_index=True).drop('m', axis=1).set_index(m0)
Out[110]: 
       A  B
10 20  1  3
   30  1  4
   40  1  5
20 20  2  3
   30  2  4
   40  2  5

Advantage: now the cross join produces the expected number of rows, and there's a way to preserve the index information if you want it, while the default (with left_index=right_index=False would continue to ignore it.)

Disadvantage: now it really feels like we're reinterpreting left_index and right_index in the how='cross' case.

3b) Do the same as in (3a), but change the interpretation of the index flags in the case of how='cross' to mean keep that index information. In other words, left_index=True would give you an index in the result which consists of df0's index elements repeated appropriately, right_index=True would give you df1's index elements repeated appropriately, and both true would give you the multiindex.

Advantage: all of those of 3a, with a much easier way to explain what gets used in the resulting index.

Disadvantage: inverts the behaviour from anyone used to the merge behaviour in this (admittedly obscure) case.

Conclusion

The fact we have indices to worry about complicates things a bit. Option #1 (raise) is certainly the simplest. I think option #2, although I like the consistency with the workaround, is ruled out by the fact it doesn't always return the expected number of elements.

I don't think I've ever used on with left_index=True or right_index=True. To be honest, it feels a little weird that it works in the first place.. but for me since I never use it, any incompatibility with current behaviour isn't a problem. If I were the only user of pandas I'd go for 3b and get on with my day, but I want to preserve least surprise for anyone who might be using it.

Thoughts?

@ngould

This comment has been minimized.

Show comment
Hide comment
@ngould

ngould Jun 3, 2017

Wow, great writeup @dsm054. I agree with ruling out option 2, since it could yield the wrong row count. And 3b stands out to me as the least surprising behavior. Namely,

  • I get the correct number of rows.
  • I get an ordered integer index when I use the default values of left_on and right_on. (Same behavior as other types of merge.)
  • I have the ability to specify which original indices I want to keep, if any.

You mentioned that 3b inverts the usual interpretation of left_on and right_on. Can you explain why you think so? My understanding is that merge gives an integer index by default, and only preserve index info from the original dataframes if left_on and right_on is specified. So 3b seems in line with that behavior, unless I'm misunderstanding something.

As for option 1, can you elaborate on your comment about reset_index(), and how it could be used to restore the indices from the dataframes being merged? I don't see how that is possible, but again that might be my misunderstanding.

ngould commented Jun 3, 2017

Wow, great writeup @dsm054. I agree with ruling out option 2, since it could yield the wrong row count. And 3b stands out to me as the least surprising behavior. Namely,

  • I get the correct number of rows.
  • I get an ordered integer index when I use the default values of left_on and right_on. (Same behavior as other types of merge.)
  • I have the ability to specify which original indices I want to keep, if any.

You mentioned that 3b inverts the usual interpretation of left_on and right_on. Can you explain why you think so? My understanding is that merge gives an integer index by default, and only preserve index info from the original dataframes if left_on and right_on is specified. So 3b seems in line with that behavior, unless I'm misunderstanding something.

As for option 1, can you elaborate on your comment about reset_index(), and how it could be used to restore the indices from the dataframes being merged? I don't see how that is possible, but again that might be my misunderstanding.

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Jun 23, 2017

Contributor

On 1, I meant that you can reset_index before you call merge:

In [12]: pd.merge(df0, df1, how='cross')
Out[12]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [13]: pd.merge(df0.reset_index(), df1.reset_index(), how='cross')
Out[13]: 
   index_x  A  index_y  B
0       10  1       20  3
1       10  1       30  4
2       10  1       40  5
3       20  2       20  3
4       20  2       30  4
5       20  2       40  5

On 3b's "inversion", what I mean is that if you think of the cross join as an inner merge on a dummy column, then when you do left_index=True, you get the right index elements in the result, and vice versa:

In [21]: df0
Out[21]: 
    A
10  1
20  2

In [22]: df1
Out[22]: 
    B
20  3
30  4
40  5

In [23]: pd.merge(df0.assign(m=1), df1.assign(m=1), how='inner', on='m', left_index=True)
Out[23]: 
    A  m  B
20  1  1  3
30  1  1  4
40  1  1  5
20  2  1  3
30  2  1  4
40  2  1  5

In [24]: pd.merge(df0.assign(m=1), df1.assign(m=1), how='inner', on='m', right_index=True)
Out[24]: 
    A  m  B
10  1  1  3
10  1  1  4
10  1  1  5
20  2  1  3
20  2  1  4
20  2  1  5

which is the opposite relation of X_index=True to the resulting index from the one you might expect.

BUT given #16228, which would forbid the use of left_index/right_index being set when left_on or right_on is set, maybe this isn't a problem, because either we follow option #1 and bypass the whole issue, or there's not going to be any other behaviour to interfere with anymore because you won't be setting left_on and left_index together.

Contributor

dsm054 commented Jun 23, 2017

On 1, I meant that you can reset_index before you call merge:

In [12]: pd.merge(df0, df1, how='cross')
Out[12]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [13]: pd.merge(df0.reset_index(), df1.reset_index(), how='cross')
Out[13]: 
   index_x  A  index_y  B
0       10  1       20  3
1       10  1       30  4
2       10  1       40  5
3       20  2       20  3
4       20  2       30  4
5       20  2       40  5

On 3b's "inversion", what I mean is that if you think of the cross join as an inner merge on a dummy column, then when you do left_index=True, you get the right index elements in the result, and vice versa:

In [21]: df0
Out[21]: 
    A
10  1
20  2

In [22]: df1
Out[22]: 
    B
20  3
30  4
40  5

In [23]: pd.merge(df0.assign(m=1), df1.assign(m=1), how='inner', on='m', left_index=True)
Out[23]: 
    A  m  B
20  1  1  3
30  1  1  4
40  1  1  5
20  2  1  3
30  2  1  4
40  2  1  5

In [24]: pd.merge(df0.assign(m=1), df1.assign(m=1), how='inner', on='m', right_index=True)
Out[24]: 
    A  m  B
10  1  1  3
10  1  1  4
10  1  1  5
20  2  1  3
20  2  1  4
20  2  1  5

which is the opposite relation of X_index=True to the resulting index from the one you might expect.

BUT given #16228, which would forbid the use of left_index/right_index being set when left_on or right_on is set, maybe this isn't a problem, because either we follow option #1 and bypass the whole issue, or there's not going to be any other behaviour to interfere with anymore because you won't be setting left_on and left_index together.

@ngould

This comment has been minimized.

Show comment
Hide comment
@ngould

ngould Jun 23, 2017

Thanks, got it! Seems like for consistency, #16228 should also aim to forbid on with any *_index args. Right now it only mentions right_on with right_index in particular. Would you agree, @dsm054?

If we're going down the path of disallowing those combinations of args, obviously option 1 is the way to go.

ngould commented Jun 23, 2017

Thanks, got it! Seems like for consistency, #16228 should also aim to forbid on with any *_index args. Right now it only mentions right_on with right_index in particular. Would you agree, @dsm054?

If we're going down the path of disallowing those combinations of args, obviously option 1 is the way to go.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 28, 2017

Contributor

yes #16228 would disallow combo operations here (and it should account for any of index setting w/ on setting)

Contributor

jreback commented Jun 28, 2017

yes #16228 would disallow combo operations here (and it should account for any of index setting w/ on setting)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 28, 2017

Contributor

@ngould

Thanks, got it! Seems like for consistency, #16228 should also aim to forbid on with any *_index args. Right now it only mentions right_on with right_index in particular. Would you agree, @dsm054?

the purpose is left_on with left_index and right_on with right_index

Contributor

jreback commented Jun 28, 2017

@ngould

Thanks, got it! Seems like for consistency, #16228 should also aim to forbid on with any *_index args. Right now it only mentions right_on with right_index in particular. Would you agree, @dsm054?

the purpose is left_on with left_index and right_on with right_index

@ftune

This comment has been minimized.

Show comment
Hide comment
@ftune

ftune Feb 9, 2018

I come up with an idea to do it:

def crossjoindf(df1,df2):
    keycolist=str(set(df1.columns.values+df2.columns.values))
    df1[keycolist]=0
    df2[keycolist]=0
    return pd.merge(dfr,dfr,on=keycolist).drop([keycolist],axis=1)

It might be too easy to build a particular function `merge(df1,df2,how='cross') for it.
Is it the reason why there is no such function in pandas document?
(T_^)
#19614

ftune commented Feb 9, 2018

I come up with an idea to do it:

def crossjoindf(df1,df2):
    keycolist=str(set(df1.columns.values+df2.columns.values))
    df1[keycolist]=0
    df2[keycolist]=0
    return pd.merge(dfr,dfr,on=keycolist).drop([keycolist],axis=1)

It might be too easy to build a particular function `merge(df1,df2,how='cross') for it.
Is it the reason why there is no such function in pandas document?
(T_^)
#19614

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment