Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame self-joins #2996

Closed
willfurnass opened this issue Mar 9, 2013 · 7 comments

Comments

Projects
None yet
5 participants
@willfurnass
Copy link
Contributor

commented Mar 9, 2013

Given the following DataFrame

area point test value
A 11 0 1234234
A 11 1 12341234
A 16 0 234234
A 16 1 2343
A 16 2 234234
C 4 0 234234
C 4 1 234234

it would be nice if there were a way of grouping say columns area and point and comparing the value per test > 1 with the value for test - 1.

This can be done by iterating over df.groupby(['area', 'point', 'test']) and using the sorting provided by groupby() on the specified columns to compare current and previous values. However, it would be neat if this could also be done in a more Pandas-esque way using something akin to a SQL self-join.

NB request first made in pystatsmodels Google Group; was asked by Wes to create a Github issue for this.

@wesm

This comment has been minimized.

Copy link
Member

commented Mar 9, 2013

Doesn't this do what you want?

In [25]: merge(df, df, left_on=['point', 'area', 'test'], right_on=['point', 'area', df['test'] - 1])
Out[25]: 
  area  point  test  test_x  value_x  test_y   value_y
0    A     11     0       0  1234234       1  12341234
1    A     16     0       0   234234       1      2343
2    A     16     1       1     2343       2    234234
3    C      4     0       0   234234       1    234234

or

In [24]: merge(df, df, left_on='test', right_on=df['test'] - 1)
Out[24]: 
    test area_x  point_x  test_x   value_x area_y  point_y  test_y   value_y
0      0      A       11       0   1234234      A       11       1  12341234
1      0      A       11       0   1234234      A       16       1      2343
2      0      A       11       0   1234234      C        4       1    234234
3      0      A       16       0    234234      A       11       1  12341234
4      0      A       16       0    234234      A       16       1      2343
5      0      A       16       0    234234      C        4       1    234234
6      0      C        4       0    234234      A       11       1  12341234
7      0      C        4       0    234234      A       16       1      2343
8      0      C        4       0    234234      C        4       1    234234
9      1      A       11       1  12341234      A       16       2    234234
10     1      A       16       1      2343      A       16       2    234234
11     1      C        4       1    234234      A       16       2    234234
@willfurnass

This comment has been minimized.

Copy link
Contributor Author

commented Mar 10, 2013

Cheers, that does the trick. I'll close this Issue.

@mizzao

This comment has been minimized.

Copy link

commented Oct 7, 2015

In pandas 0.16.2 I get the following error when trying to attempt a join like this, using the following code to add one column:

df_filtered = df[['gameId', 'round', 'player', 'action']]
df.merge(df_filtered, left_on=['gameId', 'player', 'round'], right_on=['gameId', 'player', df_filtered.round + 1], how='left')

results in

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-597-6ba80432ed68> in <module>()
      1 # Add last action to dataframe
      2 df_filtered = df[['gameId', 'round', 'player', 'action']]
----> 3 pd.DataFrame.merge(df, df_filtered, left_on=['gameId', 'player', 'round'], right_on=['gameId', 'player', df_filtered.round + 1], how='left')

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
   4069                      left_on=left_on, right_on=right_on,
   4070                      left_index=left_index, right_index=right_index, sort=sort,
-> 4071                      suffixes=suffixes, copy=copy)
   4072 
   4073     #----------------------------------------------------------------------

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     36                          right_index=right_index, sort=sort, suffixes=suffixes,
     37                          copy=copy)
---> 38     return op.get_result()
     39 if __debug__:
     40     merge.__doc__ = _merge_doc % '\nleft : DataFrame'

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in get_result(self)
    203         result = typ(result_data).__finalize__(self, method='merge')
    204 
--> 205         self._maybe_add_join_keys(result, left_indexer, right_indexer)
    206 
    207         return result

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in _maybe_add_join_keys(self, result, left_indexer, right_indexer)
    249                 right_na_indexer = right_indexer.take(na_indexer)
    250                 key_col.put(na_indexer, com.take_1d(self.right_join_keys[i],
--> 251                                                     right_na_indexer))
    252                 result.insert(i, name, key_col)
    253 

/usr/local/lib/python2.7/dist-packages/pandas/core/common.pyc in take_nd(arr, indexer, axis, out, fill_value, mask_info, allow_fill)
    842 
    843     indexer = _ensure_int64(indexer)
--> 844     func(arr, indexer, out, fill_value)
    845 
    846     if flip_order:

TypeError: Argument 'values' has incorrect type (expected numpy.ndarray, got Series)

This error also seems to have been reproduced by another user: http://stackoverflow.com/q/24091385/586086

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 8, 2015

you would have to show the input frame df_filtered.info() and df.head()

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Oct 8, 2015

Also, right_on=['gameId', 'player', df_filtered.round + 1] doesn't really make sense. right_on should be a list of strings, but df_filtered.round + 1 will be a Series.

Oh, and pandas 0.17 added a DataFrame.round method. Use df_filtered['round'].

@mizzao

This comment has been minimized.

Copy link

commented Oct 9, 2015

@TomAugspurger: @wesm used the series in place of strings in his post above. In any case, that syntax produces the same error:

df_test.merge(df_test, left_on=['gameId', 'player', 'round'], right_on=['gameId', 'player', df_test['round'] + 1], how='left')

@jreback, the info you requested (See here for a description of the data):

df_test.head()

    gameId  round   player  action
0   Afom9bWqYBgZXXXN8   1   PvQ8B5kuA9Fbq9N59   1
1   Afom9bWqYBgZXXXN8   1   PJmJgrqusFZ8KRShQ   0
2   Afom9bWqYBgZXXXN8   2   PvQ8B5kuA9Fbq9N59   0
3   Afom9bWqYBgZXXXN8   2   PJmJgrqusFZ8KRShQ   0
4   Afom9bWqYBgZXXXN8   3   PJmJgrqusFZ8KRShQ   0
5   Afom9bWqYBgZXXXN8   3   PvQ8B5kuA9Fbq9N59   0
20  QdZM4yPMnjGj8f25R   1   Q6knaWEruc6BDPQT7   1
21  QdZM4yPMnjGj8f25R   1   xnAjMcWaFRpfBbukz   1
22  QdZM4yPMnjGj8f25R   2   xnAjMcWaFRpfBbukz   1
23  QdZM4yPMnjGj8f25R   2   Q6knaWEruc6BDPQT7   0
24  QdZM4yPMnjGj8f25R   3   Q6knaWEruc6BDPQT7   1
25  QdZM4yPMnjGj8f25R   3   xnAjMcWaFRpfBbukz   1
40  riMD6ctT8DLwdhHpE   1   EKkrMpMqy2PRLm7ur   1
41  riMD6ctT8DLwdhHpE   1   EqbbmngPfZBEmPTzq   1
42  riMD6ctT8DLwdhHpE   2   EKkrMpMqy2PRLm7ur   1
43  riMD6ctT8DLwdhHpE   2   EqbbmngPfZBEmPTzq   1
44  riMD6ctT8DLwdhHpE   3   EqbbmngPfZBEmPTzq   1
45  riMD6ctT8DLwdhHpE   3   EKkrMpMqy2PRLm7ur   1
60  hyEjkAg5K4WpubJA9   1   7CHpY4setLKb9ssnN   1
61  hyEjkAg5K4WpubJA9   1   hbud2J3YvitEhj4xZ   0
62  hyEjkAg5K4WpubJA9   2   hbud2J3YvitEhj4xZ   0
63  hyEjkAg5K4WpubJA9   2   7CHpY4setLKb9ssnN   0
64  hyEjkAg5K4WpubJA9   3   7CHpY4setLKb9ssnN   0
65  hyEjkAg5K4WpubJA9   3   hbud2J3YvitEhj4xZ   1
80  ay5pmpeNcwqHJ8JBH   1   tWA9ZxSnKpZyWwYsQ   1
81  ay5pmpeNcwqHJ8JBH   1   2qiHdJgL4WQe5qrHQ   1
82  ay5pmpeNcwqHJ8JBH   2   2qiHdJgL4WQe5qrHQ   1
83  ay5pmpeNcwqHJ8JBH   2   tWA9ZxSnKpZyWwYsQ   1
84  ay5pmpeNcwqHJ8JBH   3   tWA9ZxSnKpZyWwYsQ   1
85  ay5pmpeNcwqHJ8JBH   3   2qiHdJgL4WQe5qrHQ   1

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 374251 entries, 0 to 374250
Data columns (total 4 columns):
gameId    374251 non-null object
player    374251 non-null object
round     374251 non-null int64
action    374251 non-null int64
dtypes: int64(2), object(2)
memory usage: 14.3+ MB
@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 14, 2015

Is this what you want?

In [39]: df['round2']=df['round']+1

In [40]: df.merge(df, left_on=['gameId', 'player', 'round'], right_on=['gameId', 'player','round2'], how='left')
Out[40]: 
               gameId  round_x             player  action_x  round2_x  round_y  action_y  round2_y
0   Afom9bWqYBgZXXXN8        1  PvQ8B5kuA9Fbq9N59         1         2      NaN       NaN       NaN
1   Afom9bWqYBgZXXXN8        1  PJmJgrqusFZ8KRShQ         0         2      NaN       NaN       NaN
2   Afom9bWqYBgZXXXN8        2  PvQ8B5kuA9Fbq9N59         0         3        1         1         2
3   Afom9bWqYBgZXXXN8        2  PJmJgrqusFZ8KRShQ         0         3        1         0         2
4   Afom9bWqYBgZXXXN8        3  PJmJgrqusFZ8KRShQ         0         4        2         0         3
5   Afom9bWqYBgZXXXN8        3  PvQ8B5kuA9Fbq9N59         0         4        2         0         3
6   QdZM4yPMnjGj8f25R        1  Q6knaWEruc6BDPQT7         1         2      NaN       NaN       NaN
7   QdZM4yPMnjGj8f25R        1  xnAjMcWaFRpfBbukz         1         2      NaN       NaN       NaN
8   QdZM4yPMnjGj8f25R        2  xnAjMcWaFRpfBbukz         1         3        1         1         2
9   QdZM4yPMnjGj8f25R        2  Q6knaWEruc6BDPQT7         0         3        1         1         2
10  QdZM4yPMnjGj8f25R        3  Q6knaWEruc6BDPQT7         1         4        2         0         3
11  QdZM4yPMnjGj8f25R        3  xnAjMcWaFRpfBbukz         1         4        2         1         3
12  riMD6ctT8DLwdhHpE        1  EKkrMpMqy2PRLm7ur         1         2      NaN       NaN       NaN
13  riMD6ctT8DLwdhHpE        1  EqbbmngPfZBEmPTzq         1         2      NaN       NaN       NaN
14  riMD6ctT8DLwdhHpE        2  EKkrMpMqy2PRLm7ur         1         3        1         1         2
15  riMD6ctT8DLwdhHpE        2  EqbbmngPfZBEmPTzq         1         3        1         1         2
16  riMD6ctT8DLwdhHpE        3  EqbbmngPfZBEmPTzq         1         4        2         1         3
17  riMD6ctT8DLwdhHpE        3  EKkrMpMqy2PRLm7ur         1         4        2         1         3
18  hyEjkAg5K4WpubJA9        1  7CHpY4setLKb9ssnN         1         2      NaN       NaN       NaN
19  hyEjkAg5K4WpubJA9        1  hbud2J3YvitEhj4xZ         0         2      NaN       NaN       NaN
20  hyEjkAg5K4WpubJA9        2  hbud2J3YvitEhj4xZ         0         3        1         0         2
21  hyEjkAg5K4WpubJA9        2  7CHpY4setLKb9ssnN         0         3        1         1         2
22  hyEjkAg5K4WpubJA9        3  7CHpY4setLKb9ssnN         0         4        2         0         3
23  hyEjkAg5K4WpubJA9        3  hbud2J3YvitEhj4xZ         1         4        2         0         3
24  ay5pmpeNcwqHJ8JBH        1  tWA9ZxSnKpZyWwYsQ         1         2      NaN       NaN       NaN
25  ay5pmpeNcwqHJ8JBH        1  2qiHdJgL4WQe5qrHQ         1         2      NaN       NaN       NaN
26  ay5pmpeNcwqHJ8JBH        2  2qiHdJgL4WQe5qrHQ         1         3        1         1         2
27  ay5pmpeNcwqHJ8JBH        2  tWA9ZxSnKpZyWwYsQ         1         3        1         1         2
28  ay5pmpeNcwqHJ8JBH        3  tWA9ZxSnKpZyWwYsQ         1         4        2         1         3
29  ay5pmpeNcwqHJ8JBH        3  2qiHdJgL4WQe5qrHQ         1         4        2         1         3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.