# データの結合

In [2]:
import pandas as pd

# データの結合：横につなぐ、列方向

merge,join,concatについて説明します。

まずディクショナリからサンプルのデータフレームを作ります。

In [27]:
df1 = pd.DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)})
df2 = pd.DataFrame({'key':['a','b','d','a'],'data2':range(4)})

In [28]:
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [29]:
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,d
3,3,a


In [30]:
print(df1['key'].unique())
print(df2['key'].unique())

['b' 'a' 'c']
['a' 'b' 'd']


### 内部結合、左外部結合、完全外部結合を使い分ける

- 内部結合

In [31]:
pd.merge(df1,df2,on='key',how='inner')

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,2,a,3
5,4,a,0
6,4,a,3
7,5,a,0
8,5,a,3


df1のkeyとdf2のkey両方に存在する値のみ結合され、すべての組が新しいテーブルになります。

- 完全外部結合

In [32]:
pd.merge(df1,df2,on='key',how='outer')

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,1.0,b,1.0
2,6.0,b,1.0
3,2.0,a,0.0
4,2.0,a,3.0
5,4.0,a,0.0
6,4.0,a,3.0
7,5.0,a,0.0
8,5.0,a,3.0
9,3.0,c,


df1のkeyとdf2のkeyの少なくとも一方に存在する値が結合され、すべての組が新しいテーブルになります。

- 左外部結合

In [33]:
pd.merge(df1,df2,on='key',how='left')

Unnamed: 0,data1,key,data2
0,0,b,1.0
1,1,b,1.0
2,2,a,0.0
3,2,a,3.0
4,3,c,
5,4,a,0.0
6,4,a,3.0
7,5,a,0.0
8,5,a,3.0
9,6,b,1.0


df1のkeyに存在する値はすべて、df2のkeyはdf1のkeyにも存在するもののみ結合され、すべての組が新しいテーブルになります。

- 右外部結合

In [34]:
pd.merge(df1,df2,on='key',how='right')

Unnamed: 0,data1,key,data2
0,0.0,b,1
1,1.0,b,1
2,6.0,b,1
3,2.0,a,0
4,4.0,a,0
5,5.0,a,0
6,2.0,a,3
7,4.0,a,3
8,5.0,a,3
9,,d,2


右外部結合と引数のデータフレームの順序を入れ替えた左外部結合は同じです。

In [35]:
pd.merge(df2,df1,on='key',how='left')

Unnamed: 0,data2,key,data1
0,0,a,2.0
1,0,a,4.0
2,0,a,5.0
3,1,b,0.0
4,1,b,1.0
5,1,b,6.0
6,2,d,
7,3,a,2.0
8,3,a,4.0
9,3,a,5.0


デフォルトは内部結合です。

In [37]:
pd.merge(df1,df2)

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,2,a,3
5,4,a,0
6,4,a,3
7,5,a,0
8,5,a,3


keyとする列名が異なる場合

In [38]:
df3 = pd.DataFrame({'lkey':['b','b','a','c','a','a','b'],'data1':range(7)})
df4 = pd.DataFrame({'rkey':['a','b','c','a'],'data2':range(4)})

In [39]:
df3

Unnamed: 0,data1,lkey
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [40]:
df4

Unnamed: 0,data2,rkey
0,0,a
1,1,b
2,2,c
3,3,a


In [42]:
pd.merge(df3,df4)

MergeError: No common columns to perform merge on

left_on,right_onでkeyを指定する。

In [41]:
pd.merge(df3,df4,left_on='lkey',right_on='rkey')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,1,b
1,1,b,1,b
2,6,b,1,b
3,2,a,0,a
4,2,a,3,a
5,4,a,0,a
6,4,a,3,a
7,5,a,0,a
8,5,a,3,a
9,3,c,2,c


インデックスで結合することもできます。

In [43]:
pd.merge(df3,df4,left_index=True,right_index=True)

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,0,a
1,1,b,1,b
2,2,a,2,c
3,3,c,3,a


In [44]:
pd.merge(df3,df4,left_index=True,right_index=True,how='outer')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,0.0,a
1,1,b,1.0,b
2,2,a,2.0,c
3,3,c,3.0,a
4,4,a,,
5,5,a,,
6,6,b,,


キーを列かインデックスか指定して結合します。

In [45]:
pd.merge(df3,df4,left_on='data1',right_index=True,how='outer')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,0.0,a
1,1,b,1.0,b
2,2,a,2.0,c
3,3,c,3.0,a
4,4,a,,
5,5,a,,
6,6,b,,


列名が重複する場合、自動でsuffixesがふられます。

In [47]:
df5 = pd.DataFrame({'key':['b','b','a','c','a','a','b'],'data':range(7)})
df6 = pd.DataFrame({'key':['a','b','d','a'],'data':range(4)})

In [48]:
df5

Unnamed: 0,data,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [49]:
df6

Unnamed: 0,data,key
0,0,a
1,1,b
2,2,d
3,3,a


In [50]:
pd.merge(df5,df6,on='key',how='left')

Unnamed: 0,data_x,key,data_y
0,0,b,1.0
1,1,b,1.0
2,2,a,0.0
3,2,a,3.0
4,3,c,
5,4,a,0.0
6,4,a,3.0
7,5,a,0.0
8,5,a,3.0
9,6,b,1.0


In [52]:
pd.merge(df5,df6,on='key',how='left',suffixes=('_1','_2'))

Unnamed: 0,data_1,key,data_2
0,0,b,1.0
1,1,b,1.0
2,2,a,0.0
3,2,a,3.0
4,3,c,
5,4,a,0.0
6,4,a,3.0
7,5,a,0.0
8,5,a,3.0
9,6,b,1.0


データフレーム型のオブジェクトに対するメソッドもあります。

In [53]:
df1.merge(df2,on='key',how='left')

Unnamed: 0,data1,key,data2
0,0,b,1.0
1,1,b,1.0
2,2,a,0.0
3,2,a,3.0
4,3,c,
5,4,a,0.0
6,4,a,3.0
7,5,a,0.0
8,5,a,3.0
9,6,b,1.0


引数はpd.merge(df1,df2)もdf1.merge(df2)も同じです。

共通の列名を持たず、インデックスで結合する場合joinが使えます。

In [56]:
df3.join(df4)

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,0.0,a
1,1,b,1.0,b
2,2,a,2.0,c
3,3,c,3.0,a
4,4,a,,
5,5,a,,
6,6,b,,


同じ列名があると使えません。

In [57]:
df1.join(df2)

ValueError: columns overlap but no suffix specified: Index(['key'], dtype='object')

In [58]:
left = pd.DataFrame([[1.,2.],[3.,4.],[5.,6.]],index=['a','c','e'],columns=['Ohio','Nevada'])
right = pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[13.,14.]],index=['b','c','d','e'],columns=['Missoure','Alabama'])

In [59]:
left

Unnamed: 0,Ohio,Nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [60]:
right

Unnamed: 0,Missoure,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [61]:
left.join(right,how='outer')

Unnamed: 0,Ohio,Nevada,Missoure,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


In [68]:
another = pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[13.,14.]],index=['a','c','e','f'],columns=['New York','Oregon'])

In [69]:
another

Unnamed: 0,New York,Oregon
a,7.0,8.0
c,9.0,10.0
e,11.0,12.0
f,13.0,14.0


joinは複数のデータフレームをまとめて結合することもできます。

In [71]:
left.join([right,another])

Unnamed: 0,Ohio,Nevada,Missoure,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0


後ほど説明するconcatも使えます。

- 分析方法にもよるが、keyは一意（unique）なものを選択するのが基本です。

# データの連結：縦につなぐ、行方向

まずデータフレームを作ります。

In [75]:
s1 = pd.Series([0,1],index=['a','b'])
s2 = pd.Series([2,3,4],index=['c','d','e'])
s3 = pd.Series([5,6],index=['f','g'])

In [None]:
pd.concat([s1,s2])

複数まとめて連結できます。

In [79]:
pd.concat([s1,s2,s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

## 縦横どちらにも使える：concat

横方向に結合し、mergeのようにも使えます。

In [80]:
pd.concat([s1,s2,s3],axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


縦方向はaxis=0で、デフォルトは縦方向です。

In [81]:
pd.concat([s1,s2,s3],axis=0)

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [82]:
s4 = pd.concat([s1 * 5,s3])
s4

a    0
b    5
f    5
g    6
dtype: int64

concatによる結合はデフォルトは完全外部結合です。

In [83]:
pd.concat([s1,s4],axis=1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,5
f,,5
g,,6


インデックスをキーとし内部結合を指定することができます。

In [84]:
pd.concat([s1,s4],axis=1,join='inner')

Unnamed: 0,0,1
a,0,0
b,1,5


In [1]:
import pandas_datareader.data as web
from datetime import datetime

In [24]:
ticker1 = ['AAPL','IBM']
start1 = '2016-01-01'#datetime(2016,1,1)
end1 = '2016-05-31'#datetime(2016,12,31)
df1 = web.DataReader(ticker1,'yahoo',start1,end1)['Close',:,:]
ticker2 = ['MSFT','GOOG']
start2 = '2016-08-01'#datetime(2016,1,1)
end2 = '2016-12-31'#datetime(2016,12,31)
df2 = web.DataReader(ticker2,'yahoo',start2,end2)['Close',:,:]
ticker3 = ['AAPL','IBM','MSFT','GOOG']
start3 = '2016-05-01'#datetime(2016,1,1)
end3 = '2016-8-31'#datetime(2016,12,31)
df3 = web.DataReader(ticker3,'yahoo',start3,end3)['Close',:,:]

In [25]:
df1

Unnamed: 0_level_0,AAPL,IBM
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-01-04,105.349998,135.949997
2016-01-05,102.709999,135.850006
2016-01-06,100.699997,135.169998
2016-01-07,96.449997,132.860001
2016-01-08,96.959999,131.630005
2016-01-11,98.529999,133.229996
2016-01-12,99.959999,132.899994
2016-01-13,97.389999,131.169998
2016-01-14,99.519997,132.910004
2016-01-15,97.129997,130.029999


In [26]:
df2

Unnamed: 0_level_0,GOOG,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-08-01,772.880005,56.580002
2016-08-02,771.070007,56.580002
2016-08-03,773.179993,56.970001
2016-08-04,771.609985,57.389999
2016-08-05,782.219971,57.959999
2016-08-08,781.760010,58.060001
2016-08-09,784.260010,58.200001
2016-08-10,784.679993,58.020000
2016-08-11,784.849976,58.299999
2016-08-12,783.219971,57.939999


In [27]:
df3

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-05-02,93.639999,698.210022,145.270004,50.610001
2016-05-03,95.180000,692.359985,144.130005,49.779999
2016-05-04,94.190002,695.700012,144.250000,49.869999
2016-05-05,93.239998,701.429993,146.470001,49.939999
2016-05-06,92.720001,711.119995,147.289993,50.389999
2016-05-09,92.790001,712.900024,147.339996,50.070000
2016-05-10,93.419998,723.179993,149.970001,51.020000
2016-05-11,92.510002,715.289978,148.949997,51.049999
2016-05-12,90.339996,713.309998,148.839996,51.509998
2016-05-13,90.519997,710.830017,147.720001,51.080002


In [28]:
pd.concat([df1,df3],axis=0)

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-01-04,105.349998,,135.949997,
2016-01-05,102.709999,,135.850006,
2016-01-06,100.699997,,135.169998,
2016-01-07,96.449997,,132.860001,
2016-01-08,96.959999,,131.630005,
2016-01-11,98.529999,,133.229996,
2016-01-12,99.959999,,132.899994,
2016-01-13,97.389999,,131.169998,
2016-01-14,99.519997,,132.910004,
2016-01-15,97.129997,,130.029999,


In [29]:
pd.concat([df1,df3],axis=1)

Unnamed: 0_level_0,AAPL,IBM,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-01-04,105.349998,135.949997,,,,
2016-01-05,102.709999,135.850006,,,,
2016-01-06,100.699997,135.169998,,,,
2016-01-07,96.449997,132.860001,,,,
2016-01-08,96.959999,131.630005,,,,
2016-01-11,98.529999,133.229996,,,,
2016-01-12,99.959999,132.899994,,,,
2016-01-13,97.389999,131.169998,,,,
2016-01-14,99.519997,132.910004,,,,
2016-01-15,97.129997,130.029999,,,,


In [30]:
pd.concat([df1,df2],axis=0)

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-01-04,105.349998,,135.949997,
2016-01-05,102.709999,,135.850006,
2016-01-06,100.699997,,135.169998,
2016-01-07,96.449997,,132.860001,
2016-01-08,96.959999,,131.630005,
2016-01-11,98.529999,,133.229996,
2016-01-12,99.959999,,132.899994,
2016-01-13,97.389999,,131.169998,
2016-01-14,99.519997,,132.910004,
2016-01-15,97.129997,,130.029999,


In [31]:
pd.concat([df1,df2],axis=1)

Unnamed: 0_level_0,AAPL,IBM,GOOG,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-01-04,105.349998,135.949997,,
2016-01-05,102.709999,135.850006,,
2016-01-06,100.699997,135.169998,,
2016-01-07,96.449997,132.860001,,
2016-01-08,96.959999,131.630005,,
2016-01-11,98.529999,133.229996,,
2016-01-12,99.959999,132.899994,,
2016-01-13,97.389999,131.169998,,
2016-01-14,99.519997,132.910004,,
2016-01-15,97.129997,130.029999,,


In [32]:
pd.concat([df1,df2,df3],axis=0)

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-01-04,105.349998,,135.949997,
2016-01-05,102.709999,,135.850006,
2016-01-06,100.699997,,135.169998,
2016-01-07,96.449997,,132.860001,
2016-01-08,96.959999,,131.630005,
2016-01-11,98.529999,,133.229996,
2016-01-12,99.959999,,132.899994,
2016-01-13,97.389999,,131.169998,
2016-01-14,99.519997,,132.910004,
2016-01-15,97.129997,,130.029999,


In [33]:
pd.concat([df1,df2,df3],axis=1)

Unnamed: 0_level_0,AAPL,IBM,GOOG,MSFT,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2016-01-04,105.349998,135.949997,,,,,,
2016-01-05,102.709999,135.850006,,,,,,
2016-01-06,100.699997,135.169998,,,,,,
2016-01-07,96.449997,132.860001,,,,,,
2016-01-08,96.959999,131.630005,,,,,,
2016-01-11,98.529999,133.229996,,,,,,
2016-01-12,99.959999,132.899994,,,,,,
2016-01-13,97.389999,131.169998,,,,,,
2016-01-14,99.519997,132.910004,,,,,,
2016-01-15,97.129997,130.029999,,,,,,


In [34]:
pd.concat([df1,df3],axis=0,join='inner')

Unnamed: 0_level_0,AAPL,IBM
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-01-04,105.349998,135.949997
2016-01-05,102.709999,135.850006
2016-01-06,100.699997,135.169998
2016-01-07,96.449997,132.860001
2016-01-08,96.959999,131.630005
2016-01-11,98.529999,133.229996
2016-01-12,99.959999,132.899994
2016-01-13,97.389999,131.169998
2016-01-14,99.519997,132.910004
2016-01-15,97.129997,130.029999


In [35]:
pd.concat([df1,df3],axis=1,join='inner')

Unnamed: 0_level_0,AAPL,IBM,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-05-02,93.639999,145.270004,93.639999,698.210022,145.270004,50.610001
2016-05-03,95.18,144.130005,95.18,692.359985,144.130005,49.779999
2016-05-04,94.190002,144.25,94.190002,695.700012,144.25,49.869999
2016-05-05,93.239998,146.470001,93.239998,701.429993,146.470001,49.939999
2016-05-06,92.720001,147.289993,92.720001,711.119995,147.289993,50.389999
2016-05-09,92.790001,147.339996,92.790001,712.900024,147.339996,50.07
2016-05-10,93.419998,149.970001,93.419998,723.179993,149.970001,51.02
2016-05-11,92.510002,148.949997,92.510002,715.289978,148.949997,51.049999
2016-05-12,90.339996,148.839996,90.339996,713.309998,148.839996,51.509998
2016-05-13,90.519997,147.720001,90.519997,710.830017,147.720001,51.080002


Unnamed: 0_level_0,GOOG,GOOG
Unnamed: 0_level_1,AAPL,IBM
Date,Unnamed: 1_level_2,Unnamed: 2_level_2
2016-01-04,105.349998,135.949997
2016-01-05,102.709999,135.850006
2016-01-06,100.699997,135.169998
2016-01-07,96.449997,132.860001
2016-01-08,96.959999,131.630005
2016-01-11,98.529999,133.229996
2016-01-12,99.959999,132.899994
2016-01-13,97.389999,131.169998
2016-01-14,99.519997,132.910004
2016-01-15,97.129997,130.029999


In [44]:
s4 = df2['GOOG']

In [45]:
s4

Date
2016-08-01    772.880005
2016-08-02    771.070007
2016-08-03    773.179993
2016-08-04    771.609985
2016-08-05    782.219971
2016-08-08    781.760010
2016-08-09    784.260010
2016-08-10    784.679993
2016-08-11    784.849976
2016-08-12    783.219971
2016-08-15    782.440002
2016-08-16    777.140015
2016-08-17    779.909973
2016-08-18    777.500000
2016-08-19    775.419983
2016-08-22    772.150024
2016-08-23    772.080017
2016-08-24    769.640015
2016-08-25    769.409973
2016-08-26    769.539978
2016-08-29    772.150024
2016-08-30    769.090027
2016-08-31    767.049988
2016-09-01    768.780029
2016-09-02    771.460022
2016-09-06    780.080017
2016-09-07    780.349976
2016-09-08    775.320007
2016-09-09    759.659973
2016-09-12    769.020020
                 ...    
2016-11-17    771.229980
2016-11-18    760.539978
2016-11-21    769.200012
2016-11-22    768.270020
2016-11-23    760.989990
2016-11-25    761.679993
2016-11-28    768.239990
2016-11-29    770.840027
2016-11-30    758.03

In [46]:
type(s4)

pandas.core.series.Series

データフレームとシリーズの結合

In [47]:
pd.concat([df1,s4],axis=0)

Unnamed: 0_level_0,AAPL,IBM,0
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-01-04,105.349998,135.949997,
2016-01-05,102.709999,135.850006,
2016-01-06,100.699997,135.169998,
2016-01-07,96.449997,132.860001,
2016-01-08,96.959999,131.630005,
2016-01-11,98.529999,133.229996,
2016-01-12,99.959999,132.899994,
2016-01-13,97.389999,131.169998,
2016-01-14,99.519997,132.910004,
2016-01-15,97.129997,130.029999,


In [48]:
pd.concat([df1,s4],axis=1)

Unnamed: 0_level_0,AAPL,IBM,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-01-04,105.349998,135.949997,
2016-01-05,102.709999,135.850006,
2016-01-06,100.699997,135.169998,
2016-01-07,96.449997,132.860001,
2016-01-08,96.959999,131.630005,
2016-01-11,98.529999,133.229996,
2016-01-12,99.959999,132.899994,
2016-01-13,97.389999,131.169998,
2016-01-14,99.519997,132.910004,
2016-01-15,97.129997,130.029999,
