# How to merge two datasets
This notebook is a PoC for merging two data sets based on a column.  It is based on [Jezrael (2016)](https://stackoverflow.com/questions/37697195/how-to-merge-two-data-frames-based-on-particular-column-in-pandas-python).

The second part of the notebook is an example for flattening multiple Pandas dataframes (averaging over many df to get a single df, as answered by [ali_m (2014)](https://stackoverflow.com/questions/25057835/get-the-mean-across-multiple-pandas-dataframes).

In [4]:
import pandas as pd
import numpy as np

In [2]:
# OK to mix data types
# Extra values (that are not present in both) are 
# silently dropped by default (how=inner), but all
# are preserved if using how=outer in pd.merge.
d1 = {'id': ["A", "B", 3, 4, "D"], 'value1': [5, 6, 7, 8, 1]}
df1 = pd.DataFrame(data=d1)

# OK to change the order of the values in id
d2 = {'id': ["A", 3, "B", 4], 'value2': [9, 10, 11, "C"]}
df2 = pd.DataFrame(data=d2)

In [3]:
df1

Unnamed: 0,id,value1
0,A,5
1,B,6
2,3,7
3,4,8
4,D,1


In [4]:
df2

Unnamed: 0,id,value2
0,A,9
1,3,10
2,B,11
3,4,C


In [5]:
df3 = pd.merge(df1, df2, on='id', how='outer')

In [6]:
df3

Unnamed: 0,id,value1,value2
0,A,5,9
1,B,6,11
2,3,7,10
3,4,8,C
4,D,1,


## Part 2: Flattening multiple DF

In [26]:
# some random data frames
df_len = 5
df1 = pd.DataFrame(dict(id=np.arange(1,(df_len+1),1), x=np.random.randn(df_len), y=np.random.randint(0, 5, df_len)))
df2 = pd.DataFrame(dict(id=np.arange(1,(df_len+1),1), x=np.random.randn(df_len), y=np.random.randint(0, 5, df_len)))

In [27]:
df1

Unnamed: 0,id,x,y
0,1,0.643401,4
1,2,0.598743,3
2,3,-0.843181,2
3,4,1.203864,2
4,5,-0.635627,0


In [28]:
df2

Unnamed: 0,id,x,y
0,1,-1.30207,0
1,2,1.656506,2
2,3,-0.316538,4
3,4,1.661356,3
4,5,-1.268055,4


In [29]:
df3 = pd.concat((df1, df2))

In [30]:
df3

Unnamed: 0,id,x,y
0,1,0.643401,4
1,2,0.598743,3
2,3,-0.843181,2
3,4,1.203864,2
4,5,-0.635627,0
0,1,-1.30207,0
1,2,1.656506,2
2,3,-0.316538,4
3,4,1.661356,3
4,5,-1.268055,4


In [36]:
by_id = df3.groupby(df3['id'])
df_means = by_id.mean()
df_stds = by_id.std()

In [33]:
df_means

Unnamed: 0_level_0,x,y
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-0.329334,2.0
2,1.127625,2.5
3,-0.57986,3.0
4,1.43261,2.5
5,-0.951841,2.0


In [35]:
(-0.635627 + (-1.268055))/2

-0.9518409999999999

In [37]:
df_stds

Unnamed: 0_level_0,x,y
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1.375656,2.828427
2,0.747952,0.707107
3,0.372393,1.414214
4,0.323496,0.707107
5,0.447194,2.828427
