## Joining DataFrames in Pandas

Joining and merging DataFrames is the core process to start with data analysis and machine learning tasks.

- While building a model if the data is available from different sources then we may need to merge multiple csv files together in a single DataFrame.

- To join these DataFrames, pandas provides multiple functions like

    - concat()
    - merge()
    - join()
    
- For a horizontal combination we have merge() and join(), whereas for vertical combination we can use concat() and append().

- Merge and join perform similar tasks but internally they have some differences, similar to concat and append.

In [58]:
import pandas as pd
import numpy as np

In [60]:
#creating two data frames
dummy_data1 = {
        'id': ['1', '2', '3', '4', '5', '20'],
        'Feature1': ['A', 'C', 'E', 'G', 'I', 'X'],
        'Feature2': ['B', 'D', 'F', 'H', 'J', 'Z'],
        'Feature4': ['R', 'S', 't', 'u', 'v', 'w'],}

dummy_data2 = {
        'id': ['1', '2', '6', '7', '8'],
        'Feature1': ['K', 'M', 'O', 'Q', 'S'],
        'Feature2': ['L', 'N', 'P', 'R', 'T']}
dummy_data3 = {
        'id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'Feature3': [12, 13, 14, 15, 16, 17, 15, 12, 13, 23]}

df1 = pd.DataFrame(dummy_data1, columns = ['id', 'Feature1', 'Feature2', 'Feature4'])
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'Feature1', 'Feature2'])
df3 = pd.DataFrame(dummy_data3, columns = ['id', 'Feature3'])

In [61]:
df1

Unnamed: 0,id,Feature1,Feature2,Feature4
0,1,A,B,R
1,2,C,D,S
2,3,E,F,t
3,4,G,H,u
4,5,I,J,v
5,20,X,Z,w


In [62]:
df2

Unnamed: 0,id,Feature1,Feature2
0,1,K,L
1,2,M,N
2,6,O,P
3,7,Q,R
4,8,S,T


In [63]:
df3

Unnamed: 0,id,Feature3
0,1,12
1,2,13
2,3,14
3,4,15
4,5,16
5,7,17
6,8,15
7,9,12
8,10,13
9,11,23


### 1. Concatenate DataFrames using concat()

In [66]:
df_row = pd.concat([df1, df2], axis=1, ignore_index=True)

In [67]:
df_row

Unnamed: 0,0,1,2,3,4,5,6
0,1,A,B,R,1.0,K,L
1,2,C,D,S,2.0,M,N
2,3,E,F,t,6.0,O,P
3,4,G,H,u,7.0,Q,R
4,5,I,J,v,8.0,S,T
5,20,X,Z,w,,,


### merge()

merge() is used for combining data on common columns or indices.

join vs merge?

In [37]:
#by default inner join (doesn't include rows with NaN val)
df_merge_col=pd.merge(df_row,df3)

In [38]:
df_merge_col

Unnamed: 0,id,Feature1,Feature2,Feature3
0,1,A,B,12
1,1,K,L,12
2,2,C,D,13
3,2,M,N,13
4,3,E,F,14
5,4,G,H,15
6,5,I,J,16
7,7,Q,R,17
8,8,S,T,15


In [39]:
#left merge or right merge (include NaN if available)
df_left_merge_col= pd.merge(df_row,df3, how = 'left')

In [41]:
df_left_merge_col

Unnamed: 0,id,Feature1,Feature2,Feature3
0,1,A,B,12.0
1,2,C,D,13.0
2,3,E,F,14.0
3,4,G,H,15.0
4,5,I,J,16.0
5,20,X,Z,
6,1,K,L,12.0
7,2,M,N,13.0
8,6,O,P,
9,7,Q,R,17.0


In [43]:
#merging using a particular features which is present in both the dfs
df_merge_col_on=pd.merge(df_row,df3,on='id')

In [44]:
df_merge_col_on

Unnamed: 0,id,Feature1,Feature2,Feature3
0,1,A,B,12
1,1,K,L,12
2,2,C,D,13
3,2,M,N,13
4,3,E,F,14
5,4,G,H,15
6,5,I,J,16
7,7,Q,R,17
8,8,S,T,15


## Aggregate agg in pandas
The aggregate() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis.

`dataframe.aggregate(func, axis, args, kwargs)`

Dataframe.aggregate() function is used to apply some aggregation across one or more column. Aggregate using callable, string, dict, or list of string/callables. Most frequently used aggregations are:

In [45]:
df = pd.read_csv("nba.csv")

In [46]:
df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


For eg; Let's say we want to find sum and min val of all the numeric features.

In [49]:
df.aggregate(['sum', 'min'])

  df.aggregate(['sum', 'min'])


Unnamed: 0,Number,Age,Weight,Salary
sum,8079.0,12311.0,101236.0,2159837000.0
min,0.0,19.0,161.0,30888.0
