## Handling Data with Multiple files

Data can be placed in several different files. In order to read the whole data, we might need to merge those files into one Data Frame. 
There are many merge functions. In this example, we will learn the merge functions when the header is available with the same name and different name. 

In this example, we will use four (4) dataset:
- emp_dept.csv (attribute name: "employee", "group")
- emp_date.csv (attribute name: "employee", "hire_date")
- emp_supervisor.csv (attribute name: "group", "supervisor")
- emp_salary.csv (attribute name: "name", "salary")

In [2]:
import numpy as np
import pandas as pd

In [3]:
dataEmp = pd.read_csv('data/emp_dept.csv')

In [4]:
dataEmp

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR


In [5]:
dataEmp2 = pd.read_csv('data/emp_date.csv')

In [6]:
dataEmp2

Unnamed: 0,employee,hire_date
0,Lisa,2004
1,Bob,2008
2,Jake,2012
3,Sue,2014


In [9]:
#simple merging of two Data Frame due to the same column name
cDataEmp = pd.merge(dataEmp, dataEmp2) # + , on="employee" 가능

In [8]:
cDataEmp

Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,2008
1,Jake,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


In [8]:
dataEmp3 = pd.read_csv('data/emp_supervisor.csv')

In [9]:
dataEmp3

Unnamed: 0,group,supervisor
0,Accounting,Carly
1,Engineering,Guido
2,HR,Steve


In [10]:
#simple merging of two Data Frame due to the same column name
cDataEmp1 = pd.merge(cDataEmp, dataEmp3)

In [11]:
cDataEmp1

Unnamed: 0,employee,group,hire_date,supervisor
0,Bob,Accounting,2008,Carly
1,Jake,Engineering,2012,Guido
2,Lisa,Engineering,2004,Guido
3,Sue,HR,2014,Steve


In [12]:
#defining the column name will explicitly give certainty on merging the data
cDataEmp2 = pd.merge(cDataEmp, dataEmp3, on="group") #group:그룹별로 출력이 됨 ( on = 여기에 들어가는 건 merge의 기준이 되는 key 변수임)

In [13]:
cDataEmp2

Unnamed: 0,employee,group,hire_date,supervisor
0,Bob,Accounting,2008,Carly
1,Jake,Engineering,2012,Guido
2,Lisa,Engineering,2004,Guido
3,Sue,HR,2014,Steve


In [14]:
cDataEmp3 = pd.read_csv('data/emp_salary.csv')

In [15]:
cDataEmp3

Unnamed: 0,name,salary
0,Bob,70000
1,Jake,80000
2,Lisa,12000
3,Sue,90000


Since the column name "name" is not the same with other data frame, there should be a way to merge the two dataset. In this case, ``left_on`` and ``right_on`` can be used to deal with different column name in two different data.

In [28]:
cDataEmp4 = pd.merge(cDataEmp2, cDataEmp3, left_on="employee", right_on="name")

In [29]:
cDataEmp4

Unnamed: 0,employee,group,hire_date,supervisor,name,salary
0,Bob,Accounting,2008,Carly,Bob,70000
1,Jake,Engineering,2012,Guido,Jake,80000
2,Lisa,Engineering,2004,Guido,Lisa,12000
3,Sue,HR,2014,Steve,Sue,90000


Using ``left_on`` and ``right_on``, additional column "name" is created. Meanwhile, our intention is to merge "employee" and "name". Hence, we can apply ``drop`` and ``axis`` for this problem.  

In [24]:
cDataEmp5 = pd.merge(cDataEmp2, cDataEmp3, left_on="employee", right_on="name").drop('name', axis=1)
# drop (axis = 1) 이라는 말은 column을 지우겠다는 말임

In [25]:
cDataEmp5

Unnamed: 0,employee,group,hire_date,supervisor,salary
0,Bob,Accounting,2008,Carly,70000
1,Jake,Engineering,2012,Guido,80000
2,Lisa,Engineering,2004,Guido,12000
3,Sue,HR,2014,Steve,90000


## Learning Check

1. How to combine two datasets?
2. How to combine two datasets which the columns names are different?
3. How to "drop" (delete) a column that is redundant due to the merge function?

In [23]:
# 1. merge function
# 2. left_on , right_on 써주기

In [1]:
import numpy as np
np.zeros(5, dtype=int)

array([0, 0, 0, 0, 0])