# Combining Data 
This code includes basic methods to combine data frames in pandas. This code includes examples about following methods. You can find additional tutorials on these methods [here](https://www.datacamp.com/community/tutorials/joining-dataframes-pandas)
1. concat
2. merge
3. melt  

In this code we will use three simple data sets.This data sets inlcudes PISA 2012, 2015 and 2018 indicators (mean values) for some countries in mathematics and reading doamins. You can access additional information and full data sets [here](https://data.oecd.org/pisa/mathematics-performance-pisa.htm#indicator-chart)

The file names are:  
Mat_1.csv   
Mat_2.csv  
Mat_3.csv  
Reading.csv


In [1]:
#import pandas
import pandas as pd

In [2]:
#read data sets and take a look at hem
Mat_1=pd.read_csv('Mat_1.csv')
Mat_1

Unnamed: 0,Country,2012,2015,2018
0,Finland,519,511,507
1,United States,481,470,478
2,Turkey,448,420,454
3,OECD_Average,494,490,489


In [3]:
Mat_2=pd.read_csv('Mat_2.csv')
Mat_2

Unnamed: 0,Country,2012,2015,2018
0,Canada,518,516,512
1,Mexico,413,408,409
2,Japan,536,532,527


In [13]:
Mat_3=pd.read_csv('Mat_3.csv')
Mat_3

Unnamed: 0,Country,2009,2012,2015,2018
0,Finland,542,519,511,507
1,United States,497,481,470,478
2,Turkey,451,448,420,454
3,OECD_Average,501,494,490,489
4,Denmark,511,507,516,511


In [22]:
Reading=pd.read_csv('Reading.csv')
Reading

Unnamed: 0,Country,2012,2015,2018
0,Finland,524,526,520
1,United States,498,497,505
2,Turkey,509,492,484
3,OECD_Average,496,493,487
4,Canada,523,527,520
5,Mexico,424,423,420
6,Japan,538,516,504


## concat() method
Simply concatenate the DataFrames along the row default and along column by assigning axis=1.  


In [5]:
Mathematics=pd.concat([Mat_1,Mat_2])
Mathematics

Unnamed: 0,Country,2012,2015,2018
0,Finland,519,511,507
1,United States,481,470,478
2,Turkey,448,420,454
3,OECD_Average,494,490,489
0,Canada,518,516,512
1,Mexico,413,408,409
2,Japan,536,532,527


In [23]:
#we can correct index values by using ignore_index=True
Mathematics=pd.concat([Mat_2,Mat_1],ignore_index=True)
Mathematics

Unnamed: 0,Country,2012,2015,2018
0,Canada,518,516,512
1,Mexico,413,408,409
2,Japan,536,532,527
3,Finland,519,511,507
4,United States,481,470,478
5,Turkey,448,420,454
6,OECD_Average,494,490,489


## merge() Method
merge() method merges two data frames based on a common index/id/feature. This method provides left, right, inner and outer joins. Here are a copule of examples

In [8]:
#Merge mathemaics and reading dfs based on Country
df_1=pd.merge(Mathematics,Reading,on='Country')
df_1

Unnamed: 0,Country,2012_x,2015_x,2018_x,2012_y,2015_y,2018_y
0,Canada,518,516,512,523,527,520
1,Mexico,413,408,409,424,423,420
2,Japan,536,532,527,538,516,504
3,Finland,519,511,507,524,526,520
4,United States,481,470,478,498,497,505
5,Turkey,448,420,454,509,492,484
6,OECD_Average,494,490,489,496,493,487


merge() method joined the two df columwise as we have same number of rows.  
As column names are identical, it added _ _x_ and _ _y_ to column names. We can define column names ourselves. 

In [9]:
df_1=pd.merge(Mathematics,Reading,on='Country',suffixes=('_mat','_reading'))
df_1

Unnamed: 0,Country,2012_mat,2015_mat,2018_mat,2012_reading,2015_reading,2018_reading
0,Canada,518,516,512,523,527,520
1,Mexico,413,408,409,424,423,420
2,Japan,536,532,527,538,516,504
3,Finland,519,511,507,524,526,520
4,United States,481,470,478,498,497,505
5,Turkey,448,420,454,509,492,484
6,OECD_Average,494,490,489,496,493,487


When we have different number of rows we need to specify how we want to merge the dfs. we have four options:

* __outer:__ joins every row and column in two dfs. Places NaN to non commoon rows and columns
* __inner:__ keeps only common rows
* __left:__ keeps the rows in the left df
* __right:__ keeps the rows in the right df


In [17]:
#lets merge Reading and Mat_3 files
df_outer=pd.merge(Reading, Mat_3, on='Country', how='outer',suffixes=('_reading','_mat'))

#if your id columns have different names, you can use left_on and rigth_on methods instead of on

df_outer

Unnamed: 0,Country,2012_reading,2015_reading,2018_reading,2009,2012_mat,2015_mat,2018_mat
0,Finland,524.0,526.0,520.0,542.0,519.0,511.0,507.0
1,United States,498.0,497.0,505.0,497.0,481.0,470.0,478.0
2,Turkey,509.0,492.0,484.0,451.0,448.0,420.0,454.0
3,OECD_Average,496.0,493.0,487.0,501.0,494.0,490.0,489.0
4,Canada,523.0,527.0,520.0,,,,
5,Mexico,424.0,423.0,420.0,,,,
6,Japan,538.0,516.0,504.0,,,,
7,Denmark,,,,511.0,507.0,516.0,511.0


In [20]:
df_inner=pd.merge(Reading, Mat_3, on='Country', how='inner',suffixes=('_reading','_mat'))
df_inner

Unnamed: 0,Country,2012_reading,2015_reading,2018_reading,2009,2012_mat,2015_mat,2018_mat
0,Finland,524,526,520,542,519,511,507
1,United States,498,497,505,497,481,470,478
2,Turkey,509,492,484,451,448,420,454
3,OECD_Average,496,493,487,501,494,490,489


In [18]:
df_left=pd.merge(Reading, Mat_3, on='Country', how='left',suffixes=('_reading','_mat'))
df_left

Unnamed: 0,Country,2012_reading,2015_reading,2018_reading,2009,2012_mat,2015_mat,2018_mat
0,Finland,524,526,520,542.0,519.0,511.0,507.0
1,United States,498,497,505,497.0,481.0,470.0,478.0
2,Turkey,509,492,484,451.0,448.0,420.0,454.0
3,OECD_Average,496,493,487,501.0,494.0,490.0,489.0
4,Canada,523,527,520,,,,
5,Mexico,424,423,420,,,,
6,Japan,538,516,504,,,,


In [19]:
df_right=pd.merge(Reading, Mat_3, on='Country', how='right',suffixes=('_reading','_mat'))
df_right

Unnamed: 0,Country,2012_reading,2015_reading,2018_reading,2009,2012_mat,2015_mat,2018_mat
0,Finland,524.0,526.0,520.0,542,519,511,507
1,United States,498.0,497.0,505.0,497,481,470,478
2,Turkey,509.0,492.0,484.0,451,448,420,454
3,OECD_Average,496.0,493.0,487.0,501,494,490,489
4,Denmark,,,,511,507,516,511


## melt() method
melt() method is an useful method to transform your data. We can identify one or more columns as identifier and the rest as measured variables. I provided an example here. You can find additional information [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html)  
Lets transform Reaading and Mathematics df like  

| Country   |      Year      |  Score |  
|----------|:-------------:|------:|
| Finland |  2012 | 524 |


In [25]:
#Reading scores
Reading_melt=pd.melt(Reading, id_vars='Country',var_name='Year',value_name='Reading_Score')
Reading_melt.head()

Unnamed: 0,Country,Year,Reading_Score
0,Finland,2012,524
1,United States,2012,498
2,Turkey,2012,509
3,OECD_Average,2012,496
4,Canada,2012,523


In [26]:
#Mathematics scores
Mathematics_melt=pd.melt(Reading, id_vars='Country',var_name='Year',value_name='Mathematics_Score')
Mathematics_melt.head()

Unnamed: 0,Country,Year,Mathematics_Score
0,Finland,2012,524
1,United States,2012,498
2,Turkey,2012,509
3,OECD_Average,2012,496
4,Canada,2012,523


In [30]:
#Lets merge these two dfs based on country and year
df_final=pd.merge(Reading_melt, Mathematics_melt, on=['Country','Year'], how='outer')

df_final.sort_values(['Country', 'Year']).head()

Unnamed: 0,Country,Year,Reading_Score,Mathematics_Score
4,Canada,2012,523,523
11,Canada,2015,527,527
18,Canada,2018,520,520
0,Finland,2012,524,524
7,Finland,2015,526,526
