<!--NAVIGATION-->
< [Intro and Load Data](01-intro-and-load-data.ipynb) | [Contents](Index.ipynb) | [Data Cleaning and Handling Missing Values](03-data-cleaning-and-handling-missing-values.ipynb) >

## Join Merge and Concatenate

Pandas provides various facilities for easily combining together Series or DataFrame with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

To be able to test the functions, lets import a dataframe from dwh and excel

In [None]:
import pandas as pd

df = pd.read_csv('../Data/airlineDT.csv', sep=',')
df.head(5)

In [None]:
df_csv = pd.read_csv('D_Data_Extension.csv', sep=';') 

### Merge

Similarly as in SQL, Pandas can join, merge or concatenate different dataframe. Since we have two dataframes from db2 and csv, we can merge both of them. Here you can merge both data set by defining HOW (outer, inner, left, right) and ON which variable. 

<img src="../fig/SQL_pandas.png" width="500">

Now we want to use excel file for grouping with dataframe. We want to match similar ORIGIN data and add only the FULL_ORIGIN.

In [None]:
df_test = pd.merge(df, df_csv, how="outer", on=["ORIGIN"])
df_test.head(5)

You can notice that variable FULL_ORIGIN was merged into DB2 dataframe

### Join

Join method is a good way to combine two dataframes with a potentially differently-indexed dataframes. Here is a simple example if you have two dataframes

In [None]:
left = pd.DataFrame(
     {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
    ) 

right = pd.DataFrame(
     {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
    )
left

In [None]:
right

Now we want to join **right** dataframe to the **left** dataframe

In [None]:
result = left.join(right)
result

By the result of the join function, the dataframes joined in **outer** way. In order to join by **inner**, you can define by which method

In [None]:
result = left.join(right, how="inner")
result

### Concatenate

The special thing about pandas is you can very easy to concatenate by specifying either by row or by column. In order to properly demonstrate how to do it, lets split the data for a concatenate case.

In [None]:
# Calculating length of rows, getting a row value which has 50% of data and rounding to integer.
row_split=round(len(df)*0.5)

In [None]:
df1 = df[:row_split].copy()
df2 = df[row_split:].copy()

In [None]:
df1.head()

In [None]:
df2.head()

Here you will concatenate two dataframes by the row which is given by default.

In [None]:
result = pd.concat([df1,df2])
result

In case you want to concatenate by the column, you can define axis=1 which states to focus on column. In case there will be missing (non-matching) values, it will replace as missing values.

In [None]:
result = pd.concat([df1,df2], axis=1)
result

In general, all given ways can be used to manipulate data, where by using different way, as join or concatenate can bring the same result. Moreover, if you want explore more about this part, you can look at the given reference:
- https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
- https://pandas.pydata.org/docs/reference/api/pandas.concat.html

<!--NAVIGATION-->
< [Intro and Load Data](01-intro-and-load-data.ipynb) | [Contents](Index.ipynb) | [Data Cleaning and Handling Missing Values](03-data-cleaning-and-handling-missing-values.ipynb) >