---
# Combining Datasets in Pandas


---
## 1. The Power of Combining Datasets:
Often times data sets contain information on a specific topic - e.g. companies in the banking sector maintain and govern vast data sets, containing client data, financial data, credit data, etc. __Combining Datasets__ across topics by leveraging 'commonalities' can illuminate valuable infromation, otherwise undiscoverable if the data was not blended. In that sense __Data Combining__ is a pivotal skill when it comes to analysing data and building a 'narrative' around it, which in turn facilitates better decision-making.

---
## 2. Combining Datasets - Methods:
In what follows, we will learn and explore 3 main ways to combine datasets:
- Merging (Joining) datasets -- `pd.merge()`
- Conatenating datasets -- `pd.concat()`
- Appending datasets -- `.append()`


The method we use is defined by the specifications of our datasets, as well as the analysis we want to conduct.

---
### 2.1 Merge and Join:
__Merging Datasets - Definition__ :

- Merging refers to __horizontally joining two datasets.
- The merge is done on a common __column/set of columns__ between the datasets, which act as __join keys__ for the operation.
- It is the equivalent of a SQL join.

__Merging Datasets - Types__:

- left merge - uses only keys from left dataframe
- right merge - uses only keys from right dataframe
- inner merge - uses intersection of keys from both dataframes
- outer merge - uses union of keys from both dataframes
- cross merge - creates the cartesian product of both dataframes
    
Syntax:

- `merge = pd.merge(df1, df2, left_on = 'df1_key_column', right_on = 'df2_key_column', how = '...')` where `how` takes one of the values `left, right, inner, outer, cross`

<center>
<div>
<img src="merge_types.png" width="500"/>
</div>
</center>

---
### 2.2 Concatenate:
__Concatenating Datasets - Definition__ :
- refers to joining two dataframes along a __particular axis__ - effectively this allows joining dataframes both __horizontally__ and/or __vertically__
- if we use `axis = 0` this stacks the two dataframes on top of each other, aligning them by __column name__
- if we use `axis = 1` this glues the two dataframes next to each other, aligning them by __index__

Syntax:
- `pd.concat([df1, df2], axis = ...)` where `axis = 0 or 1`


---
### 2.3 Append:
__Appending Datasets - Definition__ :
- refers to appending rows from one dataframe onto the other
- essentially identical to concatenating when __axis = 0__, i.e. joining two dataframes vertically

Syntax:
- `df1.append(df2)`

---
## 3. Sorting - Methods:

__Sorting - Definition:__
- sorting refers to putting a collection of items into some well-defined order
- sorting can be alphabetical, numerical or even something else entirely
- sorting a dataset allows us to compare items and obtain quick and meaningful insights from the data


__Sorting Methods__:
- sorting by index - `.sort_index()`
- sorting by values - `.sort_values('column_name', ascending = ...)` where `ascending = True/False`


---
## 4. Summary:
- Combining Data allows us to discover valuable insights, otherwise undiscoverable if the data was not blended
- We can combine datasets both __horizontally__ and __vertically__ - i.e. 'joining datasets next to each other' and 'on top of each other'
- The main 3 methods to combine data are `.pd.merge()`, `pd.concat()` and `.append()`
- To sort data, we use methods `.sort_index()` and `.sort_values()`

---
## 5. Concept Check:

1. Suppose we have two DataFrames `df1` and `df2`. Suppose also that `df1.shape = (4,4)` and `df2.shape = (3,3)` and both dataframes have a common column called 'key'. The 'key' column in each dataframe has unique values.
- explain what it means to perform a 'left merge' of df2 onto df1
- what would be the shape of the output, produced by `pd.merge(df1, df2, on = ..., how = 'left')`?
2. What is the difference between `.merge()` and `.append()`?
    - `.concat(axis=0)` and `.append()`?
    - `.merge()` and `.concat(axis=1)`
3. Suppose we have the following DataFrames:
- `df1 = pd.DataFrame({'Service': ['Advisory', 'Advisory', 'Discretionary', 'Credit'], 'Port ID':[1,2,3,4]}, index = ['Anna', 'Bella', 'Charlie', 'Dan'])`
- `pd.DataFrame({'Booking Location':['UK', 'UK', 'Zurich', 'Vienna'], 'Port ID':[2,3,4,1]}, index = ['Bella', 'Charlie', 'Dan', 'Anna'])`
- Join the two DataFrames horizontally (i.e. 'glue next to each other') - use both `merge()` and `concat()`?