# Joining Data with pandas - 3

In [1]:
import pandas as pd
import numpy as np

%load_ext nb_black

<IPython.core.display.Javascript object>

## `pandas.concat`

```python
pandas.concat(objs: Union[Iterable[‘DataFrame’], Mapping[Label, ‘DataFrame’]], axis='0', join: str = "'outer'", ignore_index: bool = 'False', keys='None', levels='None', names='None', verify_integrity: bool = 'False', sort: bool = 'False', copy: bool = 'True') → ’DataFrame’
```

```python
pandas.concat(objs: Union[Iterable[FrameOrSeries], Mapping[Label, FrameOrSeries]], axis='0', join: str = "'outer'", ignore_index: bool = 'False', keys='None', levels='None', names='None', verify_integrity: bool = 'False', sort: bool = 'False', copy: bool = 'True') → FrameOrSeriesUnion
```

* Concatenate pandas objects along a particular axis with optional set logic along the other axes.
* Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.


### Parameters

* `objs`: a sequence or mapping of Series or DataFrame objects
    * If a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). 
    * Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
* `axis`: {0/'index', 1/'columns'}, default 0
    * The axis to concatenate along.
* `join`: {'inner', 'outer'}, default 'outer'
    * How to handle indexes on other axis (or axes).
* `ignore_index`: bool, default False
    * If True, do not use the index values along the concatenation axis. 
    * The resulting axis will be labeled 0, …, n - 1
    * This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. 
    * Note the index values on the other axes are still respected in the join.
* `keys`: sequence, default None
    * If multiple levels passed, should contain tuples. 
    * Construct hierarchical index using the passed keys as the outermost level.
* `levels`: list of sequences, default None
    * Specific levels (unique values) to use for constructing a MultiIndex. 
    * Otherwise they will be inferred from the keys.
* `names`: list, default None
    * Names for the levels in the resulting hierarchical index.
* `verify_integrity`: bool, default False
    * Check whether the new concatenated axis contains duplicates. 
    * This can be very expensive relative to the actual data concatenation.
* `sort`: bool, default False
    * Sort non-concatenation axis if it is not already aligned when join is 'outer'. 
    * This has no effect when join='inner', which already preserves the order of the non-concatenation axis.
10. `copy`: bool, default True
    * If False, do not copy data unnecessarily.

> Returns: object, type of objs
* When concatenating all `Series` along the index (`axis=0`), a `Series` is returned. 
* When objs contains at least one `DataFrame`, a `DataFrame` is returned. When concatenating along the columns (`axis=1`), a DataFrame is returned.

In [2]:
s1 = pd.Series(["a", "b"])
s2 = pd.Series(["c", "d"])
pd.concat([s1, s2])

0    a
1    b
0    c
1    d
dtype: object

<IPython.core.display.Javascript object>

In [3]:
pd.concat([s1, s2], ignore_index=True)

0    a
1    b
2    c
3    d
dtype: object

<IPython.core.display.Javascript object>

Add a hierarchical index at the outermost level of the data with the `keys` option.

In [4]:
pd.concat([s1, s2], keys=["s1", "s2"])

s1  0    a
    1    b
s2  0    c
    1    d
dtype: object

<IPython.core.display.Javascript object>

Label the index keys you create with the `names` option.

In [5]:
pd.concat([s1, s2], keys=["s1", "s2"], names=["Series name", "Row ID"])

Series name  Row ID
s1           0         a
             1         b
s2           0         c
             1         d
dtype: object

<IPython.core.display.Javascript object>

In [6]:
df1 = pd.DataFrame([["a", 1], ["b", 2]], columns=["letter", "number"])
df1

Unnamed: 0,letter,number
0,a,1
1,b,2


<IPython.core.display.Javascript object>

In [7]:
df2 = pd.DataFrame([["c", 3], ["d", 4]], columns=["letter", "number"])
df2

Unnamed: 0,letter,number
0,c,3
1,d,4


<IPython.core.display.Javascript object>

In [8]:
pd.concat([df1, df2])

Unnamed: 0,letter,number
0,a,1
1,b,2
0,c,3
1,d,4


<IPython.core.display.Javascript object>

Combine `DataFrame` objects with overlapping columns and return everything. Columns outside the intersection will be filled with `NaN` values.

In [9]:
df3 = pd.DataFrame(
    [["c", 3, "cat"], ["d", 4, "dog"]], columns=["letter", "number", "animal"]
)
df3

Unnamed: 0,letter,number,animal
0,c,3,cat
1,d,4,dog


<IPython.core.display.Javascript object>

In [10]:
pd.concat([df1, df3], sort=False)

Unnamed: 0,letter,number,animal
0,a,1,
1,b,2,
0,c,3,cat
1,d,4,dog


<IPython.core.display.Javascript object>

Combine `DataFrame` objects with overlapping columns and return only those that are shared by passing `inner` to the `join` keyword argument.

In [11]:
pd.concat([df1, df3], join="inner")

Unnamed: 0,letter,number
0,a,1
1,b,2
0,c,3
1,d,4


<IPython.core.display.Javascript object>

Combine `DataFrame` objects horizontally along the x axis by passing in `axis=1`.

In [12]:
df4 = pd.DataFrame(
    [["bird", "polly"], ["monkey", "george"]], columns=["animal", "name"]
)
pd.concat([df1, df4], axis=1)

Unnamed: 0,letter,number,animal,name
0,a,1,bird,polly
1,b,2,monkey,george


<IPython.core.display.Javascript object>

Prevent the result from including duplicate index values with the `verify_integrity` option.

In [13]:
df5 = pd.DataFrame([1], index=["a"])
df5

Unnamed: 0,0
a,1


<IPython.core.display.Javascript object>

In [14]:
df6 = pd.DataFrame([2], index=["a"])
df6

Unnamed: 0,0
a,2


<IPython.core.display.Javascript object>

In [15]:
pd.concat([df5, df6], verify_integrity=True)

ValueError: Indexes have overlapping values: Index(['a'], dtype='object')

<IPython.core.display.Javascript object>

## `append()`

* Simplified version of the `.concat()` method
* Supports `ignore_index`, and `sort`
* Does not support: `keys` and `join`
    - Always `join = outer`
    
## Validating a merge:

```python
DataFrame.merge(right, how='inner',...,validate=None)
```

`validate`: str, optional

If specified, checks if merge is of specified type:
* "one_to_one" or "1:1": check if merge keys are unique in both left and right datasets.
* "one_to_many" or "1:m": check if merge keys are unique in left dataset.
* "many_to_one" or "m:1": check if merge keys are unique in right dataset.
* "many_to_many" or "m:m": allowed, but does not result in checks.