# Combining Datasets: Concat and Append
_____________________

``Series`` and ``DataFrame``s are built for operations on data which

* come from different data sources

* involve anything **from** very straightforward **concatenation** of two different datasets, **to** more complicated **joins** and **merges** that correctly handle any overlaps between the datasets

* are fast and straightforward with using Pandas functions and methods  

In [8]:
import pandas as pd
import numpy as np

In [9]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)

In [10]:
# example DataFrame
make_df('ABC', range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


Class for displaying multiple ``DataFrame``s side by side with ``_repr_html_`` method, which IPython uses to display:

In [11]:
class display:
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)
    

## 1. Simple Concatenation with ``pd.concat()``
------------------------

#### 1.1. Concatenation of ``Series`` objects
-----------------

* The combination of options of the ``pd.concat()`` allows a wide range of possible behaviors when **joining** two datasets

In [12]:
pd.concat?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mconcat[0m[1;33m([0m[1;33m
[0m    [0mobjs[0m[1;33m:[0m [1;34m'Iterable[NDFrame] | Mapping[Hashable, NDFrame]'[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mjoin[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m'outer'[0m[1;33m,[0m[1;33m
[0m    [0mignore_index[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mkeys[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mlevels[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mverify_integrity[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0msort[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[

In [13]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

In [14]:
ser2.append(ser1)

  ser2.append(ser1)


4    D
5    E
6    F
1    A
2    B
3    C
dtype: object

In [15]:
pd.concat([ser2, ser1])

4    D
5    E
6    F
1    A
2    B
3    C
dtype: object

#### 1.2. Concatenation of ``DataFrame`` objects
-------------

* By default, the concatenation takes place row-wise within the ``DataFrame`` (i.e., ``axis=0``)

In [16]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')

Unnamed: 0,A,B
1,A1,B1
2,A2,B2

Unnamed: 0,A,B
3,A3,B3
4,A4,B4

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


* Specification of an axis along which concatenation will take place:

In [17]:
df3 = make_df('AB', [1, 2])
df4 = make_df('CD', [1, 2])
display('df3', 'df4', "pd.concat([df3, df4], axis=1)")

Unnamed: 0,A,B
1,A1,B1
2,A2,B2

Unnamed: 0,C,D
1,C1,D1
2,C2,D2

Unnamed: 0,A,B,C,D
1,A1,B1,C1,D1
2,A2,B2,C2,D2


In [18]:
display('df3', 'df4', "pd.concat([df3, df4])")

Unnamed: 0,A,B
1,A1,B1
2,A2,B2

Unnamed: 0,C,D
1,C1,D1
2,C2,D2

Unnamed: 0,A,B,C,D
1,A1,B1,,
2,A2,B2,,
1,,,C1,D1
2,,,C2,D2


In [19]:
df5 = make_df('AB', [2, 3])
display('df1', 'df5', 'pd.concat([df1, df5], axis=1)')

Unnamed: 0,A,B
1,A1,B1
2,A2,B2

Unnamed: 0,A,B
2,A2,B2
3,A3,B3

Unnamed: 0,A,B,A.1,B.1
1,A1,B1,,
2,A2,B2,A2,B2
3,,,A3,B3


## 2. Duplicate indices and ways to handle it
_____________________________________________

#### 2.1. Preserving indices
-----------

* Pandas concatenation *preserves indices*, even if the result will have duplicate indices (important difference between ``np.concatenate`` and ``pd.concat`` ): 

In [20]:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index  # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,A,B
0,A2,B2
1,A3,B3

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
0,A2,B2
1,A3,B3


* ``verify_integrity=True`` -- catching the repeats as an error:  the concatenation will raise an exception if there are duplicate indices:

In [21]:
try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')


#### 2.2. Managing of duplication
-----------------
* ``ignore_index=True`` -- ignoring the index when itself does not matter: the concatenation will create a new integer index for the resulting ``Series``:

In [22]:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,A,B
0,A2,B2
1,A3,B3

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


* Adding MultiIndex ``keys`` option to specify a label for the data sources; the result is a multiply indexed DataFrame:

In [23]:
display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,A,B
0,A2,B2
1,A3,B3

Unnamed: 0,Unnamed: 1,A,B
x,0,A0,B0
x,1,A1,B1
y,0,A2,B2
y,1,A3,B3


In [24]:
dxy = pd.concat([x, y], keys=['x', 'y'])
dxy

Unnamed: 0,Unnamed: 1,A,B
x,0,A0,B0
x,1,A1,B1
y,0,A2,B2
y,1,A3,B3


In [25]:
dxy.loc['x']

Unnamed: 0,A,B
0,A0,B0
1,A1,B1


#### 2.3. Concatenation in case of different sets of column names with joins 
--------------

* ``join='outer', join_axes=None`` -- default values -- outer joins --  the join is a union of the input columns, the entries for which no data is available are filled with NaN

In [26]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6], sort=False)' )

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2

Unnamed: 0,B,C,D
3,B3,C3,D3
4,B4,C4,D4

Unnamed: 0,A,B,C,D
1,A1,B1,C1,
2,A2,B2,C2,
3,,B3,C3,D3
4,,B4,C4,D4


In [27]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2

Unnamed: 0,B,C,D
3,B3,C3,D3
4,B4,C4,D4

Unnamed: 0,A,B,C,D
1,A1,B1,C1,
2,A2,B2,C2,
3,,B3,C3,D3
4,,B4,C4,D4


* ``join='inner'`` -- inner join -- using the common columns only  

In [28]:
display('df5', 'df6',
        "pd.concat([df5, df6], join='inner')")

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2

Unnamed: 0,B,C,D
3,B3,C3,D3
4,B4,C4,D4

Unnamed: 0,B,C
1,B1,C1
2,B2,C2
3,B3,C3
4,B4,C4


## 3. Concatenation with ``Series.append()`` and ``DataFrame.append()``  
___________________

* ``Series`` and ``DataFrame`` objects have an ``append()`` method for direct concatenation 

* ``append()`` does not modify the original object – instead it creates a new object  (unlike ``append()`` and ``extend()``  of `` lists``)

* ``append()`` also **is not a very efficient**, because it involves creation of a new index *and* data buffer.

* In the case of multiple ``append`` operations, it is generally better to build a list of ``DataFrame``s and pass them all at once to the ``concat()`` function

In [None]:
pd.Series.append?

In [None]:
pd.DataFrame.append?

In [29]:
display('df1', 'df2', 'df1.append(df2)')



Unnamed: 0,A,B
1,A1,B1
2,A2,B2

Unnamed: 0,A,B
3,A3,B3
4,A4,B4

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


In [30]:
display('x', 'y', 'x.append(y, ignore_index=True)')



Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,A,B
0,A2,B2
1,A3,B3

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


In [None]:
display('x')