<p><font size="6"><b>Pandas: Combining datasets Part I - concat</b></font></p>

> *Data wrangling in Python*  
> *November, 2020*
>
> *© 2020, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*

---

In [2]:
import pandas as pd

Combining data is essential functionality in a data analysis workflow. 

Data is distributed in multiple files, different information needs to be merged, new data is calculated, .. and needs to be added together. Pandas provides various facilities for easily combining together Series and DataFrame objects

In [3]:
# redefining the example objects

# series
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 
                        'United Kingdom': 64.9, 'Netherlands': 16.9})

# dataframe
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries

Unnamed: 0,country,population,area,capital
0,Belgium,11.3,30510,Brussels
1,France,64.3,671308,Paris
2,Germany,81.3,357050,Berlin
3,Netherlands,16.9,41526,Amsterdam
4,United Kingdom,64.9,244820,London


# Adding columns

As we already have seen before, adding a single column is very easy:

In [4]:
pop_density = countries['population']*1e6 / countries['area']

In [5]:
pop_density

0    370.370370
1     95.783158
2    227.699202
3    406.973944
4    265.092721
dtype: float64

In [6]:
countries['pop_density'] = pop_density

In [7]:
countries

Unnamed: 0,country,population,area,capital,pop_density
0,Belgium,11.3,30510,Brussels,370.37037
1,France,64.3,671308,Paris,95.783158
2,Germany,81.3,357050,Berlin,227.699202
3,Netherlands,16.9,41526,Amsterdam,406.973944
4,United Kingdom,64.9,244820,London,265.092721


Adding multiple columns at once is also possible. For example, the following method gives us a DataFrame of two columns:

In [8]:
countries["country"].str.split(" ", expand=True)

Unnamed: 0,0,1
0,Belgium,
1,France,
2,Germany,
3,Netherlands,
4,United,Kingdom


We can add both at once to the dataframe:

In [9]:
countries[['first', 'last']] = countries["country"].str.split(" ", expand=True)

In [10]:
countries

Unnamed: 0,country,population,area,capital,pop_density,first,last
0,Belgium,11.3,30510,Brussels,370.37037,Belgium,
1,France,64.3,671308,Paris,95.783158,France,
2,Germany,81.3,357050,Berlin,227.699202,Germany,
3,Netherlands,16.9,41526,Amsterdam,406.973944,Netherlands,
4,United Kingdom,64.9,244820,London,265.092721,United,Kingdom


# Concatenating data

The ``pd.concat`` function does all of the heavy lifting of combining data in different ways.

``pd.concat`` takes a list or dict of Series/DataFrame objects and concatenates them in a certain direction (`axis`) with some configurable handling of “what to do with the other axes”.


## Combining rows - ``pd.concat``

![](../img/pandas/schema-concat0.svg)

Assume we have some similar data as in `countries`, but for a set of different countries:

In [11]:
data = {'country': ['Nigeria', 'Rwanda', 'Egypt', 'Morocco', ],
        'population': [182.2, 11.3, 94.3, 34.4],
        'area': [923768, 26338 , 1010408, 710850],
        'capital': ['Abuja', 'Kigali', 'Cairo', 'Rabat']}
countries_africa = pd.DataFrame(data)
countries_africa 

Unnamed: 0,country,population,area,capital
0,Nigeria,182.2,923768,Abuja
1,Rwanda,11.3,26338,Kigali
2,Egypt,94.3,1010408,Cairo
3,Morocco,34.4,710850,Rabat


We now want to combine the rows of both datasets:

In [12]:
pd.concat([countries, countries_africa])

Unnamed: 0,country,population,area,capital,pop_density,first,last
0,Belgium,11.3,30510,Brussels,370.37037,Belgium,
1,France,64.3,671308,Paris,95.783158,France,
2,Germany,81.3,357050,Berlin,227.699202,Germany,
3,Netherlands,16.9,41526,Amsterdam,406.973944,Netherlands,
4,United Kingdom,64.9,244820,London,265.092721,United,Kingdom
0,Nigeria,182.2,923768,Abuja,,,
1,Rwanda,11.3,26338,Kigali,,,
2,Egypt,94.3,1010408,Cairo,,,
3,Morocco,34.4,710850,Rabat,,,


If we don't want the index to be preserved:

In [13]:
pd.concat([countries, countries_africa], ignore_index=True)

Unnamed: 0,country,population,area,capital,pop_density,first,last
0,Belgium,11.3,30510,Brussels,370.37037,Belgium,
1,France,64.3,671308,Paris,95.783158,France,
2,Germany,81.3,357050,Berlin,227.699202,Germany,
3,Netherlands,16.9,41526,Amsterdam,406.973944,Netherlands,
4,United Kingdom,64.9,244820,London,265.092721,United,Kingdom
5,Nigeria,182.2,923768,Abuja,,,
6,Rwanda,11.3,26338,Kigali,,,
7,Egypt,94.3,1010408,Cairo,,,
8,Morocco,34.4,710850,Rabat,,,


When the two dataframes don't have the same set of columns, by default missing values get introduced:

In [14]:
pd.concat([countries, countries_africa[['country', 'capital']]], ignore_index=True)

Unnamed: 0,country,population,area,capital,pop_density,first,last
0,Belgium,11.3,30510.0,Brussels,370.37037,Belgium,
1,France,64.3,671308.0,Paris,95.783158,France,
2,Germany,81.3,357050.0,Berlin,227.699202,Germany,
3,Netherlands,16.9,41526.0,Amsterdam,406.973944,Netherlands,
4,United Kingdom,64.9,244820.0,London,265.092721,United,Kingdom
5,Nigeria,,,Abuja,,,
6,Rwanda,,,Kigali,,,
7,Egypt,,,Cairo,,,
8,Morocco,,,Rabat,,,


We can also pass a dictionary of objects instead of a list of objects. Now the keys of the dictionary are preserved as an additional index level:

In [15]:
pd.concat({'europe': countries, 'africa': countries_africa})

Unnamed: 0,Unnamed: 1,country,population,area,capital,pop_density,first,last
europe,0,Belgium,11.3,30510,Brussels,370.37037,Belgium,
europe,1,France,64.3,671308,Paris,95.783158,France,
europe,2,Germany,81.3,357050,Berlin,227.699202,Germany,
europe,3,Netherlands,16.9,41526,Amsterdam,406.973944,Netherlands,
europe,4,United Kingdom,64.9,244820,London,265.092721,United,Kingdom
africa,0,Nigeria,182.2,923768,Abuja,,,
africa,1,Rwanda,11.3,26338,Kigali,,,
africa,2,Egypt,94.3,1010408,Cairo,,,
africa,3,Morocco,34.4,710850,Rabat,,,


## Combining columns  - ``pd.concat`` with ``axis=1``

![](../img/pandas/schema-concat1.svg)

Assume we have another DataFrame for the same countries, but with some additional statistics:

In [16]:
data = {'country': ['Belgium', 'France', 'Netherlands'],
        'GDP': [496477, 2650823, 820726],
        'area': [8.0, 9.9, 5.7]}
country_economics = pd.DataFrame(data).set_index('country')
country_economics

Unnamed: 0_level_0,GDP,area
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Belgium,496477,8.0
France,2650823,9.9
Netherlands,820726,5.7


In [21]:
countries2 = countries.set_index("country")

In [22]:
pd.concat([countries2, country_economics, ...], axis=1)

Unnamed: 0,population,area,capital,pop_density,first,last,GDP,area.1
Belgium,11.3,30510,Brussels,370.37037,Belgium,,496477.0,8.0
France,64.3,671308,Paris,95.783158,France,,2650823.0,9.9
Germany,81.3,357050,Berlin,227.699202,Germany,,,
Netherlands,16.9,41526,Amsterdam,406.973944,Netherlands,,820726.0,5.7
United Kingdom,64.9,244820,London,265.092721,United,Kingdom,,


`pd.concat` matches the different objects based on the index:

In [None]:
countries2 = countries.set_index('country')

In [None]:
countries2

In [None]:
pd.concat([countries2, country_economics], axis=1)

# Joining data with `pd.merge`

Using `pd.concat` above, we combined datasets that had the same columns or the same index values. But, another typical case if where you want to add information of second dataframe to a first one based on one of the columns. That can be done with [`pd.merge`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html).

Let's look again at the titanic passenger data, but taking a small subset of it to make the example easier to grasp:

In [None]:
df = pd.read_csv("data/titanic.csv")
df = df.loc[:9, ['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked']]

In [None]:
df

Assume we have another dataframe with more information about the 'Embarked' locations:

In [None]:
locations = pd.DataFrame({'Embarked': ['S', 'C', 'Q', 'N'],
                          'City': ['Southampton', 'Cherbourg', 'Queenstown', 'New York City'],
                          'Country': ['United Kindom', 'France', 'Ireland', 'United States']})

In [None]:
locations

We now want to add those columns to the titanic dataframe, for which we can use `pd.merge`, specifying the column on which we want to merge the two datasets:

In [None]:
pd.merge(df, locations, on='Embarked', how='left')

In this case we use `how='left` (a "left join") because we wanted to keep the original rows of `df` and only add matching values from `locations` to it. Other options are 'inner', 'outer' and 'right' (see the [docs](http://pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra) for more on this).