# Combining data frames

Frequently, the data of interest is distributed over several data frames. In order to process them, we would need to combine them into a single data frame.

In [1]:
import pandas as pd

# Concatenation

Concatenation refers to simply attaching data frames over new rows or columns.

## Concatenating Series objects

In [2]:
s1 = pd.Series(["A","B","C"],index=[1,2,3])
s2 = pd.Series(["A","B","C"],index=[4,5,6])
pd.concat([s1,s2])

1    A
2    B
3    C
4    A
5    B
6    C
dtype: object

## Concatenating dataframes

In [3]:
df1 = pd.DataFrame({"A":["A1","A2"], "B":["B1","B2"]}, index=[1,2])
df1

Unnamed: 0,A,B
1,A1,B1
2,A2,B2


In [4]:
df2 = pd.DataFrame({"A":["A3","A4"], "B":["B3","B4"]}, index=[3,4])
df2

Unnamed: 0,A,B
3,A3,B3
4,A4,B4


Let's combine these two data frames by putting one after the other:

In [5]:
pd.concat([df1,df2])

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


Now let's combine them in the other direction, pasting them as new columns.

In [6]:
df1 = pd.DataFrame({"A":["A1","A2"], "B":["B1","B2"]}, index=[1,2])
df2 = pd.DataFrame({"C":["C1","C2"], "D":["D1","D2"]}, index=[1,2])
pd.concat([df1,df2],axis="columns")  # or axis=1

Unnamed: 0,A,B,C,D
1,A1,B1,C1,D1
2,A2,B2,C2,D2


## Duplicated indices

If concatenated dataframes have common indices, the resulting DataFrame preserves them by repeating.

In [7]:
df1 = pd.DataFrame({"A":["A1","A2"], "B":["B1","B2"]}) # indices: [0,1]
df2 = pd.DataFrame({"A":["A3","A4"], "B":["B3","B4"]}) # indices: [0,1]
pd.concat([df1,df2])

Unnamed: 0,A,B
0,A1,B1
1,A2,B2
0,A3,B3
1,A4,B4


However, this is frequently not what you want. You can reset the indices with the parameter `ignore_index=True`.

In [8]:
pd.concat([df1,df2], ignore_index=True)

Unnamed: 0,A,B
0,A1,B1
1,A2,B2
2,A3,B3
3,A4,B4


## Inner and outer joins

In the examples above, we concatenated DataFrames whose columns match. Now let's see a more typical example where we need to combine DataFrames with different index and columns.

In [9]:
df1 = pd.DataFrame({"A":["A1","A2"], "B":["B1","B2"], "C":["C1","C2"]}, index=[1,2])
df1

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2


In [10]:
df2 = pd.DataFrame({"B":["B12","B13","B14"], "C":["C12","C13","C14"], "D":["D12","D13","D14"]}, index=[3,4,5])
df2

Unnamed: 0,B,C,D
3,B12,C12,D12
4,B13,C13,D13
5,B14,C14,D14


By default, `concat()` creates a result with the unions of indices and of columns. This is called an *outer join*.

In an outer join, we often get empty data. *pandas* fills them up with `NaN` (a.k.a. NA, null, empty).

In [11]:
pd.concat([df1,df2])

Unnamed: 0,A,B,C,D
1,A1,B1,C1,
2,A2,B2,C2,
3,,B12,C12,D12
4,,B13,C13,D13
5,,B14,C14,D14


In contrast, an _inner join_ combines only the intersection of columns.

In [12]:
pd.concat([df1,df2],join="inner")

Unnamed: 0,B,C
1,B1,C1
2,B2,C2
3,B12,C12
4,B13,C13
5,B14,C14


# Database-style merging

## One-to-one joins

Often we have data in different tables, related by a common column. For example, we can have a table of city names and temperatures, another table of city names and humidity.

In [13]:
df1 = pd.DataFrame({"city":["Istanbul","Ankara","Edirne", "Izmir"],
                    "temp":[14,16,17,20]})
df1

Unnamed: 0,city,temp
0,Istanbul,14
1,Ankara,16
2,Edirne,17
3,Izmir,20


In [14]:
df2 = pd.DataFrame({"city":["Edirne","Ankara","Istanbul","Bursa"],
                    "hum":[39.5,54.6,13.7,20.5]})
df2

Unnamed: 0,city,hum
0,Edirne,39.5
1,Ankara,54.6
2,Istanbul,13.7
3,Bursa,20.5


To combine these into a single data frame, the `merge()` function can be used:

In [15]:
pd.merge(df1,df2)

Unnamed: 0,city,temp,hum
0,Istanbul,14,13.7
1,Ankara,16,54.6
2,Edirne,17,39.5


This is called a *one-to-one join*, because every temperature point is associated with one humidity point.

The same could be done with `concat()`, albeit more clunkily:

In [16]:
pd.concat([df1.set_index("city"),df2.set_index("city")], join="inner", axis=1).reset_index()

Unnamed: 0,city,temp,hum
0,Istanbul,14,13.7
1,Ankara,16,54.6
2,Edirne,17,39.5


Notice:

* The association is done correctly even though the order of the cities are different in the two dataframes.
* The merging is done over the "city" column (the *key field*). If not specified, `merge()` uses column names that are common in both dataframes.
* `merge()` performs an *inner join* by default (Izmir and Bursa are excluded because they are found in both dataframes).

To get an outer join, set the parameter `how="outer"`.

In [17]:
pd.merge(df1,df2, how="outer")

Unnamed: 0,city,temp,hum
0,Ankara,16.0,54.6
1,Bursa,,20.5
2,Edirne,17.0,39.5
3,Istanbul,14.0,13.7
4,Izmir,20.0,


If the two data frames have different column labels for the city names, we need to explicitly specify the label to join on.

In [18]:
df1 = pd.DataFrame({"city":["Istanbul","Ankara","Edirne", "Izmir"],
                    "temp":[14,16,17,20]})
df1

Unnamed: 0,city,temp
0,Istanbul,14
1,Ankara,16
2,Edirne,17
3,Izmir,20


In [19]:
df2 = pd.DataFrame({"location":["Edirne","Ankara","Istanbul","Bursa"],
                    "hum":[39.5,54.6,13.7,20.5]})
df2

Unnamed: 0,location,hum
0,Edirne,39.5
1,Ankara,54.6
2,Istanbul,13.7
3,Bursa,20.5


In that case, we use the `left_on` and `right_on` parameters to specify which column names are to be used for merging.

In [20]:
pd.merge(df1, df2, left_on="city", right_on="location")

Unnamed: 0,city,temp,location,hum
0,Istanbul,14,Istanbul,13.7
1,Ankara,16,Ankara,54.6
2,Edirne,17,Edirne,39.5


We can then remove one of the duplicate columns with the `.drop()` method.

In [21]:
pd.merge(df1, df2, left_on="city", right_on="location").drop("location",axis=1)

Unnamed: 0,city,temp,hum
0,Istanbul,14,13.7
1,Ankara,16,54.6
2,Edirne,17,39.5


## Many-to-one joins
We have a table of employee data:

In [22]:
df1 = pd.DataFrame({"employee":["Ali","Fatma","Meral","Kaan","Ziya","Filiz"],
                    "department":["HR","R&D","HR","Logistics","HR","R&D"],
                    "hire date":[2001,2004,2010,2013,2020,2019]})
df1

Unnamed: 0,employee,department,hire date
0,Ali,HR,2001
1,Fatma,R&D,2004
2,Meral,HR,2010
3,Kaan,Logistics,2013
4,Ziya,HR,2020
5,Filiz,R&D,2019


and a table of department supervisors:

In [23]:
df2 = pd.DataFrame({"department":["HR","R&D","Accounting","Logistics"],
                    "supervisor":["Joe","Mehmet","Julie","Nur"]})
df2

Unnamed: 0,department,supervisor
0,HR,Joe
1,R&D,Mehmet
2,Accounting,Julie
3,Logistics,Nur


Suppose we want to associate every employee with their supervisor. This is called a *many-to-one* join because multiple employees are associated with one supervisor.

`merge()` can make this join automatically:

In [24]:
pd.merge(df1,df2)

Unnamed: 0,employee,department,hire date,supervisor
0,Ali,HR,2001,Joe
1,Fatma,R&D,2004,Mehmet
2,Meral,HR,2010,Joe
3,Kaan,Logistics,2013,Nur
4,Ziya,HR,2020,Joe
5,Filiz,R&D,2019,Mehmet


## Joining by index values

If the merging should be done by index values, we set the `right_index` and/or the `left_index` parameters to True.

In [25]:
df1 = pd.DataFrame({"temp":[14,16,17]}, index=["Istanbul","Ankara","Edirne"])
df1

Unnamed: 0,temp
Istanbul,14
Ankara,16
Edirne,17


In [26]:
df2 = pd.DataFrame({"hum":[39.5,54.6,13.7]}, index=["Istanbul","Ankara","Edirne"])
df2

Unnamed: 0,hum
Istanbul,39.5
Ankara,54.6
Edirne,13.7


In [27]:
pd.merge(df1,df2, right_index=True, left_index=True)

Unnamed: 0,temp,hum
Istanbul,14,39.5
Ankara,16,54.6
Edirne,17,13.7


If we merge one dataframe by the index and the other by a column, we specify that with `left_on` or with `right_on`.

In [28]:
df1 = pd.DataFrame({"temp":[14,16,17]}, index=["Istanbul","Ankara","Edirne"])
df1

Unnamed: 0,temp
Istanbul,14
Ankara,16
Edirne,17


In [29]:
df2 = pd.DataFrame({"city":["Istanbul","Ankara","Edirne"], "hum":[39.5,54.6,13.7]})
df2

Unnamed: 0,city,hum
0,Istanbul,39.5
1,Ankara,54.6
2,Edirne,13.7


In [30]:
pd.merge(df1,df2, right_on="city", left_index=True)

Unnamed: 0,temp,city,hum
0,14,Istanbul,39.5
1,16,Ankara,54.6
2,17,Edirne,13.7
