# Combining Datasets: Concat and Merge

It is very common to need to combine different datasets. Pandas provides several functions for this, mainly `pd.concat()` for simple stacking and `pd.merge()` for database-style joins.

In [None]:
import pandas as pd
import numpy as np

## Simple Concatenation with `pd.concat`

`pd.concat` can be used for a simple concatenation of `Series` or `DataFrame` objects.

In [None]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)

In [None]:
# example DataFrame
make_df('ABC', range(3))

In [None]:
# Concatenating two DataFrames
df1 = make_df('AB', [1, 2])
df2 = make_df('CD', [3, 4])
pd.concat([df1, df2])

By default, `concat` works row-wise (`axis=0`). You can specify `axis=1` to concatenate column-wise.

In [None]:
# Concatenate along columns
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
pd.concat([df3, df4], axis=1)

## Combining Datasets: Merge and Join

For more complex, database-style merging, Pandas provides `pd.merge()`.
This is the entry point for all standard database join operations.

### Categories of Joins

`pd.merge()` implements one-to-one, one-to-many, and many-to-many joins.

In [None]:
# One-to-one join
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})

In [None]:
df1

In [None]:
df2

In [None]:
df3 = pd.merge(df1, df2)
df3

In [None]:
# One-to-many join
# Keys in `df4` are unique, but map to multiple in `df3`
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
print(df4)
pd.merge(df3, df4)

### Specification of the Merge Key

You can explicitly specify the key column to merge on using the `on` keyword.

In [None]:
# Merging on a specific key
pd.merge(df1, df2, on='employee') # use a list if specifying multiple

If the column names are different in the two dataframes, you can use `left_on` and `right_on`.

In [None]:
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
pd.merge(df1, df3, left_on="employee", right_on="name")

### Specifying Set Arithmetic for Joins

The `how` keyword controls what to do with entries that don't match in both dataframes. The default is an `inner` join.

In [None]:
df6 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'Marketing'],
                    'skills': ['math', 'coding', 'communication']})
pd.merge(df1, df6, on='group') # Inner join is default (require strict match between values in all specified columns, if `on` is passed as an array-list object)

In [None]:
# Outer join returns all entries from both, filling missing with NaN
pd.merge(df1, df6, how='outer')

In [None]:
# Left join returns all entries from the left dataframe
pd.merge(df1, df6, how='left')

### Overlapping Column Names: The `suffixes` Keyword

If both dataframes have a column with the same name, but that column is not meant to be the join key, using `suffixes` keyword allows for distinguishing.

In [None]:
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [4, 3, 2, 1]})
pd.merge(df8, df9, on="name", suffixes=["_left", "_right"])
# will work through all columns with the same name, as long as they aren't 'on'

### Verifying Non-Duplicate Indices

When concatenating or merging dataframes, it's crucial to ensure the index integrity. Duplicate indices can lead to subtle bugs and incorrect results. Pandas provides several ways to verify that an index has no duplicate values.

In [None]:
df_a = make_df('AB', [0, 1])
df_b = make_df('AB', [1, 2])

print(df_a)
print(df_b)
print("---")

bad_concat = pd.concat([df_a, df_b])
print(bad_concat)
print("---")

print(f'Is the index unique? {bad_concat.index.is_unique}')
print(f'Duplicated entries:\n{bad_concat[bad_concat.index.duplicated()]}')

As you can see, `index.is_unique` returns `False` because there are duplicate values in the index. The `index.duplicated()` method returns a boolean Series indicating whether each index entry is a duplicate.

### Example: US States Data

Let's combine some mock data about US states to see a more realistic example.

In [None]:
# Data from https://github.com/jakevdp/PythonDataScienceHandbook/
pop = pd.read_csv('../data/state-population.csv')
areas = pd.read_csv('../data/state-areas.csv')
abbrevs = pd.read_csv('../data/state-abbrevs.csv')

# Let's merge the population and abbreviation data
merged = pd.merge(pop, abbrevs, how='outer',
                  left_on='state/region', right_on='abbreviation')
print(merged)

# Drop the duplicate info
merged = merged.drop('abbreviation', axis=1) 
merged.head()

In [None]:
# Now merge with the area data
final = pd.merge(merged, areas, on='state', how='left')
final.head()

In [None]:
# Let's calculate population density in 2010
final_2010 = final[final['year'] == 2010]
final_2010.set_index('state', inplace=True)
density = final_2010['population'] / final_2010['area (sq. mi)']
density.sort_values(ascending=False, inplace=True)
density.head()