# Merging and Concatinating DataFrames and Series

**Learning Objectives:** Learn how to combine multiple DataFrames using `merge` and `concat` and learn about relationships between DataFrames.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import seaborn as sns

## Introduction to merging

To perform a **merge** or **join**, you need two DataFrames with one or more columns in common, called "key" columns or keys.

* Find common "key" column(s) which will be the merge keys.
* Find the unique values in the merge keys and use `how` to pick what values will be in the new DF:
  - `inner` take values present in both DFs.
  - `outer` take values present in either DF.
  - `left/right` take values present only in left/right DF.
* Build a new DataFrame with all columns from both DFs, but the merge keys just once.
* Use `left_on/right_on` to specify which columns to use as the merge keys or `left_index/right_index` to specify that the index should be used as the merge key.

In [None]:
df1 = DataFrame({'key': list('bbacaab'), 'data1': range(7)})
df1

In [None]:
df1.key.unique()

In [None]:
df2 = DataFrame({'key': list('abbd'), 'data2': range(4)})
df2

In [None]:
df2.key.unique()

The default merge method is `how="inner"`, which only includes keys that are in both DataFrames (`ab`): 

In [None]:
pd.merge(df1, df2)

The `how="outer"` approach includes keys that are in either DataFrames (`abcd`): 

In [None]:
pd.merge(df1, df2, how='outer')

The `how="left"` approach includes keys that are in only the left DataFrame (`abc`): 

In [None]:
pd.merge(df1, df2, how='left')

The `how="right"` approach includes keys that are in only the right DataFrame (`abd`): 

In [None]:
pd.merge(df1, df2, how='right')

## Relationships between DataFrames

When you have multiple DataFrames that have common keys you can have **relationships** between the entities in the different DataFrames. There are three types of entity relationships that are possible:

* 1-to-1
* 1-to-many
* many-to-many

Here is a small data set from the TV show [The Simpsons]() to illustrate these relationshps.

First, here is a DataFrame with students' first and last names, along with a unique student id:

In [None]:
students = DataFrame({'fname': ['Bart','Lisa','Milhouse'],
                      'lname': ['Simpson','Simpson','Van Houten']},
                     index=list('abc'))
students

Here is a DataFrame with the student social security numbers, indexed by their unique student id:

In [None]:
ssns = DataFrame({'ssn':[1234,5678,9101]}, index=list('abc'))
ssns

Each student can have aliases or nicknames:

In [None]:
aliases = DataFrame({'alias':['Bartman','Bartron','Cosmos','Truth Teller','Lady Penelope Ariel',
                              'Jake Boyman','Lou La Trec','Eagle Eye','Maestro'],
                     'student': list('aaabbbccc')})
aliases

Here are the student home addresses:

In [None]:
addresses = DataFrame({'address':['742 Evergreen Terrace','742 Evergreen Terrace','316 Pikeland Ave.']},
                      index=list('abc'))
addresses

A table of courses the students can be enrolled in:

In [None]:
courses = DataFrame({'name':['Biology','Math','PE','Underwater electronics']}, index=range(4))
courses

This table contains the enrollment for each course. Each row of this table has a student and course.

In [None]:
enroll = DataFrame({'student':['a','b','b','c','c','c']},index=(2,0,1,0,1,2))
enroll

## 1-1 relationships

* Each student has exactly one SSN.
* Each SSN belongs to exactly one student.

Here we are merging on the index of both columns, so we use `left_index` and `right_index`:

In [None]:
pd.merge(students, ssns, left_index=True, right_index=True)

When the merge is on the index of both DataFrames, we can also use the `.join()` method of the left DataFrame:

In [None]:
students.join(ssns)

## 1-many relationships

### Students and addresses

* Each student has exactly one address.
* Each address can have many students.

In [None]:
pd.merge(students, addresses, left_index=True, right_index=True)

### Students and aliases

* Each student can have many aliases.
* Each alias belong to exactly one student.

Here we are joining on the left DataFrame's index and the right DataFrame's `student` column:

In [None]:
pd.merge(students, aliases, left_index=True, right_on='student').set_index('student')

## Many-many relationships

* A student can take multiple classes.
* A single class can have multiple students.

In [None]:
m1 = pd.merge(students, enroll, left_index=True, right_on='student')
m1

In [None]:
pd.merge(m1, courses, left_index=True, right_index=True).sort_values('student')

In [None]:
pd.merge(m1, courses, left_index=True, right_index=True, how='outer').sort_values('student')

## Introduction to concatenation

Concatenation is closely related to merging and can be done on sets of `Series` or `DataFrames`. The basic idea is that `concat` simple stacks the different objects along a particular axis.

Here are three `Series`:

In [None]:
s1 = Series(range(5))
s2 = Series(range(5,10))
s3 = Series(range(10,15))

The default concatenation is along `axis=0`, which stacks the Series on top of each other. Notice how the indices of the different Series are preserved.

In [None]:
pd.concat([s1, s2, s3])

If we pass `ignore_index=True`, the indices for each component are discarded and a new index is created:

In [None]:
pd.concat([s1, s2, s3], ignore_index=True)

If `axis=1` is set the different objects are put side by side. In this case, the original Series have the same indices and the final DataFrame inherits that:

In [None]:
pd.concat([s1,s2], axis=1)

However, if the different objects have different indices, the final DataFrame will have NaNs where the indices don't overlap:

In [None]:
s1.index=list('abcde')

In [None]:
pd.concat([s1,s2], axis=1)

The `concat` function also works on DataFrames. Here we are stacking the `student` and `addresses` DataFrames on top of each other. It doesn't make much sense conceptually - the point is that `concat` is not "smart" in any way.

In [None]:
pd.concat([students, addresses])

Using `axis=1` in this case provides a meaningful way of combining the `students` and `ssns` DataFrames:

In [None]:
pd.concat([students, ssns], axis=1)

More than two DataFrames can be concatenated. This doesn't work with `merge`.

In [None]:
pd.concat([students, ssns, addresses], axis=1)