# Set Operations on DataFrames

To start with, let's explain what "set" operations are.  When we say "set" here, we aren't talking about the `set` data type in Python.  We're talking about **set algebra**.  https://en.wikipedia.org/wiki/Algebra_of_sets

**Operations**
* UNION
* INTERSECTION
* MINUS / EXCEPT
* COMPLIMENT

**Relations**
* EQUALITY
* INCLUSION

## OPERATIONS

In [1]:
import pandas as pd
family = pd.DataFrame([['Paul','M'],['Anny','F'],['Sarahlynn','F'],['Jim','M']], columns=['Name','Gender'])
kirkwood = pd.DataFrame([['Paul','M'],['Anny','F'],['Sarahlynn','F'],['Rob','M']], columns=['Name','Gender'])

family.set_index('Name', inplace=True)
kirkwood.set_index('Name', inplace=True)

In [2]:
family

Unnamed: 0_level_0,Gender
Name,Unnamed: 1_level_1
Paul,M
Anny,F
Sarahlynn,F
Jim,M


In [3]:
kirkwood

Unnamed: 0_level_0,Gender
Name,Unnamed: 1_level_1
Paul,M
Anny,F
Sarahlynn,F
Rob,M


In [4]:
# UNION
pd.concat([family, kirkwood], join='outer', axis=1, sort=False)

Unnamed: 0,Gender,Gender.1
Paul,M,M
Anny,F,F
Sarahlynn,F,F
Jim,M,
Rob,,M


In [5]:
# INTERSECTION
pd.concat([family, kirkwood], join='inner', axis=1, sort=False)

Unnamed: 0_level_0,Gender,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Paul,M,M
Anny,F,F
Sarahlynn,F,F


In [6]:
# MINUS
family.loc[family.index.difference(kirkwood.index)]

Unnamed: 0_level_0,Gender
Name,Unnamed: 1_level_1
Jim,M


In [8]:
kirkwood.loc[kirkwood.index.difference(family.index)]

Unnamed: 0_level_0,Gender
Name,Unnamed: 1_level_1
Rob,M


**COMPLIMENT** isn't really a valuable concept with data frames because there isn't the idea of an "entire universe of possible values"


## RELATIONS

In [9]:
import pandas as pd
family = pd.DataFrame([['Paul','M'],['Anny','F'],['Sarahlynn','F'],['Jim','M']], columns=['Name','Gender'])
kirkwood_family = pd.DataFrame([['Paul','M'],['Anny','F'],['Sarahlynn','F']], columns=['Name','Gender'])

family.set_index('Name', inplace=True)
kirkwood_family.set_index('Name', inplace=True)

In [10]:
family

Unnamed: 0_level_0,Gender
Name,Unnamed: 1_level_1
Paul,M
Anny,F
Sarahlynn,F
Jim,M


In [11]:
kirkwood_family

Unnamed: 0_level_0,Gender
Name,Unnamed: 1_level_1
Paul,M
Anny,F
Sarahlynn,F


In [12]:
# Test for equality
family.index.equals(kirkwood_family.index)

False

In [13]:
family.index == kirkwood_family.index

ValueError: Lengths must match to compare

In [14]:
# Test for inclusion
len(kirkwood_family.index.difference(family.index)) == 0

True

In [15]:
import pandas as pd
family = pd.DataFrame([['Paul','M'],['Anny','F'],['Sarahlynn','F'],['Jim','M']], columns=['Name','Gender'])
kirkwood_family = pd.DataFrame([['PAUL','M'],['ANNY','F'],['SARAHLYNN','F']], columns=['Name','Gender'])


In [16]:
family['n'] = family['Name'].str.lower()

In [17]:
kirkwood_family['n'] = kirkwood_family['Name'].str.lower()

In [18]:
family.set_index('n', inplace=True)
kirkwood_family.set_index('n', inplace=True)

In [19]:
family

Unnamed: 0_level_0,Name,Gender
n,Unnamed: 1_level_1,Unnamed: 2_level_1
paul,Paul,M
anny,Anny,F
sarahlynn,Sarahlynn,F
jim,Jim,M


In [20]:
len(kirkwood_family.index.difference(family.index)) == 0

True

In [21]:
kirkwood_family.index.difference(family.index)

Index([], dtype='object', name='n')