# Section 6: Working with Multiple DataFrames

In [1]:
import pandas as pd
import numpy as np

## Introducting Five New Datasets

In this section, we will be working with data describing salary information for US colleges by major (field of study) and region. Each of these data sources stands separately, and it will be our job to piece them together with the methods we will learn in this section. 

We will begin by defining local variables that point to the URLs for the data.

In [2]:
# Dataset URL Sources
eng_url = 'https://andybek.com/pandas-eng'
state_url = 'https://andybek.com/pandas-state'
party_url = 'https://andybek.com/pandas-party'
liberal_url = 'https://andybek.com/pandas-liberal'
ivies_url = 'https://andybek.com/pandas-ivies'


Let's begin by reading in the engineering school salaries. This gives a list of 19 engineering schools in the US and their starting and mid-career median salaries.

In [3]:
eng = pd.read_csv(eng_url)

In [4]:
eng.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
4,Cooper Union,Engineering,"$62,200.00","$114,000.00"


Now let's read in the state school data, which describes public universities.

In [5]:
state = pd.read_csv(state_url)

In [6]:
state.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"


Next, we'll do the party data, which the Wall Street Journal categorizes as "party schools" with heavy alcohol use.

In [7]:
party = pd.read_csv(party_url)

In [8]:
party.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,University of Illinois at Urbana-Champaign (UIUC),Party,"$52,900.00","$96,100.00"
1,"University of Maryland, College Park",Party,"$52,000.00","$95,000.00"
2,"University of California, Santa Barbara (UCSB)",Party,"$50,500.00","$95,000.00"
3,University of Texas (UT) - Austin,Party,"$49,700.00","$93,900.00"
4,State University of New York (SUNY) at Albany,Party,"$44,500.00","$92,200.00"


Next up we have liberal arts schools, which emphasize rational thinking and first principles over technical knowledge

In [9]:
liberal = pd.read_csv(liberal_url)

In [10]:
liberal.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00"
1,Colgate University,Liberal Arts,"$52,800.00","$108,000.00"
2,Amherst College,Liberal Arts,"$54,500.00","$107,000.00"
3,Lafayette College,Liberal Arts,"$53,900.00","$107,000.00"
4,Bowdoin College,Liberal Arts,"$48,100.00","$107,000.00"


Last but not least, let's take a look at the Ivy League schools - a group of highly prestigious and highly selective schools in the American northeast. .

In [11]:
ivies = pd.read_csv(ivies_url)

In [12]:
ivies.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"


## Concatenating DataFrames with `pd.concat()`

In the previous section, we read in 5 dataframes of similar structure but fragmented. Let's check the shape of each dataframe to confirm that they are shaped similarly.

In [13]:
dfs = [state, eng, liberal, ivies, party]

In [14]:
for df in dfs:
  print(df.shape)

(175, 4)
(19, 4)
(47, 4)
(8, 4)
(20, 4)


So we have 5 two-dimensional dataframes, each with a different number of schools but the same four columns. 

As a first step, let's put them all together, a process called **concatenation**. Suppose we want to concatenate the Ivies and engineering schools. To do this, we can run the `pandas.concat()` function, passing in a list of the dataframes that we want to concatenate..
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

In [15]:
pd.concat([ivies, eng])

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"
7,Columbia University,Ivy League,"$59,400.00","$107,000.00"
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


Let's check the shape of the concatenated dataframe

In [16]:
pd.concat([ivies, eng]).shape

(27, 4)

Let's try concatenating all of these dataframes!

In [17]:
pd.concat(dfs)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
15,University of New Hampshire (UNH),Party,"$41,800.00","$78,300.00"
16,West Virginia University (WVU),Party,"$43,100.00","$78,100.00"
17,University of Tennessee,Party,"$43,800.00","$74,600.00"
18,Ohio University,Party,"$42,200.00","$73,400.00"


We've now concatenated all of the dataframes into one master frame!



In [18]:
pd.concat(dfs).head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"


But there's a problem with this - these datasets are NOT mutually exclusive. Some schools are in more than one dataframe - for example, most party schools are also state schools.

In [19]:
pd.concat(dfs)["School Name"].value_counts()

University of Iowa (UI)              2
University of New Hampshire (UNH)    2
University of Mississippi            2
University of Texas (UT) - Austin    2
University of Florida (UF)           2
                                    ..
St. Olaf College                     1
Boise State University (BSU)         1
East Carolina University (ECU)       1
Cleveland State University           1
Utah State University                1
Name: School Name, Length: 249, dtype: int64

To see this more clearly, let's use the `set()` and `difference()` functions to compare the differences between the party and state school name lists. Remember the following Python fundamentals:
* sets do not allow for duplicate values. Thus, we will get a unique list when we call the `set()` function.
* the difference function looks for any valus that are in one collection but not the other 

In [20]:
set(party["School Name"]).difference(state['School Name'])

{'Randolph-Macon College'}

And thus we see that Randolph-Macon College is the only university that is not both a party school and a state school. For those who care, Randolph-Macon College is a private party school. It's also a liberal arts school.

In [21]:
'Randolph-Macon College' in liberal["School Name"].values

True

Thus, all party schools are either state schools, liberal arts schools, or both. 

Just for kicks, let's see if any engineering schools are also party schools using the `intersection()` set method.

In [22]:
set(eng["School Name"]).intersection(party['School Name'])

set()

Guess not.

Another way to check for duplication is to use the `duplicated()` method on the concatenated dataframe. Remember this will generate a boolean array, which we can use to make a selection.

In [23]:
pd.concat(dfs).duplicated(subset = ['School Name'], keep = 'first')

0     False
1     False
2     False
3     False
4     False
      ...  
15     True
16     True
17     True
18     True
19     True
Length: 269, dtype: bool

In [24]:
pd.concat(dfs)[pd.concat(dfs).duplicated(subset = ['School Name'], keep = 'first')]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,University of Illinois at Urbana-Champaign (UIUC),Party,"$52,900.00","$96,100.00"
1,"University of Maryland, College Park",Party,"$52,000.00","$95,000.00"
2,"University of California, Santa Barbara (UCSB)",Party,"$50,500.00","$95,000.00"
3,University of Texas (UT) - Austin,Party,"$49,700.00","$93,900.00"
4,State University of New York (SUNY) at Albany,Party,"$44,500.00","$92,200.00"
5,University of Florida (UF),Party,"$47,100.00","$87,900.00"
6,Louisiana State University (LSU),Party,"$46,900.00","$87,800.00"
7,University of Georgia (UGA),Party,"$44,100.00","$86,000.00"
8,Pennsylvania State University (PSU),Party,"$49,900.00","$85,700.00"
9,Arizona State University (ASU),Party,"$47,400.00","$84,100.00"


So check that out, there are 20 schools that are duplicates in our dataset, and they all happen to be party schools. Because we know that our *party* dataframe has 20 rows, this means that the entire *party* dataframe consists of schools that show up in at least one other constituent dataframe. 

How can we fix this? One way is to simply remove the party schools from our list of dataframes that we use for the concatenation. 

A more programmatic way of doing this is to use the `drop_duplicates()` method, subsetting by school name.

In [25]:
pd.concat(dfs).drop_duplicates(subset = "School Name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"


Note that we've now lost the information of whether the a particular school is a party school (in addition to being a state or liberal arts school).

Let's lock this into memory by assigning to a variable.

In [26]:
schools = pd.concat(dfs).drop_duplicates(subset = "School Name")

As a side activity, let's try to remove all party schools from the concatenated data frame completely - that is, if a school is a party school, it will be removed from the dataframe even if it is also a state school, liberal arts school, etc. This would allow us to do comparisons like, for example, comparing starting salaries at party schools versus non-party schools. Remember that we can't simply remove the party schools from the dataframe, because all of those schools are also present as *state* or *liberal arts* schools which are in the dataframe. We have to remove any school that is a party school completely.

To start, let's use list comprehension to create a new boolean mask that indicates whether a school **is not** in the list of party schools. This will be tested for every school in the concatenated school dataframe.

In [27]:
is_not_party_school = [school not in party["School Name"].values for school in schools["School Name"].values]

We can now use selection to grab the schools from the concatenated dataframe that are NOT party schools!

In [28]:
schools[is_not_party_school]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"


## The Duplicated Index Issue

In the previous lecture, we concatenated 5 dataframes together and removed the party school. Let's look at this dataframe.


In [29]:
schools

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"


See the problem here? Our index for the concatenated dataframe contains duplicates indices. As an example, let's grab all rows with an index of `0`.

In [30]:
schools.loc[0]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00"
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"


This happens because the component dataframes all have indices that start at 0 and increment until the end. The reason this happens is that **`pd.concat()` does not discard the original index of the dataframes being concatenated**.

To illustrate further, let's take a look at the indices using the `duplicated()` method.


In [31]:
schools.index.duplicated()

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

As you can see, the indices of the later dataframes overlap completely with the indices of the first dataframe.

It is technically okay to have duplicated indices in Pandas dataframes. However, you will give up a lot of functionality by doing so. For example, slices by index will not work.


In [32]:
## This does NOT work with duplicated indices
# schools.loc[0:2]

For this and other reasons, it is usually best practice to have at least one common index with unique values across all records. Can we fix this for our concatenated dataframe? Yes we can!

The first method we can use is `reset_index()`. This method restores a 0-based range index in the dataframe while pushing the old index into a regular column in the dataframe.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

In [33]:
schools.reset_index()

Unnamed: 0,index,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...,...
244,3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
245,4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
246,5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
247,6,Brown University,Ivy League,"$56,200.00","$109,000.00"


Oftentimes however there is no use for the old index. So we can remove it entirely be setting the `drop` parameter to `True`. Let's also perform the method in place to modify the underlying dataframe.

In [34]:
schools.reset_index(drop = True, inplace=True)

In [35]:
schools

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
244,Harvard University,Ivy League,"$63,400.00","$124,000.00"
245,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
246,Cornell University,Ivy League,"$60,300.00","$110,000.00"
247,Brown University,Ivy League,"$56,200.00","$109,000.00"


Let's make sure the duplicated indices are gone.

In [36]:
schools.index.duplicated()

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

Beautiful. We now know that our index is unique.

An even more elegant approach is to **discard the old indices as we do the concatenation**. We can do this by modifying the `concat()` method by using the `ignore_index` parameter. This essentially tells Pandas to ignore the indices of the dataframes being concatenated. Instead, Pandas generates a new index for a new dataframe that is being produced.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html



In [37]:
pd.concat(dfs, ignore_index=True).drop_duplicates(subset = "School Name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
244,Harvard University,Ivy League,"$63,400.00","$124,000.00"
245,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
246,Cornell University,Ivy League,"$60,300.00","$110,000.00"
247,Brown University,Ivy League,"$56,200.00","$109,000.00"


As a sanity check, let's check the index for duplicates.

In [38]:
pd.concat(dfs, ignore_index=True).drop_duplicates(subset = "School Name").index.duplicated()

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

Awesome, no duplicates!

## Enforcing Unique Indices

We previously explored ways to remove the old indices of dataframes when concatenating them. But what if those indices contained useful information and we wanted to keep them, yet still created a concatenated dataframe with unique indices? Turns out we can do that. 

To illustrate, let's assume our original dataframes have the school name as the indices.

In [39]:
ivies2 = ivies.set_index('School Name')

In [40]:
ivies2

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
Princeton University,Ivy League,"$66,500.00","$131,000.00"
Yale University,Ivy League,"$59,100.00","$126,000.00"
Harvard University,Ivy League,"$63,400.00","$124,000.00"
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
Cornell University,Ivy League,"$60,300.00","$110,000.00"
Brown University,Ivy League,"$56,200.00","$109,000.00"
Columbia University,Ivy League,"$59,400.00","$107,000.00"


Let's do the same for the engineering schools.

In [41]:
eng2 = eng.set_index("School Name")

In [42]:
eng2.head()

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
Cooper Union,Engineering,"$62,200.00","$114,000.00"


Now, if we were to merge the two dataframes and set `ignore_index` to `True`, we would be losing important information - namely, the school names.

In [43]:
pd.concat([ivies2, eng2], ignore_index = True)

Unnamed: 0,School Type,Starting Median Salary,Mid-Career Median Salary
0,Ivy League,"$58,000.00","$134,000.00"
1,Ivy League,"$66,500.00","$131,000.00"
2,Ivy League,"$59,100.00","$126,000.00"
3,Ivy League,"$63,400.00","$124,000.00"
4,Ivy League,"$60,900.00","$120,000.00"
5,Ivy League,"$60,300.00","$110,000.00"
6,Ivy League,"$56,200.00","$109,000.00"
7,Ivy League,"$59,400.00","$107,000.00"
8,Engineering,"$72,200.00","$126,000.00"
9,Engineering,"$75,500.00","$123,000.00"


That is NOT very useful. So, how do we preserve the indices while enforcing uniqueness? 

Let's start by concatenating with the original indices.

In [44]:
pd.concat([ivies2, eng2])

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
Princeton University,Ivy League,"$66,500.00","$131,000.00"
Yale University,Ivy League,"$59,100.00","$126,000.00"
Harvard University,Ivy League,"$63,400.00","$124,000.00"
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
Cornell University,Ivy League,"$60,300.00","$110,000.00"
Brown University,Ivy League,"$56,200.00","$109,000.00"
Columbia University,Ivy League,"$59,400.00","$107,000.00"
Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


That's better. But how do we *ensure* that the none of the indices are duplicated (it turns out they are not in this example, but let's pretend that they could be)? We can do that by using the `verify_integrity` parameter within the `pd.concat()` method. This parameter performs a unique index check for us, in that Pandas will throw an error if it detects a duplicated index.

In [45]:
pd.concat([ivies2, eng2], verify_integrity=True)

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
Princeton University,Ivy League,"$66,500.00","$131,000.00"
Yale University,Ivy League,"$59,100.00","$126,000.00"
Harvard University,Ivy League,"$63,400.00","$124,000.00"
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
Cornell University,Ivy League,"$60,300.00","$110,000.00"
Brown University,Ivy League,"$56,200.00","$109,000.00"
Columbia University,Ivy League,"$59,400.00","$107,000.00"
Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


Turns out our dataframes had unique indices and thus no error was thrown. Let's mess around with our data to illustrate the power of `verify_integrity`.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

Let's pick a random sample from the *eng2* dataframe and assign it to a new variable.

In [46]:
random_eng_school = eng2.sample()

In [47]:
random_eng_school

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Virginia Polytechnic Institute and State University (Virginia Tech),Engineering,"$53,500.00","$95,400.00"


Now let's add this school to the *ivies2* dataframe.

In [48]:
ivies2 = ivies2.append(random_eng_school)

In [49]:
ivies2

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
Princeton University,Ivy League,"$66,500.00","$131,000.00"
Yale University,Ivy League,"$59,100.00","$126,000.00"
Harvard University,Ivy League,"$63,400.00","$124,000.00"
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
Cornell University,Ivy League,"$60,300.00","$110,000.00"
Brown University,Ivy League,"$56,200.00","$109,000.00"
Columbia University,Ivy League,"$59,400.00","$107,000.00"
Virginia Polytechnic Institute and State University (Virginia Tech),Engineering,"$53,500.00","$95,400.00"


We now have the Worcester Polytechnic Institute which is part of both *ivies2* and *eng2*. Now let's try to concatenated them with integrity verification. It throws an error!

In [50]:
## This will throw an error!
# pd.concat([ivies2, eng2], verify_integrity=True)

To get around this and allow us to concatenate the dataframes with shared indices, we have a few options.
1. We could choose a different index for the two dataframes that *do* have unique values. The instructor recommends this approach.
2. We can turn of `verify_integrity` and then deal with the duplicates afterward, either by dropping duplicates or resetting the index.

One final note is that you can also concatenate along axis 1 (the column axis).

## Creating Multiple Indices with `concat()`

We previously looked at how to replace the index of a concatenated dataframe with a new zero-based index using the `ignore_index` parameter of `concat()`. But what if we wanted to add another level of indexing to easily partition or identify the constituent dataframes within the structure of the concatenated dataframe. We'll explore this more in the next section, but here's a sneak peak.

Say we're combining the ivies and the engineering schools again, but we want to be able to specify within the index the origin of the data. Recall that when using the vanilla `concat()` function, we'll get repeating indices.


In [51]:
pd.concat([ivies, eng])

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"
7,Columbia University,Ivy League,"$59,400.00","$107,000.00"
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


The `pd.concat()` function has a `keys` parameter that constructs a hierarchical index using the passed into keys as the outermost level. The keys are applied to the data respectively to according to how the constituent dataframes were passed into `pd.concat()`. 


In [52]:
new_df = pd.concat([ivies, eng], keys = ["ivyleague_schools", "engineering_schools"])

In [53]:
new_df

Unnamed: 0,Unnamed: 1,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
ivyleague_schools,0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
ivyleague_schools,1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
ivyleague_schools,2,Yale University,Ivy League,"$59,100.00","$126,000.00"
ivyleague_schools,3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
ivyleague_schools,4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
ivyleague_schools,5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
ivyleague_schools,6,Brown University,Ivy League,"$56,200.00","$109,000.00"
ivyleague_schools,7,Columbia University,Ivy League,"$59,400.00","$107,000.00"
engineering_schools,0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
engineering_schools,1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


Our new dataframe contains an index with two levels. The first level is the numerical index from the original dataframes. Then the second (outer) level contains the labels (keys) that we specified. And if we check the indices, we see that each index is a tuple.

In [54]:
new_df.index

MultiIndex([(  'ivyleague_schools',  0),
            (  'ivyleague_schools',  1),
            (  'ivyleague_schools',  2),
            (  'ivyleague_schools',  3),
            (  'ivyleague_schools',  4),
            (  'ivyleague_schools',  5),
            (  'ivyleague_schools',  6),
            (  'ivyleague_schools',  7),
            ('engineering_schools',  0),
            ('engineering_schools',  1),
            ('engineering_schools',  2),
            ('engineering_schools',  3),
            ('engineering_schools',  4),
            ('engineering_schools',  5),
            ('engineering_schools',  6),
            ('engineering_schools',  7),
            ('engineering_schools',  8),
            ('engineering_schools',  9),
            ('engineering_schools', 10),
            ('engineering_schools', 11),
            ('engineering_schools', 12),
            ('engineering_schools', 13),
            ('engineering_schools', 14),
            ('engineering_schools', 15),
            ('en

This means that we cannot use methods like `loc[]` in the traditional sense. Instead, we have to identify the entire tupled index.

In [55]:
## This does not work with multilevel indices.
# new_df.loc[3]

In [56]:
## This will work
new_df.loc[('ivyleague_schools', 3)]

School Name                 Harvard University
School Type                         Ivy League
Starting Median Salary             $63,400.00 
Mid-Career Median Salary          $124,000.00 
Name: (ivyleague_schools, 3), dtype: object

We can also avoid this issue by using position-based indexing instead of label-based indexing. Of course, this means you must know where the data you want is located within the dataframe.

In [57]:
new_df.iloc[3]

School Name                 Harvard University
School Type                         Ivy League
Starting Median Salary             $63,400.00 
Mid-Career Median Salary          $124,000.00 
Name: (ivyleague_schools, 3), dtype: object

## Column Axis Concatenation

We previously concatenated only by rows, or along the index axis (as opposed to the column axis). But we can change this behavior and concatenate along the column axis.

To do this, we simply alter the `pd.concat()` method by changing the `axis` parameter from 0 to 1. When might this be useful?

Suppose we want to reflect, side-by-side, the top 5 Ivy League and engineering schools that produce the highest earning graduates. One way to approach this is to produce sorted copies of the dataframes. Let's grab the top 5 starting salaries from each dataframe sorted in descending order, drop the original indices, and reset the index.

In [58]:
ivies3 = ivies.sort_values(by = "Starting Median Salary", ascending=False)[0:5].reset_index(drop=True)

In [59]:
ivies3

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Princeton University,Ivy League,"$66,500.00","$131,000.00"
1,Harvard University,Ivy League,"$63,400.00","$124,000.00"
2,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
3,Cornell University,Ivy League,"$60,300.00","$110,000.00"
4,Columbia University,Ivy League,"$59,400.00","$107,000.00"


In [60]:
eng3 = eng.sort_values(by = "Starting Median Salary", ascending=False)[0:5].reset_index(drop=True)

In [61]:
eng3

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
1,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
4,Cooper Union,Engineering,"$62,200.00","$114,000.00"


Finally, let's put these dataframes side-by-side in a new dataframe.

In [62]:
pd.concat([ivies3, eng3], axis = 1)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,School Name.1,School Type.1,Starting Median Salary.1,Mid-Career Median Salary.1
0,Princeton University,Ivy League,"$66,500.00","$131,000.00",California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
1,Harvard University,Ivy League,"$63,400.00","$124,000.00",Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
2,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
3,Cornell University,Ivy League,"$60,300.00","$110,000.00","Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
4,Columbia University,Ivy League,"$59,400.00","$107,000.00",Cooper Union,Engineering,"$62,200.00","$114,000.00"


This looks pretty decent. For example, we can easily see that in general, the starting median salaries of top engineering schools compete well with the salaries at Ivy League schools. Interesting. Don't jump to any conclusions though - there are a lot of questions to keep in mind, including:
1. What if we considered only engineering studies at Ivy League schools? Would the starting median salary increase in that case?
2. What if we considered non-engineering students at the top engineering schools? Would the median salaries decrease?

It is possible the Ivy Leagues simply have more students that enter lower-paying, non-engineering careers, which brings down the median starting salary. It is entirely possible that engineering students at Ivy Leagues also make a good amount of money. 

A final word of note is that while this is not terrible to look at, there are better ways to visualize this type of data. This was simply to illustrate the ability of the `concat()` method to combine dataframes along the column axis.

## The `append()` Method: A Special Case of `concat()`

We previously used the `append()` method in the previous section to add new rows onto dataframes. You may have noticed that `append()` and `concat()` are similar in that regard.

For instance, we can use `append()` to add the *party* dataframe onto the *libaral* dataframe.

In [63]:
liberal.append(party)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00"
1,Colgate University,Liberal Arts,"$52,800.00","$108,000.00"
2,Amherst College,Liberal Arts,"$54,500.00","$107,000.00"
3,Lafayette College,Liberal Arts,"$53,900.00","$107,000.00"
4,Bowdoin College,Liberal Arts,"$48,100.00","$107,000.00"
...,...,...,...,...
15,University of New Hampshire (UNH),Party,"$41,800.00","$78,300.00"
16,West Virginia University (WVU),Party,"$43,100.00","$78,100.00"
17,University of Tennessee,Party,"$43,800.00","$74,600.00"
18,Ohio University,Party,"$42,200.00","$73,400.00"


But equivalently, we can use concat() to do this as well.

In [64]:
pd.concat([liberal, party])

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00"
1,Colgate University,Liberal Arts,"$52,800.00","$108,000.00"
2,Amherst College,Liberal Arts,"$54,500.00","$107,000.00"
3,Lafayette College,Liberal Arts,"$53,900.00","$107,000.00"
4,Bowdoin College,Liberal Arts,"$48,100.00","$107,000.00"
...,...,...,...,...
15,University of New Hampshire (UNH),Party,"$41,800.00","$78,300.00"
16,West Virginia University (WVU),Party,"$43,100.00","$78,100.00"
17,University of Tennessee,Party,"$43,800.00","$74,600.00"
18,Ohio University,Party,"$42,200.00","$73,400.00"


So what's the difference? In this use case, the result is identical. However, there are differences in how they behave in different contexts and also what options you have.
* `append()` is an instance method which can only be called on existing dataframe or series objects, whereas `concat()` is a Pandas method available on the main Pandas namespace.
* `concat()` is generally more flexible in that it gives us the ability to concatenate along both the index and column axes. `append()` does not do column-wise concatenation; it's axis of operation is fixed to `0`, that is, the index axis.

## Concat() on Different Columns

So far, we've been working with dataframes that stack very nicely - they have the same columns and the same column names. Outstanding!

But sometimes we need to concatenate dataframes that don't necessarily have the same columns. For instance, let's add a STEM column to engineering schools that indicates whether or not the school specializes in Science, Technology, Engineering, or Medicine. We'll add this to a copy of *eng* so that we don't affect the original dataframe.

In [65]:
eng4 = eng.copy()

In [66]:
eng4['STEM'] = True

In [67]:
eng4.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,STEM
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00",True
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00",True
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00",True
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00",True
4,Cooper Union,Engineering,"$62,200.00","$114,000.00",True


Great, now let's try to go back and concatenate these dataframe with one of our original *ivies* dataframe and see what happens.

In [68]:
pd.concat([ivies, eng4])

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,STEM
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",
2,Yale University,Ivy League,"$59,100.00","$126,000.00",
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",
5,Cornell University,Ivy League,"$60,300.00","$110,000.00",
6,Brown University,Ivy League,"$56,200.00","$109,000.00",
7,Columbia University,Ivy League,"$59,400.00","$107,000.00",
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00",True
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00",True


As we see, the concatenated dataframe includes the STEM column, and for the Ivy League schools the value defaults to NaN. The reason for this behavior is because the `concat()` method, by default, includes all of the columns of the original dataframes. This is known as an *outer* join.
* https://radacad.com/wp-content/uploads/2015/07/joins.jpg

That said, `concat()` has a `join` parameter where we can select the type of join. For example, if we choose an *inner* join, only columns that are common to both dataframes will be in the final concatenated dataframe.

In [69]:
pd.concat([ivies, eng4], join='inner')

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"
7,Columbia University,Ivy League,"$59,400.00","$107,000.00"
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


In addition to having extra columns, you may also have columns that have the same type of data and intention, but have different column names. In those cases `pd.concat()` is not the best method to use. There are other techniques for dealing with this and we'll cover those later.

## Skill Challenge


#### 1. Concatenate the *liberal* and *state* schools into a new dataframe. How many unique school names are there?

Let's start by concatenating the two dataframes:

In [70]:
pd.concat([liberal, state])

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00"
1,Colgate University,Liberal Arts,"$52,800.00","$108,000.00"
2,Amherst College,Liberal Arts,"$54,500.00","$107,000.00"
3,Lafayette College,Liberal Arts,"$53,900.00","$107,000.00"
4,Bowdoin College,Liberal Arts,"$48,100.00","$107,000.00"
...,...,...,...,...
170,Austin Peay State University,State,"$37,700.00","$59,200.00"
171,Pittsburg State University,State,"$40,400.00","$58,200.00"
172,Southern Utah University,State,"$41,900.00","$56,500.00"
173,Montana State University - Billings,State,"$37,900.00","$50,600.00"


Since we're interested in to comparing school names, let's isolate that column.

In [71]:
pd.concat([liberal, state]).loc[:, 'School Name']

0                      Bucknell University
1                       Colgate University
2                          Amherst College
3                        Lafayette College
4                          Bowdoin College
                      ...                 
170           Austin Peay State University
171             Pittsburg State University
172               Southern Utah University
173    Montana State University - Billings
174           Black Hills State University
Name: School Name, Length: 222, dtype: object

We see that this concatenated dataframe has 222 rows in it. Are any of these duplicated (that is, are any schools both state schools and liberal arts schools)?

In [72]:
pd.concat([liberal, state])['School Name'].unique()

array(['Bucknell University', 'Colgate University', 'Amherst College',
       'Lafayette College', 'Bowdoin College',
       'College of the Holy Cross', 'Occidental College',
       'Washington and Lee University', 'Swarthmore College',
       'Davidson College', 'Carleton College', 'Williams College',
       'Pomona College', 'Wesleyan University (Middletown, Connecticut)',
       'Bates College', 'Union College', 'University of Richmond',
       'Vassar College', 'Middlebury College', 'Mount Holyoke College',
       'Franklin and Marshall College', 'DePauw University',
       'St. Olaf College', 'Colby College', 'Gettysburg College',
       'Siena College', 'Smith College', 'Hamilton College',
       'Randolph-Macon College', 'Wellesley College',
       'Denison University', 'Oberlin College',
       'University of Puget Sound', 'Colorado College (CC)',
       'Reed College', 'Gustavus Adolphus College', 'Whitman College',
       'Ursinus College', 'Juniata College', 'Wittenberg Uni

It appears that there are 222 unique names, and thus there are no duplicates in this concatenated dataframe. Just to confirm, let's run a `duplicated()` and `value_counts()` on the column. 

In [73]:
pd.concat([liberal, state])['School Name'].duplicated().value_counts()

False    222
Name: School Name, dtype: int64

An alternative approach is to simply use the `nunique()` method.

In [74]:
pd.concat([liberal, state])['School Name'].nunique()

222

#### 2. Calculate the average median starting salary in the dataframe that we created in Part 1. Because the datatype for the salaries is "object", we'll need to convert it to "float".

In [75]:
pd.concat([liberal, state]).loc[:, "Starting Median Salary"]

0      $54,100.00 
1      $52,800.00 
2      $54,500.00 
3      $53,900.00 
4      $48,100.00 
          ...     
170    $37,700.00 
171    $40,400.00 
172    $41,900.00 
173    $37,900.00 
174    $35,300.00 
Name: Starting Median Salary, Length: 222, dtype: object

The presence of the commas the dollar signs makes this a bit challenging. Let's start by removing the dollar signs using the `replace()` method and regular expressions. 

In [76]:
pd.concat([liberal, state]).loc[:, "Starting Median Salary"].replace(to_replace = '\$(\d+),*(\d+\.\d+)', value = '\\1\\2', regex = True)

0      54100.00 
1      52800.00 
2      54500.00 
3      53900.00 
4      48100.00 
         ...    
170    37700.00 
171    40400.00 
172    41900.00 
173    37900.00 
174    35300.00 
Name: Starting Median Salary, Length: 222, dtype: object

Note that an alternative approach would have been to select the dollar signs ($) and commas (,) and replaced them with an tmpty strings, which effectively removes them.

Now let's convert this all to floats. We can do this with the `replace()` function in combination with regular expressions.

In [77]:
pd.concat([liberal, state]).loc[:, "Starting Median Salary"].replace(to_replace = '\$(\d+),*(\d+\.\d+)', value = '\\1\\2', regex = True).astype(np.float64)

0      54100.0
1      52800.0
2      54500.0
3      53900.0
4      48100.0
        ...   
170    37700.0
171    40400.0
172    41900.0
173    37900.0
174    35300.0
Name: Starting Median Salary, Length: 222, dtype: float64

Finally, let's calculate the average median starting salary.

In [78]:
pd.concat([liberal, state]).loc[:, "Starting Median Salary"].replace(to_replace = '\$(\d+),*(\d+\.\d+)', value = '\\1\\2', regex = True).astype(np.float64).mean()

44469.36936936937

Just for fun, let's do the same calculation for the "Mid-Career Median Salary".

In [79]:
pd.concat([liberal, state]).loc[:, "Mid-Career Median Salary"].replace(to_replace = '\$(\d+),*(\d+\.\d+)', value = '\\1\\2', regex = True).astype(np.float64).mean()

80856.3063063063

#### 3. Create a short dataframe that shows the top 3 *liberal* and *state* schools that produce the highest (mid-career) earning graduates. Show the *School Name* and *Mid-Career Median Salary* columns from each datasheet, side-by-side (i.e. horizontally). BONUS: nest the column labels within 'Liberal Arts' and 'State' Labels

We start by sorting the top 3 liberal arts and state schools by mid-career earning graduates and isolating the top 3. Remember that the values are currently in the "object" datatype, so we have to convert them to do any numerical work.

We'll start by converting both the *liberal* and *state* dataframes' dollar values from strings to floats. Then we will sort these individual dataframes by those numerical values for *Mid-Career Median Salary*, select the top 3 from each, and concatenate them together>

In [80]:
liberal_numeric = liberal.replace(to_replace = '\$(\d+),*(\d+\.\d+)', value = '\\1\\2', regex = True)

In [81]:
liberal_numeric = liberal_numeric.astype(
    {
        "Starting Median Salary" : np.float64,
        "Mid-Career Median Salary" : np.float64
    }
)

In [82]:
liberal_numeric

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,54100.0,110000.0
1,Colgate University,Liberal Arts,52800.0,108000.0
2,Amherst College,Liberal Arts,54500.0,107000.0
3,Lafayette College,Liberal Arts,53900.0,107000.0
4,Bowdoin College,Liberal Arts,48100.0,107000.0
5,College of the Holy Cross,Liberal Arts,50200.0,106000.0
6,Occidental College,Liberal Arts,51900.0,105000.0
7,Washington and Lee University,Liberal Arts,53600.0,104000.0
8,Swarthmore College,Liberal Arts,49700.0,104000.0
9,Davidson College,Liberal Arts,46100.0,104000.0


In [83]:
liberal_numeric.sort_values(
    by = "Mid-Career Median Salary",
    ascending = False
)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,54100.0,110000.0
1,Colgate University,Liberal Arts,52800.0,108000.0
2,Amherst College,Liberal Arts,54500.0,107000.0
3,Lafayette College,Liberal Arts,53900.0,107000.0
4,Bowdoin College,Liberal Arts,48100.0,107000.0
5,College of the Holy Cross,Liberal Arts,50200.0,106000.0
6,Occidental College,Liberal Arts,51900.0,105000.0
7,Washington and Lee University,Liberal Arts,53600.0,104000.0
8,Swarthmore College,Liberal Arts,49700.0,104000.0
9,Davidson College,Liberal Arts,46100.0,104000.0


In [84]:
state_numeric = state.replace(to_replace = '\$(\d+),*(\d+\.\d+)', value = '\\1\\2', regex = True)

In [85]:
state_numeric = state_numeric.astype(
    {
        "Starting Median Salary" : np.float64,
        "Mid-Career Median Salary" : np.float64
    }
)

In [86]:
state_numeric.sort_values(by = "Mid-Career Median Salary", ascending = False)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,59900.0,112000.0
1,University of Virginia (UVA),State,52700.0,103000.0
2,Cal Poly San Luis Obispo,State,57200.0,101000.0
3,University of California at Los Angeles (UCLA),State,52600.0,101000.0
4,"University of California, San Diego (UCSD)",State,51100.0,101000.0
...,...,...,...,...
170,Austin Peay State University,State,37700.0,59200.0
171,Pittsburg State University,State,40400.0,58200.0
172,Southern Utah University,State,41900.0,56500.0
173,Montana State University - Billings,State,37900.0,50600.0


Collect the top 3 schools by *Mid-Career Median Salary* from each.

In [87]:
state_top3 = state_numeric.sort_values(by = "Mid-Career Median Salary", ascending = False).iloc[0:3]

In [88]:
state_top3

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,59900.0,112000.0
1,University of Virginia (UVA),State,52700.0,103000.0
2,Cal Poly San Luis Obispo,State,57200.0,101000.0


In [89]:
liberal_top3 = liberal_numeric.sort_values(
    by = "Mid-Career Median Salary",
    ascending = False
).iloc[0:3]

In [90]:
liberal_top3

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,54100.0,110000.0
1,Colgate University,Liberal Arts,52800.0,108000.0
2,Amherst College,Liberal Arts,54500.0,107000.0


Finally, concatenate them along the column axis and nest within the "Liberal Arts" and "State" labels using `pd.concat()`

In [91]:
pd.concat([state_top3, liberal_top3], keys = ['State', 'Liberal Arts'], axis = 1)

Unnamed: 0_level_0,State,State,State,State,Liberal Arts,Liberal Arts,Liberal Arts,Liberal Arts
Unnamed: 0_level_1,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,59900.0,112000.0,Bucknell University,Liberal Arts,54100.0,110000.0
1,University of Virginia (UVA),State,52700.0,103000.0,Colgate University,Liberal Arts,52800.0,108000.0
2,Cal Poly San Luis Obispo,State,57200.0,101000.0,Amherst College,Liberal Arts,54500.0,107000.0


Note that the instructor video (as of 10/10/2021) has an error for Part 3 section. He forgot to convert the object values for the salaries to numeric. Didactically this is a minor detail as the section was focused on concatenating and shaping.

## The `merge()` Method

The `merge()` method allows for dataframe concatenation in a way that is very similar to SQL. This method gives us a flexible interface to join various dataframe and series options.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html

Note that `merge()` is also a dataframe method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

How is *merging* different from *concatenation*? 
* Think of concatenation as "gluing" datasets together. It is a structure-focused operation and does not concern itself much with the content of the data.
* On the other hand, merging combines data sets together based on the content they share. As such, it is much more flexible than concatenation. See the instructor's course notes.

With that, let's play around with `merge()`. We begin by reading in some new data describing regional information on the schools in our dataset.

In [92]:
regions_url = 'https://andybek.com/pandas-regions'

In [93]:
regions = pd.read_csv(regions_url)

In [94]:
regions.shape

(269, 2)

In [95]:
regions.head()

Unnamed: 0,School Name,Region
0,Massachusetts Institute of Technology (MIT),Northeastern
1,California Institute of Technology (CIT),California
2,Harvey Mudd College,California
3,"Polytechnic University of New York, Brooklyn",Northeastern
4,Cooper Union,Northeastern


Our goal is to extend the schools dataframe by adding the regions information! Remember that the *schools* dataframe already has a column with "School Name", just like the *regions* dataframe.

In [96]:
schools.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"


Let's merge that sucker.

In [97]:
pd.merge(schools, regions)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00",California
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00",Southern
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00",California
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00",California
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00",California
...,...,...,...,...,...
264,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
265,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
266,Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
267,Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern


Notice that the merge automatically occurred on "School Name", which is the only column name shared by the two dataframes.

This could have been more explicitly coded by using the `on` parameter as follows.
* Note that if `on` is not used, the merge defaults to merging on columns on the intersection of the key columns of the two dataframes

In [98]:
pd.merge(schools, regions, on = "School Name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00",California
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00",Southern
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00",California
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00",California
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00",California
...,...,...,...,...,...
264,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
265,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
266,Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
267,Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern


What do we do if the columns that we want to merge on are labeled differently in the two dataframes? Stay tuned...

## The `left_on` and `right_on` Parameters

You will oftentimes come across dataframes that you want to merge, but do not have a key column of the same name on which to merge. We can still do it.

Let's bring in some new data describing the distribution (percentiles) of student median income by school.

In [99]:
income_url = 'https://andybek.com/pandas-mid'

In [100]:
mid_career = pd.read_csv(income_url)

In [101]:
mid_career.head()

Unnamed: 0,school_name,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
0,Massachusetts Institute of Technology (MIT),"$76,800.00","$99,200.00","$168,000.00","$220,000.00"
1,California Institute of Technology (CIT),,"$104,000.00","$161,000.00",
2,Harvey Mudd College,,"$96,000.00","$180,000.00",
3,"Polytechnic University of New York, Brooklyn","$66,800.00","$94,300.00","$143,000.00","$190,000.00"
4,Cooper Union,,"$80,200.00","$142,000.00",


Now we want to add this to our *schools* dataframe.

In [102]:
schools.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"


But we have a problem - the key columns containing the school names are labeled a bit differently. Let's try `pd.merge()` anyway.

In [103]:
## This will not work because Pandas does not have an identical key to merge on.
# pd.merge(schools, mid_career)

To resolve this issue, we use the `left_on` and `right_on` parameters, where we provide the names of the columns in the "left" and "right" dataframes on which to merge. This allows us to identify exactly which column to merge on, even if they are named differently.

In [104]:
pd.merge(schools, mid_career, left_on = "School Name", right_on = "school_name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,school_name,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00","University of California, Berkeley","$59,500.00","$81,000.00","$149,000.00","$201,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00",University of Virginia (UVA),"$52,200.00","$71,800.00","$146,000.00","$215,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00",Cal Poly San Luis Obispo,"$55,000.00","$74,700.00","$133,000.00","$178,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00",University of California at Los Angeles (UCLA),"$51,300.00","$72,500.00","$139,000.00","$193,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00","University of California, San Diego (UCSD)","$51,700.00","$75,400.00","$131,000.00","$177,000.00"
...,...,...,...,...,...,...,...,...,...
264,Harvard University,Ivy League,"$63,400.00","$124,000.00",Harvard University,"$54,800.00","$86,200.00","$179,000.00","$288,000.00"
265,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",University of Pennsylvania,"$55,900.00","$79,200.00","$192,000.00","$282,000.00"
266,Cornell University,Ivy League,"$60,300.00","$110,000.00",Cornell University,"$56,800.00","$79,800.00","$160,000.00","$210,000.00"
267,Brown University,Ivy League,"$56,200.00","$109,000.00",Brown University,"$55,400.00","$74,400.00","$159,000.00","$228,000.00"


This is *almost* what we wanted. The merged dataframe contains all of the data from both constituent dataframes. However, we have two columns with school names which is a bit redundant. There are multiple ways to remove this, but the easiest based on what we already know is to use the `drop()` method.

In [105]:
pd.merge(schools, mid_career, left_on = "School Name", right_on = "school_name").drop(columns = "school_name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00","$59,500.00","$81,000.00","$149,000.00","$201,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00","$52,200.00","$71,800.00","$146,000.00","$215,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00","$55,000.00","$74,700.00","$133,000.00","$178,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00","$51,300.00","$72,500.00","$139,000.00","$193,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00","$51,700.00","$75,400.00","$131,000.00","$177,000.00"
...,...,...,...,...,...,...,...,...
264,Harvard University,Ivy League,"$63,400.00","$124,000.00","$54,800.00","$86,200.00","$179,000.00","$288,000.00"
265,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00","$55,900.00","$79,200.00","$192,000.00","$282,000.00"
266,Cornell University,Ivy League,"$60,300.00","$110,000.00","$56,800.00","$79,800.00","$160,000.00","$210,000.00"
267,Brown University,Ivy League,"$56,200.00","$109,000.00","$55,400.00","$74,400.00","$159,000.00","$228,000.00"


## Inner vs Outer Joins

We've learned how to merge dataframes on a *key column*. But we haven't yet discussed how the specific keys (entries) in the dataset are included in the final merge.

An **inner** join only includes common keys in both dataframes. If a key appears in the key column of one dataframe and not the other, it will not appear in the merge. It is similar to a *set intersection*.

An **outer** join will include all keys from both key columns of the two dataframes, regardless of whether the keys appear in one key column or both. It is similar to a *set union*. Any key that does not have a counterpart in the other dataframe will have `NaN` as its value in the merged dataframe. 




The type of join that occurs with the `pd.merge()` method is controlled by `how` parameter

Suppose we want to merge the *ivies* dataframe with the *regions* dataframe. Let's start with the basic merge.

In [106]:
pd.merge(ivies, regions)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
2,Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
5,Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
6,Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
7,Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern


Looks good, all of the Ivy League schools are in the Northeastern U.S. But wait a second - the *regions* dataframe has many schools in it (269 to be exact), and yet the merged dataframe only has the ivies. What gives?

The reason is that `pd.merge()` defaults to an *inner* merge (`how = "inner"`). Only the Ivy League school names were common to the key column *School Name* in each dataframe, and thus only those names were kept.

Let's now try an *outer* merge.

In [107]:
pd.merge(ivies, regions, how='outer')

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
2,Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
...,...,...,...,...,...
264,Austin Peay State University,,,,Southern
265,Pittsburg State University,,,,Midwestern
266,Southern Utah University,,,,Western
267,Montana State University - Billings,,,,Western


We see here that we have all of the Ivy League schools with Region populated, but no values for School Type or salary. That's because only the Ivy League keys were included in the key column (School Name) of the *ivies* dataframe. The remaining keys (schools) areo nly found in the *regions* dataframe, and thus they will have none of the School Type and Salary data from *ivies* - those values will default to `NaN`.

## Left vs. Right Joins

The `how` parameter of the `merge()` method also supports `left` and `right` joins. This essentially tells Pandas from which dataframe to preserve the keys. For example, if we do a "left" join, only the keys from the *ivies* dataframe will be preserved in the merge.

In [108]:
pd.merge(ivies, regions, how='left')

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
2,Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
5,Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
6,Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
7,Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern


If we do a "right" join, the keys from the *regions* dataframe will be preserved. Of course, the *regions* dataframe includes the Ivy League schools, and so this is essentially an outer join.

In [109]:
pd.merge(ivies, regions, how='right')

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Massachusetts Institute of Technology (MIT),,,,Northeastern
1,California Institute of Technology (CIT),,,,California
2,Harvey Mudd College,,,,California
3,"Polytechnic University of New York, Brooklyn",,,,Northeastern
4,Cooper Union,,,,Northeastern
...,...,...,...,...,...
264,Austin Peay State University,,,,Southern
265,Pittsburg State University,,,,Midwestern
266,Southern Utah University,,,,Western
267,Montana State University - Billings,,,,Western


A final note is that if we flip the order in which we declare the dataframes, the joins will flip as well. For instance, these two code blocks are equivalent:

`pd.merge(ivies, regions, how='left')`

`pd.merge(regions, ivies, how='right')`


## One-to-One and One-to-Many Joins

When working with data, you will eventually come across terminology that describes the type of association between different entities in our data. In this section we'll review these terms and how they are reflected in Pandas (particularly the `merge()` method).

A **one-to-one** join occurs when each record in a dataframe is associated with one and only one record in another dataframe.

If we merge *ivies* and *regions*, is this a one-to-one merge? Is each school in *ivies* associated with one and only one record in *regions*?

In [110]:
pd.merge(ivies, regions, how ='inner', on = "School Name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
2,Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
5,Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
6,Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
7,Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern


The answer comes down to whether the two datasets both have unique values in the key column. For *ivies*, we clearly see that the School Names key column contains unique values. What about *regions*? 

Let's start be selecting the schools in *regions* that are in *ivies*. If the merge is one-to-one, then we will only see each Ivy League school show up only once as a result of this selection.
* Under the hood, this code creates a boolean mask, returning `True` or `False` for each entry in the "School Name" column of *regions* depending on whether that entry is in the "School Name" column of *ivies*.

In [111]:
regions[regions['School Name'].isin(ivies["School Name"])]

Unnamed: 0,School Name,Region
86,Dartmouth College,Northeastern
87,Princeton University,Northeastern
88,Yale University,Northeastern
89,Harvard University,Northeastern
90,University of Pennsylvania,Northeastern
91,Cornell University,Northeastern
92,Brown University,Northeastern
93,Columbia University,Northeastern


Thus, for the keys that they have in common (the Ivy League school names), both datasets have unique values in the key column.

What happens if we try to merge our *state* dataframe with *regions*? What type of join is this? That depends on whether both dataframes have unique instances of the values that they have in common in the *key column*. 
A **one-to-many** join

In [112]:
state

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
170,Austin Peay State University,State,"$37,700.00","$59,200.00"
171,Pittsburg State University,State,"$40,400.00","$58,200.00"
172,Southern Utah University,State,"$41,900.00","$56,500.00"
173,Montana State University - Billings,State,"$37,900.00","$50,600.00"


Let's check whether *state* contains only unique school names.

In [113]:
state['School Name'].is_unique

True

The *state* dataframe has only unique school names, and there are 175 of them.

Now let's use a similar approach to above to select only the schools from *regions* that appear in *state*. 

In [114]:
regions[regions["School Name"].isin(state["School Name"])]

Unnamed: 0,School Name,Region
19,University of Illinois at Urbana-Champaign (UIUC),Midwestern
20,"University of Maryland, College Park",Southern
21,"University of California, Santa Barbara (UCSB)",California
22,University of Texas (UT) - Austin,Southern
23,State University of New York (SUNY) at Albany,Northeastern
...,...,...
264,Austin Peay State University,Southern
265,Pittsburg State University,Midwestern
266,Southern Utah University,Western
267,Montana State University - Billings,Western


This is interesting - when we selected the schools from *regions* that are also in *state*, we got 194 schools back. However, *state* only has 175 unique schools in it. This means that *regions* must have duplicated school names - one or more of the schools names in *state* appear two or more times in the *regions* dataframe.

We can confirm this by using the `is_unique` attribute and the `value_counts` method on *regions*

In [115]:
regions["School Name"].is_unique

False

In [116]:
regions["School Name"].value_counts()

University of Mississippi               2
Ohio University                         2
Indiana University (IU), Bloomington    2
Randolph-Macon College                  2
Florida State University (FSU)          2
                                       ..
University of Rhode Island (URI)        1
St. Olaf College                        1
Boise State University (BSU)            1
Tarleton State University (TSU)         1
Utah State University                   1
Name: School Name, Length: 249, dtype: int64

Thus, a merge between *regions* and *state* is a **one-to-many** association. Each school name in *state* is unique, but one or more of them is associated with at least two entries in *regions*. When performing the merge, Pandas will duplicate the records that appear more than once. In the merged dataframe, the non-unique records will appear the same number of times as they did in the dataframe that contained the duplicates. 

In [117]:
pd.merge(state, regions, how ='inner', on="School Name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00",California
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00",Southern
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00",California
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00",California
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00",California
...,...,...,...,...,...
189,Austin Peay State University,State,"$37,700.00","$59,200.00",Southern
190,Pittsburg State University,State,"$40,400.00","$58,200.00",Midwestern
191,Southern Utah University,State,"$41,900.00","$56,500.00",Western
192,Montana State University - Billings,State,"$37,900.00","$50,600.00",Western


Which schools are duplicated? We can check with a simple boolean mask and selection.

In [118]:
pd.merge(state, regions, how ='inner', on="School Name").duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
189    False
190    False
191    False
192    False
193    False
Length: 194, dtype: bool

In [119]:
pd.merge(state, regions, how ='inner', on="School Name").loc[pd.merge(state, regions, how ='inner', on="School Name").duplicated(keep = False)]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
8,University of Illinois at Urbana-Champaign (UIUC),State,"$52,900.00","$96,100.00",Midwestern
9,University of Illinois at Urbana-Champaign (UIUC),State,"$52,900.00","$96,100.00",Midwestern
14,"University of Maryland, College Park",State,"$52,000.00","$95,000.00",Southern
15,"University of Maryland, College Park",State,"$52,000.00","$95,000.00",Southern
16,"University of California, Santa Barbara (UCSB)",State,"$50,500.00","$95,000.00",California
17,"University of California, Santa Barbara (UCSB)",State,"$50,500.00","$95,000.00",California
18,University of Texas (UT) - Austin,State,"$49,700.00","$93,900.00",Southern
19,University of Texas (UT) - Austin,State,"$49,700.00","$93,900.00",Southern
22,State University of New York (SUNY) at Albany,State,"$44,500.00","$92,200.00",Northeastern
23,State University of New York (SUNY) at Albany,State,"$44,500.00","$92,200.00",Northeastern


In this instance, these duplicates do not add any extra value. We can safely drop them from the merge.

In [120]:
pd.merge(state, regions, how ='inner', on="School Name").drop_duplicates()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00",California
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00",Southern
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00",California
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00",California
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00",California
...,...,...,...,...,...
189,Austin Peay State University,State,"$37,700.00","$59,200.00",Southern
190,Pittsburg State University,State,"$40,400.00","$58,200.00",Midwestern
191,Southern Utah University,State,"$41,900.00","$56,500.00",Western
192,Montana State University - Billings,State,"$37,900.00","$50,600.00",Western


Better yet, since we know that *regions* contains replicates, we can remove them from the *regions* dataframe before we even perform the merge. In doing so, the merge becomes one-to-one since *regions* no longer has any repeating school names.

In [121]:
pd.merge(state, regions.drop_duplicates(), how ='inner', on="School Name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00",California
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00",Southern
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00",California
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00",California
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00",California
...,...,...,...,...,...
170,Austin Peay State University,State,"$37,700.00","$59,200.00",Southern
171,Pittsburg State University,State,"$40,400.00","$58,200.00",Midwestern
172,Southern Utah University,State,"$41,900.00","$56,500.00",Western
173,Montana State University - Billings,State,"$37,900.00","$50,600.00",Western


## Many-to-Many Joins

Many-to-many joins occur when we have duplicates in the key columns of both dataframes that are being merged.

Suppose we asked 4 people what the perceived value an Ivy League or Engineering degree has, and recorded the data of the survey responses.

In [122]:
survey = pd.DataFrame({
    "School Type": ['Ivy League', 'Ivy League', 'Engineering', 'Engineering'],
    "Prestige":['High', "Good", "Good", "Okay"],
    "Respondent": [1, 2, 3, 4]
})

In [123]:
survey

Unnamed: 0,School Type,Prestige,Respondent
0,Ivy League,High,1
1,Ivy League,Good,2
2,Engineering,Good,3
3,Engineering,Okay,4


Before we go further, notice that "School Type" has duplicates.

Now let's merge the made-up survey data with the *ivies* dataframe. In this case, the **key column** for the merge will be "School Type", since that is the column that they have in common. But notice that the "School Type" column in *ivies* also contains duplicates. So what's going to happen? 

In [124]:
ivies

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"
7,Columbia University,Ivy League,"$59,400.00","$107,000.00"


In [125]:
pd.merge(ivies, survey)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Prestige,Respondent
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",High,1
1,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Good,2
2,Princeton University,Ivy League,"$66,500.00","$131,000.00",High,1
3,Princeton University,Ivy League,"$66,500.00","$131,000.00",Good,2
4,Yale University,Ivy League,"$59,100.00","$126,000.00",High,1
5,Yale University,Ivy League,"$59,100.00","$126,000.00",Good,2
6,Harvard University,Ivy League,"$63,400.00","$124,000.00",High,1
7,Harvard University,Ivy League,"$63,400.00","$124,000.00",Good,2
8,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",High,1
9,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Good,2


When we merge these together, we end up with a many-to-many association. The merged dataframe has twice the number records that we started with (16 instead of 8). This is because each school in *ivies* is of the type "Ivy League", and the *survey* dataframe has two entries for Ivy League in the "School Type" key column. 

More concretely, the "Ivy League" school type received two responses in our survey, and thus the merge will need to reflect two responses for each school. 
* There are no engineering schools in the *ivies* dataframe, and thus they are not applicable here.

Let's add another respondent to our *survey* dataframe and see what happens to the merge.

In [126]:
survey = survey.append(pd.Series({"School Type": "Ivy League", "Prestige":"Very High", "Respondent": 5}, name = 4))

In [127]:
survey

Unnamed: 0,School Type,Prestige,Respondent
0,Ivy League,High,1
1,Ivy League,Good,2
2,Engineering,Good,3
3,Engineering,Okay,4
4,Ivy League,Very High,5


Now our *survey* dataframe contains THREE Ivy League responses. So when we repeat the merge with *ivies*, we should get 24 rows since each Ivy League school type will have three different values applied to it.

In [128]:
pd.merge(ivies, survey)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Prestige,Respondent
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",High,1
1,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Good,2
2,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Very High,5
3,Princeton University,Ivy League,"$66,500.00","$131,000.00",High,1
4,Princeton University,Ivy League,"$66,500.00","$131,000.00",Good,2
5,Princeton University,Ivy League,"$66,500.00","$131,000.00",Very High,5
6,Yale University,Ivy League,"$59,100.00","$126,000.00",High,1
7,Yale University,Ivy League,"$59,100.00","$126,000.00",Good,2
8,Yale University,Ivy League,"$59,100.00","$126,000.00",Very High,5
9,Harvard University,Ivy League,"$63,400.00","$124,000.00",High,1


## Merging by Index


Thus far, we've been merging by columns. Pandas automatically identifies or is directed by us (via the `on`, `left_on`, and `right_on` paramters) to use a column throughout datasets as the key column for the merge. 

Occasionally however, we may be interested in joining by **index** instead. Fortunately, the Pandas `merge()` method fully supports this

Suppose we are working with versions of our *ivies* and *regions* dataframes where the school name is the index (instead of the incremental numbers).

In [130]:
ivies4 = ivies.set_index("School Name")

In [131]:
ivies4

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
Princeton University,Ivy League,"$66,500.00","$131,000.00"
Yale University,Ivy League,"$59,100.00","$126,000.00"
Harvard University,Ivy League,"$63,400.00","$124,000.00"
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
Cornell University,Ivy League,"$60,300.00","$110,000.00"
Brown University,Ivy League,"$56,200.00","$109,000.00"
Columbia University,Ivy League,"$59,400.00","$107,000.00"


In [132]:
regions2 = regions.set_index("School Name")

In [134]:
regions2.head()

Unnamed: 0_level_0,Region
School Name,Unnamed: 1_level_1
Massachusetts Institute of Technology (MIT),Northeastern
California Institute of Technology (CIT),California
Harvey Mudd College,California
"Polytechnic University of New York, Brooklyn",Northeastern
Cooper Union,Northeastern


Now, if we attempt a standard merge on this, we'll see that Pandas complains because there is no common column to merge on.

In [136]:
## This will not work - no common column
# pd.merge(ivies4, regions2)

Thus, we need a way to tell Pandas to use the index axis as the key column. To do this, we can use the `left_index` and `right_index` parameters, which accept boolean values. Let's try that.

In [137]:
pd.merge(ivies4, regions2, left_index = True, right_index = True)

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary,Region
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern
Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern


So what did this do? It's not too different than a column-based join, except that the key "column" is the index of each dataframe!

You can also use these parameters to merge using the index of one dataframe as a key column and a standard column of another dataframe as a key column. You can do this by:
* combining `left_index` with `right_on`
* combining `right_index` with `left_on`

Let's test this out with the standard *regions* dataframe and *ivies4*. Here, we will merge on the index of *ivies4* and on the "School Name" colum nof *regions*.

In [138]:
regions.head()

Unnamed: 0,School Name,Region
0,Massachusetts Institute of Technology (MIT),Northeastern
1,California Institute of Technology (CIT),California
2,Harvey Mudd College,California
3,"Polytechnic University of New York, Brooklyn",Northeastern
4,Cooper Union,Northeastern


In [139]:
pd.merge(ivies4, regions, left_index = True, right_on = "School Name")

Unnamed: 0,School Type,Starting Median Salary,Mid-Career Median Salary,School Name,Region
86,Ivy League,"$58,000.00","$134,000.00",Dartmouth College,Northeastern
87,Ivy League,"$66,500.00","$131,000.00",Princeton University,Northeastern
88,Ivy League,"$59,100.00","$126,000.00",Yale University,Northeastern
89,Ivy League,"$63,400.00","$124,000.00",Harvard University,Northeastern
90,Ivy League,"$60,900.00","$120,000.00",University of Pennsylvania,Northeastern
91,Ivy League,"$60,300.00","$110,000.00",Cornell University,Northeastern
92,Ivy League,"$56,200.00","$109,000.00",Brown University,Northeastern
93,Ivy League,"$59,400.00","$107,000.00",Columbia University,Northeastern


## The `join()` Method

With the types of joins that we saw in the previous lecture (index-on-index and index-on-columns), Pandas actually provides a convenient dataframe instance method called `join()`.

Recall our *ivies4* and *regions2* dataframes. We can se `join()` to very succinctly merge these together.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
* If no additional parameters are provided, the method automatically joins index-on-index.

In [140]:
ivies4.join(regions2)

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary,Region
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern
Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern


This gave the equivalent result to the following, which we did above:

In [143]:
pd.merge(ivies4, regions2, left_index = True, right_index = True)

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary,Region
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern
Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern


`join()` also supports column-on-index joins. It will merge a called dataframe using any of its columns as the key column and another frame's index as the key. This is accessed using the `on` parameter, in which we specify the column in the caller dataframe to use as the key. The index of the *other* dataframe is used as the key for that frame.

In [145]:
ivies.join(regions2, on = "School Name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
2,Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
5,Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
6,Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
7,Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern


Compare that to the `merge()` method to accomplish the same task. More typing, same result.

In [148]:
pd.merge(ivies, regions2, left_on = "School Name", right_index = True)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00",Northeastern
1,Princeton University,Ivy League,"$66,500.00","$131,000.00",Northeastern
2,Yale University,Ivy League,"$59,100.00","$126,000.00",Northeastern
3,Harvard University,Ivy League,"$63,400.00","$124,000.00",Northeastern
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Northeastern
5,Cornell University,Ivy League,"$60,300.00","$110,000.00",Northeastern
6,Brown University,Ivy League,"$56,200.00","$109,000.00",Northeastern
7,Columbia University,Ivy League,"$59,400.00","$107,000.00",Northeastern


Think of the `join()` method as a convenience method that allows us to use shorter code in instances where we want to merge **index-to-index** or **column-to-index**. Performing a **column-to-column** merge requires the `merge()` method. 

Under the hood, `join()` calls the `merge()` method. So you can accomplish anything in `merge()` that you can with `join()`.

## Skill Challenge

#### 1. Merge the *liberal* dataframe with the *regions* dataframe and name the resulting dataframe *dfm*. Which region has the highest number of liberal arts schools?

Start by merging the dataframes. The two dataframes share a common column in "School Name", so we'll merge on that automatically using the `merge()` method.

In [155]:
liberal.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00"
1,Colgate University,Liberal Arts,"$52,800.00","$108,000.00"
2,Amherst College,Liberal Arts,"$54,500.00","$107,000.00"
3,Lafayette College,Liberal Arts,"$53,900.00","$107,000.00"
4,Bowdoin College,Liberal Arts,"$48,100.00","$107,000.00"


In [156]:
regions.head()

Unnamed: 0,School Name,Region
0,Massachusetts Institute of Technology (MIT),Northeastern
1,California Institute of Technology (CIT),California
2,Harvey Mudd College,California
3,"Polytechnic University of New York, Brooklyn",Northeastern
4,Cooper Union,Northeastern


In [158]:
dfm = pd.merge(liberal, regions)

In [159]:
dfm.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00",Northeastern
1,Colgate University,Liberal Arts,"$52,800.00","$108,000.00",Northeastern
2,Amherst College,Liberal Arts,"$54,500.00","$107,000.00",Northeastern
3,Lafayette College,Liberal Arts,"$53,900.00","$107,000.00",Northeastern
4,Bowdoin College,Liberal Arts,"$48,100.00","$107,000.00",Northeastern


We can determine the region with the largest number of liberal arts schools by performing a value_counts() on the "Region" column.

In [161]:
dfm.Region.value_counts()

Northeastern    25
Midwestern       8
Western          7
Southern         5
California       3
Name: Region, dtype: int64

The Northeast has the greatest number of liberal arts schools.

#### 2. Set *school_name* as the index of the *mid_career* dataframe. Do the operation inplace so that it's ready to go for part 3.

This is simple enough, but first let's be reminded of what *mid-career* looks like.

In [162]:
mid_career.head()

Unnamed: 0,school_name,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
0,Massachusetts Institute of Technology (MIT),"$76,800.00","$99,200.00","$168,000.00","$220,000.00"
1,California Institute of Technology (CIT),,"$104,000.00","$161,000.00",
2,Harvey Mudd College,,"$96,000.00","$180,000.00",
3,"Polytechnic University of New York, Brooklyn","$66,800.00","$94,300.00","$143,000.00","$190,000.00"
4,Cooper Union,,"$80,200.00","$142,000.00",


Now let's set the index.

In [163]:
mid_career.set_index("school_name", inplace = True)

In [165]:
mid_career.head()

Unnamed: 0_level_0,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
school_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Massachusetts Institute of Technology (MIT),"$76,800.00","$99,200.00","$168,000.00","$220,000.00"
California Institute of Technology (CIT),,"$104,000.00","$161,000.00",
Harvey Mudd College,,"$96,000.00","$180,000.00",
"Polytechnic University of New York, Brooklyn","$66,800.00","$94,300.00","$143,000.00","$190,000.00"
Cooper Union,,"$80,200.00","$142,000.00",


#### 3. Merge *dfm* from Part 1 with the *mid_career* dataframe. Is this join operation one-to-one?

Let's start by performing the merge. *dfm* and *mid-career* share similar data in the index of *mid-career" (the index happens to be named "school_name") and the "School Name" column of dfm. So that is what we will merge on. 

In [184]:
pd.merge(dfm, mid_career, left_on = "School Name", right_index = True)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00",Northeastern,"$62,800.00","$80,600.00","$156,000.00","$251,000.00"
1,Colgate University,Liberal Arts,"$52,800.00","$108,000.00",Northeastern,"$60,000.00","$76,700.00","$167,000.00","$265,000.00"
2,Amherst College,Liberal Arts,"$54,500.00","$107,000.00",Northeastern,,"$84,900.00","$162,000.00",
3,Lafayette College,Liberal Arts,"$53,900.00","$107,000.00",Northeastern,"$70,600.00","$79,300.00","$144,000.00","$204,000.00"
4,Bowdoin College,Liberal Arts,"$48,100.00","$107,000.00",Northeastern,,"$74,600.00","$146,000.00",
5,College of the Holy Cross,Liberal Arts,"$50,200.00","$106,000.00",Northeastern,,"$65,600.00","$143,000.00",
6,Occidental College,Liberal Arts,"$51,900.00","$105,000.00",California,,"$54,800.00","$157,000.00",
7,Washington and Lee University,Liberal Arts,"$53,600.00","$104,000.00",Southern,,"$82,800.00","$146,000.00",
8,Swarthmore College,Liberal Arts,"$49,700.00","$104,000.00",Northeastern,,"$67,200.00","$167,000.00",
9,Davidson College,Liberal Arts,"$46,100.00","$104,000.00",Southern,,"$70,500.00","$146,000.00",


Remember that in a one-to-one merge, both dataframes will have had unique, non-duplicated values in their keys. So determine whether our merged dataframe is one-to-one, we can use the `duplicated()` method followed by `value_counts()`

In [178]:
pd.merge(dfm, mid_career, left_on = "School Name", right_index = True).duplicated(keep = "first").value_counts()

False    47
True      3
dtype: int64

Thus, this merged dataframe has three entries that are duplicates of one or more other entries. Which ones are the duplicates? Other than looking through the frame, we can select the duplicates explicitly.

First let's save our duplicates boolean mask to a variable *dfm_dupes*

In [181]:
dfm_dupes = pd.merge(dfm, mid_career, left_on = "School Name", right_index = True).duplicated(keep = "first")

Now let's set our merged dataframe to the variable *dfm_mid_merge*

In [182]:
dfm_mid_merge = pd.merge(dfm, mid_career, left_on = "School Name", right_index = True)

Finally, let's use our *dfm_dupes* boolean mask to select the duplicates from *dfm_mid_merge*

In [183]:
dfm_mid_merge.loc[dfm_dupes]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
28,Randolph-Macon College,Liberal Arts,"$42,600.00","$83,600.00",Southern,,"$54,100.00","$123,000.00",
29,Randolph-Macon College,Liberal Arts,"$42,600.00","$83,600.00",Southern,,"$54,100.00","$123,000.00",
29,Randolph-Macon College,Liberal Arts,"$42,600.00","$83,600.00",Southern,,"$54,100.00","$123,000.00",


We therefore see that Randolph-Macon College has been repeated three times (four entries total). In which dataframe did it appear multiple times?

In [189]:
dfm.loc[dfm["School Name"] == 'Randolph-Macon College']

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,Region
28,Randolph-Macon College,Liberal Arts,"$42,600.00","$83,600.00",Southern
29,Randolph-Macon College,Liberal Arts,"$42,600.00","$83,600.00",Southern


In [196]:
mid_career[mid_career.index == "Randolph-Macon College"]

Unnamed: 0_level_0,Mid-Career 10th Percentile Salary,Mid-Career 25th Percentile Salary,Mid-Career 75th Percentile Salary,Mid-Career 90th Percentile Salary
school_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Randolph-Macon College,,"$54,100.00","$123,000.00",
Randolph-Macon College,,"$54,100.00","$123,000.00",


Randolph-Macon College appeared twice in both contituent dataframes. That explains why there were four entries (2 x 2) in the merged dataframe. This is a **many-to-many** merge.