# Section 6: Working with Multiple DataFrames

In [1]:
import pandas as pd
import numpy as np

## Introducting Five New Datasets

In this section, we will be working with data describing salary information for US colleges by major (field of study) and region. Each of these data sources stands separately, and it will be our job to piece them together with the methods we will learn in this section. 

We will begin by defining local variables that point to the URLs for the data.

In [2]:
# Dataset URL Sources
eng_url = 'https://andybek.com/pandas-eng'
state_url = 'https://andybek.com/pandas-state'
party_url = 'https://andybek.com/pandas-party'
liberal_url = 'https://andybek.com/pandas-liberal'
ivies_url = 'https://andybek.com/pandas-ivies'


Let's begin by reading in the engineering school salaries. This gives a list of 19 engineering schools in the US and their starting and mid-career median salaries.

In [3]:
eng = pd.read_csv(eng_url)

In [4]:
eng.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
4,Cooper Union,Engineering,"$62,200.00","$114,000.00"


Now let's read in the state school data, which describes public universities.

In [5]:
state = pd.read_csv(state_url)

In [6]:
state.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"


Next, we'll do the party data, which the Wall Street Journal categorizes as "party schools" with heavy alcohol use.

In [7]:
party = pd.read_csv(party_url)

In [8]:
party.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,University of Illinois at Urbana-Champaign (UIUC),Party,"$52,900.00","$96,100.00"
1,"University of Maryland, College Park",Party,"$52,000.00","$95,000.00"
2,"University of California, Santa Barbara (UCSB)",Party,"$50,500.00","$95,000.00"
3,University of Texas (UT) - Austin,Party,"$49,700.00","$93,900.00"
4,State University of New York (SUNY) at Albany,Party,"$44,500.00","$92,200.00"


Next up we have liberal arts schools, which emphasize rational thinking and first principles over technical knowledge

In [9]:
liberal = pd.read_csv(liberal_url)

In [10]:
liberal.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00"
1,Colgate University,Liberal Arts,"$52,800.00","$108,000.00"
2,Amherst College,Liberal Arts,"$54,500.00","$107,000.00"
3,Lafayette College,Liberal Arts,"$53,900.00","$107,000.00"
4,Bowdoin College,Liberal Arts,"$48,100.00","$107,000.00"


Last but not least, let's take a look at the Ivy League schools - a group of highly prestigious and highly selective schools in the American northeast. .

In [11]:
ivies = pd.read_csv(ivies_url)

In [12]:
ivies.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"


## Concatenating DataFrames

In the previous section, we read in 5 dataframes of similar structure but fragmented. Let's check the shape of each dataframe to confirm that they are shaped similarly.

In [13]:
dfs = [state, eng, liberal, ivies, party]

In [14]:
for df in dfs:
  print(df.shape)

(175, 4)
(19, 4)
(47, 4)
(8, 4)
(20, 4)


So we have 5 two-dimensional dataframes, each with a different number of schools but the same four columns. 

As a first step, let's put them all together, a process called **concatenation**. Suppose we want to concatenate the Ivies and engineering schools. To do this, we can run the `pandas.concat()` function, passing in a list of the dataframes that we want to concatenate..
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

In [15]:
pd.concat([ivies, eng])

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"
7,Columbia University,Ivy League,"$59,400.00","$107,000.00"
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


Let's check the shape of the concatenated dataframe

In [16]:
pd.concat([ivies, eng]).shape

(27, 4)

Let's try concatenating all of these dataframes!

In [17]:
pd.concat(dfs)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
15,University of New Hampshire (UNH),Party,"$41,800.00","$78,300.00"
16,West Virginia University (WVU),Party,"$43,100.00","$78,100.00"
17,University of Tennessee,Party,"$43,800.00","$74,600.00"
18,Ohio University,Party,"$42,200.00","$73,400.00"


We've now concatenated all of the dataframes into one master frame!



In [18]:
pd.concat(dfs).head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"


But there's a problem with this - these datasets are NOT mutually exclusive. Some schools are in more than one dataframe - for example, most party schools are also state schools.

In [19]:
pd.concat(dfs)["School Name"].value_counts()

University of California, Santa Barbara (UCSB)    2
University of Iowa (UI)                           2
University of Tennessee                           2
Pennsylvania State University (PSU)               2
University of New Hampshire (UNH)                 2
                                                 ..
University of Nevada, Reno (UNR)                  1
University of Wisconsin (UW) - Oshkosh            1
University of Missouri - Columbia                 1
Colby College                                     1
Penn State - Harrisburg                           1
Name: School Name, Length: 249, dtype: int64

To see this more clearly, let's use the `set()` and `difference()` functions to compare the differences between the party and state school name lists. Remember that 
* sets do not allow for duplicate values. Thus, we will get a unique list when we call the `set()` function.
* the difference function looks for any valus that are in one collection but not the other 

In [20]:
set(party["School Name"]).difference(state['School Name'])

{'Randolph-Macon College'}

And thus we see that Randolph-Macon College is the only university that is not both a party school and a state school. For those who care, Randolph-Macon College is a private party school. It's also a liberal arts school.

In [21]:
'Randolph-Macon College' in liberal["School Name"].values

True

Thus, all party schools are either state schools, liberal arts schools, or both. 

Just for kicks, let's see if any engineering schools are also party schools using the `intersection()` set method.

In [22]:
set(eng["School Name"]).intersection(party['School Name'])

set()

Guess not.

Another way to check for duplication is to use the `duplicated()` method on the concatenated dataframe. Remember this will generate a boolean array, which we can use to make a selection.

In [23]:
pd.concat(dfs).duplicated(subset = ['School Name'], keep = 'first')

0     False
1     False
2     False
3     False
4     False
      ...  
15     True
16     True
17     True
18     True
19     True
Length: 269, dtype: bool

In [24]:
pd.concat(dfs)[pd.concat(dfs).duplicated(subset = ['School Name'], keep = 'first')]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,University of Illinois at Urbana-Champaign (UIUC),Party,"$52,900.00","$96,100.00"
1,"University of Maryland, College Park",Party,"$52,000.00","$95,000.00"
2,"University of California, Santa Barbara (UCSB)",Party,"$50,500.00","$95,000.00"
3,University of Texas (UT) - Austin,Party,"$49,700.00","$93,900.00"
4,State University of New York (SUNY) at Albany,Party,"$44,500.00","$92,200.00"
5,University of Florida (UF),Party,"$47,100.00","$87,900.00"
6,Louisiana State University (LSU),Party,"$46,900.00","$87,800.00"
7,University of Georgia (UGA),Party,"$44,100.00","$86,000.00"
8,Pennsylvania State University (PSU),Party,"$49,900.00","$85,700.00"
9,Arizona State University (ASU),Party,"$47,400.00","$84,100.00"


So check that out, there are 20 schools that are duplicates in our dataset, and they all happen to be party schools. Because we know that our *party* dataframe has 20 rows, this means that the entire *party* dataframe consists of schools that show up in at least one other constituent dataframe. 

How can we fix this? One way is to simply remove the party schools from our list of dataframes that we use for the concatenation. 

A more programmatic way of doing this is to use the `drop_duplicates()` method, subsetting by school name.

In [25]:
pd.concat(dfs).drop_duplicates(subset = "School Name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"


Note that we've now lost the information of whether the a particular school is a party school (in addition to being a state or liberal arts school).

Let's lock this into memory by assigning to a variable.

In [26]:
schools = pd.concat(dfs).drop_duplicates(subset = "School Name")

As a side activity, let's try to remove all party schools from the concatenated data frame completely. This would allow us to do comparisons like, for example, comparing starting salaries at party schools versus non-party schools. Remember that we can't simply remove the party schools from the dataframe, because all of those schools are also present as *state* or *liberal arts* schools which are in the dataframe. We have to remove any school that is a party school completely.

To start, let's use list comprehension to create a new boolean mask that indicates whether a school **is not** in the list of party schools. This will be tested for every school in the concatenated school dataframe.

In [27]:
is_not_party_school = [school not in party["School Name"].values for school in schools["School Name"].values]

We can now use selection to grab the schools from the concatenated dataframe that are NOT party schools!

In [28]:
schools[is_not_party_school]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"


## The Duplicated Index Issue

In the previous lecture, we concatenated 5 dataframes together and removed the party school. Let's look at this dataframe.


In [29]:
schools

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"


See the problem here? Our index for the concatenated dataframe contains duplicates indices. As an example, let's grab all rows with an index of `0`.

In [30]:
schools.loc[0]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00"
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"


This happens because the component dataframes all have indices that start at 0 and go until the end. The reason this happens is that **`pd.concat()` does not discard the original index of the dataframes being concatenated**.

To illustrate further, let's take a look at the indices using the `duplicated()` method.


In [31]:
schools.index.duplicated()

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

As you can see, the indices of the later dataframes overlap completely with the indices of the first dataframe.

It is technically okay to have duplicated indices in Pandas dataframes. However, you will give up a lot of functionality by doing so. For example, slices by index will not work.


In [32]:
## This does NOT work with duplicated indices
# schools.loc[0:2]

For this and other reasons, it is usually best practice to have at least one common index with unique values across all records. Can we fix this for our concatenated dataframe? Yes we can!

The first method we can use is `reset_index()`. This method restores a 0-based range index in the dataframe while pushing the old index into a regular column in the dataframe.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

In [33]:
schools.reset_index()

Unnamed: 0,index,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...,...
244,3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
245,4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
246,5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
247,6,Brown University,Ivy League,"$56,200.00","$109,000.00"


Oftentimes however there is no use for the old index. So we can remove it entirely be setting the `drop` parameter to `True`. Let's also perform the method in place to modify the underlying dataframe.

In [34]:
schools.reset_index(drop = True, inplace=True)

In [35]:
schools

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
244,Harvard University,Ivy League,"$63,400.00","$124,000.00"
245,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
246,Cornell University,Ivy League,"$60,300.00","$110,000.00"
247,Brown University,Ivy League,"$56,200.00","$109,000.00"


Let's make sure the duplicated indices are gone.

In [36]:
schools.index.duplicated()

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

Beautiful. We now know that our index is unique.

An even more elegant approach is to **discard the old indices as we do the concatenation**. We can do this by modifying the `concat()` method by using the `ignore_index` parameter. This essentially tells Pandas to ignore the indices of the dataframes being concatenated. Instead, Pandas generates a new index for a new dataframe that is being produced.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html



In [37]:
pd.concat(dfs, ignore_index=True).drop_duplicates(subset = "School Name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
244,Harvard University,Ivy League,"$63,400.00","$124,000.00"
245,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
246,Cornell University,Ivy League,"$60,300.00","$110,000.00"
247,Brown University,Ivy League,"$56,200.00","$109,000.00"


As a sanity check, let's check the index for duplicates.

In [38]:
pd.concat(dfs, ignore_index=True).drop_duplicates(subset = "School Name").index.duplicated()

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

Awesome, no duplicates!

## Enforcing Unique Indices

We previously explored ways to remove the old indices of dataframes when concatenating them. But what if those indices contained useful information and we wanted to keep them, yet still created a concatenated dataframe with unique indices? Turns out we can do that. 

To illustrate, let's assume our original dataframes have the school name as the indices.

In [41]:
ivies2 = ivies.set_index('School Name')

In [42]:
ivies2

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
Princeton University,Ivy League,"$66,500.00","$131,000.00"
Yale University,Ivy League,"$59,100.00","$126,000.00"
Harvard University,Ivy League,"$63,400.00","$124,000.00"
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
Cornell University,Ivy League,"$60,300.00","$110,000.00"
Brown University,Ivy League,"$56,200.00","$109,000.00"
Columbia University,Ivy League,"$59,400.00","$107,000.00"


Let's do the same for the engineering schools.

In [43]:
eng2 = eng.set_index("School Name")

In [44]:
eng2.head()

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
Cooper Union,Engineering,"$62,200.00","$114,000.00"


Now, if we were to merge the two dataframes and set `ignore_index` to `True`, we would be losing important information - namely, the school names.

In [45]:
pd.concat([ivies2, eng2], ignore_index = True)

Unnamed: 0,School Type,Starting Median Salary,Mid-Career Median Salary
0,Ivy League,"$58,000.00","$134,000.00"
1,Ivy League,"$66,500.00","$131,000.00"
2,Ivy League,"$59,100.00","$126,000.00"
3,Ivy League,"$63,400.00","$124,000.00"
4,Ivy League,"$60,900.00","$120,000.00"
5,Ivy League,"$60,300.00","$110,000.00"
6,Ivy League,"$56,200.00","$109,000.00"
7,Ivy League,"$59,400.00","$107,000.00"
8,Engineering,"$72,200.00","$126,000.00"
9,Engineering,"$75,500.00","$123,000.00"


That is NOT very useful. So, how do we preserve the indices while enforcing uniqueness? 

Let's start by concatenating with the original indices.

In [50]:
pd.concat([ivies2, eng2])

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
Princeton University,Ivy League,"$66,500.00","$131,000.00"
Yale University,Ivy League,"$59,100.00","$126,000.00"
Harvard University,Ivy League,"$63,400.00","$124,000.00"
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
Cornell University,Ivy League,"$60,300.00","$110,000.00"
Brown University,Ivy League,"$56,200.00","$109,000.00"
Columbia University,Ivy League,"$59,400.00","$107,000.00"
Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


That's better. But how do we *ensure* that the none of the indices are duplicated (it turns out they are not, but let's pretend that they could be)? We can do that by using the `verify_integrity` parameter within the `pd.concat()` method. This parameter performs a unique index check for us, in that Pandas will throw an error if it detects a duplicated index.

In [51]:
pd.concat([ivies2, eng2], verify_integrity=True)

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
Princeton University,Ivy League,"$66,500.00","$131,000.00"
Yale University,Ivy League,"$59,100.00","$126,000.00"
Harvard University,Ivy League,"$63,400.00","$124,000.00"
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
Cornell University,Ivy League,"$60,300.00","$110,000.00"
Brown University,Ivy League,"$56,200.00","$109,000.00"
Columbia University,Ivy League,"$59,400.00","$107,000.00"
Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


Turns out our dataframes had unique indices and thus no error was thrown. Let's mess around with our data to illustrate the power of `verify_integrity`.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

Let's pick a random sample from the *eng2* dataframe and assign it to a new variable.

In [53]:
random_eng_school = eng2.sample()

In [54]:
random_eng_school

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Worcester Polytechnic Institute (WPI),Engineering,"$61,000.00","$114,000.00"


Now let's add this school to the *ivies2* dataframe.

In [55]:
ivies2 = ivies2.append(random_eng_school)

In [56]:
ivies2

Unnamed: 0_level_0,School Type,Starting Median Salary,Mid-Career Median Salary
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
Princeton University,Ivy League,"$66,500.00","$131,000.00"
Yale University,Ivy League,"$59,100.00","$126,000.00"
Harvard University,Ivy League,"$63,400.00","$124,000.00"
University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
Cornell University,Ivy League,"$60,300.00","$110,000.00"
Brown University,Ivy League,"$56,200.00","$109,000.00"
Columbia University,Ivy League,"$59,400.00","$107,000.00"
Worcester Polytechnic Institute (WPI),Engineering,"$61,000.00","$114,000.00"


We now have the Worcester Polytechnic Institute which is part of both *ivies2* and *eng2*. Now let's try to concatenated them with integrity verification. It throws an error!

In [57]:
## This will throw an error!
# pd.concat([ivies2, eng2], verify_integrity=True)

ValueError: ignored

To get around this and allow us to concatenate the dataframes with shared indices, we have a few options.
1. We could choose a different index for the two dataframes that *do* have unique values. The instructor recommends this approach.
2. We can turn of `verify_integrity` and then deal with the duplicates afterward, either by dropping duplicates or resetting the index.

One final note is that you can also concatenate along axis 1 (the column axis).

## Creating Multiple Indices with `concat()`

We previously looked at how to replace the index of a concatenated dataframe with a new zero-based index using the `ignore_index` parameter of `concat()`. But what if we wanted to add another level of indexing to easily partition or identify the constituent dataframes within the structure of the concatenated dataframe. We'll explore this more in the next section, but here's a sneak peak.

Say we're combining the ivies and the engineering schools again, but we want to be able to specify within the index the origin of the data. Recall that when using the vanilla `concat()` function, we'll get repeating indices.


In [58]:
pd.concat([ivies, eng])

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"
7,Columbia University,Ivy League,"$59,400.00","$107,000.00"
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


The `pd.concat()` function has a `keys` parameter that constructs a hierarchical index using the passed into keys as the outermost level. The keys are applied to the data respectively to according to how the constituent dataframes were passed into `pd.concat()`. 


In [60]:
new_df = pd.concat([ivies, eng], keys = ["ivyleague_schools", "engineering_schools"])

In [61]:
new_df

Unnamed: 0,Unnamed: 1,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
ivyleague_schools,0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
ivyleague_schools,1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
ivyleague_schools,2,Yale University,Ivy League,"$59,100.00","$126,000.00"
ivyleague_schools,3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
ivyleague_schools,4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
ivyleague_schools,5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
ivyleague_schools,6,Brown University,Ivy League,"$56,200.00","$109,000.00"
ivyleague_schools,7,Columbia University,Ivy League,"$59,400.00","$107,000.00"
engineering_schools,0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
engineering_schools,1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


Our new dataframe contains an index with two levels. The first level is the numerical index from the original dataframes. Then the second (outer) level contains the labels (keys) that we specified. And if we check the indices, we see that each index is a tuple.

In [62]:
new_df.index

MultiIndex([(  'ivyleague_schools',  0),
            (  'ivyleague_schools',  1),
            (  'ivyleague_schools',  2),
            (  'ivyleague_schools',  3),
            (  'ivyleague_schools',  4),
            (  'ivyleague_schools',  5),
            (  'ivyleague_schools',  6),
            (  'ivyleague_schools',  7),
            ('engineering_schools',  0),
            ('engineering_schools',  1),
            ('engineering_schools',  2),
            ('engineering_schools',  3),
            ('engineering_schools',  4),
            ('engineering_schools',  5),
            ('engineering_schools',  6),
            ('engineering_schools',  7),
            ('engineering_schools',  8),
            ('engineering_schools',  9),
            ('engineering_schools', 10),
            ('engineering_schools', 11),
            ('engineering_schools', 12),
            ('engineering_schools', 13),
            ('engineering_schools', 14),
            ('engineering_schools', 15),
            ('en

This means that we cannot use methods like `loc[]` in the traditional sense. Instead, we have to identify the entire tupled index.

In [64]:
## This does not work with multilevel indices.
# new_df.loc[3]

In [66]:
## This will work
new_df.loc[('ivyleague_schools', 3)]

School Name                 Harvard University
School Type                         Ivy League
Starting Median Salary             $63,400.00 
Mid-Career Median Salary          $124,000.00 
Name: (ivyleague_schools, 3), dtype: object

We can also avoid this issue by using position-based indexing instead of label-based indexing. Of course, this means you must know where the data you want is located within the dataframe.

In [67]:
new_df.iloc[3]

School Name                 Harvard University
School Type                         Ivy League
Starting Median Salary             $63,400.00 
Mid-Career Median Salary          $124,000.00 
Name: (ivyleague_schools, 3), dtype: object

## Column Axis Concatenation

We previously concatenated only by rows, or along the index axis (as opposed to the column axis). But we can change this behavior and concatenate along the column axis.

To do this, we simply alter the `pd.concat()` method by changing the `axis` parameter from 0 to 1. When might this be useful?

Suppose we want to reflect, side-by-side, the top 5 Ivy League and engineering schools that produce the highest earning graduates. One way to approach this is to produce sorted copies of the dataframes. Let's grab the top 5 starting salaries from each dataframe sorted in descending order, drop the original indices, and reset the index.

In [72]:
ivies3 = ivies.sort_values(by = "Starting Median Salary", ascending=False)[0:5].reset_index(drop=True)

In [75]:
ivies3

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Princeton University,Ivy League,"$66,500.00","$131,000.00"
1,Harvard University,Ivy League,"$63,400.00","$124,000.00"
2,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
3,Cornell University,Ivy League,"$60,300.00","$110,000.00"
4,Columbia University,Ivy League,"$59,400.00","$107,000.00"


In [73]:
eng3 = eng.sort_values(by = "Starting Median Salary", ascending=False)[0:5].reset_index(drop=True)

In [74]:
eng3

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
1,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
4,Cooper Union,Engineering,"$62,200.00","$114,000.00"


Finally, let's put these dataframes side-by-side in a new dataframe.

In [76]:
pd.concat([ivies3, eng3], axis = 1)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary,School Name.1,School Type.1,Starting Median Salary.1,Mid-Career Median Salary.1
0,Princeton University,Ivy League,"$66,500.00","$131,000.00",California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
1,Harvard University,Ivy League,"$63,400.00","$124,000.00",Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
2,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00",Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
3,Cornell University,Ivy League,"$60,300.00","$110,000.00","Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
4,Columbia University,Ivy League,"$59,400.00","$107,000.00",Cooper Union,Engineering,"$62,200.00","$114,000.00"


This looks pretty decent. For example, we can easily see that in general, the starting median salaries of top engineering schools compete well with the salaries at Ivy League schools. Interesting. Don't jump to any conclusions though - there are a lot of questions to keep in mind, including:
1. What if we considered only engineering studies at Ivy League schools? Would the starting median salary increase in that case?
2. What if we considered non-engineering students at the top engineering schools? Would the median salaries decrease?

It is possible the Ivy Leagues simply have more students that enter lower-paying, non-engineering careers, which brings down the median starting salary. It is entirely possible that engineering students at Ivy Leagues also make a good amount of money. 

A final word of note is that while this is not terrible to look at, there are better ways to visualize this type of data. This was simply to illustrate the ability of the `concat()` method to combine dataframes along the column axis.