# Section 6: Working with Multiple DataFrames

In [2]:
import pandas as pd
import numpy as np

## Introducting Five New Datasets

In this section, we will be working with data describing salary information for US colleges by major (field of study) and region. Each of these data sources stands separately, and it will be our job to piece them together with the methods we will learn in this section. 

We will begin by defining local variables that point to the URLs for the data.

In [3]:
# Dataset URL Sources
eng_url = 'https://andybek.com/pandas-eng'
state_url = 'https://andybek.com/pandas-state'
party_url = 'https://andybek.com/pandas-party'
liberal_url = 'https://andybek.com/pandas-liberal'
ivies_url = 'https://andybek.com/pandas-ivies'


Let's begin by reading in the engineering school salaries. This gives a list of 19 engineering schools in the US and their starting and mid-career median salaries.

In [5]:
eng = pd.read_csv(eng_url)

In [6]:
eng.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"
2,Harvey Mudd College,Engineering,"$71,800.00","$122,000.00"
3,"Polytechnic University of New York, Brooklyn",Engineering,"$62,400.00","$114,000.00"
4,Cooper Union,Engineering,"$62,200.00","$114,000.00"


Now let's read in the state school data, which describes public universities.

In [7]:
state = pd.read_csv(state_url)

In [12]:
state.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"


Next, we'll do the party data, which the Wall Street Journal categorizes as "party schools" with heavy alcohol use.

In [13]:
party = pd.read_csv(party_url)

In [14]:
party.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,University of Illinois at Urbana-Champaign (UIUC),Party,"$52,900.00","$96,100.00"
1,"University of Maryland, College Park",Party,"$52,000.00","$95,000.00"
2,"University of California, Santa Barbara (UCSB)",Party,"$50,500.00","$95,000.00"
3,University of Texas (UT) - Austin,Party,"$49,700.00","$93,900.00"
4,State University of New York (SUNY) at Albany,Party,"$44,500.00","$92,200.00"


Next up we have liberal arts schools, which emphasize rational thinking and first principles over technical knowledge

In [15]:
liberal = pd.read_csv(liberal_url)

In [16]:
liberal.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Bucknell University,Liberal Arts,"$54,100.00","$110,000.00"
1,Colgate University,Liberal Arts,"$52,800.00","$108,000.00"
2,Amherst College,Liberal Arts,"$54,500.00","$107,000.00"
3,Lafayette College,Liberal Arts,"$53,900.00","$107,000.00"
4,Bowdoin College,Liberal Arts,"$48,100.00","$107,000.00"


Last but not least, let's take a look at the Ivy League schools - a group of highly prestigious and highly selective schools in the American northeast. .

In [17]:
ivies = pd.read_csv(ivies_url)

In [18]:
ivies.head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"


## Concatenating DataFrames

In the previous section, we read in 5 dataframes of similar structure but fragmented. Let's check the shape of each dataframe to confirm that they are shaped similarly.

In [21]:
dfs = [state, eng, liberal, ivies, party]

In [22]:
for df in dfs:
  print(df.shape)

(175, 4)
(19, 4)
(47, 4)
(8, 4)
(20, 4)


So we have 5 two-dimensional dataframes, each with a different number of schools but the same four columns. 

As a first step, let's put them all together, a process called **concatenation**. Suppose we want to concatenate the Ivies and engineering schools. To do this, we can run the `pandas.concat()` function, passing in a list of the dataframes that we want to concatenate..
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

In [28]:
pd.concat([ivies, eng])

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,Dartmouth College,Ivy League,"$58,000.00","$134,000.00"
1,Princeton University,Ivy League,"$66,500.00","$131,000.00"
2,Yale University,Ivy League,"$59,100.00","$126,000.00"
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"
7,Columbia University,Ivy League,"$59,400.00","$107,000.00"
0,Massachusetts Institute of Technology (MIT),Engineering,"$72,200.00","$126,000.00"
1,California Institute of Technology (CIT),Engineering,"$75,500.00","$123,000.00"


Let's check the shape of the concatenated dataframe

In [30]:
pd.concat([ivies, eng]).shape

(27, 4)

Let's try concatenating all of these dataframes!

In [31]:
pd.concat(dfs)

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
15,University of New Hampshire (UNH),Party,"$41,800.00","$78,300.00"
16,West Virginia University (WVU),Party,"$43,100.00","$78,100.00"
17,University of Tennessee,Party,"$43,800.00","$74,600.00"
18,Ohio University,Party,"$42,200.00","$73,400.00"


We've now concatenated all of the dataframes into one master frame!



In [34]:
pd.concat(dfs).head()

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"


But there's a problem with this - these datasets are NOT mutually exclusive. Some schools are in more than one dataframe - for example, most party schools are also state schools.

In [35]:
pd.concat(dfs)["School Name"].value_counts()

Indiana University (IU), Bloomington                  2
Ohio University                                       2
University of Tennessee                               2
Randolph-Macon College                                2
University of California, Santa Barbara (UCSB)        2
                                                     ..
University of Colorado - Denver                       1
Stony Brook University                                1
Fitchburg State College                               1
University of South Florida (USF)                     1
University of North Carolina at Chapel Hill (UNCH)    1
Name: School Name, Length: 249, dtype: int64

To see this more clearly, let's use the `set()` and `difference()` functions to compare the differences between the party and state school name lists. Remember that 
* sets do not allow for duplicate values. Thus, we will get a unique list when we call the `set()` function.
* the difference function looks for any valus that are in one collection but not the other 

In [37]:
set(party["School Name"]).difference(state['School Name'])

{'Randolph-Macon College'}

And thus we see that Randolph-Macon College is the only university that is not both a party school and a state school. For those who care, Randolph-Macon College is a private party school. It's also a liberal arts school.

In [38]:
'Randolph-Macon College' in liberal["School Name"].values

True

Thus, all party schools are either state schools, liberal arts schools, or both. 

Just for kicks, let's see if any engineering schools are also party schools using the `intersection()` set method.

In [42]:
set(eng["School Name"]).intersection(party['School Name'])

set()

Guess not.

Another way to check for duplication is to use the `duplicated()` method on the concatenated dataframe. Remember this will generate a boolean array, which we can use to make a selection.

In [43]:
pd.concat(dfs).duplicated(subset = ['School Name'], keep = 'first')

0     False
1     False
2     False
3     False
4     False
      ...  
15     True
16     True
17     True
18     True
19     True
Length: 269, dtype: bool

In [44]:
pd.concat(dfs)[pd.concat(dfs).duplicated(subset = ['School Name'], keep = 'first')]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,University of Illinois at Urbana-Champaign (UIUC),Party,"$52,900.00","$96,100.00"
1,"University of Maryland, College Park",Party,"$52,000.00","$95,000.00"
2,"University of California, Santa Barbara (UCSB)",Party,"$50,500.00","$95,000.00"
3,University of Texas (UT) - Austin,Party,"$49,700.00","$93,900.00"
4,State University of New York (SUNY) at Albany,Party,"$44,500.00","$92,200.00"
5,University of Florida (UF),Party,"$47,100.00","$87,900.00"
6,Louisiana State University (LSU),Party,"$46,900.00","$87,800.00"
7,University of Georgia (UGA),Party,"$44,100.00","$86,000.00"
8,Pennsylvania State University (PSU),Party,"$49,900.00","$85,700.00"
9,Arizona State University (ASU),Party,"$47,400.00","$84,100.00"


So check that out, there are 20 schools that are duplicates in our dataset, and they all happen to be party schools. Because we know that our *party* dataframe has 20 rows, this means that the entire *party* dataframe consists of schools that show up in at least one other constituent dataframe. 

How can we fix this? One way is to simply remove the party schools from our list of dataframes that we use for the concatenation. 

A more programmatic way of doing this is to use the `drop_duplicates()` method, subsetting by school name.

In [45]:
pd.concat(dfs).drop_duplicates(subset = "School Name")

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"


Note that we've now lost the information of whether the a particular school is a party school (in addition to being a state or liberal arts school).

Let's lock this into memory by assigning to a variable.

In [46]:
schools = pd.concat(dfs).drop_duplicates(subset = "School Name")

In [57]:
for school in schools["School Name"].values:
  print(school in party["School Name"].values)

False
False
False
False
False
False
False
False
True
False
False
False
False
True
True
True
False
False
True
False
False
False
False
False
True
False
True
False
False
False
False
False
False
True
True
False
False
False
False
False
False
False
False
False
False
False
False
False
True
False
True
True
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
False
False
False
False
False
False
False
False
False
False
True
False
False
False
False
False
False
False
False
False
False
True
False
True
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
False
False
False
False
False
False
False
False
True
False
False
True
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False

In [64]:
is_not_party_school = [school not in party["School Name"].values for school in schools["School Name"].values]

In [65]:
schools[is_not_party_school]

Unnamed: 0,School Name,School Type,Starting Median Salary,Mid-Career Median Salary
0,"University of California, Berkeley",State,"$59,900.00","$112,000.00"
1,University of Virginia (UVA),State,"$52,700.00","$103,000.00"
2,Cal Poly San Luis Obispo,State,"$57,200.00","$101,000.00"
3,University of California at Los Angeles (UCLA),State,"$52,600.00","$101,000.00"
4,"University of California, San Diego (UCSD)",State,"$51,100.00","$101,000.00"
...,...,...,...,...
3,Harvard University,Ivy League,"$63,400.00","$124,000.00"
4,University of Pennsylvania,Ivy League,"$60,900.00","$120,000.00"
5,Cornell University,Ivy League,"$60,300.00","$110,000.00"
6,Brown University,Ivy League,"$56,200.00","$109,000.00"
