In [1]:
import numpy as np
import pandas as pd

# Combining Pandas Datasets with Concatenation [MORE INFO](https://pandas.pydata.org/pandas-docs/stable/merging.html)

## Introduction

In [5]:
# For this tutorial, we will need college_loan_defaults dataset.
college_loan_defaults = pd.read_csv(
    './data/college-loan-default-rates.csv', index_col='opeid')

# Keep in mind that the original dataset has this many rows
print(college_loan_defaults.shape)
college_loan_defaults.head()
# college_loan_defaults.isnull().sum()

(4596, 21)


Unnamed: 0_level_0,name,address,city,state,state_desc,zipcode,zipcode_ext,school_type_code,school_type_desc,year_1,...,year_1_borrowers_in_repay,year_1_default_rate,year_2,year_2_borrowers_in_default,year_2_borrowers_in_repay,year_2_default_rate,year_3,year_3_borrowers_in_default,year_3_borrowers_in_repay,year_3_default_rate
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
31505,A - TECHNICAL COLLEGE,1033 SOUTH BROADWAY STREET,LOS ANGELES,CA,CALIFORNIA,90015,3535,3,PrivateForProfit,2013,...,92,27.1,2012,9.0,33.0,27.2,2011,11.0,35.0,31.4
41495,A & W HEALTHCARE EDUCATORS,6930 MARTIN DRIVE,NEW ORLEANS,LA,LOUISIANA,70126,2923,3,PrivateForProfit,2013,...,31,12.9,2012,5.0,31.0,16.1,2011,0.0,5.0,0.0
2477,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,800 WEST JEFFERSON STREET,KIRKSVILLE,MO,MISSOURI,63501,1497,2,PrivateNonProfit,2013,...,837,1.6,2012,15.0,886.0,1.6,2011,24.0,763.0,3.1
30649,AARON'S ACADEMY OF BEAUTY,11690 DOOLITTLE DRIVE,WALDORF,MD,MARYLAND,20602,2715,3,PrivateForProfit,2013,...,106,35.8,2012,14.0,36.0,38.8,2011,18.0,41.0,43.9
30651,ABC BEAUTY COLLEGE,"203 SOUTH 26TH ST, SUITE B",ARKADELPHIA,AR,ARKANSAS,71923,4206,3,PrivateForProfit,2013,...,45,26.6,2012,10.0,52.0,19.2,2011,8.0,45.0,17.7


In [6]:
college_loan_defaults.isnull().sum()
print(college_loan_defaults.shape)
college_loan_defaults_clean = college_loan_defaults.dropna().copy()


(4596, 21)


The Office of Postsecondary Education Identification (OPEID) code for each college is used as an index

In [8]:
# File under 'useful to know': you can create new columns
# college_loan_defaults_clean['avg_default_rate'] = round(
#         (college_loan_defaults_clean['year_1_default_rate'] + 
#          college_loan_defaults_clean['year_2_default_rate'] + 
#          college_loan_defaults_clean['year_3_default_rate'] 
#         ) / 3
# )

# File under 'also might be useful to know': copy the index value to a new column
college_loan_defaults_clean['COLLEGE_ID'] = college_loan_defaults_clean.index

college_loan_defaults_clean.head()

Unnamed: 0_level_0,name,address,city,state,state_desc,zipcode,zipcode_ext,school_type_code,school_type_desc,year_1,...,year_2,year_2_borrowers_in_default,year_2_borrowers_in_repay,year_2_default_rate,year_3,year_3_borrowers_in_default,year_3_borrowers_in_repay,year_3_default_rate,avg_default_rate,COLLEGE_ID
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
31505,A - TECHNICAL COLLEGE,1033 SOUTH BROADWAY STREET,LOS ANGELES,CA,CALIFORNIA,90015,3535,3,PrivateForProfit,2013,...,2012,9.0,33.0,27.2,2011,11.0,35.0,31.4,29.0,31505
41495,A & W HEALTHCARE EDUCATORS,6930 MARTIN DRIVE,NEW ORLEANS,LA,LOUISIANA,70126,2923,3,PrivateForProfit,2013,...,2012,5.0,31.0,16.1,2011,0.0,5.0,0.0,10.0,41495
2477,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,800 WEST JEFFERSON STREET,KIRKSVILLE,MO,MISSOURI,63501,1497,2,PrivateNonProfit,2013,...,2012,15.0,886.0,1.6,2011,24.0,763.0,3.1,2.0,2477
30649,AARON'S ACADEMY OF BEAUTY,11690 DOOLITTLE DRIVE,WALDORF,MD,MARYLAND,20602,2715,3,PrivateForProfit,2013,...,2012,14.0,36.0,38.8,2011,18.0,41.0,43.9,40.0,30649
30651,ABC BEAUTY COLLEGE,"203 SOUTH 26TH ST, SUITE B",ARKADELPHIA,AR,ARKANSAS,71923,4206,3,PrivateForProfit,2013,...,2012,10.0,52.0,19.2,2011,8.0,45.0,17.7,21.0,30651


## `pd.concat`
You can think of the `pd.concat` function as the equivalent of the NumPy `concatenate` function for `Series` and `DataFrame` objects.

Will we spend most of our time on how these function works with `DataFrame` objects as opposed to `Series` objects since in practice that is how it is used most frequently.

When it comes to using the `pd.concat` function, the most basic question is whether you are adding *additional rows* or *additional columns*. We'll run through the function arguments based on concatenating rows and then come back for a look at how we perform column concatentations.

### Concatenating `DataFrame` Rows

In [9]:
# Here, I'll split the college_loan_defaults into multiple 
# sections of rows that we will then stiched back together.
part_1 = college_loan_defaults.iloc[:1000]
part_2 = college_loan_defaults.iloc[1000:2000]
part_3 = college_loan_defaults.iloc[1999:]

# This creates three parts:
# rows 0-999
# rows 1000-1999
# rows 1999-end -> notice 1999 appears twice
part_3.index & part_2.index

Int64Index([1698], dtype='int64', name='opeid')

#### Basic Usage

In [10]:
# Join all three parts together pd.concat
concatenated_dataframe = pd.concat([part_3, part_1, part_2])
concatenated_dataframe.head()

Unnamed: 0_level_0,name,address,city,state,state_desc,zipcode,zipcode_ext,school_type_code,school_type_desc,year_1,...,year_1_borrowers_in_repay,year_1_default_rate,year_2,year_2_borrowers_in_default,year_2_borrowers_in_repay,year_2_default_rate,year_3,year_3_borrowers_in_default,year_3_borrowers_in_repay,year_3_default_rate
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1698,JOHN MARSHALL LAW SCHOOL (THE),315 SOUTH PLYMOUTH COURT,CHICAGO,IL,ILLINOIS,60604,3968,2,PrivateNonProfit,2013,...,530,0.9,2012,9.0,485.0,1.8,2011,8.0,491.0,1.6
41340,"JOHN PAOLO'S XTREME BEAUTY INSTITUTE, GOLDWELL...",2144 SARATOGA AVENUE,BALLSTON SPA,NY,NEW YORK,12020,1245,3,PrivateForProfit,2013,...,90,13.3,2012,11.0,68.0,16.1,2011,13.0,44.0,29.5
41657,"JOHN PAOLO'S XTREME BEAUTY, GOLDWELL PRODUCTS ...",638 COLUMBIA STREET EXTENSION,LATHAM,NY,NEW YORK,12110,3053,3,PrivateForProfit,2013,...,47,21.2,2012,5.0,24.0,20.8,2011,1.0,5.0,20.0
4004,JOHN TYLER COMMUNITY COLLEGE,13101 JEFFERSON DAVIS HIGHWAY,CHESTER,VA,VIRGINIA,23831,5316,1,Public,2013,...,1125,15.8,2012,131.0,938.0,13.9,2011,128.0,747.0,17.1
12813,JOHN WOOD COMMUNITY COLLEGE,1301 SOUTH 48TH STREET,QUINCY,IL,ILLINOIS,62305,401,1,Public,2013,...,552,18.2,2012,86.0,582.0,14.7,2011,89.0,565.0,15.7


In [11]:
print (concatenated_dataframe.shape[0])
print (part_1.shape[0], part_2.shape[0], part_3.shape[0])

4597
1000 1000 2597


<div class="alert alert-block alert-info">
Notice that `pd.concat` does not sort the elements of the DataFrame that it returns.
</div>

#### Handling Duplicate Index Values with `verify_integrity` & `ignore_index` Parameters
You probably didn't notice, but we got a school that is appearing twice in our list.

In [14]:
# The `DataFrame.index.duplicated` function returns a boolean array
# we can use as a mask to extract duplicate records.
concatenated_dataframe.index.duplicated()

array([False, False, False, ..., False, False,  True])

In [15]:
concatenated_dataframe[concatenated_dataframe.index.duplicated()]

Unnamed: 0_level_0,name,address,city,state,state_desc,zipcode,zipcode_ext,school_type_code,school_type_desc,year_1,...,year_1_borrowers_in_repay,year_1_default_rate,year_2,year_2_borrowers_in_default,year_2_borrowers_in_repay,year_2_default_rate,year_3,year_3_borrowers_in_default,year_3_borrowers_in_repay,year_3_default_rate
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1698,JOHN MARSHALL LAW SCHOOL (THE),315 SOUTH PLYMOUTH COURT,CHICAGO,IL,ILLINOIS,60604,3968,2,PrivateNonProfit,2013,...,530,0.9,2012,9.0,485.0,1.8,2011,8.0,491.0,1.6


### About loc[]

As we have seen you can access columns of a dataframe using Dictionary-like syntax. For instance, you can get all the cities with `college_loan_defaults['city]` or mulitple columns using a list of columns with `college_loan_defaults[ ['city', 'state']]`

What if you want to access a row by explicit index? If you try using dictionary like syntax, it will assume you are looking for a column and get a `KeyError`.

The way around this is with `.loc[]`. This syntax will let pandas know you are looking for an index.

In [19]:
concatenated_dataframe.loc[1698]

Unnamed: 0_level_0,name,address,city,state,state_desc,zipcode,zipcode_ext,school_type_code,school_type_desc,year_1,...,year_1_borrowers_in_repay,year_1_default_rate,year_2,year_2_borrowers_in_default,year_2_borrowers_in_repay,year_2_default_rate,year_3,year_3_borrowers_in_default,year_3_borrowers_in_repay,year_3_default_rate
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1698,JOHN MARSHALL LAW SCHOOL (THE),315 SOUTH PLYMOUTH COURT,CHICAGO,IL,ILLINOIS,60604,3968,2,PrivateNonProfit,2013,...,530,0.9,2012,9.0,485.0,1.8,2011,8.0,491.0,1.6
1698,JOHN MARSHALL LAW SCHOOL (THE),315 SOUTH PLYMOUTH COURT,CHICAGO,IL,ILLINOIS,60604,3968,2,PrivateNonProfit,2013,...,530,0.9,2012,9.0,485.0,1.8,2011,8.0,491.0,1.6


Now, I purposefully caused this problem for us (by including the 1999 indexed element in both `part_2` and `part_3`; but in the real world this is pretty common!

Sometimes you might want to keep both entries (often the case if the index value is the same but the rest of the data is different). If so, you can pass the **`ignore_index`** parameter with a value of **`True`** to the function and **all existing index values will be destroyed** and a new one integer based one will be created for you.

In [20]:
concatenated_dataframe = pd.concat([part_2, part_3, part_1], ignore_index=True)
print(concatenated_dataframe.index.duplicated().sum())
concatenated_dataframe.head()

0


Unnamed: 0,name,address,city,state,state_desc,zipcode,zipcode_ext,school_type_code,school_type_desc,year_1,...,year_1_borrowers_in_repay,year_1_default_rate,year_2,year_2_borrowers_in_default,year_2_borrowers_in_repay,year_2_default_rate,year_3,year_3_borrowers_in_default,year_3_borrowers_in_repay,year_3_default_rate
0,COLUMBUS COLLEGE OF ART & DESIGN,60 CLEVELAND AVENUE,COLUMBUS,OH,OHIO,43215,1758,2,PrivateNonProfit,2013,...,348,11.4,2012,30.0,377.0,7.9,2011,38.0,371.0,10.2
1,COLUMBUS STATE COMMUNITY COLLEGE,550 EAST SPRING STREET,COLUMBUS,OH,OHIO,43215,1786,1,Public,2013,...,11464,19.3,2012,2627.0,12290.0,21.3,2011,2120.0,9911.0,21.3
2,COLUMBUS STATE UNIVERSITY,4225 UNIVERSITY AVENUE,COLUMBUS,GA,GEORGIA,31907,5645,1,Public,2013,...,2396,8.1,2012,189.0,2148.0,8.7,2011,217.0,1890.0,11.4
3,COLUMBUS TECHNICAL COLLEGE,928 MANCHESTER EXPRESSWAY,COLUMBUS,GA,GEORGIA,31904,6572,1,Public,2013,...,523,19.3,2012,2.0,12.0,16.6,2011,0.0,0.0,0.0
4,COMMONWEALTH INSTITUTE OF FUNERAL SERVICE,415 BARREN SPRINGS DRIVE,HOUSTON,TX,TEXAS,77090,5918,2,PrivateNonProfit,2013,...,142,5.6,2012,8.0,113.0,7.0,2011,6.0,84.0,7.1


If on the other hand, a duplicate index would mean there is a data problem that you don't want to allow, you can specify the `verify_integrity` parameter as `True`.

When this is passed, the existence of duplicate indices will generate a `ValueError` exception.

In [21]:
concatenated_dataframe = pd.concat([part_2, part_3, part_1], verify_integrity=True)

ValueError: Indexes have overlapping values: Int64Index([1698], dtype='int64', name='opeid')

#### Handling Column Mismatches with the `join` Parameter
Sometimes you will have two sets of rows that you want to join together, but the sets don't have all of the same columns.

I'll create a couple of additional small `DataFrame` objects from our college loan dataset to demonstrate our options here.

In [22]:
# DataFrame 1
# Contains the first 5 rows of the original dataset
# But only the name, city, and state columns
name_city_state_columns_only = college_loan_defaults[['name', 'city', 'state']][:5]

# DataFrame 2
# Contains the second 5 rows of the original dataset
# But only the name, state, and zipcode columns
name_state_zipcode_columns_only = college_loan_defaults[['name', 'state', 'zipcode']][5:10]

In [23]:
name_city_state_columns_only

Unnamed: 0_level_0,name,city,state
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
31505,A - TECHNICAL COLLEGE,LOS ANGELES,CA
41495,A & W HEALTHCARE EDUCATORS,NEW ORLEANS,LA
2477,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,KIRKSVILLE,MO
30649,AARON'S ACADEMY OF BEAUTY,WALDORF,MD
30651,ABC BEAUTY COLLEGE,ARKADELPHIA,AR


In [24]:
name_state_zipcode_columns_only

Unnamed: 0_level_0,name,state,zipcode
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
41833,ABCOTT INSTITUTE,MI,48075
37834,ABDILL CAREER COLLEGE,OR,97504
3537,ABILENE CHRISTIAN UNIVERSITY,TX,79699
7087,ABINGTON MEMORIAL HOSPITAL DIXON SCHOOL OF NUR...,PA,19090
1541,ABRAHAM BALDWIN AGRICULTURAL COLLEGE,GA,31793


We have have 2 sets of 5 rows that we want to concatenate together, but they have different columns. Let's see what happens if you don't specify anything with the **`join`** parameter.

In [25]:
pd.concat(
    [name_city_state_columns_only, name_state_zipcode_columns_only], 
    sort=False)

Unnamed: 0_level_0,name,city,state,zipcode
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
31505,A - TECHNICAL COLLEGE,LOS ANGELES,CA,
41495,A & W HEALTHCARE EDUCATORS,NEW ORLEANS,LA,
2477,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,KIRKSVILLE,MO,
30649,AARON'S ACADEMY OF BEAUTY,WALDORF,MD,
30651,ABC BEAUTY COLLEGE,ARKADELPHIA,AR,
41833,ABCOTT INSTITUTE,,MI,48075.0
37834,ABDILL CAREER COLLEGE,,OR,97504.0
3537,ABILENE CHRISTIAN UNIVERSITY,,TX,79699.0
7087,ABINGTON MEMORIAL HOSPITAL DIXON SCHOOL OF NUR...,,PA,19090.0
1541,ABRAHAM BALDWIN AGRICULTURAL COLLEGE,,GA,31793.0


See how Pandas adds the special `NaN` value for any column that didn't have a value in the original dataframes? 

The other option is to drop any columns where there is not data in both sets of rows. You can do this be specifying a value of **`inner`** to the join parameter of the function.

Let's demonstrate how doing so will result in only the shared columns (name, state) appearing in the final dataframe.

In [26]:
pd.concat(
    [name_city_state_columns_only, name_state_zipcode_columns_only], 
    join='inner')

Unnamed: 0_level_0,name,state
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1
31505,A - TECHNICAL COLLEGE,CA
41495,A & W HEALTHCARE EDUCATORS,LA
2477,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,MO
30649,AARON'S ACADEMY OF BEAUTY,MD
30651,ABC BEAUTY COLLEGE,AR
41833,ABCOTT INSTITUTE,MI
37834,ABDILL CAREER COLLEGE,OR
3537,ABILENE CHRISTIAN UNIVERSITY,TX
7087,ABINGTON MEMORIAL HOSPITAL DIXON SCHOOL OF NUR...,PA
1541,ABRAHAM BALDWIN AGRICULTURAL COLLEGE,GA


### Concatenating `DataFrame` Columns
Now let's go back and see how we can use the `pd.concat` function to merge two sets of columns with the same index (row) values.

The data will start out a little dirty but we will clean it up with our parameters.

In [27]:
# DataFrame 1
# Contains the first 5 rows of the original dataset
# But only the name, city, and state columns
name_city_state_columns = college_loan_defaults[['name', 'city', 'state']][:5]

# DataFrame 2
# Contains the 7 rows of the original dataset - this will cause a duplicate index
# But only default rates columns
default_rates = college_loan_defaults[
    ['year_1_default_rate',
     'year_2_default_rate', 
     'year_3_default_rate']][:7]

In [28]:
name_city_state_columns

Unnamed: 0_level_0,name,city,state
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
31505,A - TECHNICAL COLLEGE,LOS ANGELES,CA
41495,A & W HEALTHCARE EDUCATORS,NEW ORLEANS,LA
2477,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,KIRKSVILLE,MO
30649,AARON'S ACADEMY OF BEAUTY,WALDORF,MD
30651,ABC BEAUTY COLLEGE,ARKADELPHIA,AR


In [29]:
default_rates

Unnamed: 0_level_0,year_1_default_rate,year_2_default_rate,year_3_default_rate
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
31505,27.1,27.2,31.4
41495,12.9,16.1,0.0
2477,1.6,1.6,3.1
30649,35.8,38.8,43.9
30651,26.6,19.2,17.7
41833,16.4,15.1,0.0
37834,17.1,20.7,19.6


Now let's do a simple concatenation. To add columns we have to specify the `axis` parameter with a value of **`1`** to indicate we are adding colums, not rows.

In [30]:
pd.concat(
    [name_city_state_columns, default_rates], 
    axis=1)

Unnamed: 0_level_0,name,city,state,year_1_default_rate,year_2_default_rate,year_3_default_rate
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2477,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,KIRKSVILLE,MO,1.6,1.6,3.1
30649,AARON'S ACADEMY OF BEAUTY,WALDORF,MD,35.8,38.8,43.9
30651,ABC BEAUTY COLLEGE,ARKADELPHIA,AR,26.6,19.2,17.7
31505,A - TECHNICAL COLLEGE,LOS ANGELES,CA,27.1,27.2,31.4
37834,,,,17.1,20.7,19.6
41495,A & W HEALTHCARE EDUCATORS,NEW ORLEANS,LA,12.9,16.1,0.0
41833,,,,16.4,15.1,0.0


<div class="alert alert-block alert-info">
<p>
Note that the ``pd.concat()`` function is smart to match the rows based on the index `opeid`.  
</div> 

There are a couple of important things to notice here:
* Unlike when concatenating rows, this time Pandas did **sort the rows based on the index**. Just something to be aware of.
* See how there are a couple of rows with `NaN` values for their first three colums.  That's because our `name_and_default_rates` dataframe had two additional rows for which there were no corresponding values in `name_city_state_zipcode_columns`.

Let's drop the rows with `NaN` values by specifying an inner join.

In [31]:
pd.concat(
    [name_city_state_columns, default_rates], 
    axis=1, join='inner')

Unnamed: 0_level_0,name,city,state,year_1_default_rate,year_2_default_rate,year_3_default_rate
opeid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
31505,A - TECHNICAL COLLEGE,LOS ANGELES,CA,27.1,27.2,31.4
41495,A & W HEALTHCARE EDUCATORS,NEW ORLEANS,LA,12.9,16.1,0.0
2477,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,KIRKSVILLE,MO,1.6,1.6,3.1
30649,AARON'S ACADEMY OF BEAUTY,WALDORF,MD,35.8,38.8,43.9
30651,ABC BEAUTY COLLEGE,ARKADELPHIA,AR,26.6,19.2,17.7


Finally, let's talk about the how the **`verify_integrity`** and **`ignore_index`** parameters would work when concatenating columns.

Let's say that we had included the city column in both dataframes:
* The default behavior of `pd.concat` would have been to create a new dataframe with 2 "city" columns.
* You could make Pandas throw a `ValueError` exception by passing `verify_integrity=True` to the function.
* You could also throw out all the column names and replace them with an 0-based series of integers.  This would result in the values of "city" being duplicated in two columns, but the columns would have different integer "names".

# Combining Datasets with Merge [MORE INFO](https://pandas.pydata.org/pandas-docs/stable/merging.html)

We will be exploring another way to combine datasets through the **`pd.merge`** function.

Those who have a background in databases will find a significant amount of overlap between your SQL work and the merge function.

## The 3 Categories of Joins
There are 3 different categories of merges/joins which are defined by the characteristics of the shared columns/indices:
* One-to-One: Each shared value exists only once in both dataframes.
* One-to-Many: A given shared value exists once in first dataframe, but 1 or more times in the second dateframe.
* Many-to-Many: A given shared value exists 1 or more times in both dataframes.

Let's provide an example of each type of join from our datasets.

### One-to-One Join

<div class="alert alert-block alert-info">
<p>
This will feel pretty similar to concatenating columns.
</p>
</div>

In [32]:
# Team Members Favorite Restaurants
team_restaurants = pd.DataFrame(
    {'restaurant': ['In-N-Out', 'Chipotle', 'Chick-Fil-A'], 
    'name': ['Mike', 'Kim', 'Roger']})
team_restaurants

Unnamed: 0,restaurant,name
0,In-N-Out,Mike
1,Chipotle,Kim
2,Chick-Fil-A,Roger


In [33]:
# Item Locations
items_locations = pd.DataFrame(
    {'items': ['Fries', 'Pizza', 'Barritos','Pasta', 'Shakes'], 
    'locations': ['Chicago', 'New York', 'San Diego', 'Pittsburgh', 'Seattle']})
items_locations

Unnamed: 0,items,locations
0,Fries,Chicago
1,Pizza,New York
2,Barritos,San Diego
3,Pasta,Pittsburgh
4,Shakes,Seattle


In [34]:
# Restaurant Items
restaurant_items = pd.DataFrame(
    {
        'item': [
        'Shakes', 
        'Burritos', 
        'Burger'
        ]
    ,
        'restaurant':[
        'In-N-Out',
        'Chipotle',
        'Five Guys',
    ]
    }
)
restaurant_items

Unnamed: 0,item,restaurant
0,Shakes,In-N-Out
1,Burritos,Chipotle
2,Burger,Five Guys


In [35]:
team_restaurants

Unnamed: 0,restaurant,name
0,In-N-Out,Mike
1,Chipotle,Kim
2,Chick-Fil-A,Roger


The **`restaurant`** field in the restuarant_items, team_restaurants dataset is a unique field, that is it the restaurant names appear only once in each dataset. 

Because of this, if we merge the two dataframes it will be a **1-1 join.**

In [36]:
pd.merge(team_restaurants, restaurant_items)

Unnamed: 0,restaurant,name,item
0,In-N-Out,Mike,Shakes
1,Chipotle,Kim,Burritos


Great. Here's what Pandas did:
1. Identified the matching column(s) between the two dataframes: **`restaurant`**.
1. Found matching **`restaurant`** values between the two dataframes.
1. Merged the columns of matching **`restuarant`** values together.
1. **Important**: Notice that a new index was generated.

<div class="alert alert-block alert-info">
<p>
In our discussion, we will reference to the columns that pandas is using to find matches between dataframes as the "join column(s)".
</p>
</div>

#### Controlling the Join Type with the `how` Parameter
Did you notice that some of the records from each of the original dataframes didn't make it into the merge product?

This is because the type of join that was applied to the dataframes was called an **inner join**.

The are actually 4 types of joins that you can use:
* **Inner Join**: To be included in the output dataframe, the join column(s) value must exist in both original dataframes. 
    * This is why some of the records didn't get included in the output, because they didn't have a corresponding join column(s) values in the other dataframe.
* **Outer Join**: All records from both dataframes are included in the output. Pandas simply fills in `NaN` where there is no corresponding join column(s) value.
* **Left Join**: All rows from the first (left) dataframe will be included in the output dataframe, regardless of whether there is a matching join column(s) value in the second (right) dataframe.
* **Right Join**: All rows from the second (right) dataframe will be included in the output dataframe, regardless of whether there is a matching join columns value in the left (first) dataframe.

Let's go ahead and try all these different types of joins to see how our output changes.

In [37]:
team_restaurants

Unnamed: 0,restaurant,name
0,In-N-Out,Mike
1,Chipotle,Kim
2,Chick-Fil-A,Roger


In [38]:
restaurant_items

Unnamed: 0,item,restaurant
0,Shakes,In-N-Out
1,Burritos,Chipotle
2,Burger,Five Guys


In [39]:
# Outer Join
# All records from both dataframes are included.
# NaN is inserted into missing grid point.
pd.merge(team_restaurants, restaurant_items, how="outer")

Unnamed: 0,restaurant,name,item
0,In-N-Out,Mike,Shakes
1,Chipotle,Kim,Burritos
2,Chick-Fil-A,Roger,
3,Five Guys,,Burger


In [40]:
pd.merge(team_restaurants, restaurant_items, how="left")

Unnamed: 0,restaurant,name,item
0,In-N-Out,Mike,Shakes
1,Chipotle,Kim,Burritos
2,Chick-Fil-A,Roger,


In [41]:
pd.merge(team_restaurants, restaurant_items, how="right")

Unnamed: 0,restaurant,name,item
0,In-N-Out,Mike,Shakes
1,Chipotle,Kim,Burritos
2,Five Guys,,Burger


### One-to-Many Join

In [42]:
# Restaurant Items
restaurant_items = pd.DataFrame(
    {
        'item': [
        'Burgers', 'Fries', 'Shakes', 
        'Tacos', 'Burritos', 'Chips',
        'Chicken Sandwich', 'Fries', 'Salads'
        ]
    ,
        'rest':[
        'In-N-Out', 'In-N-Out', 'In-N-Out', 
        'Chipotle', 'Chipotle', 'Chipotle',
        'Five Guys', 'Five Guys', 'Five Guys'
    ]
    }
)
restaurant_items

Unnamed: 0,item,rest
0,Burgers,In-N-Out
1,Fries,In-N-Out
2,Shakes,In-N-Out
3,Tacos,Chipotle
4,Burritos,Chipotle
5,Chips,Chipotle
6,Chicken Sandwich,Five Guys
7,Fries,Five Guys
8,Salads,Five Guys


In [43]:
team_restaurants

Unnamed: 0,restaurant,name
0,In-N-Out,Mike
1,Chipotle,Kim
2,Chick-Fil-A,Roger


In [44]:
pd.merge(team_restaurants, restaurant_items)

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

#### Specifying the Join Columns
Well... that isn't want we wanted.

Thankfully though, the error message is pretty self-explanatory. Pandas thinks there are no common columns to merge on.

The reason for this is that the common values are held in columns with slightly different names. We have to explain to Pandas what to do when this happens by specifying the names of the columns to join on.

In [45]:
team_restaurants

Unnamed: 0,restaurant,name
0,In-N-Out,Mike
1,Chipotle,Kim
2,Chick-Fil-A,Roger


In [46]:
restaurant_items

Unnamed: 0,item,rest
0,Burgers,In-N-Out
1,Fries,In-N-Out
2,Shakes,In-N-Out
3,Tacos,Chipotle
4,Burritos,Chipotle
5,Chips,Chipotle
6,Chicken Sandwich,Five Guys
7,Fries,Five Guys
8,Salads,Five Guys


In [47]:
# Use the left_on and right_on parameters to specify the
# name(s) of the join column(s) in the first(left)
# and second(right) dataframes.
pd.merge(
    team_restaurants, 
    restaurant_items,
    left_on='restaurant',
    right_on='rest')

Unnamed: 0,restaurant,name,item,rest
0,In-N-Out,Mike,Burgers,In-N-Out
1,In-N-Out,Mike,Fries,In-N-Out
2,In-N-Out,Mike,Shakes,In-N-Out
3,Chipotle,Kim,Tacos,Chipotle
4,Chipotle,Kim,Burritos,Chipotle
5,Chipotle,Kim,Chips,Chipotle


<div class="alert alert-block alert-info">
<h5>There can be more than 1 join column</h5>
<p>
In this example, we have specified only one join column. But you can specify multiple columns if you so desire. Just pass them as a list to the `left_on` and `right_on` parameters.
</p>
</div>

### Many-to-Many Join

In [48]:
# Team Members Favorite Restaurants
team_restaurants = pd.DataFrame(
    {'restaurant': ['In-N-Out', 'Chipotle', 'Chick-Fil-A', 'Chick-Fil-A', 'In-N-Out'], 
    'name': ['Mike', 'Kim', 'Roger', 'Sam', 'Sonia']})
team_restaurants


Unnamed: 0,restaurant,name
0,In-N-Out,Mike
1,Chipotle,Kim
2,Chick-Fil-A,Roger
3,Chick-Fil-A,Sam
4,In-N-Out,Sonia


In [49]:
# Restaurant Items
restaurant_items = pd.DataFrame(
    {
        'item': [
        'Burgers', 'Fries', 'Shakes', 
        'Tacos', 'Burritos', 'Chips',
        'Chicken Sandwich', 'Fries', 'Salads'
        ]
    ,
        'rest':[
        'In-N-Out', 'In-N-Out', 'In-N-Out', 
        'Chipotle', 'Chipotle', 'Chipotle',
        'Five Guys', 'Five Guys', 'Five Guys'
    ]
    }
)
restaurant_items

Unnamed: 0,item,rest
0,Burgers,In-N-Out
1,Fries,In-N-Out
2,Shakes,In-N-Out
3,Tacos,Chipotle
4,Burritos,Chipotle
5,Chips,Chipotle
6,Chicken Sandwich,Five Guys
7,Fries,Five Guys
8,Salads,Five Guys


In [50]:
new_df = pd.merge(
    team_restaurants, restaurant_items, 
        left_on = 'restaurant', 
        right_on = 'rest', 
        how = "outer"
)

new_df


Unnamed: 0,restaurant,name,item,rest
0,In-N-Out,Mike,Burgers,In-N-Out
1,In-N-Out,Mike,Fries,In-N-Out
2,In-N-Out,Mike,Shakes,In-N-Out
3,In-N-Out,Sonia,Burgers,In-N-Out
4,In-N-Out,Sonia,Fries,In-N-Out
5,In-N-Out,Sonia,Shakes,In-N-Out
6,Chipotle,Kim,Tacos,Chipotle
7,Chipotle,Kim,Burritos,Chipotle
8,Chipotle,Kim,Chips,Chipotle
9,Chick-Fil-A,Roger,,


<div class="alert alert-block alert-info">
<p> You could merge two dataframes based on index as well. </p>

<p>
If you wanted to, you could actually use the index of one dataframe and a column of the other dataframe. Pandas gives you great flexibility here. 
</p>
</div>

## Activity: Compute the population density of each state

We will learn ``pd.merge()`` operation using the three datasets from your textbook. 


* Load the datasets

In [51]:
pop = pd.read_csv('./data/state-population.csv')
areas = pd.read_csv('./data/state-areas.csv')
abbrevs = pd.read_csv('./data/state-abbrevs.csv')

In [52]:
pop.head()

Unnamed: 0,state_abrv,ages,year,population
0,AL,under18,2012,1117489.0
1,AL,total,2012,4817528.0
2,AL,under18,2010,1130966.0
3,AL,total,2010,4785570.0
4,AL,under18,2011,1125763.0


In [53]:
areas.head()

Unnamed: 0,state,area (sq. mi)
0,Alabama,52423
1,Alaska,656425
2,Arizona,114006
3,Arkansas,53182
4,California,163707


In [54]:
abbrevs.head()

Unnamed: 0,state,abbreviation
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA


* Create a DataFrame named `areas_abbrvs_merged` by merging the ``areas`` DataFrame and ``abbrevs`` DataFrame to get the state names, state abbreviations, and area into one DataFrame

In [55]:
areas_abbrvs_merged = pd.merge(areas,abbrevs)
areas_abbrvs_merged.head()

Unnamed: 0,state,area (sq. mi),abbreviation
0,Alabama,52423,AL
1,Alaska,656425,AK
2,Arizona,114006,AZ
3,Arkansas,53182,AR
4,California,163707,CA


* Merge the DataFrame created above (`areas_abbrvs_merged`) with the `pop` DataFrame to create a the `complete_data` DataFrame

In [63]:
complete_data = pd.merge(
    areas_abbrvs_merged,
    pop, 
    left_on='abbreviation', 
    right_on='state_abrv'
)

complete_data.sample()

Unnamed: 0,state,area (sq. mi),abbreviation,state_abrv,ages,year,population
1738,Oregon,98386,OR,OR,total,1995,3184369.0


* Finally, create a column called density using the `population` and `area(sq. mi)` columns. 

In [73]:
# density == pop / area

complete_data['density'] = (
    complete_data['population'] / complete_data['area (sq. mi)']
)

complete_data.sample()

Unnamed: 0,state,area (sq. mi),abbreviation,state_abrv,ages,year,population,density
1733,Oregon,98386,OR,OR,under18,1993,778973.0,7.917519


* Which state has highest total population density in year 2012? 

In [74]:
# Step 1: Filter out the data with ages is total and year is 2012
total_pop_for_2012 = complete_data[ 
    (complete_data['ages'] == 'total') 
        & ( complete_data['year'] == 2012)
]

total_pop_for_2012.sample(5)

Unnamed: 0,state,area (sq. mi),abbreviation,state_abrv,ages,year,population,density
287,Colorado,104100,CO,CO,total,2012,5189458.0,49.850701
1295,Nebraska,77358,NE,NE,total,2012,1855350.0,23.983945
1584,North Dakota,70704,ND,ND,total,2012,701345.0,9.919453
432,Georgia,59441,GA,GA,total,2012,9915646.0,166.814926
1104,Mississippi,48434,MS,MS,total,2012,2986450.0,61.660197


In [72]:
# Step 2: Find the state that has the maximum pop density

max_id = total_pop_for_2012['density'].idxmax()
complete_data.loc[max_id]

state            District of Columbia
area (sq. mi)                      68
abbreviation                       DC
state_abrv                         DC
ages                            total
year                             2012
population                     633427
density                        9315.1
Name: 2401, dtype: object

# Writing files back 

Until now you have been loading the dataset from your computer, however, you might want to store the data back to the computer to use it later. 

For example, in the above activity were you created population density by merging bunch of DataFrames, you might want to save that DataFrame, rather than redoing all the steps. 

## pd.DataFrame.to_csv()

In [None]:
pd.DataFrame.to_csv?

In [None]:
complete_data.to_csv('./data/state-population-density.csv')

<div class="alert alert-block alert-info">
<p>
``to_csv()`` by default writes the index (row names) as well. This will create an additional column with the indexes. If you want to avoid it, you can use keyword parameter ``index = False`` to avoid creating a column for the index. 
</p>
</div>

In [None]:
complete_data.to_csv('./data/state-population-density.csv', index = False)

# Hierarchical Indexing


### Multiindex

If you set an index to more than one columnn you are creating multi index or Hieararchical index. This makes asking questions based on indexes a lot more easier, and also opens the possibility of working with multidimensional data. 

We'll use the example sourced from [here](https://chrisalbon.com/python/pandas_hierarchical_data.html). 

In [None]:
# Create dataframe
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df

In [None]:
df_1_ind = df.set_index('regiment')
df_1_ind

* How do we get the average scores, based on the regiment? 

In [None]:
df_1_ind.mean(level = 'regiment')

* How about you want to get the mean scores, based on the company but not the regiment? 

In [None]:
# Set the hierarchical index to be by regiment, and then by company
df_2_ind = df.set_index(['regiment', 'company'])
df_2_ind

<div class="alert alert-block alert-info">
<p>
Having multiple indexes will give you an easy way to model more than two dimensional data with DataFrames, which are by default a two dimensional data structures. 
</p>
<p>
For the above example, you can imagine each regiment is a two-dimensional array giving details about the company, names and the scores, and they are stacked one below the other. 
</p>
</div>

In [None]:
df_2_ind.mean(level='company')

In [None]:
df_2_ind.mean(level='regiment')

In [None]:
df_2_ind.mean(level=['regiment','company'])