# Merging data
Here, you'll learn all about merging pandas DataFrames. You'll explore different techniques for merging, and learn about left joins, right joins, inner joins, and outer joins, as well as when to use which. You'll also learn about ordered merging, which is useful when you want to merge DataFrames whose columns have natural orderings, like date-time columns.

In [1]:
from IPython.display import HTML, Image
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
%%HTML
<video style="display:block; margin: 0 auto;" controls>
      <source src="_Docs/01-Merging_DataFrames.mp4" type="video/mp4">
</video>

### Merging company DataFrames
Suppose your company has operations in several different cities under several different managers. The DataFrames `revenue` and `managers` contain partial information related to the company. That is, the rows of the `city` columns don't quite match in `revenue` and `managers` (the Mendocino branch has no revenue yet since it just opened and the manager of Springfield branch recently left the company).

The DataFrames have been printed. If you were to run the command `combined = pd.merge(revenue, managers, on='city')`, how many rows would combined have?

In [3]:
dic = {'city':['Austin','Denver','Springfield'], 'revenue':[100,83,4]}
revenue = pd.DataFrame(dic)
revenue

Unnamed: 0,city,revenue
0,Austin,100
1,Denver,83
2,Springfield,4


In [4]:
dic = {'city':['Austin','Denver','Mendocino'], 'manager':['Charlers','Joel','Brett']}
managers = pd.DataFrame(dic)
managers

Unnamed: 0,city,manager
0,Austin,Charlers
1,Denver,Joel
2,Mendocino,Brett


In [5]:
combined = pd.merge(revenue, managers, on='city')
combined

Unnamed: 0,city,revenue,manager
0,Austin,100,Charlers
1,Denver,83,Joel


### Merging on a specific column
This exercise follows on the last one with the DataFrames `revenue` and `managers` for your company. You expect your company to grow and, eventually, to operate in cities with the same name on different states. As such, you decide that every branch should have a numerical branch identifier. Thus, you add a `branch_id` column to both DataFrames. Moreover, new cities have been added to both the `revenue` and `managers` DataFrames as well. pandas has been imported as `pd` and both DataFrames are available in your namespace.

At present, there should be a 1-to-1 relationship between the `city` and `branch_id` fields. In that case, the result of a merge on the `city` columns ought to give you the same output as a merge on the `branch_id` columns. Do they? Can you spot an ambiguity in one of the DataFrames?

In [6]:
# List of Tuples
records = [('Austin',10,100),
            ('Denver',20,83),
            ('Springfield',30,4),
            ('Mendocino',47,200)]
#Create a DataFrame object
revenue = pd.DataFrame(records, columns = ['city','branch_id','revenue'])
revenue.branch_id = pd.to_numeric(revenue.branch_id)
revenue

Unnamed: 0,city,branch_id,revenue
0,Austin,10,100
1,Denver,20,83
2,Springfield,30,4
3,Mendocino,47,200


In [7]:
# List of Tuples
records = [('Austin',10,'Charles'),
            ('Denver',20,'Joel'),
            ('Mendocino',47,'Brett'),
            ('Springfield',31,'Sally')]
#Create a DataFrame object
managers = pd.DataFrame(records, columns = ['city','branch_id','manager'])
managers.branch_id = pd.to_numeric(managers.branch_id)
managers

Unnamed: 0,city,branch_id,manager
0,Austin,10,Charles
1,Denver,20,Joel
2,Mendocino,47,Brett
3,Springfield,31,Sally


In [8]:
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue,managers,on='city')

# Print merge_by_city
merge_by_city

Unnamed: 0,city,branch_id_x,revenue,branch_id_y,manager
0,Austin,10,100,10,Charles
1,Denver,20,83,20,Joel
2,Springfield,30,4,31,Sally
3,Mendocino,47,200,47,Brett


In [9]:
# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue,managers,on='branch_id')

# Print merge_by_id
merge_by_id

Unnamed: 0,city_x,branch_id,revenue,city_y,manager
0,Austin,10,100,Austin,Charles
1,Denver,20,83,Denver,Joel
2,Mendocino,47,200,Mendocino,Brett


### Merging on columns with non-matching labels
You continue working with the `revenue` & `managers` DataFrames from before. This time, someone has changed the field name `'city'` to `'branch'` in the managers table. Now, when you attempt to merge DataFrames, an exception is thrown:
```Python
>>> pd.merge(revenue, managers, on='city')
Traceback (most recent call last):
    ... <text deleted> ...
    pd.merge(revenue, managers, on='city')
    ... <text deleted> ...
KeyError: 'city'
```    
Given this, it will take a bit more work for you to join or merge on the city/branch name. You have to specify the `left_on` and `right_on` parameters in the call to `pd.merge()`.

As before, pandas has been pre-imported as `pd` and the `revenue` and `managers` DataFrames are in your namespace. They have been printed in the IPython Shell so you can examine the columns prior to merging.

Are you able to merge better than in the last exercise? How should the rows with `Springfield` be handled?

In [10]:
# List of Tuples
records = [('Austin'     ,10,'TX',100),
           ('Denver'     ,20,'CO',83 ),
           ('Springfield',30,'IL',4  ),
           ('Mendocino'  ,47,'CA',200)]
#Create a DataFrame object
revenue = pd.DataFrame(records, columns = ['city','branch_id','state','revenue'])
revenue.branch_id = pd.to_numeric(revenue.branch_id)
revenue

Unnamed: 0,city,branch_id,state,revenue
0,Austin,10,TX,100
1,Denver,20,CO,83
2,Springfield,30,IL,4
3,Mendocino,47,CA,200


In [11]:
# List of Tuples
records = [('Austin'     ,10,'TX','Charles'),
           ('Denver'     ,20,'CO','Joel'   ),
           ('Mendocino'  ,47,'CA','Brett'  ),
           ('Springfield',31,'MO','Sally'  )]
#Create a DataFrame object
managers = pd.DataFrame(records, columns = ['branch','branch_id','state','manager'])
managers.branch_id = pd.to_numeric(managers.branch_id)
managers

Unnamed: 0,branch,branch_id,state,manager
0,Austin,10,TX,Charles
1,Denver,20,CO,Joel
2,Mendocino,47,CA,Brett
3,Springfield,31,MO,Sally


In [12]:
# Merge revenue & managers on 'city' & 'branch': combined
combined = pd.merge(revenue,managers,left_on='city',right_on='branch')

# Print combined
combined

Unnamed: 0,city,branch_id_x,state_x,revenue,branch,branch_id_y,state_y,manager
0,Austin,10,TX,100,Austin,10,TX,Charles
1,Denver,20,CO,83,Denver,20,CO,Joel
2,Springfield,30,IL,4,Springfield,31,MO,Sally
3,Mendocino,47,CA,200,Mendocino,47,CA,Brett


### Merging on multiple columns
Another strategy to disambiguate cities with identical names is to add information on the states in which the cities are located. To this end, you add a column called `state` to both DataFrames from the preceding exercises. Again, pandas has been pre-imported as `pd` and the `revenue` and `managers` DataFrames are in your namespace.

Your goal in this exercise is to use `pd.merge()` to merge DataFrames using multiple columns (using `'branch_id'`, `'city'`, and `'state'` in this case).

Are you able to match all your company's branches correctly?

In [13]:
# List of Tuples
records = [('Austin',10,100),
            ('Denver',20,83),
            ('Springfield',30,4),
            ('Mendocino',47,200)]
#Create a DataFrame object
revenue = pd.DataFrame(records, columns = ['city','branch_id','revenue'])
revenue.branch_id = pd.to_numeric(revenue.branch_id)
revenue

Unnamed: 0,city,branch_id,revenue
0,Austin,10,100
1,Denver,20,83
2,Springfield,30,4
3,Mendocino,47,200


In [14]:
# List of Tuples
records = [('Austin',10,'Charles'),
            ('Denver',20,'Joel'),
            ('Mendocino',47,'Brett'),
            ('Springfield',31,'Sally')]
#Create a DataFrame object
managers = pd.DataFrame(records, columns = ['city','branch_id','manager'])
managers.branch_id = pd.to_numeric(managers.branch_id)
managers

Unnamed: 0,city,branch_id,manager
0,Austin,10,Charles
1,Denver,20,Joel
2,Mendocino,47,Brett
3,Springfield,31,Sally


In [15]:
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']

# Add 'state' column to managers: managers['state']
managers['state'] = ['TX','CO','CA','MO']

# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue,managers,on=['branch_id', 'city', 'state'])

# Print combined
combined

Unnamed: 0,city,branch_id,revenue,state,manager
0,Austin,10,100,TX,Charles
1,Denver,20,83,CO,Joel
2,Mendocino,47,200,CA,Brett


# Joining DataFrames
## Merging with left join
- Keeps all rows of the le! DF in the merged DF
- For rows in the le! DF with matches in the right DF:
    - Non-joining columns of right DF are appended to le! DF
- For rows in the le! DF with no matches in the right DF:
    - Non-joining columns are filled with nulls
## Which should you use?
- df1.append(df2): stacking vertically
- pd.concat([df1, df2]):
    - stacking many horizontally or vertically
    - simple inner/outer joins on Indexes
- df1.join(df2): inner/outer/le!/right joins on Indexes
- pd.merge([df1, df2]): many joins on multiple columns    

In [16]:
%%HTML
<video style="display:block; margin: 0 auto;" controls>
      <source src="_Docs/02-Joining_DataFrames.mp4" type="video/mp4">
</video>

### Joining by Index
The DataFrames `revenue` and `managers` are displayed in the IPython Shell. Here, they are indexed by `'branch_id'`.

Choose the function call below that will join the DataFrames on their indexes and return 5 rows with index labels `[10, 20, 30, 31, 47]`. 

In [17]:
# List of Tuples
records = [('Austin'     ,10,'TX',100),
           ('Denver'     ,20,'CO',83 ),
           ('Springfield',30,'IL',4  ),
           ('Mendocino'  ,47,'CA',200)]
#Create a DataFrame object
revenue = pd.DataFrame(records, columns = ['city','branch_id','state','revenue'])
revenue.branch_id = pd.to_numeric(revenue.branch_id)
revenue = revenue.set_index('branch_id')
revenue

Unnamed: 0_level_0,city,state,revenue
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,Austin,TX,100
20,Denver,CO,83
30,Springfield,IL,4
47,Mendocino,CA,200


In [18]:
# List of Tuples
records = [('Austin'     ,10,'TX','Charles'),
           ('Denver'     ,20,'CO','Joel'   ),
           ('Mendocino'  ,47,'CA','Brett'  ),
           ('Springfield',31,'MO','Sally'  )]
#Create a DataFrame object
managers = pd.DataFrame(records, columns = ['branch','branch_id','state','manager'])
managers.branch_id = pd.to_numeric(managers.branch_id)
managers = managers.set_index('branch_id')
managers

Unnamed: 0_level_0,branch,state,manager
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,Austin,TX,Charles
20,Denver,CO,Joel
47,Mendocino,CA,Brett
31,Springfield,MO,Sally


In [19]:
revenue.join(managers, lsuffix='_rev', rsuffix='_mng', how='outer')

Unnamed: 0_level_0,city,state_rev,revenue,branch,state_mng,manager
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10,Austin,TX,100.0,Austin,TX,Charles
20,Denver,CO,83.0,Denver,CO,Joel
30,Springfield,IL,4.0,,,
31,,,,Springfield,MO,Sally
47,Mendocino,CA,200.0,Mendocino,CA,Brett


### Choosing a joining strategy
Suppose you have two DataFrames: `students` (with columns `'StudentID'`, `'LastName'`, `'FirstName'`, and `'Major'`) and `midterm_results` (with columns `'StudentID'`, `'Q1'`, `'Q2'`, and `'Q3'` for their scores on midterm questions).

You want to combine the DataFrames into a single DataFrame grades, and be able to easily spot which students wrote the midterm and which didn't (their midterm question scores `'Q1'`, `'Q2'`, & `'Q3'` should be filled with `NaN` values).

You also want to drop rows from `midterm_results` in which the `StudentID` is not found in students.

Which of the following strategies gives the desired result?

```Python
A left join: grades = pd.merge(students, midterm_results, how='left').
```

### Left & right merging on multiple columns
You now have, in addition to the `revenue` and `managers` DataFrames from prior exercises, a DataFrame `sales` that summarizes units sold from specific branches (identified by `city` and `state` but not `branch_id`).

Once again, the `managers` DataFrame uses the label `branch` in place of `city` as in the other two DataFrames. Your task here is to employ `left` and `right` merges to preserve data and identify where data is missing.

By merging `revenue` and `sales` with a `right merge`, you can identify the missing revenue values. Here, you don't need to specify `left_on` or `right_on` because the columns to merge on have matching labels.

By merging `sales` and `managers` with a `left merge`, you can identify the missing manager. Here, the columns to merge on have conflicting labels, so you must specify `left_on` and `right_on`. In both cases, you're looking to figure out how to connect the fields in rows containing `Springfield`.

pandas has been imported as `pd` and the three DataFrames `revenue`, `managers`, and `sales` have been pre-loaded. They have been printed for you to explore in the IPython Shell.

In [20]:
# List of Tuples
records = [('Austin'     ,10,'TX',100),
           ('Denver'     ,20,'CO',83 ),
           ('Springfield',30,'IL',4  ),
           ('Mendocino'  ,47,'CA',200)]
#Create a DataFrame object
revenue = pd.DataFrame(records, columns = ['city','branch_id','state','revenue'])
revenue.branch_id = pd.to_numeric(revenue.branch_id)
revenue

Unnamed: 0,city,branch_id,state,revenue
0,Austin,10,TX,100
1,Denver,20,CO,83
2,Springfield,30,IL,4
3,Mendocino,47,CA,200


In [21]:
# List of Tuples
records = [('Austin'     ,10,'TX','Charles'),
           ('Denver'     ,20,'CO','Joel'   ),
           ('Mendocino'  ,47,'CA','Brett'  ),
           ('Springfield',31,'MO','Sally'  )]
#Create a DataFrame object
managers = pd.DataFrame(records, columns = ['branch','branch_id','state','manager'])
managers.branch_id = pd.to_numeric(managers.branch_id)
managers

Unnamed: 0,branch,branch_id,state,manager
0,Austin,10,TX,Charles
1,Denver,20,CO,Joel
2,Mendocino,47,CA,Brett
3,Springfield,31,MO,Sally


In [22]:
# List of Tuples
records = [('Mendocino'  ,'CA',1),
           ('Denver'     ,'CO',4),
           ('Austin'     ,'TX',2),
           ('Springfield','MO',5),
           ('Springfield','IL',1)]
#Create a DataFrame object
sales = pd.DataFrame(records, columns = ['city','state','units'])
sales.units = pd.to_numeric(sales.units)
sales

Unnamed: 0,city,state,units
0,Mendocino,CA,1
1,Denver,CO,4
2,Austin,TX,2
3,Springfield,MO,5
4,Springfield,IL,1


In [23]:
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue,sales,how='right',on=['city', 'state'])

# Print revenue_and_sales
revenue_and_sales

Unnamed: 0,city,branch_id,state,revenue,units
0,Austin,10.0,TX,100.0,2
1,Denver,20.0,CO,83.0,4
2,Springfield,30.0,IL,4.0,1
3,Mendocino,47.0,CA,200.0,1
4,Springfield,,MO,,5


In [24]:
# Merge sales and managers: sales_and_managers
sales_and_managers = pd.merge(sales,managers,how='left',left_on=['city', 'state'],right_on=['branch', 'state'])

# Print sales_and_managers
sales_and_managers

Unnamed: 0,city,state,units,branch,branch_id,manager
0,Mendocino,CA,1,Mendocino,47.0,Brett
1,Denver,CO,4,Denver,20.0,Joel
2,Austin,TX,2,Austin,10.0,Charles
3,Springfield,MO,5,Springfield,31.0,Sally
4,Springfield,IL,1,,,


### Merging DataFrames with outer join
This exercise picks up where the previous one left off. The DataFrames `revenue`, `managers`, and `sales` are pre-loaded into your namespace (and, of course, pandas is imported as `pd`). Moreover, the merged DataFrames `revenue_and_sales` and `sales_and_managers` have been pre-computed exactly as you did in the previous exercise.

The merged DataFrames contain enough information to construct a DataFrame with 5 rows with all known information correctly aligned and each branch listed only once. You will try to merge the merged DataFrames on all matching keys (which computes an inner join by default). You can compare the result to an outer join and also to an outer join with restricted subset of columns as keys.

In [25]:
# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers,revenue_and_sales)

# Print merge_default
merge_default

Unnamed: 0,city,state,units,branch,branch_id,manager,revenue
0,Mendocino,CA,1,Mendocino,47.0,Brett,200.0
1,Denver,CO,4,Denver,20.0,Joel,83.0
2,Austin,TX,2,Austin,10.0,Charles,100.0


In [26]:
# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers,revenue_and_sales,how='outer')

# Print merge_outer
merge_outer

Unnamed: 0,city,state,units,branch,branch_id,manager,revenue
0,Mendocino,CA,1,Mendocino,47.0,Brett,200.0
1,Denver,CO,4,Denver,20.0,Joel,83.0
2,Austin,TX,2,Austin,10.0,Charles,100.0
3,Springfield,MO,5,Springfield,31.0,Sally,
4,Springfield,IL,1,,,,
5,Springfield,IL,1,,30.0,,4.0
6,Springfield,MO,5,,,,


In [27]:
# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers,revenue_and_sales,on=['city','state'],how='outer')

# Print merge_outer_on
merge_outer_on

Unnamed: 0,city,state,units_x,branch,branch_id_x,manager,branch_id_y,revenue,units_y
0,Mendocino,CA,1,Mendocino,47.0,Brett,47.0,200.0,1
1,Denver,CO,4,Denver,20.0,Joel,20.0,83.0,4
2,Austin,TX,2,Austin,10.0,Charles,10.0,100.0,2
3,Springfield,MO,5,Springfield,31.0,Sally,,,5
4,Springfield,IL,1,,,,30.0,4.0,1


# Ordered merges

In [28]:
%%HTML
<video style="display:block; margin: 0 auto;" controls>
      <source src="_Docs/03-Ordered_merges.mp4" type="video/mp4">
</video>

### Using `merge_ordered()`
This exercise uses pre-loaded DataFrames `austin` and `houston` that contain weather data from the cities `Austin` and `Houston` respectively. They have been printed in the IPython Shell for you to examine.

Weather conditions were recorded on separate days and you need to merge these two DataFrames together such that the dates are ordered. To do this, you'll use `pd.merge_ordered()`. After you're done, note the order of the rows before and after merging.

In [29]:
# List of Tuples
records = [('2016-01-01','Cloudy'),
           ('2016-02-08','Cloudy'),
           ('2016-01-17','Sunny' )
          ]
#Create a DataFrame object
austin = pd.DataFrame(records, columns = ['date','ratings'])
austin

Unnamed: 0,date,ratings
0,2016-01-01,Cloudy
1,2016-02-08,Cloudy
2,2016-01-17,Sunny


In [30]:
# List of Tuples
records = [('2016-01-04','Rainy'),
           ('2016-01-01','Cloudy'),
           ('2016-03-01','Sunny' )
          ]
#Create a DataFrame object
houston = pd.DataFrame(records, columns = ['date','ratings'])
houston

Unnamed: 0,date,ratings
0,2016-01-04,Rainy
1,2016-01-01,Cloudy
2,2016-03-01,Sunny


In [31]:
# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin,houston)

# Print tx_weather
tx_weather

Unnamed: 0,date,ratings
0,2016-01-01,Cloudy
1,2016-01-04,Rainy
2,2016-01-17,Sunny
3,2016-02-08,Cloudy
4,2016-03-01,Sunny


In [32]:
# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin,houston,on='date',suffixes=['_aus','_hus'])

# Print tx_weather_suff
tx_weather_suff

Unnamed: 0,date,ratings_aus,ratings_hus
0,2016-01-01,Cloudy,Cloudy
1,2016-01-04,,Rainy
2,2016-01-17,Sunny,
3,2016-02-08,Cloudy,
4,2016-03-01,,Sunny


In [33]:
# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin,houston,on='date',suffixes=['_aus','_hus'],fill_method='ffill')

# Print tx_weather_ffill
tx_weather_ffill

Unnamed: 0,date,ratings_aus,ratings_hus
0,2016-01-01,Cloudy,Cloudy
1,2016-01-04,Cloudy,Rainy
2,2016-01-17,Sunny,Rainy
3,2016-02-08,Cloudy,Rainy
4,2016-03-01,Cloudy,Sunny


### Using merge_asof()
Similar to `pd.merge_ordered()`, the `pd.merge_asof()` function **will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.**

**This function can be used to align disparate datetime frequencies without having to first resample.**

Here, you'll merge monthly oil prices (US dollars) into a full automobile fuel efficiency dataset. The `oil` and `automobile` DataFrames have been pre-loaded as `oil` and `auto`. The first 5 rows of each have been printed in the IPython Shell for you to explore.

These datasets will align such that the first price of the year will be broadcast into the rows of the automobiles DataFrame. This is considered correct since by the start of any given year, most automobiles for that year will have already been manufactured.

You'll then inspect the merged DataFrame, resample by year and compute the mean `'Price'` and `'mpg'`. You should be able to see a trend in these two columns, that you can confirm by computing the Pearson correlation between resampled `'Price'` and `'mpg'`.

In [34]:
oil = pd.read_csv('../_datasets/oil_price.csv',parse_dates=True)
oil.head()

Unnamed: 0,Date,Price
0,1970-01-01,3.35
1,1970-02-01,3.35
2,1970-03-01,3.35
3,1970-04-01,3.35
4,1970-05-01,3.35


In [35]:
auto = pd.read_csv('../_datasets/automobiles.csv',parse_dates=True)
auto.head()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,18.0,8,307.0,130,3504,12.0,1970-01-01,US,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,1970-01-01,US,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,1970-01-01,US,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,1970-01-01,US,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,1970-01-01,US,ford torino


In [36]:
# Merge auto and oil: merged
merged = pd.merge_asof(auto,oil,left_on='yr',right_on='Date')

# Print the tail of merged
merged.tail()

TypeError: No matching signature found

In [None]:
# Resample merged: yearly
yearly = merged.resample('A',on='Date')[['mpg','Price']].mean()

# Print yearly
yearly

In [None]:
# print yearly.corr()
print(yearly.corr())