# Merging data
Here, you'll learn all about merging pandas DataFrames. You'll explore different techniques for merging, and learn about left joins, right joins, inner joins, and outer joins, as well as when to use which. You'll also learn about ordered merging, which is useful when you want to merge DataFrames whose columns have natural orderings, like date-time columns.

# 1. Merging DataFrames
### 1.1 Merging company DataFrames
Suppose your company has operations in several different cities under several different managers. The DataFrames `revenue` and `managers` contain partial information related to the company. That is, the rows of the `city` columns don't quite match in `revenue` and `managers` (the Mendocino branch has no revenue yet since it just opened and the manager of Springfield branch recently left the company).

The DataFrames have been printed in the IPython Shell. If you were to run the command `combined = pd.merge(revenue, managers, on='city')`, how many rows would `combined` have?

In [1]:
import pandas as pd

revenue = pd.DataFrame({"city": ["Austin", "Denver", "Springfield"], "revenue": [100, 83, 4]})
revenue
managers = pd.DataFrame({"city": ["Austin", "Denver", "Mendocino"], "manager": ["Charlers", "Joel", "Brett"]})
managers

Unnamed: 0,city,manager
0,Austin,Charlers
1,Denver,Joel
2,Mendocino,Brett


In [2]:
managers = pd.DataFrame({"city": ["Austin", "Denver", "Mendocino"], "manager": ["Charlers", "Joel", "Brett"]})
managers

Unnamed: 0,city,manager
0,Austin,Charlers
1,Denver,Joel
2,Mendocino,Brett


#### Possible Answers
1. 0 rows.
2. 2 rows.
3. 3 rows.
4. 4 rows.

#### Answer:
Since the default strategy for `pd.merge()` is an _inner join_, `combined` will have 2 rows.

### 1.2 Merging on a specific column
This exercise follows on the last one with the DataFrames `revenue` and `managers` for your company. You expect your company to grow and, eventually, to operate in cities with the same name on different states. As such, you decide that every branch should have a numerical branch identifier. Thus, you add a `branch_id` column to both DataFrames. Moreover, new cities have been added to both the `revenue` and `managers` DataFrames as well. 

At present, there should be a 1-to-1 relationship between the `city` and `branch_id` fields. In that case, the result of a merge on the `city` columns ought to give you the same output as a merge on the `branch_id` columns. Do they? Can you spot an ambiguity in one of the DataFrames?

### Instructions:
* Using `pd.merge()`, merge the DataFrames `revenue` and `managers` on the `'city'` column of each. Store the result as `merge_by_city`.
* Print the DataFrame `merge_by_city`. 
* Merge the DataFrames `revenue` and `managers` on the `'branch_id'` column of each. Store the result as `merge_by_id`.
* Print the DataFrame `merge_by_id`.

In [3]:
revenue = pd.DataFrame({"city": ["Austin", "Denver", "Springfield", "Mendocino"],
                        "branch_id": [10, 20, 30, 47],
                        "revenue": [100, 83, 4, 200]})
managers = pd.DataFrame({"city": ["Austin", "Denver", "Mendocino", "Springfield"],
                        "branch_id": [10, 20, 47, 31],
                        "manager": ["Charles", "Joel", "Brett", "Sally"]})

In [4]:
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on='city')

# Print merge_by_city
merge_by_city

Unnamed: 0,city,branch_id_x,revenue,branch_id_y,manager
0,Austin,10,100,10,Charles
1,Denver,20,83,20,Joel
2,Springfield,30,4,31,Sally
3,Mendocino,47,200,47,Brett


In [5]:
# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on='branch_id')

# Print merge_by_id
merge_by_id

Unnamed: 0,city_x,branch_id,revenue,city_y,manager
0,Austin,10,100,Austin,Charles
1,Denver,20,83,Denver,Joel
2,Mendocino,47,200,Mendocino,Brett


Notice that when you merge on `'city'`, the resulting DataFrame has a peculiar result: In row 2, the city Springfield has two different branch IDs. This is because there are actually two different cities named Springfield - one in the State of Illinois, and the other in Missouri. The `revenue` DataFrame has the one from Illinois, and the `managers` DataFrame has the one from Missouri. Consequently, when you merge on `'branch_id'`, both of these get dropped from the merged DataFrame.

### 1.3 Merging on columns with non-matching labels
You continue working with the `revenue` & `managers` DataFrames from before. This time, someone has changed the field name `'city'` to `'branch'` in the `managers` table. Now, when you attempt to merge DataFrames, an exception is thrown:
```
>>> pd.merge(revenue, managers, on='city')
Traceback (most recent call last):
    ... <text deleted> ...
    pd.merge(revenue, managers, on='city')
    ... <text deleted> ...
KeyError: 'city'
```
Given this, it will take a bit more work for you to join or merge on the city/branch name. You have to specify the `left_on` and `right_on` parameters in the call to `pd.merge()`.

Are you able to merge better than in the last exercise? How should the rows with `Springfield` be handled?

### Instructions:
* Merge the DataFrames `revenue` and `managers` into a single DataFrame called `combined` using the `'city'` and `'branch'` columns from the appropriate DataFrames.
    * In your call to `pd.merge()`, you will have to specify the parameters `left_on` and `right_on` appropriately.
* Print the new DataFrame `combined`.

In [6]:
revenue = pd.DataFrame({"city": ["Austin", "Denver", "Springfield", "Mendocino"],
                        "branch_id": [10, 20, 30, 47],
                        "state":["TX","CO","IL","CA"],
                        "revenue": [100, 83, 4, 200]})

revenue

Unnamed: 0,city,branch_id,state,revenue
0,Austin,10,TX,100
1,Denver,20,CO,83
2,Springfield,30,IL,4
3,Mendocino,47,CA,200


In [7]:
managers = pd.DataFrame({"branch": ["Austin", "Denver", "Mendocino", "Springfield"],
                        "branch_id": [10, 20, 47, 31],
                        "state":["TX","CO","CA","MO"],
                        "manager": ["Charlers", "Joel", "Brett", "Sally"]})
                        
managers

Unnamed: 0,branch,branch_id,state,manager
0,Austin,10,TX,Charlers
1,Denver,20,CO,Joel
2,Mendocino,47,CA,Brett
3,Springfield,31,MO,Sally


In [8]:
# Merge revenue & managers on 'city' & 'branch': combined
combined = pd.merge(revenue, managers, left_on='city', right_on='branch')

# Print combined
combined

Unnamed: 0,city,branch_id_x,state_x,revenue,branch,branch_id_y,state_y,manager
0,Austin,10,TX,100,Austin,10,TX,Charlers
1,Denver,20,CO,83,Denver,20,CO,Joel
2,Springfield,30,IL,4,Springfield,31,MO,Sally
3,Mendocino,47,CA,200,Mendocino,47,CA,Brett


It is important to pay attention to how columns are named in different DataFrames.

### 1.4 Merging on multiple columns
Another strategy to disambiguate cities with identical names is to add information on the _states_ in which the cities are located. To this end, you add a column called `state` to both DataFrames from the preceding exercises.

Your goal in this exercise is to use `pd.merge()` to merge DataFrames using multiple columns (using `'branch_id'`, `'city'`, and `'state'` in this case).

Are you able to match all your company's branches correctly?

### Instructions:
* Create a column called `'state'` in the DataFrame `revenue`, consisting of the list `['TX','CO','IL','CA']`.
* Create a column called `'state'` in the DataFrame `managers`, consisting of the list `['TX','CO','CA','MO']`.
* Merge the DataFrames `revenue` and `managers` using _three_ columns :`'branch_id'`, `'city'`, and `'state'`. Pass them in as a list to the `on` paramater of `pd.merge()`.

In [9]:
revenue = pd.DataFrame({"city": ["Austin", "Denver", "Springfield", "Mendocino"],
                        "branch_id": [10, 20, 30, 47],
                        "revenue": [100, 83, 4, 200]})
managers = pd.DataFrame({"city": ["Austin", "Denver", "Mendocino", "Springfield"],
                        "branch_id": [10, 20, 47, 31],
                        "manager": ["Charlers", "Joel", "Brett", "Sally"]})

In [10]:
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']
revenue

Unnamed: 0,city,branch_id,revenue,state
0,Austin,10,100,TX
1,Denver,20,83,CO
2,Springfield,30,4,IL
3,Mendocino,47,200,CA


In [11]:
# Add 'state' column to managers: managers['state']
managers['state'] = ['TX','CO','CA','MO']
managers

Unnamed: 0,city,branch_id,manager,state
0,Austin,10,Charlers,TX
1,Denver,20,Joel,CO
2,Mendocino,47,Brett,CA
3,Springfield,31,Sally,MO


In [12]:
# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue, managers, on=['branch_id', 'city', 'state'])

# Print combined
combined

Unnamed: 0,city,branch_id,revenue,state,manager
0,Austin,10,100,TX,Charlers
1,Denver,20,83,CO,Joel
2,Mendocino,47,200,CA,Brett


# 2. Joining DataFrames
### 2.1 Joining by Index
The DataFrames `revenue` and `managers` are displayed. Here, they are indexed by `'branch_id'`.

Choose the function call below that will join the DataFrames on their indexes and return 5 rows with index labels `[10, 20, 30, 31, 47]`. Explore each of them in the IPython Shell to get a better understanding of their functionality.

In [14]:
revenue = pd.DataFrame({"city": ["Austin", "Denver", "Springfield", "Mendocino"],
                        "branch_id": [10, 20, 30, 47],
                        "state":["TX","CO","IL","CA"],
                        "revenue": [100, 83, 4, 200]}).set_index("branch_id")
revenue

Unnamed: 0_level_0,city,state,revenue
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,Austin,TX,100
20,Denver,CO,83
30,Springfield,IL,4
47,Mendocino,CA,200


In [15]:
managers = pd.DataFrame({"branch": ["Austin", "Denver", "Mendocino", "Springfield"],
                        "branch_id": [10, 20, 47, 31],
                        "state":["TX","CO","CA","MO"],
                        "manager": ["Charlers", "Joel", "Brett", "Sally"]}).set_index("branch_id")
managers

Unnamed: 0_level_0,branch,state,manager
branch_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,Austin,TX,Charlers
20,Denver,CO,Joel
47,Mendocino,CA,Brett
31,Springfield,MO,Sally


### Possible Answers:
1. pd.merge(revenue, managers, on='branch_id').
2. pd.merge(managers, revenue, how='left').
3. revenue.join(managers, lsuffix='\_rev', rsuffix='\_mng', how='outer').
4. managers.join(revenue, lsuffix='\_mgn', rsuffix='\_rev', how='left').

In [17]:
print(pd.merge(revenue, managers, on='branch_id'))
print(50*'-')
print(pd.merge(managers, revenue, how='left'))
print(50*'-')
print(revenue.join(managers, lsuffix='_rev', rsuffix='_mng', how='outer'))
print(50*'-')
print(managers.join(revenue, lsuffix='_mgn', rsuffix='_rev', how='left'))

                city state_x  revenue     branch state_y   manager
branch_id                                                         
10            Austin      TX      100     Austin      TX  Charlers
20            Denver      CO       83     Denver      CO      Joel
47         Mendocino      CA      200  Mendocino      CA     Brett
--------------------------------------------------
        branch state   manager       city  revenue
0       Austin    TX  Charlers     Austin    100.0
1       Denver    CO      Joel     Denver     83.0
2    Mendocino    CA     Brett  Mendocino    200.0
3  Springfield    MO     Sally        NaN      NaN
--------------------------------------------------
                  city state_rev  revenue       branch state_mng   manager
branch_id                                                                 
10              Austin        TX    100.0       Austin        TX  Charlers
20              Denver        CO     83.0       Denver        CO      Joel
30      

<div align=right><b>Answer:</b> (3)</div>

### 2.2 Choosing a joining strategy
Suppose you have two DataFrames: `students` (with columns `'StudentID'`, `'LastName'`, `'FirstName'`, and `'Major'`) and `midterm_results` (with columns `'StudentID'`, `'Q1'`, `'Q2'`, and `'Q3'` for their scores on midterm questions).

You want to combine the DataFrames into a single DataFrame `grades`, and be able to easily spot which students wrote the midterm and which didn't (their midterm question scores `'Q1'`, `'Q2'`, & `'Q3'` should be filled with `NaN` values).

You also want to drop rows from `midterm_results` in which the `StudentID` is not found in `students`.

Which of the following strategies gives the desired result?

### Possible Answers:
1. A left join: `grades = pd.merge(students, midterm_results, how='left')`.
2. A right join: `grades = pd.merge(students, midterm_results, how='right')`.
3. An inner join: `grades = pd.merge(students, midterm_results, how='inner')`.
4. An outer join: `grades = pd.merge(students, midterm_results, how='outer')`.

<div align=right><b>Answer:</b> (1)</div>

### 2.3 Left & right merging on multiple columns
You now have, in addition to the `revenue` and `managers` DataFrames from prior exercises, a DataFrame `sales` that summarizes units sold from specific branches (identified by `city` and `state` but not `branch_id`).

Once again, the `managers` DataFrame uses the label `branch` in place of `city` as in the other two DataFrames. Your task here is to employ _left_ and _right_ merges to preserve data and identify where data is missing.

By merging `revenue` and `sales` with a _right_ merge, you can identify the missing `revenue` values. Here, you don't need to specify `left_on` or `right_on` because the columns to merge on have matching labels.

By merging `sales` and `managers` with a _left_ merge, you can identify the missing `manager`. Here, the columns to merge on have conflicting labels, so you must specify `left_on` and `right_on`. In both cases, you're looking to figure out how to connect the fields in rows containing `Springfield`.

### Instructions:
* Execute a right merge using `pd.merge()` with revenue and sales to yield a new DataFrame `revenue_and_sales`.
* Use `how='right'` and `on=['city', 'state']`.
* Print the new DataFrame revenue_and_sales. This has been done for you.
* Execute a left merge with sales and managers to yield a new DataFrame sales_and_managers.
* Use `how='left'`, `left_on=['city', 'state']`, and `right_on=['branch', 'state']`.
* Print the new DataFrame `sales_and_managers`. 

In [20]:
sales = pd.DataFrame({"city": ["Mendocino", "Denver", "Austin", "Springfield", "Springfield"],
                      "state":["CA","CO","TX","MO","IL"],
                      "units": [1, 4, 2, 5, 1]})
sales

Unnamed: 0,city,state,units
0,Mendocino,CA,1
1,Denver,CO,4
2,Austin,TX,2
3,Springfield,MO,5
4,Springfield,IL,1


In [21]:
revenue = pd.DataFrame({"city": ["Austin", "Denver", "Springfield", "Mendocino"],
                        "branch_id": [10, 20, 30, 47],
                        "state":["TX","CO","IL","CA"],
                        "revenue": [100, 83, 4, 200]})
revenue

Unnamed: 0,city,branch_id,state,revenue
0,Austin,10,TX,100
1,Denver,20,CO,83
2,Springfield,30,IL,4
3,Mendocino,47,CA,200


In [22]:
managers = pd.DataFrame({"branch": ["Austin", "Denver", "Mendocino", "Springfield"],
                        "branch_id": [10, 20, 47, 31],
                        "state":["TX","CO","CA","MO"],
                        "manager": ["Charlers", "Joel", "Brett", "Sally"]})
managers

Unnamed: 0,branch,branch_id,state,manager
0,Austin,10,TX,Charlers
1,Denver,20,CO,Joel
2,Mendocino,47,CA,Brett
3,Springfield,31,MO,Sally


In [23]:
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales, how='right', on=['city', 'state'])

# Print revenue_and_sales
revenue_and_sales

Unnamed: 0,city,branch_id,state,revenue,units
0,Austin,10.0,TX,100.0,2
1,Denver,20.0,CO,83.0,4
2,Springfield,30.0,IL,4.0,1
3,Mendocino,47.0,CA,200.0,1
4,Springfield,,MO,,5


In [24]:
# Merge sales and managers: sales_and_managers
sales_and_managers = pd.merge(sales, managers, how='left', left_on=['city', 'state'], right_on=['branch', 'state'])

# Print sales_and_managers
sales_and_managers

Unnamed: 0,city,state,units,branch,branch_id,manager
0,Mendocino,CA,1,Mendocino,47.0,Brett
1,Denver,CO,4,Denver,20.0,Joel
2,Austin,TX,2,Austin,10.0,Charlers
3,Springfield,MO,5,Springfield,31.0,Sally
4,Springfield,IL,1,,,


### 2.4 Merging DataFrames with outer join
This exercise picks up where the previous one left off. 

The merged DataFrames contain enough information to construct a DataFrame with 5 rows with all known information correctly aligned and each branch listed only once. You will try to merge the merged DataFrames on all matching keys (which computes an inner join by default). You can compare the result to an outer join and also to an outer join with restricted subset of columns as keys.

### Instructions:
* Merge `sales_and_managers` with `revenue_and_sales`. Store the result as `merge_default`.
* Print `merge_default`.
* Merge `sales_and_managers` with `revenue_and_sales` using `how='outer'`. Store the result as `merge_outer`.
* Print `merge_outer`. 
* Merge `sales_and_managers` with `revenue_and_sales` only on `['city','state']` using an outer join. Store the result as `merge_outer_on`.

In [25]:
# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers, revenue_and_sales)

# Print merge_default
merge_default

Unnamed: 0,city,state,units,branch,branch_id,manager,revenue
0,Mendocino,CA,1,Mendocino,47.0,Brett,200.0
1,Denver,CO,4,Denver,20.0,Joel,83.0
2,Austin,TX,2,Austin,10.0,Charlers,100.0


In [26]:
# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers, revenue_and_sales, how='outer')

# Print merge_outer
merge_outer

Unnamed: 0,city,state,units,branch,branch_id,manager,revenue
0,Mendocino,CA,1,Mendocino,47.0,Brett,200.0
1,Denver,CO,4,Denver,20.0,Joel,83.0
2,Austin,TX,2,Austin,10.0,Charlers,100.0
3,Springfield,MO,5,Springfield,31.0,Sally,
4,Springfield,IL,1,,,,
5,Springfield,IL,1,,30.0,,4.0
6,Springfield,MO,5,,,,


In [27]:
# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers, revenue_and_sales, how='outer', on=['city','state'])

# Print merge_outer_on
merge_outer_on

Unnamed: 0,city,state,units_x,branch,branch_id_x,manager,branch_id_y,revenue,units_y
0,Mendocino,CA,1,Mendocino,47.0,Brett,47.0,200.0,1
1,Denver,CO,4,Denver,20.0,Joel,20.0,83.0,4
2,Austin,TX,2,Austin,10.0,Charlers,10.0,100.0,2
3,Springfield,MO,5,Springfield,31.0,Sally,,,5
4,Springfield,IL,1,,,,30.0,4.0,1


Notice how the default merge drops the `Springfield` rows, while the default outer merge includes them twice.

# 3. Ordered merges
### 3.1 Using merge_ordered()
This exercise uses pre-loaded DataFrames `austin` and `houston` that contain weather data from the cities Austin and Houston respectively. They have been printed in the IPython Shell for you to examine.

Weather conditions were recorded on separate days and you need to merge these two DataFrames together such that the dates are ordered. To do this, you'll use `pd.merge_ordered()`. After you're done, note the order of the rows before and after merging.

### Instructions:
* Perform an ordered merge on `austin` and `houston` using `pd.merge_ordered()`. Store the result as `tx_weather`.
* Print `tx_weather`. You should notice that the rows are sorted by the date but it is not possible to tell which observation came from which city.
* Perform another ordered merge on `austin` and `houston`.
    * This time, specify the keyword arguments `on='date'` and `suffixes=['_aus','_hus']` so that the rows can be distinguished. Store the result as `tx_weather_suff`.
* Print `tx_weather_suff` to examine its contents. 
* Perform a third ordered merge on `austin` and `houston`.
    * This time, in addition to the `on` and `suffixes` parameters, specify the keyword argument `fill_method='ffill'` to use _forward-filling_ to replace `NaN` entries with the most recent non-null entry.

In [32]:
austin = pd.DataFrame({"date": pd.to_datetime(["2016-01-01", "2016-02-08", "2016-01-17"]),
                       "ratings": ["Cloudy", "Cloudy", "Sunny"]})
austin

Unnamed: 0,date,ratings
0,2016-01-01,Cloudy
1,2016-02-08,Cloudy
2,2016-01-17,Sunny


In [33]:
houston = pd.DataFrame(
    {"date": pd.to_datetime(["2016-01-04", "2016-01-01", "2016-03-01"]),
     "ratings": ["Rainy", "Cloudy", "Sunny"]})
houston

Unnamed: 0,date,ratings
0,2016-01-04,Rainy
1,2016-01-01,Cloudy
2,2016-03-01,Sunny
