# Merging data
Here, you'll learn all about merging pandas DataFrames. You'll explore different techniques for merging, and learn about left joins, right joins, inner joins, and outer joins, as well as when to use which. You'll also learn about ordered merging, which is useful when you want to merge DataFrames whose columns have natural orderings, like date-time columns.

# 1. Merging DataFrames
### 1.1 Merging company DataFrames
Suppose your company has operations in several different cities under several different managers. The DataFrames `revenue` and `managers` contain partial information related to the company. That is, the rows of the `city` columns don't quite match in `revenue` and `managers` (the Mendocino branch has no revenue yet since it just opened and the manager of Springfield branch recently left the company).

The DataFrames have been printed in the IPython Shell. If you were to run the command `combined = pd.merge(revenue, managers, on='city')`, how many rows would `combined` have?

In [1]:
import pandas as pd

revenue = pd.DataFrame({"city": ["Austin", "Denver", "Springfield"], "revenue": [100, 83, 4]})
managers = pd.DataFrame({"city": ["Austin", "Denver", "Mendocino"], "manager": ["Charlers", "Joel", "Brett"]})

print(revenue)
print("")
print(managers)

          city  revenue
0       Austin      100
1       Denver       83
2  Springfield        4

        city   manager
0     Austin  Charlers
1     Denver      Joel
2  Mendocino     Brett


#### Possible Answers
1. 0 rows.
2. 2 rows.
3. 3 rows.
4. 4 rows.

#### Answer:
Since the default strategy for `pd.merge()` is an _inner join_, `combined` will have 2 rows.

### 1.2 Merging on a specific column
This exercise follows on the last one with the DataFrames `revenue` and `managers` for your company. You expect your company to grow and, eventually, to operate in cities with the same name on different states. As such, you decide that every branch should have a numerical branch identifier. Thus, you add a `branch_id` column to both DataFrames. Moreover, new cities have been added to both the `revenue` and `managers` DataFrames as well. 

At present, there should be a 1-to-1 relationship between the `city` and `branch_id` fields. In that case, the result of a merge on the `city` columns ought to give you the same output as a merge on the `branch_id` columns. Do they? Can you spot an ambiguity in one of the DataFrames?

### Instructions:
* Using `pd.merge()`, merge the DataFrames `revenue` and `managers` on the `'city'` column of each. Store the result as `merge_by_city`.
* Print the DataFrame `merge_by_city`. 
* Merge the DataFrames `revenue` and `managers` on the `'branch_id'` column of each. Store the result as `merge_by_id`.
* Print the DataFrame `merge_by_id`.

In [4]:
revenue = pd.DataFrame({"city": ["Austin", "Denver", "Springfield", "Mendocino"],
                        "branch_id": [10, 20, 30, 47],
                        "revenue": [100, 83, 4, 200]})
managers = pd.DataFrame({"city": ["Austin", "Denver", "Mendocino", "Springfield"],
                        "branch_id": [10, 20, 47, 31],
                        "manager": ["Charles", "Joel", "Brett", "Sally"]})

In [5]:
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on='city')

# Print merge_by_city
print(merge_by_city)

          city  branch_id_x  revenue  branch_id_y  manager
0       Austin           10      100           10  Charles
1       Denver           20       83           20     Joel
2  Springfield           30        4           31    Sally
3    Mendocino           47      200           47    Brett


In [6]:
# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on='branch_id')

# Print merge_by_id
print(merge_by_id)

      city_x  branch_id  revenue     city_y  manager
0     Austin         10      100     Austin  Charles
1     Denver         20       83     Denver     Joel
2  Mendocino         47      200  Mendocino    Brett


Notice that when you merge on `'city'`, the resulting DataFrame has a peculiar result: In row 2, the city Springfield has two different branch IDs. This is because there are actually two different cities named Springfield - one in the State of Illinois, and the other in Missouri. The `revenue` DataFrame has the one from Illinois, and the `managers` DataFrame has the one from Missouri. Consequently, when you merge on `'branch_id'`, both of these get dropped from the merged DataFrame.

### 1.3 Merging on columns with non-matching labels
You continue working with the `revenue` & `managers` DataFrames from before. This time, someone has changed the field name `'city'` to `'branch'` in the `managers` table. Now, when you attempt to merge DataFrames, an exception is thrown:
```
>>> pd.merge(revenue, managers, on='city')
Traceback (most recent call last):
    ... <text deleted> ...
    pd.merge(revenue, managers, on='city')
    ... <text deleted> ...
KeyError: 'city'
```
Given this, it will take a bit more work for you to join or merge on the city/branch name. You have to specify the `left_on` and `right_on` parameters in the call to `pd.merge()`.

Are you able to merge better than in the last exercise? How should the rows with `Springfield` be handled?

### Instructions:
* Merge the DataFrames `revenue` and `managers` into a single DataFrame called `combined` using the `'city'` and `'branch'` columns from the appropriate DataFrames.
    * In your call to `pd.merge()`, you will have to specify the parameters `left_on` and `right_on` appropriately.
* Print the new DataFrame `combined`.

In [10]:
revenue = pd.DataFrame({"city": ["Austin", "Denver", "Springfield", "Mendocino"],
                        "branch_id": [10, 20, 30, 47],
                        "state":["TX","CO","IL","CA"],
                        "revenue": [100, 83, 4, 200]})

revenue

Unnamed: 0,city,branch_id,state,revenue
0,Austin,10,TX,100
1,Denver,20,CO,83
2,Springfield,30,IL,4
3,Mendocino,47,CA,200


In [11]:
managers = pd.DataFrame({"branch": ["Austin", "Denver", "Mendocino", "Springfield"],
                        "branch_id": [10, 20, 47, 31],
                        "state":["TX","CO","CA","MO"],
                        "manager": ["Charlers", "Joel", "Brett", "Sally"]})
                        
managers

Unnamed: 0,branch,branch_id,state,manager
0,Austin,10,TX,Charlers
1,Denver,20,CO,Joel
2,Mendocino,47,CA,Brett
3,Springfield,31,MO,Sally


In [9]:
# Merge revenue & managers on 'city' & 'branch': combined
combined = pd.merge(revenue, managers, left_on='city', right_on='branch')

# Print combined
combined

Unnamed: 0,city,branch_id_x,state_x,revenue,branch,branch_id_y,state_y,manager
0,Austin,10,TX,100,Austin,10,TX,Charlers
1,Denver,20,CO,83,Denver,20,CO,Joel
2,Springfield,30,IL,4,Springfield,31,MO,Sally
3,Mendocino,47,CA,200,Mendocino,47,CA,Brett


It is important to pay attention to how columns are named in different DataFrames.

### 1.4 Merging on multiple columns
Another strategy to disambiguate cities with identical names is to add information on the _states_ in which the cities are located. To this end, you add a column called `state` to both DataFrames from the preceding exercises.

Your goal in this exercise is to use `pd.merge()` to merge DataFrames using multiple columns (using `'branch_id'`, `'city'`, and `'state'` in this case).

Are you able to match all your company's branches correctly?

### Instructions:
* Create a column called `'state'` in the DataFrame `revenue`, consisting of the list `['TX','CO','IL','CA']`.
* Create a column called `'state'` in the DataFrame `managers`, consisting of the list `['TX','CO','CA','MO']`.
* Merge the DataFrames `revenue` and `managers` using _three_ columns :`'branch_id'`, `'city'`, and `'state'`. Pass them in as a list to the `on` paramater of `pd.merge()`.

In [17]:
revenue = pd.DataFrame({"city": ["Austin", "Denver", "Springfield", "Mendocino"],
                        "branch_id": [10, 20, 30, 47],
                        "revenue": [100, 83, 4, 200]})
managers = pd.DataFrame({"city": ["Austin", "Denver", "Mendocino", "Springfield"],
                        "branch_id": [10, 20, 47, 31],
                        "manager": ["Charlers", "Joel", "Brett", "Sally"]})

In [18]:
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']
revenue

Unnamed: 0,city,branch_id,revenue,state
0,Austin,10,100,TX
1,Denver,20,83,CO
2,Springfield,30,4,IL
3,Mendocino,47,200,CA


In [20]:
# Add 'state' column to managers: managers['state']
managers['state'] = ['TX','CO','CA','MO']
managers

Unnamed: 0,city,branch_id,manager,state
0,Austin,10,Charlers,TX
1,Denver,20,Joel,CO
2,Mendocino,47,Brett,CA
3,Springfield,31,Sally,MO


In [21]:
# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue, managers, on=['branch_id', 'city', 'state'])

# Print combined
combined

Unnamed: 0,city,branch_id,revenue,state,manager
0,Austin,10,100,TX,Charlers
1,Denver,20,83,CO,Joel
2,Mendocino,47,200,CA,Brett
