# Merges and matching
* You might have seen -- in the previous example -- that there is more we might want to do. The two files concern the same entities! 

This workbook was based upon [the merges getting started](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#merge) in the Pandas documentation. 

# Data lakes
* a *data lake* is a large set of files with some columns that match. 
* the *merging problem* is to combine data for two or more files to get more complete information about an entity. 
* Here is a compelling example. 
* *The key is to create columns you can match.* 

*In the following, I have shortened printing of large tables. You can remove the shortening at need. *

Let's load our first file from http://data.gov: population of towns in Connecticut. 

In [None]:
import pandas as pd
population = pd.read_csv('2010_Population_By_Town.csv')
population.head()

This is a table of populations for each town in Connecticut. We also have the following table: 

In [None]:
tax = pd.read_csv('2012_Retail_Sales_By_Town_ALL_NAICS.csv', engine='python', skipfooter=8)
tax.head()

This is a table of sales tax for the same towns. 

# The key observation

* These tables are both about the same towns, but 
* They represent the towns differently. 
* To combine them into one table, we need a common description of the town 

# What's different?
* case of town name. 
* extra stuff in town name. 

# The key strategy
* turn them both into something that matches. 
* lowercase the unadorned name. 
* lowercase and trim the adorned name. 

# Step 1: add lowercase names to `population`

In [None]:
lowercase = population.TOWN.str.lower()
lowercase.head()

In [None]:
population['lowercase'] = lowercase
population.head()

# Step 2: transform the name in `tax`
* `tax.Municipality`: the adorned name. 
* `.str.split('(')`: split at '(' character. 
* `[0]`: select first part of split. 
* `.str.strip()`: remove spaces from both sides. 
* `.str.lower()`: lowercase the result. 

In [None]:
lowercase = tax.Municipality.str.split('(', expand=True)[0].str.strip().str.lower()
lowercase.head()


In [None]:
tax['lowercase'] = lowercase
tax.head()

# Step 3: merge on now-common column 'lowercase': 
* `left`, `right`: positions of source `DataFrame`s in merged `DataFrame`. 
* `how='outer'`: leave records that don't match in the data. 
* `on=`: what to match

In [None]:
both = pd.merge(left=population, right=tax, how='outer', on='lowercase')
both.head()

# A few notes
* We could have used most any mechanism to make columns the same. 
* E.g., uppercase rather than lowercase. 
* The important thing is that they are exactly the same in format. 
* If the same columns are indexes, so much the better. This improves performance. 

# The data merging problem
* https://data.gov is a huge "data lake" of CSV files. 
* Many of them describe the same entities. 
* But they may depict the entities differently, and substantive creativity may be necessary to collect all the data for each entity, in this case, towns in Connecticut. 

# Data fusion
* More generally, there is a problem of *Data Fusion* that goes beyond mere tables.
* Entities can be geospatial, i.e., on a map. 
* Entities can overlap. 
* Data may only be measured for part of an entity, e.g., a county of a state. 
* Data may not be commensurate for the same entity. 

# Often, data fusion is more difficult than the analysis that follows. 
* Column names are synonyms, or missing. 
* Some columns that are named the same contain different data.
* The US 'Open Data Initiative' says that data has to be available, but *does not specify its format or metadata format.* 
* The EU version is even less specific: *it doesn't even specify that data should be machine readable!*
* The [Research Data Alliance](https://rd-alliance.org/) is trying to do something about this by defining metadata and structural standards for CSV data. 
* My own project [HydroShare](https://www.hydroshare.org) goes much farther, and is compliant with rather rigorous metadata and discoverability standards defined by the [DataOne initiative](https://dataone.org). 

# What is `how`? 
* `how`: the join type.
* `'outer'`: leave rows in that don't have matches. 
* `'inner'`: omit any rows that don't match. 
* `'left'`: keep rows on left if they don't match. 
* `'right'`: keep rows on right if they don't match. 
* These names are consistent with names of "join types" in database theory. 
* In fact, that's what we're doing. 

# Let's put this into practice. 

Let's register you for grading purposes. 

In [None]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('03-05-data-manipulation.ok')
ok.auth(inline=True)

Then let's create some `DataFrame`s to play with. 

In [None]:
phones = pd.DataFrame({
    'name': ['Mark', 'Anne', 'Frank', 'Lisa'],
    'phone': ['7815551212', 'unlisted', '4035551212', '9195551212']  # NB
})
print("phones:")
print(phones)

addresses = pd.DataFrame({
    'name': ['Frank', 'Anne', 'Mark', 'Samantha'],
    'city': ['Boston', 'Austin', 'Boston', 'Los Angeles'],
    'state': ['MA', 'TX', 'MA', 'CA']
})
print('\naddresses:')
print(addresses)

pets = pd.DataFrame({
    'name': ['Garfield', 'Snoopy', 'Brrf', 'Bill'],
    'type': ['cat', 'dog', 'dog', 'cat'],
    'owner': ['Frank', 'Lisa', 'Samantha', 'Lisa']
})
print('\npets:')
print(pets)

hobbies = pd.DataFrame({
    'name':['Frank', 'Frank', 'George', 'Anne', 'Mark', 'Mark', 'Samantha'], 
    'hobby':['cycling', 'astronomy', 'knitting', 'tennis', 'cycling', 'rock climbing', 'astronomy'],
})
print('\nhobbies:')
print(hobbies)

(NB: all the phone numbers are directory assistance. I have learned, in the past, that in any public set of notes, some idiot will actually call the numbers I specify!)

1. Make up a regular address book `book` by combining `addresses` and `phones`. We want an entry even if we don't have a phone or an address for a person. 

In [None]:
# Your answer: 
book = ...
book

In [None]:
_ = ok.grade('q01')  # run to check your answer

2. Merge `pets` with `book` to get an address book `owners` for the `pets`. In this case we leave out an entry if there isn't a pet there. Hint: since column names aren't the same, use `left_on=` and `right_on=`. See [full merge documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) for details.

In [None]:
# Your answer:
owners = ...
owners

In [None]:
_ = ok.grade('q02')  # run to check your answer

3. Figure out which people share the same hobby by creating a merge `common` of `hobbies` with itself. Then eliminate lines with the same name for both people via row selection.  Also eliminate duplicates by ensuring that names are in alphabetical order from left to right. 

In [None]:
# Your answer: 
common = ...
common

In [None]:
_ = ok.grade('q03')  # run to check your answer

4. For people in `common`, join with `addresses` and select those that live in the same state. Put the result into `pairs`.

In [None]:
# Your answer: 
pairs = ...
pairs

In [None]:
_ = ok.grade('q04')  # run to check your answer

5. Delete extra columns from `pairs` to leave `name_x`, `hobby`, and `name_y`. 
Put the result in `possible`. Hint: google this.

In [None]:
# Your answer: 
possible = ...
possible

In [None]:
_ = ok.grade('q05')  # run to check your answer

# When you are done with this notebook, 

* Save and checkpoint. 
* Ensure that the name of this file is precisely `03-05-data-manipulation.ipynb`. 
* <del>Change `ready` to `True` in the cell below. </del>
* <del>Run the cell below to submit your work for grading. </del>
* Save and checkpoint the notebook. 

* If your Jupyter installation can download the notebook as a PDF,
    * (File >> Download as >> PDF via LaTeX (.pdf)), 
    * Rename the downloaded file to `<loginid>-03-05-data-manipulation.pdf`. In other words, my filename would be `jsingh11-03-05-data-manipulation.pdf`.
    * Submit the file `<loginid>-03-05-data-manipulation.pdf` to Canvas.
* Otherwise 
    * (File >> Download as >> Notebook (.ipynb)). In other words, my filename would be `jsingh11-03-05-data-manipulation.ipynb`.
    * Rename the downloaded file to `<loginid>-03-05-data-manipulation.ipynb`,
    * Submit the file `<loginid>-03-05-data-manipulation.ipynb` to Canvas.