# Intro

Everyone in this room can be conceptualized as a series of records that have, do, or will exist.
- We begin with a birth certificate ...
- ... and end with a death certificate.
- In between, there will be medical records, school records, marriage records, bank records, arrest records, etc.

Imagine what we could learn about ourselves by integrating all of that information, from all of those different sources, into one single, cohesive story.

## Quick example

- Public health example: Joining hospital records to birth certificates.
    - What problems would occur?
        1. People change
            - Names, addresses, ...
        2. People make mistakes
            - Typos, spelling errors, nicknames, abbreviations, ...
        3. People lie
            - Age, weight, neighborhood, ...

## Additional example areas
Data matching is not new -- well before computers, we needed to match records belonging to the same individual.
- National census
    - Governments around the world rely on census data to allocate resources appropriately.
    - RL plays an important role in improving the quality and accuracy of census data.
    - The U.S. Census Bureau has played a major role in the development of RL techniques for several decades.

<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/85/Seal_of_the_United_States_Census_Bureau.svg/200px-Seal_of_the_United_States_Census_Bureau.svg.png" alt="Census Bureau seal" height="140" width="140">    

- Medicine and public health 
    - (historically referred to as _medical_ record linkage)
    - Simply consider all of the doctors, hospitals, insurance companies, and pharmacies you've interacted with and it becomes obvious why medical records are another major RL application area.
    - In addition, _longitudinally-matched records_ can provide novel insights into health outcomes, as in the example given previously.

- Customer records
    - In order to effectively target their customers, businesses need to minimize the redundancy that tends to occur as a result of changes in name, address, etc.
    - This requires businesses to periodically remove redundant records, in order to maintain an accurate record of their customer base (often a main source of revenue) and reach those customers effectively.

- Genealogy
    - Given that more than 10% of men and women were named 'John' and 'Mary' in nineteenth century England, it becomes obvious why RL is an invaluable tool for genealogical databases, some of which are now a billion-dollar industry.
    - [LDS have spent a lot of $$$ and published a few papers on RL]
    
<p><img src="https://upload.wikimedia.org/wikipedia/commons/a/a8/1900_census_Kershaw_Lindauer.gif" alt="1900 census Kershaw Lindauer.gif" height="480" width="456">

<br>By 1900 US Census, Public Domain, <a href="https://commons.wikimedia.org/w/index.php?curid=11768459">Link</a></p>

## Why do I care?

- As a society, we are producing more data than ever before. In order to make use of it, we need intelligent solutions to integrate data from disparate sources.
- Such tools play an important role in both data mining _and_ data warehousing -- using RL, we can not only improve the quality (and statistical power) of our data, but also reveal relationships not contained within any single database.

# Overview

## Challenges
### Missing unique identifiers
### Computation complexity
### Lack of training data
### Privacy

## Classic record linkage approach
- First, I want to give you a snapshot of the classic record linkage approach which will include:
### Pre-processing (normalization of undesired variation)
### Indexing (blocking)
### Comparison and classification
### Evaluation
    
## Advancements (rename)
### HMM for pre-processing(?)
### Complex features
#### NLP
### Neural networks, etc.

# Classic record linkage approach
I will use 'record linkage' to refer to both the matching of records across two (or more) databases.
- This can also include the special case of _'de-duplication'_, which simply involves using the same approach* to find duplicate records in _the same_ database.

<div style="text-align: right">*De-duplication can sometimes involve matching more than 2 records within a database.</div>

Most commonly, each record refers to a real-live person*.
- Customers in a business database
- Constituents in a government database
- Patients in a hospital database
<div style="text-align: right">*Sometimes the entity to be matched is a business, or some other object.</div>

# Challenges

## In all of these case, the challenge that we have to overcome is missing a unique identifier for the entities we are matching.
- For example, if we had perfectly accurate social security numbers for each record, the task is reduced to a straight-forward join of two databases.
- This is often not the case for multiple reasons:
    1. accurate record keeping is hard
    2. privacy is usually a concern (in some countries use of such identifiers is illegal).
- As such, in order to match records across databases, we must use common attributes shared by both databases.
    - e.g. Name, address, phone number, age
- The quality of data points such as these are notoriously low for reasons described earlier.

## Computation complexity
- As a naive approach, one might try comparing each record in one database, to each record in the other, to determine if each pair under consideration might be a match.
- The computational complexity of such an approach, however, grows quadratically ($O(N²)$) with the size of the smaller database.
- As we'll see, some nice tricks exist to reduce the size of the problem substantially.

## Lack of training labels
- In the typical (supervised) machine learning approach, labeled training data is used as feedback by a statistical model during the process of training. 
- In some cases, the there is no training data that tells us if two records correspond to the same individual or not.
- This can make the evaluation of the model's matches especially challenging.

## Privacy
- Given that these records often contain sensitive personal information (such as medical/employment records), special attention must be paid to preserving this privacy via _'de-identification'_.
- This is especially important for academic or medical researchers using HIPAA-protected datasets for research use.

# Classic record linkage approach

There are two main approaches to matching two records:
## Deterministic
## Probabilistic