## Splink data linking demo (link only)

In this demo we link two small datasets.  

We assume we have a list of people in one table who we want to find in a larger table.  It is assumed that due to transcription or other errors, there will often not be an exact match.

The larger table contains duplicates, but in this notebook we use the `link_only` setting, so `splink` makes no attempt to deduplicate these records.    Note it is possible to simultaneously link and dedupe using the `link_and_dedupe` setting.

**Important** Where deduplication is not required, `link_only` can provide an important performance boost by dramatically reducing the number of records which need to be compared.

For example, if you wanted to link 10 records to 1,000, then the maximum number of comparisons that need to be made (i.e. with no blocking rules) is 10,000.  If you need to dedupe as well, that number would be n(n-1)/2 = 509,545.

I print the output at each stage using `spark_dataframe.show()`.  This is for instructional purposes only - it degrades performance and shouldn't be used in a production setting.

## Step 1:  Imports and setup

The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options

In [16]:
!pip3 install --upgrade splink

Collecting splink
  Downloading splink-1.0.5-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 24.4 MB/s eta 0:00:01
Installing collected packages: splink
  Attempting uninstall: splink
    Found existing installation: splink 1.0.4
    Uninstalling splink-1.0.4:
      Successfully uninstalled splink-1.0.4
Successfully installed splink-1.0.5
You should consider upgrading via the '/usr/local/bin/python3.6 -m pip install --upgrade pip' command.[0m


In [17]:
from pyspark.context import SparkContext, SparkConf
from pyspark.sql import SparkSession, Window
from pyspark.sql.types import StructType
import pyspark.sql.functions as f

In [18]:
import os
import time
import json
import requests
import xml.etree.ElementTree as ET
import datetime

#Extracting the correct URL from hive-site.xml
tree = ET.parse('/etc/hadoop/conf/hive-site.xml')
root = tree.getroot()

for prop in root.findall('property'):
    if prop.find('name').text == "hive.metastore.warehouse.dir":
        storage = prop.find('value').text.split("/")[0] + "//" + prop.find('value').text.split("/")[2]

print("The correct Cloud Storage URL is: {}".format(storage))

os.environ['STORAGE'] = storage

The correct Cloud Storage URL is: s3a://demo-aws-2


In [19]:
#conf=SparkConf()

# Load in a jar that provides extended string comparison functions such as Jaro Winkler.
# Splink
#     conf.set('spark.driver.extraClassPath', 'jars/scala-udf-similarity-0.0.6.jar,jars/graphframes-0.6.0-spark2.3-s_2.11.jar')
#     conf.set('spark.jars', 'jars/scala-udf-similarity-0.0.6.jar,jars/graphframes-0.6.0-spark2.3-s_2.11.jar')
#conf.set('spark.driver.extraClassPath', 'jars/scala-udf-similarity-0.0.6.jar')
#conf.set('spark.jars', 'jars/scala-udf-similarity-0.0.6.jar')
#conf.set('spark.jars.packages', 'graphframes:graphframes:0.6.0-spark2.3-s_2.11')

#sc = SparkContext.getOrCreate(conf=conf)
#sc.setCheckpointDir("temp_graphframes/")


spark = SparkSession\
    .builder\
    .appName("Entity Resolution with Lineage")\
    .config("spark.hadoop.fs.s3a.s3guard.ddb.region","us-east-1")\
    .config("spark.yarn.access.hadoopFileSystems", os.environ['STORAGE'])\
    .config("spark.driver.extraClassPath", "jars/scala-udf-similarity-0.0.6.jar")\
    .config("spark.jars", "jars/scala-udf-similarity-0.0.6.jar")\
    .getOrCreate()

# Register UDFs
from pyspark.sql import types
spark.udf.registerJavaFunction('jaro_winkler_sim', 'uk.gov.moj.dash.linkage.JaroWinklerSimilarity', types.DoubleType())
spark.udf.registerJavaFunction('Dmetaphone', 'uk.gov.moj.dash.linkage.DoubleMetaphone', types.StringType())

In [20]:
spark

In [21]:
import pandas as pd 
pd.options.display.max_columns = 500

In [22]:
import logging 
logging.basicConfig()  # Means logs will print in Jupyter Lab

# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)

## Step 2: Read in the data

The `l` and `r` stand for 'left' and 'right.  It doesn't matter which of the two datasets you choose as the left, performance and results will be the same.

⚠️ Note that `splink` makes the following assumptions about your data:

-  There is a field containing a unique record identifier in each dataset
-  The two datasets being linked have common column names - e.g. date of birth is represented in both datasets in a field of the same name.   In many cases, this means that the user needs to rename columns prior to using `splink`


In [33]:
from pyspark.sql.functions import lit 
df_1 = spark.read.parquet("data/fake_df_l.parquet")
df_1 = df_1.withColumn("source_dataset", lit("df_1"))
df_2 = spark.read.parquet("data/fake_df_r.parquet")
df_2 = df_2.withColumn("source_dataset", lit("df_2"))
print(f"The count of rows in `df_1` is {df_1.count()}")
df_1.show(5)
print(f"The count of rows in `df_2` is {df_2.count()}")
df_2.show(5)

The count of rows in `df_1` is 181
+---------+----------+-------+----------+------------+--------------------+-----+--------------+
|unique_id|first_name|surname|       dob|        city|               email|group|source_dataset|
+---------+----------+-------+----------+------------+--------------------+-----+--------------+
|        0|    Julia |   null|2015-10-29|      London| hannah88@powers.com|    0|          df_1|
|        4|      oNah| Watson|2008-03-23|      Bolton|matthew78@ballard...|    1|          df_1|
|       13|    Molly |   Bell|2002-01-05|Peterborough|                null|    2|          df_1|
|       15| Alexander|Amelia |1983-05-19|     Glasgow|ic-mpbell@alleale...|    3|          df_1|
|       20|    Ol vri|ynnollC|1972-03-08|    Plymouth|derekwilliams@nor...|    4|          df_1|
+---------+----------+-------+----------+------------+--------------------+-----+--------------+
only showing top 5 rows

The count of rows in `df_2` is 819
+---------+----------+-------+--

In [24]:
df_l.schema

StructType(List(StructField(unique_id,LongType,true),StructField(first_name,StringType,true),StructField(surname,StringType,true),StructField(dob,StringType,true),StructField(city,StringType,true),StructField(email,StringType,true),StructField(group,LongType,true)))

## Step 3:  Configure splink using the `settings` object

Most of `splink` configuration options are stored in a settings dictionary.  This dictionary allows significant customisation, and can therefore get quite complex.  

💥 We provide an tool for helping to author valid settings dictionaries, which includes tooltips and autocomplete, which you can find [here](http://robinlinacre.com/splink_settings_editor/).

Customisation overrides default values built into splink.  For the purposes of this demo, we will specify a simple settings dictionary, which means we will be relying on these sensible defaults.

To help with authoring and validation of the settings dictionary, we have written a [json schema](https://json-schema.org/), which can be found [here](https://github.com/moj-analytical-services/splink/blob/master/splink/files/settings_jsonschema.json).  




In [34]:
# The comparison expression allows for the case where a first name and surname have been inverted 
sql_case_expression = """
CASE 
WHEN first_name_l = first_name_r AND surname_l = surname_r THEN 4 
WHEN first_name_l = surname_r AND surname_l = first_name_r THEN 3
WHEN first_name_l = first_name_r THEN 2
WHEN surname_l = surname_r THEN 1
ELSE 0 
END
"""

settings = {
    "link_type": "link_only", 
    "max_iterations": 20,
    "blocking_rules": [
    ],
    "comparison_columns": [
       {
            "custom_name": "name_inversion",
            "custom_columns_used": ["first_name", "surname"],
            "case_expression": sql_case_expression,
            "num_levels": 5
        },
        {
            "col_name": "city",
            "num_levels": 3
        },
        {
            "col_name": "email",
            "num_levels": 3
        },
        {
            "col_name": "dob"
        }
    ],
    "additional_columns_to_retain": ["group"]
    
}

In words, this setting dictionary says:

- We are performing a data linking task (the other options are `dedupe_only`, or `link_and_dedupe`)
- Rather than generate all possible comparisons (the cartesian product of the input datasets), we are going restrict record comparisons to those generated by at least one of the rules in the specified array
- When comparing records, we will use information from the `first_name`, `surname`, `dob`, `city` and `email` columns to compute a match score.
- For `first_name` and `surname`, string comparisons will have three levels:
    - Level 2: Strings are (almost) exactly the same
    - Level 1: Strings are similar 
    - Level 0: No match
- We will make adjustments for term frequencies on the `first_name` and `surname` columns
- We will retain the `group` column in the results even though this is not used as part of comparisons.  This is a labelled dataset and `group` contains the true match - i.e. where group matches, the records pertain to the same person

## Step 4: Save the Two Datasets as Spark Tables

In [26]:
df_l.write.format('parquet').mode("overwrite").saveAsTable('ER_table_left')
df_r.write.format('parquet').mode("overwrite").saveAsTable('ER_table_right')

In [27]:
#spark.catalog.listTables("default")

In [28]:
df_l.show()

+---------+----------+---------+----------+---------------+--------------------+-----+
|unique_id|first_name|  surname|       dob|           city|               email|group|
+---------+----------+---------+----------+---------------+--------------------+-----+
|        0|    Julia |     null|2015-10-29|         London| hannah88@powers.com|    0|
|        4|      oNah|   Watson|2008-03-23|         Bolton|matthew78@ballard...|    1|
|       13|    Molly |     Bell|2002-01-05|   Peterborough|                null|    2|
|       15| Alexander|  Amelia |1983-05-19|        Glasgow|ic-mpbell@alleale...|    3|
|       20|    Ol vri|  ynnollC|1972-03-08|       Plymouth|derekwilliams@nor...|    4|
|       23|      null| Thompson|1996-03-22|          Leeds|jefferyduke@brown...|    5|
|       27|  Matilda |    Hsrir|1983-04-30|         London| patrcio47@davis.cam|    6|
|       32|    Baxter|    Aria |1992-09-07|         London|christineshepherd...|    7|
|       37|    Wilson| Charlie |1998-09-15|

## Step 4:  Estimate match scores using the Expectation Maximisation algorithm

In [35]:
from splink import Splink

linker = Splink(settings, [df_1, df_2], spark)
df_e = linker.get_scored_comparisons()

# Later, we will make term frequency adjustments.  
# Persist caches these results in memory, preventing them having to be recomputed when we make these adjustments.
df_e.persist()  


  col_settings[key] = default
INFO:splink.iterate:Iteration 0 complete
INFO:splink.model:The maximum change in parameters was 0.40568520724773405 for key name_inversion, level 4
INFO:splink.iterate:Iteration 1 complete
INFO:splink.model:The maximum change in parameters was 0.06933289766311646 for key email, level 1
INFO:splink.iterate:Iteration 2 complete
INFO:splink.model:The maximum change in parameters was 0.02503591775894165 for key dob, level 0
INFO:splink.iterate:Iteration 3 complete
INFO:splink.model:The maximum change in parameters was 0.009511321783065796 for key dob, level 0
INFO:splink.iterate:Iteration 4 complete
INFO:splink.model:The maximum change in parameters was 0.004227638244628906 for key dob, level 0
INFO:splink.iterate:Iteration 5 complete
INFO:splink.model:The maximum change in parameters was 0.0022344589233398438 for key dob, level 0
INFO:splink.iterate:Iteration 6 complete
INFO:splink.model:The maximum change in parameters was 0.001312553882598877 for key dob, l

DataFrame[match_probability: double, source_dataset_l: string, unique_id_l: bigint, source_dataset_r: string, unique_id_r: bigint, first_name_l: string, first_name_r: string, surname_l: string, surname_r: string, gamma_name_inversion: int, city_l: string, city_r: string, gamma_city: int, email_l: string, email_r: string, gamma_email: int, dob_l: string, dob_r: string, gamma_dob: int, group_l: bigint, group_r: bigint]

In [None]:
df_e.write.format('parquet').mode("overwrite").saveAsTable('ER_target')

## Step 5: Inspect results 



In [None]:
# Inspect main dataframe that contains the match scores
df_e.toPandas().sample(5)

The `params` property of the `linker` is an object that contains a lot of diagnostic information about how the match probability was computed.  The following cells demonstrate some of its functionality

An alternative representation of the parameters displays them in terms of the effect different values in the comparison vectors have on the match probability:

In [None]:
params.bayes_factor_chart()

In [None]:
# If charts aren't displaying correctly in your notebook, you can write them to a file (by default splink_charts.html)
params.all_charts_write_html_file("splink_charts.html", overwrite=True)

You can also generate a report which explains how the match probability was computed for an individual comparison row.  

Note that you need to convert the row to a dictionary for this to work

In [None]:
from splink.intuition import intuition_report
row_dict = df_e.toPandas().sample(1).to_dict(orient="records")[0]
print(intuition_report(row_dict, params))

In [None]:
from splink.diagnostics import splink_score_histogram
from pyspark.sql.functions import expr 
splink_score_histogram(df_e.filter(expr('match_probability > 0.001')), spark)

## Step 6: Create a Custom Atlas Type (Process) reflecting the EM algorithm

First we need to instantiate the connection to Atlas in CDP

In [None]:
import atlasclient

Endpoint, Username and Passoword are stored as CML project variables and passed dynamically

In [None]:
from atlasclient.client import Atlas
client = Atlas(os.environ["atlas_endpoint"], port='', username=os.environ["atlas_username"], password=os.environ["atlas_password"])

Verify the Client connection is working by querying a random Atlas entity

In [None]:
guid = "c845eb62-d85d-4591-8abe-0c31449cdd95"

In [None]:
entity = client.entity_guid(guid)

In [None]:
entity.entity['attributes']

Looks like we have successfully established the connection. Next we can create a custom Atlas type (process) reflecting the EM algorithm

In [None]:
typedef_dict = {
    "enumTypes": [],
    "structTypes": [],
    "classificationDefs":[],
    "entityDefs": [{
        "superTypes": ["Process"],
        "name": "EM_algorithm_linkage",
        "description":"custom_type_for_Entity_Resolution",
        "attributeDefs": [{
            "name": "startTime",
            "isOptional": True,
            "isUnique": False,
            "isIndexable": False,
            "typeName":"string",
            "valuesMaxCount":1,
            "cardinality":"SINGLE",
            "valuesMinCount":0
        }]
    }]
}

And we can now register the new type with Atlas. For more on the Atlas type model, please visit this page: https://docs.cloudera.com/runtime/7.2.7/cdp-governance-overview/topics/atlas-metadata-model-overview.html

In [None]:
#Has already run once so will not run again
#client.typedefs.create(data=typedef_dict)

## Step 7: Instantiate the EM algorithm in Atlas along with lineage reflecting our Linkage Job above

Notice: we need to pass the Atlas guid for the two datasets we compared above as they were registered in Atlas when they were stored as a Spark table

In [None]:
#Retrieving GUID's for the three tables via Atlas Client - search by name

In [None]:
#params = {'typeName': 'hive_table', 'attrName': 'data', 'attrValue': 'provider', 'offset': '1', 'limit':'10'}
#search_results = client.search_basic(**params)
#for s in search_results:
#    for e in s.entities:
#        print(e.guid)
#        print(e.attributes)
#        print(e.attributes.values)
#        print(e.typeName)
#        print(e.attributes)

In [None]:
#params = {'typeName': 'hive_table', 'attrName': 'name', 'attrValue': 'cc_data', 'offset': '1', 'limit':'10'}
#search_results = client.search_attribute(**params)
#for s in search_results:
#    for e in s.entities:
#        print(e.guid)
#        print(e.attributes)

In [None]:
#for s in search_results:
#    print(s.entities.to_dict())

In [None]:
#data = {'typeName': 'hive_table', 'attrName': 'name', 'attrValue': 'cc_data', 'offset': '1', 'limit': '100'}
#search_results = client.search_basic.create(data=data)
#for e in search_results.entities:
#    print(e.guid)
#    print(e.attributes)

In [None]:
process_entity_dict = {
  "entity" : {
    "guid" : "-2089428075574333",
    "status" : "ACTIVE",
    "createdBy" : "pdefusco",
    "updatedBy" : "pdefusco",
    "createTime" : "12342",
    "updateTime" : "12342",
    "version" : "12342",
    "relationshipAttributes" : {},
    "classifications" : [],
    "typeName" : "EM_algorithm_linkage",
    "attributes" : {
      "startTime" : "123",
      "qualifiedName": "EM Record Linkage",
      "name":"EM Record Linkage",
      "description":"Record Linkage Algorithm",
      "owner": "pdefusco",
        #, 
      "inputs":[{"guid": "aa955089-5a11-46d9-9dbf-2f6b75f4d65b", "typeName":"hive_table"},
               {"guid": "43d788ce-4af4-4253-af0b-465ea45c1b93", "typeName":"hive_table"}], 
      "outputs":[{"guid":"ac1bdcb3-73c8-4198-a8e6-0aa104c606bb", "type_name":"hive_table"}]
    }, 
  },
  
}

In [None]:
client.entity_post.create(data=process_entity_dict)

## Step 8: Navigate to Atlas (SDX) and browse for the "EM_algorithm_linkage" entity. Expand the lineage tab and the source and target datasets will be shown

![title](images/ER_atlas_lineage.png)

Next we can optionally remove the EM Algorithm instance from Atlas via the client

In [None]:
entity = client.entity_guid("44848fe5-6950-4a73-a89c-9775b736b4c9")

In [None]:
entity.entity['attributes']["owner"]

In [None]:
entity.delete()

## We have completed our introduction to Splink and the Atlas Client. 
## Next we will simulate a real world Application with CML Jobs and COD (Cloudera Operational Database)