## Sparklink demo 

In this demo we de-duplicate a small dataset.

The purpose is to provide an end-to-end example of how to use the package

I print the output at each stage using `spark_dataframe.show()`.  This is for instructional purposes only - it degrades performance and shouldn't be used in a production setting.

## Step 1:  Imports and setup

The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options

In [1]:
import pandas as pd 
pd.options.display.max_columns = 500
pd.options.display.max_rows = 100

In [2]:
import logging 
logging.basicConfig()

# Set to DEBUG if you want sparklink to log the SQL statements it's executing under the hood
logging.getLogger("sparklink").setLevel(logging.INFO)

In [3]:
from pyspark.context import SparkContext, SparkConf
from pyspark.sql import SparkSession, Window
from pyspark.sql.types import StructType
import pyspark.sql.functions as f

conf=SparkConf()

# Load in a jar that provides extended string comparison functions such as Jaro Winkler.
# Sparklink 
conf.set('spark.driver.extraClassPath', 'jars/scala-udf-similarity-0.0.6.jar')
conf.set('spark.jars', 'jars/scala-udf-similarity-0.0.6.jar')   


# WARNING:
# These config options are appropriate only if you're running Spark locally!!!
conf.set('spark.driver.memory', '4g')
conf.set("spark.sql.shuffle.partitions", "8") 

sc = SparkContext.getOrCreate(conf=conf)

spark = SparkSession(sc)

 # Register UDFs
from pyspark.sql import types
spark.udf.registerJavaFunction('jaro_winkler_sim', 'uk.gov.moj.dash.linkage.JaroWinklerSimilarity', types.DoubleType())
spark.udf.registerJavaFunction('Dmetaphone', 'uk.gov.moj.dash.linkage.DoubleMetaphone', types.StringType())

## Step 2:  Configure sparklink using the `settings` object

Most of `sparklink` configuration options are stored in a settings dictionary.  This dictionary allows significant customisation, and can therefore get quite complex.  

Customisation overrides default values built into sparklink.  For the purposes of this demo, we will specify a simple settings dictionary, which means we will be relying on these sensible defaults.

To help with authoring and validation of the settings dictionary, we have written a [json schema](https://json-schema.org/), which can be found [here](https://github.com/moj-analytical-services/sparklink/blob/dev/sparklink/files/settings_jsonschema.json).  We also provide an tool for helping to author valid settings dictionaries, which includes tooltips and autocomplete, which you can find [here](https://robinlinacre.com/simple_sparklink_settings_editor/).



In [4]:
settings = {
    "proportion_of_matches": 0.5,
    "link_type": "dedupe_only",
    "blocking_rules": [
        'l.first_name = r.first_name',
        'l.surname = r.surname',
        'l.dob = r.dob'
    ],
    "comparison_columns": [
        {
            "col_name": "first_name",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "surname",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "dob"
        },
        {
            "col_name": "city"
        },
        {
            "col_name": "email"
        }
    ]
    
}

In words, this setting dictionary says:
- We are performing a deduplication task (the other options are `link_only`, or `link_and_dedupe`)
- The initial value (prior belief) for the proportion of matches amongst comparisons is 0.5 (50% of comparisons that result from the blocking rules are matches).
- We are going generate comparisons subject to the blocking rules contained in the specified array
- When comparing records, we will use information from the `first_name`, `surname`, `dob`, `city` and `email` columns to compute a match score.
- For `first_name` and `surname`, string comparisons will have three levels:
    - Level 2: Strings are (almost) exactly the same
    - Level 1: Strings are similar 
    - Level 0: No match


## Step 3:  Read in data and create comparisons using blocking rules

In [5]:
df = spark.read.parquet("data/fake_1000.parquet")
df.show(5)

+---------+----------+-------+----------+------+--------------------+-----+
|unique_id|first_name|surname|       dob|  city|               email|group|
+---------+----------+-------+----------+------+--------------------+-----+
|        0|    Julia |   null|2015-10-29|London| hannah88@powers.com|    0|
|        1|    Julia | Taylor|2015-07-31|London| hannah88@powers.com|    0|
|        2|    Julia | Taylor|2016-01-27|London| hannah88@powers.com|    0|
|        3|    Julia | Taylor|2015-10-29|  null|  hannah88opowersc@m|    0|
|        4|      oNah| Watson|2008-03-23|Bolton|matthew78@ballard...|    1|
+---------+----------+-------+----------+------+--------------------+-----+
only showing top 5 rows



In [6]:
from sparklink.blocking import block_using_rules
from sparklink.gammas import complete_settings_dict
settings = complete_settings_dict(settings)
df_comparison = block_using_rules(settings, df=df, spark=spark)
df_comparison.show(5)

+-----------+-----------+------------+------------+---------+---------+----------+----------+------+------+-------------------+-------------------+
|unique_id_l|unique_id_r|first_name_l|first_name_r|surname_l|surname_r|     dob_l|     dob_r|city_l|city_r|            email_l|            email_r|
+-----------+-----------+------------+------------+---------+---------+----------+----------+------+------+-------------------+-------------------+
|          0|          3|      Julia |      Julia |     null|   Taylor|2015-10-29|2015-10-29|London|  null|hannah88@powers.com| hannah88opowersc@m|
|          0|          2|      Julia |      Julia |     null|   Taylor|2015-10-29|2016-01-27|London|London|hannah88@powers.com|hannah88@powers.com|
|          0|          1|      Julia |      Julia |     null|   Taylor|2015-10-29|2015-07-31|London|London|hannah88@powers.com|hannah88@powers.com|
|          1|          3|      Julia |      Julia |   Taylor|   Taylor|2015-07-31|2015-10-29|London|  null|hanna

## Step 3:  Compute Fellegi Sunter comparison vectors from the table of comparisons

Columns are assumed to be strings by default.  See the 'comparison vector settings' notebook for details of configuration options.

In [7]:
from sparklink.gammas import add_gammas

df_gammas = add_gammas(df_comparison, settings, spark, include_orig_cols = True)
df_gammas.persist()
df_gammas.show(5)

+-----------+-----------+------------+------------+----------------+---------+---------+-------------+----------+----------+---------+------+------+----------+-------------------+-------------------+-----------+
|unique_id_l|unique_id_r|first_name_l|first_name_r|gamma_first_name|surname_l|surname_r|gamma_surname|     dob_l|     dob_r|gamma_dob|city_l|city_r|gamma_city|            email_l|            email_r|gamma_email|
+-----------+-----------+------------+------------+----------------+---------+---------+-------------+----------+----------+---------+------+------+----------+-------------------+-------------------+-----------+
|          0|          3|      Julia |      Julia |               2|     null|   Taylor|           -1|2015-10-29|2015-10-29|        1|London|  null|        -1|hannah88@powers.com| hannah88opowersc@m|          0|
|          0|          2|      Julia |      Julia |               2|     null|   Taylor|           -1|2015-10-29|2016-01-27|        0|London|London|    

## Step 4:  Initialise parameters (m and u probabilities)

In [8]:
from sparklink.params import Params 
params = Params(settings)


# Note all initial parameters are customisable - see the  'comparison vector settings' notebook for details of configuration options.

In [9]:
#  Note that the params object has a formatted, human-readable __repr__ representation when you print it
params.params

{'λ': 0.5,
 'π': {'gamma_first_name': {'gamma_index': 0,
   'desc': 'Comparison of first_name',
   'column_name': 'first_name',
   'num_levels': 3,
   'prob_dist_match': {'level_0': {'value': 0, 'probability': 0.1},
    'level_1': {'value': 1, 'probability': 0.2},
    'level_2': {'value': 2, 'probability': 0.7}},
   'prob_dist_non_match': {'level_0': {'value': 0, 'probability': 0.7},
    'level_1': {'value': 1, 'probability': 0.19999999999999996},
    'level_2': {'value': 2, 'probability': 0.09999999999999998}}},
  'gamma_surname': {'gamma_index': 1,
   'desc': 'Comparison of surname',
   'column_name': 'surname',
   'num_levels': 3,
   'prob_dist_match': {'level_0': {'value': 0, 'probability': 0.1},
    'level_1': {'value': 1, 'probability': 0.2},
    'level_2': {'value': 2, 'probability': 0.7}},
   'prob_dist_non_match': {'level_0': {'value': 0, 'probability': 0.7},
    'level_1': {'value': 1, 'probability': 0.19999999999999996},
    'level_2': {'value': 2, 'probability': 0.099999999

## Step 5:  Repeatedly apply expectation and maximisation step to improve parameter estimates

In [11]:
from sparklink.iterate import iterate
logging.getLogger("sparklink").setLevel(logging.INFO)
df_e = iterate(df_gammas, spark, params, num_iterations=5, compute_ll=True)
df_e.persist()
df_e.show()

INFO:sparklink.expectation_step:
Log likelihood for iteration 5:  -15450.728664992195

INFO:sparklink.expectation_step:
Log likelihood for iteration 6:  -15445.65232979209

INFO:sparklink.expectation_step:
Log likelihood for iteration 7:  -15442.149315795372

INFO:sparklink.expectation_step:
Log likelihood for iteration 8:  -15439.608912537664

INFO:sparklink.expectation_step:
Log likelihood for iteration 9:  -15437.702593455539

INFO:sparklink.expectation_step:
Log likelihood for iteration 10:  -15436.23335290754



+--------------------+-----------+-----------+------------+------------+----------------+---------+---------+-------------+----------+----------+---------+------+----------+----------+--------------------+--------------------+-----------+-------------------------------+---------------------------+----------------------------+------------------------+------------------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+
|   match_probability|unique_id_l|unique_id_r|first_name_l|first_name_r|gamma_first_name|surname_l|surname_r|gamma_surname|     dob_l|     dob_r|gamma_dob|city_l|    city_r|gamma_city|             email_l|             email_r|gamma_email|prob_gamma_first_name_non_match|prob_gamma_first_name_match|prob_gamma_surname_non_match|prob_gamma_surname_match|prob_gamma_dob_non_match|prob_gamma_dob_match|prob_gamma_city_non_match|prob_gamma_city_match|prob_gamma_email_non_match|prob_gamma_email_match|
+-----

## Step 6: Inspect results 



In [12]:
# Inspect main dataframe that contains the match scores
df_e.toPandas().sample(5)

Unnamed: 0,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,surname_l,surname_r,gamma_surname,dob_l,dob_r,gamma_dob,city_l,city_r,gamma_city,email_l,email_r,gamma_email,prob_gamma_first_name_non_match,prob_gamma_first_name_match,prob_gamma_surname_non_match,prob_gamma_surname_match,prob_gamma_dob_non_match,prob_gamma_dob_match,prob_gamma_city_non_match,prob_gamma_city_match,prob_gamma_email_non_match,prob_gamma_email_match
1418,0.999634,519,525,Martha,Martha,2,Brown,Brown,2,2002-09-01,2002-09-01,1,Southend-on-Sea,Southend-on-Sea,1,watsonthomas@jones-stuart.biz,watsonthomas@jones-stuart.biz,1,0.490485,0.473628,0.682115,0.49257,0.025586,0.806543,0.146712,0.693776,0.016559,0.703854
1029,0.017254,359,934,George,George,2,Wallace,Hodgson,0,2018-10-31,1980-12-30,0,Stoke-on-Trent,Sheffield,0,jonathan74@glover.com,lori8y@hu8h.biz,0,0.490485,0.473628,0.315947,0.433086,0.974414,0.193457,0.853288,0.306224,0.983441,0.296146
2248,0.566253,982,983,Ball,Ball,2,Layla,Layla,2,1992-05-07,1992-04-02,0,Newcastle-upoT-nye,Newcastle-upon-Tyne,0,stacykelly@brown.info,stacykelly@brown.info,1,0.490485,0.473628,0.682115,0.49257,0.974414,0.193457,0.853288,0.306224,0.016559,0.703854
2192,0.995199,933,936,George,George,2,Hodgson,Hodgson,2,1980-12-30,1980-12-30,1,Sheffield,Sheffied,0,lori88@huynh.biz,lori88@huynh.biz,1,0.490485,0.473628,0.682115,0.49257,0.025586,0.806543,0.853288,0.306224,0.016559,0.703854
1484,0.803614,555,557,Henry,Henry,2,Owen,Owen,2,2016-05-16,2016-05-16,1,,Kingston-upon-Hull,-1,nicholasbutler@jackson.net,nicobaslutler@jackson.net,0,0.490485,0.473628,0.682115,0.49257,0.025586,0.806543,1.0,1.0,0.983441,0.296146


Note that the `params` object is updated during the process of iteration.

This means that if we inspect it again, we will see a new set of parameters - those that result from the application of the expectation maximiastion algorithm

In [None]:
params.probability_distribution_chart()

An alternative representation of the parameters displays them in terms of the effect different values in the comparison vectors have on the match probability:

In [None]:
params.adjustment_factor_chart()

In [None]:
# If charts aren't displaying correctly in your notebook, you can write them to a file (by default sparklink_charts.html)
params.all_charts_write_html_file()

You can also generate a report which explains how the match probability was computed for an individual comparison row.  

Note that you need to convert the row to a dictionary for this to work

In [None]:
from sparklink.intuition import intuition_report
row_dict = df_e.toPandas().sample(1).to_dict(orient="records")[0]
print(intuition_report(row_dict, params))

## Step 7: Term frequency adjustments

Sparklink enables you to make adjustments for term frequency on any number of columns

This enables match probabilities to be adjusted for e.g. the fact John Smith is more prevalent than Robin Linacre

In [None]:
from sparklink.term_frequencies import make_adjustment_for_term_frequencies
df_e_adj = make_adjustment_for_term_frequencies(df_e, params, settings, retain_adjustment_columns=True, spark=spark)

In [None]:
pdtf = df_e_adj.toPandas()
sam = pdtf.sample(10)
sam[["match_probability", "tf_adjusted_match_prob"] + list(pdtf.columns)]