## Working offline

The data security policies of some organisations mean that Splink sometimes must be used in an offline environment (i.e. no internet connection).

This has two main implications:
- Some of the charts rely on javascript libraries which are hosted online.  A workaround is needed for these to work offline (`splink.charts.save_offline_chart`)
- The user cannot install `jars` from an online package repository like `maven` using the `'spark.jars.packages'` [config option](https://spark.apache.org/docs/latest/configuration.html#runtime-environment) and instead must download and reference local copies of these jars

This notebook contains some code examples to demonstrate how to workaround these issues.

## Charts

`splink` provides a function called `splink.charts.save_offline_chart` which takes any charting output, and saves it to a standalone `.html` file which can be viewed in a web browser, or in an iFrame in JupyterLab.

This `.html` file needs no internet connection.

The following is an example:

In [1]:
from splink.charts import save_offline_chart

In [2]:
from utility_functions.demo_utils import get_spark
spark = get_spark() # See utility_functions/demo_utils.py for how to set up Spark
df = spark.read.csv("data/fake_1000.csv", header=True)

22/01/11 05:51:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/11 05:51:22 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/01/11 05:51:22 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/01/11 05:51:22 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
22/01/11 05:51:22 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
22/01/11 05:51:22 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
22/01/11 05:51:22 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
22/01/11 05:51:22 WARN Utils: Service 'SparkUI' could

In [3]:
from splink import Splink

settings = {
    "link_type": "dedupe_only",
    "comparison_columns": [
        {
            "col_name": "dob",
            "m_probabilities": [0.38818904757499695, 0.6118109226226807],
            "u_probabilities": [0.9997655749320984, 0.00023440067889168859],
        },
        {
            "col_name": "city",
            "m_probabilities": [0.29216697812080383, 0.7078329920768738],
            "u_probabilities": [0.9105007648468018, 0.08949924260377884],
        }
    ],
    }

linker = Splink(settings, df, spark)

In [4]:
chart_output = linker.model.bayes_factor_chart()
save_offline_chart(chart_output, filename="my_bayes_factor_chart.html", overwrite=True)

Chart saved to my_bayes_factor_chart.html

To view in Jupyter you can use the following command:

from IPython.display import IFrame
IFrame(src="./my_bayes_factor_chart.html", width=1000, height=500)



You can now open `my_bayes_factor_chart.html` in a web browser, or view it in an iFrame as follows

In [5]:
from IPython.display import IFrame
IFrame(src="./my_bayes_factor_chart.html", width=1000, height=200)

## Jars

### Similarity functions

As of version 2.0, Splink now bundles the scala similarity jar, that provides efficient scala-based implementations of common record linkage functions like `jaro-winkler`.

Splink also provides a function which reports the location of this file.

This `jar` must still be registered with Spark when the SparkContext is created, but Splink now provides more helpful error messages.

An example follows

In [6]:
# Get the location of the jar
from splink.jar_location import similarity_jar_location
similarity_jar_location()

'/Users/robinlinacre/anaconda3/lib/python3.8/site-packages/splink/jars/scala-udf-similarity-0.0.9.jar'

In [7]:
# Set up Spark to use it
path = similarity_jar_location()

from pyspark.sql import SparkSession, types
from pyspark.context import SparkConf, SparkContext
conf = SparkConf()
# conf.set('spark.driver.extraClassPath', path) # Spark 2.x only, not needed in spark 3
conf.set('spark.jars', path)

sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)

spark.udf.registerJavaFunction('jaro_winkler_sim','uk.gov.moj.dash.linkage.JaroWinklerSimilarity',types.DoubleType())


22/01/11 05:51:32 WARN SimpleFunctionRegistry: The function jaro_winkler_sim replaced a previously registered function.


In [8]:
from pyspark.sql import Row
data_list = [
    {"comp_l": 'Robin', "comp_r": 'Rob'},
    {"comp_l": 'Robin', "comp_r": 'Robin'},
    {"comp_l": 'Robin', "comp_r": 'Robyn'},
        ]

df = spark.createDataFrame(Row(**x) for x in data_list)
df.createOrReplaceTempView("df")

sql = """
select comp_l, comp_r, jaro_winkler_sim(comp_l, comp_r) as jaro_score
from df 
"""
spark.sql(sql).toPandas()

                                                                                

Unnamed: 0,comp_l,comp_r,jaro_score
0,Robin,Rob,0.906667
1,Robin,Robin,1.0
2,Robin,Robyn,0.906667


## Clustering

For the clustering functionality offered by the `splink.cluster.clusters_at_thresholds` function, there are two requirements.  
- The `graphframes` python library corresponding to the user's version of Spark
- A graphframes `jar` corresponding to the user's version of Spark

These libraries are separate from Splink and maintained by other programmers, and therefore are not bundled with Splink

### Graphframes Python library

The python package `graphframes` is required to use the `splink.cluster.clusters_at_thresholds`

For Spark `2.4.x`, you can `pip install graphframes=0.6.0` or download from https://github.com/graphframes/graphframes/tags

For Spark `>=3.0.0`, the package version you need is not available from PyPi, and you should download the version corresponding to your version of Spark from https://github.com/graphframes/graphframes/releases

### Graphframes jar

The suggested code for Python 2.4.5 is:
```
from pyspark.sql import SparkSession

spark = (SparkSession
   .builder
   .appName("my_app")
   .config('spark.driver.extraClassPath', 'jars/graphframes-0.6.0-spark2.3-s_2.11.jar,jars/scala-logging-api_2.11-2.1.2.jar,jars/scala-logging-slf4j_2.11-2.1.2.jar') # Spark 2.x only
    .config('spark.jars', 'jars/graphframes-0.6.0-spark2.3-s_2.11.jar,jars/scala-logging-api_2.11-2.1.2.jar,jars/scala-logging-slf4j_2.11-2.1.2.jar')
   .getOrCreate()
   )

spark.sparkContext.setCheckpointDir("graphframes_tempdir/")
```



Note extraClassPath is needed on spark version `2.x` only.  This line must be omitted in Spark `>=3.0.0` 

You can find these jars [here](https://github.com/moj-analytical-services/splink_graph/tree/master/jars)

You can find a list of jars corresponding to different versions of Spark [here](https://mvnrepository.com/artifact/graphframes/graphframes?repo=spark-packages)

More info on adding jars to Spark [here](https://spark.apache.org/docs/latest/configuration.html#runtime-environment).
