## Loading and saving models

Since data linking tasks can take a long time to execute, it is often useful to be able to save the results.  For example, this allows model parameters to be applied to new data, or iterations to be re-started from where they left off.

In this demo, we see how we can save a model to a json file and reload it.

It assumes you have already completed the [data deduplication quick start](quickstart_demo_deduplication.ipynb).

## Step 1:  Imports and setup

The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options

In [1]:
import logging 
from utility_functions.demo_utils import get_spark

logging.basicConfig()  # Means logs will print in Jupyter Lab

# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)
spark = get_spark()

## Step 2:  Read in data and run linking

In [2]:
df = spark.read.parquet("data/fake_1000.parquet")

settings = {
    "link_type": "dedupe_only",
    "max_iterations": 5,
    "blocking_rules": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
        "l.dob = r.dob"
    ],
    "comparison_columns": [
        {
            "col_name": "first_name",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "surname",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "dob"
        },
        {
            "col_name": "city"
        },
        {
            "col_name": "email"
        }
    ]
}

from splink import Splink

linker = Splink(settings, spark, df=df)
df_e = linker.get_scored_comparisons()
df_e.limit(5).toPandas()

INFO:splink.iterate:Iteration 0 complete
INFO:splink.iterate:Iteration 1 complete
INFO:splink.iterate:Iteration 2 complete
INFO:splink.iterate:Iteration 3 complete
INFO:splink.iterate:Iteration 4 complete


Unnamed: 0,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,prob_gamma_first_name_non_match,prob_gamma_first_name_match,surname_l,surname_r,...,city_l,city_r,gamma_city,prob_gamma_city_non_match,prob_gamma_city_match,email_l,email_r,gamma_email,prob_gamma_email_non_match,prob_gamma_email_match
0,0.999646,0,3,Julia,Julia,2,0.47229,0.567037,,Taylor,...,London,,-1,1.0,1.0,hannah88@powers.com,hannah88opowersc@m,1,0.007089,0.894782
1,0.985811,0,2,Julia,Julia,2,0.47229,0.567037,,Taylor,...,London,London,1,0.14658,0.780896,hannah88@powers.com,hannah88@powers.com,1,0.007089,0.894782
2,0.985811,0,1,Julia,Julia,2,0.47229,0.567037,,Taylor,...,London,London,1,0.14658,0.780896,hannah88@powers.com,hannah88@powers.com,1,0.007089,0.894782
3,0.916171,1,3,Julia,Julia,2,0.47229,0.567037,Taylor,Taylor,...,London,,-1,1.0,1.0,hannah88@powers.com,hannah88opowersc@m,1,0.007089,0.894782
4,0.983115,1,2,Julia,Julia,2,0.47229,0.567037,Taylor,Taylor,...,London,London,1,0.14658,0.780896,hannah88@powers.com,hannah88@powers.com,1,0.007089,0.894782


## Step 3:  Save model

We are going to save the model settings, current parameters, and iteration history to a file called `saved_model.json`.

In [3]:
linker.save_model_as_json("saved_model.json", overwrite=True)

## Step 4: Reload model

Reloading the model creates a new Splink object.  It populates the settings with the settings saved in the json files, and restores the parameters (the `m_probabilities` and `u_probabilities`) from the file

In [4]:
from splink import load_from_json
linker_2 = load_from_json("saved_model.json", spark=spark, df=df) 

# Perform another set of iterations 
df_e_2 = linker_2.get_scored_comparisons()
df_e_2.limit(5).toPandas()

INFO:splink.iterate:Iteration 0 complete
INFO:splink.iterate:Iteration 1 complete
INFO:splink.iterate:Iteration 2 complete
INFO:splink.iterate:Iteration 3 complete
INFO:splink.iterate:Iteration 4 complete


Unnamed: 0,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,prob_gamma_first_name_non_match,prob_gamma_first_name_match,surname_l,surname_r,...,city_l,city_r,gamma_city,prob_gamma_city_non_match,prob_gamma_city_match,email_l,email_r,gamma_email,prob_gamma_email_non_match,prob_gamma_email_match
0,0.999962,0,3,Julia,Julia,2,0.469065,0.568189,,Taylor,...,London,,-1,1.0,1.0,hannah88@powers.com,hannah88opowersc@m,1,0.001349,0.875251
1,0.997613,0,2,Julia,Julia,2,0.469065,0.568189,,Taylor,...,London,London,1,0.140833,0.769179,hannah88@powers.com,hannah88@powers.com,1,0.001349,0.875251
2,0.997613,0,1,Julia,Julia,2,0.469065,0.568189,,Taylor,...,London,London,1,0.140833,0.769179,hannah88@powers.com,hannah88@powers.com,1,0.001349,0.875251
3,0.984606,1,3,Julia,Julia,2,0.469065,0.568189,Taylor,Taylor,...,London,,-1,1.0,1.0,hannah88@powers.com,hannah88opowersc@m,1,0.001349,0.875251
4,0.997146,1,2,Julia,Julia,2,0.469065,0.568189,Taylor,Taylor,...,London,London,1,0.140833,0.769179,hannah88@powers.com,hannah88@powers.com,1,0.001349,0.875251


In [5]:
# We can now see 10 iterations
linker_2.params.all_charts_write_html_file("more_charts.html")

ValueError: The path more_charts.html already exists. Please provide a different path.