# Deduplication tutorial
This example shows how to de-duplicate a small dataset using simple settings.

The aim is to demonstarate core Splink functionality succinctly, rather that comprehensively document all configuration options.

We use the duckdb backend, which is the recommended option for smaller datasets of up to around 1 million records on a normal laptop.

In [1]:
from splink.duckdb.duckdb_linker import DuckDBLinker

# Read in the data

In [2]:
import pandas as pd 
pd.options.display.max_rows = 1000
df = pd.read_csv("./data/fake_1000.csv")
df.head(5)

Unnamed: 0,unique_id,first_name,surname,dob,city,email,cluster
0,0,Robert,Alan,1971-06-24,,robert255@smith.net,0
1,1,Robert,Allen,1971-05-24,,roberta25@smith.net,0
2,2,Rob,Allen,1971-06-24,London,roberta25@smith.net,0
3,3,Robert,Alen,1971-06-24,Lonon,,0
4,4,Grace,,1997-04-26,Hull,grace.kelly52@jones.com,1


Note that the cluster column represents the 'ground truth' - a column which tells us with which rows refer to the same person. In most real linkage scenarios, we wouldn't have this column (this is what Splink is trying to estimate.)

# Exploratory analysis

Splink contains exploratory analysis tools designed to highlight the aspects of your data most relevant to data linking - things like missingness, skew, and whether further data cleaning may be needed prior to linking.

This is useful for understanding your data, whether it suffers from skew, and whether additional data cleaning may be necessary.

In [3]:
# Initialise the linker, passing in the input dataset(s)
linker = DuckDBLinker(df)

In [4]:
import altair as alt
alt.renderers.enable('default')
linker.missingness_chart()

The `profile_columns` method creates summary charts. You may input column names (e.g. `first_name`), or arbitrary sql expressions like `concat(first_name, surname)`

In [5]:
linker.profile_columns(["first_name", "city", "substr(dob, 1,4)"], top_n=10, bottom_n=5)

In [6]:
linker.compute_number_of_comparisons_generated_by_blocking_rule("l.first_name = r.first_name")

{'count_of_pairwise_comparisons_generated': 1998}

In [7]:
linker.load_settings_from_json("./demo_settings/real_time_settings.json")
linker.unlinkables_chart()
