# Training a record pair classifier for Customer Data Deduplication

<table>
  <td>
    <a target="_blank" href="https://www.recogn.ai/biome-text/documentation/tutorials/Training_a_record_pair_classifier_for_Customer_Data_Deduplication.html"><img src="https://www.recogn.ai/biome-text/assets/img/biome-isotype.svg" width=32 />View on recogn.ai</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/recognai/biome-text/blob/master/docs/docs/documentation/tutorials/Training_a_record_pair_classifier_for_Customer_Data_Deduplication.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/recognai/biome-text/blob/master/docs/docs/documentation/tutorials/Training_a_record_pair_classifier_for_Customer_Data_Deduplication.ipynb"><img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width=32 />View source on GitHub</a>
  </td>
</table>

In this tutorial we will train a record pair classifier to detect possible duplicates in customer data.

The challenge of matching customer records and its deduplication is frequently encountered by big businesses dealing with thousands of customers in their database.
A first rough selection of possible matches is often made by [fuzzy matching](https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation)) techniques or the computation of some [string metric](https://en.wikipedia.org/wiki/String_metric) (like the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)).
In a second step the fine-grained match often requires human intervention or is made by an AI taking advantage of previous decisions took by humans.

This tutorial covers the latter approach for which we us a curated data set by [Uniserv](https://www.uniserv.com/en/), which contains pairs of fictional customer records with a corresponding label.
With this data we will train a binary classifier that predicts if two provided records are a duplicate or not. 

When running this tutorial in Google Colab, make sure to install *biome.text* first:

In [None]:
#!pip install -U git+https://github.com/recognai/biome-text.git

Ignore warnings and don't forget to restart your runtime afterwards (*Runtime -> Restart runtime*).

## Explore the data

Let's take a look at the data before starting with the configuration of our pipeline.
For this we create a `DataSource` instance providing a path to our data.

In [2]:
from biome.text.data import DataSource

In [3]:
train_ds = DataSource(source="https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/record_pair_classifier/train.json")
train_ds.head()

Unnamed: 0,record1,record2,target,path
0,"{'@gender': 'xxx', '@firstname': 'Werner', '@l...","{'@gender': 'xxx', '@firstname': 'Wernre', '@l...",duplicate,https://biome-tutorials-data.s3-eu-west-1.amaz...
1,"{'@gender': 'Herr', '@firstname': 'Hugo', '@la...","{'@gender': 'Herr', '@firstname': 'Hugo', '@la...",not_duplicate,https://biome-tutorials-data.s3-eu-west-1.amaz...
2,"{'@gender': 'Herr', '@firstname': 'Peter', '@l...","{'@gender': 'Herr', '@firstname': 'Pteer Hans'...",not_duplicate,https://biome-tutorials-data.s3-eu-west-1.amaz...
3,"{'@gender': 'Herr', '@firstname': 'Karl-Heinz'...","{'@gender': 'Herr', '@firstname': 'Karl-Heinz'...",duplicate,https://biome-tutorials-data.s3-eu-west-1.amaz...
4,"{'@gender': 'xxx', '@firstname': 'Walter', '@l...","{'@gender': 'xxx', '@firstname': 'Watler', '@l...",duplicate,https://biome-tutorials-data.s3-eu-west-1.amaz...
5,"{'@gender': 'Frau', '@firstname': 'Lissa', '@l...","{'@gender': 'Frau', '@firstname': 'Lisssa', '@...",duplicate,https://biome-tutorials-data.s3-eu-west-1.amaz...
6,"{'@gender': 'Herr', '@firstname': 'Siegfried',...","{'@gender': 'Herr', '@firstname': 'Siegfriedd'...",duplicate,https://biome-tutorials-data.s3-eu-west-1.amaz...
7,"{'@gender': 'xxx', '@firstname': 'Franz', '@la...","{'@gender': 'xxx', '@firstname': 'Franz', '@la...",not_duplicate,https://biome-tutorials-data.s3-eu-west-1.amaz...
8,"{'@gender': 'Herr', '@firstname': 'Peter', '@l...","{'@gender': 'Herr', '@firstname': 'Peterr', '@...",duplicate,https://biome-tutorials-data.s3-eu-west-1.amaz...
9,"{'@gender': 'Herr', '@firstname': 'Jens', '@la...","{'@gender': 'Herr', '@firstname': 'Jenns', '@l...",duplicate,https://biome-tutorials-data.s3-eu-west-1.amaz...


As we can see we have three relevant columns for our task: *record1*, *record2* and *target*. 
The *path* column is added automatically by the [DataSource](../../api/biome/text/data/datasource.html#datasource) class to keep track of the source file.

Each record is a python dictionary with the field names of the record and their values.
It is helpful to mark the field names with a specific token to explicitly separate it from the values for the model.

Since the [TaskHead](../../api/biome/text/modules/heads/task_head.html#taskhead) of our model (the [RecordPairClassification](../../api/biome/text/modules/heads/classification/record_pair_classification.html#recordpairclassification)) will expect a *record1*, *record2* and a *label* column to be present in the dataframe, we need to provide a `mapping` dictionary to adjust for the different names:

In [4]:
train_ds.mapping = {
    "record1": "record1", 
    "record2": "record2", 
    "label": "target"
}

The [DataSource](../../api/biome/text/data/datasource.html#datasource) class stores the data in an underlying [Dask DataFrame](https://docs.dask.org/en/latest/dataframe.html) that you can easily access.
For example, let's check the size of our training data:

In [5]:
len(train_ds.to_dataframe())

10000

Or let's check the distribution of our labels:

In [6]:
df = train_ds.to_mapped_dataframe().compute()
df.label.value_counts()

duplicate        5697
not_duplicate    4303
Name: label, dtype: int64

## Configure your *biome.text* Pipeline

A typical [Pipeline](../../api/biome/text/pipeline.html#pipeline) consists of tokenizing the input, extracting features, applying a language encoding (optionally) and executing a task-specific head in the end.

After training a pipeline, you can use it to make predictions or explore the underlying model via the [explore UI](../../documentation/user-guides/02.explore.html).

As a first step we must define a configuration for our pipeline. 
In this tutorial we will create a configuration dictionary and use the `Pipeline.from_config()` method to create our pipeline, but there are [other ways](../../api/biome/text/pipeline.html#pipeline).

A *biome.text* pipeline has the following main components:

```yaml
name: # a descriptive name of your pipeline

tokenizer: # how to tokenize the input

features: # input features of the model

encoder: # the language encoder

head: # your task configuration

```

See the [Configuration section](../../documentation/user-guides/05.configuration.html) for a detailed description of how these main components can be configured.

Our complete configuration for this tutorial will be following:

In [23]:
pipeline_dict = {
    "name": "uniserv_record_pairs",
    
    "tokenizer": {
        "text_cleaning": {
            "rules": ["strip_spaces"]
        }
    },
    
    "features": {
#         "words": {
#             "embedding_dim": 32,
#             "lowercase_tokens": True,
#         },
        "chars": {
            "embedding_dim": 64,
            "dropout": 0.1,
            "encoder": {
                "type": "gru",
                "hidden_size": 128,
                "num_layers": 1,
                "bidirectional": True,
            },
            "lowercase_characters": True,
        },
    },
    
    "head": {
        "type": "RecordPairClassification",
        "labels": list(df.label.value_counts().index),
        "dropout": 0.1,
        "field_encoder": {
            "type": "gru",
            "bidirectional": False,
            "hidden_size": 64,
            "num_layers": 1,
        },
        "record_encoder": {
            "type": "gru",
            "bidirectional": True,
            "hidden_size": 32,
            "num_layers": 1,
        },
        "matcher_forward": {
            "is_forward": True,
            "num_perspectives": 10,
            "with_full_match": False,
        },
        "matcher_backward": {
            "is_forward": False,
            "num_perspectives": 10,
            "with_full_match": False,
        },
        "aggregator": {
            "type": "gru",
            "bidirectional": True,
            "hidden_size": 32,
            "num_layers": 1,
            "dropout": 0.0,
        },
        "classifier_feedforward": {
            "num_layers": 1,
            "hidden_dims": [32],
            "activations": ["relu"],
            "dropout": [0.0],
        },
        "initializer": {
            "regexes": [
                ["_output_layer.weight", {"type": "xavier_normal"}],
                ["_output_layer.bias", {"type": "constant", "val": 0}],
                [".*linear_layers.*weight", {"type": "xavier_normal"}],
                [".*linear_layers.*bias", {"type": "constant", "val": 0}],
                [".*weight_ih.*", {"type": "xavier_normal"}],
                [".*weight_hh.*", {"type": "orthogonal"}],
                [".*bias.*", {"type": "constant", "val": 0}],
                [".*matcher.*match_weights.*", {"type": "kaiming_normal"}],
            ],
        },
    },
}

In [24]:
from biome.text import Pipeline

In [25]:
pl = Pipeline.from_config(pipeline_dict)

## Create a vocabulary

Before we can start the training we need to create the vocabulary for our model.
For this we define a `VocabularyConfiguration`.

In our business name classifier we only want to include words with a general meaning to our word feature vocabulary (like "Computer" or "Autohaus", for example), and want to exclude specific names that will not help to generally classify the kind of business.
This can be achieved by including only the most frequent words in our training set via the `min_count` argument. For a complete list of available arguments see the [VocabularyConfiguration API](../../api/biome/text/configuration.html#vocabularyconfiguration).

In [10]:
from biome.text.configuration import VocabularyConfiguration, WordFeatures

In [26]:
vocab_config = VocabularyConfiguration(sources=[train_ds])#, min_count={WordFeatures.namespace: 5000})

We then pass this configuration to our `Pipeline` to create the vocabulary:

In [27]:
pl.create_vocabulary(vocab_config)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




After creating the vocabulary we can check the size of our entire model in terms of trainable parameters:

In [28]:
pl.trainable_parameters

124930

In [29]:
pl.backbone.vocab.get_token_to_index_vocabulary("char")

{'@@PADDING@@': 0,
 '@@UNKNOWN@@': 1,
 'e': 2,
 'r': 3,
 't': 4,
 '@': 5,
 'n': 6,
 's': 7,
 'a': 8,
 'i': 9,
 'm': 10,
 'u': 11,
 'h': 12,
 'd': 13,
 'o': 14,
 'l': 15,
 'c': 16,
 'g': 17,
 'b': 18,
 'f': 19,
 'p': 20,
 'z': 21,
 'y': 22,
 'x': 23,
 'k': 24,
 '3': 25,
 '4': 26,
 '5': 27,
 '7': 28,
 '6': 29,
 '2': 30,
 '1': 31,
 '.': 32,
 'w': 33,
 '9': 34,
 '8': 35,
 '0': 36,
 '-': 37,
 'ü': 38,
 ',': 39,
 'ö': 40,
 'j': 41,
 'v': 42,
 'ß': 43,
 'ä': 44,
 'q': 45,
 'é': 46,
 'è': 47,
 'ó': 48,
 'á': 49,
 '/': 50,
 'í': 51,
 "'": 52}

In [30]:
from biome.text.configuration import TrainerConfiguration

In [31]:
valid_ds = DataSource(
    source="https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/record_pair_classifier/valid.json",
    mapping={"record1": "record1", "record2": "record2", "label": "target"}
)

In [32]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adam",
        "lr": 0.002,
    },
    batch_size=32,
    num_epochs=5,
    cuda_device=0,
)

In [33]:
pl.train(
    output="output_rpc2",
    training=train_ds,
    validation=valid_ds,
    trainer=trainer_config,
)

INFO:allennlp.common.params:validation_dataset_reader = None
INFO:allennlp.common.params:train_data_path = output_rpc2/.datasources/training_train.json.yml
INFO:allennlp.common.params:validation_data_path = output_rpc2/.datasources/validation_valid.json.yml
INFO:allennlp.common.params:test_data_path = None
INFO:allennlp.common.params:random_seed = 13370
INFO:allennlp.common.params:numpy_seed = 1337
INFO:allennlp.common.params:pytorch_seed = 133
INFO:allennlp.common.checks:Pytorch version: 1.5.0
INFO:allennlp.common.params:trainer.no_grad = ()
INFO:allennlp.common.params:trainer.type = gradient_descent
INFO:allennlp.common.params:trainer.local_rank = 0
INFO:allennlp.common.params:trainer.patience = 2
INFO:allennlp.common.params:trainer.validation_metric = -loss
INFO:allennlp.common.params:trainer.num_epochs = 5
INFO:allennlp.common.params:trainer.cuda_device = 0
INFO:allennlp.common.params:trainer.grad_norm = None
INFO:allennlp.common.params:trainer.grad_clipping = None
INFO:allennlp.co

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.trainer:Validating





HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.tensorboard_writer:                             Training |  Validation
INFO:allennlp.training.tensorboard_writer:macro/fscore             |     0.646  |     0.744
INFO:allennlp.training.tensorboard_writer:_fscore/not_duplicate    |     0.519  |     0.656
INFO:allennlp.training.tensorboard_writer:cpu_memory_MB            |  6737.108  |       N/A
INFO:allennlp.training.tensorboard_writer:_precision/not_duplicate |     0.792  |     0.925
INFO:allennlp.training.tensorboard_writer:macro/precision          |     0.729  |     0.826
INFO:allennlp.training.tensorboard_writer:loss                     |     0.576  |     0.487
INFO:allennlp.training.tensorboard_writer:micro/fscore             |     0.692  |     0.774
INFO:allennlp.training.tensorboard_writer:_fscore/duplicate        |     0.774  |     0.831
INFO:allennlp.training.tensorboard_writer:_precision/duplicate     |     0.666  |     0.727
INFO:allennlp.training.tensorboard_writer:_recall/duplicate        |     0.923




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.trainer:Validating





HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.tensorboard_writer:                             Training |  Validation
INFO:allennlp.training.tensorboard_writer:macro/fscore             |     0.749  |     0.722
INFO:allennlp.training.tensorboard_writer:_fscore/not_duplicate    |     0.667  |     0.620
INFO:allennlp.training.tensorboard_writer:cpu_memory_MB            |  7350.636  |       N/A
INFO:allennlp.training.tensorboard_writer:_precision/not_duplicate |     0.923  |     0.949
INFO:allennlp.training.tensorboard_writer:macro/precision          |     0.825  |     0.830
INFO:allennlp.training.tensorboard_writer:loss                     |     0.478  |     0.470
INFO:allennlp.training.tensorboard_writer:micro/fscore             |     0.775  |     0.760
INFO:allennlp.training.tensorboard_writer:_fscore/duplicate        |     0.831  |     0.825
INFO:allennlp.training.tensorboard_writer:_precision/duplicate     |     0.728  |     0.711
INFO:allennlp.training.tensorboard_writer:_recall/duplicate        |     0.967




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.trainer:Validating





HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.tensorboard_writer:                             Training |  Validation
INFO:allennlp.training.tensorboard_writer:macro/fscore             |     0.748  |     0.747
INFO:allennlp.training.tensorboard_writer:_fscore/not_duplicate    |     0.666  |     0.661
INFO:allennlp.training.tensorboard_writer:cpu_memory_MB            |  7350.636  |       N/A
INFO:allennlp.training.tensorboard_writer:_precision/not_duplicate |     0.912  |     0.922
INFO:allennlp.training.tensorboard_writer:macro/precision          |     0.820  |     0.826
INFO:allennlp.training.tensorboard_writer:loss                     |     0.456  |     0.430
INFO:allennlp.training.tensorboard_writer:micro/fscore             |     0.774  |     0.775
INFO:allennlp.training.tensorboard_writer:_fscore/duplicate        |     0.829  |     0.832
INFO:allennlp.training.tensorboard_writer:_precision/duplicate     |     0.728  |     0.730
INFO:allennlp.training.tensorboard_writer:_recall/duplicate        |     0.962




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.trainer:Validating





HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.tensorboard_writer:                             Training |  Validation
INFO:allennlp.training.tensorboard_writer:macro/fscore             |     0.750  |     0.753
INFO:allennlp.training.tensorboard_writer:_fscore/not_duplicate    |     0.678  |     0.751
INFO:allennlp.training.tensorboard_writer:cpu_memory_MB            |  7350.636  |       N/A
INFO:allennlp.training.tensorboard_writer:_precision/not_duplicate |     0.857  |     0.658
INFO:allennlp.training.tensorboard_writer:macro/precision          |     0.797  |     0.767
INFO:allennlp.training.tensorboard_writer:loss                     |     0.426  |     0.419
INFO:allennlp.training.tensorboard_writer:micro/fscore             |     0.771  |     0.754
INFO:allennlp.training.tensorboard_writer:_fscore/duplicate        |     0.822  |     0.756
INFO:allennlp.training.tensorboard_writer:_precision/duplicate     |     0.737  |     0.876
INFO:allennlp.training.tensorboard_writer:_recall/duplicate        |     0.929




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.trainer:Validating





HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.tensorboard_writer:                             Training |  Validation
INFO:allennlp.training.tensorboard_writer:macro/fscore             |     0.788  |     0.829
INFO:allennlp.training.tensorboard_writer:_fscore/not_duplicate    |     0.750  |     0.815
INFO:allennlp.training.tensorboard_writer:cpu_memory_MB            |  7350.636  |       N/A
INFO:allennlp.training.tensorboard_writer:_precision/not_duplicate |     0.790  |     0.759
INFO:allennlp.training.tensorboard_writer:macro/precision          |     0.794  |     0.829
INFO:allennlp.training.tensorboard_writer:loss                     |     0.365  |     0.343
INFO:allennlp.training.tensorboard_writer:micro/fscore             |     0.795  |     0.830
INFO:allennlp.training.tensorboard_writer:_fscore/duplicate        |     0.827  |     0.843
INFO:allennlp.training.tensorboard_writer:_precision/duplicate     |     0.798  |     0.899
INFO:allennlp.training.tensorboard_writer:_recall/duplicate        |     0.857


