# Using Raw String Similarity Features

The paper [@bilenko2002learning] outlines a process by which we can learn the
edit distance between two strings using minimal training over pairs of
coreferent strings.
Coreferent strings are strings found in references to the same entity.
The idea is to fine tune the algorithm that computes the string edit distance to
the business domain where we perform entity resolution. By doing so we increase
the chances of recognizing records that refer to the same entity more
accurately.
The paper also references an algorithm [@ristad1998learning] which learns to
compute the Levenshtein distance between two strings from training data. The
method is computationally intensive and other methods have superseded it.

Another finding suggests that accuracy gains can be made by training the SVM
kernel that performs entity resolution in a specific way. This notebook goes
through that training procedure and compares the results using SVMs to those
obtained using Logistic Regression.

In [None]:
%load_ext tensorboard

In [None]:
import os

import polars as pl
from transformers import AutoTokenizer

DATADIR = os.path.abspath("../../data")

BERT_MODEL_NAME = "roberta-base"
LEFT_CSV_PATH = os.path.join(DATADIR, "abt-buy", "Abt.csv")
RIGHT_CSV_PATH = os.path.join(DATADIR, "abt-buy", "Buy.csv")
GROUND_TRUTH_PATH = os.path.join(DATADIR, "abt-buy", "abt_buy_perfectMapping.csv")

In [None]:
from matchescu.matching.extraction import Traits, CsvDataSource

# set up abt extraction
abt_traits = list(Traits().int([0]).string([1, 2]).currency([3]))
abt = CsvDataSource(name="abt", traits=abt_traits).read_csv(LEFT_CSV_PATH)
# set up buy extraction
buy_traits = list(Traits().int([0]).string([1, 2, 3]).currency([4]))
buy = CsvDataSource(name="buy", traits=buy_traits).read_csv(RIGHT_CSV_PATH)
# set up ground truth
gt = set(
    pl.read_csv(
        os.path.join(DATADIR, "abt-buy", "abt_buy_perfectMapping.csv"),
        ignore_errors=True,
    ).iter_rows()
)

def _id(row):
    return row[0]

Let's set up a couple of functions that will help us shortly.

We need to be able to perform a k-fold split of the training/test data to
reproduce the results for classic edit distance from the paper
[@bilenko2002learning].

We will use the Levenshtein distance between coreferent strings to train an
SVM kernel and a Logistic regression model. We wouldn't be able to use an NB
approach or the classic pattern matching here since those don't work with
continuous feature values.

In [None]:
from matchescu.matching.blocking import BlockEngine


blocker = BlockEngine().add_source(abt, _id).add_source(buy, _id).tf_idf(0.25)
blocker.filter_candidates_jaccard(0.6)
blocker.update_candidate_pairs(False)
metrics = blocker.calculate_metrics(gt)

display(metrics)

Now let's set the stable global parameters:

- the fold count (i.e. the number of folds we're going to take through the input data)
- the models (i.e. a mapping of user-friendly model name to training function)

In [None]:
from matchescu.matching.ml.ditto._ditto_dataset import DittoDataset


tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL_NAME)
ds = DittoDataset(
    blocker,
    _id,
    _id,
    gt,
    tokenizer,
    left_cols=("name","description","price"),
    right_cols=("name","description","manufacturer","price"),
    size=1000,
)

For the deduplication scenario, we want to both reproduce the findings of Milenko et al
and test how logistic regression compares to SVM.

We start by preparing a data deduplication dataframe.

In [None]:
from matchescu.matching.ml.ditto._ditto_module import DittoModel

ditto = DittoModel(BERT_MODEL_NAME)
ditto.run_training(ds, BERT_MODEL_NAME)

In [None]:
%tensorboard --logdir ./logs