# Using Raw String Similarity Features

The paper [@bilenko2002learning] outlines a process by which we can learn the
edit distance between two strings using minimal training over pairs of
coreferent strings.
Coreferent strings are strings found in references to the same entity.
The idea is to fine tune the algorithm that computes the string edit distance to
the business domain where we perform entity resolution. By doing so we increase
the chances of recognizing records that refer to the same entity more
accurately.
The paper also references an algorithm [@ristad1998learning] which learns to
compute the Levenshtein distance between two strings from training data. The
method is computationally intensive and other methods have superseded it.

Another finding suggests that accuracy gains can be made by training the SVM
kernel that performs entity resolution in a specific way. This notebook goes
through that training procedure and compares the results using SVMs to those
obtained using Logistic Regression.

In [15]:
import os

import numpy as np
import polars as pl


DATADIR = os.path.abspath("../../data")
CORA_PATH = os.path.join(DATADIR, "cora", "cora.csv")

LEFT_CSV_PATH = os.path.join(DATADIR, "abt-buy", "Abt.csv")
RIGHT_CSV_PATH = os.path.join(DATADIR, "abt-buy", "Buy.csv")
GROUND_TRUTH_PATH = os.path.join(DATADIR, "abt-buy", "abt_buy_perfectMapping.csv")

In [16]:
from matchescu.matching.ml.datasets import Traits, CsvDataSource

cora = (
    pl.read_csv(CORA_PATH, has_header=False, ignore_errors=True)
    .rename(
        {
            "column_1": "id",
            "column_3": "class",
            "column_4": "author",
            "column_5": "volume",
            "column_6": "title",
            "column_7": "institution",
            "column_8": "venue",
            "column_11": "year",
        }
    )
    .select(pl.col("id", "class", "author", "title", "venue", "year"))
)

# set up abt extraction
abt_traits = list(Traits().int([0]).string([1, 2]).currency([3]))
abt = CsvDataSource(name="abt", traits=abt_traits).read_csv(LEFT_CSV_PATH)
# set up buy extraction
buy_traits = list(Traits().int([0]).string([1, 2, 3]).currency([4]))
buy = CsvDataSource(name="buy", traits=buy_traits).read_csv(RIGHT_CSV_PATH)
# set up ground truth
gt = set(
    pl.read_csv(
        os.path.join(DATADIR, "abt-buy", "abt_buy_perfectMapping.csv"),
        ignore_errors=True,
    ).iter_rows()
)

Let's set up a couple of functions that will help us shortly.

We need to be able to perform a k-fold split of the training/test data to
reproduce the results for classic edit distance from the paper
[@bilenko2002learning].

We will use the Levenshtein distance between coreferent strings to train an
SVM kernel and a Logistic regression model. We wouldn't be able to use an NB
approach or the classic pattern matching here since those don't work with
continuous feature values.

In [17]:
from jellyfish import levenshtein_distance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.svm import LinearSVC


def generate_k_folds(
    training_features: np.ndarray, training_target: np.ndarray, n_splits: int
) -> list[tuple]:
    kf = KFold(n_splits)
    result = []
    for train_idx, cv_idx in kf.split(training_features, training_target):
        result.append(
            (
                training_features[train_idx],
                training_target[train_idx],
                training_features[cv_idx],
                training_target[cv_idx],
            )
        )
    return result


def train_svm(features, target):
    svm_kernel = LinearSVC()
    svm_kernel.fit(features, target)
    return svm_kernel


def train_regression(features, target):
    log_regression = LogisticRegression()
    log_regression.fit(features, target)
    return log_regression


def _safe_str(value) -> str:
    if value is None:
        return ""
    if not isinstance(value, str):
        return str(value)
    return value


def compute_distances(values: tuple) -> tuple:
    values = tuple(map(_safe_str, values))
    return (
        levenshtein_distance(values[2], values[8]),
        levenshtein_distance(values[3], values[9]),
        levenshtein_distance(values[4], values[10]),
    )

Now let's set the stable global parameters:

- the fold count (i.e. the number of folds we're going to take through the input data)
- the models (i.e. a mapping of user-friendly model name to training function)

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

fold_count = int(input("folds:"))
models = {"svm": train_svm, "regression": train_regression}

For the deduplication scenario, we want to both reproduce the findings of Milenko et al
and test how logistic regression compares to SVM.

We start by preparing a data deduplication dataframe.

In [19]:
self_join = cora.join(cora, how="cross")
y = self_join.select(pl.nth(1) == pl.nth(len(cora.columns) + 1)).to_numpy().ravel()
X = self_join.map_rows(compute_distances).to_numpy()
print(X.shape, y.shape)

(1671849, 3) (1671849,)


Next, we evaluate the two models on the data deduplication dataframe.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
folds = generate_k_folds(X_train, y_train, fold_count)
stats = []
for idx, (
    X_fold_train,
    y_fold_train,
    X_fold_validation,
    y_fold_validation,
) in enumerate(folds, start=1):
    trained_models = {
        model: train(X_fold_train, y_fold_train) for model, train in models.items()
    }
    print("trained on fold #", idx)

    for model_name, model in trained_models.items():
        prediction = model.predict(X_fold_validation)
        f1 = f1_score(y_fold_validation, prediction)
        stats.append(
            {
                "model": model_name,
                "precision": precision_score(y_fold_validation, prediction),
                "recall": recall_score(y_fold_validation, prediction),
                "f1": f1,
            }
        )
    print("evaluated fold #", idx)

stats = pl.DataFrame(stats)
display(stats)

trained on fold # 1
evaluated fold # 1
trained on fold # 2
evaluated fold # 2
trained on fold # 3
evaluated fold # 3
trained on fold # 4
evaluated fold # 4
trained on fold # 5
evaluated fold # 5
trained on fold # 6
evaluated fold # 6
trained on fold # 7
evaluated fold # 7
trained on fold # 8
evaluated fold # 8
trained on fold # 9
evaluated fold # 9
trained on fold # 10
evaluated fold # 10


model,precision,recall,f1
str,f64,f64,f64
"""svm""",0.767399,0.823154,0.794299
"""regression""",0.782952,0.792016,0.787458
"""svm""",0.768405,0.805143,0.786345
"""regression""",0.780032,0.772198,0.776095
"""svm""",0.748223,0.805802,0.775946
…,…,…,…
"""regression""",0.775828,0.782067,0.778935
"""svm""",0.773529,0.831292,0.801371
"""regression""",0.789762,0.798499,0.794106
"""svm""",0.753534,0.828287,0.789144


In [21]:
avg_stats = stats.group_by("model").mean()
display(avg_stats)

model,precision,recall,f1
str,f64,f64,f64
"""svm""",0.760782,0.819162,0.788859
"""regression""",0.776637,0.789554,0.783011


We've managed to reproduce similar results as the ones obtained in [@bilenko2002learning]
for the SVM kernel. We notice that logistic regression and SVMs perform similarly.

Our next task is to evaluate SVMs and Logistic Regression on the Abt-Buy dataset.
This dataset is much more challenging than the Cora dataset so we expect lower
scores, in line with the scores we obtained in our other notebooks.

In contrast to the other notebooks so far, we will stick to comparing strings.
Except we'll permute the coreferences across name, manufacturer and description
gaining much more information.

In [25]:
from matchescu.matching.ml.datasets import RecordLinkageDataSet
from matchescu.matching.entity_reference import RawComparison

cmp_config = (
    RawComparison()
    .levenshtein_distance("name", 1, 1)
    .levenshtein_distance("description", 2, 2)
    .levenshtein_distance("name_description", 1, 2)
    .levenshtein_distance("description_name", 2, 1)
    .levenshtein_distance("name_manufacturer", 1, 3)
    .levenshtein_distance("description_manufacturer", 2, 3)
)

ds = RecordLinkageDataSet(abt, buy, gt).attr_compare(cmp_config).cross_sources()
display(ds.feature_matrix)

name,description,name_description,description_name,name_manufacturer,description_manufacturer
i64,i64,i64,i64,i64,i64
42,171,47,171,24,190
35,182,22,171,24,190
38,167,43,168,23,187
64,173,32,157,24,188
34,168,50,168,23,187
…,…,…,…,…,…
39,297,181,338,40,356
46,356,40,324,37,351
38,356,40,343,40,355
39,356,40,342,40,355


In [27]:
from sklearn.preprocessing import StandardScaler

X = ds.feature_matrix.to_numpy()
y = ds.target_vector.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
scaler = StandardScaler().fit(X_train)
scaler.transform(X_train)
folds = generate_k_folds(X_train, y_train, fold_count)

stats = []
for idx, (
    X_fold_train,
    y_fold_train,
    X_fold_validation,
    y_fold_validation,
) in enumerate(folds, start=1):
    trained_models = {
        model: train(X_fold_train, y_fold_train) for model, train in models.items()
    }
    print("trained on fold #", idx)

    for model_name, model in trained_models.items():
        prediction = model.predict(X_fold_validation)
        f1 = f1_score(y_fold_validation, prediction, zero_division=0)
        stats.append(
            {
                "model": model_name,
                "precision": precision_score(
                    y_fold_validation, prediction, zero_division=0
                ),
                "recall": recall_score(y_fold_validation, prediction, zero_division=0),
                "f1": f1,
            }
        )
    print("evaluated fold #", idx)

stats = pl.DataFrame(stats)
display(stats)
avg_stats = stats.group_by("model").mean()
display(avg_stats)

trained on fold # 1
evaluated fold # 1
trained on fold # 2
evaluated fold # 2
trained on fold # 3
evaluated fold # 3
trained on fold # 4
evaluated fold # 4
trained on fold # 5
evaluated fold # 5
trained on fold # 6
evaluated fold # 6
trained on fold # 7
evaluated fold # 7
trained on fold # 8
evaluated fold # 8
trained on fold # 9
evaluated fold # 9
trained on fold # 10
evaluated fold # 10


model,precision,recall,f1
str,f64,f64,f64
"""svm""",0.0,0.0,0.0
"""regression""",0.588235,0.147059,0.235294
"""svm""",0.0,0.0,0.0
"""regression""",0.3,0.083333,0.130435
"""svm""",0.0,0.0,0.0
…,…,…,…
"""regression""",0.578947,0.142857,0.229167
"""svm""",0.0,0.0,0.0
"""regression""",0.35,0.104478,0.16092
"""svm""",0.0,0.0,0.0


model,precision,recall,f1
str,f64,f64,f64
"""regression""",0.484448,0.121926,0.193939
"""svm""",0.0,0.0,0.0


We see a marked difference between SVMs and logistic regression. The number of
features shouldn't be such a big factor because it's still small enough for
SVMs to work properly.
However, we can validate that logistic regression using co-referent attributes
works properly here and in the other notebooks. The F1 score we can expect on
Abt-Buy is around 0.20.

# Conclusion

This notebook reproduces the original findings in [@bilenko2002learning] to a
sufficient degree to inspire confidence in the other results observed here.
Using the Levenshtein distance instead of patterns or match/non-match semantics
seems to improve both the precision and the recall of the matching process for
both SVMs and Logistic Regression.
But this is only an illusion!

The improvement in precision, recall and F1 score are due to the shape/structure
of the data.
Running the same models on Abt-Buy, a much more challenging data set than the
one used in the [@bilenko2002learning] paper, we get scores similar to our other
notebooks that use Abt-Buy.
This shows that the probability that entity resolution is correct doesn't
increase with data partitioning choice or with using more attribute coreferences
in our comparisons.
Instead, the results of the entity resolution solution depend largely on the
data where they are applied.

Furthermore, we see that the current approach requires a _lot_ of data
preprocessing to function correctly. For example, we have to compute a specific
feature matrix which uses coreferences. Computing this matrix is by far the
most expensive part of the training process.

On top of this, the models:

* can only compare apples to apples (that is, we can compare only the
coreferences we define - if such a coreference can't be provided then the method
doesn't work);
* do not transfer any of what they learn to a new comparison.

It's not obvious that we can get much further with coreferent attributes.
Note that all the models which use coreferent attributes are just variations on
the Fellegi-Sunter model.
In order to get better results, we need to move to a different matching model
which does not rely on coreferent attributes, but instead interprets the meaning
of entity references.
Logistic regression might still work with these models, but SVMs will not due to
the higher dimensionality of these theoretical models.