# Using Raw String Similarity Features

The paper [@bilenko2002learning] outlines a process by which we can learn the
edit distance between two strings using minimal training over pairs of
coreferent strings.
Coreferent strings are strings found in references to the same entity.
The idea is to fine tune the algorithm that computes the string edit distance to
the business domain where we perform entity resolution. By doing so we increase
the chances of recognizing records that refer to the same entity more
accurately.
The paper also references an algorithm [@ristad1998learning] which learns to
compute the Levenshtein distance between two strings from training data. The
method is computationally intensive and other methods have superseded it.

Another finding suggests that accuracy gains can be made by training the SVM
kernel that performs entity resolution in a specific way. This notebook goes
through that training procedure and compares the results using SVMs to those
obtained using Logistic Regression.

In [1]:
import os
import random

import numpy as np
import polars as pl
from matchescu.matching.similarity import LevenshteinLearner


DATADIR = os.path.abspath("../../data")
CSV_PATH = os.path.join(DATADIR, "cora", "cora.csv")

In [2]:
df = (
    pl.read_csv(CSV_PATH, has_header=False, ignore_errors=True)
    .rename(
        {
            "column_1": "id",
            "column_3": "class",
            "column_4": "author",
            "column_5": "volume",
            "column_6": "title",
            "column_7": "institution",
            "column_8": "venue",
            "column_11": "year",
        }
    )
    .select(pl.col("id", "class", "author", "title", "venue", "year"))
)
display(df)

id,class,author,title,venue,year
i64,str,str,str,str,str
1,"""blum1993""","""avrim blum, merrick furst, mic…","""cryptographic primitives based…","""in pre-proceedings of crypto '…","""1993"""
2,"""blum1993""","""avrim blum, merrick furst, mic…","""cryptographic primitives based…","""proc. crypto 93,""","""1994"""
3,"""blum1993""","""a. blum, m. furst, m. kearns, …","""cryptographic primitives based…","""crypto,""","""1993"""
4,"""blum1994""","""blum, a., furst, m., jackson, …","""weakly learning dnf and charac…","""proceedings of the 26th annual…","""(1994)."""
5,"""blum1994""","""blum, a., furst, m., jackson, …","""weakly learning dnf and charac…","""in proceedings of the twenty-s…","""(1994)."""
…,…,…,…,…,…
1289,"""schapire1998""","""robert e. schapire and yoram s…","""improved boosting algorithms u…","""in proceedings of the eleventh…","""1998"""
1290,"""schapire""","""schapire, r. e., freund, y., b…","""boosting the margin: a new exp…",,"""(1998)."""
1291,"""schapire1998mm""","""robert e. schapire and yoram s…","""a system for multiclass multi-…","""unpublished manuscript,""","""1998"""
1292,"""singer""","""robert e. schapire yoram singe…","""improved boosting algorithms u…",,


In [3]:
records = list(df.iter_rows(named=True))
dedupe_data = []
y = []
for i, left_record in enumerate(records):
    for j, right_record in enumerate(records, i + 1):
        lclass, rclass = left_record["class"], right_record["class"]
        row = {f"{k}_left": v for k, v in left_record.items()}
        row.update({f"{k}_right": v for k, v in right_record.items()})
        dedupe_data.append(row)
        y.append(int(lclass == rclass))
X = pl.DataFrame(dedupe_data).to_numpy()
y = np.array(y)

Let's set up a couple of functions that will help us shortly.

We need to be able to perform a k-fold split of the training/test data to
reproduce the results for classic edit distance from the paper
[@bilenko2002learning].

We will use the Levenshtein distance between coreferent strings to train an
SVM kernel and a Logistic regression model. We wouldn't be able to use an NB
approach or the classic pattern matching here since those don't work with
continuous feature values.

In [19]:
from jellyfish import levenshtein_distance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.svm import LinearSVC


def generate_k_folds(n_splits: int) -> list[tuple]:
    kf = KFold(n_splits)
    result = []
    for (train_idx, test_idx) in kf.split(X, y):
        result.append((X[train_idx], y[train_idx], X[test_idx], y[test_idx]))
    return result


def train_svm(features, target):
    svm_kernel = LinearSVC()
    svm_kernel.fit(features, target)
    return svm_kernel


def train_regression(features, target):
    log_regression = LogisticRegression()
    log_regression.fit(features, target)
    return log_regression


def _safe_str(value) -> str:
    if value is None:
        return ""
    if not isinstance(value, str):
        return str(value)
    return value

def compute_distances(values: tuple) -> tuple:
    values = tuple(map(_safe_str, values))
    return (
        levenshtein_distance(values[2], values[8]),
        levenshtein_distance(values[3], values[9]),
        levenshtein_distance(values[4], values[10]),
    )

The main part of the notebook is to split the data, train SVM, train logistic
regression, run predictions for both models and compare results.

In [20]:
from sklearn.metrics import precision_score, recall_score, f1_score

fold_count = int(input("folds:"))
stats = []
folds = generate_k_folds(fold_count)

models = {
    "svm": train_svm,
    "regression": train_regression
}

for idx, (X_train, y_train, X_test, y_test) in enumerate(folds, start=1):
    title_train = list(map(compute_distances, X_train))
    trained_models = {
        model: train(title_train, y_train)
        for model, train in models.items()
    }
    print("trained on fold #", idx)

    title_test = list(map(compute_distances, X_test))
    for model_name, model in trained_models.items():
        prediction = model.predict(title_test)
        stats.append(
            {
                "model": model_name,
                "precision": precision_score(y_test, prediction),
                "recall": recall_score(y_test, prediction),
                "f1": f1_score(y_test, prediction),
            }
        )
    print("evaluated fold #", idx)
stats = pl.DataFrame(stats)
display(stats)

trained on fold # 1
evaluated fold # 1
trained on fold # 2
evaluated fold # 2
trained on fold # 3
evaluated fold # 3
trained on fold # 4
evaluated fold # 4
trained on fold # 5
evaluated fold # 5
trained on fold # 6
evaluated fold # 6
trained on fold # 7
evaluated fold # 7
trained on fold # 8
evaluated fold # 8
trained on fold # 9
evaluated fold # 9
trained on fold # 10
evaluated fold # 10


model,precision,recall,f1
str,f64,f64,f64
"""svm""",0.935116,0.338898,0.497497
"""regression""",0.941388,0.328666,0.487227
"""svm""",0.865943,0.902752,0.883965
"""regression""",0.870763,0.879817,0.875266
"""svm""",0.789357,0.848967,0.818078
…,…,…,…
"""regression""",0.664469,0.893884,0.76229
"""svm""",0.636447,0.82921,0.720152
"""regression""",0.644112,0.790104,0.709677
"""svm""",0.896607,0.816977,0.854942


In [21]:
display(stats.group_by("model").mean())

model,precision,recall,f1
str,f64,f64,f64
"""svm""",0.765938,0.828311,0.77241
"""regression""",0.779555,0.79966,0.766207


# Conclusion

This notebook reproduces the original findings in [@bilenko2002learning] to a
sufficient degree to inspire confidence in the other results observed here.
Using the Levenshtein distance instead of patterns or match/non-match semantics
seems to dramatically improve both the precision and the recall of the matching
process for both SVMs and Logistic Regression.
The difference in performance between the two is negligible.

On one hand, we see that using more expressive feature values improves the
match performance.
This is akin to having a stronger signal and the conclusion is not surprising
at all.

On the other hand we see that the models still need a lot of data preprocessing
to function correctly. They:

* can only compare apples to apples
* do not transfer any of the learning to a new comparison of pears to pears
* are reasonably powerful only if data has been previously cleaned

Note that these models are still variations on the Fellegi-Sunter model in that
they only manipulate probability distributions using polynomials.
Neither SVMs (the construction of the kernel is probabilistic) nor Logistic
Regression (the sigmoid function emulates probabilities) represent breakthroughs
from what we've seen so far.
The one thing they do in addition to more classical implementations of the F-S
model is that they work with continuous valued input features.
