# Learning String Similarity

The paper [@bilenko2002learning] outlines a process by which we can learn the
edit distance between two strings using minimal training over pairs of
coreferent strings.
Coreferent strings are strings found in references to the same entity.
The idea is to fine tune the algorithm that computes the string edit distance to
the business domain where we perform entity resolution. By doing so we increase
the chances of recognizing records that refer to the same entity more
accurately.
The aforementioned paper references an algorithm [@ristad1998learning] which
learns to compute the Levenshtein distance between two strings from training
data.

Another finding suggests that accuracy gains can be made by training the SVM
kernel that performs entity resolution in a specific way. This notebook goes
through that training procedure.

In [1]:
import os
import random

import numpy as np
import polars as pl
from matchescu.matching.similarity import LevenshteinLearner


DATADIR = os.path.abspath("../../data")
CSV_PATH = os.path.join(DATADIR, "cora", "cora.csv")

In [2]:
df = (
    pl.read_csv(CSV_PATH, has_header=False, ignore_errors=True)
    .rename(
        {
            "column_1": "id",
            "column_3": "class",
            "column_4": "author",
            "column_5": "volume",
            "column_6": "title",
            "column_7": "institution",
            "column_8": "venue",
            "column_11": "year",
        }
    )
    .select(pl.col("id", "class", "author", "title", "venue", "year"))
)
display(df)

id,class,author,title,venue,year
i64,str,str,str,str,str
1,"""blum1993""","""avrim blum, merrick furst, mic…","""cryptographic primitives based…","""in pre-proceedings of crypto '…","""1993"""
2,"""blum1993""","""avrim blum, merrick furst, mic…","""cryptographic primitives based…","""proc. crypto 93,""","""1994"""
3,"""blum1993""","""a. blum, m. furst, m. kearns, …","""cryptographic primitives based…","""crypto,""","""1993"""
4,"""blum1994""","""blum, a., furst, m., jackson, …","""weakly learning dnf and charac…","""proceedings of the 26th annual…","""(1994)."""
5,"""blum1994""","""blum, a., furst, m., jackson, …","""weakly learning dnf and charac…","""in proceedings of the twenty-s…","""(1994)."""
…,…,…,…,…,…
1289,"""schapire1998""","""robert e. schapire and yoram s…","""improved boosting algorithms u…","""in proceedings of the eleventh…","""1998"""
1290,"""schapire""","""schapire, r. e., freund, y., b…","""boosting the margin: a new exp…",,"""(1998)."""
1291,"""schapire1998mm""","""robert e. schapire and yoram s…","""a system for multiclass multi-…","""unpublished manuscript,""","""1998"""
1292,"""singer""","""robert e. schapire yoram singe…","""improved boosting algorithms u…",,


In [4]:
records = list(df.iter_rows(named=True))
dedupe_data = []
y = []
for i, left_record in enumerate(records):
    for j, right_record in enumerate(records, i + 1):
        lclass, rclass = left_record["class"], right_record["class"]
        row = {f"{k}_left": v for k, v in left_record.items()}
        row.update({f"{k}_right": v for k, v in right_record.items()})
        dedupe_data.append(row)
        y.append(int(lclass == rclass))
X = pl.DataFrame(dedupe_data).to_numpy()
y = np.array(y)

Now we split the data using k-fold.

In [11]:
from sklearn.model_selection import KFold

kf = KFold(10)
folds = []
for i, (train_idx, test_idx) in enumerate(kf.split(X, y)):
    folds.append((X[train_idx], y[train_idx], X[test_idx], y[test_idx]))

In [12]:
from sklearn.svm import LinearSVC

def train_svm(features, target):
    model = LinearSVC()
    model.fit(features, target)
    return model

Next, train the Levenshtein distance estimator using 100 random samples from
the data over 10 epochs. 

In [17]:
from sklearn.metrics import precision_score, recall_score, f1_score
from jellyfish import levenshtein_distance
from IPython.display import display


def _title_levenshtein(values: tuple) -> tuple:
    return (levenshtein_distance(values[3], values[9]),)


stats = []

for idx, (X_train, y_train, X_test, y_test) in enumerate(folds):
    title_train = list(map(_title_levenshtein, X_train))
    model = train_svm(title_train, y_train)
    print("trained SVM levenshtein #", idx + 1)
    title_test = list(map(_title_levenshtein, X_test))
    prediction = model.predict(title_test)
    print("evaluated levenshtein #", idx + 1)

    stats.append(
        {
            "levenshtein precision": precision_score(y_test, prediction),
            "levenshtein recall": recall_score(y_test, prediction),
            "levenshtein f1": f1_score(y_test, prediction),
        }
    )

trained SVM levenshtein # 1
evaluated levenshtein # 1
trained SVM levenshtein # 2
evaluated levenshtein # 2
trained SVM levenshtein # 3
evaluated levenshtein # 3
trained SVM levenshtein # 4
evaluated levenshtein # 4
trained SVM levenshtein # 5
evaluated levenshtein # 5
trained SVM levenshtein # 6
evaluated levenshtein # 6
trained SVM levenshtein # 7
evaluated levenshtein # 7
trained SVM levenshtein # 8
evaluated levenshtein # 8
trained SVM levenshtein # 9
evaluated levenshtein # 9
trained SVM levenshtein # 10
evaluated levenshtein # 10


In [18]:
display(pl.DataFrame(stats))

levenshtein precision,levenshtein recall,levenshtein f1
f64,f64,f64
0.958499,0.563813,0.709991
0.863609,0.892661,0.877895
0.78119,0.863928,0.820478
0.487437,0.833065,0.615019
0.853659,0.87627,0.864816
0.731262,0.97869,0.837075
0.782515,0.929739,0.849797
0.653339,0.894495,0.755131
0.645709,0.909816,0.755342
0.887536,0.822866,0.853978


In [19]:
stats = pl.DataFrame(stats)
display(stats)
display(stats.mean())

levenshtein precision,levenshtein recall,levenshtein f1
f64,f64,f64
0.958499,0.563813,0.709991
0.863609,0.892661,0.877895
0.78119,0.863928,0.820478
0.487437,0.833065,0.615019
0.853659,0.87627,0.864816
0.731262,0.97869,0.837075
0.782515,0.929739,0.849797
0.653339,0.894495,0.755131
0.645709,0.909816,0.755342
0.887536,0.822866,0.853978


levenshtein precision,levenshtein recall,levenshtein f1
f64,f64,f64
0.764476,0.856534,0.793952


In [None]:
The conclusion here is that we get infinitely higher performance by embedding
more information within each comparison. This track leads us to the next
breakthrough: vectors.