# Entity Resolution using Logistic Regression

In other notebooks we saw how the classical Fellegi-Sunter model and the Naive
Bayes methods work. They're beautiful theoretical constructions without much
practical power. That idea is perhaps best expressed by the lack of information
in each feature coupled by the stern strictures regarding the structure of each
data source. Indeed, once trained, the Naive Bayes model can only be used on
very similar data (to the point it almost doesn't transfer) and it fails to
capture even the slightest meaning beyond whether entity references match on
a certain attribute or not.
 
Logistic regression is a model well supported in literature:

* [@peruzzi2014remerge]
* [@efremova2015multi]
* [@steorts2018generalized]
* [@ye2018effect]
* [@kobayashi2018entity]

The main advantage of logistic regression is that it allows us to use the
similarity scores directly and engineer features which take advantage of
continuously valued functions. For example, we can use polynomials to better fit
the data.

In [41]:
import os
import polars as pl

from matchescu.matching.entity_reference import (
    RawComparison,
)
from matchescu.matching.ml.datasets import CsvDataSource, Traits

Next, we need to define some fairly important 'constants' (alas, there's no such thing in Python).
Feel free to change these values to whichever dataset you want to test.

In [42]:
DATADIR = os.path.abspath("../../data")
LEFT_CSV_PATH = os.path.join(DATADIR, "abt-buy", "Abt.csv")
RIGHT_CSV_PATH = os.path.join(DATADIR, "abt-buy", "Buy.csv")
GROUND_TRUTH_PATH = os.path.join(DATADIR, "abt-buy", "abt_buy_perfectMapping.csv")

The sources of information can be structured in any way. However, when we read
from the data source we expect to be able to refer to discrete pieces of
information.
The important bit is to have a decent feature extraction process that is able to
produce relatively uniformly shaped entity references. That's what `Traits()`
do. That way we can get a neat matching process going.  

In [43]:
# set up abt extraction
abt_traits = list(Traits().int([0]).string([1, 2]).currency([3]))
abt = CsvDataSource(name="abt", traits=abt_traits).read_csv(LEFT_CSV_PATH)
# set up buy extraction
buy_traits = list(Traits().int([0]).string([1, 2, 3]).currency([4]))
buy = CsvDataSource(name="buy", traits=buy_traits).read_csv(RIGHT_CSV_PATH)
# set up ground truth
gt = set(
    pl.read_csv(
        os.path.join(DATADIR, "abt-buy", "abt_buy_perfectMapping.csv"),
        ignore_errors=True,
    ).iter_rows()
)

Same as with other notebooks, it's time to initialize how we're going to compare
entity reference attributes.

In [44]:
fs_config = (
    RawComparison()
    .levenshtein("name", 1, 1, threshold=0.8)
    .levenshtein("description", 2, 2, threshold=0.8)
    .exact("price", 3, 4)
)

As you can see, the setup is still very similar to the Fellegi-Sunter model.
We can even reuse our `RecordLinkageDataSet` to showcase the Naive Bayes model.

In [45]:
from matchescu.matching.ml.datasets import RecordLinkageDataSet

ds = RecordLinkageDataSet(abt, buy, gt)
y = ds.target_vector.flatten()
X = ds.compute_feature_matrix(fs_config).to_numpy()

Our feature matrix contains the actual similarity values for the defined
attribute comparisons. Remember that $similarity(a, b) \in \left[0, 1\right]$
interval.
We'll use SciKit Learn's `train_test_split` to split our data into training
and test sets. 

In [46]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6)
y_ratio = len(y[y == 1]) / len(y)
y_train_ratio = len(y_train[y_train == 1]) / len(y_train)
y_test_ratio = len(y_test[y_test == 1]) / len(y_test)
print(y_ratio, y_train_ratio, y_test_ratio)

0.0009293050458637878 0.0009290229304884713 0.0009297282186280262


We can now train a Logistic Regression model using our feature matrix.

In [47]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model = model.fit(X_train, y_train)

Logistic regression training is much faster than Fellegi-Sunter pattern matching
and on par with Naive Bayes.

In [48]:
y_pred = model.predict(X_test)

Now we can compute our metrics.

In [49]:
from sklearn.metrics import precision_score, recall_score, f1_score

p = precision_score(y_test, y_pred)
r = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"precision={p}", f"recall={r}", f"F1={f1}")

precision=0.504950495049505 recall=0.11617312072892938 F1=0.18888888888888888


Although the scores we obtain are underwhelming, we haven't really done anything
to the data, but we doubled our F1 score. It seems we're on the right track with
the idea of placing more information in each feature.

The logistic regression model is still very much anchored in the Fellegi-Sunter
model. It just attempts to "learn" the extent to which each attribute comparison
value is useful to finding two matching entity references instead of using
patterns of matches. That's not to say that we couldn't encode patterns of
matches instead of individual values.
 
As stated in the beginning, logistic regression works on top of continuous
intervals. It can therefore be tuned using higher order polynomials in the
feature matrix as long as we don't produce features that combine the comparisons
in any way. Combining features effectively vetoes the probabilistic independence
of the variables that were used in feature engineering. The probabilistic
independence of the entity reference comparison's components is fundamental in
the Fellegi-Sunter model.

Let's have a look at the effect polynomials would have on the result.

In [57]:
import plotly.graph_objs as go
from sklearn.preprocessing import PolynomialFeatures

degrees = list(range(1, 11, 1))
precision, recall, f1 = [], [], []
trials = 10

max_degree, max_f1 = 1, 0
for degree in degrees:
    X_polynomial = PolynomialFeatures(degree).fit_transform(X)
    Xp_train, Xp_test, y_train, y_test = train_test_split(
        X_polynomial, y, train_size=0.6
    )
    d_p, d_r, d_f1 = [], [], []
    for i in range(1, trials, 1):
        print("performing trial", i, "for degree", degree)
        model.fit(Xp_train, y_train)
        y_pred = model.predict(Xp_test)
        d_p.append(precision_score(y_test, y_pred))
        d_r.append(recall_score(y_test, y_pred))
        d_f1.append(f1_score(y_test, y_pred))
    precision.append(sum(d_p) / len(d_p))
    recall.append(sum(d_r) / len(d_r))
    degree_f1 = sum(d_f1) / len(d_f1)
    if degree_f1 > max_f1:
        max_f1 = degree_f1
        max_degree = degree
    f1.append(degree_f1)

fig = go.Figure(
    [
        go.Line(x=degrees, y=precision, name="precision"),
        go.Line(x=degrees, y=recall, name="recall"),
        go.Line(x=degrees, y=f1, name="f1"),
    ]
)

fig.update_layout()
fig.show()
print("got the best F1 score for polynomials of degree", max_degree)

performing trial 1 for degree 1
performing trial 2 for degree 1
performing trial 3 for degree 1
performing trial 4 for degree 1
performing trial 5 for degree 1
performing trial 6 for degree 1
performing trial 7 for degree 1
performing trial 8 for degree 1
performing trial 9 for degree 1
performing trial 1 for degree 2
performing trial 2 for degree 2
performing trial 3 for degree 2
performing trial 4 for degree 2
performing trial 5 for degree 2
performing trial 6 for degree 2
performing trial 7 for degree 2
performing trial 8 for degree 2
performing trial 9 for degree 2
performing trial 1 for degree 3
performing trial 2 for degree 3
performing trial 3 for degree 3
performing trial 4 for degree 3
performing trial 5 for degree 3
performing trial 6 for degree 3
performing trial 7 for degree 3
performing trial 8 for degree 3
performing trial 9 for degree 3
performing trial 1 for degree 4
performing trial 2 for degree 4
performing trial 3 for degree 4
performing trial 4 for degree 4
performi


plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.




got the best F1 score for polynomials of degree 8


The polynomial features only bring a marginal improvement. And we can also see
that using higher order polynomials doesn't really ensure that we'll get higher
F1 scores (see $x^9$). That said, using probabilistic metrics to calibrate our
solution seems ok because these metrics are dangerous only in case of very high
recall values.

We can draw a bunch of conclusions from this notebook:

* logistic regression is a much better starting point for optimisation than
Naive Bayes and classic FSM
    * tweaking attribute level matching will yield better results, but using a
    superior method provides a better starting point
* logistic regression does not solve the knowledge transfer problem (you can't
use the logistic regression model we just trained on Abt-Buy to perform entity
resolution effectively on another dataset)
* the features used on another dataset must employ the same number and type of
comparisons (not the same attributes) as the features the models was trained on
    * if polynomials are used then the same polynomial expansion must be used on
    the target dataset 
* model training is still very quick - allowing us to iterate through 10
training sessions in a relatively short amount of time for each polynomial
expansion of the initial feature matrix degree
* polynomial expansion doesn't change the results very much (only a marginal
improvement) => we should focus on adding more meaning to the feature values
 
In our next notebook, we're going to try logistic regression with features that
encode comparison patterns as real numbers. 