# Entity Resolution using Logistic Regression

The other notebook on Logistic Regression shows the default approach when using
this type of model to perform entity matching.
In this notebook, we're giving the approach a different spin.
We'll encode raw similarity values resulted from encoding attributes with all
possible patterns of the values of the output classes (0 and 1).


In [1]:
import os
import polars as pl

from matchescu.matching.entity_reference import (
    RawComparison,
)
from matchescu.matching.ml.datasets import CsvDataSource, Traits

'Constant' definition. Change the values here to use a different dataset.

In [2]:
DATADIR = os.path.abspath("../../data")
LEFT_CSV_PATH = os.path.join(DATADIR, "abt-buy", "Abt.csv")
RIGHT_CSV_PATH = os.path.join(DATADIR, "abt-buy", "Buy.csv")
GROUND_TRUTH_PATH = os.path.join(DATADIR, "abt-buy", "abt_buy_perfectMapping.csv")

The sources of information can be structured in any way. However, when we read
from the data source we expect to be able to refer to discrete pieces of
information.
The important bit is to have a decent feature extraction process that is able to
produce relatively uniformly shaped entity references. That's what `Traits()`
do. That way we can get a neat matching process going.  

In [3]:
# set up abt extraction
abt_traits = list(Traits().int([0]).string([1, 2]).currency([3]))
abt = CsvDataSource(name="abt", traits=abt_traits).read_csv(LEFT_CSV_PATH)
# set up buy extraction
buy_traits = list(Traits().int([0]).string([1, 2, 3]).currency([4]))
buy = CsvDataSource(name="buy", traits=buy_traits).read_csv(RIGHT_CSV_PATH)
# set up ground truth
gt = set(
    pl.read_csv(
        os.path.join(DATADIR, "abt-buy", "abt_buy_perfectMapping.csv"),
        ignore_errors=True,
    ).iter_rows()
)

For Logistic Regression we want similarities to return real values between 0 and
1.

In [4]:
cmp_config = (
    RawComparison()
    .levenshtein("name", 1, 1, threshold=0.8)
    .levenshtein("description", 2, 2, threshold=0.8)
    .exact("price", 3, 4)
)

Next, it's time to instantiate the `RecordLinkageDataSet` with a 'pattern
encoded' sampling strategy.

In [33]:
from matchescu.matching.ml.datasets import RecordLinkageDataSet

r = 2  # y contains 0 or 1
ds = RecordLinkageDataSet(abt, buy, gt).pattern_encoded(cmp_config, r).cross_sources()
y = ds.target_vector.to_numpy()
X = ds.feature_matrix.to_numpy()

Notice that we now have $r^m$ features instead of $m$, where $m$ is the number
of comparisons we make and $r$ the number of possible results. The question
really is whether this is useful or not.

In [34]:
print(X.shape, y.shape)

(1180452, 8) (1180452,)


The feature matrix now contains values based on similarities, but which have
been encoded to express what would happen if all attribute similarities were to
predict a pattern of matching or non-matching.
We'll use SciKit Learn's `train_test_split` to split our data into training
and test sets.
We can now train a Logistic Regression model using our feature matrix.

In [35]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6)
y_ratio = len(y[y == 1]) / len(y)
y_train_ratio = len(y_train[y_train == 1]) / len(y_train)
y_test_ratio = len(y_test[y_test == 1]) / len(y_test)

# the model is adjusted to handle the class imbalance specific to ER datasets
model = LogisticRegression(class_weight={0: 1, 1: 9})
model = model.fit(X_train, y_train)

Logistic regression training is much faster than Fellegi-Sunter pattern matching
and on par with Naive Bayes.

In [36]:
y_pred = model.predict(X_test)

Now we can compute our metrics.

In [37]:
from sklearn.metrics import precision_score, recall_score, f1_score

p = precision_score(y_test, y_pred)
r = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"precision={p}", f"recall={r}", f"F1={f1}")

precision=0.23448275862068965 recall=0.4657534246575342 F1=0.3119266055045872


Comparing this with standard logistic regression shows us only marginal
improvements - similar to what we saw when using polynomials. This is expected
because we've basically only combined feature values in various ways.