# Entity Resolution using the Naive Bayes ML model

We're going to dive into using the Naive Bayes regression model. This model
is probabilistic in nature and can be trained to classify data in a given number
of categories.
The model is supervised, meaning that it learns using known good examples.
Sounds like an ideal fit for entity resolution, right?

Let's dive in!

_(btw, the first parts of this code are the same as the Fellegi-Sunter example
notebook, so check that one out, too)_.   

In [1]:
!test -f ~/requirements.txt && pip install -r ~/requirements.txt



In [2]:
import os

import numpy as np
import polars as pl

from matchescu.matching.entity_reference import (
    NaiveBayesComparison,
)
from matchescu.matching.ml.datasets import CsvDataSource, Traits

Next, we need to define some fairly important 'constants' (alas, there's no such thing in Python).
Feel free to change these values to whichever dataset you want to test.

In [3]:
DATADIR = os.path.abspath("../../data")
LEFT_CSV_PATH = os.path.join(DATADIR, "abt-buy", "Abt.csv")
RIGHT_CSV_PATH = os.path.join(DATADIR, "abt-buy", "Buy.csv")
GROUND_TRUTH_PATH = os.path.join(DATADIR, "abt-buy", "abt_buy_perfectMapping.csv")

The sources of information can be structured in any way. However, when we read
from the data source we expect to be able to refer to discrete pieces of
information.
The important bit is to have a decent feature extraction process that is able to
produce relatively uniformly shaped entity references. That's what `Traits()`
do. That way we can get a neat matching process going.  

In [4]:
# set up abt extraction
abt_traits = list(Traits().int([0]).string([1, 2]).currency([3]))
abt = CsvDataSource(name="abt", traits=abt_traits).read_csv(LEFT_CSV_PATH)
# set up buy extraction
buy_traits = list(Traits().int([0]).string([1, 2, 3]).currency([4]))
buy = CsvDataSource(name="buy", traits=buy_traits).read_csv(RIGHT_CSV_PATH)
# set up ground truth
gt = set(
    pl.read_csv(
        os.path.join(DATADIR, "abt-buy", "abt_buy_perfectMapping.csv"),
        ignore_errors=True,
    ).iter_rows()
)

So far, this is very similar to the setup we had for the Fellegi-Sunter model.
It's time to introduce the twist required to use the Naive Bayes model.

In [5]:
cmp_config = (
    NaiveBayesComparison()
    .levenshtein("name", 1, 1, threshold=0.8)
    .levenshtein("description", 2, 2, threshold=0.8)
    .exact("price", 3, 4)
)

As you can see, the setup is still very similar to the Fellegi-Sunter model.
We can even reuse our `RecordLinkageDataSet` to showcase the Naive Bayes model.

In [6]:
from matchescu.matching.ml.datasets import RecordLinkageDataSet

ds = RecordLinkageDataSet(abt, buy, gt).attr_compare(cmp_config).cross_sources()
y = ds.target_vector.to_numpy()
X = ds.feature_matrix.to_numpy()

We just created the same type of feature matrix and target vector like the ones
we had for the Fellegi-Sunter model.
This time around, however, we need to pass them to SciKit Learn so we'll need
all our data to be `numpy.ndarray`s.
Also, the feature matrix values will be in the set `{-1, 1}`. 

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6)
y_ratio = len(y[y == 1]) / len(y)
y_train_ratio = len(y_train[y_train == 1]) / len(y_train)
y_test_ratio = len(y_test[y_test == 1]) / len(y_test)
print(y_ratio, y_train_ratio, y_test_ratio)

0.0009293050458637878 0.0009219634857279205 0.0009403173782934934


We can now train the Naive Bayes model that ships with SciKit learn.

In [8]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model = model.fit(X_train, y_train)

Much easier than the Fellegi-Sunter method. Let's make predictions and compute
precision, recall and the F1 score.

In [9]:
y_pred = model.predict(X_test)

Now we can compute our metrics.

In [10]:
tp = np.sum(np.logical_and(y_pred == 1, y_test == 1))
fp = np.sum(np.logical_and(y_pred == 1, y_test == 0))
tn = np.sum(np.logical_and(y_pred == 0, y_test == 0))
fn = np.sum(np.logical_and(y_pred == 0, y_test == 1))
total = len(y_test)
print(f"total comparisons: {total}")
print(f"tp={tp};fp={fp};tn={tn};fn={fn}")

p = tp / (tp + fp) if tp + fp > 0 else 0
r = tp / (tp + fn) if tp + fn > 0 else 0
f1 = 2 * p * r / (p + r) if p + r > 0 else 0
print(f"precision={p}", f"recall={r}", f"F1={f1}")

total comparisons: 472181
tp=42;fp=148;tn=471589;fn=402
precision=0.22105263157894736 recall=0.0945945945945946 F1=0.13249211356466878


The Naive-Bayes model has a different approach to determining the $\mu$ and
$\lambda$ error margins for definitive match and non-match, respectively.
It yields slightly better results compared to the classical Fellegi-Sunter
deterministic decision model. It also involves writing much less code and there
are still many ways of improving the results. A more subtle note is that this
model yields more balanced precision/recall which makes it a better candidate
for improvements via attribute similarity (because it is more sensitive by
default).

To improve our results, we can try various probabilistic distributions (we're
using the normal distribution here to make decisions, but we could very well try
out the Bernoulli distribution). Since the training is so fast, we can actually
find the best fit for our data by iterating over many options. We know full well
that the resulting model is overfitted and won't transfer to other data, but at
least we get a very tunable evaluator fairly quickly and easily. 

Before we get into the weeds, there's something obviously wrong with this model:
it doesn't even capture the case when data is missing. So maybe a better way
forward is to capture more information in each individual feature. That starts
by having floating point values in the feature matrix.

Enter logistic regression!