# Example for using PINE

PINE(Pair INterpretation for Entity matching) is an explainable entity matching algorithm.

PINE takes two records(entities) as input, and outputs correlated token pairs as an explanation for an entity-matching decision.

## Install

```bash
git clone https://github.com/m-hironori/pine-explainer.git
pip install .
```

## Advance preparation

### Making your entity matching model in your dataset

PINE can explain arbitrary entity matching models.
You can make your model to analyze by PINE.

In the experiments, we tried `DITTO` model and `py_entitymatching` model in `Magellan Datasets` by using `lemmon` module.
You can see how we make model at [make_DITTO_model_by_lemon.ipynb](make_DITTO_model_by_lemon.ipynb) and [make_py_entitymatching_model_by_lemon.ipynb](make_py_entitymatching_model_by_lemon.ipynb)

In this example, we use one entity matching model (`py_entitymatching` model for `Structureed Anazon-Google` dataset) in the experiments.

You can download this model from here and set `data/model/magellan/structured_amazon_google` directory.

### Converting your entity matching function to PINE's function

PINE can recognize your entity matching function by converting PINE's function format.

```python
def your_proba_func(
        records_a: pandas.DataFrame,
        records_b: pandas.DataFrame,
        record_id_pairs: pandas.DataFrame,
)->pandas.Series:
   '''Input two entities data, Calc the score.

   Arguments:
        records_a: Entities data in one dataset( index is required) 
        records_b: Entities data in one dataset( index is required) 
        record_id_pairs: Entity pairs between records_a and records_b(index is required)
    Return:
        pandas.Series: socres (-1 <= score <= 1) 
```

`DITTO` function and `py_entitymatching` function have been prepared in PINE (here and here).
In this example use these functions. 

In [8]:
# Load entity maching function
from pine.matcher.magellan_matcher import make_magellan_matcher_func
from pine.matcher.transformer_matcher import make_transformer_matcher_func


model_root_dir = "data/model"
target_dataset_name = "structured_amazon_google"

# proba_fn = make_magellan_matcher_func(target_dataset_name, model_root_dir)
proba_fn = make_transformer_matcher_func(target_dataset_name, model_root_dir)


## Explain py PINE

### Convering your data to PINE's data format

Your data have to be converted to PINE's data format.

- Single row of `pandas.Dataframe` as one your record(one your entity)

In this example, we use lemon dataset which is represented as `pandas.Dataframe`.
`pine-explainer` package contains useful function the lemon dataset.

In [9]:
from pine.dataset import load_dataset


dataset_root_dir = "data/dataset"
dataset = load_dataset(target_dataset_name, dataset_root_dir)
print("Dataset a")
display(dataset.test.records.a.head())
print("Dataset b")
display(dataset.test.records.b.head())
print("Dataset record-id pairs")
display(dataset.test.record_id_pairs.head())
print("Dataset grand truth labels")
display(dataset.test.labels.head())
print("Test DATA SIZE =", len(dataset.test.record_id_pairs))

Dataset a


Unnamed: 0_level_0,title,manufacturer,price
__id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,clickart 950 000 premier image pack ( dvd-rom ),broderbund,
1,ca international arcserve lap/desktop oem 30pk,computer associates,
2,noah 's ark activity center ( jewel case ages 3-8 ),victory multimedia,
3,peachtree by sage premium accounting for nonprofits 2007,sage software,599.99
4,singing coach unlimited,carry-a-tune technologies,99.99


Dataset b


Unnamed: 0_level_0,title,manufacturer,price
__id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,learning quickbooks 2007,intuit,38.99
1,superstart ! fun with reading & writing !,,8.49
2,qb pos 6.0 basic software,intuit,637.99
3,math missions : the amazing arcade adventure ( grades 3-5 ),,12.95
4,production prem cs3 mac upgrad,adobe software,805.99


Dataset record-id pairs


Unnamed: 0_level_0,a.rid,b.rid
pid,Unnamed: 1_level_1,Unnamed: 2_level_1
0,393,831
1,559,324
2,558,3023
3,762,1618
4,1262,2860


Dataset grand truth labels


pid
0    False
1    False
2    False
3    False
4    False
Name: label, dtype: bool

Test DATA SIZE = 2293


In [10]:
# label = match
match_label_pid = dataset.test.labels[dataset.test.labels == True].index
match_label_record_pair_id = dataset.test.record_id_pairs.loc[match_label_pid]
# label = unmatch
unmatch_label_pid = dataset.test.labels[dataset.test.labels == False].index
unmatch_label_record_pair_id = dataset.test.record_id_pairs.loc[unmatch_label_pid]

### Making explanation

In [11]:
import pandas as pd
from pine.entity import Entity, EntityPair
from pine.explainer import LimeResultPair
from pine.explainer.pine_explainer import make_explanation

In [12]:
def explain_records(
    record_left: pd.DataFrame, record_right: pd.DataFrame, proba_fn, topk=5
):
    # Convert dataset to the format that the matcher function can handle
    entity_left = Entity.from_dataframe(record_left)
    entity_right = Entity.from_dataframe(record_right)
    entity_pair = EntityPair(entity_left, entity_right)
    # Make explanation
    explanation, entity_pair_marged = make_explanation(entity_pair, proba_fn, topk)
    return explanation, entity_pair_marged


record_left = dataset.test.records.a.loc[[match_label_record_pair_id.iloc[0]["a.rid"]]]
record_right = dataset.test.records.b.loc[[match_label_record_pair_id.iloc[0]["b.rid"]]]
explanation, entity_pair_marged = explain_records(record_left, record_right, proba_fn)

In [13]:
def display_explanation(explanation: LimeResultPair, entity_pair: EntityPair):
    display("Records")
    display(entity_pair.entity_l.to_dataframe())
    display(entity_pair.entity_r.to_dataframe())
    display("Explanation summary")
    display(f"match_score {explanation.match_score}")
    token_pair_attributions = []
    for i, attr in enumerate(explanation.attributions):
        attribution = {}
        if entity_pair.merged_segment_list[attr.index].segment_list_in_l:
            attribution["token_left"] = entity_pair.entity_l.get_segment_label(
                entity_pair.merged_segment_list[attr.index].segment_list_in_l[0]
            )
        else:
            attribution["token_left"] = ""
        if entity_pair.merged_segment_list[attr.index].segment_list_in_r:
            attribution["token_right"] = entity_pair.entity_r.get_segment_label(
                entity_pair.merged_segment_list[attr.index].segment_list_in_r[0]
            )
        else:
            attribution["token_right"] = ""
        attribution["score"] = attr.score
        token_pair_attributions.append(attribution)
    display(pd.DataFrame(token_pair_attributions))


display_explanation(explanation, entity_pair_marged)

'Records'

Unnamed: 0_level_0,title,manufacturer,price
__id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,sims 2 glamour life stuff pack,aspyr media,24.99


Unnamed: 0_level_0,title,manufacturer,price
__id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,aspyr media inc sims 2 glamour life stuff pack,,23.44


'Explanation summary'

'match_score 0.89081871509552'

Unnamed: 0,token_left,token_right,score
0,aspyr,aspyr,0.014076
1,24.99,23.44,-0.009165
2,life,life,0.005146
3,glamour,glamour,6.8e-05
4,2,2,5.1e-05


In [14]:
record_left = dataset.test.records.a.loc[
    [unmatch_label_record_pair_id.iloc[0]["a.rid"]]
]
record_right = dataset.test.records.b.loc[
    [unmatch_label_record_pair_id.iloc[0]["b.rid"]]
]
explanation, entity_pair_marged = explain_records(record_left, record_right, proba_fn)
display_explanation(explanation, entity_pair_marged)

'Records'

Unnamed: 0_level_0,title,manufacturer,price
__id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,microsoft visual studio test agent 2005 cd 1 processor license,microsoft software,5099.0


Unnamed: 0_level_0,title,manufacturer,price
__id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,individual software professor teaches microsoft office 2007,,29.99


'Explanation summary'

'match_score -0.9930281639099121'

Unnamed: 0,token_left,token_right,score
0,5099.0,,-0.000573
1,agent,office,-0.000214
2,microsoft,software,-0.000184
3,processor,microsoft,-5.5e-05
4,cd,,-4.5e-05
