# Introduction


This IPython notebook explains a basic workflow of matching two tables using deepmatcher. Our goal is to come up with a workflow to match books from amazon and goodread sites. Specifically, we want to create a matcher with precision of at least 90% and recall as high as possible. The datasets contain information about the books.

We used Magellan to perform blocking of tuple pairs and matching is done my using deepmatcher package.

First, we need to import py_entitymatching package and other libraries as follows:

In [142]:
import sys
import py_entitymatching as em
import deepmatcher as dm
import os

Matching two tables typically consists of the following three steps:

1. Reading the input tables

2. Blocking the input tables to get a candidate set

3. Matching the tuple pairs in the candidate set

# Read input tables

In [143]:
# Get the paths
path_A = "../DATA/TableA.csv"
path_B = "../DATA/TableB.csv"
A = em.read_csv_metadata(path_A, key='ID')
B = em.read_csv_metadata(path_B, key='ID')



In [144]:
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))

Number of tuples in A: 3000
Number of tuples in B: 3000
Number of tuples in A X B (i.e the cartesian product): 9000000


In [145]:
A.head(2)

Unnamed: 0,ID,title,author,rating,format
0,a1,To Kill a Mockingbird,Harper Lee,4.26,Paperback
1,a2,Harry Potter and the Sorcerer's Stone (Harry Potter #1),J.K. Rowling,4.45,Hardcover


In [146]:
B.head(2)

Unnamed: 0,ID,title,author,rating,format
0,b1,Make Your Bed: Little Things That Can Change Your Life...And Maybe the World,William H. McRaven,4.7,Hardcover
1,b2,The Silent Wife: A gripping emotional page turner with a twist that will take your breath away,Kerry Fisher,4.1,Paperback


In [147]:
# Display the keys of the input tables
em.get_key(A), em.get_key(B)

('ID', 'ID')

# Block tables to get candidate set

Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching.
For the matching problem at hand, we know that two books with different titles will not match. So we decide the apply blocking over title:

In [148]:
# Create overlap blocker
ob = em.OverlapBlocker()

# Block using title attribute
C1= ob.block_tables(A, B, 'title', 'title', word_level=True, overlap_size=3, l_output_attrs=['title','author'], r_output_attrs=['title','author'], show_progress=False)

In [149]:
len(C1)

54723

We write the output of C1 to csv file which is unlabelled and will be used as candidate set for deepmatcher.

In [150]:
path_C = os.path.join('.', '..', 'DATA', 'candidate.csv')
C1.to_csv(path_C, index=False)

The number of tuple pairs are still huge. So, we would like to further reduce it. Knowing the dataset, we can say that two books with different authors will not match. Hence, we can block the candidate set of tuple pairs on author.

In [151]:
C2 = ob.block_candset(C1, 'author', 'author', word_level=True, overlap_size=1, show_progress=False)

In [152]:
len(C2)

548

# Debug blocker output

The number of tuple pairs considered for matching is reduced to 54723 (from 9000000), but we would want to make sure that the blocker did not drop any potential matches. We could debug the blocker output in py_entitymatching as follows:

In [153]:
# Debug blocker output
dbg = em.debug_blocker(C2, A, B, output_size=200)

In [154]:
# Display first few tuple pairs from the debug_blocker's output
dbg.head()

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_title,ltable_author,ltable_rating,ltable_format,rtable_title,rtable_author,rtable_rating,rtable_format
0,0,a33,b2990,Holy Bible: King James Version,Anonymous,4.43,Hardcover,"Holy Bible: King James Version, 1611 Edition",Hendrickson Publishers,4.4,Hardcover
1,1,a1861,b166,The Liars' Club,Mary Karr,3.93,Paperback,The Book Club,Mary Alice Monroe,4.3,Paperback
2,2,a2142,b2554,A Short History of Progress,Ronald Wright,4.11,Paperback,A Short History of Nearly Everything,Bill Bryson,4.6,Paperback
3,3,a120,b2197,The Lorax,Dr. Seuss,4.35,Hardcover,The Sneetches and Other Stories,Dr. Seuss,4.8,Hardcover
4,4,a1688,b2024,Fantastic Beasts and Where to Find Them,Newt Scamander,3.97,Hardcover,Fantastic Beasts and Where to Find Them: The Original Screenplay,J.K. Rowling,4.6,Hardcover


From the debug blocker's output we observe that the current blocker drops few potential matches. We would want to update the blocking sequence to avoid dropping these potential matches.
The potential matches are the ones where the book title length is less than 3 words. Hence, we could use block box blocker to create a rule to avoid blocking these tuple pairs. After comparing the length of the book titles, we used jacard measure on the attribute 'author'.

In [155]:
# Getting q-gram and jacard from the library
qgm_3 = em.get_tokenizers_for_blocking()['qgm_3']
jaccard = em.get_sim_funs_for_blocking()['jaccard']

In [156]:
# Defining the black box blocker rule
def bbRule(ltuple, rtuple):
    l_title = ltuple['title'].split()
    r_title = rtuple['title'].split()
    if len(l_title) < 3 and (len(r_title) == len(l_title)):
        for i in range(len(l_title)):
            if (l_title[i] != r_title[i]):
                return True
            if (jaccard(qgm_3(ltuple['author']), qgm_3(rtuple['author'])) < 0.5):
                return True
        return False
    else:
        return True

bb = em.BlackBoxBlocker()
# Setting the rule in black box blocker
bb.set_black_box_function(bbRule)

In [157]:
D = bb.block_tables(A, B, l_output_attrs=['title','author'], r_output_attrs=['title','author'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:41


In [158]:
len(D)

17

In [159]:
D.head(2)

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_title,ltable_author,rtable_title,rtable_author
0,0,a10,b2979,Jane Eyre,Charlotte Brontë,Jane Eyre,Charlotte Bronte
1,1,a28,b2987,Wuthering Heights,Emily Brontë,Wuthering Heights,Emily Bronte


In [160]:
# Combine blocker outputs
C = em.combine_blocker_outputs_via_union([C2, D])

In [161]:
len(C)

565

We observe that the current blocker sequence does not drop obvious potential matches, and we can proceed with the matching step now. A subtle point to note here is, debugging blocker output practically provides a stopping criteria for modifying the blocker sequence.

## Train a classifier using labeled data
Once we have the labeled data, we use `deepmatcher` to train a classifier. The first thing we need to do is to split the data for training purpose. In this example, we split the labeled data into three parts: training, validation and test data, with the ratio of 3:1:1. (For now we only support spliting the labeled data into three parts train/valid/test, where the validation set is used for selecting the best model during the training epochs.) For the purpose of caching data and progressive training, we will first save the split parts to disk in the format of csv files, then load them back in. The cache file will be saved during the loading procedure. For subsequent training runs, the cache file will be used to save preprocessing time on the raw csv files, unless the csv files are modified (in this case, new cache file will be generated).

As the number of tuple pairs after blocking are only 565, we can skip sampling the candidate set. Also we have manually labeled the tuple pairs which have survived blocking.

In [162]:
#Loading manually labeled data
G = em.read_csv_metadata("../DATA/Labelled Set G.csv", key="_id", fk_ltable="ltable_ID", fk_rtable="rtable_ID", ltable=A, rtable=B)
len(G)



565

In [163]:
G.head()

Unnamed: 0.1,Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_title,ltable_author,rtable_title,rtable_author,label
0,0,0,a1,b1421,To Kill a Mockingbird,Harper Lee,To Kill a Mockingbird,Harper Lee,1
1,1,1,a10,b2979,Jane Eyre,Charlotte Brontë,Jane Eyre,Charlotte Bronte,1
2,2,2,a1003,b1012,Mrs. Frisby and the Rats of NIMH (Rats of NIMH #1),Robert C. O'Brien,The Million Dollar Decision: Get Out of the Rigged Game of Investing and Add a Million to Your N...,Robert Rolih,0
3,3,3,a1003,b1932,Mrs. Frisby and the Rats of NIMH (Rats of NIMH #1),Robert C. O'Brien,The 21 Irrefutable Laws of Leadership: Follow Them and People Will Follow You (10th Anniversary ...,John C. Maxwell and,0
4,4,4,a1003,b2980,Mrs. Frisby and the Rats of NIMH (Rats of NIMH #1),Robert C. O'Brien,"The Lion, the Witch and the Wardrobe (The Chronicles of Narnia)",C. S. Lewis,0


In [164]:
G = G.drop(columns=['Unnamed: 0'])

In [165]:
# The directory where the data splits will be saved.
split_path = os.path.join('.', '..', 'DATA')

In [166]:
# Split labeled data into train, valid, and test csv files to disk, with the split ratio of 3:1:1.
dm.data.split(G, split_path, 'train.csv', 'valid.csv', 'test.csv',
              [3, 1, 1])

## Selecting the best learning-based matcher

Selecting the best learning-based matcher involves training the models for 4 different algorithms:

    SIF
    Hybrid
    RNN
    Attention

In [167]:
# Load the training data files from the disk. Ignore the "left_id" and "right_id" 
# columns for data preprocessing.
train, validation, test = dm.data.process(
    path=os.path.join('.', '..', 'DATA'),
    cache='train_cache.pth',
    train='train.csv',
    validation='valid.csv',
    test='test.csv',
    left_prefix='ltable_',
    right_prefix='rtable_',
    id_attr='_id',
    ignore_columns=('ltable_id', 'rtable_id'))

Rebuilding data cache because: {'One or more data files have been modified.'}
Load time: 0.444913548999466
Vocab time: 0.07991927705006674
Metadata time: 0.19269640097627416
Cache time: 0.22147205099463463


## Hybrid model train and test

In [168]:
# Create a hybrid model.
model = dm.MatchingModel(attr_summarizer='hybrid')

In [169]:
# Train the hybrid model with 10 training epochs, batch size of 16, positive-to-negative 
# ratio to be 3. We save the best model (with the 
# highest F1 score on the validation set) to 'hybrid_model.pth'.
model.run_train(
    train,
    validation,
    epochs=10,
    batch_size=16,
    best_save_path='hybrid_model.pth',
    pos_neg_ratio=3)

* Number of trainable parameters: 7133105
===>  TRAIN Epoch 1 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:09


Finished Epoch 1 || Run Time:   10.9 | Load Time:    0.0 || F1:  70.12 | Prec:  57.03 | Rec:  91.03 || Ex/s:  31.05

===>  EVAL Epoch 1 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 1 || Run Time:    1.3 | Load Time:    0.0 || F1:  81.82 | Prec:  84.91 | Rec:  78.95 || Ex/s:  86.30

* Best F1: 81.81818181818181
Saving best model...
===>  TRAIN Epoch 2 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:11


Finished Epoch 2 || Run Time:   13.5 | Load Time:    0.0 || F1:  85.63 | Prec:  80.34 | Rec:  91.67 || Ex/s:  25.02

===>  EVAL Epoch 2 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 2 || Run Time:    1.2 | Load Time:    0.0 || F1:  86.96 | Prec:  86.21 | Rec:  87.72 || Ex/s:  90.87

* Best F1: 86.95652173913044
Saving best model...
===>  TRAIN Epoch 3 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:10


Finished Epoch 3 || Run Time:   11.4 | Load Time:    0.0 || F1:  88.62 | Prec:  83.15 | Rec:  94.87 || Ex/s:  29.66

===>  EVAL Epoch 3 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 3 || Run Time:    0.9 | Load Time:    0.0 || F1:  83.64 | Prec:  86.79 | Rec:  80.70 || Ex/s: 118.20

===>  TRAIN Epoch 4 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:09


Finished Epoch 4 || Run Time:   10.7 | Load Time:    0.0 || F1:  90.15 | Prec:  84.36 | Rec:  96.79 || Ex/s:  31.70

===>  EVAL Epoch 4 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 4 || Run Time:    0.9 | Load Time:    0.0 || F1:  84.68 | Prec:  87.04 | Rec:  82.46 || Ex/s: 118.87

===>  TRAIN Epoch 5 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:09


Finished Epoch 5 || Run Time:   10.7 | Load Time:    0.0 || F1:  90.69 | Prec:  85.31 | Rec:  96.79 || Ex/s:  31.63

===>  EVAL Epoch 5 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 5 || Run Time:    0.9 | Load Time:    0.0 || F1:  85.96 | Prec:  85.96 | Rec:  85.96 || Ex/s: 119.00

===>  TRAIN Epoch 6 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:09


Finished Epoch 6 || Run Time:   11.0 | Load Time:    0.0 || F1:  92.40 | Prec:  87.86 | Rec:  97.44 || Ex/s:  30.66

===>  EVAL Epoch 6 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 6 || Run Time:    0.9 | Load Time:    0.0 || F1:  88.33 | Prec:  84.13 | Rec:  92.98 || Ex/s: 118.86

* Best F1: 88.33333333333333
Saving best model...
===>  TRAIN Epoch 7 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:09


Finished Epoch 7 || Run Time:   10.8 | Load Time:    0.0 || F1:  94.19 | Prec:  90.06 | Rec:  98.72 || Ex/s:  31.29

===>  EVAL Epoch 7 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 7 || Run Time:    0.9 | Load Time:    0.0 || F1:  85.04 | Prec:  77.14 | Rec:  94.74 || Ex/s: 117.42

===>  TRAIN Epoch 8 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:09


Finished Epoch 8 || Run Time:   10.9 | Load Time:    0.0 || F1:  94.41 | Prec:  91.57 | Rec:  97.44 || Ex/s:  31.09

===>  EVAL Epoch 8 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 8 || Run Time:    0.9 | Load Time:    0.0 || F1:  83.46 | Prec:  75.71 | Rec:  92.98 || Ex/s: 118.00

===>  TRAIN Epoch 9 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:09


Finished Epoch 9 || Run Time:   10.7 | Load Time:    0.0 || F1:  93.62 | Prec:  89.02 | Rec:  98.72 || Ex/s:  31.56

===>  EVAL Epoch 9 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 9 || Run Time:    1.0 | Load Time:    0.0 || F1:  84.48 | Prec:  83.05 | Rec:  85.96 || Ex/s: 116.40

===>  TRAIN Epoch 10 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:09


Finished Epoch 10 || Run Time:   10.7 | Load Time:    0.0 || F1:  95.71 | Prec:  91.76 | Rec: 100.00 || Ex/s:  31.66

===>  EVAL Epoch 10 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 10 || Run Time:    0.9 | Load Time:    0.0 || F1:  84.75 | Prec:  81.97 | Rec:  87.72 || Ex/s: 118.58

Loading best model...


In [170]:
# Evaluate the accuracy on the test data.
model.run_eval(test)

===>  EVAL Epoch 6 :
Finished Epoch 6 || Run Time:    0.8 | Load Time:    0.0 || F1:  88.70 | Prec:  87.93 | Rec:  89.47 || Ex/s: 132.05



88.69565217391303

## SIF model train and test

In [171]:
# Create a hybrid model.
model_sif = dm.MatchingModel(attr_summarizer='sif')

In [172]:
# Train the hybrid model with 10 training epochs, batch size of 16, positive-to-negative 
# ratio to be 3. We save the best model (with the 
# highest F1 score on the validation set) to 'hybrid_model.pth'.
model_sif.run_train(
    train,
    validation,
    epochs=10,
    batch_size=16,
    best_save_path='sif_model.pth',
    pos_neg_ratio=3)

* Number of trainable parameters: 542402
===>  TRAIN Epoch 1 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 1 || Run Time:    0.2 | Load Time:    0.0 || F1:  64.00 | Prec:  47.65 | Rec:  97.44 || Ex/s: 1238.62

===>  EVAL Epoch 1 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 1 || Run Time:    0.0 | Load Time:    0.0 || F1:  67.06 | Prec:  50.44 | Rec: 100.00 || Ex/s: 2301.19

* Best F1: 67.05882352941177
Saving best model...
===>  TRAIN Epoch 2 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 2 || Run Time:    0.2 | Load Time:    0.0 || F1:  64.73 | Prec:  47.85 | Rec: 100.00 || Ex/s: 1253.28

===>  EVAL Epoch 2 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 2 || Run Time:    0.0 | Load Time:    0.0 || F1:  70.81 | Prec:  54.81 | Rec: 100.00 || Ex/s: 2449.75

* Best F1: 70.80745341614906
Saving best model...
===>  TRAIN Epoch 3 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 3 || Run Time:    0.2 | Load Time:    0.0 || F1:  77.61 | Prec:  63.41 | Rec: 100.00 || Ex/s: 1259.02

===>  EVAL Epoch 3 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 3 || Run Time:    0.0 | Load Time:    0.0 || F1:  73.33 | Prec:  59.14 | Rec:  96.49 || Ex/s: 2337.32

* Best F1: 73.33333333333333
Saving best model...
===>  TRAIN Epoch 4 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 4 || Run Time:    0.2 | Load Time:    0.0 || F1:  84.24 | Prec:  73.11 | Rec:  99.36 || Ex/s: 1261.69

===>  EVAL Epoch 4 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 4 || Run Time:    0.0 | Load Time:    0.0 || F1:  76.26 | Prec:  64.63 | Rec:  92.98 || Ex/s: 2288.56

* Best F1: 76.2589928057554
Saving best model...
===>  TRAIN Epoch 5 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 5 || Run Time:    0.2 | Load Time:    0.0 || F1:  86.91 | Prec:  76.85 | Rec: 100.00 || Ex/s: 1252.78

===>  EVAL Epoch 5 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 5 || Run Time:    0.0 | Load Time:    0.0 || F1:  77.94 | Prec:  67.09 | Rec:  92.98 || Ex/s: 2374.95

* Best F1: 77.94117647058825
Saving best model...
===>  TRAIN Epoch 6 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 6 || Run Time:    0.2 | Load Time:    0.0 || F1:  88.64 | Prec:  79.59 | Rec: 100.00 || Ex/s: 1263.37

===>  EVAL Epoch 6 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 6 || Run Time:    0.0 | Load Time:    0.0 || F1:  80.92 | Prec:  71.62 | Rec:  92.98 || Ex/s: 2314.76

* Best F1: 80.91603053435114
Saving best model...
===>  TRAIN Epoch 7 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 7 || Run Time:    0.2 | Load Time:    0.0 || F1:  89.40 | Prec:  80.83 | Rec: 100.00 || Ex/s: 1228.48

===>  EVAL Epoch 7 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 7 || Run Time:    0.0 | Load Time:    0.0 || F1:  80.00 | Prec:  71.23 | Rec:  91.23 || Ex/s: 2294.33

===>  TRAIN Epoch 8 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 8 || Run Time:    0.2 | Load Time:    0.0 || F1:  90.43 | Prec:  82.54 | Rec: 100.00 || Ex/s: 1220.06

===>  EVAL Epoch 8 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 8 || Run Time:    0.0 | Load Time:    0.0 || F1:  80.00 | Prec:  71.23 | Rec:  91.23 || Ex/s: 2229.02

===>  TRAIN Epoch 9 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 9 || Run Time:    0.2 | Load Time:    0.0 || F1:  91.23 | Prec:  83.87 | Rec: 100.00 || Ex/s: 1268.60

===>  EVAL Epoch 9 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 9 || Run Time:    0.0 | Load Time:    0.0 || F1:  80.62 | Prec:  72.22 | Rec:  91.23 || Ex/s: 2300.46

===>  TRAIN Epoch 10 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 10 || Run Time:    0.2 | Load Time:    0.0 || F1:  92.04 | Prec:  85.25 | Rec: 100.00 || Ex/s: 1254.65

===>  EVAL Epoch 10 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 10 || Run Time:    0.0 | Load Time:    0.0 || F1:  80.62 | Prec:  72.22 | Rec:  91.23 || Ex/s: 2341.79

Loading best model...


In [173]:
# Evaluate the accuracy on the test data.
model_sif.run_eval(test)

===>  EVAL Epoch 6 :
Finished Epoch 6 || Run Time:    0.0 | Load Time:    0.0 || F1:  82.96 | Prec:  71.79 | Rec:  98.25 || Ex/s: 2774.77



82.96296296296296

## RNN model train and test

In [174]:
# Create a hybrid model.
model_rnn = dm.MatchingModel(attr_summarizer='rnn')

In [175]:

model_rnn.run_train(
    train,
    validation,
    epochs=10,
    batch_size=16,
    best_save_path='rnn_model.pth',
    pos_neg_ratio=3)

* Number of trainable parameters: 1762802
===>  TRAIN Epoch 1 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 1 || Run Time:    2.9 | Load Time:    0.0 || F1:  62.14 | Prec:  45.76 | Rec:  96.79 || Ex/s: 115.10

===>  EVAL Epoch 1 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 1 || Run Time:    0.3 | Load Time:    0.0 || F1:  67.46 | Prec:  50.89 | Rec: 100.00 || Ex/s: 416.48

* Best F1: 67.45562130177515
Saving best model...
===>  TRAIN Epoch 2 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 2 || Run Time:    2.9 | Load Time:    0.0 || F1:  71.19 | Prec:  56.09 | Rec:  97.44 || Ex/s: 115.84

===>  EVAL Epoch 2 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 2 || Run Time:    0.3 | Load Time:    0.0 || F1:  77.14 | Prec:  65.06 | Rec:  94.74 || Ex/s: 413.86

* Best F1: 77.14285714285715
Saving best model...
===>  TRAIN Epoch 3 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 3 || Run Time:    2.9 | Load Time:    0.0 || F1:  88.12 | Prec:  80.42 | Rec:  97.44 || Ex/s: 114.66

===>  EVAL Epoch 3 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 3 || Run Time:    0.3 | Load Time:    0.0 || F1:  78.01 | Prec:  65.48 | Rec:  96.49 || Ex/s: 418.24

* Best F1: 78.01418439716312
Saving best model...
===>  TRAIN Epoch 4 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 4 || Run Time:    2.9 | Load Time:    0.0 || F1:  93.90 | Prec:  89.53 | Rec:  98.72 || Ex/s: 115.45

===>  EVAL Epoch 4 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 4 || Run Time:    0.3 | Load Time:    0.0 || F1:  78.26 | Prec:  66.67 | Rec:  94.74 || Ex/s: 420.04

* Best F1: 78.2608695652174
Saving best model...
===>  TRAIN Epoch 5 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 5 || Run Time:    2.9 | Load Time:    0.0 || F1:  97.20 | Prec:  94.55 | Rec: 100.00 || Ex/s: 115.86

===>  EVAL Epoch 5 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 5 || Run Time:    0.3 | Load Time:    0.0 || F1:  80.92 | Prec:  71.62 | Rec:  92.98 || Ex/s: 413.02

* Best F1: 80.91603053435114
Saving best model...
===>  TRAIN Epoch 6 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 6 || Run Time:    2.9 | Load Time:    0.0 || F1:  98.73 | Prec:  97.50 | Rec: 100.00 || Ex/s: 115.96

===>  EVAL Epoch 6 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 6 || Run Time:    0.3 | Load Time:    0.0 || F1:  79.10 | Prec:  68.83 | Rec:  92.98 || Ex/s: 416.77

===>  TRAIN Epoch 7 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 7 || Run Time:    2.9 | Load Time:    0.0 || F1: 100.00 | Prec: 100.00 | Rec: 100.00 || Ex/s: 116.05

===>  EVAL Epoch 7 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 7 || Run Time:    0.3 | Load Time:    0.0 || F1:  78.52 | Prec:  67.95 | Rec:  92.98 || Ex/s: 420.39

===>  TRAIN Epoch 8 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 8 || Run Time:    3.6 | Load Time:    0.0 || F1: 100.00 | Prec: 100.00 | Rec: 100.00 || Ex/s:  93.46

===>  EVAL Epoch 8 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 8 || Run Time:    0.3 | Load Time:    0.0 || F1:  79.10 | Prec:  68.83 | Rec:  92.98 || Ex/s: 318.92

===>  TRAIN Epoch 9 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 9 || Run Time:    3.5 | Load Time:    0.0 || F1: 100.00 | Prec: 100.00 | Rec: 100.00 || Ex/s:  96.92

===>  EVAL Epoch 9 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 9 || Run Time:    0.3 | Load Time:    0.0 || F1:  79.10 | Prec:  68.83 | Rec:  92.98 || Ex/s: 402.04

===>  TRAIN Epoch 10 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 10 || Run Time:    3.0 | Load Time:    0.0 || F1: 100.00 | Prec: 100.00 | Rec: 100.00 || Ex/s: 113.48

===>  EVAL Epoch 10 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 10 || Run Time:    0.3 | Load Time:    0.0 || F1:  79.10 | Prec:  68.83 | Rec:  92.98 || Ex/s: 416.68

Loading best model...


In [176]:
# Evaluate the accuracy on the test data.
model_rnn.run_eval(test)

===>  EVAL Epoch 5 :
Finished Epoch 5 || Run Time:    0.2 | Load Time:    0.0 || F1:  83.87 | Prec:  77.61 | Rec:  91.23 || Ex/s: 508.46



83.8709677419355

## Attention model train and test

In [177]:
# Create a hybrid model.
model_attn = dm.MatchingModel(attr_summarizer='attention')

In [178]:

model_attn.run_train(
    train,
    validation,
    epochs=10,
    batch_size=16,
    best_save_path='attn_model.pth',
    pos_neg_ratio=3)

* Number of trainable parameters: 3429602
===>  TRAIN Epoch 1 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 1 || Run Time:    4.4 | Load Time:    0.0 || F1:  69.76 | Prec:  56.30 | Rec:  91.67 || Ex/s:  77.28

===>  EVAL Epoch 1 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 1 || Run Time:    0.5 | Load Time:    0.0 || F1:  77.78 | Prec:  71.01 | Rec:  85.96 || Ex/s: 225.60

* Best F1: 77.77777777777777
Saving best model...
===>  TRAIN Epoch 2 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 2 || Run Time:    4.2 | Load Time:    0.0 || F1:  82.19 | Prec:  71.77 | Rec:  96.15 || Ex/s:  80.21

===>  EVAL Epoch 2 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 2 || Run Time:    0.4 | Load Time:    0.0 || F1:  79.37 | Prec:  72.46 | Rec:  87.72 || Ex/s: 287.79

* Best F1: 79.36507936507935
Saving best model...
===>  TRAIN Epoch 3 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 3 || Run Time:    4.3 | Load Time:    0.0 || F1:  86.59 | Prec:  76.73 | Rec:  99.36 || Ex/s:  78.02

===>  EVAL Epoch 3 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 3 || Run Time:    0.5 | Load Time:    0.0 || F1:  81.54 | Prec:  72.60 | Rec:  92.98 || Ex/s: 227.78

* Best F1: 81.53846153846153
Saving best model...
===>  TRAIN Epoch 4 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 4 || Run Time:    4.2 | Load Time:    0.0 || F1:  89.21 | Prec:  81.82 | Rec:  98.08 || Ex/s:  80.28

===>  EVAL Epoch 4 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 4 || Run Time:    0.4 | Load Time:    0.0 || F1:  80.30 | Prec:  70.67 | Rec:  92.98 || Ex/s: 281.58

===>  TRAIN Epoch 5 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 5 || Run Time:    4.5 | Load Time:    0.0 || F1:  90.48 | Prec:  84.44 | Rec:  97.44 || Ex/s:  74.56

===>  EVAL Epoch 5 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 5 || Run Time:    0.5 | Load Time:    0.0 || F1:  82.17 | Prec:  73.61 | Rec:  92.98 || Ex/s: 213.41

* Best F1: 82.17054263565892
Saving best model...
===>  TRAIN Epoch 6 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 6 || Run Time:    4.5 | Load Time:    0.0 || F1:  90.59 | Prec:  83.70 | Rec:  98.72 || Ex/s:  74.64

===>  EVAL Epoch 6 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 6 || Run Time:    0.5 | Load Time:    0.0 || F1:  83.87 | Prec:  77.61 | Rec:  91.23 || Ex/s: 228.54

* Best F1: 83.8709677419355
Saving best model...
===>  TRAIN Epoch 7 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04


Finished Epoch 7 || Run Time:    5.0 | Load Time:    0.0 || F1:  91.76 | Prec:  84.78 | Rec: 100.00 || Ex/s:  67.03

===>  EVAL Epoch 7 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 7 || Run Time:    0.5 | Load Time:    0.0 || F1:  84.55 | Prec:  78.79 | Rec:  91.23 || Ex/s: 204.81

* Best F1: 84.55284552845528
Saving best model...
===>  TRAIN Epoch 8 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 8 || Run Time:    4.4 | Load Time:    0.0 || F1:  92.86 | Prec:  86.67 | Rec: 100.00 || Ex/s:  76.31

===>  EVAL Epoch 8 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 8 || Run Time:    0.5 | Load Time:    0.0 || F1:  85.25 | Prec:  80.00 | Rec:  91.23 || Ex/s: 233.16

* Best F1: 85.24590163934425
Saving best model...
===>  TRAIN Epoch 9 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 9 || Run Time:    4.3 | Load Time:    0.0 || F1:  93.98 | Prec:  88.64 | Rec: 100.00 || Ex/s:  78.33

===>  EVAL Epoch 9 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 9 || Run Time:    0.4 | Load Time:    0.0 || F1:  85.25 | Prec:  80.00 | Rec:  91.23 || Ex/s: 288.87

===>  TRAIN Epoch 10 :


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 10 || Run Time:    3.9 | Load Time:    0.0 || F1:  94.55 | Prec:  89.66 | Rec: 100.00 || Ex/s:  85.19

===>  EVAL Epoch 10 :


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finished Epoch 10 || Run Time:    0.4 | Load Time:    0.0 || F1:  85.95 | Prec:  81.25 | Rec:  91.23 || Ex/s: 289.93

* Best F1: 85.9504132231405
Saving best model...
Loading best model...


In [179]:
# Evaluate the accuracy on the test data.
model_attn.run_eval(test)

===>  EVAL Epoch 10 :
Finished Epoch 10 || Run Time:    0.4 | Load Time:    0.0 || F1:  85.00 | Prec:  80.95 | Rec:  89.47 || Ex/s: 285.72



85.0

## Predictions using Hybrid Model

In [180]:
# Load the model - Hybrid model.
model = dm.MatchingModel(attr_summarizer='hybrid')
model.load_state('hybrid_model.pth')

In [181]:
# Hybrid Model -Load the candidate set. Note that the trained model is an input parameter as we need to trained 
# model for candidate set preprocessing.
candidate = dm.data.process_unlabeled(
    path=os.path.join('.', '..', 'DATA', 'candidate.csv'),
    trained_model=model,
    ignore_columns=('ltable_id', 'rtable_id'))

Load time: 29.54633905098308
Vocab update time: 0.4504475080175325


In [182]:
# Predict the pairs in the candidate set and return a dataframe containing the pair id with 
# the score of being a match.
predictions = model.run_prediction(candidate, output_attributes=list(candidate.get_raw_table().columns))

===>  PREDICT Epoch 6 :


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:05:47


Finished Epoch 6 || Run Time:  344.0 | Load Time:    3.9 || F1:   0.00 | Prec:   0.00 | Rec:   0.00 || Ex/s:   0.00



In [183]:
predictions.head()

Unnamed: 0_level_0,match_score,ltable_ID,rtable_ID,ltable_title,ltable_author,rtable_title,rtable_author
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
26670,0.992591,a1,b1421,To Kill a Mockingbird,Harper Lee,To Kill a Mockingbird,Harper Lee
26668,0.993153,a21,b1418,Green Eggs and Ham,Dr. Seuss,Green Eggs and Ham,Dr. Seuss
14602,0.983933,a2270,b849,The Hate U Give,Angie Thomas,The Hate U Give,Angie Thomas
11028,0.991205,a1034,b615,The Very Hungry Caterpillar,Eric Carle,The Very Hungry Caterpillar,Eric Carle
969,0.985455,a1387,b60,The Book of Disquiet,Fernando Pessoa,The Book of Awesome,Neil Pasricha


In [184]:
high_score_pairs = predictions[predictions['match_score'] >= 0.9].sort_values(by=['match_score'], ascending=False)
high_score_pairs

Unnamed: 0_level_0,match_score,ltable_ID,rtable_ID,ltable_title,ltable_author,rtable_title,rtable_author
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
40357,0.996721,a1241,b2177,The Long Winter (Little House #6),Laura Ingalls Wilder,The Little House (9 Volumes Set),Laura Ingalls Wilder and
40353,0.996679,a87,b2177,The Little House Collection (Little House #1-9),Laura Ingalls Wilder,The Little House (9 Volumes Set),Laura Ingalls Wilder and
40355,0.995849,a380,b2177,Little House on the Prairie (Little House #2),Laura Ingalls Wilder,The Little House (9 Volumes Set),Laura Ingalls Wilder and
40356,0.995518,a1209,b2177,On the Banks of Plum Creek (Little House #4),Laura Ingalls Wilder,The Little House (9 Volumes Set),Laura Ingalls Wilder and
24562,0.995413,a175,b1301,In Cold Blood,Truman Capote,In Cold Blood,Truman Capote
40359,0.995267,a2956,b2177,The Little House,Virginia Lee Burton,The Little House (9 Volumes Set),Laura Ingalls Wilder and
38943,0.994365,a252,b2073,Things Fall Apart (The African Trilogy #1),Chinua Achebe,Things Fall Apart,Chinua Achebe
40358,0.994276,a1573,b2177,Little Town on the Prairie (Little House #7),Laura Ingalls Wilder,The Little House (9 Volumes Set),Laura Ingalls Wilder and
26668,0.993153,a21,b1418,Green Eggs and Ham,Dr. Seuss,Green Eggs and Ham,Dr. Seuss
40354,0.992893,a303,b2177,Little House in the Big Woods (Little House #1),Laura Ingalls Wilder,The Little House (9 Volumes Set),Laura Ingalls Wilder and


In [185]:
valid_predictions = model.run_prediction(validation, output_attributes=True)
valid_predictions.head()

===>  PREDICT Epoch 6 :
Finished Epoch 6 || Run Time:    0.8 | Load Time:    0.0 || F1:  87.60 | Prec:  82.81 | Rec:  92.98 || Ex/s: 132.59



Unnamed: 0_level_0,match_score,ltable_ID,rtable_ID,ltable_title,ltable_author,rtable_title,rtable_author,label
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
69,0.989513,a125,b2986,Catching Fire (The Hunger Games #2),Suzanne Collins,The Hunger Games,Suzanne Collins,1
396,0.980363,a40,b1860,Brave New World / Brave New World Revisited,Aldous Huxley,Brave New World,Aldous Huxley,1
58,0.114351,a1221,b1008,Buddenbrooks: The Decline of a Family,Thomas Mann,The National Parks of the United States: A Photographic Journey,Andrew Thomas,0
218,0.578693,a2073,b1715,Influence: The Psychology of Persuasion,Robert B. Cialdini,Influence: The Psychology of Persuasion Revised Edition,Robert B. Cialdini,1
360,0.98646,a34,b2991,The Picture of Dorian Gray,Oscar Wilde,The Picture of Dorian Gray (Dover Thrift Editions),Oscar Wilde,1


## SIF Model predictions

In [186]:
# Load the model - Hybrid model.
model_sif = dm.MatchingModel(attr_summarizer='sif')
model_sif.load_state('sif_model.pth')

In [187]:
# sifModel -Load the candidate set. Note that the trained model is an input parameter as we need to trained 
# model for candidate set preprocessing.
candidate_sif = dm.data.process_unlabeled(
    path=os.path.join('.', '..', 'DATA', 'candidate.csv'),
    trained_model=model_sif,
    ignore_columns=('ltable_id', 'rtable_id'))

Load time: 31.04383582895389
Vocab update time: 0.47491544304648414


In [188]:
# Predict the pairs in the candidate set and return a dataframe containing the pair id with 
# the score of being a match.
predictions_sif = model_sif.run_prediction(candidate_sif, output_attributes=list(candidate_sif.get_raw_table().columns))

===>  PREDICT Epoch 6 :


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:13


Finished Epoch 6 || Run Time:    9.9 | Load Time:    4.0 || F1:   0.00 | Prec:   0.00 | Rec:   0.00 || Ex/s:   0.00



In [189]:
valid_predictions_sif = model_sif.run_prediction(validation, output_attributes=True)
valid_predictions_sif.head()

===>  PREDICT Epoch 6 :
Finished Epoch 6 || Run Time:    0.0 | Load Time:    0.0 || F1:  80.92 | Prec:  71.62 | Rec:  92.98 || Ex/s: 2526.88



Unnamed: 0_level_0,match_score,ltable_ID,rtable_ID,ltable_title,ltable_author,rtable_title,rtable_author,label
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
69,0.991062,a125,b2986,Catching Fire (The Hunger Games #2),Suzanne Collins,The Hunger Games,Suzanne Collins,1
396,0.999974,a40,b1860,Brave New World / Brave New World Revisited,Aldous Huxley,Brave New World,Aldous Huxley,1
58,0.184558,a1221,b1008,Buddenbrooks: The Decline of a Family,Thomas Mann,The National Parks of the United States: A Photographic Journey,Andrew Thomas,0
218,0.998985,a2073,b1715,Influence: The Psychology of Persuasion,Robert B. Cialdini,Influence: The Psychology of Persuasion Revised Edition,Robert B. Cialdini,1
360,0.999904,a34,b2991,The Picture of Dorian Gray,Oscar Wilde,The Picture of Dorian Gray (Dover Thrift Editions),Oscar Wilde,1


## RNN Model predictions

In [190]:
# Load the model - RNN model.
model_rnn = dm.MatchingModel(attr_summarizer='rnn')
model_rnn.load_state('rnn_model.pth')

In [191]:
# RNN Model - Load the candidate set. Note that the trained model is an input parameter as we need to trained 
# model for candidate set preprocessing.
candidate_RNN = dm.data.process_unlabeled(
    path=os.path.join('.', '..', 'DATA', 'candidate.csv'),
    trained_model=model_rnn,
    ignore_columns=('ltable_id', 'rtable_id'))

Load time: 29.02728861104697
Vocab update time: 0.4698701609740965


In [192]:
# Predict the pairs in the candidate set and return a dataframe containing the pair id with 
# the score of being a match.
predictions_rnn = model_rnn.run_prediction(candidate_RNN, output_attributes=list(candidate_RNN.get_raw_table().columns))

===>  PREDICT Epoch 5 :


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:01:20


Finished Epoch 5 || Run Time:   76.6 | Load Time:    3.8 || F1:   0.00 | Prec:   0.00 | Rec:   0.00 || Ex/s:   0.00



In [193]:
valid_predictions_rnn = model_rnn.run_prediction(validation, output_attributes=True)
valid_predictions_rnn.head()

===>  PREDICT Epoch 5 :
Finished Epoch 5 || Run Time:    0.2 | Load Time:    0.0 || F1:  80.92 | Prec:  71.62 | Rec:  92.98 || Ex/s: 482.45



Unnamed: 0_level_0,match_score,ltable_ID,rtable_ID,ltable_title,ltable_author,rtable_title,rtable_author,label
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
69,0.562076,a125,b2986,Catching Fire (The Hunger Games #2),Suzanne Collins,The Hunger Games,Suzanne Collins,1
396,0.990769,a40,b1860,Brave New World / Brave New World Revisited,Aldous Huxley,Brave New World,Aldous Huxley,1
58,0.04168,a1221,b1008,Buddenbrooks: The Decline of a Family,Thomas Mann,The National Parks of the United States: A Photographic Journey,Andrew Thomas,0
218,0.865173,a2073,b1715,Influence: The Psychology of Persuasion,Robert B. Cialdini,Influence: The Psychology of Persuasion Revised Edition,Robert B. Cialdini,1
360,0.966046,a34,b2991,The Picture of Dorian Gray,Oscar Wilde,The Picture of Dorian Gray (Dover Thrift Editions),Oscar Wilde,1


## Attention Model predictions

In [194]:
# Load the model - Attention model.
model_attn = dm.MatchingModel(attr_summarizer='attention')
model_attn.load_state('attn_model.pth')

In [195]:
# Attention Model - Load the candidate set. Note that the trained model is an input parameter as we need to trained 
# model for candidate set preprocessing.
candidate_attn = dm.data.process_unlabeled(
    path=os.path.join('.', '..', 'DATA', 'candidate.csv'),
    trained_model=model_attn,
    ignore_columns=('ltable_id', 'rtable_id'))

Load time: 29.51213377097156
Vocab update time: 0.5610705970320851


In [196]:
# Predict the pairs in the candidate set and return a dataframe containing the pair id with 
# the score of being a match.
predictions_attn = model_attn.run_prediction(candidate_attn, output_attributes=list(candidate_attn.get_raw_table().columns))

===>  PREDICT Epoch 10 :


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:02:48


Finished Epoch 10 || Run Time:  164.2 | Load Time:    4.0 || F1:   0.00 | Prec:   0.00 | Rec:   0.00 || Ex/s:   0.00



In [197]:
valid_predictions_attn = model_attn.run_prediction(validation, output_attributes=True)
valid_predictions_attn.head()

===>  PREDICT Epoch 10 :
Finished Epoch 10 || Run Time:    0.4 | Load Time:    0.0 || F1:  85.95 | Prec:  81.25 | Rec:  91.23 || Ex/s: 276.53



Unnamed: 0_level_0,match_score,ltable_ID,rtable_ID,ltable_title,ltable_author,rtable_title,rtable_author,label
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
69,0.988211,a125,b2986,Catching Fire (The Hunger Games #2),Suzanne Collins,The Hunger Games,Suzanne Collins,1
396,0.958843,a40,b1860,Brave New World / Brave New World Revisited,Aldous Huxley,Brave New World,Aldous Huxley,1
58,0.216428,a1221,b1008,Buddenbrooks: The Decline of a Family,Thomas Mann,The National Parks of the United States: A Photographic Journey,Andrew Thomas,0
218,0.829784,a2073,b1715,Influence: The Psychology of Persuasion,Robert B. Cialdini,Influence: The Psychology of Persuasion Revised Edition,Robert B. Cialdini,1
360,0.987078,a34,b2991,The Picture of Dorian Gray,Oscar Wilde,The Picture of Dorian Gray (Dover Thrift Editions),Oscar Wilde,1
