# Entity Matching (EM) about Books

# Introduction

This IPython notebook shows a basic workflow two tables using *py_entitymatching*. We want to match data science books in library of UW-Madison and UIUC.  The book information of UW-Madison is from [here](https://search.library.wisc.edu/search/system?q=Data+Science) and the book information of UIUC is from [here](https://vufind.carli.illinois.edu/vf-uiu/Search/Home?lookfor=Data+Science+&type=all&start_over=1&submit=Find&search=new). Details can be found from our Stage 2 Report [here](https://github.com/iphyer/CS839ClassProject/blob/master/stage2/Stage2Report.pdf). 


First, we need to import *py_entitymatching* package and other libraries as follows:

In [217]:
import pandas as pd
import py_entitymatching as em

# Read input tables

We begin by loading the input tables.

We name the table about UW-Madison `TableA.csv` and the table about UIUC `TableB.csv`. And there are 

* 4824 tuples in table `TableA.csv`
* 5060 tuples in table `TableB.csv`

In [218]:
table_A = em.read_csv_metadata('../data/TableA.csv', key = 'ID')
table_B = em.read_csv_metadata('../data/TableB.csv', key = 'ID')

Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.


In [219]:
table_A.shape

(4824, 8)

In [220]:
table_B.shape

(5060, 8)

# Down sampling
Down sampling table A and B， get 1000 examples from both table A and B.

In [221]:
A, B = em.down_sample(table_A, table_B, size=1000, y_param = 1, show_progress=False)

In [102]:
A.shape

(1000, 8)

In [None]:
block_f = em.get_features_for_blocking(A, B)
block_t = em.get_tokenizers_for_blocking()
block_s = em.get_sim_funs_for_blocking()
r = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Title"]), dlm_dc0(rtuple["Title"]))', block_t, block_s)
em.add_feature(block_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', r)

# Block tables to get candidate set

Here we will use several blockers to remove obviously non-matching tuple pairs from the input tables.

For the same book, since we got the data from two different library websites, their attributes may not be the exact same. Therefore, we applied an OverlapBlocker over some of the attributes, including the *Title* and *Author*.

After multiple tests, we found the best overlap_size for each attribute - for *Author* and *Title*, we set the overlap_size to be 2 and 4 respectively.

In [None]:
ob = em.OverlapBlocker()
C = ob.block_tables(A, B, 'Author', 'Author', 
                    l_output_attrs=['Title','Author','Publication','Format','ISBN','Series', 'Physical Details'], 
                    r_output_attrs=['Title','Author','Publication','Format','ISBN','Series', 'Physical Details'], 
                    overlap_size = 2)

In [None]:
D = ob.block_candset(C, 'Title', 'Title', overlap_size = 4)

In [None]:
# D[['ltable_Title', 'rtable_Title']].to_csv('test1.csv', sep = ',')
# E = ob.block_candset(D, 'Series', 'Series', overlap_size = 1)
# F = ob.block_candset(E, 'Publication', 'Publication', overlap_size = 1)

In [None]:
# rule1 = ['Title_Title_jac_dlm_dc0_dlm_dc0(ltuple, rtuple) < 0.3']
# rb = em.RuleBasedBlocker()
# rb.add_rule(rule1, block_f)
# G = rb.block_candset(D)
# G[['ltable_Title', 'rtable_Title']].to_csv('test.csv', sep = ',')

# End of blocking
Set D contains all examples after blocking.

In [None]:
D[D['ltable_ISBN'] == D['rtable_ISBN']].shape
D.to_csv('Set_C.csv', sep = ',')

In [None]:
# E = em.label_table(D, label_column_name='label')

## Sampling from D
Sample 300 examples from D.

In [None]:
S = em.sample_table(E, 300)

In [None]:
# S.to_csv('Set_S.csv', sep = ',')
# em.to_csv_metadata(S, './table_S.csv')
# S[S['ltable_ISBN'] == S['rtable_ISBN']].shape
# S["label"] = (S["ltable_ISBN"] == S["rtable_ISBN"]).astype(int)

## Create label
After manually labeling the data, We get 300 candidates with labels in label_S. <br/>
Also, need to set the metadata for label_S appropriately.

In [103]:
label_S = pd.read_csv('./data_with_label.csv')
# em.copy_properties(S, label_S)
em.set_property(label_S, 'key', '_id')
em.set_property(label_S, 'fk_ltable', 'ltable_ID')
em.set_property(label_S, 'fk_rtable', 'rtable_ID')
label_S_rtable = em.read_csv_metadata('./label_S_rtable.csv')
label_S_ltable = em.read_csv_metadata('./label_S_ltable.csv')
em.set_property(label_S, 'rtable', label_S_rtable)
em.set_property(label_S, 'ltable', label_S_ltable)

True

In [None]:
# print(em.get_property(I, 'key'))
# print(em.get_property(I, 'ltable'))
# print(em.get_property(I, 'rtable'))
# I_rtable = em.get_property(I, 'rtable')
# em.set_property(label_S, 'rtable', I_rtable)
# I_ltable = em.get_property(I, 'ltable')
# em.set_property(label_S, 'ltable', I_ltable)

In [None]:
# em.to_csv_metadata(I_rtable, './label_S_rtable.csv')
# em.to_csv_metadata(I_ltable, './label_S_ltable.csv')

In [None]:
# label_S = em.read_csv_metadata('./label_S')
# em.to_csv_metadata(label_S, './label_S.csv')
# label_S_new = em.read_csv_metadata('./label_S.csv')

In [None]:
# df = label_S[['_id','label']]
# df.columns=['lid','label']
# S_new = S.iloc[:,:-1]
# S_new = pd.concat([S_new, df], axis = 1, ignore_index = False)
# S_new = S_new.merge(df,left_on='_id',right_on='lid', how = 'inner')
# S_new = S_new.drop(columns=['lid'])

In [None]:
# em.get_key(S)
# em.set_key(S_new, '_id')
# em.set_fk_ltable(S_new, 'ltable_ID')
# em.set_fk_rtable(S_new, 'rtable_ID')

In [None]:
# label_S[label_S['label'] == 0].shape

In [None]:
# em.get_fk_rtable(S_new)

In [None]:
# S.to_csv('Set_G.csv', sep = ',')

In [168]:
IJ = em.split_train_test(label_S, train_proportion=0.66, random_state=0)
I = IJ['train']
J = IJ['test']

In [169]:
# I.to_csv('Set_I.csv', sep = ',')

In [170]:
# J.to_csv('Set_J.csv', sep = ',')

In [171]:
# block_f = em.get_features_for_blocking(A, B)

# Training

In [190]:
match_f = em.get_features_for_matching(A, B)

The table shows the corresponding attributes along with their respective types.
Please confirm that the information  has been correctly inferred.
If you would like to skip this validation process in the future,
please set the flag validate_inferred_attr_types equal to false.


Unnamed: 0,Left Attribute,Right Attribute,Left Attribute Type,Right Attribute Type,Example Features
0,ID,ID,short string (1 word),short string (1 word),Levenshtein Distance; Levenshtein Similarity
1,Title,Title,short string (1 word),short string (1 word),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
2,Author,Author,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
3,Publication,Publication,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
4,Format,Format,short string (1 word to 5 words),short string (1 word to 5 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
5,ISBN,ISBN,numeric,numeric,Exact Match; Absolute Norm
6,Series,Series,medium string (5 words to 10 words),medium string (5 words to 10 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
7,Physical Details,Physical Details,short string (1 word),short string (1 word to 5 words),Not Applicable: Types do not match


Do you want to proceed? (y/n):y


In [191]:
match_t = em.get_tokenizers_for_matching()
match_s = em.get_sim_funs_for_matching()
f1 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Title"]), dlm_dc0(rtuple["Title"]))', match_t, match_s)
f2 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Author"]), dlm_dc0(rtuple["Author"]))', match_t, match_s)
f3 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Publication"]), dlm_dc0(rtuple["Publication"]))', match_t, match_s)
f4 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Series"]), dlm_dc0(rtuple["Series"]))', match_t, match_s)
em.add_feature(match_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', f1)
em.add_feature(match_f, 'Author_Author_jac_dlm_dc0_dlm_dc0', f2)
em.add_feature(match_f, 'Publication_Publication_jac_dlm_dc0_dlm_dc0', f3)
em.add_feature(match_f, 'Series_Series_jac_dlm_dc0_dlm_dc0', f4)

True

In [192]:
# Add blackbox feature

import re
# for Roman numerals matching
def Title_Title_blackbox_1(x, y):
    
    # get name attribute
    x_title = x['Title']
    y_title = y['Title']
    regex_roman = '\s+[MDCLXVI]+\s+'
    x_match = None
    y_match = None
    if re.search(regex_roman, x_title):
        x_match = re.search(regex_roman, x_title).group(0)
    if re.search(regex_roman, y_title):
        y_match = re.search(regex_roman, y_title).group(0)

    if x_match is None or y_match is None:
        return False
    else:
        return x_match == y_match

em.add_blackbox_feature(match_f, 'blackbox_1', Title_Title_blackbox_1)


for number matching (e.g. 6th edition)
def Title_Title_blackbox_2(x, y):
    # x, y will be of type pandas series
    
    x_title = x['Title']
    y_title = y['Title']
    regex_number = '\s+(\d+)\s*th'
    x_match = None
    y_match = None
    if re.search(regex_number, x_title):
        x_match = re.search(regex_number, x_title).group(1)
    if re.search(regex_number, y_title):
        y_match = re.search(regex_number, y_title).group(1)

    if x_match is None or y_match is None:
        return False
    else:
        return x_match == y_match

em.add_blackbox_feature(match_f, 'blackbox_2', Title_Title_blackbox_2)

SyntaxError: invalid syntax (<ipython-input-192-99c38684fca7>, line 26)

In [193]:
# import re
# regex_roman = '\s+(\d+)\s*th'

# pattern = re.compile(regex_roman)
# x_title = '"Public key cryptography--PKC 2004 : 7th International Workshop on Theory and Practice in Public...'
# pattern.search(x_title).group(1)

Here we delete features that are related to ID and ISBN.

In [194]:
match_f = match_f[(match_f['left_attribute'] != 'ID') & (match_f['left_attribute'] != 'ISBN')]

In [195]:
# match_f

Extract feature from set I.

In [196]:
H = em.extract_feature_vecs(I, feature_table=match_f, attrs_after=['label'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [197]:
# H['blackbox_2'].sum()
# H[(H['blackbox_2'] == True) & (H['label'] == 0)]

In [198]:
dt = em.DTMatcher(name='DecisionTree', random_state = 0, max_depth = 5)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher('NaiveBayes')

result = em.select_matcher(matchers=[dt, rf, svm, lg, ln], 
                           table=H, 
                           exclude_attrs=['_id', 'ltable_ID', 'rtable_ID'], 
                           target_attr='label', 
                           k=5,
                           metric_to_select_matcher='precision'
                           )

In [199]:
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.845203,0.90825,0.846377
1,RF,0.889492,0.895561,0.923153
2,SVM,0.775531,0.688355,0.739918
3,LogReg,0.844767,0.841471,0.83814
4,LinReg,0.9068,0.935833,0.913828


In [200]:
rf.fit(table=H, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
       target_attr='label')

In [201]:
H_test = em.extract_feature_vecs(J, feature_table=match_f, attrs_after=['label'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Extract feature from set J.

In [202]:
pred_table = rf.predict(table= H_test, 
                        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
                        target_attr='predicted_labels', 
                        return_probs=True, 
                        probs_attr='proba', 
                        append=True)

In [203]:
# ln.fit(table=H, 
#        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
#        target_attr='label')
# pred_table = ln.predict(table= H_test, 
#                         exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
#                         target_attr='predicted_labels', 
#                         return_probs=True, 
#                         probs_attr='proba', 
#                         append=True)

In [211]:
# pred_table[(pred_table['label'] == 1) & (pred_table['predicted_labels'] == 0)]
# wrong = pred_table[(pred_table['label'] != pred_table['predicted_labels'])]

In [222]:
# J
# J.join(wrong, on = '_id', lsuffix = '_id', rsuffix = '_id')

In [205]:
# print(J[J['_id'] == 358]['ltable_Title'])
# print(J[J['_id'] == 358]['rtable_Title'])
# print(J[J['_id'] == 332]['ltable_Title'])
# print(J[J['_id'] == 332]['rtable_Title'])
# print(J[J['_id'] == 509]['ltable_Title'])
# print(J[J['_id'] == 509]['rtable_Title'])

In [206]:
eval_summary = em.eval_matches(pred_table, 'label', 'predicted_labels')

In [207]:
eval_summary

OrderedDict([('prec_numerator', 27.0),
             ('prec_denominator', 30.0),
             ('precision', 0.9),
             ('recall_numerator', 27.0),
             ('recall_denominator', 31.0),
             ('recall', 0.8709677419354839),
             ('f1', 0.8852459016393444),
             ('pred_pos_num', 30.0),
             ('false_pos_num', 3.0),
             ('false_pos_ls',
              [('a5146', 'b695'), ('a3061', 'b99'), ('a3709', 'b907')]),
             ('pred_neg_num', 72.0),
             ('false_neg_num', 4.0),
             ('false_neg_ls',
              [('a4779', 'b2500'),
               ('a3595', 'b826'),
               ('a120', 'b4255'),
               ('a1876', 'b862')])])