# Entity Matching (EM) about Books

# Introduction

This IPython notebook shows a basic workflow two tables using *py_entitymatching*. We want to match data science books in library of UW-Madison and UIUC.  The book information of UW-Madison is from [here](https://search.library.wisc.edu/search/system?q=Data+Science) and the book information of UIUC is from [here](https://vufind.carli.illinois.edu/vf-uiu/Search/Home?lookfor=Data+Science+&type=all&start_over=1&submit=Find&search=new). Details can be found from our Stage 2 Report [here](https://github.com/iphyer/CS839ClassProject/blob/master/stage2/Stage2Report.pdf). 


First, we need to import *py_entitymatching* package and other libraries as follows:

In [518]:
import pandas as pd
import py_entitymatching as em

# Read input tables

We begin by loading the input tables.

We name the table about UW-Madison `TableA.csv` and the table about UIUC `TableB.csv`. And there are 

* 6963 tuples in table `TableA.csv`
* 5730 tuples in table `TableB.csv`

In [519]:
table_A = em.read_csv_metadata('../data/TableA.csv', key = 'ID')
table_B = em.read_csv_metadata('../data/TableB.csv', key = 'ID')

Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.


In [520]:
table_A.shape

(4824, 8)

In [521]:
table_B.shape

(5060, 8)

# Down sampling
Down sampling table A and B.

In [522]:
A, B = em.down_sample(table_A, table_B, size=1000, y_param = 1, show_progress=False)

In [523]:
A.shape

(1000, 8)

In [524]:
block_f = em.get_features_for_blocking(A, B)
block_t = em.get_tokenizers_for_blocking()
block_s = em.get_sim_funs_for_blocking()
r = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Title"]), dlm_dc0(rtuple["Title"]))', block_t, block_s)
em.add_feature(block_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', r)

The table shows the corresponding attributes along with their respective types.
Please confirm that the information  has been correctly inferred.
If you would like to skip this validation process in the future,
please set the flag validate_inferred_attr_types equal to false.


Unnamed: 0,Left Attribute,Right Attribute,Left Attribute Type,Right Attribute Type,Example Features
0,ID,ID,short string (1 word),short string (1 word),Levenshtein Distance; Levenshtein Similarity
1,Title,Title,short string (1 word),short string (1 word),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
2,Author,Author,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
3,Publication,Publication,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
4,Format,Format,short string (1 word to 5 words),short string (1 word to 5 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
5,ISBN,ISBN,numeric,numeric,Exact Match; Absolute Norm
6,Series,Series,medium string (5 words to 10 words),medium string (5 words to 10 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
7,Physical Details,Physical Details,short string (1 word),short string (1 word to 5 words),Not Applicable: Types do not match


Do you want to proceed? (y/n):y


True

# Block tables to get candidate set

Here we will use several blockers to remove obviously non-matching tuple pairs from the input tables.

For the same book, since we got the data from two different library websites, their attributes may not be the exact same. Therefore, we applied an OverlapBlocker over some of the attributes, including the *Title*, *Author* and *Series* of the book.

After multiple tests, we found the best overlap_size for each attribute - for *Title*, *Author* and *Series*, we set the overlap_size to be 1, 3 and 1 respectively.

In [525]:
ob = em.OverlapBlocker()
C = ob.block_tables(A, B, 'Author', 'Author', 
                    l_output_attrs=['Title','Author','Publication','Format','ISBN','Series', 'Physical Details'], 
                    r_output_attrs=['Title','Author','Publication','Format','ISBN','Series', 'Physical Details'], 
                    overlap_size = 2)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [526]:
len(C)

1233

In [527]:
D = ob.block_candset(C, 'Title', 'Title', overlap_size = 4)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [528]:
# D[['ltable_Title', 'rtable_Title']].to_csv('test1.csv', sep = ',')
# E = ob.block_candset(D, 'Series', 'Series', overlap_size = 1)
# F = ob.block_candset(E, 'Publication', 'Publication', overlap_size = 1)

In [529]:
len(D)

559

In [530]:
# rule1 = ['Title_Title_jac_dlm_dc0_dlm_dc0(ltuple, rtuple) < 0.3']
# rb = em.RuleBasedBlocker()
# rb.add_rule(rule1, block_f)
# G = rb.block_candset(D)
# G[['ltable_Title', 'rtable_Title']].to_csv('test.csv', sep = ',')

# End of blocking
Set D contains all examples after blocking.

In [447]:
D[D['ltable_ISBN'] == D['rtable_ISBN']].shape
D.to_csv('Set_C.csv', sep = ',')

In [540]:
E = em.label_table(D, label_column_name='label')

Column name (label) is not present in dataframe
  table.set_value(idxv[i], cols[j], val)


## Sampling from D
Sample 300 examples from D.

In [541]:
S = em.sample_table(E, 300)

In [136]:
# S.to_csv('Set_S.csv', sep = ',')
# em.to_csv_metadata(S, './table_S.csv')
# S[S['ltable_ISBN'] == S['rtable_ISBN']].shape
# S["label"] = (S["ltable_ISBN"] == S["rtable_ISBN"]).astype(int)

True

## Create label
After manually labeling the data, We get 300 candidates with labels label_S. <br/>
We read the data from csv file and copy the metadata from S to label_S.

In [611]:
label_S = pd.read_csv('./data_with_label.csv')
# em.copy_properties(S, label_S)
em.set_property(label_S, 'key', '_id')
em.set_property(label_S, 'fk_ltable', 'ltable_ID')
em.set_property(label_S, 'fk_rtable', 'rtable_ID')
label_S_rtable = em.read_csv_metadata('./label_S_rtable.csv')
label_S_ltable = em.read_csv_metadata('./label_S_ltable.csv')
em.set_property(label_S, 'rtable', label_S_rtable)
em.set_property(label_S, 'ltable', label_S_ltable)

True

In [605]:
# print(em.get_property(I, 'key'))
# print(em.get_property(I, 'ltable'))
# print(em.get_property(I, 'rtable'))
# I_rtable = em.get_property(I, 'rtable')
# em.set_property(label_S, 'rtable', I_rtable)
# I_ltable = em.get_property(I, 'ltable')
# em.set_property(label_S, 'ltable', I_ltable)

In [606]:
# em.to_csv_metadata(I_rtable, './label_S_rtable.csv')
# em.to_csv_metadata(I_ltable, './label_S_ltable.csv')

True

In [608]:
# label_S = em.read_csv_metadata('./label_S')
# em.to_csv_metadata(label_S, './label_S.csv')
# label_S_new = em.read_csv_metadata('./label_S.csv')

In [189]:
# df = label_S[['_id','label']]
# df.columns=['lid','label']
# S_new = S.iloc[:,:-1]
# S_new = pd.concat([S_new, df], axis = 1, ignore_index = False)
# S_new = S_new.merge(df,left_on='_id',right_on='lid', how = 'inner')
# S_new = S_new.drop(columns=['lid'])

In [214]:
# em.get_key(S)
# em.set_key(S_new, '_id')
# em.set_fk_ltable(S_new, 'ltable_ID')
# em.set_fk_rtable(S_new, 'rtable_ID')

True

In [612]:
label_S[label_S['label'] == 0].shape

(194, 18)

In [216]:
# em.get_fk_rtable(S_new)

'rtable_ID'

In [145]:
# S.to_csv('Set_G.csv', sep = ',')

In [613]:
IJ = em.split_train_test(label_S, train_proportion=0.66, random_state=0)
I = IJ['train']
J = IJ['test']

In [614]:
I.to_csv('Set_I.csv', sep = ',')

In [615]:
J.to_csv('Set_J.csv', sep = ',')

In [None]:
# block_f = em.get_features_for_blocking(A, B)

# Training

In [616]:
match_f = em.get_features_for_matching(A, B)

The table shows the corresponding attributes along with their respective types.
Please confirm that the information  has been correctly inferred.
If you would like to skip this validation process in the future,
please set the flag validate_inferred_attr_types equal to false.


Unnamed: 0,Left Attribute,Right Attribute,Left Attribute Type,Right Attribute Type,Example Features
0,ID,ID,short string (1 word),short string (1 word),Levenshtein Distance; Levenshtein Similarity
1,Title,Title,short string (1 word),short string (1 word),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
2,Author,Author,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
3,Publication,Publication,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
4,Format,Format,short string (1 word to 5 words),short string (1 word to 5 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
5,ISBN,ISBN,numeric,numeric,Exact Match; Absolute Norm
6,Series,Series,medium string (5 words to 10 words),medium string (5 words to 10 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
7,Physical Details,Physical Details,short string (1 word),short string (1 word to 5 words),Not Applicable: Types do not match


Do you want to proceed? (y/n):y


In [617]:
match_f

Unnamed: 0,feature_name,left_attribute,right_attribute,left_attr_tokenizer,right_attr_tokenizer,simfunction,function,function_source,is_auto_generated
0,ID_ID_lev_dist,ID,ID,,,lev_dist,<function ID_ID_lev_dist at 0x7f713ce166a8>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
1,ID_ID_lev_sim,ID,ID,,,lev_sim,<function ID_ID_lev_sim at 0x7f713ce16950>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
2,ID_ID_jar,ID,ID,,,jaro,<function ID_ID_jar at 0x7f713ce16488>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
3,ID_ID_jwn,ID,ID,,,jaro_winkler,<function ID_ID_jwn at 0x7f713ce16620>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
4,ID_ID_exm,ID,ID,,,exact_match,<function ID_ID_exm at 0x7f713ce16730>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
5,ID_ID_jac_qgm_3_qgm_3,ID,ID,qgm_3,qgm_3,jaccard,<function ID_ID_jac_qgm_3_qgm_3 at 0x7f713ce16510>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
6,Title_Title_jac_qgm_3_qgm_3,Title,Title,qgm_3,qgm_3,jaccard,<function Title_Title_jac_qgm_3_qgm_3 at 0x7f713ce16840>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
7,Title_Title_cos_dlm_dc0_dlm_dc0,Title,Title,dlm_dc0,dlm_dc0,cosine,<function Title_Title_cos_dlm_dc0_dlm_dc0 at 0x7f713ce168c8>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
8,Format_Format_jac_qgm_3_qgm_3,Format,Format,qgm_3,qgm_3,jaccard,<function Format_Format_jac_qgm_3_qgm_3 at 0x7f713ce16b70>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
9,Format_Format_cos_dlm_dc0_dlm_dc0,Format,Format,dlm_dc0,dlm_dc0,cosine,<function Format_Format_cos_dlm_dc0_dlm_dc0 at 0x7f713ce16a60>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True


In [618]:
match_t = em.get_tokenizers_for_matching()
match_s = em.get_sim_funs_for_matching()
f1 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Title"]), dlm_dc0(rtuple["Title"]))', match_t, match_s)
f2 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Author"]), dlm_dc0(rtuple["Author"]))', match_t, match_s)
f3 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Publication"]), dlm_dc0(rtuple["Publication"]))', match_t, match_s)
f4 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Series"]), dlm_dc0(rtuple["Series"]))', match_t, match_s)


In [619]:
em.add_feature(match_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', f1)
em.add_feature(match_f, 'Author_Author_jac_dlm_dc0_dlm_dc0', f2)
em.add_feature(match_f, 'Publication_Publication_jac_dlm_dc0_dlm_dc0', f3)
em.add_feature(match_f, 'Series_Series_jac_dlm_dc0_dlm_dc0', f4)

True

In [620]:
import re
def Title_Title_blackbox_1(x, y):
    # x, y will be of type pandas series
    
    # get name attribute
    x_title = x['Title']
    y_title = y['Title']
    regex_roman = '\s+[MDCLXVI]+\s+'
    x_match = None
    y_match = None
    if re.search(regex_roman, x_title):
        x_match = re.search(regex_roman, x_title).group(0)
    if re.search(regex_roman, y_title):
        y_match = re.search(regex_roman, y_title).group(0)

    if x_match is None or y_match is None:
        return False
    else:
        return x_match == y_match

bb = em.add_blackbox_feature(match_f, 'blackbox_1', Title_Title_blackbox_1)

In [348]:
# import re
# regex_roman = '\s+[MDCLXVI]+\s+'

# pattern = re.compile(regex_roman)
# x_title = '"Neural information processing : 24th XVI nternational Conference, ICONIP 2017, Guangzhou, China, N...'
# pattern.search(x_title)

<_sre.SRE_Match object; span=(37, 42), match=' XVI '>

In [621]:
match_f

Unnamed: 0,feature_name,left_attribute,right_attribute,left_attr_tokenizer,right_attr_tokenizer,simfunction,function,function_source,is_auto_generated
0,ID_ID_lev_dist,ID,ID,,,lev_dist,<function ID_ID_lev_dist at 0x7f713ce166a8>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
1,ID_ID_lev_sim,ID,ID,,,lev_sim,<function ID_ID_lev_sim at 0x7f713ce16950>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
2,ID_ID_jar,ID,ID,,,jaro,<function ID_ID_jar at 0x7f713ce16488>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
3,ID_ID_jwn,ID,ID,,,jaro_winkler,<function ID_ID_jwn at 0x7f713ce16620>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
4,ID_ID_exm,ID,ID,,,exact_match,<function ID_ID_exm at 0x7f713ce16730>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
5,ID_ID_jac_qgm_3_qgm_3,ID,ID,qgm_3,qgm_3,jaccard,<function ID_ID_jac_qgm_3_qgm_3 at 0x7f713ce16510>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
6,Title_Title_jac_qgm_3_qgm_3,Title,Title,qgm_3,qgm_3,jaccard,<function Title_Title_jac_qgm_3_qgm_3 at 0x7f713ce16840>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
7,Title_Title_cos_dlm_dc0_dlm_dc0,Title,Title,dlm_dc0,dlm_dc0,cosine,<function Title_Title_cos_dlm_dc0_dlm_dc0 at 0x7f713ce168c8>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
8,Format_Format_jac_qgm_3_qgm_3,Format,Format,qgm_3,qgm_3,jaccard,<function Format_Format_jac_qgm_3_qgm_3 at 0x7f713ce16b70>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
9,Format_Format_cos_dlm_dc0_dlm_dc0,Format,Format,dlm_dc0,dlm_dc0,cosine,<function Format_Format_cos_dlm_dc0_dlm_dc0 at 0x7f713ce16a60>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True


In [622]:
match_f = match_f[(match_f['left_attribute'] != 'ID') & (match_f['left_attribute'] != 'ISBN')]

In [623]:
H = em.extract_feature_vecs(I, feature_table=match_f, attrs_after=['label'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [624]:
H['blackbox_1'].sum()

5

In [625]:
H[H['_id'] == 880]

Unnamed: 0,_id,ltable_ID,rtable_ID,Title_Title_jac_qgm_3_qgm_3,Title_Title_cos_dlm_dc0_dlm_dc0,Format_Format_jac_qgm_3_qgm_3,Format_Format_cos_dlm_dc0_dlm_dc0,Format_Format_jac_dlm_dc0_dlm_dc0,Format_Format_mel,Format_Format_lev_dist,...,Series_Series_cos_dlm_dc0_dlm_dc0,Series_Series_mel,Series_Series_lev_dist,Series_Series_lev_sim,Title_Title_jac_dlm_dc0_dlm_dc0,Author_Author_jac_dlm_dc0_dlm_dc0,Publication_Publication_jac_dlm_dc0_dlm_dc0,Series_Series_jac_dlm_dc0_dlm_dc0,blackbox_1,label
215,880,a5257,b1674,0.252427,0.229416,0.063492,0.0,0.0,0.641071,50,...,0.47629,0.842939,110.0,0.382022,0.129032,0.05,0.125,0.304348,False,0


In [626]:
dt = em.DTMatcher(name='DecisionTree', max_depth = 5)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0, max_depth = 5, n_estimators = 40)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher('NaiveBayes')

In [627]:
label_S[label_S['_id'] == 880]

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_Title,ltable_Author,ltable_Publication,ltable_Format,ltable_ISBN,ltable_Series,ltable_Physical Details,rtable_Title,rtable_Author,rtable_Publication,rtable_Format,rtable_ISBN,rtable_Series,rtable_Physical Details,label
215,880,a5257,b1674,"""Neural information processing : 24th International Conference, ICONIP 2017, Guangzhou, China, N...","""Derong Liu, Shengli Xie, Yuanqing Li, Dongbin Zhao, El-Sayed M. El-Alfy (eds.)""","""Cham, Switzerland : Springer, 2017.""","""Books""",9783319700960,"""Lecture notes in computer science ; 10635,LNCS sublibrary. Theoretical computer science and gen...","""nan""","""Advances in brain inspired cognitive systems : 6th International Conference, BICS 2013, Beijing...","""Liu, Derong, | Alippi, Cesare, | Zhao, Dongbin, | Hussain, A.""","""Berlin ; Springer, c2013.""","""Electronic books. | Conference papers and proceedings.""",9783642387869,"""Lecture notes in computer science ; 7888. 1611-3349 Lecture notes in computer science. Lecture ...","""1 online resource (xiv, 418 pages)""",0


In [628]:
result = em.select_matcher(matchers=[dt, rf, svm, lg, ln], 
                           table=H, 
                           exclude_attrs=['_id', 'ltable_ID', 'rtable_ID'], 
                           target_attr='label', 
                           k=5,
                           metric_to_select_matcher='precision', 
                           random_state=0)

In [629]:
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.874365,0.918953,0.895634
1,RF,0.930476,0.886282,0.906218
2,SVM,0.823869,0.688184,0.737271
3,LogReg,0.846691,0.817671,0.829644
4,LinReg,0.900441,0.938056,0.917791


In [630]:
rf.fit(table=H, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
       target_attr='label')

In [631]:
H_test = em.extract_feature_vecs(J, feature_table=match_f, attrs_after=['label'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [632]:
pred_table = rf.predict(table= H_test, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], target_attr='predicted_labels', return_probs=True, probs_attr='proba', append=True)

In [633]:
pred_table

Unnamed: 0,_id,ltable_ID,rtable_ID,Title_Title_jac_qgm_3_qgm_3,Title_Title_cos_dlm_dc0_dlm_dc0,Format_Format_jac_qgm_3_qgm_3,Format_Format_cos_dlm_dc0_dlm_dc0,Format_Format_jac_dlm_dc0_dlm_dc0,Format_Format_mel,Format_Format_lev_dist,...,Series_Series_lev_dist,Series_Series_lev_sim,Title_Title_jac_dlm_dc0_dlm_dc0,Author_Author_jac_dlm_dc0_dlm_dc0,Publication_Publication_jac_dlm_dc0_dlm_dc0,Series_Series_jac_dlm_dc0_dlm_dc0,blackbox_1,label,predicted_labels,proba
208,867,a5342,b797,0.355769,0.244558,0.063492,0.0,0.0,0.641071,50,...,85.0,0.325397,0.138889,0.000000,0.000000,0.352941,False,0,0,0.025476
188,802,a5399,b685,0.629630,0.669439,0.153846,0.0,0.0,0.517293,13,...,52.0,0.446809,0.500000,0.000000,0.000000,0.500000,False,0,0,0.374708
12,52,a153,b1561,0.376866,0.344628,0.153846,0.0,0.0,0.517293,13,...,23.0,0.798246,0.204545,0.000000,0.142857,0.555556,False,0,0,0.087496
221,895,a4311,b547,0.884615,0.771517,0.153846,0.0,0.0,0.517293,13,...,1.0,0.979592,0.625000,0.000000,0.210526,0.800000,False,1,1,0.967072
239,995,a5399,b862,0.189815,0.308607,0.063492,0.0,0.0,0.641071,50,...,54.0,0.425532,0.181818,0.000000,0.200000,0.545455,False,0,0,0.000391
136,576,a2488,b767,0.893617,0.824958,0.153846,0.0,0.0,0.517293,13,...,1.0,0.972973,0.700000,0.083333,0.000000,0.600000,False,1,1,0.940532
230,933,a3709,b907,0.653846,0.666667,0.153846,0.0,0.0,0.517293,13,...,3.0,0.930233,0.500000,0.000000,0.142857,0.750000,False,0,0,0.301905
206,859,a996,b527,0.699248,0.774597,0.063492,0.0,0.0,0.641071,50,...,102.0,0.291667,0.631579,0.000000,0.000000,0.333333,False,0,0,0.076388
52,205,a2989,b1435,0.184426,0.198030,0.063492,0.0,0.0,0.641071,50,...,169.0,0.206573,0.108108,0.181818,0.090909,0.260870,False,0,0,0.000000
108,443,a3526,b4041,0.144068,0.102062,0.153846,0.0,0.0,0.517293,13,...,26.0,0.133333,0.052632,0.022727,0.000000,0.000000,False,0,0,0.054124


In [634]:
eval_summary = em.eval_matches(pred_table, 'label', 'predicted_labels')

In [635]:
eval_summary

OrderedDict([('prec_numerator', 26.0),
             ('prec_denominator', 30.0),
             ('precision', 0.8666666666666667),
             ('recall_numerator', 26.0),
             ('recall_denominator', 29.0),
             ('recall', 0.896551724137931),
             ('f1', 0.8813559322033899),
             ('pred_pos_num', 30.0),
             ('false_pos_num', 4.0),
             ('false_pos_ls',
              [('a5146', 'b695'),
               ('a4258', 'b4518'),
               ('a3061', 'b99'),
               ('a3594', 'b858')]),
             ('pred_neg_num', 72.0),
             ('false_neg_num', 3.0),
             ('false_neg_ls',
              [('a4779', 'b2500'), ('a3595', 'b826'), ('a3020', 'b4720')])])

# CODE 

In [None]:
S[S['ltable_ISBN'] == S['rtable_ISBN']].shape

In [62]:
len(G)

1286

In [11]:
F.head(20)

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_Title,ltable_Author,ltable_Publication,ltable_Format,ltable_ISBN,ltable_Series,ltable_Physical Details,rtable_Title,rtable_Author,rtable_Publication,rtable_Format,rtable_ISBN,rtable_Series,rtable_Physical Details
0,0,a2337,b0,"""Statistical learning and data science""","""edited by Mireille Gettler Summa ... [and others]""","""Boca Raton : CRC Press, [2012] ©2012""","""Books""",9781439867631,"""Series in computer science and data analysis,""","""nan""","""Statistical learning and data science ""","""Summa, Mireille Gettler.""","""Boca Raton""","""nan""",9781439867631,"""Series in computer science and data analysis.""","""xv, 227 p."""
1,1,a3097,b1,"""Intelligent techniques for data science""","""Rajendra Akerkar, Priti Srinivas Sajja""","""Cham, Switzerland : Springer, 2016.""","""Books""",9783319292069,"""nan""","""nan""","""Intelligent techniques for data science ""","""Akerkar, Rajendra""","""Cham, Switzerland""","""Electronic books.""",9783319292069,"""nan""","""1 online resource (xvi, 272 pages)"""
2,2,a4508,b2,"""Algorithms for data science""","""Brian Steele, John Chandler, Swarna Reddy""","""Cham, Switzerland : Springer, 2016.""","""Books""",9783319457970,"""nan""","""nan""","""Algorithms for data science ""","""Steele, Brian""","""Cham, Switzerland""","""Electronic books.""",9783319457970,"""nan""","""1 online resource (xxiii, 430 pages)"""
3,3,a3267,b3,"""Data science at the command line""","""Jeroen Janssens""","""First edition. Sebastopol, CA : O'Reilly, 2014. ©2015""","""Books""",9781491947852,"""nan""","""nan""","""Data science at the command line ""","""Janssens, Jeroen""","""Sebastopol, CA""","""nan""",9781491947852,"""nan""","""xvii, 191 pages"""
5,5,a4755,b5,"""Introduction to HPC with MPI for data science""","""Frank Nielsen""","""Cham : Springer, 2016.""","""Books""",9783319219035,"""Undergraduate topics in computer science,""","""nan""","""Introduction to HPC with MPI for data science ""","""Nielsen, Frank""","""Cham""","""Electronic books.""",9783319219035,"""Undergraduate topics in computer science, 1863-7310""","""1 online resource (xxxiii, 282 pages)"""
6,6,a2523,b6,"""Data Science Using Oracle Data Miner and Oracle R Enterprise : Transform Your Business Systems ...","""Sibanjan Das""","""Berkeley, CA : Apress, 2016. Berkeley, CA : Apress, 2016.""","""Books""",9781484226148,"""nan""","""nan""","""Data Science Using Oracle Data Miner and Oracle R Enterprise : Transform Your Business Systems ...","""Das, Sibanjan.""","""Berkeley, CA""","""Electronic books.""",9781484226148,"""nan""","""1 online resource (300 pages)"""
7,7,a3584,b7,"""The data science handbook""","""Field Cady""","""Hoboken, NJ : John Wiley & Sons, Inc., 2017.""","""Books""",9781119092933,"""nan""","""nan""","""The data science handbook ""","""Cady, Field, 1984-""","""Hoboken, NJ""","""Electronic books. | Handbooks and manuals.""",9781119092933,"""nan""","""1 online resource"""
8,8,a1869,b8,"""Data science : create teams that ask the right questions and deliver real value""","""Doug Rose""","""[Berkeley, CA] : Apress, 2016.""","""Books""",9781484222539,"""nan""","""nan""","""Data science : create teams that ask the right questions and deliver real value ""","""Rose, Doug, (Agile coach)""","""[Berkeley, CA]""","""Electronic books.""",9781484222539,"""nan""","""1 online resource"""
9,9,a1186,b9,"""Spatial big data science classification techniques for Earth observation imagery""","""Zhe Jiang, Shashi Shekhar""","""Cham : Springer, 2017.""","""Books""",9783319601953,"""nan""","""nan""","""Spatial big data science : classification techniques for Earth observation imagery ""","""Jiang, Zhe.""","""Cham""","""Electronic books.""",9783319601953,"""nan""","""1 online resource"""
10,10,a3432,b11,"""Practical data science cookbook : 89 hands-on recipes to help you complete real-world data scie...","""Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, Abhijit Dasgupta""","""Birmingham, UK : Packt Publishing Ltd., September 2014. ©2014""","""Books""",9781783980246,"""nan""","""nan""","""Practical data science cookbook : 89 hands-on recipes to help you complete real-world data scie...","""Ojeda, Tony.""","""Birmingham, UK""","""Electronic books.""",9781783980253,"""nan""","""1 online resource"""


In [None]:
s1 = pd.merge(A, B, how='inner', on=['ISBN'])

In [8]:
s1.head(20)

NameError: name 's1' is not defined

In [None]:
C[C['ltable_ISBN'] == C['rtable_ISBN']].shape

In [None]:
D[D['ltable_ISBN'] == D['rtable_ISBN']].shape

In [None]:
E[E['ltable_ISBN'] == E['rtable_ISBN']].shape

In [59]:
F[F['ltable_ISBN'] == F['rtable_ISBN']].shape

(639, 17)

In [195]:
G[G['ltable_ISBN'] == G['rtable_ISBN']].shape

(627, 17)

In [196]:
S[S['ltable_ISBN'] == S['rtable_ISBN']].shape

(254, 18)

In [197]:
I[I['ltable_ISBN'] == I['rtable_ISBN']].shape

(176, 18)

In [198]:
len(I)

330

In [None]:
D = ob.block_candset(block_data, 'Title', 'Title', allow_missing=True)

In [None]:
em.get_key(data1)

In [None]:
data1.keys()