# Entity Matching (EM) about Books

# Introduction

This IPython notebook shows a basic workflow two tables using *py_entitymatching*. We want to match data science books in library of UW-Madison and UIUC.  The book information of UW-Madison is from [here](https://search.library.wisc.edu/search/system?q=Data+Science) and the book information of UIUC is from [here](https://vufind.carli.illinois.edu/vf-uiu/Search/Home?lookfor=Data+Science+&type=all&start_over=1&submit=Find&search=new). Details can be found from our Stage 2 Report [here](https://github.com/iphyer/CS839ClassProject/blob/master/stage2/Stage2Report.pdf). 


First, we need to import *py_entitymatching* package and other libraries as follows:

In [1]:
import pandas as pd
import py_entitymatching as em

# Read input tables

We begin by loading the input tables.

We name the table about UW-Madison `TableA.csv` and the table about UIUC `TableB.csv`. And there are 

* 6963 tuples in table `TableA.csv`
* 5730 tuples in table `TableB.csv`

In [2]:
A = em.read_csv_metadata('../data/TableA.csv', key = 'ID')
B = em.read_csv_metadata('../data/TableB.csv', key = 'ID')

Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.


In [12]:
A.shape

(4824, 8)

In [13]:
B.shape

(5060, 8)

# Down sampling

In [20]:
sample_A, sample_B = em.down_sample(A, B, size=500, y_param = 1.5, show_progress=False)

In [21]:
sample_A.shape

(815, 8)

In [8]:
block_f = em.get_features_for_blocking(A, B)
block_t = em.get_tokenizers_for_blocking()
block_s = em.get_sim_funs_for_blocking()
r = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Title"]), dlm_dc0(rtuple["Title"]))', block_t, block_s)
em.add_feature(block_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', r)

The table shows the corresponding attributes along with their respective types.
Please confirm that the information  has been correctly inferred.
If you would like to skip this validation process in the future,
please set the flag validate_inferred_attr_types equal to false.


Unnamed: 0,Left Attribute,Right Attribute,Left Attribute Type,Right Attribute Type,Example Features
0,ID,ID,short string (1 word),short string (1 word),Levenshtein Distance; Levenshtein Similarity
1,Title,Title,short string (1 word),short string (1 word),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
2,Author,Author,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
3,Publication,Publication,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
4,Format,Format,short string (1 word to 5 words),short string (1 word to 5 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
5,ISBN,ISBN,numeric,numeric,Exact Match; Absolute Norm
6,Series,Series,medium string (5 words to 10 words),medium string (5 words to 10 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
7,Physical Details,Physical Details,short string (1 word),short string (1 word to 5 words),Not Applicable: Types do not match


Do you want to proceed? (y/n):n

If the attribute correspondences or types have been inferred incorrectly,
use the get_features() function with your  own correspondences and attribute
types to get the correct features for your data


AssertionError: Input feature table: None 
is not of type pandas dataframe

# Block tables to get candidate set

Here we will use several blockers to remove obviously non-matching tuple pairs from the input tables.

For the same book, since we got the data from two different library websites, their attributes may not be the exact same. Therefore, we applied an OverlapBlocker over some of the attributes, including the *Title*, *Author* and *Series* of the book.

After multiple tests, we found the best overlap_size for each attribute - for *Title*, *Author* and *Series*, we set the overlap_size to be 1, 3 and 1 respectively.

In [217]:
ob = em.OverlapBlocker()
C = ob.block_tables(A, B, 'Author', 'Author', l_output_attrs=['Title','Author','Publication','Format','ISBN','Series', 'Physical Details'], r_output_attrs=['Title','Author','Publication','Format','ISBN','Series', 'Physical Details'], overlap_size = 2)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [218]:
D = ob.block_candset(C, 'Title', 'Title', overlap_size = 4)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [219]:
E = ob.block_candset(D, 'Series', 'Series', overlap_size = 1)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [220]:
F = ob.block_candset(E, 'Publication', 'Publication', overlap_size = 1)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [221]:
len(F)

3456

In [222]:
block_f

Unnamed: 0,feature_name,left_attribute,right_attribute,left_attr_tokenizer,right_attr_tokenizer,simfunction,function,function_source,is_auto_generated
0,ID_ID_lev_dist,ID,ID,,,lev_dist,<function ID_ID_lev_dist at 0x7f9d014949d8>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
1,ID_ID_lev_sim,ID,ID,,,lev_sim,<function ID_ID_lev_sim at 0x7f9d01494620>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
2,ID_ID_jar,ID,ID,,,jaro,<function ID_ID_jar at 0x7f9d01494f28>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
3,ID_ID_jwn,ID,ID,,,jaro_winkler,<function ID_ID_jwn at 0x7f9d01494510>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
4,ID_ID_exm,ID,ID,,,exact_match,<function ID_ID_exm at 0x7f9d01494840>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
5,ID_ID_jac_qgm_3_qgm_3,ID,ID,qgm_3,qgm_3,jaccard,<function ID_ID_jac_qgm_3_qgm_3 at 0x7f9d01494e18>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
6,Title_Title_jac_qgm_3_qgm_3,Title,Title,qgm_3,qgm_3,jaccard,<function Title_Title_jac_qgm_3_qgm_3 at 0x7f9d014946a8>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
7,Title_Title_cos_dlm_dc0_dlm_dc0,Title,Title,dlm_dc0,dlm_dc0,cosine,<function Title_Title_cos_dlm_dc0_dlm_dc0 at 0x7f9d01494598>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
8,Format_Format_jac_qgm_3_qgm_3,Format,Format,qgm_3,qgm_3,jaccard,<function Format_Format_jac_qgm_3_qgm_3 at 0x7f9d01494ea0>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
9,Format_Format_cos_dlm_dc0_dlm_dc0,Format,Format,dlm_dc0,dlm_dc0,cosine,<function Format_Format_cos_dlm_dc0_dlm_dc0 at 0x7f9d014941e0>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True


In [242]:
rule1 = ['Title_Title_jac_dlm_dc0_dlm_dc0(ltuple, rtuple) < 0.5']
rb = em.RuleBasedBlocker()
rb.add_rule(rule1, block_f)
G = rb.block_candset(F)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [243]:
em.jaccard(['data'], ['data'])

1.0

In [244]:
len(G)

1286

In [245]:
G[G['ltable_ISBN'] == G['rtable_ISBN']].shape

(627, 17)

In [143]:
G.to_csv('Set_C.csv', sep = ',')

In [None]:
# G = em.label_table(F, label_column_name='gold_labels')

In [118]:
S = em.sample_table(G, 500)

In [None]:
S

In [None]:
S["label"] = (S["ltable_ISBN"] == S["rtable_ISBN"]).astype(int)

In [318]:
G = em.label_table(F, label_column_name='label')

Column name (label) is not present in dataframe
  table.set_value(idxv[i], cols[j], val)


In [105]:
# S = S.drop(['ltable_ISBN', 'rtable_ISBN'], 1)

In [145]:
S.to_csv('Set_G.csv', sep = ',')

In [147]:
IJ = em.split_train_test(S, train_proportion=0.66, random_state=0)
I = IJ['train']
J = IJ['test']

In [148]:
I.to_csv('Set_I.csv', sep = ',')

In [149]:
J.to_csv('Set_J.csv', sep = ',')

In [None]:
# block_f = em.get_features_for_blocking(A, B)

In [246]:
S

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_Title,ltable_Author,ltable_Publication,ltable_Format,ltable_ISBN,ltable_Series,ltable_Physical Details,rtable_Title,rtable_Author,rtable_Publication,rtable_Format,rtable_ISBN,rtable_Series,rtable_Physical Details,label
0,0,a2337,b0,"""Statistical learning and data science""","""edited by Mireille Gettler Summa ... [and others]""","""Boca Raton : CRC Press, [2012] ©2012""","""Books""",9781439867631,"""Series in computer science and data analysis,""","""nan""","""Statistical learning and data science ""","""Summa, Mireille Gettler.""","""Boca Raton""","""nan""",9781439867631,"""Series in computer science and data analysis.""","""xv, 227 p.""",1
3,3,a3267,b3,"""Data science at the command line""","""Jeroen Janssens""","""First edition. Sebastopol, CA : O'Reilly, 2014. ©2015""","""Books""",9781491947852,"""nan""","""nan""","""Data science at the command line ""","""Janssens, Jeroen""","""Sebastopol, CA""","""nan""",9781491947852,"""nan""","""xvii, 191 pages""",1
9,9,a1186,b9,"""Spatial big data science classification techniques for Earth observation imagery""","""Zhe Jiang, Shashi Shekhar""","""Cham : Springer, 2017.""","""Books""",9783319601953,"""nan""","""nan""","""Spatial big data science : classification techniques for Earth observation imagery ""","""Jiang, Zhe.""","""Cham""","""Electronic books.""",9783319601953,"""nan""","""1 online resource""",1
12,12,a3432,b12,"""Practical data science cookbook : 89 hands-on recipes to help you complete real-world data scie...","""Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, Abhijit Dasgupta""","""Birmingham, UK : Packt Publishing Ltd., September 2014. ©2014""","""Books""",9781783980246,"""nan""","""nan""","""Practical data science cookbook : 89 hands-on recipes to help you complete real-world data scie...","""Ojeda, Tony.""","""Birmingham, UK""","""Electronic books.""",9781783980253,"""nan""","""1 online resource""",0
19,19,a4368,b16,"""Neural data science : a primer with MATLAB® and Python™""","""Erik Lee Nylen, Parsec Media, New York, NY, United States, Pascal Wallisch, New York University...","""London : San Diego, CA : Academic Press, [2017]""","""Books""",9780128040980,"""nan""","""nan""","""Neural data science : a primer with MATLABÂ® and Pythonâ¢ ""","""Nylen, Erik Lee""","""London""","""Electronic books.""",9780128040980,"""nan""","""1 online resource""",1
24,24,a107,b28,"""The data science design manual""","""Steven S. Skiena""","""Cham, Switzerland : Springer, 2017.""","""Books""",9783319554440,"""Texts in computer science,Undergraduate texts in computer science,""","""nan""","""The data science design manual ""","""Skiena, Steven S.""","""Cham, Switzerland""","""Electronic books.""",9783319554440,"""Texts in computer science, 1868-0941""","""1 online resource (xvii, 445 pages)""",1
80,80,a3115,b58,"""Big data at work : the data science revolution and organizational psychology""","""edited by Scott Tonidandel, Eden B. King, and Jose M. Cortina""","""New York : Routledge, Taylor & Francis Group, 2016.""","""Books""",9781848725812,"""Organizational frontiers series,Frontiers of industrial and organizational psychology,""","""nan""","""Big data at work : the data science revolution and organizational psychology ""","""Tonidandel, Scott. | King, Eden. | Cortina, Jose M.""","""New York""","""nan""",9781848725812,"""Organizational frontiers series.""","""xiii, 367 pages""",1
147,147,a3510,b71,"""Mathematical problems in data science : theoretical and practical methods""","""Li M. Chen, Zhixun Su, Bo Jiang""","""Cham : Springer, 2015.""","""Books""",9783319251271,"""nan""","""nan""","""Mathematical problems in data science : theoretical and practical methods ""","""Chen, Li M.""","""Cham""","""Electronic books.""",9783319251271,"""nan""","""1 online resource (xv, 213 pages)""",1
151,151,a3977,b72,"""Cyber-risk informatics : engineering evaluation with data science""","""Mehmet Sahinoglu""","""Hoboken, New Jersey : Wiley, 2016.""","""Books""",9781119087526,"""nan""","""nan""","""Cyber-risk informatics : engineering evaluation with data science ""","""Sahinoglu, Mehmet, 1951-""","""Hoboken, New Jersey""","""Electronic books.""",9781119087533,"""nan""","""1 online resource""",0
172,172,a3029,b91,"""Beginning data science in R : data analysis, visualization, and modelling for the data scientist""","""Thomas Mailund""","""New York : Apress, [2017]. ©2017""","""Books""",9781484226711,"""nan""","""nan""","""Beginning data science in R : data analysis, visualization, and modelling for the data scientis...","""Mailund, Thomas""","""New York""","""Electronic books.""",9781484226711,"""nan""","""1 online resource""",1


# Training

In [279]:
match_f = em.get_features_for_matching(A, B)

The table shows the corresponding attributes along with their respective types.
Please confirm that the information  has been correctly inferred.
If you would like to skip this validation process in the future,
please set the flag validate_inferred_attr_types equal to false.


Unnamed: 0,Left Attribute,Right Attribute,Left Attribute Type,Right Attribute Type,Example Features
0,ID,ID,short string (1 word),short string (1 word),Levenshtein Distance; Levenshtein Similarity
1,Title,Title,short string (1 word),short string (1 word),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
2,Author,Author,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
3,Publication,Publication,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
4,Format,Format,short string (1 word to 5 words),short string (1 word to 5 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
5,ISBN,ISBN,numeric,numeric,Exact Match; Absolute Norm
6,Series,Series,medium string (5 words to 10 words),medium string (5 words to 10 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
7,Physical Details,Physical Details,short string (1 word),short string (1 word to 5 words),Not Applicable: Types do not match


Do you want to proceed? (y/n):y


In [280]:
match_f

Unnamed: 0,feature_name,left_attribute,right_attribute,left_attr_tokenizer,right_attr_tokenizer,simfunction,function,function_source,is_auto_generated
0,ID_ID_lev_dist,ID,ID,,,lev_dist,<function ID_ID_lev_dist at 0x7f9cbff91598>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
1,ID_ID_lev_sim,ID,ID,,,lev_sim,<function ID_ID_lev_sim at 0x7f9d014ea840>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
2,ID_ID_jar,ID,ID,,,jaro,<function ID_ID_jar at 0x7f9d014ea6a8>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
3,ID_ID_jwn,ID,ID,,,jaro_winkler,<function ID_ID_jwn at 0x7f9d011810d0>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
4,ID_ID_exm,ID,ID,,,exact_match,<function ID_ID_exm at 0x7f9d01181158>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
5,ID_ID_jac_qgm_3_qgm_3,ID,ID,qgm_3,qgm_3,jaccard,<function ID_ID_jac_qgm_3_qgm_3 at 0x7f9d01181620>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
6,Title_Title_jac_qgm_3_qgm_3,Title,Title,qgm_3,qgm_3,jaccard,<function Title_Title_jac_qgm_3_qgm_3 at 0x7f9d01181268>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
7,Title_Title_cos_dlm_dc0_dlm_dc0,Title,Title,dlm_dc0,dlm_dc0,cosine,<function Title_Title_cos_dlm_dc0_dlm_dc0 at 0x7f9d01181400>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
8,Format_Format_jac_qgm_3_qgm_3,Format,Format,qgm_3,qgm_3,jaccard,<function Format_Format_jac_qgm_3_qgm_3 at 0x7f9d011812f0>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
9,Format_Format_cos_dlm_dc0_dlm_dc0,Format,Format,dlm_dc0,dlm_dc0,cosine,<function Format_Format_cos_dlm_dc0_dlm_dc0 at 0x7f9d01181950>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True


In [281]:
match_t = em.get_tokenizers_for_matching()
match_s = em.get_sim_funs_for_matching()
f1 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Title"]), dlm_dc0(rtuple["Title"]))', match_t, match_s)
f2 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Author"]), dlm_dc0(rtuple["Author"]))', match_t, match_s)
f3 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Publication"]), dlm_dc0(rtuple["Publication"]))', match_t, match_s)
f4 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Series"]), dlm_dc0(rtuple["Series"]))', match_t, match_s)

In [272]:
f1

{'function': <function fn>,
 'function_source': 'def fn(ltuple, rtuple):\n    return jaccard(dlm_dc0(ltuple["Title"]), dlm_dc0(rtuple["Title"]))',
 'is_auto_generated': False,
 'left_attr_tokenizer': 'dlm_dc0',
 'left_attribute': 'Title',
 'right_attr_tokenizer': 'dlm_dc0',
 'right_attribute': 'Title',
 'simfunction': 'jaccard'}

In [284]:
em.add_feature(match_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', f1)

Input feature name is already present in feature table


AssertionError: Input feature name is already present in feature table

In [287]:
em.add_feature(match_f, 'Publication_Publication_jac_dlm_dc0_dlm_dc0', f3)

True

In [288]:
em.add_feature(match_f, 'Series_Series_jac_dlm_dc0_dlm_dc0', f4)

True

In [289]:
match_f

Unnamed: 0,feature_name,left_attribute,right_attribute,left_attr_tokenizer,right_attr_tokenizer,simfunction,function,function_source,is_auto_generated
0,ID_ID_lev_dist,ID,ID,,,lev_dist,<function ID_ID_lev_dist at 0x7f9cbff91598>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
1,ID_ID_lev_sim,ID,ID,,,lev_sim,<function ID_ID_lev_sim at 0x7f9d014ea840>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
2,ID_ID_jar,ID,ID,,,jaro,<function ID_ID_jar at 0x7f9d014ea6a8>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
3,ID_ID_jwn,ID,ID,,,jaro_winkler,<function ID_ID_jwn at 0x7f9d011810d0>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
4,ID_ID_exm,ID,ID,,,exact_match,<function ID_ID_exm at 0x7f9d01181158>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
5,ID_ID_jac_qgm_3_qgm_3,ID,ID,qgm_3,qgm_3,jaccard,<function ID_ID_jac_qgm_3_qgm_3 at 0x7f9d01181620>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
6,Title_Title_jac_qgm_3_qgm_3,Title,Title,qgm_3,qgm_3,jaccard,<function Title_Title_jac_qgm_3_qgm_3 at 0x7f9d01181268>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
7,Title_Title_cos_dlm_dc0_dlm_dc0,Title,Title,dlm_dc0,dlm_dc0,cosine,<function Title_Title_cos_dlm_dc0_dlm_dc0 at 0x7f9d01181400>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
8,Format_Format_jac_qgm_3_qgm_3,Format,Format,qgm_3,qgm_3,jaccard,<function Format_Format_jac_qgm_3_qgm_3 at 0x7f9d011812f0>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
9,Format_Format_cos_dlm_dc0_dlm_dc0,Format,Format,dlm_dc0,dlm_dc0,cosine,<function Format_Format_cos_dlm_dc0_dlm_dc0 at 0x7f9d01181950>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True


In [290]:
match_f = match_f[(match_f['left_attribute'] != 'ID') & (match_f['left_attribute'] != 'ISBN')]

In [295]:
H = em.extract_feature_vecs(I, feature_table=match_f, attrs_after=['label'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:01


In [185]:
H

Unnamed: 0,_id,ltable_ID,rtable_ID,Title_Title_jac_qgm_3_qgm_3,Title_Title_cos_dlm_dc0_dlm_dc0,Format_Format_jac_qgm_3_qgm_3,Format_Format_cos_dlm_dc0_dlm_dc0,Format_Format_jac_dlm_dc0_dlm_dc0,Format_Format_mel,Format_Format_lev_dist,Format_Format_lev_sim,Format_Format_nmw,Format_Format_sw,Series_Series_jac_qgm_3_qgm_3,Series_Series_cos_dlm_dc0_dlm_dc0,Series_Series_mel,Series_Series_lev_dist,Series_Series_lev_sim,Title_Title_jac_dlm_dc0_dlm_dc0,label
18217,18217,a3828,b5590,0.956250,0.923381,0.063492,0.0,0.0,0.641071,50,0.107143,-43.0,4.0,0.746479,0.824958,0.967625,11.0,0.840580,0.857143,0
5553,5553,a3925,b738,0.947826,0.903696,0.076923,0.0,0.0,0.647619,39,0.133333,-32.0,4.0,0.622951,0.577350,0.940199,13.0,0.754717,0.823529,0
19908,19908,a4539,b6384,0.911765,0.801784,0.142857,0.0,0.0,0.605714,5,0.285714,0.0,1.0,0.714286,0.288675,0.952900,2.0,0.942857,0.666667,0
14540,14540,a606,b3709,0.871795,0.730297,0.153846,0.0,0.0,0.517293,13,0.315789,-6.0,4.0,0.693548,0.755929,0.952698,12.0,0.785714,0.571429,1
21117,21117,a4920,b6847,0.922078,0.843274,0.153846,0.0,0.0,0.517293,13,0.315789,-6.0,4.0,0.880000,0.833333,0.991111,1.0,0.977778,0.727273,1
6780,6780,a5123,b907,0.900000,0.828079,0.153846,0.0,0.0,0.517293,13,0.315789,-6.0,4.0,0.730769,0.857143,0.962791,4.0,0.906977,0.705882,0
12856,12856,a1510,b3048,0.925000,0.878310,0.063492,0.0,0.0,0.641071,50,0.107143,-43.0,4.0,0.868852,0.875000,0.986441,2.0,0.966102,0.782609,0
2892,2892,a651,b385,0.950820,0.909509,0.153846,0.0,0.0,0.517293,13,0.315789,-6.0,4.0,0.696429,0.624038,0.839636,106.0,0.561983,0.833333,1
12577,12577,a3228,b2833,0.808511,0.730297,0.153846,0.0,0.0,0.517293,13,0.315789,-6.0,4.0,0.314050,0.520266,0.828102,127.0,0.239521,0.571429,0
18180,18180,a4920,b5581,0.922078,0.843274,0.153846,0.0,0.0,0.517293,13,0.315789,-6.0,4.0,0.880000,0.833333,0.991111,1.0,0.977778,0.727273,0


In [313]:
dt = em.DTMatcher(name='DecisionTree', max_depth = 20000)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher('NaiveBayes')

In [316]:
result = em.select_matcher(matchers=[dt, rf, svm, lg, ln], 
                           table=H, 
                           exclude_attrs=['_id', 'ltable_ID', 'rtable_ID'], 
                           target_attr='label', 
                           k=5,
                           metric_to_select_matcher='precision', 
                           random_state=0)

In [317]:
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.752902,0.708199,0.726982
1,RF,0.791367,0.789844,0.783071
2,SVM,0.721147,0.874917,0.789138
3,LogReg,0.694734,0.750416,0.719114
4,LinReg,0.73176,0.86575,0.790542


In [127]:
pred_table = dt.predict(table=H, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], target_attr='predicted_labels', return_probs=True, probs_attr='proba', append=True)

In [103]:
pred_table

Unnamed: 0,_id,ltable_ID,rtable_ID,Title_Title_jac_qgm_3_qgm_3,Title_Title_cos_dlm_dc0_dlm_dc0,Format_Format_jac_qgm_3_qgm_3,Format_Format_cos_dlm_dc0_dlm_dc0,Format_Format_jac_dlm_dc0_dlm_dc0,Format_Format_mel,Format_Format_lev_dist,...,ISBN_ISBN_lev_dist,ISBN_ISBN_lev_sim,Series_Series_jac_qgm_3_qgm_3,Series_Series_cos_dlm_dc0_dlm_dc0,Series_Series_mel,Series_Series_lev_dist,Series_Series_lev_sim,label,predicted_labels,proba
2,2,a4508,b2,0.828571,0.670820,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,1.000000,1.000000,1.000000,0.0,1.000000,1,1,1.0
3,3,a3267,b3,0.850000,0.771517,0.142857,0.0,0.0,0.605714,5,...,0.0,1.000000,1.000000,1.000000,1.000000,0.0,1.000000,1,1,1.0
8,8,a1869,b8,0.929412,0.897085,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,1.000000,1.000000,1.000000,0.0,1.000000,1,1,1.0
9,9,a1186,b9,0.883721,0.821584,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,1.000000,1.000000,1.000000,0.0,1.000000,1,1,1.0
12,12,a3432,b12,0.956140,0.919255,0.153846,0.0,0.0,0.517293,13,...,2.0,0.846154,1.000000,1.000000,1.000000,0.0,1.000000,0,0,0.0
30,30,a1869,b34,0.929412,0.897085,0.142857,0.0,0.0,0.605714,5,...,2.0,0.846154,1.000000,1.000000,1.000000,0.0,1.000000,0,0,0.0
33,33,a539,b40,0.950495,0.944444,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,0.842105,0.666667,0.987879,1.0,0.969697,1,1,1.0
41,41,a2695,b48,0.872340,0.801784,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,1.000000,1.000000,1.000000,0.0,1.000000,1,1,1.0
80,80,a3115,b58,0.923077,0.870388,0.142857,0.0,0.0,0.605714,5,...,0.0,1.000000,0.452055,0.408248,0.860963,55.0,0.375000,1,1,1.0
113,113,a247,b66,0.934783,0.903696,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,0.837838,0.800000,0.987500,1.0,0.968750,1,1,1.0


In [128]:
eval_summary = em.eval_matches(pred_table, 'label', 'predicted_labels')

In [129]:
eval_summary

OrderedDict([('prec_numerator', 243.0),
             ('prec_denominator', 327.0),
             ('precision', 0.7431192660550459),
             ('recall_numerator', 243.0),
             ('recall_denominator', 254.0),
             ('recall', 0.9566929133858267),
             ('f1', 0.8364888123924269),
             ('pred_pos_num', 327.0),
             ('false_pos_num', 84.0),
             ('false_pos_ls',
              [('a3432', 'b12'),
               ('a2444', 'b1860'),
               ('a4114', 'b4042'),
               ('a3977', 'b72'),
               ('a5346', 'b2191'),
               ('a287', 'b95'),
               ('a3051', 'b109'),
               ('a4088', 'b112'),
               ('a4102', 'b706'),
               ('a104', 'b127'),
               ('a1384', 'b715'),
               ('a3385', 'b2197'),
               ('a3463', 'b2218'),
               ('a1785', 'b4171'),
               ('a1117', 'b2270'),
               ('a2263', 'b2289'),
               ('a917', 'b4218'),
           

# CODE 

In [None]:
S[S['ltable_ISBN'] == S['rtable_ISBN']].shape

In [62]:
len(G)

1286

In [11]:
F.head(20)

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_Title,ltable_Author,ltable_Publication,ltable_Format,ltable_ISBN,ltable_Series,ltable_Physical Details,rtable_Title,rtable_Author,rtable_Publication,rtable_Format,rtable_ISBN,rtable_Series,rtable_Physical Details
0,0,a2337,b0,"""Statistical learning and data science""","""edited by Mireille Gettler Summa ... [and others]""","""Boca Raton : CRC Press, [2012] ©2012""","""Books""",9781439867631,"""Series in computer science and data analysis,""","""nan""","""Statistical learning and data science ""","""Summa, Mireille Gettler.""","""Boca Raton""","""nan""",9781439867631,"""Series in computer science and data analysis.""","""xv, 227 p."""
1,1,a3097,b1,"""Intelligent techniques for data science""","""Rajendra Akerkar, Priti Srinivas Sajja""","""Cham, Switzerland : Springer, 2016.""","""Books""",9783319292069,"""nan""","""nan""","""Intelligent techniques for data science ""","""Akerkar, Rajendra""","""Cham, Switzerland""","""Electronic books.""",9783319292069,"""nan""","""1 online resource (xvi, 272 pages)"""
2,2,a4508,b2,"""Algorithms for data science""","""Brian Steele, John Chandler, Swarna Reddy""","""Cham, Switzerland : Springer, 2016.""","""Books""",9783319457970,"""nan""","""nan""","""Algorithms for data science ""","""Steele, Brian""","""Cham, Switzerland""","""Electronic books.""",9783319457970,"""nan""","""1 online resource (xxiii, 430 pages)"""
3,3,a3267,b3,"""Data science at the command line""","""Jeroen Janssens""","""First edition. Sebastopol, CA : O'Reilly, 2014. ©2015""","""Books""",9781491947852,"""nan""","""nan""","""Data science at the command line ""","""Janssens, Jeroen""","""Sebastopol, CA""","""nan""",9781491947852,"""nan""","""xvii, 191 pages"""
5,5,a4755,b5,"""Introduction to HPC with MPI for data science""","""Frank Nielsen""","""Cham : Springer, 2016.""","""Books""",9783319219035,"""Undergraduate topics in computer science,""","""nan""","""Introduction to HPC with MPI for data science ""","""Nielsen, Frank""","""Cham""","""Electronic books.""",9783319219035,"""Undergraduate topics in computer science, 1863-7310""","""1 online resource (xxxiii, 282 pages)"""
6,6,a2523,b6,"""Data Science Using Oracle Data Miner and Oracle R Enterprise : Transform Your Business Systems ...","""Sibanjan Das""","""Berkeley, CA : Apress, 2016. Berkeley, CA : Apress, 2016.""","""Books""",9781484226148,"""nan""","""nan""","""Data Science Using Oracle Data Miner and Oracle R Enterprise : Transform Your Business Systems ...","""Das, Sibanjan.""","""Berkeley, CA""","""Electronic books.""",9781484226148,"""nan""","""1 online resource (300 pages)"""
7,7,a3584,b7,"""The data science handbook""","""Field Cady""","""Hoboken, NJ : John Wiley & Sons, Inc., 2017.""","""Books""",9781119092933,"""nan""","""nan""","""The data science handbook ""","""Cady, Field, 1984-""","""Hoboken, NJ""","""Electronic books. | Handbooks and manuals.""",9781119092933,"""nan""","""1 online resource"""
8,8,a1869,b8,"""Data science : create teams that ask the right questions and deliver real value""","""Doug Rose""","""[Berkeley, CA] : Apress, 2016.""","""Books""",9781484222539,"""nan""","""nan""","""Data science : create teams that ask the right questions and deliver real value ""","""Rose, Doug, (Agile coach)""","""[Berkeley, CA]""","""Electronic books.""",9781484222539,"""nan""","""1 online resource"""
9,9,a1186,b9,"""Spatial big data science classification techniques for Earth observation imagery""","""Zhe Jiang, Shashi Shekhar""","""Cham : Springer, 2017.""","""Books""",9783319601953,"""nan""","""nan""","""Spatial big data science : classification techniques for Earth observation imagery ""","""Jiang, Zhe.""","""Cham""","""Electronic books.""",9783319601953,"""nan""","""1 online resource"""
10,10,a3432,b11,"""Practical data science cookbook : 89 hands-on recipes to help you complete real-world data scie...","""Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, Abhijit Dasgupta""","""Birmingham, UK : Packt Publishing Ltd., September 2014. ©2014""","""Books""",9781783980246,"""nan""","""nan""","""Practical data science cookbook : 89 hands-on recipes to help you complete real-world data scie...","""Ojeda, Tony.""","""Birmingham, UK""","""Electronic books.""",9781783980253,"""nan""","""1 online resource"""


In [None]:
s1 = pd.merge(A, B, how='inner', on=['ISBN'])

In [8]:
s1.head(20)

NameError: name 's1' is not defined

In [None]:
C[C['ltable_ISBN'] == C['rtable_ISBN']].shape

In [None]:
D[D['ltable_ISBN'] == D['rtable_ISBN']].shape

In [None]:
E[E['ltable_ISBN'] == E['rtable_ISBN']].shape

In [59]:
F[F['ltable_ISBN'] == F['rtable_ISBN']].shape

(639, 17)

In [195]:
G[G['ltable_ISBN'] == G['rtable_ISBN']].shape

(627, 17)

In [196]:
S[S['ltable_ISBN'] == S['rtable_ISBN']].shape

(254, 18)

In [197]:
I[I['ltable_ISBN'] == I['rtable_ISBN']].shape

(176, 18)

In [198]:
len(I)

330

In [None]:
D = ob.block_candset(block_data, 'Title', 'Title', allow_missing=True)

In [None]:
em.get_key(data1)

In [None]:
data1.keys()