# Entity Matching (EM) about Books

# Introduction

This IPython notebook shows a basic workflow two tables using *py_entitymatching*. We want to match data science books in library of UW-Madison and UIUC.  The book information of UW-Madison is from [here](https://search.library.wisc.edu/search/system?q=Data+Science) and the book information of UIUC is from [here](https://vufind.carli.illinois.edu/vf-uiu/Search/Home?lookfor=Data+Science+&type=all&start_over=1&submit=Find&search=new). Details can be found from our Stage 2 Report [here](https://github.com/iphyer/CS839ClassProject/blob/master/stage2/Stage2Report.pdf). 


First, we need to import *py_entitymatching* package and other libraries as follows:

In [16]:
import pandas as pd
import py_entitymatching as em

# Read input tables

We begin by loading the input tables.

We name the table about UW-Madison `TableA.csv` and the table about UIUC `TableB.csv`. And there are 

* 6963 tuples in table `TableA.csv`
* 5730 tuples in table `TableB.csv`

In [17]:
A = em.read_csv_metadata('../data/TableA.csv', key = 'ID')
B = em.read_csv_metadata('../data/TableB.csv', key = 'ID')

Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.


In [18]:
A.head(10)

Unnamed: 0,ID,Title,Author,Publication,Format,ISBN,Series,Physical Details
0,a0,"""Data compression techniques and applications""","""Thomas J. Lynch""","""Belmont, Calif. : Lifetime Learning Publications, [1985] ©1985""","""Books""",9780534034184,"""nan""","""nan"""
1,a1,"""Information security and privacy : Third Australasian Conference, ACISP'98, Brisbane, Australia...","""Colin Boyd, Ed Dawson, (eds.)""","""Berlin ; New York : Springer, [1998] ©1998""","""Books""",9783540647324,"""Lecture notes in computer science. 1438,""","""nan"""
2,a2,"""Algorithms and data structures the science of computing""","""Douglas Baldwin and Greg W. Scragg""","""1st ed. Hingham, Mass. : Charles River Media, c2004.""","""Books""",9781584502500,"""Charles River Media computer engineering series,""","""nan"""
3,a3,"""Tele-informatics : data and computer communications""","""César Macchi and Jean-François Guilbert and 17 co-authors ; translated by J.M-A. St. Quinton""","""Amsterdam ; New York : North-Holland ; New York, N.Y., U.S.A. : Sole distributors for the U.S.A...","""Books""",9780444875075,"""Studies in telecommunication ; v. 3,""","""nan"""
4,a4,"""Mobile data management : 4th international conference, MDM 2003, Melbourne, Australia, January ...","""Ming-Syan Chen ... [and others], (eds.)""","""Berlin ; New York : Springer, [2003] ©2003""","""Books""",9783540003939,"""Lecture notes in computer science. 2574,""","""nan"""
5,a5,"""Medical data analysis : 4th International Symposium, ISMDA 2003, Berlin, Germany, October 9-10,...","""Petra Perner, Rüdiger Brause, Hermann-Georg Holzhütter (eds.)""","""Berlin ; New York : Springer, [2003] ©2003""","""Books""",9783540202820,"""Lecture notes in computer science. 2868,""","""nan"""
6,a6,"""Indexing it all : the subject in the age of documentation, information, and data""","""Ronald E. Day""","""Cambridge, Massachusetts ; London, England : The MIT Press, 2014. ©2014""","""Books""",9780262322775,"""History and foundations of information science,""","""nan"""
7,a7,"""Advanced analysis and learning on temporal data : first ECML PKDD Workshop, AALTD 2015, Porto, ...","""Ahlame Douzal-Chouakria, José A. Vilar, Pierre-François Marteau (eds.)""","""Switzerland : Springer, 2016.""","""Books""",9783319444123,"""Lecture notes in computer science. Lecture notes in artificial intelligence ; 9785,LNCS sublibr...","""nan"""
8,a8,"""Data clustering : algorithms and applications""","""[edited by] Charu C. Aggarwal, Chandan K. Reddy""","""Boca Raton : Chapman and Hall/CRC, [2014] ©2014""","""Books""",9781466558212,"""Chapman & Hall/CRC data mining and knowledge discovery series,""","""nan"""
9,a10,"""Knowledge discovery from sensor data : second international workshop, Sensor-KDD 2008, Las Vega...","""Mohamed Medhat Gaber ... [and others] (eds.)""","""Berlin ; New York : Springer, [2010] ©2010""","""Books""",9783642125188,"""Lecture notes in computer science ; 5840,LNCS sublibrary. Information systems and applications,...","""nan"""


In [51]:
block_f = em.get_features_for_blocking(A, B)
block_t = em.get_tokenizers_for_blocking()
block_s = em.get_sim_funs_for_blocking()
r = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Title"]), dlm_dc0(rtuple["Title"]))', block_t, block_s)
em.add_feature(block_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', r)

# Block tables to get candidate set

Here we will use several blockers to remove obviously non-matching tuple pairs from the input tables.

For the same book, since we got the data from two different library websites, their attributes may not be the exact same. Therefore, we applied an OverlapBlocker over some of the attributes, including the *Title*, *Author* and *Series* of the book.

After multiple tests, we found the best overlap_size for each attribute - for *Title*, *Author* and *Series*, we set the overlap_size to be 1, 3 and 1 respectively.

In [3]:
ob = em.OverlapBlocker()
C = ob.block_tables(A, B, 'Author', 'Author', l_output_attrs=['Title','Author','Publication','Format','ISBN','Series', 'Physical Details'], r_output_attrs=['Title','Author','Publication','Format','ISBN','Series', 'Physical Details'], overlap_size = 2)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [4]:
D = ob.block_candset(C, 'Title', 'Title', overlap_size = 4)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [5]:
E = ob.block_candset(D, 'Series', 'Series', overlap_size = 1)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [6]:
F = ob.block_candset(E, 'Publication', 'Publication', overlap_size = 1)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [None]:
F = ob.block_candset(F, 'Author', 'Author', overlap_size = 2)

In [60]:
rule1 = ['Title_Title_jac_dlm_dc0_dlm_dc0(ltuple, rtuple) < 0.5']
rb = em.RuleBasedBlocker()
rb.add_rule(rule1, block_f)
G = rb.block_candset(F)

'_rule_0'

In [141]:
len(G)

1286

In [143]:
G.to_csv('Set_C.csv', sep = ',')

In [None]:
# G = em.label_table(F, label_column_name='gold_labels')

In [118]:
S = em.sample_table(G, 500)

In [None]:
S

In [None]:
S["label"] = (S["ltable_ISBN"] == S["rtable_ISBN"]).astype(int)

In [105]:
# S = S.drop(['ltable_ISBN', 'rtable_ISBN'], 1)

In [145]:
S.to_csv('Set_G.csv', sep = ',')

In [147]:
IJ = em.split_train_test(S, train_proportion=0.66, random_state=0)
I = IJ['train']
J = IJ['test']

In [148]:
I.to_csv('Set_I.csv', sep = ',')

In [149]:
J.to_csv('Set_J.csv', sep = ',')

In [None]:
# block_f = em.get_features_for_blocking(A, B)

# Training

In [163]:
match_f = em.get_features_for_matching(A, B)

The table shows the corresponding attributes along with their respective types.
Please confirm that the information  has been correctly inferred.
If you would like to skip this validation process in the future,
please set the flag validate_inferred_attr_types equal to false.


Unnamed: 0,Left Attribute,Right Attribute,Left Attribute Type,Right Attribute Type,Example Features
0,ID,ID,short string (1 word),short string (1 word),Levenshtein Distance; Levenshtein Similarity
1,Title,Title,short string (1 word),short string (1 word),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
2,Author,Author,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
3,Publication,Publication,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
4,Format,Format,short string (1 word to 5 words),short string (1 word to 5 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
5,ISBN,ISBN,numeric,numeric,Exact Match; Absolute Norm
6,Series,Series,medium string (5 words to 10 words),medium string (5 words to 10 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
7,Physical Details,Physical Details,short string (1 word),short string (1 word to 5 words),Not Applicable: Types do not match


Do you want to proceed? (y/n):y


In [165]:
match_f = match_f[(match_f['left_attribute'] != 'ID') & (match_f['left_attribute'] != 'ISBN')]

In [169]:
match_f

Unnamed: 0,feature_name,left_attribute,right_attribute,left_attr_tokenizer,right_attr_tokenizer,simfunction,function,function_source,is_auto_generated
6,Title_Title_jac_qgm_3_qgm_3,Title,Title,qgm_3,qgm_3,jaccard,<function Title_Title_jac_qgm_3_qgm_3 at 0x7f9cbff60158>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
7,Title_Title_cos_dlm_dc0_dlm_dc0,Title,Title,dlm_dc0,dlm_dc0,cosine,<function Title_Title_cos_dlm_dc0_dlm_dc0 at 0x7f9cbff601e0>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
8,Format_Format_jac_qgm_3_qgm_3,Format,Format,qgm_3,qgm_3,jaccard,<function Format_Format_jac_qgm_3_qgm_3 at 0x7f9cbff60268>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
9,Format_Format_cos_dlm_dc0_dlm_dc0,Format,Format,dlm_dc0,dlm_dc0,cosine,<function Format_Format_cos_dlm_dc0_dlm_dc0 at 0x7f9cbff602f0>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
10,Format_Format_jac_dlm_dc0_dlm_dc0,Format,Format,dlm_dc0,dlm_dc0,jaccard,<function Format_Format_jac_dlm_dc0_dlm_dc0 at 0x7f9cbff60378>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
11,Format_Format_mel,Format,Format,,,monge_elkan,<function Format_Format_mel at 0x7f9cbff60400>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
12,Format_Format_lev_dist,Format,Format,,,lev_dist,<function Format_Format_lev_dist at 0x7f9cbff60488>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
13,Format_Format_lev_sim,Format,Format,,,lev_sim,<function Format_Format_lev_sim at 0x7f9cbff60510>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
14,Format_Format_nmw,Format,Format,,,needleman_wunsch,<function Format_Format_nmw at 0x7f9cbff60598>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
15,left_attribute,right_attribute,left_attr_tokenizer,right_attr_tokenizer,simfunction,is_auto_generated,function,function_source,feature_name


In [167]:
match_t = em.get_tokenizers_for_matching()
match_s = em.get_sim_funs_for_matching()
r = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Title"]), dlm_dc0(rtuple["Title"]))', match_t, match_s)

In [168]:
em.add_feature(match_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', r)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


True

In [None]:
feature1 = ['Title_Title_jac_dlm_dc0_dlm_dc0(ltuple, rtuple)']
em.add_feature(match_f, 'feature1', r)
rb.add_rule(rule1, block_f)
G = rb.block_candset(F)

In [125]:
H = em.extract_feature_vecs(S, feature_table=match_f, attrs_after=['label'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:01


In [137]:
H

Unnamed: 0,_id,ltable_ID,rtable_ID,Title_Title_jac_qgm_3_qgm_3,Title_Title_cos_dlm_dc0_dlm_dc0,Format_Format_jac_qgm_3_qgm_3,Format_Format_cos_dlm_dc0_dlm_dc0,Format_Format_jac_dlm_dc0_dlm_dc0,Format_Format_mel,Format_Format_lev_dist,...,Format_Format_nmw,Format_Format_sw,Series_Series_jac_qgm_3_qgm_3,Series_Series_cos_dlm_dc0_dlm_dc0,Series_Series_mel,Series_Series_lev_dist,Series_Series_lev_sim,label,predicted_labels,proba
0,0,a2337,b0,0.866667,0.730297,0.142857,0.0,0.0,0.605714,5,...,0.0,1.0,0.882353,0.857143,0.991489,1.0,0.978723,1,1,0.804878
3,3,a3267,b3,0.850000,0.771517,0.142857,0.0,0.0,0.605714,5,...,0.0,1.0,1.000000,1.000000,1.000000,0.0,1.000000,1,1,0.804878
9,9,a1186,b9,0.883721,0.821584,0.153846,0.0,0.0,0.517293,13,...,-6.0,4.0,1.000000,1.000000,1.000000,0.0,1.000000,1,1,0.592814
12,12,a3432,b12,0.956140,0.919255,0.153846,0.0,0.0,0.517293,13,...,-6.0,4.0,1.000000,1.000000,1.000000,0.0,1.000000,0,1,0.851064
19,19,a4368,b16,0.782609,0.762770,0.153846,0.0,0.0,0.517293,13,...,-6.0,4.0,1.000000,1.000000,1.000000,0.0,1.000000,1,1,0.592814
24,24,a107,b28,0.837838,0.730297,0.153846,0.0,0.0,0.517293,13,...,-6.0,4.0,0.474576,0.547723,0.837926,39.0,0.426471,1,0,0.280000
80,80,a3115,b58,0.923077,0.870388,0.142857,0.0,0.0,0.605714,5,...,0.0,1.0,0.452055,0.408248,0.860963,55.0,0.375000,1,1,1.000000
147,147,a3510,b71,0.915493,0.858116,0.153846,0.0,0.0,0.517293,13,...,-6.0,4.0,1.000000,1.000000,1.000000,0.0,1.000000,1,1,0.592814
151,151,a3977,b72,0.916667,0.824958,0.153846,0.0,0.0,0.517293,13,...,-6.0,4.0,1.000000,1.000000,1.000000,0.0,1.000000,0,1,0.592814
172,172,a3029,b91,0.931818,0.889499,0.153846,0.0,0.0,0.517293,13,...,-6.0,4.0,1.000000,1.000000,1.000000,0.0,1.000000,1,1,0.851064


In [135]:
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher('NaiveBayes')

In [139]:
result = em.select_matcher(matchers=[dt, rf, svm, lg, ln], 
                           table=H, 
                           exclude_attrs=['_id', 'ltable_ID', 'rtable_ID'], 
                           target_attr='label', 
                           k=5,
                           metric_to_select_matcher='f1', 
                           random_state=0)

In [140]:
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.770724,0.737146,0.751981
1,RF,0.797944,0.783701,0.78845
2,SVM,0.700108,0.868054,0.77169
3,LogReg,0.738215,0.957824,0.832714
4,LinReg,0.739418,0.950842,0.83037


In [127]:
pred_table = dt.predict(table=H, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], target_attr='predicted_labels', return_probs=True, probs_attr='proba', append=True)

In [103]:
pred_table

Unnamed: 0,_id,ltable_ID,rtable_ID,Title_Title_jac_qgm_3_qgm_3,Title_Title_cos_dlm_dc0_dlm_dc0,Format_Format_jac_qgm_3_qgm_3,Format_Format_cos_dlm_dc0_dlm_dc0,Format_Format_jac_dlm_dc0_dlm_dc0,Format_Format_mel,Format_Format_lev_dist,...,ISBN_ISBN_lev_dist,ISBN_ISBN_lev_sim,Series_Series_jac_qgm_3_qgm_3,Series_Series_cos_dlm_dc0_dlm_dc0,Series_Series_mel,Series_Series_lev_dist,Series_Series_lev_sim,label,predicted_labels,proba
2,2,a4508,b2,0.828571,0.670820,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,1.000000,1.000000,1.000000,0.0,1.000000,1,1,1.0
3,3,a3267,b3,0.850000,0.771517,0.142857,0.0,0.0,0.605714,5,...,0.0,1.000000,1.000000,1.000000,1.000000,0.0,1.000000,1,1,1.0
8,8,a1869,b8,0.929412,0.897085,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,1.000000,1.000000,1.000000,0.0,1.000000,1,1,1.0
9,9,a1186,b9,0.883721,0.821584,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,1.000000,1.000000,1.000000,0.0,1.000000,1,1,1.0
12,12,a3432,b12,0.956140,0.919255,0.153846,0.0,0.0,0.517293,13,...,2.0,0.846154,1.000000,1.000000,1.000000,0.0,1.000000,0,0,0.0
30,30,a1869,b34,0.929412,0.897085,0.142857,0.0,0.0,0.605714,5,...,2.0,0.846154,1.000000,1.000000,1.000000,0.0,1.000000,0,0,0.0
33,33,a539,b40,0.950495,0.944444,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,0.842105,0.666667,0.987879,1.0,0.969697,1,1,1.0
41,41,a2695,b48,0.872340,0.801784,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,1.000000,1.000000,1.000000,0.0,1.000000,1,1,1.0
80,80,a3115,b58,0.923077,0.870388,0.142857,0.0,0.0,0.605714,5,...,0.0,1.000000,0.452055,0.408248,0.860963,55.0,0.375000,1,1,1.0
113,113,a247,b66,0.934783,0.903696,0.153846,0.0,0.0,0.517293,13,...,0.0,1.000000,0.837838,0.800000,0.987500,1.0,0.968750,1,1,1.0


In [128]:
eval_summary = em.eval_matches(pred_table, 'label', 'predicted_labels')

In [129]:
eval_summary

OrderedDict([('prec_numerator', 243.0),
             ('prec_denominator', 327.0),
             ('precision', 0.7431192660550459),
             ('recall_numerator', 243.0),
             ('recall_denominator', 254.0),
             ('recall', 0.9566929133858267),
             ('f1', 0.8364888123924269),
             ('pred_pos_num', 327.0),
             ('false_pos_num', 84.0),
             ('false_pos_ls',
              [('a3432', 'b12'),
               ('a2444', 'b1860'),
               ('a4114', 'b4042'),
               ('a3977', 'b72'),
               ('a5346', 'b2191'),
               ('a287', 'b95'),
               ('a3051', 'b109'),
               ('a4088', 'b112'),
               ('a4102', 'b706'),
               ('a104', 'b127'),
               ('a1384', 'b715'),
               ('a3385', 'b2197'),
               ('a3463', 'b2218'),
               ('a1785', 'b4171'),
               ('a1117', 'b2270'),
               ('a2263', 'b2289'),
               ('a917', 'b4218'),
           

# CODE 

In [None]:
S[S['ltable_ISBN'] == S['rtable_ISBN']].shape

In [62]:
len(G)

1286

In [11]:
F.head(20)

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_Title,ltable_Author,ltable_Publication,ltable_Format,ltable_ISBN,ltable_Series,ltable_Physical Details,rtable_Title,rtable_Author,rtable_Publication,rtable_Format,rtable_ISBN,rtable_Series,rtable_Physical Details
0,0,a2337,b0,"""Statistical learning and data science""","""edited by Mireille Gettler Summa ... [and others]""","""Boca Raton : CRC Press, [2012] ©2012""","""Books""",9781439867631,"""Series in computer science and data analysis,""","""nan""","""Statistical learning and data science ""","""Summa, Mireille Gettler.""","""Boca Raton""","""nan""",9781439867631,"""Series in computer science and data analysis.""","""xv, 227 p."""
1,1,a3097,b1,"""Intelligent techniques for data science""","""Rajendra Akerkar, Priti Srinivas Sajja""","""Cham, Switzerland : Springer, 2016.""","""Books""",9783319292069,"""nan""","""nan""","""Intelligent techniques for data science ""","""Akerkar, Rajendra""","""Cham, Switzerland""","""Electronic books.""",9783319292069,"""nan""","""1 online resource (xvi, 272 pages)"""
2,2,a4508,b2,"""Algorithms for data science""","""Brian Steele, John Chandler, Swarna Reddy""","""Cham, Switzerland : Springer, 2016.""","""Books""",9783319457970,"""nan""","""nan""","""Algorithms for data science ""","""Steele, Brian""","""Cham, Switzerland""","""Electronic books.""",9783319457970,"""nan""","""1 online resource (xxiii, 430 pages)"""
3,3,a3267,b3,"""Data science at the command line""","""Jeroen Janssens""","""First edition. Sebastopol, CA : O'Reilly, 2014. ©2015""","""Books""",9781491947852,"""nan""","""nan""","""Data science at the command line ""","""Janssens, Jeroen""","""Sebastopol, CA""","""nan""",9781491947852,"""nan""","""xvii, 191 pages"""
5,5,a4755,b5,"""Introduction to HPC with MPI for data science""","""Frank Nielsen""","""Cham : Springer, 2016.""","""Books""",9783319219035,"""Undergraduate topics in computer science,""","""nan""","""Introduction to HPC with MPI for data science ""","""Nielsen, Frank""","""Cham""","""Electronic books.""",9783319219035,"""Undergraduate topics in computer science, 1863-7310""","""1 online resource (xxxiii, 282 pages)"""
6,6,a2523,b6,"""Data Science Using Oracle Data Miner and Oracle R Enterprise : Transform Your Business Systems ...","""Sibanjan Das""","""Berkeley, CA : Apress, 2016. Berkeley, CA : Apress, 2016.""","""Books""",9781484226148,"""nan""","""nan""","""Data Science Using Oracle Data Miner and Oracle R Enterprise : Transform Your Business Systems ...","""Das, Sibanjan.""","""Berkeley, CA""","""Electronic books.""",9781484226148,"""nan""","""1 online resource (300 pages)"""
7,7,a3584,b7,"""The data science handbook""","""Field Cady""","""Hoboken, NJ : John Wiley & Sons, Inc., 2017.""","""Books""",9781119092933,"""nan""","""nan""","""The data science handbook ""","""Cady, Field, 1984-""","""Hoboken, NJ""","""Electronic books. | Handbooks and manuals.""",9781119092933,"""nan""","""1 online resource"""
8,8,a1869,b8,"""Data science : create teams that ask the right questions and deliver real value""","""Doug Rose""","""[Berkeley, CA] : Apress, 2016.""","""Books""",9781484222539,"""nan""","""nan""","""Data science : create teams that ask the right questions and deliver real value ""","""Rose, Doug, (Agile coach)""","""[Berkeley, CA]""","""Electronic books.""",9781484222539,"""nan""","""1 online resource"""
9,9,a1186,b9,"""Spatial big data science classification techniques for Earth observation imagery""","""Zhe Jiang, Shashi Shekhar""","""Cham : Springer, 2017.""","""Books""",9783319601953,"""nan""","""nan""","""Spatial big data science : classification techniques for Earth observation imagery ""","""Jiang, Zhe.""","""Cham""","""Electronic books.""",9783319601953,"""nan""","""1 online resource"""
10,10,a3432,b11,"""Practical data science cookbook : 89 hands-on recipes to help you complete real-world data scie...","""Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, Abhijit Dasgupta""","""Birmingham, UK : Packt Publishing Ltd., September 2014. ©2014""","""Books""",9781783980246,"""nan""","""nan""","""Practical data science cookbook : 89 hands-on recipes to help you complete real-world data scie...","""Ojeda, Tony.""","""Birmingham, UK""","""Electronic books.""",9781783980253,"""nan""","""1 online resource"""


In [None]:
s1 = pd.merge(A, B, how='inner', on=['ISBN'])

In [8]:
s1.head(20)

NameError: name 's1' is not defined

In [None]:
C[C['ltable_ISBN'] == C['rtable_ISBN']].shape

In [None]:
D[D['ltable_ISBN'] == D['rtable_ISBN']].shape

In [None]:
E[E['ltable_ISBN'] == E['rtable_ISBN']].shape

In [59]:
F[F['ltable_ISBN'] == F['rtable_ISBN']].shape

(639, 17)

In [63]:
G[G['ltable_ISBN'] == G['rtable_ISBN']].shape

(627, 17)

In [65]:
S[S['ltable_ISBN'] == S['rtable_ISBN']].shape

(243, 17)

In [None]:
D = ob.block_candset(block_data, 'Title', 'Title', allow_missing=True)

In [None]:
em.get_key(data1)

In [None]:
data1.keys()