# Entity Matching (EM) about Books

# Introduction

This IPython notebook shows a basic workflow two tables using *py_entitymatching*. We want to match data science books in library of UW-Madison and UIUC.  The book information of UW-Madison is from [here](https://search.library.wisc.edu/search/system?q=Data+Science) and the book information of UIUC is from [here](https://vufind.carli.illinois.edu/vf-uiu/Search/Home?lookfor=Data+Science+&type=all&start_over=1&submit=Find&search=new). Details can be found from our Stage 2 Report [here](https://github.com/iphyer/CS839ClassProject/blob/master/stage2/Stage2Report.pdf). 


First, we need to import *py_entitymatching* package and other libraries as follows:

In [7]:
import pandas as pd
import py_entitymatching as em

# Read input tables

We begin by loading the input tables.

We name the table about UW-Madison `TableA.csv` and the table about UIUC `TableB.csv`. And there are 

* 4824 tuples in table `TableA.csv`
* 5060 tuples in table `TableB.csv`

In [8]:
table_A = em.read_csv_metadata('../data/TableA.csv', key = 'ID')
table_B = em.read_csv_metadata('../data/TableB.csv', key = 'ID')

Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.


In [9]:
table_A.shape

(4824, 8)

In [10]:
table_B.shape

(5060, 8)

# Down sampling
Down sampling table A and B， get 1000 examples from both table A and B.

In [11]:
A, B = em.down_sample(table_A, table_B, size=1000, y_param = 1, show_progress=False)

In [12]:
A.shape

(1000, 8)

In [13]:
block_f = em.get_features_for_blocking(A, B)
block_t = em.get_tokenizers_for_blocking()
block_s = em.get_sim_funs_for_blocking()
r = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Title"]), dlm_dc0(rtuple["Title"]))', block_t, block_s)
em.add_feature(block_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', r)

The table shows the corresponding attributes along with their respective types.
Please confirm that the information  has been correctly inferred.
If you would like to skip this validation process in the future,
please set the flag validate_inferred_attr_types equal to false.


Unnamed: 0,Left Attribute,Right Attribute,Left Attribute Type,Right Attribute Type,Example Features
0,ID,ID,short string (1 word),short string (1 word),Levenshtein Distance; Levenshtein Similarity
1,Title,Title,short string (1 word),short string (1 word),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
2,Author,Author,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
3,Publication,Publication,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
4,Format,Format,short string (1 word to 5 words),short string (1 word to 5 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
5,ISBN,ISBN,numeric,numeric,Exact Match; Absolute Norm
6,Series,Series,medium string (5 words to 10 words),medium string (5 words to 10 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
7,Physical Details,Physical Details,short string (1 word),short string (1 word to 5 words),Not Applicable: Types do not match


Do you want to proceed? (y/n):y


True

# Block tables to get candidate set

Here we will use several blockers to remove obviously non-matching tuple pairs from the input tables.

For the same book, since we got the data from two different library websites, their attributes may not be the exact same. Therefore, we applied an OverlapBlocker over some of the attributes, including the *Title* and *Author*.

After multiple tests, we found the best overlap_size for each attribute - for *Author* and *Title*, we set the overlap_size to be 2 and 4 respectively.

In [14]:
ob = em.OverlapBlocker()
C = ob.block_tables(A, B, 'Author', 'Author', 
                    l_output_attrs=['Title','Author','Publication','Format','ISBN','Series', 'Physical Details'], 
                    r_output_attrs=['Title','Author','Publication','Format','ISBN','Series', 'Physical Details'], 
                    overlap_size = 2)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [15]:
D = ob.block_candset(C, 'Title', 'Title', overlap_size = 4)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [47]:
len(D)
D.to_csv('Set_C.csv', sep = ',')

## Sampling from D
Sample 300 examples from D.

In [None]:
S = em.sample_table(D, 300)

## Create label
After manually labeling the data, We get 300 candidates with labels in label_S. <br/>
Also, need to set the metadata for label_S appropriately.

In [3]:
label_S = pd.read_csv('./Set_G.csv')
# em.copy_properties(S, label_S)
em.set_property(label_S, 'key', '_id')
em.set_property(label_S, 'fk_ltable', 'ltable_ID')
em.set_property(label_S, 'fk_rtable', 'rtable_ID')
label_S_rtable = em.read_csv_metadata('./label_S_rtable.csv')
label_S_ltable = em.read_csv_metadata('./label_S_ltable.csv')
em.set_property(label_S, 'rtable', label_S_rtable)
em.set_property(label_S, 'ltable', label_S_ltable)

True

In [4]:
IJ = em.split_train_test(label_S, train_proportion=0.66, random_state=0)
I = IJ['train']
J = IJ['test']

In [45]:
I.to_csv('Set_I.csv', sep = ',')

In [46]:
J.to_csv('Set_J.csv', sep = ',')

# Training

In [26]:
match_f = em.get_features_for_matching(A, B)

The table shows the corresponding attributes along with their respective types.
Please confirm that the information  has been correctly inferred.
If you would like to skip this validation process in the future,
please set the flag validate_inferred_attr_types equal to false.


Unnamed: 0,Left Attribute,Right Attribute,Left Attribute Type,Right Attribute Type,Example Features
0,ID,ID,short string (1 word),short string (1 word),Levenshtein Distance; Levenshtein Similarity
1,Title,Title,short string (1 word),short string (1 word),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
2,Author,Author,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
3,Publication,Publication,medium string (5 words to 10 words),short string (1 word to 5 words),Not Applicable: Types do not match
4,Format,Format,short string (1 word to 5 words),short string (1 word to 5 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
5,ISBN,ISBN,numeric,numeric,Exact Match; Absolute Norm
6,Series,Series,medium string (5 words to 10 words),medium string (5 words to 10 words),"Jaccard Similarity [3-grams, 3-grams]; Cosine Similarity [Space Delimiter, Space Delimiter]"
7,Physical Details,Physical Details,short string (1 word),short string (1 word to 5 words),Not Applicable: Types do not match


Do you want to proceed? (y/n):y


# Generating feature
In the feature engineering process, we first exclude the features that are automatically generated by ID and ISBN, and add Jaccard scores on Book Title, Author, Publication and Series with space as the delimiter.  We fit 19 features into Decision Tree, Random Forest, SVM, Naive Bayes, Logistic Regression, Linear Regression with 5-fold cross-validation.

In [27]:
match_t = em.get_tokenizers_for_matching()
match_s = em.get_sim_funs_for_matching()
f1 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Title"]), dlm_dc0(rtuple["Title"]))', match_t, match_s)
f2 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Author"]), dlm_dc0(rtuple["Author"]))', match_t, match_s)
f3 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Publication"]), dlm_dc0(rtuple["Publication"]))', match_t, match_s)
f4 = em.get_feature_fn('jaccard(dlm_dc0(ltuple["Series"]), dlm_dc0(rtuple["Series"]))', match_t, match_s)
em.add_feature(match_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', f1)
em.add_feature(match_f, 'Author_Author_jac_dlm_dc0_dlm_dc0', f2)
em.add_feature(match_f, 'Publication_Publication_jac_dlm_dc0_dlm_dc0', f3)
em.add_feature(match_f, 'Series_Series_jac_dlm_dc0_dlm_dc0', f4)

True

# debugging
After first round of debug, we found that the learning method had difficulty in distinguishing books/journals in different versions/edition/conference. So we introduced a new feature that capture this piece of information embedded in book titles. Basically, we extracted the roman numerals from the title and use 1-0 to indicate whether they are the same from each tuple pair.

In [28]:
# Add blackbox feature

import re
# for Roman numerals matching
def Title_Title_blackbox_1(x, y):
    
    # get name attribute
    x_title = x['Title']
    y_title = y['Title']
    regex_roman = '\s+[MDCLXVI]+\s+'
    x_match = None
    y_match = None
    if re.search(regex_roman, x_title):
        x_match = re.search(regex_roman, x_title).group(0)
    if re.search(regex_roman, y_title):
        y_match = re.search(regex_roman, y_title).group(0)

    if x_match is None or y_match is None:
        return False
    else:
        return x_match == y_match

em.add_blackbox_feature(match_f, 'blackbox_1', Title_Title_blackbox_1)

True

Here we delete features that are related to ID and ISBN.

In [29]:
match_f = match_f[(match_f['left_attribute'] != 'ID') & (match_f['left_attribute'] != 'ISBN')]

Extract feature from set I.

In [30]:
H = em.extract_feature_vecs(I, feature_table=match_f, attrs_after=['label'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [32]:
dt = em.DTMatcher(name='DecisionTree', random_state = 0, max_depth = 5)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher('NaiveBayes')

result = em.select_matcher(matchers=[dt, rf, svm, lg, ln], 
                           table=H, 
                           exclude_attrs=['_id', 'ltable_ID', 'rtable_ID'], 
                           target_attr='label', 
                           k=5,
                           metric_to_select_matcher='precision'
                           )

In [44]:
result = em.select_matcher(matchers=[nb], 
                           table=H, 
                           exclude_attrs=['_id', 'ltable_ID', 'rtable_ID'], 
                           target_attr='label', 
                           k=5,
                           metric_to_select_matcher='precision'
                           )

JoblibTypeError: JoblibTypeError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/runpy.py in _run_module_as_main(mod_name='ipykernel_launcher', alter_argv=1)
    188         sys.exit(msg)
    189     main_globals = sys.modules["__main__"].__dict__
    190     if alter_argv:
    191         sys.argv[0] = mod_spec.origin
    192     return _run_code(code, main_globals, None,
--> 193                      "__main__", mod_spec)
        mod_spec = ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py')
    194 
    195 def run_module(mod_name, init_globals=None,
    196                run_name=None, alter_sys=False):
    197     """Execute a module's code without importing it

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/runpy.py in _run_code(code=<code object <module> at 0x7f5f0f863930, file "/...3.6/site-packages/ipykernel_launcher.py", line 5>, run_globals={'__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__cached__': '/home/xiuyuan/anaconda3/envs/fonduer/lib/python3...ges/__pycache__/ipykernel_launcher.cpython-36.pyc', '__doc__': 'Entry point for launching an IPython kernel.\n\nTh...orts until\nafter removing the cwd from sys.path.\n', '__file__': '/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/ipykernel_launcher.py', '__loader__': <_frozen_importlib_external.SourceFileLoader object>, '__name__': '__main__', '__package__': '', '__spec__': ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), 'app': <module 'ipykernel.kernelapp' from '/home/xiuyua.../python3.6/site-packages/ipykernel/kernelapp.py'>, ...}, init_globals=None, mod_name='__main__', mod_spec=ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), pkg_name='', script_name=None)
     80                        __cached__ = cached,
     81                        __doc__ = None,
     82                        __loader__ = loader,
     83                        __package__ = pkg_name,
     84                        __spec__ = mod_spec)
---> 85     exec(code, run_globals)
        code = <code object <module> at 0x7f5f0f863930, file "/...3.6/site-packages/ipykernel_launcher.py", line 5>
        run_globals = {'__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__cached__': '/home/xiuyuan/anaconda3/envs/fonduer/lib/python3...ges/__pycache__/ipykernel_launcher.cpython-36.pyc', '__doc__': 'Entry point for launching an IPython kernel.\n\nTh...orts until\nafter removing the cwd from sys.path.\n', '__file__': '/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/ipykernel_launcher.py', '__loader__': <_frozen_importlib_external.SourceFileLoader object>, '__name__': '__main__', '__package__': '', '__spec__': ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), 'app': <module 'ipykernel.kernelapp' from '/home/xiuyua.../python3.6/site-packages/ipykernel/kernelapp.py'>, ...}
     86     return run_globals
     87 
     88 def _run_module_code(code, init_globals=None,
     89                     mod_name=None, mod_spec=None,

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/ipykernel_launcher.py in <module>()
     11     # This is added back by InteractiveShellApp.init_path()
     12     if sys.path[0] == '':
     13         del sys.path[0]
     14 
     15     from ipykernel import kernelapp as app
---> 16     app.launch_new_instance()

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/traitlets/config/application.py in launch_instance(cls=<class 'ipykernel.kernelapp.IPKernelApp'>, argv=None, **kwargs={})
    653 
    654         If a global instance already exists, this reinitializes and starts it
    655         """
    656         app = cls.instance(**kwargs)
    657         app.initialize(argv)
--> 658         app.start()
        app.start = <bound method IPKernelApp.start of <ipykernel.kernelapp.IPKernelApp object>>
    659 
    660 #-----------------------------------------------------------------------------
    661 # utility functions, for convenience
    662 #-----------------------------------------------------------------------------

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/ipykernel/kernelapp.py in start(self=<ipykernel.kernelapp.IPKernelApp object>)
    481         if self.poller is not None:
    482             self.poller.start()
    483         self.kernel.start()
    484         self.io_loop = ioloop.IOLoop.current()
    485         try:
--> 486             self.io_loop.start()
        self.io_loop.start = <bound method BaseAsyncIOLoop.start of <tornado.platform.asyncio.AsyncIOMainLoop object>>
    487         except KeyboardInterrupt:
    488             pass
    489 
    490 launch_new_instance = IPKernelApp.launch_instance

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/tornado/platform/asyncio.py in start(self=<tornado.platform.asyncio.AsyncIOMainLoop object>)
    107         except (RuntimeError, AssertionError):
    108             old_loop = None
    109         try:
    110             self._setup_logging()
    111             asyncio.set_event_loop(self.asyncio_loop)
--> 112             self.asyncio_loop.run_forever()
        self.asyncio_loop.run_forever = <bound method BaseEventLoop.run_forever of <_Uni...EventLoop running=True closed=False debug=False>>
    113         finally:
    114             asyncio.set_event_loop(old_loop)
    115 
    116     def stop(self):

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/asyncio/base_events.py in run_forever(self=<_UnixSelectorEventLoop running=True closed=False debug=False>)
    416             sys.set_asyncgen_hooks(firstiter=self._asyncgen_firstiter_hook,
    417                                    finalizer=self._asyncgen_finalizer_hook)
    418         try:
    419             events._set_running_loop(self)
    420             while True:
--> 421                 self._run_once()
        self._run_once = <bound method BaseEventLoop._run_once of <_UnixS...EventLoop running=True closed=False debug=False>>
    422                 if self._stopping:
    423                     break
    424         finally:
    425             self._stopping = False

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/asyncio/base_events.py in _run_once(self=<_UnixSelectorEventLoop running=True closed=False debug=False>)
   1426                         logger.warning('Executing %s took %.3f seconds',
   1427                                        _format_handle(handle), dt)
   1428                 finally:
   1429                     self._current_handle = None
   1430             else:
-> 1431                 handle._run()
        handle._run = <bound method Handle._run of <Handle BaseAsyncIOLoop._handle_events(14, 1)>>
   1432         handle = None  # Needed to break cycles when an exception occurs.
   1433 
   1434     def _set_coroutine_wrapper(self, enabled):
   1435         try:

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/asyncio/events.py in _run(self=<Handle BaseAsyncIOLoop._handle_events(14, 1)>)
    140             self._callback = None
    141             self._args = None
    142 
    143     def _run(self):
    144         try:
--> 145             self._callback(*self._args)
        self._callback = <bound method BaseAsyncIOLoop._handle_events of <tornado.platform.asyncio.AsyncIOMainLoop object>>
        self._args = (14, 1)
    146         except Exception as exc:
    147             cb = _format_callback_source(self._callback, self._args)
    148             msg = 'Exception in callback {}'.format(cb)
    149             context = {

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/tornado/platform/asyncio.py in _handle_events(self=<tornado.platform.asyncio.AsyncIOMainLoop object>, fd=14, events=1)
     97             self.writers.remove(fd)
     98         del self.handlers[fd]
     99 
    100     def _handle_events(self, fd, events):
    101         fileobj, handler_func = self.handlers[fd]
--> 102         handler_func(fileobj, events)
        handler_func = <function wrap.<locals>.null_wrapper>
        fileobj = <zmq.sugar.socket.Socket object>
        events = 1
    103 
    104     def start(self):
    105         try:
    106             old_loop = asyncio.get_event_loop()

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/tornado/stack_context.py in null_wrapper(*args=(<zmq.sugar.socket.Socket object>, 1), **kwargs={})
    271         # Fast path when there are no active contexts.
    272         def null_wrapper(*args, **kwargs):
    273             try:
    274                 current_state = _state.contexts
    275                 _state.contexts = cap_contexts[0]
--> 276                 return fn(*args, **kwargs)
        args = (<zmq.sugar.socket.Socket object>, 1)
        kwargs = {}
    277             finally:
    278                 _state.contexts = current_state
    279         null_wrapper._wrapped = True
    280         return null_wrapper

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _handle_events(self=<zmq.eventloop.zmqstream.ZMQStream object>, fd=<zmq.sugar.socket.Socket object>, events=1)
    445             return
    446         zmq_events = self.socket.EVENTS
    447         try:
    448             # dispatch events:
    449             if zmq_events & zmq.POLLIN and self.receiving():
--> 450                 self._handle_recv()
        self._handle_recv = <bound method ZMQStream._handle_recv of <zmq.eventloop.zmqstream.ZMQStream object>>
    451                 if not self.socket:
    452                     return
    453             if zmq_events & zmq.POLLOUT and self.sending():
    454                 self._handle_send()

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _handle_recv(self=<zmq.eventloop.zmqstream.ZMQStream object>)
    475             else:
    476                 raise
    477         else:
    478             if self._recv_callback:
    479                 callback = self._recv_callback
--> 480                 self._run_callback(callback, msg)
        self._run_callback = <bound method ZMQStream._run_callback of <zmq.eventloop.zmqstream.ZMQStream object>>
        callback = <function wrap.<locals>.null_wrapper>
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    481         
    482 
    483     def _handle_send(self):
    484         """Handle a send event."""

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _run_callback(self=<zmq.eventloop.zmqstream.ZMQStream object>, callback=<function wrap.<locals>.null_wrapper>, *args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    427         close our socket."""
    428         try:
    429             # Use a NullContext to ensure that all StackContexts are run
    430             # inside our blanket exception handler rather than outside.
    431             with stack_context.NullContext():
--> 432                 callback(*args, **kwargs)
        callback = <function wrap.<locals>.null_wrapper>
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    433         except:
    434             gen_log.error("Uncaught exception in ZMQStream callback",
    435                           exc_info=True)
    436             # Re-raise the exception so that IOLoop.handle_callback_exception

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/tornado/stack_context.py in null_wrapper(*args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    271         # Fast path when there are no active contexts.
    272         def null_wrapper(*args, **kwargs):
    273             try:
    274                 current_state = _state.contexts
    275                 _state.contexts = cap_contexts[0]
--> 276                 return fn(*args, **kwargs)
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    277             finally:
    278                 _state.contexts = current_state
    279         null_wrapper._wrapped = True
    280         return null_wrapper

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/ipykernel/kernelbase.py in dispatcher(msg=[<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>])
    278         if self.control_stream:
    279             self.control_stream.on_recv(self.dispatch_control, copy=False)
    280 
    281         def make_dispatcher(stream):
    282             def dispatcher(msg):
--> 283                 return self.dispatch_shell(stream, msg)
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    284             return dispatcher
    285 
    286         for s in self.shell_streams:
    287             s.on_recv(make_dispatcher(s), copy=False)

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/ipykernel/kernelbase.py in dispatch_shell(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, msg={'buffers': [], 'content': {'allow_stdin': True, 'code': "result = em.select_matcher(matchers=[nb], \n     ..._matcher='precision'\n                           )", 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 4, 18, 21, 3, 32, 172935, tzinfo=tzutc()), 'msg_id': 'b2954cab15ff49bf844a0e2c2bc60acf', 'msg_type': 'execute_request', 'session': '7c38e0acddad484fac587da61d5f0bba', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': 'b2954cab15ff49bf844a0e2c2bc60acf', 'msg_type': 'execute_request', 'parent_header': {}})
    228             self.log.warn("Unknown message type: %r", msg_type)
    229         else:
    230             self.log.debug("%s: %s", msg_type, msg)
    231             self.pre_handler_hook()
    232             try:
--> 233                 handler(stream, idents, msg)
        handler = <bound method Kernel.execute_request of <ipykernel.ipkernel.IPythonKernel object>>
        stream = <zmq.eventloop.zmqstream.ZMQStream object>
        idents = [b'7c38e0acddad484fac587da61d5f0bba']
        msg = {'buffers': [], 'content': {'allow_stdin': True, 'code': "result = em.select_matcher(matchers=[nb], \n     ..._matcher='precision'\n                           )", 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 4, 18, 21, 3, 32, 172935, tzinfo=tzutc()), 'msg_id': 'b2954cab15ff49bf844a0e2c2bc60acf', 'msg_type': 'execute_request', 'session': '7c38e0acddad484fac587da61d5f0bba', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': 'b2954cab15ff49bf844a0e2c2bc60acf', 'msg_type': 'execute_request', 'parent_header': {}}
    234             except Exception:
    235                 self.log.error("Exception in message handler:", exc_info=True)
    236             finally:
    237                 self.post_handler_hook()

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/ipykernel/kernelbase.py in execute_request(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, ident=[b'7c38e0acddad484fac587da61d5f0bba'], parent={'buffers': [], 'content': {'allow_stdin': True, 'code': "result = em.select_matcher(matchers=[nb], \n     ..._matcher='precision'\n                           )", 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 4, 18, 21, 3, 32, 172935, tzinfo=tzutc()), 'msg_id': 'b2954cab15ff49bf844a0e2c2bc60acf', 'msg_type': 'execute_request', 'session': '7c38e0acddad484fac587da61d5f0bba', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': 'b2954cab15ff49bf844a0e2c2bc60acf', 'msg_type': 'execute_request', 'parent_header': {}})
    394         if not silent:
    395             self.execution_count += 1
    396             self._publish_execute_input(code, parent, self.execution_count)
    397 
    398         reply_content = self.do_execute(code, silent, store_history,
--> 399                                         user_expressions, allow_stdin)
        user_expressions = {}
        allow_stdin = True
    400 
    401         # Flush output before sending the reply.
    402         sys.stdout.flush()
    403         sys.stderr.flush()

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/ipykernel/ipkernel.py in do_execute(self=<ipykernel.ipkernel.IPythonKernel object>, code="result = em.select_matcher(matchers=[nb], \n     ..._matcher='precision'\n                           )", silent=False, store_history=True, user_expressions={}, allow_stdin=True)
    203 
    204         self._forward_input(allow_stdin)
    205 
    206         reply_content = {}
    207         try:
--> 208             res = shell.run_cell(code, store_history=store_history, silent=silent)
        res = undefined
        shell.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = "result = em.select_matcher(matchers=[nb], \n     ..._matcher='precision'\n                           )"
        store_history = True
        silent = False
    209         finally:
    210             self._restore_input()
    211 
    212         if res.error_before_exec is not None:

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/ipykernel/zmqshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, *args=("result = em.select_matcher(matchers=[nb], \n     ..._matcher='precision'\n                           )",), **kwargs={'silent': False, 'store_history': True})
    532             )
    533         self.payload_manager.write_payload(payload)
    534 
    535     def run_cell(self, *args, **kwargs):
    536         self._last_traceback = None
--> 537         return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
        self.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        args = ("result = em.select_matcher(matchers=[nb], \n     ..._matcher='precision'\n                           )",)
        kwargs = {'silent': False, 'store_history': True}
    538 
    539     def _showtraceback(self, etype, evalue, stb):
    540         # try to preserve ordering of tracebacks and print statements
    541         sys.stdout.flush()

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell="result = em.select_matcher(matchers=[nb], \n     ..._matcher='precision'\n                           )", store_history=True, silent=False, shell_futures=True)
   2723                 self.displayhook.exec_result = result
   2724 
   2725                 # Execute the user code
   2726                 interactivity = "none" if silent else self.ast_node_interactivity
   2727                 has_raised = self.run_ast_nodes(code_ast.body, cell_name,
-> 2728                    interactivity=interactivity, compiler=compiler, result=result)
        interactivity = 'last_expr'
        compiler = <IPython.core.compilerop.CachingCompiler object>
   2729                 
   2730                 self.last_execution_succeeded = not has_raised
   2731                 self.last_execution_result = result
   2732 

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_ast_nodes(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, nodelist=[<_ast.Assign object>], cell_name='<ipython-input-44-2335990c12b9>', interactivity='none', compiler=<IPython.core.compilerop.CachingCompiler object>, result=<ExecutionResult object at 7f5eaaca4c88, executi..._before_exec=None error_in_exec=None result=None>)
   2845 
   2846         try:
   2847             for i, node in enumerate(to_run_exec):
   2848                 mod = ast.Module([node])
   2849                 code = compiler(mod, cell_name, "exec")
-> 2850                 if self.run_code(code, result):
        self.run_code = <bound method InteractiveShell.run_code of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = <code object <module> at 0x7f5eaa7c6660, file "<ipython-input-44-2335990c12b9>", line 1>
        result = <ExecutionResult object at 7f5eaaca4c88, executi..._before_exec=None error_in_exec=None result=None>
   2851                     return True
   2852 
   2853             for i, node in enumerate(to_run_interactive):
   2854                 mod = ast.Interactive([node])

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_code(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, code_obj=<code object <module> at 0x7f5eaa7c6660, file "<ipython-input-44-2335990c12b9>", line 1>, result=<ExecutionResult object at 7f5eaaca4c88, executi..._before_exec=None error_in_exec=None result=None>)
   2905         outflag = True  # happens in more places, so it's easier as default
   2906         try:
   2907             try:
   2908                 self.hooks.pre_run_code_hook()
   2909                 #rprint('Running code', repr(code_obj)) # dbg
-> 2910                 exec(code_obj, self.user_global_ns, self.user_ns)
        code_obj = <code object <module> at 0x7f5eaa7c6660, file "<ipython-input-44-2335990c12b9>", line 1>
        self.user_global_ns = {'A':          ID  \
4096  a4866   
4098  a4868   
205...
4095            "nan"  

[1000 rows x 8 columns], 'B':          ID  \
3113  b4425   
1672  b2064   
345...resource (ix, 208 p.)"  

[1000 rows x 8 columns], 'C':        _id ltable_ID rtable_ID  \
0        0    ...urce (xv, 333 pages)"  

[1307 rows x 17 columns], 'D':        _id ltable_ID rtable_ID  \
1        1    ...urce (xxi, 538 pages)"  

[606 rows x 17 columns], 'H':       _id ltable_ID rtable_ID  Title_Title_jac_q...810       False      1  

[198 rows x 24 columns], 'H_test':       _id ltable_ID rtable_ID  Title_Title_jac_q...   0                 0  

[102 rows x 25 columns], 'I':       _id ltable_ID rtable_ID  \
126   548     a...xi, 807 pages)"      1  

[198 rows x 18 columns], 'IJ': OrderedDict([('train',       _id ltable_ID rtabl...155 pages)."      0  

[102 rows x 18 columns])]), 'In': ['', 'import pandas as pd\nimport py_entitymatching as em', "label_S = pd.read_csv('./data_with_label.csv')\n#...m.set_property(label_S, 'ltable', label_S_ltable)", "label_S = pd.read_csv('./data_with_label.csv')\n#...m.set_property(label_S, 'ltable', label_S_ltable)", "IJ = em.split_train_test(label_S, train_proporti...6, random_state=0)\nI = IJ['train']\nJ = IJ['test']", "I.to_csv('Set_I.csv', sep = ',')", "J.to_csv('Set_J.csv', sep = ',')", 'import pandas as pd\nimport py_entitymatching as em', "table_A = em.read_csv_metadata('../data/TableA.c...ad_csv_metadata('../data/TableB.csv', key = 'ID')", 'table_A.shape', 'table_B.shape', 'A, B = em.down_sample(table_A, table_B, size=1000, y_param = 1, show_progress=False)', 'A.shape', "block_f = em.get_features_for_blocking(A, B)\nblo...re(block_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', r)", "ob = em.OverlapBlocker()\nC = ob.block_tables(A, ...Details'], \n                    overlap_size = 2)", "D = ob.block_candset(C, 'Title', 'Title', overlap_size = 4)", 'len(D)', 'len(C)', 'len(D)', 'match_f = em.get_features_for_matching(A, B)', ...], 'J':       _id ltable_ID rtable_ID  \
208   867     a...i, 155 pages)."      0  

[102 rows x 18 columns], ...}
        self.user_ns = {'A':          ID  \
4096  a4866   
4098  a4868   
205...
4095            "nan"  

[1000 rows x 8 columns], 'B':          ID  \
3113  b4425   
1672  b2064   
345...resource (ix, 208 p.)"  

[1000 rows x 8 columns], 'C':        _id ltable_ID rtable_ID  \
0        0    ...urce (xv, 333 pages)"  

[1307 rows x 17 columns], 'D':        _id ltable_ID rtable_ID  \
1        1    ...urce (xxi, 538 pages)"  

[606 rows x 17 columns], 'H':       _id ltable_ID rtable_ID  Title_Title_jac_q...810       False      1  

[198 rows x 24 columns], 'H_test':       _id ltable_ID rtable_ID  Title_Title_jac_q...   0                 0  

[102 rows x 25 columns], 'I':       _id ltable_ID rtable_ID  \
126   548     a...xi, 807 pages)"      1  

[198 rows x 18 columns], 'IJ': OrderedDict([('train',       _id ltable_ID rtabl...155 pages)."      0  

[102 rows x 18 columns])]), 'In': ['', 'import pandas as pd\nimport py_entitymatching as em', "label_S = pd.read_csv('./data_with_label.csv')\n#...m.set_property(label_S, 'ltable', label_S_ltable)", "label_S = pd.read_csv('./data_with_label.csv')\n#...m.set_property(label_S, 'ltable', label_S_ltable)", "IJ = em.split_train_test(label_S, train_proporti...6, random_state=0)\nI = IJ['train']\nJ = IJ['test']", "I.to_csv('Set_I.csv', sep = ',')", "J.to_csv('Set_J.csv', sep = ',')", 'import pandas as pd\nimport py_entitymatching as em', "table_A = em.read_csv_metadata('../data/TableA.c...ad_csv_metadata('../data/TableB.csv', key = 'ID')", 'table_A.shape', 'table_B.shape', 'A, B = em.down_sample(table_A, table_B, size=1000, y_param = 1, show_progress=False)', 'A.shape', "block_f = em.get_features_for_blocking(A, B)\nblo...re(block_f, 'Title_Title_jac_dlm_dc0_dlm_dc0', r)", "ob = em.OverlapBlocker()\nC = ob.block_tables(A, ...Details'], \n                    overlap_size = 2)", "D = ob.block_candset(C, 'Title', 'Title', overlap_size = 4)", 'len(D)', 'len(C)', 'len(D)', 'match_f = em.get_features_for_matching(A, B)', ...], 'J':       _id ltable_ID rtable_ID  \
208   867     a...i, 155 pages)."      0  

[102 rows x 18 columns], ...}
   2911             finally:
   2912                 # Reset our crash handler in place
   2913                 sys.excepthook = old_excepthook
   2914         except SystemExit as e:

...........................................................................
/home/xiuyuan/private/838/CS839ClassProject/stage3/code/<ipython-input-44-2335990c12b9> in <module>()
      1 result = em.select_matcher(matchers=[nb], 
      2                            table=H, 
      3                            exclude_attrs=['_id', 'ltable_ID', 'rtable_ID'], 
      4                            target_attr='label', 
      5                            k=5,
----> 6                            metric_to_select_matcher='precision'
      7                            )

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/py_entitymatching/matcherselector/mlmatcherselection.py in select_matcher(matchers=[<py_entitymatching.matcher.nbmatcher.NBMatcher object>], x=array([[0.17437722419928825, 0.17541160386140586...33333, 0.5238095238095238, False]], dtype=object), y=array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,... 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1]), table=      _id ltable_ID rtable_ID  Title_Title_jac_q...810       False      1  

[198 rows x 24 columns], exclude_attrs=['_id', 'ltable_ID', 'rtable_ID'], target_attr='label', metric_to_select_matcher='precision', metrics_to_display=['precision', 'recall', 'f1'], k=5, n_jobs=-1, random_state=None)
    119         mean_score_list = []
    120         # Run the cross validation for each matcher
    121         for m in matchers:
    122             # Use scikit learn's cross validation to get the matcher and the list
    123             #  of scores (one for each fold).
--> 124             matcher, scores = cross_validation(m, x, y, met, k, random_state, n_jobs)
        matcher = undefined
        scores = undefined
        m = <py_entitymatching.matcher.nbmatcher.NBMatcher object>
        x = array([[0.17437722419928825, 0.17541160386140586...33333, 0.5238095238095238, False]], dtype=object)
        y = array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,... 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
        met = 'precision'
        k = 5
        random_state = None
        n_jobs = -1
    125             # Fill a dictionary based on the matcher and the scores.
    126             val_list = [matcher.get_name(), matcher, k]
    127             val_list.extend(scores)
    128             val_list.append(pd.np.mean(scores))

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/py_entitymatching/matcherselector/mlmatcherselection.py in cross_validation(matcher=<py_entitymatching.matcher.nbmatcher.NBMatcher object>, x=array([[0.17437722419928825, 0.17541160386140586...33333, 0.5238095238095238, False]], dtype=object), y=array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,... 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1]), metric='precision', k=5, random_state=None, n_jobs=-1)
    157     # Use KFold function from scikit learn to create a ms object that can be
    158     # used for cross_val_score function.
    159     cv = KFold(k, shuffle=True, random_state=random_state)
    160     # Call the scikit-learn's cross_val_score function
    161     scores = cross_val_score(matcher.clf, x, y, scoring=metric, cv=cv,
--> 162                              n_jobs=n_jobs)
        n_jobs = -1
    163     # Finally, return the matcher along with the scores.
    164     return matcher, scores
    165 
    166 

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator=GaussianNB(priors='NaiveBayes'), X=array([[0.17437722419928825, 0.17541160386140586...33333, 0.5238095238095238, False]], dtype=object), y=array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,... 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1]), groups=None, scoring='precision', cv=KFold(n_splits=5, random_state=None, shuffle=True), n_jobs=-1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs')
    337     cv_results = cross_validate(estimator=estimator, X=X, y=y, groups=groups,
    338                                 scoring={'score': scorer}, cv=cv,
    339                                 return_train_score=False,
    340                                 n_jobs=n_jobs, verbose=verbose,
    341                                 fit_params=fit_params,
--> 342                                 pre_dispatch=pre_dispatch)
        pre_dispatch = '2*n_jobs'
    343     return cv_results['test_score']
    344 
    345 
    346 def _fit_and_score(estimator, X, y, scorer, train, test, verbose,

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator=GaussianNB(priors='NaiveBayes'), X=array([[0.17437722419928825, 0.17541160386140586...33333, 0.5238095238095238, False]], dtype=object), y=array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,... 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1]), groups=None, scoring={'score': make_scorer(precision_score)}, cv=KFold(n_splits=5, random_state=None, shuffle=True), n_jobs=-1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', return_train_score=False)
    201     scores = parallel(
    202         delayed(_fit_and_score)(
    203             clone(estimator), X, y, scorers, train, test, verbose, None,
    204             fit_params, return_train_score=return_train_score,
    205             return_times=True)
--> 206         for train, test in cv.split(X, y, groups))
        cv.split = <bound method _BaseKFold.split of KFold(n_splits=5, random_state=None, shuffle=True)>
        X = array([[0.17437722419928825, 0.17541160386140586...33333, 0.5238095238095238, False]], dtype=object)
        y = array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,... 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
        groups = None
    207 
    208     if return_train_score:
    209         train_scores, test_scores, fit_times, score_times = zip(*scores)
    210         train_scores = _aggregate_score_dicts(train_scores)

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=-1), iterable=<generator object cross_validate.<locals>.<genexpr>>)
    784             if pre_dispatch == "all" or n_jobs == 1:
    785                 # The iterable was consumed all at once by the above for loop.
    786                 # No need to wait for async callbacks to trigger to
    787                 # consumption.
    788                 self._iterating = False
--> 789             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=-1)>
    790             # Make sure that we get a last message telling us we are done
    791             elapsed_time = time.time() - self._start_time
    792             self._print('Done %3i out of %3i | elapsed: %s finished',
    793                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
TypeError                                          Wed Apr 18 16:03:32 2018
PID: 5707     Python 3.6.4: /home/xiuyuan/anaconda3/envs/fonduer/bin/python
...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function _fit_and_score>, (GaussianNB(priors='NaiveBayes'), array([[0.17437722419928825, 0.17541160386140586...33333, 0.5238095238095238, False]], dtype=object), array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,... 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1]), {'score': make_scorer(precision_score)}, array([  0,   1,   2,   3,   4,   5,   8,   9,  ..., 188, 189, 190, 191, 192, 194,
       196, 197]), array([  6,   7,  15,  17,  18,  20,  22,  23,  ..., 170, 171, 175, 176, 183, 184, 193,
       195]), 0, None, None), {'return_times': True, 'return_train_score': False})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _fit_and_score>
        args = (GaussianNB(priors='NaiveBayes'), array([[0.17437722419928825, 0.17541160386140586...33333, 0.5238095238095238, False]], dtype=object), array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,... 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1]), {'score': make_scorer(precision_score)}, array([  0,   1,   2,   3,   4,   5,   8,   9,  ..., 188, 189, 190, 191, 192, 194,
       196, 197]), array([  6,   7,  15,  17,  18,  20,  22,  23,  ..., 170, 171, 175, 176, 183, 184, 193,
       195]), 0, None, None)
        kwargs = {'return_times': True, 'return_train_score': False}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator=GaussianNB(priors='NaiveBayes'), X=array([[0.17437722419928825, 0.17541160386140586...33333, 0.5238095238095238, False]], dtype=object), y=array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,... 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1]), scorer={'score': make_scorer(precision_score)}, train=array([  0,   1,   2,   3,   4,   5,   8,   9,  ..., 188, 189, 190, 191, 192, 194,
       196, 197]), test=array([  6,   7,  15,  17,  18,  20,  22,  23,  ..., 170, 171, 175, 176, 183, 184, 193,
       195]), verbose=0, parameters=None, fit_params={}, return_train_score=False, return_parameters=False, return_n_test_samples=False, return_times=True, error_score='raise')
    453 
    454     try:
    455         if y_train is None:
    456             estimator.fit(X_train, **fit_params)
    457         else:
--> 458             estimator.fit(X_train, y_train, **fit_params)
        estimator.fit = <bound method GaussianNB.fit of GaussianNB(priors='NaiveBayes')>
        X_train = array([[0.17437722419928825, 0.17541160386140586...33333, 0.5238095238095238, False]], dtype=object)
        y_train = array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,...0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1])
        fit_params = {}
    459 
    460     except Exception as e:
    461         # Note fit time as time until error
    462         fit_time = time.time() - start_time

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/sklearn/naive_bayes.py in fit(self=GaussianNB(priors='NaiveBayes'), X=array([[0.17437722, 0.1754116 , 0.06349206, ...,....., 0.33333333, 0.52380952,
        0.        ]]), y=array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,...0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1]), sample_weight=None)
    180         self : object
    181             Returns self.
    182         """
    183         X, y = check_X_y(X, y)
    184         return self._partial_fit(X, y, np.unique(y), _refit=True,
--> 185                                  sample_weight=sample_weight)
        sample_weight = None
    186 
    187     @staticmethod
    188     def _update_mean_variance(n_past, mu, var, X, sample_weight=None):
    189         """Compute online update of Gaussian mean and variance.

...........................................................................
/home/xiuyuan/anaconda3/envs/fonduer/lib/python3.6/site-packages/sklearn/naive_bayes.py in _partial_fit(self=GaussianNB(priors='NaiveBayes'), X=array([[0.17437722, 0.1754116 , 0.06349206, ...,....., 0.33333333, 0.52380952,
        0.        ]]), y=array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,...0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1]), classes=array([0, 1]), _refit=True, sample_weight=None)
    361             n_classes = len(self.classes_)
    362             # Take into account the priors
    363             if self.priors is not None:
    364                 priors = np.asarray(self.priors)
    365                 # Check that the provide prior match the number of classes
--> 366                 if len(priors) != n_classes:
        priors = array('NaiveBayes', dtype='<U10')
        n_classes = 2
    367                     raise ValueError('Number of priors must match number of'
    368                                      ' classes.')
    369                 # Check that the sum is 1
    370                 if priors.sum() != 1.0:

TypeError: len() of unsized object
___________________________________________________________________________

In [53]:
pd.set_option('display.max_columns', 1000)

In [54]:
H.head(198)

Unnamed: 0,_id,ltable_ID,rtable_ID,Title_Title_jac_qgm_3_qgm_3,Title_Title_cos_dlm_dc0_dlm_dc0,Format_Format_jac_qgm_3_qgm_3,Format_Format_cos_dlm_dc0_dlm_dc0,Format_Format_jac_dlm_dc0_dlm_dc0,Format_Format_mel,Format_Format_lev_dist,Format_Format_lev_sim,Format_Format_nmw,Format_Format_sw,Series_Series_jac_qgm_3_qgm_3,Series_Series_cos_dlm_dc0_dlm_dc0,Series_Series_mel,Series_Series_lev_dist,Series_Series_lev_sim,Title_Title_jac_dlm_dc0_dlm_dc0,Author_Author_jac_dlm_dc0_dlm_dc0,Publication_Publication_jac_dlm_dc0_dlm_dc0,Series_Series_jac_dlm_dc0_dlm_dc0,blackbox_1,label
126,548,a1137,b1719,0.174377,0.175412,0.063492,0.0,0.0,0.641071,50,0.107143,-43.0,4.0,0.349693,0.533333,0.873977,53.0,0.592308,0.095238,0.03125,0.125,0.363636,False,0
171,726,a5003,b5121,0.792593,0.707107,0.153846,0.0,0.0,0.517293,13,0.315789,-6.0,4.0,0.719298,0.801784,0.95293,11.0,0.788462,0.545455,0.0,0.0,0.666667,False,1
71,295,a4188,b864,0.955556,0.927105,0.153846,0.0,0.0,0.517293,13,0.315789,-6.0,4.0,0.92,0.9,0.994521,1.0,0.986301,0.863636,0.0,0.142857,0.818182,False,1
227,920,a3709,b695,0.64486,0.602464,0.153846,0.0,0.0,0.517293,13,0.315789,-6.0,4.0,0.603175,0.801784,0.929195,15.0,0.722222,0.411765,0.0,0.0,0.666667,False,0
245,1020,a804,b5268,0.948454,0.889499,0.142857,0.0,0.0,0.605714,5,0.285714,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.8,0.0,0.222222,1.0,False,1
205,854,a1638,b527,0.766667,0.759072,0.063492,0.0,0.0,0.641071,50,0.107143,-43.0,4.0,0.253333,0.550019,0.837886,103.0,0.284722,0.611111,0.090909,0.0,0.333333,False,0
288,1252,a5497,b4357,0.561224,0.210819,0.153846,0.0,0.0,0.517293,13,0.315789,-6.0,4.0,0.707317,0.703526,0.812414,14.0,0.818182,0.117647,0.0,0.0,0.538462,False,1
215,880,a5257,b1674,0.252427,0.229416,0.063492,0.0,0.0,0.641071,50,0.107143,-43.0,4.0,0.388889,0.47629,0.842939,110.0,0.382022,0.129032,0.05,0.125,0.304348,False,0
154,681,a2792,b567,0.915493,0.843274,0.153846,0.0,0.0,0.517293,13,0.315789,-6.0,4.0,0.857143,0.75,0.989189,1.0,0.972973,0.727273,0.0,0.0,0.6,False,1
159,699,a5626,b6367,0.342857,0.342997,0.063492,0.0,0.0,0.641071,50,0.107143,-43.0,4.0,0.14433,0.251976,0.71082,33.0,0.421053,0.206897,0.142857,0.0,0.142857,False,0


In [33]:
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.879762,0.895701,0.896892
1,RF,0.934762,0.88197,0.850817
2,SVM,0.847135,0.678765,0.736943
3,LogReg,0.843874,0.840896,0.849191
4,LinReg,0.903068,0.953036,0.896838


In [35]:
# RF

rf.fit(table=H, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
       target_attr='label')

H_test = em.extract_feature_vecs(J, feature_table=match_f, attrs_after=['label'])
pred_table = rf.predict(table= H_test, 
                        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
                        target_attr='predicted_labels', 
                        return_probs=True, 
                        probs_attr='proba', 
                        append=True)
eval_summary = em.eval_matches(pred_table, 'label', 'predicted_labels')
eval_summary

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [39]:
# DT

dt.fit(table=H, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
       target_attr='label')

H_test = em.extract_feature_vecs(J, feature_table=match_f, attrs_after=['label'])
pred_table = dt.predict(table= H_test, 
                        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
                        target_attr='predicted_labels', 
                        return_probs=True, 
                        probs_attr='proba', 
                        append=True)
eval_summary = em.eval_matches(pred_table, 'label', 'predicted_labels')
eval_summary

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


OrderedDict([('prec_numerator', 25.0),
             ('prec_denominator', 27.0),
             ('precision', 0.9259259259259259),
             ('recall_numerator', 25.0),
             ('recall_denominator', 31.0),
             ('recall', 0.8064516129032258),
             ('f1', 0.8620689655172414),
             ('pred_pos_num', 27.0),
             ('false_pos_num', 2.0),
             ('false_pos_ls', [('a5146', 'b695'), ('a3709', 'b907')]),
             ('pred_neg_num', 75.0),
             ('false_neg_num', 6.0),
             ('false_neg_ls',
              [('a4779', 'b2500'),
               ('a2488', 'b767'),
               ('a1614', 'b3909'),
               ('a3595', 'b826'),
               ('a2082', 'b2692'),
               ('a5196', 'b5855')])])

In [41]:
# SVM

svm.fit(table=H, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
       target_attr='label')

H_test = em.extract_feature_vecs(J, feature_table=match_f, attrs_after=['label'])
pred_table = svm.predict(table= H_test, 
                        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
                        target_attr='predicted_labels', 
                        return_probs=False, 
                        probs_attr='proba', 
                        append=True)
eval_summary = em.eval_matches(pred_table, 'label', 'predicted_labels')
eval_summary

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


OrderedDict([('prec_numerator', 21.0),
             ('prec_denominator', 31.0),
             ('precision', 0.6774193548387096),
             ('recall_numerator', 21.0),
             ('recall_denominator', 31.0),
             ('recall', 0.6774193548387096),
             ('f1', 0.6774193548387096),
             ('pred_pos_num', 31.0),
             ('false_pos_num', 10.0),
             ('false_pos_ls',
              [('a4814', 'b140'),
               ('a5426', 'b3056'),
               ('a3061', 'b99'),
               ('a5426', 'b592'),
               ('a5640', 'b685'),
               ('a188', 'b4616'),
               ('a5426', 'b695'),
               ('a2905', 'b1241'),
               ('a4984', 'b1911'),
               ('a5426', 'b685')]),
             ('pred_neg_num', 71.0),
             ('false_neg_num', 10.0),
             ('false_neg_ls',
              [('a420', 'b5252'),
               ('a4451', 'b928'),
               ('a3594', 'b858'),
               ('a120', 'b4255'),
            

In [42]:
# LR

lg.fit(table=H, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
       target_attr='label')

H_test = em.extract_feature_vecs(J, feature_table=match_f, attrs_after=['label'])
pred_table = lg.predict(table= H_test, 
                        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
                        target_attr='predicted_labels', 
                        return_probs=False, 
                        probs_attr='proba', 
                        append=True)
eval_summary = em.eval_matches(pred_table, 'label', 'predicted_labels')
eval_summary

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


OrderedDict([('prec_numerator', 27.0),
             ('prec_denominator', 32.0),
             ('precision', 0.84375),
             ('recall_numerator', 27.0),
             ('recall_denominator', 31.0),
             ('recall', 0.8709677419354839),
             ('f1', 0.8571428571428571),
             ('pred_pos_num', 32.0),
             ('false_pos_num', 5.0),
             ('false_pos_ls',
              [('a3709', 'b907'),
               ('a4969', 'b907'),
               ('a5146', 'b695'),
               ('a5426', 'b3056'),
               ('a625', 'b685')]),
             ('pred_neg_num', 70.0),
             ('false_neg_num', 4.0),
             ('false_neg_ls',
              [('a4779', 'b2500'),
               ('a3595', 'b826'),
               ('a1876', 'b862'),
               ('a3020', 'b4720')])])

Extract feature from set J.

In [43]:
# LN

ln.fit(table=H, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
       target_attr='label')

H_test = em.extract_feature_vecs(J, feature_table=match_f, attrs_after=['label'])
pred_table = ln.predict(table= H_test, 
                        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'label'], 
                        target_attr='predicted_labels', 
                        return_probs=False, 
                        probs_attr='proba', 
                        append=True)
eval_summary = em.eval_matches(pred_table, 'label', 'predicted_labels')
eval_summary

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


OrderedDict([('prec_numerator', 28.0),
             ('prec_denominator', 31.0),
             ('precision', 0.9032258064516129),
             ('recall_numerator', 28.0),
             ('recall_denominator', 31.0),
             ('recall', 0.9032258064516129),
             ('f1', 0.9032258064516129),
             ('pred_pos_num', 31.0),
             ('false_pos_num', 3.0),
             ('false_pos_ls',
              [('a5146', 'b695'), ('a2493', 'b928'), ('a2714', 'b2276')]),
             ('pred_neg_num', 71.0),
             ('false_neg_num', 3.0),
             ('false_neg_ls',
              [('a4779', 'b2500'), ('a3595', 'b826'), ('a1876', 'b862')])])