# pyjedai WorkFlow module

__Abt-Buy dataset__

The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1076 entities from abt.com and 1076 entities from buy.com as well as a gold standard (perfect mapping) with 1076 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price.

In [1]:
import os
import sys
import pandas as pd
import networkx
from networkx import draw, Graph
%load_ext autoreload
%autoreload 2
%reload_ext autoreload

Import pyjedai

In [2]:
from pyjedai.utils import (
    text_cleaning_method,
    print_clusters,
    print_blocks,
    print_candidate_pairs
)
from pyjedai.evaluation import Evaluation, write

## Data Reading

pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files. 


In [3]:
from pyjedai.datamodel import Data

In [4]:
d1 = pd.read_csv("./data/D2/abt.csv", sep='|', engine='python', na_filter=False).astype(str)
d2 = pd.read_csv("./data/D2/buy.csv", sep='|', engine='python', na_filter=False).astype(str)
gt = pd.read_csv("./data/D2/gt.csv", sep='|', engine='python')

data = Data(
    dataset_1=d1,
    attributes_1=['id','name','description'],
    id_column_name_1='id',
    dataset_2=d2,
    attributes_2=['id','name','description'],
    id_column_name_2='id',
    ground_truth=gt,
)

data.process()

pyJedAI offers also dataset analysis methods (more will be developed)

In [5]:
data.print_specs()

Type of Entity Resolution:  Clean-Clean
Number of entities in D1:  1076
Attributes provided  for D1:  ['id', 'name', 'description']

Number of entities in D2:  1076
Attributes provided  for D2:  ['id', 'name', 'description']

Total number of entities:  2152
Number of matching pairs in ground-truth:  1076


In [6]:
data.dataset_1.head(5)

Unnamed: 0,id,name,description,price
0,0,Sony Turntable - PSLX350H,Sony Turntable - PSLX350H/ Belt Drive System/ ...,
1,1,Bose Acoustimass 5 Series III Speaker System -...,Bose Acoustimass 5 Series III Speaker System -...,399.0
2,2,Sony Switcher - SBV40S,Sony Switcher - SBV40S/ Eliminates Disconnecti...,49.0
3,3,Sony 5 Disc CD Player - CDPCE375,Sony 5 Disc CD Player- CDPCE375/ 5 Disc Change...,
4,4,Bose 27028 161 Bookshelf Pair Speakers In Whit...,Bose 161 Bookshelf Speakers In White - 161WH/ ...,158.0


In [7]:
data.dataset_2.head(5)

Unnamed: 0,id,name,description,price
0,0,Linksys EtherFast EZXS88W Ethernet Switch - EZ...,Linksys EtherFast 8-Port 10/100 Switch (New/Wo...,
1,1,Linksys EtherFast EZXS55W Ethernet Switch,5 x 10/100Base-TX LAN,
2,2,Netgear ProSafe FS105 Ethernet Switch - FS105NA,NETGEAR FS105 Prosafe 5 Port 10/100 Desktop Sw...,
3,3,Belkin Pro Series High Integrity VGA/SVGA Moni...,1 x HD-15 - 1 x HD-15 - 10ft - Beige,
4,4,Netgear ProSafe JFS516 Ethernet Switch,Netgear ProSafe 16 Port 10/100 Rackmount Switc...,


In [8]:
data.ground_truth.head(3)

Unnamed: 0,D1,D2
0,206,216
1,60,46
2,182,160


## WorkFlow

In [9]:
from pyjedai.workflow import WorkFlow

### Block building

In [27]:
from pyjedai.block_building import (
    StandardBlocking,
    QGramsBlocking,
    ExtendedQGramsBlocking,
    SuffixArraysBlocking,
    ExtendedSuffixArraysBlocking
)

bb = dict(
    method=QGramsBlocking, 
    params=dict(qgrams=3),
    attributes_1=['name'],
    attributes_2=['name']
)

### Block cleaning

In [32]:
from pyjedai.block_cleaning import BlockFiltering, BlockPurging

bc = [
    dict(
        method=BlockFiltering, 
        params=dict(ratio=0.8)
    ),
    dict(
        method=BlockPurging, 
        params=dict(smoothing_factor=1.025)
    )
]

### Comparison Cleaning - META Blocking

In [44]:
from pyjedai.comparison_cleaning import (
    WeightedEdgePruning,
    WeightedNodePruning,
    CardinalityEdgePruning,
    CardinalityNodePruning,
    BLAST,
    ReciprocalCardinalityNodePruning,
    ReciprocalWeightedNodePruning,
    ComparisonPropagation
)

cc = dict(method=CardinalityEdgePruning)

### Entity Matching

In [52]:
from pyjedai.matching import EntityMatching

em = dict(
    method=EntityMatching, 
    metric='sorensen_dice',
    similarity_threshold=0.5,
    attributes = ['description', 'name']
)

### Clustering

In [58]:
from pyjedai.clustering import ConnectedComponentsClustering

c = dict(method=ConnectedComponentsClustering)

In [61]:
w = WorkFlow(
    block_building = bb,
    block_cleaning = bc,
    comparison_cleaning = cc,
    entity_matching = em,
    clustering = c
)

In [63]:
w.run(data, tqdm_disable=True)

# Q-Grams Blocking Evaluation 
---
Method name: Q-Grams Blocking
Parameters: 
	Q-Gramms: 3
Runtime: 0.1841 seconds
Scores:
	Precision:      0.08% 
	Recall:       100.00%
	F1-score:       0.17%
Classification report:
	True positives: 1076
	False positives: 1282428
	True negatives: -124652
	False negatives: 0
	Total comparisons: 1283504
---
# Block Filtering Evaluation 
---
Method name: Block Filtering
Parameters: 
	Ratio: 0.8
Runtime: 0.0661 seconds
Scores:
	Precision:      0.06% 
	Recall:        99.91%
	F1-score:       0.12%
Classification report:
	True positives: 1075
	False positives: 1757290
	True negatives: -599515
	False negatives: 1
	Total comparisons: 1758365
---
# Block Purging Evaluation 
---
Method name: Block Purging
Parameters: 
	Smoothing factor: 1.025
	Max Comparisons per Block: 9191.0
Runtime: 0.0190 seconds
Scores:
	Precision:      0.05% 
	Recall:        99.91%
	F1-score:       0.10%
Classification report:
	True positives: 1075
	False positives: 2232151
	True negatives: