In [1]:
import sys, re
sys.path.append('/u/p/m/pmartinkus/Documents/CS_838/Stage 4')

import Code as hw
import pandas as pd
import numpy as np
import py_entitymatching as em

import warnings
warnings.filterwarnings('ignore')

## Blocking Step
If we examine the length of the cartesian product of these two tables, we will see that there are too many pairs to check and too many of the pairs will not actually be matches. We can use a blocking step to remove obvious non-matches. This helps us reduce the total number of potential matches to check and also results in our data having a better ratio of matches to non-matches.

In [15]:
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))

Number of tuples in A: 3034
Number of tuples in B: 3102
Number of tuples in A X B (i.e the cartesian product): 9411468


To begin with, we will check that the two laptops have the same brand. Two laptops cannot be a match if they are not even made by the same company and the brand attribute is relatively clean in both tables. 

In [16]:
# Create attribute equivalence blocker
ab = em.AttrEquivalenceBlocker()

# Block tables comparing the brand
C1 = ab.block_tables(A, B, 'Brand', 'Brand',
                   l_output_attrs=['Name', 'Price', 'Brand', 'Screen Size', 'RAM', 'Hard Drive Capacity', 'Processor Type', 'Processor Speed', 'Operating System', 'Clean Name'],
                   r_output_attrs=['Name', 'Price', 'Brand', 'Screen Size', 'RAM', 'Hard Drive Capacity', 'Processor Type', 'Processor Speed', 'Operating System', 'Clean Name'])

# Lets see how much we've reduced the size of the cadidate set
len(C1)

1644624

Now every tuple pair in our candidate set C1 shares the same brand. Unfortunatly, there are still over a million tuples in this set so we need to continue blocking to reduce the size. Next, we will apply a rule based blocker. This blocker will check to see how similar the Clean Name columns of the tuple pair are using a jaccard score with 3-grams. 

In [17]:
# Get features for rule based blocking
block_f = em.get_features_for_blocking(A, B, validate_inferred_attr_types=False)

# Create the rule based blocker and add rule for jaccard score on Clean Name column
rb = em.RuleBasedBlocker()
rb.add_rule(['Clean_Name_Clean_Name_jac_qgm_3_qgm_3(ltuple, rtuple) < 0.2'], block_f)

# Block the candset
C2 = rb.block_candset(C1)
len(C2)

Column Battery Life does not seem to qualify as any atomic type. It may contain all NaNs. Please update the values of column Battery Life
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:55


68742

Now we have significantly reduced the size of the candidate set to around 65 thousand, but still only a small portion of the candidates can be actual matches. Since matching laptops should have the same size screen, RAM, and hard drive capacity, we will use a black box blocker to check if the numbers in the screen size field are equal.

In [18]:
# Create another black box blocker
bb_screen = em.BlackBoxBlocker()
# Set the black box function
bb_screen.set_black_box_function((hw.screen_ram_hd_equal))
# Apply blocker 
C = bb_screen.block_candset(C2)

# Let's check the length of this candadate set
len(C)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:05


19052

Now that we have sufficently reduced the size of the candidate set, let's debug the blocking scheme to see if we left out any potential matches.

In [19]:
dbg = em.debug_blocker(C, A, B)
dbg.to_csv(path+'debug.csv', index=False)

In [20]:
# Save the candidate set to a csv file
C.to_csv(path+'Candidate_Set.csv', index=False)

## Sampling and Labeling

Before we can start choosing a model, we need to label a set of data. First, we will randomly sample 300 candidate pairs of tuples and then we will manually label them all.

In [21]:
# Sample candidate set
S = em.sample_table(C, 300)

In [22]:
# Label S and specify the attribute name for the label column
#L = em.label_table(S, 'gold')

# Save Labeled data to csv file
#L.to_csv(path+'Labeled_Data.csv', index=False)

L = em.read_csv_metadata(path+'Labeled_Data.csv', 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_ID', fk_rtable='rtable_ID')

Metadata file is not present in the given path; proceeding to read the csv file.


## Feature Generation

Now that we have a set of labeled data, we will create feature vectors for matching. We will use the features automatically generated by py_entitymatching. However, we don't need any features relating to brand since we already assured that the brands would match in the blocking stage. Additionally, many of the tuples are missing vlaues for battery life so we will not include any features based on that attribute either.

In [23]:
# Generate features
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)

# We don't need any features relating to brand
feature_subset = feature_table.iloc[np.r_[4:10, 40:len(feature_table)], :]

# List the names of the features generated
feature_subset['feature_name'].head()

Column Battery Life does not seem to qualify as any atomic type. It may contain all NaNs. Please update the values of column Battery Life


4        Name_Name_jac_qgm_3_qgm_3
5    Name_Name_cos_dlm_dc0_dlm_dc0
6                  Price_Price_exm
7                  Price_Price_anm
8             Price_Price_lev_dist
Name: feature_name, dtype: object

We will split the data into a training set and a test set. The training set will be used for selecting our model, debugging the models, and training the final model. The test set is only used later on for evaluating the final model we settle on.

In [24]:
# Split the labeled data into development and evaluation set
development_evaluation = em.split_train_test(L, train_proportion=0.7)
development =  development_evaluation['train']
evaluation = development_evaluation['test']

# Save sets I and J to csv files
development.to_csv(path+'Development.csv', index=False)
evaluation.to_csv(path+'Evaluation.csv', index=False)

In [25]:
# Extract feature vectors
feature_vectors_dev = em.extract_feature_vecs(development, 
                            feature_table=feature_subset, 
                            attrs_after='gold')
# Display first few rows
feature_vectors_dev.head(3)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Unnamed: 0,_id,ltable_ID,rtable_ID,Name_Name_jac_qgm_3_qgm_3,Name_Name_cos_dlm_dc0_dlm_dc0,Price_Price_exm,Price_Price_anm,Price_Price_lev_dist,Price_Price_lev_sim,Processor_Speed_Processor_Speed_jac_qgm_3_qgm_3,...,Operating_System_Operating_System_lev_dist,Operating_System_Operating_System_lev_sim,Operating_System_Operating_System_nmw,Operating_System_Operating_System_sw,Clean_Name_Clean_Name_jac_qgm_3_qgm_3,Clean_Name_Clean_Name_cos_dlm_dc0_dlm_dc0,Clean_Name_Clean_Name_mel,Clean_Name_Clean_Name_lev_dist,Clean_Name_Clean_Name_lev_sim,gold
130,283662,167,2011,0.189055,0.359092,0.0,0.92678,5.0,0.166667,,...,0.0,1.0,10.0,10.0,0.211268,0.353553,0.784166,37.0,0.339286,0
171,478739,282,2925,0.406114,0.464642,0.0,0.925033,3.0,0.5,,...,0.0,1.0,10.0,10.0,0.457143,0.57735,0.69329,27.0,0.571429,1
23,11792,6,3004,0.345361,0.382692,0.0,0.977098,4.0,0.333333,0.285714,...,0.0,1.0,10.0,10.0,0.324675,0.503953,0.666835,35.0,0.363636,1


In [26]:
# Check for missing values
any(pd.isnull(feature_vectors_dev))

True

In [27]:
# Impute feature vectors with the mean of the column values.
feature_vectors_dev = em.impute_table(feature_vectors_dev, 
                exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'gold'],
                strategy='mean')

## Selecting a Matcher

Using the feature vectors we created, we can create some ML models and select the one that performs best. We will be selecting between the following methods: Decision Tree, SVM, Random Forest, Naive Bayes, Logistic Regression, and Linear Regression.

In [28]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree')
svm = em.SVMMatcher(name='SVM')
rf = em.RFMatcher(name='RF')
nb = em.NBMatcher(name='NB')
lg = em.LogRegMatcher(name='LogReg')
ln = em.LinRegMatcher(name='LinReg')

In [29]:
# Select the best ML matcher using Cross Validation
result = em.select_matcher([dt, rf, svm, nb, lg, ln], table=feature_vectors_dev, 
        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'gold'],
        k=10,
        target_attr='gold', 
        metric_to_select_matcher='f1',
        random_state=0)

In [30]:
# Lets take a look at the average results for cross validation for each matcher
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.9175,0.919048,0.910598
1,RF,0.940714,0.882937,0.910717
2,SVM,0.757738,0.671825,0.693077
3,NB,0.57,0.57619,0.544761
4,LogReg,0.67,0.529365,0.569893
5,LinReg,0.842381,0.630556,0.711829


## Debugging a Matcher

The next step is to debug the matcher we have chosen. We will examine the false positives and false negative to try and look for some common patterns to our errors. This way, we can make changes to address these common problems.

In [31]:
# # Split feature vectors into train and test
train_test = em.split_train_test(feature_vectors_dev, train_proportion=0.5)
train = train_test['train']
test = train_test['test']

In [32]:
# Debug decision tree using GUI
em.vis_debug_rf(rf, train, test, 
        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'gold'],
        target_attr='gold')

After taking a look at the false positives and negatives using the debugger, we discovered that the random forest model has trouble differentiated referbished laptops from new laptops. We can add a new feature that checks if the two laptops are refurbished to help with this problem.

In [33]:
# Add new feature to check if laptops are refurbished or not
em.add_blackbox_feature(feature_subset, 'refurbished', hw.refurbished)

True

Now we can select a model again to see how our new refurbish feature affected performance.

In [34]:
# Extract feature vectors
feature_vectors_dev = em.extract_feature_vecs(development, 
                            feature_table=feature_subset, 
                            attrs_after='gold')

# Impute feature vectors with the mean of the column values.
feature_vectors_dev = em.impute_table(feature_vectors_dev, 
                exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'gold'],
                strategy='mean')

# Select the best ML matcher using Cross Validation
result = em.select_matcher([dt, rf, svm, nb, lg, ln], table=feature_vectors_dev, 
        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'gold'],
        k=10,
        target_attr='gold', 
        metric_to_select_matcher='f1',
        random_state=0)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [35]:
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.913929,0.919048,0.910598
1,RF,0.938492,0.927778,0.927661
2,SVM,0.784545,0.671825,0.706125
3,NB,0.714405,0.688889,0.682251
4,LogReg,0.728095,0.558532,0.616865
5,LinReg,0.812381,0.655952,0.721434


We can see above that the Random Forest model has improved due to the debugging and is now the best performing matcher on our development set. We can move forward and test it out on the evaluation set.

## Evaluating the Matchers

Next, we will evaluate the matchers on the test dataset that we set aside earlier. Up until this point we have not looked at the evaluation set at all. The hope is that the evaluation set is representative of the complete data set (after blocking) and we have not overfit it since we have not looked at it.

First, we need to create feature vectors from the evaluation set. These features will be the same as before and will include our refurbished feature we developed after debuggin the matchers.

In [36]:
# Get new set of features
feature_vectors_eval = em.extract_feature_vecs(evaluation, 
                                               feature_table=feature_subset, 
                                               attrs_after='gold')

# Impute feature vectors
feature_vectors_eval = em.impute_table(feature_vectors_eval, 
                exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'gold'],
                strategy='mean')

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Now we will fit the models on the entire development set. This will give us a model that can be used to predict matches in the evaluation set. Then, using the trained model, we will generate predictions for the entire evaluation set and report the accuracies for each model.

In [37]:
# Train using feature vectors from the development set
matchers = [dt, rf, svm, nb, lg, ln]
for matcher in matchers:
    matcher.fit(table=feature_vectors_dev, 
           exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'gold'], 
           target_attr='gold')

In [38]:
# Predict M for each matcher
predictions = []
for matcher in matchers:
    predictions.append(matcher.predict(table=feature_vectors_eval, 
                             exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'gold'], 
                             append=True, 
                             target_attr='predicted', 
                             inplace=False))

In [39]:
# Evaluate the result
for i, matcher in enumerate(matchers):
    print(matcher.name + ': ')
    eval_result = em.eval_matches(predictions[i], 'gold', 'predicted')
    em.print_eval_summary(eval_result)
    print('')

DecisionTree: 
Precision : 92.86% (26/28)
Recall : 96.3% (26/27)
F1 : 94.55%
False positives : 2 (out of 28 positive predictions)
False negatives : 1 (out of 62 negative predictions)

RF: 
Precision : 100.0% (24/24)
Recall : 88.89% (24/27)
F1 : 94.12%
False positives : 0 (out of 24 positive predictions)
False negatives : 3 (out of 66 negative predictions)

SVM: 
Precision : 66.67% (16/24)
Recall : 59.26% (16/27)
F1 : 62.75%
False positives : 8 (out of 24 positive predictions)
False negatives : 11 (out of 66 negative predictions)

NB: 
Precision : 56.67% (17/30)
Recall : 62.96% (17/27)
F1 : 59.65%
False positives : 13 (out of 30 positive predictions)
False negatives : 10 (out of 60 negative predictions)

LogReg: 
Precision : 69.23% (9/13)
Recall : 33.33% (9/27)
F1 : 45.0%
False positives : 4 (out of 13 positive predictions)
False negatives : 18 (out of 77 negative predictions)

LinReg: 
Precision : 89.47% (17/19)
Recall : 62.96% (17/27)
F1 : 73.91%
False positives : 2 (out of 19 positiv

We can see above that the Random Forest model has performed well on the test set nearly reaching a 95% F1 score. 