# Entity Matching Workflow - 31st March 2017

This notebook contains EM steps that has been done on songs using py_entitymatching. Our goal is to come up with a workflow to match songs from the datasets from a research lab from Silicon Valley. Specifically, we want Precision as 90% and Recall as high as possible. The datasets contain information about the songs.

First, we need to import py_entitymatching package and other libraries:

In [None]:
import py_entitymatching as em
import pandas as pd
import numpy as np

## Reading tables

In [None]:
path_A = os.path.join('..','dataset','datasets','tracks.csv')
path_B = os.path.join('..','dataset','datasets','songs.csv')

# Read the CSV files
A = em.read_csv_metadata(path_A,low_memory=False) # setting the parameter low_memory to False  to speed up loading.
B = em.read_csv_metadata(path_B,low_memory=False)

## Down sampling tabels

Two smaller datasets, track_sample.csv and songs_sample.csv, are produced by down sampling the full datasets. They contain 8592 and 10000 tuples, respectively. 

To avoid case-sensitive string matching, we also changed the entire tables to lowercase. 

In [None]:
#Down sample the tables
sample_A, sample_B = em.down_sample(A, B, size=10000, y_param=1, show_progress=False)

# Set 'ID as the keys to the input tables
em.set_key(sample_A,'id')
em.set_key(sample_B,'id')

sample_A = sample_A.apply(lambda x: x.astype(str).str.lower())
sample_B = sample_B.apply(lambda x: x.astype(str).str.lower())

#Print lengths of the sampled tables
print(len(sample_A))
print(len(sample_B))

#saving the down sampled datasets
sample_A.to_csv('tracks_sample.csv', index = False, sep = ',')
sample_B.to_csv('songs_sample.csv', index = False, sep = ',')

#Get headers of sampled tables
headers_A = list(A.columns)
headers_B = list(B.columns)

## Applying blocker on down sampled datasets


Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the down sampled tables. This would reduce the number of tuple pairs considered for matching.

We know that two songs with different artist names and vice versa will not match. So we decide to apply blocking over song title and artist name.

In [None]:
block_f = em.get_features_for_blocking(sample_A, sample_B)

path_A = 'tracks_sample.csv'
path_B = 'songs_sample.csv'

# Read the CSV files
sample_A = em.read_csv_metadata(path_A,key='id',low_memory=False)
sample_B = em.read_csv_metadata(path_B,key='id',low_memory=False)

Three rules were applied in a sequence to remove tuple pairs that are unlikely to match.

**1)	Pairs whose song names have a jaccard score (3-grams) lower than 0.1 were removed.**

**2)	Pairs whose song names have less than 25% common words were removed. **

**3)	Pairs whose artists have less than 25% common words were removed. **

Two helper functions ***title_function*** and ***artists_function***, defined above, were used to filter out tuples according to these rules

In [None]:
def title_function(x, y):

    x_title = str(x['song_title'])
    y_title = str(y['song_title'])
    
    if (x_title in y_title) or (y_title in x_title):
        return False
    else:
        x_split = x_title.split()
        y_split = y_title.split()
        
        intersection = len(set(x_split) & set(y_split))
        union = len(set(x_split) | set(y_split))
        
        if(intersection / union < 0.25):
            return True
        else:
            return False

In [None]:
def artists_function(x, y):

    x_artists = str(x['artists'])
    y_artists = str(y['artists'])
    
    if (x_artists in y_artists) or (y_artists in x_artists):
        return False
    else:
        x_split = x_artists.split()
        y_split = y_artists.split()
        
        intersection = len(set(x_split) & set(y_split))
        union = len(set(x_split) | set(y_split))
        
        if(intersection / union < 0.25):
            return True
        else:
            return False

In [None]:
rb = em.RuleBasedBlocker()
ob = em.OverlapBlocker()
bb = em.BlackBoxBlocker()

# remove pairs that don't share similar titles
rule1 = ['song_title_song_title_jac_qgm_3_qgm_3(ltuple, rtuple) < 0.1']
rb.add_rule(rule1, block_f)

C1 = rb.block_tables(sample_A, sample_B, l_output_attrs=['song_title','year','artists'], r_output_attrs=['song_title','year','artists'], show_progress=False)

bb.set_black_box_function(title_function)
C2 = bb.block_candset(C1)

bb.set_black_box_function(artists_function)
C3 = bb.block_candset(C2)


In [None]:
print(len(C3))
C3.head(50)

## Debugging the blocker output

By using the debugging tool supplied by Magellan, we examined the tuple pairs that had been blocked has very few wrongly removed actual matches. In addtion, we confirmed by examining the tuple pairs that survived blocking that we ended up with a reasonale number of positive examples and negative examples. 

In [1]:
# debug the blocker
dbg = em.debug_blocker(C2, sample_A, sample_B, output_size=50)
dbg

NameError: name 'em' is not defined

## Sampling and labeling the candidate set

First, we randomly sample 400 tuple pairs for labeling purposes.

In [None]:
#Sample the result set C6
S= em.sample_table(C3,400)

Next, we labelled the sampled candidate set based on the following criteria.

1. *if two tuples has the same song name (excluding version information)*
    
    * if they share common artists	                 **MATCH**
    * else				                                     **MISMATCH**

2. *else				**MISMATCH***


In [None]:
# label the sampled data
G = em.label_table(S, label_column_name='gold_labels')

G.to_csv('labeled_data.csv', index = False, sep = ',')

## Matching tuple pairs in the candidate set

In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes. This typically involves the following five steps:

1. Splitting the labeled data into development and evaluation set
2. Selecting the best learning based matcher using the development set
3. Evaluating the selected matcher using the evaluation set

### Splitting the labeled data into development and evaluation set

In this step, we split the labeled data into two sets: development (I) and evaluation (J). Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data. Size of each set is 200.

In [None]:
# Split G into I an J
train_test = em.split_train_test(G, train_proportion=0.5,random_state=0)
I = train_test['train']
J = train_test['test']

I.to_csv('train.csv')
J.to_csv('test.csv')

### Selecting the best learning-based matcher

Selecting the best learning-based matcher typically involves the following steps:


1. Creating features
2. Converting the development set into feature vectors
3. Creating a set of learning-based matchers
4. Selecting the best learning-based matcher using k-fold cross validation

#### Creating features

Next, we need to create a set of features for the development set. Using automatic feature generation in py_entitymatching, set of features F is generated based on the attributes in the input tables. 

We removed features that take ‘id’ as parameter from F as it does not contribute effectively to decide the matching between the tuple pairs in A and B

In [None]:
# Generate a set of features
F = em.get_features_for_matching(A, B)
print(F.feature_name)

# Remove all features on id parameters
F = F[4:]

# Remove some features on year parameter
F = F.drop(F.index[[0,1,2]])

#### Converting the development set to feature vectors

In [None]:
# Convert I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='gold_labels',
                            show_progress=False) 

# Display first few rows
H.head()

#### Missing value handling

We imputed missing values for feature vectors with 0. 

In [None]:
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
any(pd.notnull(H))

In [None]:
# Impute feature vectors with the mean of the column values.
H.fillna(value=0, inplace=True)

#### Creating a set of learning-based matchers

In [None]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0,max_depth=5)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')

#### Selecting the best matcher using cross-validation

Now, we select the best matcher using 5-fold cross-validation. We used 'precision' and 'recall' metric and found **Random Forest (X)** as the best matcher. 

In [None]:
# Compute accuracy and select the best ML matcher using CV
result_precision = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
        exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'],
        k=5,
        target_attr='gold_labels', metric='precision', random_state=0)
result_recall = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
        exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'],
        k=5,
        target_attr='gold_labels', metric='recall', random_state=0)
result_f1 = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
        exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'],
        k=5,
        target_attr='gold_labels', metric='f1', random_state=0)
result_precision['cv_stats']
result_recall['cv_stats']
result_f1['cv_stats']

#### Debugging matcher

To further improve the accuracy of X, we debugged it. To do so, first H was split into sets P and Q.

In [None]:
#Debug Random Forest Matcher X
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']

In [None]:
# Debug X using GUI
em.vis_debug_rf(rf, P, Q, 
        exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'],
        target_attr='gold_labels')

On examining the false positives, 2 of the 3 false positives generated were due to an exact match in either song_title or artists between the tuples. 
We tried with different subsets of the feature set involving these parameters to remove redundancy and improve its accuracy. 

In [None]:
#Debugging iteration 1 - remove song_title_song_title_lev_dist, song_title_song_title_nmw and song_title_song_title_sw,
#song_title_song_title_cos_dlm_dc0_dlm_dc0, song_title_song_title_mel,song_title_song_title_jac_dlm_dc0_dlm_dc0
F = F.drop(F.index[[2,3,4,5,7,8]])
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='gold_labels',
                            show_progress=False) 
# Impute feature vectors with 0.
H.fillna(value=0, inplace=True)
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']
# Debug the matcher using GUI
em.vis_debug_rf(rf, P, Q, 
        exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'],
        target_attr='gold_labels')

In [None]:
#Debugging iteration 2 - remove song_title_song_title_lev_sim
F = F.drop(F.index[[2]])
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='gold_labels',
                            show_progress=False) 
# Impute feature vectors with 0.
H.fillna(value=0, inplace=True)
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']
# Debug the matcher using GUI
em.vis_debug_rf(rf, P, Q, 
        exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'],
        target_attr='gold_labels')

In [None]:
#Debugging iteration 3 - remove artists_artists_lev_dist, artists_artists_nmw,artists_artists_sw
F = F.drop(F.index[[3,4,5,6,8,9]])
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='gold_labels',
                            show_progress=False) 
# Impute feature vectors with 0.
H.fillna(value=0, inplace=True)
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']
# Debug the matcher using GUI
em.vis_debug_rf(rf, P, Q, 
        exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'],
        target_attr='gold_labels')


In [None]:
#Debugging iteration 4 - remove artists_artists_lev_sim
F = F.drop(F.index[[3]])
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='gold_labels',
                            show_progress=False) 
# Impute feature vectors with 0.
H.fillna(value=0, inplace=True)
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']
# Debug the matcher using GUI
em.vis_debug_rf(rf, P, Q, 
        exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'],
        target_attr='gold_labels')

In [None]:
#Debugging iteration 5 - add feature product of jaccard measure on song_title and artists
H['song_title_song_title_jac_qgm_3_qgm_3']
H['artists_artists_jac_qgm_3_qgm_3']
H['song_title_artists_score']= H.song_title_song_title_jac_qgm_3_qgm_3*H.artists_artists_jac_qgm_3_qgm_3
# Impute feature vectors with 0.
H.fillna(value=0, inplace=True)
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']
# Debug RF matcher using GUI
em.vis_debug_rf(rf, P, Q, 
        exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'],
        target_attr='gold_labels')

Since the score of precision dropped in all cases, we proceed with matcher X as the best matcher.

### Evaluating the matching output

Evaluating the matching outputs for the evaluation set typically involves the following four steps:

1. Converting the evaluation set to feature vectors
2. Training matcher using the feature vectors extracted from the development set
3. Predicting the evaluation set using the trained matcher
4. Evaluating the predicted matches

#### Converting the evaluation set to feature vectors

As before, we convert to the feature vectors (using the feature table and the evaluation set)

In [None]:
# Evaluate matching output
# Convert J into a set of feature vectors using feature table
L = em.extract_feature_vecs(J, feature_table=F,
                            attrs_after='gold_labels', show_progress=False)

In [None]:
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
any(pd.notnull(L))
L.fillna(value=0, inplace=True)

In [None]:
# Train using feature vectors from I using decision tree
dt.fit(table=H, 
       exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'], 
       target_attr='gold_labels')
# Predict on L 
predictions = dt.predict(table=L, exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'], 
                         append=True,target_attr='predicted_labels')
# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'gold_labels', 'predicted_labels')
em.print_eval_summary(eval_result)

In [None]:
# Train using feature vectors from I using random forest
rf.fit(table=H, 
       exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'], 
       target_attr='gold_labels')
# Predict on L 
predictions = rf.predict(table=L, exclude_attrs=['id', 'ltable_id', 'rtable_id', 'gold_labels'], 
                         append=True,target_attr='predicted_labels')
# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'gold_labels', 'predicted_labels')
em.print_eval_summary(eval_result)