## Entity Matching Workflow - 31st March 2017

This notebook contains EM steps that has been done on songs using py_entitymatching. Our goal is to come up with a workflow to match songs from the datasets from a research lab from Silicon Valley. Specifically, we want Precision as 90% and Recall as high as possible. The datasets contain information about the songs.

First, we need to import py_entitymatching package and other libraries:

In [None]:
import py_entitymatching as em
import pandas as pd
import numpy as np

### Reading tables

In [None]:
path_A = os.path.join('..','dataset','datasets','tracks.csv')
path_B = os.path.join('..','dataset','datasets','songs.csv')

# Read the CSV files
A = em.read_csv_metadata(path_A,low_memory=False) # setting the parameter low_memory to False  to speed up loading.
B = em.read_csv_metadata(path_B,low_memory=False)

### Down sampling tabels

Two smaller datasets, track_sample.csv and songs_sample.csv, are produced by down sampling the full datasets. They contain 8592 and 10000 tuples, respectively. 

To avoid case-sensitive string matching, we also changed the entire tables to lowercase. 

In [None]:
#Down sample the tables
sample_A, sample_B = em.down_sample(A, B, size=10000, y_param=1, show_progress=False)

# Set 'ID as the keys to the input tables
em.set_key(sample_A,'id')
em.set_key(sample_B,'id')

sample_A = sample_A.apply(lambda x: x.astype(str).str.lower())
sample_B = sample_B.apply(lambda x: x.astype(str).str.lower())

#Print lengths of the sampled tables
print(len(sample_A))
print(len(sample_B))

#saving the down sampled datasets
sample_A.to_csv('tracks_sample.csv', index = False, sep = ',')
sample_B.to_csv('songs_sample.csv', index = False, sep = ',')

#Get headers of sampled tables
headers_A = list(A.columns)
headers_B = list(B.columns)

### Applying blocker on down sampled datasets


Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the down sampled tables. This would reduce the number of tuple pairs considered for matching.

We know that two songs with different artist names and vice versa will not match. So we decide to apply blocking over song title and artist name.

In [None]:
block_f = em.get_features_for_blocking(sample_A, sample_B)

path_A = 'tracks_sample.csv'
path_B = 'songs_sample.csv'

# Read the CSV files
sample_A = em.read_csv_metadata(path_A,key='id',low_memory=False)
sample_B = em.read_csv_metadata(path_B,key='id',low_memory=False)

Three rules were applied in a sequence to remove tuple pairs that are unlikely to match.

**1)	Pairs whose song names have a jaccard score (3-grams) lower than 0.1 were removed.**

**2)	Pairs whose song names have less than 25% common words were removed. **

**3)	Pairs whose artists have less than 25% common words were removed. **

Two helper functions ***title_function*** and ***artists_function***, defined above, were used to filter out tuples according to these rules

In [None]:
def title_function(x, y):

    x_title = str(x['song_title'])
    y_title = str(y['song_title'])
    
    if (x_title in y_title) or (y_title in x_title):
        return False
    else:
        x_split = x_title.split()
        y_split = y_title.split()
        
        intersection = len(set(x_split) & set(y_split))
        union = len(set(x_split) | set(y_split))
        
        if(intersection / union < 0.25):
            return True
        else:
            return False

In [None]:
def artists_function(x, y):

    x_artists = str(x['artists'])
    y_artists = str(y['artists'])
    
    if (x_artists in y_artists) or (y_artists in x_artists):
        return False
    else:
        x_split = x_artists.split()
        y_split = y_artists.split()
        
        intersection = len(set(x_split) & set(y_split))
        union = len(set(x_split) | set(y_split))
        
        if(intersection / union < 0.25):
            return True
        else:
            return False

In [None]:
rb = em.RuleBasedBlocker()
ob = em.OverlapBlocker()
bb = em.BlackBoxBlocker()

# remove pairs that don't share similar titles
rule1 = ['song_title_song_title_jac_qgm_3_qgm_3(ltuple, rtuple) < 0.1']
rb.add_rule(rule1, block_f)

C1 = rb.block_tables(sample_A, sample_B, l_output_attrs=['song_title','year','artists'], r_output_attrs=['song_title','year','artists'], show_progress=False)

bb.set_black_box_function(title_function)
C2 = bb.block_candset(C1)

bb.set_black_box_function(artists_function)
C3 = bb.block_candset(C2)

In [None]:
print(len(C3))
C3.head(50)

In [1]:
# debug the blocker
dbg = em.debug_blocker(C2, sample_A, sample_B, output_size=50)
dbg

NameError: name 'em' is not defined

In [None]:
#Sample the result set C6
S= em.sample_table(C3,400)

# label the sampled data

# CRITERIA:
#
# - same song name (excluding version information)
#    - same or common artist(s) -> MATCH
#    - completely different artist(s) -> MISMATCH
#    - artist(s) missing -> MISMATCH

G = em.label_table(S, label_column_name='gold_labels')

G.to_csv('labeled_data.csv', index = False, sep = ',')