## We aim in this notebook to identify duplicates in a CSV containing information about restaurants

In [None]:
!pip3 install nltk

In [11]:
import pandas as pd
import nltk
import time
import numpy as np
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/ccompain/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
df_restaurants = pd.read_csv("./restaurants.csv")

In [4]:
df_restaurants.head(5)

Unnamed: 0,name,address,city,cuisine,unique_id
0,103 west,103 w. paces ferry rd.,atlanta,continental,675
1,20 mott,20 mott st. between bowery and pell st.,new york,asian,172
2,21 club,21 w. 52nd st.,new york,american,23
3,2223,2223 market st.,san francisco,american,453
4,9 jones street,9 jones st.,new york,american,173


The dataframe contains duplicates records that represent to the same 'real-world' restanrants. The column 'unique_id' was added for this purpose. Two records that are associated with the same attribute value for unique_id represents the same restaurant.

In [5]:
df_restaurants[df_restaurants.unique_id == '23']

Unnamed: 0,name,address,city,cuisine,unique_id
2,21 club,21 w. 52nd st.,new york,american,23
753,21 club,21 w. 52nd st.,new york city,american (new),23


In the above example, the two records share the same value for attributes 'name' and 'address'. However, they have slightly different values for the columns 'city' and 'cuisine'

In [6]:
df_restaurants[df_restaurants.unique_id == '22']

Unnamed: 0,name,address,city,cuisine,unique_id
744,yujean kangs gourmet chinese cuisine,67 n. raymond ave.,los angeles,asian,22
864,yujean kangs,67 n. raymond ave.,pasadena,chinese,22


In the above example, on the other hand, the two records are associated with different names, cities and cuisiones.

This file represents a simple example of datasets, on which we can experiment with th etechniques presented in the course to try identify duplicates, without using (that is relying on the values of) the column "unique_id".

In [7]:
# We start by adding a new column to identify the records (lines) in our dataframe
df_restaurants.insert(0,'record_ID', range(0, len(df_restaurants)))

In [8]:
df_restaurants.head(5)

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id
0,0,103 west,103 w. paces ferry rd.,atlanta,continental,675
1,1,20 mott,20 mott st. between bowery and pell st.,new york,asian,172
2,2,21 club,21 w. 52nd st.,new york,american,23
3,3,2223,2223 market st.,san francisco,american,453
4,4,9 jones street,9 jones st.,new york,american,173


Exhaustive comparisons: every record is compared with every other record

We start by applying an exhaustive strategy whereby every record in the CSV file, is compared with every other record. 

The code below does this for us. In doing so, it uses the following rule:

For two records to match, i.e. refer to the same restaurant in the real world:
* The edit distance between the attribute name values of the two records needs to be smaller or equal to 3, and 
* they need to have the same value for the cuisine attribute.

In [9]:
df_restaurants[df_restaurants.record_ID.isin([43, 622])]

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id
43,43,barbecue kitchen,1437 virginia ave.,atlanta,bbq,678
622,622,shaan,57 w. 48th st.,new york,asian,349


In [12]:
num_records = len(df_restaurants)
matches = []
matchescomplet = []

number_of_matches = 0
tokens1=[]
tokens2=[]
start = time.process_time()
for i in range(0,num_records):
    
    # Après tokenization , calcul du ngrams (n=1) pour le name qui servira pour la Jaccard distance, pour la ligne i
    tokens1name = nltk.word_tokenize(df_restaurants.iloc[i,1]) 
    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    
    # Après tokenization , calcul du ngrams (n=1) pour l'adresse qui servira pour la Jaccard distance,, pour la ligne i
    tokens1adr = nltk.word_tokenize(df_restaurants.iloc[i,2]) 
    ng1_tokensadr = set(nltk.ngrams(tokens1adr, n=1))
    
    
    for j in range(i+1,num_records):
        
        # Après tokenization , calcul du ngrams (n=1) pour le name qui servira pour la Jaccard distance, , pour la ligne j
        tokens2name = nltk.word_tokenize( df_restaurants.iloc[j,1]) 
        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
        
        # Après tokenization , calcul du ngrams (n=1) pour le name qui servira pour la Jaccard distance, , pour la ligne j
        tokens2adr = nltk.word_tokenize( df_restaurants.iloc[j,2]) 
        ng2_tokensadr = set(nltk.ngrams(tokens2adr, n=1))
       
        # calcul de la Jaccard distance pour le name entre la ligne i et la ligne j ("item based" avec ngrams (n=1)) 
        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)  # jaccard distance entre les ngram=1 des names
        
        # calcul de la Jaccard distance pour l'adresse entre la ligne i et la ligne j ("item based" avec ngrams (n=1)) 
        jd_ng1_ng2_adr = nltk.jaccard_distance(ng1_tokensadr, ng2_tokensadr)  # jaccard distance entre les ngram=1 des adresses
    
        # Rule for matching: 
        # disjonction entre une similarité entre les names (name_score<=1) 
        # et une similarité conjugée entre les adresses et les noms (jd_ng1_ng2_adr <= 0.6 and jd_ng1_ng2_name <= 0.6)
        name_score = nltk.edit_distance(df_restaurants.iloc[i,1], df_restaurants.iloc[j,1])
        
        # Rule for matching: Distance between names is smaller or equal to 3 and the cuisine is the same 
        if (jd_ng1_ng2_adr <= 0.6 and jd_ng1_ng2_name <= 0.6) or name_score<=1 :
            number_of_matches = number_of_matches +1 
            # matchescomplet.append((df_restaurants.iloc[i,0],df_restaurants.iloc[i,1], df_restaurants.iloc[i,2],df_restaurants.iloc[i,5], df_restaurants.iloc[j,0],df_restaurants.iloc[j,1], df_restaurants.iloc[j,2],df_restaurants.iloc[j,5]))
            matches.append((df_restaurants.iloc[i,0],df_restaurants.iloc[j,0]))

end = time.process_time()

print("Number of matches: {}".format(number_of_matches))
print("Processing time: {}".format(end - start))
for _ in matchescomplet:
     print(_)

Number of matches: 127
Processing time: 153.38521500000002


In [13]:
# quelques tests pour ajuster les critères de notre algorithme
name_score = nltk.edit_distance(df_restaurants.iloc[73,1], df_restaurants.iloc[763,1])
name_score   # 11
# comme on le voit ci-dessous, la différence est le mot "restaurant", l'edit distance est très importante (11), 
# on ne peut pas se baser dessus pour dire que c le même resto, il faut qu'on ajoute un critère "item based" 
# en plus du critère edit_distance name_score<=1

11

In [14]:
# qq tests pour ajuster les critères de notre alogorithme
df_restaurants[df_restaurants.record_ID.isin([73, 763])]

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id
73,73,bones,3130 piedmont road,atlanta,american,76
763,763,bones restaurant,3130 piedmont rd. ne,atlanta,steakhouses,76


In [15]:
# name_score = nltk.edit_distance(df_restaurants.iloc[32,1], df_restaurants.iloc[759,1])
#print(name_score)
tokens1 = nltk.word_tokenize(df_restaurants.iloc[73,1]) 
tokens2 = nltk.word_tokenize( df_restaurants.iloc[763,1]) 
print(tokens1)
print(tokens2)
ng1_tokens = set(nltk.ngrams(tokens1, n=1))
ng2_tokens = set(nltk.ngrams(tokens2, n=1))
print(ng1_tokens)
print(ng2_tokens)

jd_sent_1_2 = nltk.jaccard_distance(ng1_tokens, ng2_tokens)
print(jd_sent_1_2)
# jd_ng1_ng2_adr <= 0.6,ce seuil de 0.6 suffira dire que les lignes 32 et 759 sont le même restaurant

['bones']
['bones', 'restaurant']
{('bones',)}
{('bones',), ('restaurant',)}
0.5


In [16]:
# adresse
#print(name_score)
tokens1 = nltk.word_tokenize(df_restaurants.iloc[73,2]) 
tokens2 = nltk.word_tokenize( df_restaurants.iloc[763,2]) 
print(tokens1)
print(tokens2)
ng1_tokens = set(nltk.ngrams(tokens1, n=1))
ng2_tokens = set(nltk.ngrams(tokens2, n=1))
print(ng1_tokens)
print(ng2_tokens)

jd_sent_1_2 = nltk.jaccard_distance(ng1_tokens, ng2_tokens)
print(jd_sent_1_2)
# ça ne passe pas , mais c pas grave car mettre le seuil à 0.67 va nous rajouter beaucoup de faux positifs 
# on a testé ce seuil plus élevé de 0.67

['3130', 'piedmont', 'road']
['3130', 'piedmont', 'rd', '.', 'ne']
{('road',), ('piedmont',), ('3130',)}
{('ne',), ('.',), ('rd',), ('piedmont',), ('3130',)}
0.6666666666666666


In [17]:
name_score = nltk.edit_distance(df_restaurants.iloc[6,1], df_restaurants.iloc[754,1])
name_score

0

In [18]:
# name_score = nltk.edit_distance(df_restaurants.iloc[32,1], df_restaurants.iloc[759,1])
#print(name_score)
tokens1 = nltk.word_tokenize(df_restaurants.iloc[6,1]) 
tokens2 = nltk.word_tokenize( df_restaurants.iloc[754,1]) 
print(tokens1)
print(tokens2)
ng1_tokens = set(nltk.ngrams(tokens1, n=1))
ng2_tokens = set(nltk.ngrams(tokens2, n=1))
print(ng1_tokens)
print(ng2_tokens)

jd_sent_1_2 = nltk.jaccard_distance(ng1_tokens, ng2_tokens)
print(jd_sent_1_2)

['abruzzi']
['abruzzi']
{('abruzzi',)}
{('abruzzi',)}
0.0


In [19]:
name_score = nltk.edit_distance(df_restaurants.iloc[6,1], df_restaurants.iloc[754,1])
name_score

0

In [20]:
# name_score = nltk.edit_distance(df_restaurants.iloc[32,1], df_restaurants.iloc[759,1])
#print(name_score)
tokens1 = nltk.word_tokenize(df_restaurants.iloc[32,1]) 
tokens2 = nltk.word_tokenize( df_restaurants.iloc[759,1]) 
print(tokens1)
print(tokens2)
ng1_tokens = set(nltk.ngrams(tokens1, n=1))
ng2_tokens = set(nltk.ngrams(tokens2, n=1))
print(ng1_tokens)
print(ng2_tokens)

jd_sent_1_2 = nltk.jaccard_distance(ng1_tokens, ng2_tokens)
print(jd_sent_1_2)

['arts', 'delicatessen']
['arts', 'deli']
{('arts',), ('delicatessen',)}
{('deli',), ('arts',)}
0.6666666666666666


In [21]:
# Display results
for match in matches:
    print("The following records {} and {} match".format(match[0],match[1]))
    print("The restaurants with the following names {} and {} match.".format(df_restaurants.iloc[match[0],1],df_restaurants.iloc[match[1],1]))
    print("The restaurants with the following addresses {} and {} match.".format(df_restaurants.iloc[match[0],2],df_restaurants.iloc[match[1],2]))
    print("\n")

The following records 2 and 753 match
The restaurants with the following names 21 club and 21 club match.
The restaurants with the following addresses  21 w. 52nd st. and  21 w. 52nd st. match.


The following records 6 and 754 match
The restaurants with the following names abruzzi and abruzzi match.
The restaurants with the following addresses  2355 peachtree rd.  peachtree battle shopping center and  2355 peachtree rd. ne match.


The following records 13 and 755 match
The restaurants with the following names alain rondelli and alain rondelli match.
The restaurants with the following addresses  126 clement st. and  126 clement st. match.


The following records 26 and 756 match
The restaurants with the following names aquavit and aquavit match.
The restaurants with the following addresses  13 w. 54th st. and  13 w. 54th st. match.


The following records 27 and 757 match
The restaurants with the following names aqua and aqua match.
The restaurants with the following addresses  252 ca

Note that the rule applied in the above code is not great. You may want to try other kind of distances, other thresholds, and other rules to identify matches.

# Assessing the quality of the results

To do so, we first need to compute the ground truth (that is the list of correct matches) considering the attribute unique_id.

In [22]:
ground_truth_matches = pd.read_csv("./restaurants.csv")

In [23]:
ground_truth_matches.insert(0, 'record_ID', range(0, len(ground_truth_matches)))

In [24]:
ground_truth_matches.head(5)

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id
0,0,103 west,103 w. paces ferry rd.,atlanta,continental,675
1,1,20 mott,20 mott st. between bowery and pell st.,new york,asian,172
2,2,21 club,21 w. 52nd st.,new york,american,23
3,3,2223,2223 market st.,san francisco,american,453
4,4,9 jones street,9 jones st.,new york,american,173


In [25]:
ground_truth_matches = pd.merge(ground_truth_matches,
                                ground_truth_matches,
                                on = 'unique_id')

In [26]:
ground_truth_matches.head(5)

Unnamed: 0,record_ID_x,name_x,address_x,city_x,cuisine_x,unique_id,record_ID_y,name_y,address_y,city_y,cuisine_y
0,0,103 west,103 w. paces ferry rd.,atlanta,continental,675,0,103 west,103 w. paces ferry rd.,atlanta,continental
1,1,20 mott,20 mott st. between bowery and pell st.,new york,asian,172,1,20 mott,20 mott st. between bowery and pell st.,new york,asian
2,2,21 club,21 w. 52nd st.,new york,american,23,2,21 club,21 w. 52nd st.,new york,american
3,2,21 club,21 w. 52nd st.,new york,american,23,753,21 club,21 w. 52nd st.,new york city,american (new)
4,753,21 club,21 w. 52nd st.,new york city,american (new),23,2,21 club,21 w. 52nd st.,new york,american


In [27]:
len(ground_truth_matches)

1089

In [28]:
ground_truth_matches = ground_truth_matches.query('record_ID_x < record_ID_y')

In [29]:
ground_truth_matches.head(20)

Unnamed: 0,record_ID_x,name_x,address_x,city_x,cuisine_x,unique_id,record_ID_y,name_y,address_y,city_y,cuisine_y
3,2,21 club,21 w. 52nd st.,new york,american,23,753,21 club,21 w. 52nd st.,new york city,american (new)
10,6,abruzzi,2355 peachtree rd. peachtree battle shopping...,atlanta,italian,74,754,abruzzi,2355 peachtree rd. ne,atlanta,italian
20,13,alain rondelli,126 clement st.,san francisco,french,94,755,alain rondelli,126 clement st.,san francisco,french (new)
36,26,aquavit,13 w. 54th st.,new york,continental,24,756,aquavit,13 w. 54th st.,new york city,scandinavian
40,27,aqua,252 california st.,san francisco,seafood,95,757,aqua,252 california st.,san francisco,american (new)
46,30,arnie mortons of chicago,435 s. la cienega blv.,los angeles,american,0,758,arnie mortons of chicago,435 s. la cienega blvd.,los angeles,steakhouses
51,32,arts delicatessen,12224 ventura blvd.,studio city,american,1,759,arts deli,12224 ventura blvd.,studio city,delis
58,36,aureole,34 e. 61st st.,new york,american,25,760,aureole,34 e. 61st st.,new york city,american (new)
62,37,bacchanalia,3125 piedmont rd. near peachtree rd.,atlanta,international,75,761,bacchanalia,3125 piedmont rd.,atlanta,californian
101,73,bones,3130 piedmont road,atlanta,american,76,763,bones restaurant,3130 piedmont rd. ne,atlanta,steakhouses


In [30]:
ground_truth_matches = ground_truth_matches[['record_ID_x','record_ID_y']]

In [31]:
print(ground_truth_matches)

      record_ID_x  record_ID_y
3               2          753
10              6          754
20             13          755
36             26          756
40             27          757
...           ...          ...
1030          708          860
1034          709          861
1041          713          862
1053          722          863
1078          744          864

[112 rows x 2 columns]


In [32]:
print(len(ground_truth_matches))

112


In [33]:
print(len(matches))


127


In [34]:
matches_df = pd.DataFrame(matches)
matches_df.head(5)

Unnamed: 0,0,1
0,2,753
1,6,754
2,13,755
3,26,756
4,27,757


In [35]:
matches_df = pd.DataFrame(matches)
matches_df.columns= ['record_ID_x','record_ID_y']

In [36]:
# on s'assure que les couples record_ID_x et record_ID_y sont dans le bons sens (record_ID_x < record_ID_y)
# comme dans ground_truth
matches_df[matches_df['record_ID_x'] >= matches_df['record_ID_y'] ]
# 0 lignes trouvées , donc c OK.


Unnamed: 0,record_ID_x,record_ID_y


In [37]:
matches_df.head()

Unnamed: 0,record_ID_x,record_ID_y
0,2,753
1,6,754
2,13,755
3,26,756
4,27,757


In [38]:
diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')

In [39]:
diff_df.head(5)

Unnamed: 0,record_ID_x,record_ID_y,Exist
0,2,753,both
1,6,754,both
2,13,755,both
3,26,756,both
4,27,757,both


In [40]:
true_positives = diff_df[diff_df.Exist=='both']
false_positives = diff_df[diff_df.Exist=='right_only']
false_negatives = diff_df[diff_df.Exist=='left_only']

In [41]:
# les vrais duplicats que notre algo a pu détecter
true_positives.head()

Unnamed: 0,record_ID_x,record_ID_y,Exist
0,2,753,both
1,6,754,both
2,13,755,both
3,26,756,both
4,27,757,both


In [42]:
#Example of a true positive
df_restaurants[df_restaurants.record_ID.isin(['6','754'])]

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id


In [43]:
# notre algo les a sortis comme restos en double mais c pas vrai
false_positives.head()

Unnamed: 0,record_ID_x,record_ID_y,Exist
112,55,56,right_only
113,87,88,right_only
114,95,180,right_only
115,96,196,right_only
116,116,125,right_only


In [44]:
# notre critère de duplicate :
# (jd_ng1_ng2_adr <= 0.6 and jd_ng1_ng2_name <= 0.6) or (name_score<=1 and jd_ng1_ng2_adr <= 0.6) 
# eliminer grace jd_ng1_ng2_adr = 0.6666
df_restaurants[df_restaurants.record_ID.isin(['55','56'])]
# le name est le même donc l'algo dit que ce le même restaurant alors que ce n'est pas vrai.

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id


In [45]:
# pareil c pas le même resto alors que l'algo les a retenu comme duplicate
# car les names diffèrent d'un seul caractère.
df_restaurants[df_restaurants.record_ID.isin(['87','88'])]

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id


In [46]:
# name_score<=1
name_score = nltk.edit_distance(df_restaurants.iloc[87,1], df_restaurants.iloc[88,1])
name_score

1

In [47]:
# les vrais duplicates que l'algo n'a pas détecté
false_negatives.head()

Unnamed: 0,record_ID_x,record_ID_y,Exist
6,32,759,left_only
9,73,763,left_only
28,179,781,left_only
34,235,786,left_only
56,388,808,left_only


In [48]:
df_restaurants[df_restaurants.record_ID.isin(['32','759'])]

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id


In [49]:
# faux négatif
# pour l'algo le 32 et le 759 c'est pas le même restaurant, pourtant c le même
# en effet les names diffèrents en lettres et en mots : 
# name_score > 1 et jd_ng1_ng2_name > 0.6 (ça suffit pour l'algo pour l'éliminer ) et en plus jd_ng1_ng2_adr > 0.6
name_score = nltk.edit_distance(df_restaurants.iloc[32,1], df_restaurants.iloc[759,1])
name_score

8

In [50]:
 # (jd_ng1_ng2_adr <= 0.6) and jd_ng1_ng2_name <= 0.6) or (name_score<=1)
    
# name_score = nltk.edit_distance(df_restaurants.iloc[32,1], df_restaurants.iloc[759,1])

tokens1 = nltk.word_tokenize(df_restaurants.iloc[32,1])   # name
tokens2 = nltk.word_tokenize( df_restaurants.iloc[759,1]) 
print(tokens1)
print(tokens2)
ng1_tokens = set(nltk.ngrams(tokens1, n=1))
ng2_tokens = set(nltk.ngrams(tokens2, n=1))
print(ng1_tokens)
print(ng2_tokens)

jd_sent_1_2 = nltk.jaccard_distance(ng1_tokens, ng2_tokens)
print(jd_sent_1_2)

['arts', 'delicatessen']
['arts', 'deli']
{('arts',), ('delicatessen',)}
{('deli',), ('arts',)}
0.6666666666666666


In [51]:
 # (jd_ng1_ng2_adr <= 0.6) and (name_score<=2 or jd_ng1_ng2_name <= 0.67)
    
# name_score = nltk.edit_distance(df_restaurants.iloc[32,1], df_restaurants.iloc[759,1])
#print(name_score)
tokens1 = nltk.word_tokenize(df_restaurants.iloc[73,2])   # adresse 
tokens2 = nltk.word_tokenize( df_restaurants.iloc[763,2]) 
print(tokens1)
print(tokens2)
ng1_tokens = set(nltk.ngrams(tokens1, n=1))
ng2_tokens = set(nltk.ngrams(tokens2, n=1))
print(ng1_tokens)
print(ng2_tokens)

jd_sent_1_2 = nltk.jaccard_distance(ng1_tokens, ng2_tokens)
print(jd_sent_1_2)

['3130', 'piedmont', 'road']
['3130', 'piedmont', 'rd', '.', 'ne']
{('road',), ('piedmont',), ('3130',)}
{('ne',), ('.',), ('rd',), ('piedmont',), ('3130',)}
0.6666666666666666


In [52]:
print(len(ground_truth_matches))
print(len(matches_df))
print(len(true_positives) , 'true_positives')
print(len(false_positives) ,'false_positives')
print(len(false_negatives)  , 'false_negatives')

# len(true_positives)  +  len(false_negatives) = len(ground_truth_matches)

# len(matches_df)) - len(false_positif) + len(false_negatives)     = ground_truth_matches

112
127
99 true_positives
28 false_positives
13 false_negatives


In [53]:
precision = len(true_positives)/(len(true_positives)+ len(false_positives))
print(precision)

0.7795275590551181


Note that if you are using pyton 2.7 (instead of Python 3), you would need to convert integers to float prior to performing the division

In [54]:
recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
print(recall)

0.8839285714285714


In [55]:
f_measure = 2*(precision*recall)/(precision+recall)
print(f_measure)

0.8284518828451883


# Windowing (SNM) method

In [56]:
df_restaurants.head()

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id
0,0,103 west,103 w. paces ferry rd.,atlanta,continental,675
1,1,20 mott,20 mott st. between bowery and pell st.,new york,asian,172
2,2,21 club,21 w. 52nd st.,new york,american,23
3,3,2223,2223 market st.,san francisco,american,453
4,4,9 jones street,9 jones st.,new york,american,173


In [57]:
# 841 842
# qq tests pour choisir sur quel champ on va faire le sort 
# le sorted name parait intéressant
df_restaurants.sort_values(by=['name']).head(20)

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id
0,0,103 west,103 w. paces ferry rd.,atlanta,continental,675
1,1,20 mott,20 mott st. between bowery and pell st.,new york,asian,172
753,753,21 club,21 w. 52nd st.,new york city,american (new),23
2,2,21 club,21 w. 52nd st.,new york,american,23
3,3,2223,2223 market st.,san francisco,american,453
4,4,9 jones street,9 jones st.,new york,american,173
5,5,abbey,163 ponce de leon ave.,atlanta,international,379
754,754,abruzzi,2355 peachtree rd. ne,atlanta,italian,74
6,6,abruzzi,2355 peachtree rd. peachtree battle shopping...,atlanta,italian,74
7,7,acquarello,1722 sacramento st.,san francisco,italian,454


### Le tri est fait dans ce qui suit selon le champ "name"

In [58]:
window = 50   # 

# tri par name car c ce qui permet d'avoir des resto en double les plus proches possibles 

df_restaurants= df_restaurants.sort_values(by=['name'])  

number_of_matchesw = 0
num_records = len(df_restaurants)
matchesw = []
matchescompletw = []

start = time.process_time()
for i in range(0,min(window,len(df_restaurants))):
    
    tokens1name = nltk.word_tokenize(df_restaurants.iloc[i,1]) 
    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    
    tokens1adr = nltk.word_tokenize(df_restaurants.iloc[i,2]) 
    ng1_tokensadr = set(nltk.ngrams(tokens1adr, n=1))
    
    
    for j in range(i+1,min(window,len(df_restaurants))):
        tokens2name = nltk.word_tokenize( df_restaurants.iloc[j,1]) 
        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
        
        
        tokens2adr = nltk.word_tokenize( df_restaurants.iloc[j,2]) 
        ng2_tokensadr = set(nltk.ngrams(tokens2adr, n=1))
#         print(tokens1)
#         print(tokens2)       
        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)  # jaccard distance entre les ngram=1 des names
        jd_ng1_ng2_adr = nltk.jaccard_distance(ng1_tokensadr, ng2_tokensadr)  # jaccard distance entre les ngram=1 des adresses
    
        name_score = nltk.edit_distance(df_restaurants.iloc[i,1], df_restaurants.iloc[j,1])
        
        # Rule for matching: Distance between names is smaller or equal to 3 and the cuisine is the same 
        if (jd_ng1_ng2_adr <= 0.6 and jd_ng1_ng2_name <= 0.6) or name_score<=1 :
            number_of_matchesw = number_of_matchesw +1 
            # matchescomplet.append((df_restaurants.iloc[i,0],df_restaurants.iloc[i,1], df_restaurants.iloc[i,2],df_restaurants.iloc[i,5], df_restaurants.iloc[j,0],df_restaurants.iloc[j,1], df_restaurants.iloc[j,2],df_restaurants.iloc[j,5]))
            matchesw.append((df_restaurants.iloc[i,0],df_restaurants.iloc[j,0]))
            matchescompletw.append((df_restaurants.iloc[i,0],df_restaurants.iloc[i,1], df_restaurants.iloc[i,2],df_restaurants.iloc[i,5], df_restaurants.iloc[j,0],df_restaurants.iloc[j,1], df_restaurants.iloc[j,2],df_restaurants.iloc[j,5]))
                     
            
            
for i in range(window,len(df_restaurants)):
    
    tokens1name = nltk.word_tokenize(df_restaurants.iloc[i,1]) 
    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    
    tokens1adr = nltk.word_tokenize(df_restaurants.iloc[i,2]) 
    ng1_tokensadr = set(nltk.ngrams(tokens1adr, n=1))
    
    
    for j in range(i-window+1,i):
        tokens2name = nltk.word_tokenize( df_restaurants.iloc[j,1]) 
        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
        
        
        tokens2adr = nltk.word_tokenize( df_restaurants.iloc[j,2]) 
        ng2_tokensadr = set(nltk.ngrams(tokens2adr, n=1))
     
        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)  # jaccard distance entre les ngram=1 des names
        jd_ng1_ng2_adr = nltk.jaccard_distance(ng1_tokensadr, ng2_tokensadr)  # jaccard distance entre les ngram=1 des adresses
    
        name_score = nltk.edit_distance(df_restaurants.iloc[i,1], df_restaurants.iloc[j,1])
        
        # Rule for matching: Distance between names is smaller or equal to 3 and the cuisine is the same 
        if (jd_ng1_ng2_adr <= 0.6 and jd_ng1_ng2_name <= 0.6) or name_score<=1 :
            number_of_matchesw = number_of_matchesw +1 
            # matchescomplet.append((df_restaurants.iloc[i,0],df_restaurants.iloc[i,1], df_restaurants.iloc[i,2],df_restaurants.iloc[i,5], df_restaurants.iloc[j,0],df_restaurants.iloc[j,1], df_restaurants.iloc[j,2],df_restaurants.iloc[j,5]))
            matchesw.append((df_restaurants.iloc[i,0],df_restaurants.iloc[j,0]))
            matchescompletw.append((df_restaurants.iloc[i,0],df_restaurants.iloc[i,1], df_restaurants.iloc[i,2],df_restaurants.iloc[i,5], df_restaurants.iloc[j,0],df_restaurants.iloc[j,1], df_restaurants.iloc[j,2],df_restaurants.iloc[j,5]))
            
end = time.process_time()

print("Number of matches: {}".format(number_of_matchesw))
print("Processing time: {}".format(end - start))            
for _ in matchescompletw:
     print(_)  
# for _ in matches:
#      print(_)          

Number of matches: 112
Processing time: 18.117661999999996
(753, '21 club', ' 21 w. 52nd st.', '23', 2, '21 club', ' 21 w. 52nd st.', '23')
(754, 'abruzzi', ' 2355 peachtree rd. ne', '74', 6, 'abruzzi', ' 2355 peachtree rd.  peachtree battle shopping center', '74')
(755, 'alain rondelli', ' 126 clement st.', '94', 13, 'alain rondelli', ' 126 clement st.', '94')
(27, 'aqua', ' 252 california st.', '95', 757, 'aqua', ' 252 california st.', '95')
(26, 'aquavit', ' 13 w. 54th st.', '24', 756, 'aquavit', ' 13 w. 54th st.', '24')
(30, 'arnie mortons of chicago', ' 435 s. la cienega blv.', '0', 758, 'arnie mortons of chicago', ' 435 s. la cienega blvd.', '0')
(36, 'aureole', ' 34 e. 61st st.', '25', 760, 'aureole', ' 34 e. 61st st.', '25')
(37, 'bacchanalia', ' 3125 piedmont rd.  near peachtree rd.', '75', 761, 'bacchanalia', ' 3125 piedmont rd.', '75')
(56, 'bertolinis', ' 3570 las vegas blvd. s', '427', 55, 'bertolinis', ' 3500 peachtree rd.  phipps plaza', '385')
(76, 'boulevard', ' 1 miss

In [59]:
# Display results
for match in matchesw:
    print("The following records {} and {} match".format(match[0],match[1]))
    print("The restaurants with the following names {} and {} match.".format(df_restaurants.iloc[match[0],1],df_restaurants.iloc[match[1],1]))
    print("The restaurants with the following addresses {} and {} match.".format(df_restaurants.iloc[match[0],2],df_restaurants.iloc[match[1],2]))
    print("\n")

The following records 753 and 2 match
The restaurants with the following names stingray and 21 club match.
The restaurants with the following addresses  428 amsterdam ave.  between 80th and 81st sts. and  21 w. 52nd st. match.


The following records 754 and 6 match
The restaurants with the following names stoyanofs cafe and abbey match.
The restaurants with the following addresses  1240 9th ave. and  163 ponce de leon ave. match.


The following records 755 and 13 match
The restaurants with the following names straits cafe and agrotikon match.
The restaurants with the following addresses  3300 geary blvd. and  322 e. 14 st.  between 1st and 2nd aves. match.


The following records 27 and 757 match
The restaurants with the following names antonios and sundown cafe match.
The restaurants with the following addresses  3700 w. flamingo and  2165 cheshire bridge rd. match.


The following records 26 and 756 match
The restaurants with the following names anthonys and stringers fish camp and

In [60]:
matchesw_df = pd.DataFrame(matchesw)
matchesw_df.columns= ['record_ID_x','record_ID_y']

matchesw_df['MIN'] = matchesw_df[['record_ID_x','record_ID_y']].min(axis=1)
matchesw_df['MAX'] = matchesw_df[['record_ID_x','record_ID_y']].max(axis=1)
matchesw_df=matchesw_df[['MIN','MAX']]
matchesw_df.columns=['record_ID_x','record_ID_y']
matchesw_df


diffw_df = pd.merge(ground_truth_matches, matchesw_df, how='outer', indicator='Exist')
true_positivesw = diffw_df[diffw_df.Exist=='both']
false_positivesw = diffw_df[diffw_df.Exist=='right_only']
false_negativesw = diffw_df[diffw_df.Exist=='left_only']
precisionw = len(true_positivesw)/(len(true_positivesw)+ len(false_positivesw))
print(precisionw)
recallw = len(true_positivesw)/(len(true_positivesw)+ len(false_negativesw))
print(recallw)
f_measurew = 2*(precisionw*recallw)/(precisionw+recallw)
print(f_measurew)

0.8482142857142857
0.8482142857142857
0.8482142857142857


In [61]:
print(len(ground_truth_matches))
print(len(matchesw_df))
print(len(true_positivesw))
print(len(false_positivesw))
print(len(false_negativesw))  
# len(true_positives)  +  len(false_negatives) = len(ground_truth_matches)
# len(matches_df)) - len(false_positif) + len(false_negatives)     = ground_truth_matches

112
112
95
17
17


It is worth noting that in the above code, we do not implement the SNM algorithm in its entirety. In particular, we do not implement the last phase of inferring matches using transitivity

# Blocking method

In [62]:
df_restaurants = pd.read_csv("./restaurants.csv")

In [63]:
len(df_restaurants)

865

In [64]:
df_restaurants.head()

Unnamed: 0,name,address,city,cuisine,unique_id
0,103 west,103 w. paces ferry rd.,atlanta,continental,675
1,20 mott,20 mott st. between bowery and pell st.,new york,asian,172
2,21 club,21 w. 52nd st.,new york,american,23
3,2223,2223 market st.,san francisco,american,453
4,9 jones street,9 jones st.,new york,american,173


In [65]:
# We start by adding a new column to identify the records (lines) in our dataframe
df_restaurants.insert(0,'record_ID', range(0, len(df_restaurants)))

In [66]:
# The blocks correspond to resturants that are located in the same citydf_restaurants.loc[df_restaurants['city']==' atlanta']
df_restaurants.loc[df_restaurants['city']==' atlanta']

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id
0,0,103 west,103 w. paces ferry rd.,atlanta,continental,675
5,5,abbey,163 ponce de leon ave.,atlanta,international,379
6,6,abruzzi,2355 peachtree rd. peachtree battle shopping...,atlanta,italian,74
15,15,alecks barbecue heaven,783 martin luther king jr. dr.,atlanta,barbecue,380
17,17,alons at the terrace,659 peachtree st.,atlanta,sandwiches,676
...,...,...,...,...,...,...
841,841,ritz-carlton cafe (buckhead),3434 peachtree rd. ne,atlanta,american (new),89
842,842,ritz-carlton dining room (buckhead),3434 peachtree rd. ne,atlanta,american (new),90
844,844,ritz-carlton restaurant,181 peachtree st.,atlanta,french (classic),91
858,858,toulouse,293-b peachtree rd.,atlanta,french (new),92


In [67]:
df_restaurants.loc[df_restaurants['city'].str.strip()=='atlanta']

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id
0,0,103 west,103 w. paces ferry rd.,atlanta,continental,675
5,5,abbey,163 ponce de leon ave.,atlanta,international,379
6,6,abruzzi,2355 peachtree rd. peachtree battle shopping...,atlanta,italian,74
15,15,alecks barbecue heaven,783 martin luther king jr. dr.,atlanta,barbecue,380
17,17,alons at the terrace,659 peachtree st.,atlanta,sandwiches,676
...,...,...,...,...,...,...
841,841,ritz-carlton cafe (buckhead),3434 peachtree rd. ne,atlanta,american (new),89
842,842,ritz-carlton dining room (buckhead),3434 peachtree rd. ne,atlanta,american (new),90
844,844,ritz-carlton restaurant,181 peachtree st.,atlanta,french (classic),91
858,858,toulouse,293-b peachtree rd.,atlanta,french (new),92


In [68]:
# on va créer un dict "df_restov" des restaurants de chaque ville
# pour une clé= ville, la valeur du dict serait égale à un dataframe représentant les restos de cette ville
df_restov= {}
for ville in df_restaurants['city'].unique():
    
    df_restov[ville]   = df_restaurants.loc[df_restaurants['city']==ville]
    num_records = len(df_restov[ville])
    print(ville)   # on affiche la ville
    print(num_records) # on affiche le nombre de restos par ville



 atlanta
120
 new york
250
 san francisco
148
 los angeles
74
 new york city
88
 las vegas
63
 west la
4
 studio city
5
 westlake village
2
 beverly hills
8
 malibu
4
 santa monica
14
 pasadena
6
 northridge
1
 mar vista
1
 sherman oaks
3
 venice
3
 marietta
2
 la
15
 redondo beach
1
 w. hollywood
6
 westwood
2
 queens
5
 hollywood
2
 pacific palisades
1
 roswell
2
 bel air
2
 smyrna
1
 duluth
2
 culver city
1
 long beach
1
 decatur
2
 century city
1
 st. boyle hts.
1
 brooklyn
6
 rancho park
1
 st. hermosa beach
1
 marina del rey
1
 encino
2
 monterey park
1
 toluca lake
1
 chinatown
2
 burbank
1
 seal beach
1
 brentwood
1
 manhattan beach
1
 los feliz
2
 college park
1
 glendale
1
city
1


In [69]:
# on vérifie  pour atlanta que ça marche bien, on a bien le dataframe qu'on voudrait.
print(type(df_restov[" atlanta"]))
print(df_restov[" atlanta"])


<class 'pandas.core.frame.DataFrame'>
     record_ID                                 name  \
0            0                             103 west   
5            5                                abbey   
6            6                              abruzzi   
15          15               alecks barbecue heaven   
17          17                 alons at the terrace   
..         ...                                  ...   
841        841         ritz-carlton cafe (buckhead)   
842        842  ritz-carlton dining room (buckhead)   
844        844              ritz-carlton restaurant   
858        858                             toulouse   
862        862                       veni vidi vici   

                                               address      city  \
0                               103 w. paces ferry rd.   atlanta   
5                               163 ponce de leon ave.   atlanta   
6     2355 peachtree rd.  peachtree battle shopping...   atlanta   
15                      783 m

In [70]:
df_restov[" atlanta"].head()

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id
0,0,103 west,103 w. paces ferry rd.,atlanta,continental,675
5,5,abbey,163 ponce de leon ave.,atlanta,international,379
6,6,abruzzi,2355 peachtree rd. peachtree battle shopping...,atlanta,italian,74
15,15,alecks barbecue heaven,783 martin luther king jr. dr.,atlanta,barbecue,380
17,17,alons at the terrace,659 peachtree st.,atlanta,sandwiches,676


In [71]:
# on testel'algo précédent sur juste un dataframe celui des restos de " atlanta"  (avec un espace devant)
num_records = len(df_restov[" atlanta"])
amatches = []
amatchescomplet = []

anumber_of_matches = 0
tokens1=[]
tokens2=[]
start = time.process_time()
for i in range(0,num_records):
    
    # Après tokenization , calcul du ngrams (n=1) pour le name qui servira pour la Jaccard distance, pour la ligne i
    tokens1name = nltk.word_tokenize(df_restov[" atlanta"].iloc[i,1]) 
    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    
    # Après tokenization , calcul du ngrams (n=1) pour l'adresse qui servira pour la Jaccard distance,, pour la ligne i
    tokens1adr = nltk.word_tokenize(df_restov[" atlanta"].iloc[i,2]) 
    ng1_tokensadr = set(nltk.ngrams(tokens1adr, n=1))
    
    
    for j in range(i+1,num_records):
        
        # Après tokenization , calcul du ngrams (n=1) pour le name qui servira pour la Jaccard distance, , pour la ligne j
        tokens2name = nltk.word_tokenize( df_restov[" atlanta"].iloc[j,1]) 
        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
        
        # Après tokenization , calcul du ngrams (n=1) pour le name qui servira pour la Jaccard distance, , pour la ligne j
        tokens2adr = nltk.word_tokenize( df_restov[" atlanta"].iloc[j,2]) 
        ng2_tokensadr = set(nltk.ngrams(tokens2adr, n=1))
     
        # calcul de la Jaccard distance pour le name entre la ligne i et la ligne j ("item based" avec ngrams (n=1)) 
        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)
        
        # calcul de la Jaccard distance pour l'adresse entre la ligne i et la ligne j ("item based" avec ngrams (n=1)) 
        jd_ng1_ng2_adr = nltk.jaccard_distance(ng1_tokensadr, ng2_tokensadr)  
    
        name_score = nltk.edit_distance(df_restov[" atlanta"].iloc[i,1], df_restov[" atlanta"].iloc[j,1])
        
        # Rule for matching: 
        # disjonction entre une similarité entre les names (name_score<=1) 
        # et une similarité conjugée entre les adresses et les noms (jd_ng1_ng2_adr <= 0.6 and jd_ng1_ng2_name <= 0.6)
        if (jd_ng1_ng2_adr <= 0.6 and jd_ng1_ng2_name <= 0.6) or name_score<=1 :
            anumber_of_matches = anumber_of_matches +1 
            matchescomplet.append((df_restov[" atlanta"].iloc[i,0],df_restov[" atlanta"].iloc[i,1], \
            df_restov[" atlanta"].iloc[i,2],df_restov[" atlanta"].iloc[i,3], df_restov[" atlanta"].iloc[i,5], \
            df_restov[" atlanta"].iloc[j,0],df_restov[" atlanta"].iloc[j,1], df_restov[" atlanta"].iloc[j,2], \
            df_restov[" atlanta"].iloc[j,3],df_restov[" atlanta"].iloc[j,5]))
            amatches.append((df_restov[" atlanta"].iloc[i,0],df_restov[" atlanta"].iloc[j,0]))

end = time.process_time()

print("Number of matches: {}".format(anumber_of_matches))
print("Processing time: {}".format(end - start))
for _ in amatchescomplet:
     print(_)

Number of matches: 21
Processing time: 3.543459000000013


In [72]:
# nous allons refaire le dict mais en éliminant les espaces saisis avant et après chaque ville
# par précaution pour éviter des villes en double
# et nous allons imprimer le nombre de restos par ville.

df_restov={}
cumul= 0
# il faut enlever les espaces au début et à la fin de chaque ville dans le dataframe, 
# sinon on va rater des restos en double car ils ne seront pas dans le même block.

for ville in df_restaurants['city'].str.strip().unique():   
     print(ville)
     df_restov[ville]   = df_restaurants.loc[df_restaurants['city'].str.strip()==ville]
     print(len(df_restov[ville]))
     cumul += len(df_restov[ville])

print(cumul)
# on vérifie qu'on retrouve bien un total de 865 restaurants.

atlanta
120
new york
250
san francisco
148
los angeles
74
new york city
88
las vegas
63
west la
4
studio city
5
westlake village
2
beverly hills
8
malibu
4
santa monica
14
pasadena
6
northridge
1
mar vista
1
sherman oaks
3
venice
3
marietta
2
la
15
redondo beach
1
w. hollywood
6
westwood
2
queens
5
hollywood
2
pacific palisades
1
roswell
2
bel air
2
smyrna
1
duluth
2
culver city
1
long beach
1
decatur
2
century city
1
st. boyle hts.
1
brooklyn
6
rancho park
1
st. hermosa beach
1
marina del rey
1
encino
2
monterey park
1
toluca lake
1
chinatown
2
burbank
1
seal beach
1
brentwood
1
manhattan beach
1
los feliz
2
college park
1
glendale
1
city
1
865


In [73]:
# Généralisation de la BLOCKING METHOD à toutes les villes 
bmatches = []
bmatchescomplet = []
bnumber_of_matches = 0
start = time.process_time()
    
for ville in df_restaurants['city'].str.strip().unique():
        # affichage de la ville et du nombre de restos par ville
        # pour les matcher entre eux
        print(ville)  
        num_records = len(df_restov[ville])
        print(num_records)
        
        tokens1=[]
        tokens2=[]
       
        for i in range(0,num_records):

            tokens1name = nltk.word_tokenize(df_restov[ville].iloc[i,1]) 
            ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))

            tokens1adr = nltk.word_tokenize(df_restov[ville].iloc[i,2]) 
            ng1_tokensadr = set(nltk.ngrams(tokens1adr, n=1))


            for j in range(i+1,num_records):

                tokens2name = nltk.word_tokenize( df_restov[ville].iloc[j,1]) 
                ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))


                tokens2adr = nltk.word_tokenize( df_restov[ville].iloc[j,2]) 
                ng2_tokensadr = set(nltk.ngrams(tokens2adr, n=1))

                jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)  # jaccard distance entre les ngram=1 des names
                jd_ng1_ng2_adr = nltk.jaccard_distance(ng1_tokensadr, ng2_tokensadr)  # jaccard distance entre les ngram=1 des adresses

                name_score = nltk.edit_distance(df_restov[ville].iloc[i,1], df_restov[ville].iloc[j,1])

                # Rule for matching: Item based Jaccard Distance with ngram=1 between adresses and between names or edit distance between names 
                if (jd_ng1_ng2_adr <= 0.6 and jd_ng1_ng2_name <= 0.6) or name_score<=1 :
                    bnumber_of_matches = bnumber_of_matches +1 
                    bmatchescomplet.append((df_restov[ville].iloc[i,0],df_restov[ville].iloc[i,1], \
                    df_restov[ville].iloc[i,2],df_restov[ville].iloc[i,3], df_restov[ville].iloc[i,5], \
                    df_restov[ville].iloc[j,0],df_restov[ville].iloc[j,1], df_restov[ville].iloc[j,2], \
                    df_restov[ville].iloc[j,3],df_restov[ville].iloc[j,5]))
                    bmatches.append((df_restov[ville].iloc[i,0],df_restov[ville].iloc[j,0]))

end = time.process_time()

print("Number of matches: {}".format(bnumber_of_matches))
print("Processing time: {}".format(end - start))
# for _ in matchescomplet:
#        print(_)
   

atlanta
120
new york
250
san francisco
148
los angeles
74
new york city
88
las vegas
63
west la
4
studio city
5
westlake village
2
beverly hills
8
malibu
4
santa monica
14
pasadena
6
northridge
1
mar vista
1
sherman oaks
3
venice
3
marietta
2
la
15
redondo beach
1
w. hollywood
6
westwood
2
queens
5
hollywood
2
pacific palisades
1
roswell
2
bel air
2
smyrna
1
duluth
2
culver city
1
long beach
1
decatur
2
century city
1
st. boyle hts.
1
brooklyn
6
rancho park
1
st. hermosa beach
1
marina del rey
1
encino
2
monterey park
1
toluca lake
1
chinatown
2
burbank
1
seal beach
1
brentwood
1
manhattan beach
1
los feliz
2
college park
1
glendale
1
city
1
Number of matches: 67
Processing time: 22.973241


#### Rappel des résultats de l'algo original sans blocking:
####  Number of matches: 127
#### Processing time: 167.984375

#### les infos de l'algo avec  blocking ci-dessus
#### Number of matches: 67
#### Processing time: 25.6875



In [74]:
import pandas as pd
ground_truth_matches = pd.read_csv("./restaurants.csv")
len(ground_truth_matches)

865

In [75]:
ground_truth_matches.insert(0, 'record_ID', range(0, len(ground_truth_matches)))

In [76]:
ground_truth_matches = pd.merge(ground_truth_matches,
                                ground_truth_matches,
                                on = 'unique_id')

In [77]:
ground_truth_matches.head(5)

Unnamed: 0,record_ID_x,name_x,address_x,city_x,cuisine_x,unique_id,record_ID_y,name_y,address_y,city_y,cuisine_y
0,0,103 west,103 w. paces ferry rd.,atlanta,continental,675,0,103 west,103 w. paces ferry rd.,atlanta,continental
1,1,20 mott,20 mott st. between bowery and pell st.,new york,asian,172,1,20 mott,20 mott st. between bowery and pell st.,new york,asian
2,2,21 club,21 w. 52nd st.,new york,american,23,2,21 club,21 w. 52nd st.,new york,american
3,2,21 club,21 w. 52nd st.,new york,american,23,753,21 club,21 w. 52nd st.,new york city,american (new)
4,753,21 club,21 w. 52nd st.,new york city,american (new),23,2,21 club,21 w. 52nd st.,new york,american


In [78]:
ground_truth_matches = ground_truth_matches.query('record_ID_x < record_ID_y')
len(ground_truth_matches)

112

In [79]:
ground_truth_matches = ground_truth_matches[['record_ID_x','record_ID_y']]

In [80]:
bmatches_df = pd.DataFrame(bmatches)
bmatches_df.columns= ['record_ID_x','record_ID_y']
bmatches_df.head()

Unnamed: 0,record_ID_x,record_ID_y
0,6,754
1,37,761
2,79,765
3,92,766
4,96,196


In [81]:
# on s'assure que les couples record_ID_x et record_ID_y sont dans le bons sens (record_ID_x < record_ID_y)
bmatches_df[bmatches_df['record_ID_x'] >= bmatches_df['record_ID_y'] ]
# 0 lignes trouvées , donc c OK.

Unnamed: 0,record_ID_x,record_ID_y


In [82]:
diff_df = pd.merge(ground_truth_matches, bmatches_df, how='outer', indicator='Exist')
diff_df.head()

Unnamed: 0,record_ID_x,record_ID_y,Exist
0,2,753,left_only
1,6,754,both
2,13,755,both
3,26,756,left_only
4,27,757,both


In [83]:
btrue_positives = diff_df[diff_df.Exist=='both']
bfalse_positives = diff_df[diff_df.Exist=='right_only']
bfalse_negatives = diff_df[diff_df.Exist=='left_only']


In [84]:
true_positives.head()

Unnamed: 0,record_ID_x,record_ID_y,Exist
0,2,753,both
1,6,754,both
2,13,755,both
3,26,756,both
4,27,757,both


In [85]:
# un vrai positif: c un vrai couple de restos en double qui a été détecté par notre algo sous forme de blocking method.
# en effet il vérifie le critère de name (edit_distance=0) et en plus les 2 restos se trouve dans la même ville d'atlanta.
df_restaurants[df_restaurants.record_ID.isin(['6','754'])]

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id


In [86]:
# les couples détectés par notre algo comme des doubles mais à tort, ce ne sont pas des doubles.
false_positives.head()

Unnamed: 0,record_ID_x,record_ID_y,Exist
112,55,56,right_only
113,87,88,right_only
114,95,180,right_only
115,96,196,right_only
116,116,125,right_only


In [87]:
df_restaurants[df_restaurants.record_ID.isin(['96','196'])]
# ce couple n'est pas dans le ground_truth car unique_id différent
# mais il est dans le bmatches_df , (jd_ng1_ng2_adr <= 0.6 and jd_ng1_ng2_name <= 0.6) 
# cad les names sont proches pour la jaccard distance item based
# et les adresses sont proches pour la jaccard distance item based.
# et en plus ils se trouvent dans la même ville atlanta (blocking method)

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id


In [88]:
# les couples de restos en double mais qui ne sont pas détectés par notre algo comme des doubles.
false_negatives.head()

Unnamed: 0,record_ID_x,record_ID_y,Exist
6,32,759,left_only
9,73,763,left_only
28,179,781,left_only
34,235,786,left_only
56,388,808,left_only


In [89]:
df_restaurants[df_restaurants.record_ID.isin(['2','753'])]
# ce couple est dans le ground_truth car même unique_id 
# mais il n'est pas dans le matches_df, malgré qu' ils ont le même name et la  même adresse (dans l'algo les détecte bien)
# mais le Blocking method ne permet pas à l'algo de les matcher car ils sont considérés ayant des villes différentes :
# 'new york' et 'new york city'  , à cause d'une mauvaise saisie de la ville.

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id


In [90]:
len(bfalse_negatives) # y a beaucoup de false_ngatives par rapport à l'algo dans Blocking method (on avait 13)

58

In [91]:
# false negative
df_restaurants[df_restaurants.record_ID.isin(['26','756'])]
# du au blocking method : new yor et new york city

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id


In [92]:
# false negative
df_restaurants[df_restaurants.record_ID.isin(['32','759'])]
# dû aux matching imprécis de l'algo 

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id


In [93]:
# false negative
df_restaurants[df_restaurants.record_ID.isin(['36','760'])]
# du au blocking method : new yor et new york city

Unnamed: 0,record_ID,name,address,city,cuisine,unique_id


In [94]:
print(len(ground_truth_matches))
print(len(bmatches_df))
print(len(btrue_positives) , 'true_positives')
print(len(bfalse_positives) ,'false_positives')
print(len(bfalse_negatives)  , 'false_negatives')

# len(true_positives)  +  len(false_negatives) = len(ground_truth_matches)

# len(matches_df)) - len(false_positif) + len(false_negatives)     = ground_truth_matches

112
67
54 true_positives
13 false_positives
58 false_negatives


In [95]:
bprecision = len(btrue_positives)/(len(btrue_positives)+ len(bfalse_positives))
print(bprecision)

0.8059701492537313


In [96]:
brecall = len(btrue_positives)/(len(btrue_positives)+ len(bfalse_negatives))
print(brecall)
# recall faible car y a beaucoup de false negatives
# y a des duplicates que l'algo avec Blocking method n'a pas détecté car saisie à tort dans des villes différentes
# surtout new york et new york city 

0.48214285714285715


In [97]:
bf_measure = 2*(bprecision*brecall)/(bprecision+brecall)
print(bf_measure)

0.6033519553072626


### chiffres de l'algo original sans blocking method
### precision: 0.7795

### recall : 0.8839

### f_measure :0.82845