## Sequence Based Metrics
---

### Entity Resoultion using PyPi Edit Distance

In [None]:
!pip install networkx==2.5
!pip install edit-distance==1.0.4
!pip install pandas==1.1.3

In [1]:
import edit_distance  # Levenshtein distance
import pandas as pd

In [2]:
actual_names = ['Los Angeles','New York City','Bangalore','Mumbai','Chennai','Kolkata','New Delhi',\
                'Saint Petersburg','Melbourne','Gothenburg','Vienna','Barcelona','Las Vegas']

input_names = ['City of Los Angeles','New York','Bengaluru','Bombay','Madras','Calutta','Delhi',\
               'St. Petersburg','Melborne','Goteborg','Wien','Barca', 'Las Vegas']

In [3]:
def edit_dist_metrics(actual_names, input_names, threshold):
    '''
    input:  list : actual_names list
            list : input_names list
            float : threshold value for similarity score 0 <= threshold <= 1
            
    The function compares every string in actual_names with every string in input_names
    using edit_distance and provides a similarity score. If the score is more than or equal to a 
    given threshold, the two strings are matched as the same entity and compared with ground truth results.
    The results, precision and recall is printed out.
    '''
    res = []
    for i, a_name in enumerate(actual_names):
        for j, i_name in enumerate(input_names):
            r = edit_distance.SequenceMatcher(a_name.lower(), i_name.lower()).ratio()
            if r >= threshold:
                res.append([i_name, a_name, r, i==j])

    df = pd.DataFrame(res, columns=['Input Name','Predicted Name','Similarity Score','Ground Truth'])
    precision = round(sum(df['Ground Truth'])/len(df),3)
    recall = round(sum(df['Ground Truth'])/len(actual_names),3)
    print(df,'\n')
    print("Precision: "+str(precision))
    print("Recall: "+str(recall))
    return

### The entity resolution is run for different threshold values

### The function takes in a threshold value which is the similarity score for a pair of strings above which they are predicted as matches. The threshold value can range from 0 to 1 where 1 is a perfect match.

In [4]:
edit_dist_metrics(actual_names,input_names,0.4)

             Input Name    Predicted Name  Similarity Score  Ground Truth
0   City of Los Angeles       Los Angeles          0.733333          True
1             Las Vegas       Los Angeles          0.500000         False
2              New York     New York City          0.761905          True
3             Bengaluru         Bangalore          0.666667          True
4                 Barca         Bangalore          0.428571         False
5                Bombay            Mumbai          0.500000          True
6               Calutta           Kolkata          0.428571          True
7              New York         New Delhi          0.470588         False
8                 Delhi         New Delhi          0.714286          True
9        St. Petersburg  Saint Petersburg          0.800000          True
10             Goteborg  Saint Petersburg          0.416667         False
11             Melborne         Melbourne          0.941176          True
12       St. Petersburg        Gothenb

In [5]:
edit_dist_metrics(actual_names,input_names,0.5)

             Input Name    Predicted Name  Similarity Score  Ground Truth
0   City of Los Angeles       Los Angeles          0.733333          True
1             Las Vegas       Los Angeles          0.500000         False
2              New York     New York City          0.761905          True
3             Bengaluru         Bangalore          0.666667          True
4                Bombay            Mumbai          0.500000          True
5                 Delhi         New Delhi          0.714286          True
6        St. Petersburg  Saint Petersburg          0.800000          True
7              Melborne         Melbourne          0.941176          True
8              Goteborg        Gothenburg          0.777778          True
9                  Wien            Vienna          0.600000          True
10                Barca         Barcelona          0.714286          True
11            Las Vegas         Las Vegas          1.000000          True 

Precision: 0.917
Recall: 0.846


In [6]:
edit_dist_metrics(actual_names,input_names,0.6)

            Input Name    Predicted Name  Similarity Score  Ground Truth
0  City of Los Angeles       Los Angeles          0.733333          True
1             New York     New York City          0.761905          True
2            Bengaluru         Bangalore          0.666667          True
3                Delhi         New Delhi          0.714286          True
4       St. Petersburg  Saint Petersburg          0.800000          True
5             Melborne         Melbourne          0.941176          True
6             Goteborg        Gothenburg          0.777778          True
7                 Wien            Vienna          0.600000          True
8                Barca         Barcelona          0.714286          True
9            Las Vegas         Las Vegas          1.000000          True 

Precision: 1.0
Recall: 0.769


### As the threshold increases, we can see that the precision increases but the recall decreases. A threshold value of 0.6 is ideal as it gives 100% precision with a good recall.

### This threshold of 0.6 can comfortably resolve names local names against official names like 'Goteborg' v 'Gothernburg' and 'Bangalore' v 'Bengaluru' while disambiguating similar but different names like 'Los Angeles' and 'Las Vegas' . It also resolves spelling mistakes like 'Melborne' and short form of names like 'St. Petersburg'.

### Q1: Try to print the pairs with edit-distance sim score higher than 0.7

In [7]:
edit_dist_metrics(actual_names,input_names,0.7)

            Input Name    Predicted Name  Similarity Score  Ground Truth
0  City of Los Angeles       Los Angeles          0.733333          True
1             New York     New York City          0.761905          True
2                Delhi         New Delhi          0.714286          True
3       St. Petersburg  Saint Petersburg          0.800000          True
4             Melborne         Melbourne          0.941176          True
5             Goteborg        Gothenburg          0.777778          True
6                Barca         Barcelona          0.714286          True
7            Las Vegas         Las Vegas          1.000000          True 

Precision: 1.0
Recall: 0.615


---

## Set Based Metrics

### 1. Jaccard Index

In [8]:
import copy
tokens_1 = actual_names[0]
tokens_2 = input_names[0]
print(f"{tokens_1}, {tokens_2}")

Los Angeles, City of Los Angeles


### Q2: What's the Jaccard index between tokens_1 and tokens_2?

In [9]:
tokens_1 = set(tokens_1.split(" "))
tokens_2 = set(tokens_2.split(" "))

jaccard_sim = len(copy.deepcopy(tokens_1).intersection(tokens_2)) / len(copy.deepcopy(tokens_1).union(tokens_2))

### 2. TF-IDF

In [10]:
!pip install spacy scikit-learn # install sklearn for a built-in TF-IDF implementation



In [11]:
desc_1 = "mia's math adventure tells a captivating story with educational activities. games focus on developing math skills such as fractions geometry logic and mental computation. oh no! mia's house has just burnt down! but how could such a thing have ... "
desc_2 = "in mia's math adventure: just in time children will help mia save her house by using their math skills!"

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
def getTFIDFVector(doc):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(doc)

    feature_names = vectorizer.get_feature_names()
    return len(feature_names), feature_names, tfidf_matrix, vectorizer

In [14]:
doc = [desc_1, desc_2]
len_feature, feature_names, tfidf_matrix, vectorizer = getTFIDFVector(doc)
print(len_feature, feature_names)
desc_1_tfidf = tfidf_matrix[0].toarray()
desc_2_tfidf = tfidf_matrix[1].toarray()
print(f"tfidf for desc_1: {desc_1_tfidf}")
print(f"tfidf for desc_2: {desc_2_tfidf}")

for i in range(2):
    print([desc_1, desc_2][i])
    tfidf_vector = tfidf_matrix[i]
    for feature_index, tfidf_value in zip(tfidf_vector.indices, tfidf_vector.data):
        feature_name = feature_names[feature_index]
        print(f"Feature: {feature_name}, TF-IDF Value: {tfidf_value}")

44 ['activities', 'adventure', 'and', 'as', 'burnt', 'but', 'by', 'captivating', 'children', 'computation', 'could', 'developing', 'down', 'educational', 'focus', 'fractions', 'games', 'geometry', 'has', 'have', 'help', 'her', 'house', 'how', 'in', 'just', 'logic', 'math', 'mental', 'mia', 'no', 'oh', 'on', 'save', 'skills', 'story', 'such', 'tells', 'their', 'thing', 'time', 'using', 'will', 'with']
tfidf for desc_1: [[0.16423278 0.11685298 0.16423278 0.16423278 0.16423278 0.16423278
  0.         0.16423278 0.         0.16423278 0.16423278 0.16423278
  0.16423278 0.16423278 0.16423278 0.16423278 0.16423278 0.16423278
  0.16423278 0.16423278 0.         0.         0.11685298 0.16423278
  0.         0.11685298 0.16423278 0.23370595 0.16423278 0.23370595
  0.16423278 0.16423278 0.16423278 0.         0.11685298 0.16423278
  0.32846556 0.16423278 0.         0.16423278 0.         0.
  0.         0.16423278]]
tfidf for desc_2: [[0.         0.16291028 0.         0.         0.         0.
  0.22



In [15]:
from sklearn.metrics.pairwise import cosine_similarity

In [16]:
print(cosine_similarity(desc_1_tfidf, desc_2_tfidf))

[[0.22843861]]


In [17]:
"""Remove Stop Words"""
import spacy
nlp = spacy.load("en_core_web_sm")
def preprocess(text):
    doc = nlp(text)
    processed_text = " ".join(token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct)
    return processed_text

In [18]:
desc_1 = preprocess(desc_1)
desc_2 = preprocess(desc_2)
print(desc_1)
print(desc_2)

mia math adventure tell captivate story educational activity game focus develop math skill fraction geometry logic mental computation oh mia house burn thing
mia math adventure time child help mia save house math skill


### Q3: Which token(s) has the highest tf-idf value after removing the stopwords?

In [19]:
print(cosine_similarity(desc_1_tfidf, desc_2_tfidf))

[[0.22843861]]


In [20]:
desc_3 = "king for a day! fritz is in charge of the castle when his parents go on vacation. it's every kid's dream until the dastardly king black challenges the young stand-in king to a duel! explore the kingdom and discover 7 arcade-style games that explai"

### Q4: After we include the desc_3 into our corpus, what's the cosine similarity between desc_1 and desc_2, as well as that between desc_1 and desc_3?

In [21]:
doc = [desc_1, desc_2, desc_3]

In [22]:
len_feature, feature_names, tfidf_matrix, vectorizer = getTFIDFVector(doc)
print(len_feature, feature_names)
desc_1_tfidf = tfidf_matrix[0].toarray()
desc_2_tfidf = tfidf_matrix[1].toarray()
desc_3_tfidf = tfidf_matrix[2].toarray()
print(cosine_similarity(desc_1_tfidf, desc_2_tfidf))
print(cosine_similarity(desc_1_tfidf, desc_3_tfidf))


62 ['activity', 'adventure', 'and', 'arcade', 'black', 'burn', 'captivate', 'castle', 'challenges', 'charge', 'child', 'computation', 'dastardly', 'day', 'develop', 'discover', 'dream', 'duel', 'educational', 'every', 'explai', 'explore', 'focus', 'for', 'fraction', 'fritz', 'game', 'games', 'geometry', 'go', 'help', 'his', 'house', 'in', 'is', 'it', 'kid', 'king', 'kingdom', 'logic', 'math', 'mental', 'mia', 'of', 'oh', 'on', 'parents', 'save', 'skill', 'stand', 'story', 'style', 'tell', 'that', 'the', 'thing', 'time', 'to', 'until', 'vacation', 'when', 'young']
[[0.41795677]]
[[0.]]




# Blocking

In [None]:
!pip install pandas

In [23]:
import pandas as pd
df_dblp = pd.read_csv('DBLP-ACM/DBLP2.csv', encoding='latin1')
df_acm = pd.read_csv('DBLP-ACM/ACM.csv', encoding='latin1')
print(df_dblp.columns.tolist(), df_dblp.shape)
print(df_acm.columns.tolist(), df_acm.shape)

['id', 'title', 'authors', 'venue', 'year'] (2616, 5)
['id', 'title', 'authors', 'venue', 'year'] (2294, 5)


### Blocking based on Exact Match

In [24]:
merged_df = pd.merge(df_dblp, df_acm, on=['title', 'authors'])
print(merged_df.head(), merged_df.shape)

                   id_x                                              title  \
0  conf/vldb/PoosalaI96  Estimation of Query-Result Distribution and it...   
1     conf/vldb/HoelS95  Benchmarking Spatial Join Operations with Spat...   
2   conf/vldb/KemperK94          Dual-Buffering Strategies in Object Bases   
3  conf/vldb/ShaferAM96  SPRINT: A Scalable Parallel Classifier for Dat...   
4      conf/vldb/CuiW01  Lineage Tracing for General Data Warehouse Tra...   

                                        authors venue_x  year_x    id_y  \
0        Viswanath Poosala, Yannis E. Ioannidis    VLDB    1996  673321   
1                     Erik G. Hoel, Hanan Samet    VLDB    1995  673135   
2                Alfons Kemper, Donald Kossmann    VLDB    1994  672977   
3  John C. Shafer, Rakesh Agrawal, Manish Mehta    VLDB    1996  673491   
4                   Yingwei Cui, Jennifer Widom    VLDB    2001  672029   

                 venue_y  year_y  
0  Very Large Data Bases    1996  
1  Very La

In [25]:
value_counts = merged_df['id_x'].value_counts()
repeated_items = value_counts[value_counts > 1]
if len(repeated_items) > 0:
#     print("Some items appeared more than once:")
    print(repeated_items)

journals/sigmod/Aberer03d    7
journals/sigmod/Aberer03b    7
journals/vldb/MaratheS02     2
Name: id_x, dtype: int64


In [26]:
duplicated_ids = merged_df['id_x'].duplicated(keep=False)

# Filter out the rows with duplicated 'id_x'
merged_df = merged_df[~duplicated_ids]

In [27]:
value_counts = merged_df['id_y'].value_counts()
repeated_items = value_counts[value_counts > 1]
if len(repeated_items) > 0:
#     print("Some items appeared more than once:")
    print(repeated_items)
else:
    print("no repeated items")

no repeated items


In [28]:
merged_df.shape

(268, 8)

In [29]:
merged_df[['id_x', 'id_y']]

Unnamed: 0,id_x,id_y
0,conf/vldb/PoosalaI96,673321
1,conf/vldb/HoelS95,673135
2,conf/vldb/KemperK94,672977
3,conf/vldb/ShaferAM96,673491
4,conf/vldb/CuiW01,672029
...,...,...
279,journals/vldb/Atkinson00,765234
280,conf/vldb/HanF95,673134
281,conf/vldb/Suciu96a,673488
282,conf/vldb/GinisHKMT97,670997


# Q1: If we only use the title to do the exact match, how many pairs we can find between the two datasets?After we remove the repeated id, how many exact pairs we can directly get?

In [30]:
merged_df_ = pd.merge(df_dblp, df_acm, on='title')
print(merged_df_.head(), merged_df_.shape)

                     id_x                                              title  \
0    conf/vldb/PoosalaI96  Estimation of Query-Result Distribution and it...   
1  conf/vldb/GardarinGT96  Cost-based Selection of Path Expression Proces...   
2       conf/vldb/HoelS95  Benchmarking Spatial Join Operations with Spat...   
3     conf/vldb/KemperK94          Dual-Buffering Strategies in Object Bases   
4  journals/vldb/ChangG01  Approximate query mapping: Accounting for tran...   

                                           authors_x  venue_x  year_x    id_y  \
0             Viswanath Poosala, Yannis E. Ioannidis     VLDB    1996  673321   
1  Zhao-Hui Tang, Georges Gardarin, Jean-Robert G...     VLDB    1996  673484   
2                          Erik G. Hoel, Hanan Samet     VLDB    1995  673135   
3                     Alfons Kemper, Donald Kossmann     VLDB    1994  672977   
4       Hector Garcia-Molina, Kevin Chen-Chuan Chang  VLDB J.    2001  767145   

                                

In [31]:
merged_df_[['id_x', 'id_y']]

Unnamed: 0,id_x,id_y
0,conf/vldb/PoosalaI96,673321
1,conf/vldb/GardarinGT96,673484
2,conf/vldb/HoelS95,673135
3,conf/vldb/KemperK94,672977
4,journals/vldb/ChangG01,767145
...,...,...
983,conf/vldb/AndreiV01,672357
984,journals/tods/KarpSP03,762473
985,journals/tods/ChakrabartiKMP02,375680
986,journals/tods/ChakrabartiKMP02,568520


In [32]:
value_counts = merged_df_['id_x'].value_counts()
repeated_items = value_counts[value_counts > 1]
if len(repeated_items) > 0:
#     print("Some items appeared more than once:")
    print(repeated_items)

journals/sigmod/Aberer03d                7
journals/sigmod/Aberer03b                7
journals/vldb/AbbadiSW01                 6
journals/vldb/Atkinson00                 6
journals/vldb/BernsteinIR03              6
journals/vldb/AtluriJY03                 6
journals/tods/Snodgrass01a               3
journals/tods/Snodgrass01                3
journals/vldb/ApersCS02                  3
journals/sigmod/RossHLW01                2
journals/sigmod/RossFLOSSVW00            2
journals/sigmod/RossCGLLM01              2
journals/sigmod/CherniakV03              2
journals/sigmod/RossAKSSY00              2
journals/sigmod/RossIJP00                2
journals/sigmod/RossGR03                 2
journals/sigmod/Snodgrass99b             2
journals/vldb/MaratheS02                 2
journals/sigmod/RossHKRRSS01             2
journals/sigmod/RossKMV02                2
journals/tods/ChakrabartiKMP02           2
journals/sigmod/SnodgrassGIMSU98         2
journals/sigmod/RossFS02                 2
journals/si

In [33]:
duplicated_ids = merged_df_['id_x'].duplicated(keep=False)

# Filter out the rows with duplicated 'id_x'
merged_df_ = merged_df_[~duplicated_ids]

In [34]:
value_counts = merged_df_['id_y'].value_counts()
repeated_items = value_counts[value_counts > 1]
if len(repeated_items) > 0:
#     print("Some items appeared more than once:")
    print(repeated_items)
else:
    print("no repeated items")

277955    2
671202    2
673643    2
672990    2
Name: id_y, dtype: int64


In [35]:
duplicated_ids = merged_df_['id_y'].duplicated(keep=False)

# Filter out the rows with duplicated 'id_x'
merged_df_ = merged_df_[~duplicated_ids]

In [36]:
merged_df_[['id_x', 'id_y']]

Unnamed: 0,id_x,id_y
0,conf/vldb/PoosalaI96,673321
1,conf/vldb/GardarinGT96,673484
2,conf/vldb/HoelS95,673135
3,conf/vldb/KemperK94,672977
4,journals/vldb/ChangG01,767145
...,...,...
981,conf/vldb/ChungCLL01,672347
982,conf/vldb/WienerN94,672979
983,conf/vldb/AndreiV01,672357
984,journals/tods/KarpSP03,762473


### Token-based Blocking 

In [37]:
# Filter df_acm and df_dblp to keep rows not present in merged_df
merged_titles = merged_df['title'].values
df_dblp = df_dblp[~df_dblp['title'].isin(merged_titles)]
df_acm = df_acm[~df_acm['title'].isin(merged_titles)]

In [38]:
dblp_venue_unique_items = df_dblp['venue'].unique()
print(dblp_venue_unique_items)

acm_venue_unique_items = df_acm['venue'].unique()
print(acm_venue_unique_items)

['SIGMOD Record' 'VLDB' 'SIGMOD Conference' 'VLDB J.'
 'ACM Trans. Database Syst.']
['International Conference on Management of Data' 'ACM SIGMOD Record '
 'ACM Transactions on Database Systems (TODS) '
 'The VLDB Journal &mdash; The International Journal on Very Large Data Bases '
 'Very Large Data Bases']


In [39]:
replacements = {
    'ACM SIGMOD Record ': 'SIGMOD Record',
    'Very Large Data Bases': 'VLDB',
    'The VLDB Journal &mdash; The International Journal on Very Large Data Bases ': 'VLDB J.',
    'ACM Transactions on Database Systems (TODS) ': 'ACM Trans. Database Syst.',
    
}

df_acm['venue'] = df_acm['venue'].replace(replacements)
print(df_acm['venue'].unique())

['International Conference on Management of Data' 'SIGMOD Record'
 'ACM Trans. Database Syst.' 'VLDB J.' 'VLDB']


In [40]:
for venue in ['SIGMOD Record', 'VLDB', 'VLDB J.', 'ACM Trans. Database Syst.']:
    filtered_dblp = df_dblp[df_dblp['venue'] == venue]
    filtered_acm = df_acm[df_acm['venue'] == venue]
    print(filtered_acm.shape, filtered_dblp.shape)

(494, 5) (565, 5)
(443, 5) (681, 5)
(186, 5) (192, 5)
(123, 5) (123, 5)


In [41]:
def edit_dist_metrics(df1, df2, threshold):
    '''
    input:  DataFrame : df1
            DataFrame : df2
            float : threshold value for similarity score 0 <= threshold <= 1
    '''
    res = []
    for i, row1 in df1.iterrows():
        for j, row2 in df2.iterrows():
            title1 = row1['title']
            title2 = row2['title']
            similarity_score1 = edit_distance.SequenceMatcher(title1.lower(), title2.lower()).ratio()
            
            authors1 = row1['authors']
            authors2 = row2['authors']
            similarity_score2 = edit_distance.SequenceMatcher(authors1.lower(), authors2.lower()).ratio()
            
            acmid = row1['id']
            dblpid = row2['id']
            if (similarity_score1+similarity_score2)/2 >= threshold:
                res.append([acmid, dblpid, (similarity_score1+similarity_score2)/2])

    df = pd.DataFrame(res, columns=['ACM id', 'DBLP id', 'Similarity Score'])
    return df

In [42]:
filtered_acm_ACMTrans = filtered_acm[filtered_acm['venue'] == 'ACM Trans. Database Syst.']
filtered_dblp_ACMTrans = filtered_dblp[filtered_dblp['venue'] == 'ACM Trans. Database Syst.']

edit_dis = edit_dist_metrics(filtered_acm_ACMTrans, filtered_dblp_ACMTrans, 0.6)


In [43]:
acm_counts = edit_dis['ACM id'].value_counts().reset_index()
dblp_counts = edit_dis['DBLP id'].value_counts().reset_index()

# Rename the columns to avoid conflicts during merging
acm_counts = acm_counts.rename(columns={'index': 'ACM id', 'ACM id': 'ACM id Count'})
dblp_counts = dblp_counts.rename(columns={'index': 'DBLP id', 'DBLP id': 'DBLP id Count'})

# Merge the count DataFrames with 'edit_dis' DataFrame
merged_edit_dis = pd.merge(edit_dis, acm_counts, on='ACM id')
merged_edit_dis = pd.merge(merged_edit_dis, dblp_counts, on='DBLP id')

# Filter the DataFrame based on IDs that appear only once in both columns
AMCTrans_Match = merged_edit_dis[(merged_edit_dis['ACM id Count'] == 1) & (merged_edit_dis['DBLP id Count'] == 1)]

# Print the filtered DataFrame
print(AMCTrans_Match)


     ACM id                           DBLP id  Similarity Score  ACM id Count  \
0    331986       journals/tods/MuralidharS99          1.000000             1   
1    331984             journals/tods/DeySB99          0.843750             1   
6    331989            journals/tods/WandSW99          0.694444             1   
7    310710          journals/tods/DattaVCK99          0.822581             1   
8    320252         journals/tods/GravanoGT99          0.723214             1   
..      ...                               ...               ...           ...   
119  958948        journals/tods/HjaltasonS03          0.983871             1   
120  958943             journals/tods/TaoSP03          0.750000             1   
121  937600            journals/tods/JacoxS03          0.648148             1   
122  937601  journals/tods/Jimenez-PerisPAK03          0.827160             1   
123  937599     journals/tods/WijesekeraJPH03          0.750000             1   

     DBLP id Count  
0     

In [44]:
AMCTrans_Match[['ACM id', 'DBLP id']]

Unnamed: 0,ACM id,DBLP id
0,331986,journals/tods/MuralidharS99
1,331984,journals/tods/DeySB99
6,331989,journals/tods/WandSW99
7,310710,journals/tods/DattaVCK99
8,320252,journals/tods/GravanoGT99
...,...,...
119,958948,journals/tods/HjaltasonS03
120,958943,journals/tods/TaoSP03
121,937600,journals/tods/JacoxS03
122,937601,journals/tods/Jimenez-PerisPAK03


In [45]:
acm_ACMTrans_id = filtered_acm_ACMTrans[['id']]
dblp_ACMTrans_id = filtered_dblp_ACMTrans[['id']]

filtered_acm_edit_dis = filtered_acm_ACMTrans[~filtered_acm_ACMTrans['id'].isin(AMCTrans_Match[['ACM id', 'DBLP id']]['ACM id'])]
filtered_dblp_edit_dis = filtered_dblp_ACMTrans[~filtered_dblp_ACMTrans['id'].isin(AMCTrans_Match[['ACM id', 'DBLP id']]['DBLP id'])]

print(filtered_acm_edit_dis.shape, filtered_dblp_edit_dis.shape)

(15, 5) (15, 5)


In [46]:
filtered_edit_dis = merged_edit_dis[(merged_edit_dis['ACM id Count'] != 1) | (merged_edit_dis['DBLP id Count'] != 1)]
print(filtered_edit_dis)

    ACM id                     DBLP id  Similarity Score  ACM id Count  \
2   331992        journals/tods/YanG99          1.000000             2   
3   176573        journals/tods/YanG99          0.623147             2   
4   331992        journals/tods/YanG94          0.717742             2   
5   176573        journals/tods/YanG94          0.905405             2   
41  505049  journals/tods/Snodgrass01a          0.959459             2   
42  505055  journals/tods/Snodgrass01a          0.959459             2   
43  505049   journals/tods/Snodgrass01          0.959459             2   
44  505055   journals/tods/Snodgrass01          0.959459             2   
55  569785         journals/tods/Kim94          0.666667             1   
56  569784         journals/tods/Kim94          0.877778             1   
75  211416        journals/tods/Chen95          0.686441             2   
76  202110        journals/tods/Chen95          1.000000             2   
77  211416       journals/tods/Chen95a

In [47]:
corpus1 = df_acm['title'].tolist()

corpus2 = df_dblp['title'].tolist()

corpus1.extend(corpus2)


In [48]:
len_feature, feature_names, tfidf_matrix, vectorizer = getTFIDFVector(corpus1)
print(len_feature)
# filtered_acm_ACMTrans, filtered_dblp_ACMTrans
tfidf_dict = {}
for i, row in filtered_edit_dis.iterrows():
    print(row['ACM id'], row['DBLP id'])
    title1 = filtered_acm_ACMTrans[filtered_acm_ACMTrans['id'] == row['ACM id']]['title'].tolist()[0]
    title2 = filtered_dblp_ACMTrans[filtered_dblp_ACMTrans['id'] == row['DBLP id']]['title'].tolist()[0]
    print(title1)
    print(title2)
    tfidf1 = vectorizer.transform([title1]).toarray()
    tfidf2 = vectorizer.transform([title2]).toarray()
    print(cosine_similarity(tfidf1, tfidf2)[0][0])
    if row['ACM id'] not in tfidf_dict:
        tfidf_dict[row['ACM id']] = {row['DBLP id']: cosine_similarity(tfidf1, tfidf2)[0][0]}
    else:
        tfidf_dict[row['ACM id']][row['DBLP id']] = cosine_similarity(tfidf1, tfidf2)[0][0]
    print('--------')
print(tfidf_dict)

3261
331992 journals/tods/YanG99
The SIFT information dissemination system
The SIFT Information Dissemination System
1.0
--------
176573 journals/tods/YanG99
Index structures for selective dissemination of information under the Boolean model
The SIFT Information Dissemination System
0.31177412953716754
--------
331992 journals/tods/YanG94
The SIFT information dissemination system
Index Structures for Selective Dissemination of Information Under the Boolean Model
0.31177412953716754
--------
176573 journals/tods/YanG94
Index structures for selective dissemination of information under the Boolean model
Index Structures for Selective Dissemination of Information Under the Boolean Model
1.0
--------
505049 journals/tods/Snodgrass01a
Editorial
Editorial
1.0
--------
505055 journals/tods/Snodgrass01a
Editorial
Editorial
1.0
--------
505049 journals/tods/Snodgrass01
Editorial
Editorial
1.0
--------
505055 journals/tods/Snodgrass01
Editorial
Editorial
1.0
--------
569785 journals/tods/Kim94
Ed



In [49]:
pairs = [(key, max(value, key=value.get)) for key, value in tfidf_dict.items()]
print(len(pairs), pairs)

9 [(331992, 'journals/tods/YanG99'), (176573, 'journals/tods/YanG94'), (505049, 'journals/tods/Snodgrass01a'), (505055, 'journals/tods/Snodgrass01a'), (569785, 'journals/tods/Kim94'), (569784, 'journals/tods/Kim94'), (211416, 'journals/tods/Chen95a'), (202110, 'journals/tods/Chen95'), (185828, 'journals/tods/CeriFPT94')]


In [50]:
print(tfidf_dict[569785]['journals/tods/Kim94'], tfidf_dict[569784]['journals/tods/Kim94'], )
print(tfidf_dict[505049]['journals/tods/Snodgrass01a'], tfidf_dict[505055]['journals/tods/Snodgrass01a'], )

0.0 0.8668471738818802
1.0 1.0


In [51]:
pairs.remove((569785, 'journals/tods/Kim94'))
print(len(pairs))

8


In [52]:
acm_ACMTrans_id = filtered_acm_edit_dis[['id']]
dblp_ACMTrans_id = filtered_dblp_edit_dis[['id']]

filtered_acm_edit_dis = filtered_acm_edit_dis[~filtered_acm_edit_dis['id'].isin([p[0]for p in pairs])]
filtered_dblp_edit_dis = filtered_dblp_edit_dis[~filtered_dblp_edit_dis['id'].isin([p[1]for p in pairs])]

print(filtered_acm_edit_dis.shape, filtered_dblp_edit_dis.shape)

(7, 5) (8, 5)


# Q2: Recall that we have title, year, authors, and venue as data attributes provided in the dataset, is there any other among these will be a good choice to set up blocking?

In [None]:
dblp_year_unique_items = df_dblp['year'].unique()
print(sorted(dblp_year_unique_items))

acm_year_unique_items = df_acm['year'].unique()
print(sorted(acm_year_unique_items))