# Exploring patient matching with unsupervised learning : Expectation-Maximisation and K-means
This notebook introduces key concepts of patient matching while demonstrating those concepts using python

***********************************************

This notebook follows  step by step the record linkage process , providing minimum exaplanation and assuming  knowledge of the process. For more detailed information, please consult the reference section


## Record linkage process (deduplication)

In [26]:
from IPython.display import display
import warnings
import numpy as np
import pandas as pd
import recordlinkage 
from recordlinkage.index import Block
from recordlinkage.preprocessing import phonetic
warnings.filterwarnings('ignore')

#### Get version information

In [27]:
# Get Version information
print("Pandas version: {0}".format(pd.__version__),'\n')
print("Python Record Linkage version: {0}".format(recordlinkage._version.get_versions()['version']),'\n')
print("Numpy version: {0}".format(np.__version__),'\n')

Pandas version: 1.5.3 

Python Record Linkage version: 0.15 

Numpy version: 1.22.0 



In [28]:
# file to deduplicate
IMPORT_FILE_TO_DEDUPLICATE = './data/dataset_febrl3.csv'


In this notebook we use  synthetic dataset from The Python Record Linkage Toolkit (PRLT). The PRLT contains several open public synthetic datasets. The package is distributed with a  four synthetic datasets. For this project we will use The Freely Extensible Biomedical Record Linkage (Febrl) dataset 3 . Dataset 3 (FEBRL3) contains 5000 records (2000 originals and 3000 duplicates).

For more info : [Synthetic datasets](./https://recordlinkage.readthedocs.io/en/latest/ref-datasets.html "Title")

In [29]:
## add columns
df_a = pd.read_csv(IMPORT_FILE_TO_DEDUPLICATE)
df_a = df_a.set_index('rec_id')
print('Number of records :', len(df_a))
print(df_a.head())


Number of records : 5000
               given_name   surname        address_1               address_2  \
rec_id                                                                         
rec-1496-org     mitchell     green    wallaby place                  delmar   
rec-552-dup-3      harley  mccarthy    pridhamstreet                  milton   
rec-988-dup-1    madeline     mason  hoseason street  lakefront retrmnt vlge   
rec-1716-dup-1   isabelle       NaN    gundulu place               currin ga   
rec-1213-org       taylor  hathaway   yuranigh court          brentwood vlge   

                   suburb  postcode state  date_of_birth  soc_sec_id  
rec_id                                                                
rec-1496-org    cleveland      2119    sa     19560409.0     1804974  
rec-552-dup-3     marsden      3165   nsw     19080419.0     6089216  
rec-988-dup-1   granville      4881   nsw     19081128.0     2185997  
rec-1716-dup-1   utakarra      2193    wa     19921119.0   

## Pre-processing and standardization

In [30]:
df_head = df_a.head()

This is the first step of the Record linkage process. The main task of **data cleaning and standardization** is the conversion of the raw input data into well defined, consistent forms, as well as the resolution of inconsistencies in the way information is represented and encoded. (source:)

We have split the date of birth in 3 columns for more easy comparison, also we have calculated the metaphone of the given name and surname respectively.

**Metaphone** is a phonetic encoding algorithm used to encode the way words an syllable are pronounces to help **reduce minor typographical error**.The output of a phonetic algorithm is an intentionally approximate phonetic representation of the word. With application still limited to English words Metaphone is an improvement on the Soundex algorihtm .


In [31]:
# convert date of birth as string
df_a['date_of_birth'] = pd.to_datetime(df_a['date_of_birth'],format='%Y%m%d', errors='coerce')
df_a['YearB'] = df_a['date_of_birth'].dt.year.astype('Int64') 
df_a['MonthB'] = df_a['date_of_birth'].dt.month.astype('Int64') 
df_a['DayB'] = df_a['date_of_birth'].dt.day.astype('Int64') 

df_a['metaphone_given_name'] = phonetic(df_a['given_name'], method='metaphone')
df_a['metaphone_surname'] = phonetic(df_a['surname'], method='metaphone')
#df_a.sort_values(['given_name'])

## Blocking and Indexing

The second step of the process called ***blocking or indexing** try to reduce the number of records we need to compare. The idea is instead of comparing all records of the dataset between themselves we want to compare only the records that are most likely to be matched. 

As example you can decide to compare only patiend with the same : first name, last name and date of birth. This combination of fields is called a **  blocking key**. Using  a blocking key provide a reduce set of record pairs. In this notebook we use multiple blocking keys and consider the  ** union  ** of all the results set of candidate record pairs to evaluate for matching in the next steps.

Please note the use of the **metaphone** algorithm here instead of the exact value.This takes into account typrographic errors in the names and provide a wider range of candidiate record pairs.

In [32]:
indexer = recordlinkage.Index()

# soundex firstname, methapone surname, exact date of birth
indexer.add(Block(['metaphone_given_name','metaphone_surname','date_of_birth']))
# soundex firstname , day of birth
indexer.add(Block(['metaphone_given_name','DayB']))
#soundex firstname , month of birth
indexer.add(Block(['metaphone_given_name','MonthB']))
# metaphone surname, year of birth 
indexer.add(Block(['metaphone_surname','YearB']))
# ssn
indexer.add(Block(['soc_sec_id']))

candidate_record_pairs = indexer.index(df_a)

print("Number of record pairs :",len(candidate_record_pairs))
candidate_record_pairs.to_frame(index=False).head()

Number of record pairs : 12873


Unnamed: 0,rec_id_1,rec_id_2
0,rec-0-org,rec-1023-org
1,rec-0-org,rec-1540-dup-1
2,rec-1-org,rec-1643-org
3,rec-1-org,rec-1986-org
4,rec-1-org,rec-41-org


In [33]:
candidate_record_pairs

MultiIndex([(   'rec-0-org',   'rec-1023-org'),
            (   'rec-0-org', 'rec-1540-dup-1'),
            (   'rec-1-org',   'rec-1643-org'),
            (   'rec-1-org',   'rec-1986-org'),
            (   'rec-1-org',     'rec-41-org'),
            ('rec-10-dup-0',   'rec-10-dup-2'),
            ('rec-10-dup-1',   'rec-10-dup-0'),
            ('rec-10-dup-1',   'rec-10-dup-2'),
            (  'rec-10-org',   'rec-10-dup-0'),
            (  'rec-10-org',   'rec-10-dup-1'),
            ...
            ( 'rec-998-org', 'rec-1207-dup-2'),
            ( 'rec-998-org', 'rec-1207-dup-3'),
            ( 'rec-998-org',   'rec-1207-org'),
            ( 'rec-998-org',   'rec-1530-org'),
            ( 'rec-998-org', 'rec-1727-dup-0'),
            ( 'rec-998-org', 'rec-1727-dup-2'),
            ( 'rec-998-org',   'rec-1727-org'),
            ( 'rec-998-org', 'rec-1758-dup-0'),
            ( 'rec-998-org', 'rec-1758-dup-1'),
            ( 'rec-998-org',   'rec-1758-org')],
           names=['rec_

In PRLT Phonetic encoding possible options are “soundex”, “nysiis”, “metaphone” or “match_rating”.**Other phonetic algorithm not included in PRLT : double-metaphne, phonix , phonex, OCNA, Fuzzy soundex (Christen 2012)**

## Comparison

Identifying the similarity between records pairs to create a comparison vectors. 

The previous step provided us a list of record pairs. In this step we compare the corresponding fields of each record pair using string distance algorithm.
Jarowinkler and Levenshtein generate a **score between 0 and 1** that is binarized based on the threshold. In our case if the **score >0.85** we say there's agreeement (1) if not there's disagreement (0).

The output of the comparison is the **comparison vector** that will be used for classification.

In [34]:

compare_cl = recordlinkage.Compare()
compare_cl.string('given_name', 'given_name', method='jarowinkler', threshold = 0.85, label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler',threshold = 0.85, label='surname')
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('soc_sec_id', 'soc_sec_id', label='soc_sec_id')
compare_cl.string('address_1', 'address_1', method ='levenshtein' ,threshold = 0.85, label='address_1')
compare_cl.string('address_2', 'address_2', method ='levenshtein' ,threshold = 0.85, label='address_2')
compare_cl.string('suburb', 'suburb', method ='levenshtein' ,threshold = 0.85, label='suburb')
compare_cl.exact('postcode', 'postcode', label='postcode')
compare_cl.exact('state', 'state', label='state')

features = compare_cl.compute(candidate_record_pairs, df_a)
features.head(50)


Unnamed: 0_level_0,Unnamed: 1_level_0,given_name,surname,date_of_birth,soc_sec_id,address_1,address_2,suburb,postcode,state
rec_id_1,rec_id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-0-org,rec-1023-org,0.0,0.0,0,0,0.0,0.0,0.0,0,0
rec-0-org,rec-1540-dup-1,0.0,0.0,0,0,0.0,0.0,0.0,0,1
rec-1-org,rec-1643-org,0.0,0.0,0,0,0.0,0.0,0.0,0,0
rec-1-org,rec-1986-org,1.0,0.0,0,0,0.0,0.0,0.0,0,0
rec-1-org,rec-41-org,0.0,0.0,0,0,0.0,0.0,0.0,0,0
rec-10-dup-0,rec-10-dup-2,0.0,0.0,1,1,0.0,1.0,1.0,0,1
rec-10-dup-1,rec-10-dup-0,0.0,0.0,1,1,0.0,1.0,1.0,0,1
rec-10-dup-1,rec-10-dup-2,1.0,1.0,1,1,0.0,1.0,1.0,1,1
rec-10-org,rec-10-dup-0,0.0,0.0,1,1,1.0,1.0,1.0,0,1
rec-10-org,rec-10-dup-1,1.0,1.0,1,1,0.0,1.0,1.0,1,1


rec-1023-org,gianni,matson,willis street,boonooloo,clifton,3101,vic,19410111,2540080
rec-1540-dup-1,john,benger,gellibrand street,grandview,carnegie,4011,nsw,19710126,5651019


## Classification

Based on comparison results, this step uses a classification algorithm to classify candidate records pairs in: matches, non-matches or potential matches.

Probabilistic matching is based on a probability model that designates record pairs as matches, possible matches, or non-matches based on calculation of linkage scores and application of decision rules about these scores to define true matches. 


### ECM Classifier

** EM-Algorithm ** :
This Expectation-Maximisation (EM) algorithm is an unsupervised probabilistic algorithm which **automatically estimate a threshold for the likelihood score to decide a match and non-match**. This do not need training data.

References :
* Herzog, Thomas N, Fritz J Scheuren and William E Winkler. 2007. Data quality and record linkage techniques. Vol. 1 Springer.
* Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.

In [35]:
ecm = recordlinkage.ECMClassifier()
matches = ecm.fit_predict(features)
print("Number of matched record pairs :",len(matches))
print(matches.to_frame(index=False).head())

Number of matched record pairs : 6277
       rec_id_1      rec_id_2
0  rec-10-dup-0  rec-10-dup-2
1  rec-10-dup-1  rec-10-dup-0
2  rec-10-dup-1  rec-10-dup-2
3    rec-10-org  rec-10-dup-0
4    rec-10-org  rec-10-dup-1


In [56]:
ecm = recordlinkage.ECMClassifier()
matches = ecm.fit(features)
p = ecm.prob(features)
# p = ecm.predict
p.tail(50)
# ecm.m_probs()
# print("Number of matched record pairs :",len(matches))
# print(matches.to_frame(index=False).head())

rec_id_1       rec_id_2      
rec-992-org    rec-992-dup-0     1.000000
rec-993-dup-0  rec-1122-org      0.000001
               rec-1199-org      0.000053
               rec-269-org       0.000053
               rec-469-org       0.000001
               rec-993-dup-1     1.000000
               rec-993-dup-2     1.000000
               rec-993-dup-3     1.000000
               rec-993-dup-4     1.000000
               rec-993-org       1.000000
rec-993-dup-2  rec-993-dup-1     1.000000
               rec-993-dup-4     1.000000
rec-993-dup-3  rec-1199-org      0.000053
               rec-1202-org      0.000053
               rec-1469-dup-3    0.000053
               rec-1469-org      0.000053
               rec-269-org       0.000053
               rec-993-dup-2     1.000000
               rec-993-dup-4     1.000000
               rec-993-org       1.000000
rec-993-dup-4  rec-993-dup-1     1.000000
rec-993-org    rec-1122-org      0.000001
               rec-1199-org      0.000053
    

In [57]:
df1= pd.DataFrame(p)

In [62]:
df1.head()

Unnamed: 0,rec_id_1,rec_id_2,prob
0,rec-0-org,rec-1023-org,0.0
1,rec-0-org,rec-1540-dup-1,0.0
2,rec-1-org,rec-1643-org,0.0
3,rec-1-org,rec-1986-org,0.0
4,rec-1-org,rec-41-org,0.0


In [58]:
df1 = df1.rename(columns={0: 'prob'})


In [59]:
df1['prob'] = df1['prob'].apply(lambda x: round(x * 1.0, 3))


In [60]:
# df1 = df1.set_index(['rec_id_1', 'rec_id_2'])

# reset the index to convert it to a single index
df1 = df1.reset_index()
df1.to_csv("output.csv", index=False)
# df2 = df1.reset_index(drop=True, inplace=True)
# df2.head(50)

In [65]:
rec_id_1_value = 'rec-10-dup-1'
column_name = 'rec_id_2'

# create a subset of the DataFrame where rec_id_1 == rec_id_1_value
subset = df1[df1.rec_id_1 ==rec_id_1_value]
subset

Unnamed: 0,rec_id_1,rec_id_2,prob
6,rec-10-dup-1,rec-10-dup-0,1.0
7,rec-10-dup-1,rec-10-dup-2,1.0


In [81]:
df_b= df_a.reset_index()
input_record = df_b[df_b.rec_id == rec_id_1_value].values.tolist()
# input_record[0]

# access the values of the desired column in the subset using the .values attribute
values = subset[column_name].values
out =[]
out.append(input_record[0])
print("!!!!!!!Matches!!!!!!!!")
# iterate over the values and print them
for value in values:
    # print("!!!!!!!!!!!!!!!!!!")    
    # print(value)
    output_record=df_b[df_b.rec_id == value].values.tolist()
    # print(output_record[0])
    out.append(output_record[0])

print(out)


!!!!!!!Matches!!!!!!!!
[['rec-10-dup-1', 'mikhvyla', 'hannagan', 'windradyen street', 'brentwood vlge', 'penshurst', 2257, 'vic', Timestamp('1977-05-01 00:00:00'), 1030769, 1977, 5, 1, 'MKFL', 'HNKN'], ['rec-10-dup-0', 'hannagan', 'mikhayla', 'rupp lace', 'brentwoo dvlge', 'penshurst', 2283, 'vic', Timestamp('1977-05-01 00:00:00'), 1030769, 1977, 5, 1, 'HNKN', 'MKHL'], ['rec-10-dup-2', 'mikhayla', 'hannaan', nan, 'brentwoodvlge', 'penshurst', 2257, 'vic', Timestamp('1977-05-01 00:00:00'), 1030769, 1977, 5, 1, 'MKHL', 'HNN']]


In [74]:


# out = []

['rec-10-dup-1',
 'mikhvyla',
 'hannagan',
 'windradyen street',
 'brentwood vlge',
 'penshurst',
 2257,
 'vic',
 Timestamp('1977-05-01 00:00:00'),
 1030769,
 1977,
 5,
 1,
 'MKFL',
 'HNKN']

In [41]:
# >80 : match
# <30 : Non-Match

In [42]:
# create a new column 'Match' with default value of 'Non-Match'
df1['Match'] = 'Non-Match'

# set conditions for 'Match' column based on 'Value' column
df1.loc[df1['prob'] >= 0.8, 'Match'] = 'Match'
# df1.to_csv('master_table.csv')
df1.loc[df1['prob'] < 0.3, 'Match'] = 'Non-Match'
# df1.loc[df1['prob']< 0.3 and 0.8 > df1['prob'] , 'UnSure'] = 'UnSure'

In [43]:
#count
df1['prob'].value_counts()

0.000    6435
1.000    6271
0.003     135
0.001      16
0.002       4
0.007       4
0.981       4
0.969       1
0.023       1
0.518       1
0.030       1
Name: prob, dtype: int64

In [44]:
df1.to_csv("output_match.csv", index=False)

In [45]:
import pandas as pd

# Load the original CSV file
df2 = pd.read_csv('output_match.csv')

# Filter the dataframe based on the 'prob' column
match_df = df2.loc[df1['prob'] >= 0.8]

# Add a new 'Match' column with value 'Match'
match_df['Match'] = 'Match'

# Save the match dataframe to a new CSV file
match_df.to_csv('match_records.csv', index=False)


In [48]:
rec_id_1_value = 'rec-10-dup-1'
column_name = 'rec_id_2'

# create a subset of the DataFrame where rec_id_1 == rec_id_1_value
subset = df1.loc[rec_id_1_value]

# access the values of the desired column in the subset using the .values attribute
values = subset[column_name].values

print("!!!!!!!Matches!!!!!!!!")
# iterate over the values and print them
for value in values:
    # print("!!!!!!!!!!!!!!!!!!")
    
    print(value)

KeyError: 'rec-10-dup-1'

#####################################################################################################


rec-1540-dup-1,john,benger,gellibrand street,grandview,carnegie,4011,nsw,19710126,5651019
rec-0-org,jinni,dreyer,were street,marriott downs,south melbourne,3172,nsw,19420127,3787407


rec-10-dup-0,hannagan,mikhayla,rupp lace,brentwoo dvlge,penshurst,2283,vic,19770501,1030769
rec-10-dup-2,mikhayla,hannaan,,brentwoodvlge,penshurst,2257,vic,19770501,1030769


### K-means classifier

In [52]:
kmeans = recordlinkage.KMeansClassifier()
matches_kmeans = kmeans.fit_predict(features)
# print(matches_kmeans)
# The predicted number of matches
# type(matches_kmeans)
print("Number of matched record pairs :",len(matches_kmeans))
matches_kmeans.to_frame(index=False).head(50)

Number of matched record pairs : 6229


Unnamed: 0,rec_id_1,rec_id_2
0,rec-10-dup-0,rec-10-dup-2
1,rec-10-dup-1,rec-10-dup-0
2,rec-10-dup-1,rec-10-dup-2
3,rec-10-org,rec-10-dup-0
4,rec-10-org,rec-10-dup-1
5,rec-10-org,rec-10-dup-2
6,rec-100-dup-0,rec-100-dup-1
7,rec-100-dup-0,rec-100-dup-3
8,rec-100-dup-2,rec-100-dup-0
9,rec-100-dup-2,rec-100-dup-1


In [None]:
kmeans = recordlinkage.KMeansClassifier()
matches_kmeans = kmeans.fit(features)
# print(matches_kmeans)
pb = kmeans.prob()

AttributeError: It is not possible to compute probabilities for the KMeansClassfier

## Evaluation

Comparing match results with the known ground truth or gold standard to mesaure the performance of the matching process.


### Gold standard 
The main objective of evaluation techniques is to achieve **high matching quality** in order to assess  the quality of the matched  data for a certain project ground-truth data also known as gold standard is required.

There are several approches of how ground-thruth data can be generated. In this notebook the gold standard data was generated as part of the synthetic data used for matching.

In [None]:
# gold_ standard or known truth
IMPORT_FILE_GOLD_STANDARD = './data/dataset_febrl3_true_links.csv'

In [None]:
df_true_links = pd.read_csv(IMPORT_FILE_GOLD_STANDARD)
df_true_links.columns=['rec_id_1','rec_id_2']
df_true_links.set_index(['rec_id_1','rec_id_2'],inplace=True)
df_true_links.head()

rec_id_1,rec_id_2
rec-552-dup-1,rec-552-dup-3
rec-552-dup-0,rec-552-dup-3
rec-552-dup-0,rec-552-dup-1
rec-552-org,rec-552-dup-3
rec-552-org,rec-552-dup-1


In [None]:
def metrics(links_true,links_pred,pairs):
    if len(links_pred) > 0 :
        matrix  = recordlinkage.confusion_matrix(links_true, links_pred, len(pairs))
            
        # precision
        precision  = recordlinkage.precision(links_true, links_pred)

         #precision
        recall  = recordlinkage.recall(links_true, links_pred)

        # The F-score for this classification is
        fscore = recordlinkage.fscore(links_true,links_pred)
        
        return {'precision':precision, 'recall':recall,'fscore':fscore}
    else :
        return {'precision':0, 'recall':0,'fscore':0}

In [None]:
## Create Function to Print Results
def get_results(metrics):
    print("\n{0:20}    {1:6}    {2:6}    {3:6}".format('Matching ','Precision','Recall','Fscore'))
    print('------------------------------------------------------')
    for i in metrics.keys():
        print("{0:20}    {1:<6.4}      {2:<6.4f}      {3:<6.4f}".format(i,metrics[i]['precision'],
                                                                      metrics[i]['recall'],
                                                                      float(metrics[i]['fscore'])))

In [None]:
results_score = {}

results_score['ECM'] =  metrics(df_true_links,matches,features)
results_score['K-means'] = metrics(df_true_links,matches_kmeans,features)

In [None]:
get_results(results_score)


Matching                Precision    Recall    Fscore
------------------------------------------------------
ECM                     1.0         0.9601      0.9796
K-means                 1.0         0.9527      0.9758


## View Duplicates


rec_id,given_name,surname,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec-1496-org,mitchell,green,wallaby place,delmar,cleveland,2119,sa,19560409,1804974
rec-552-dup-3,harley,mccarthy,pridhamstreet,milton,marsden,3165,nsw,19080419,6089216

In [None]:
import pandas as pd
import recordlinkage

# Load the trained K-means model
# kmeans_model = recordlinkage.KMeansClassifier()

# Load the comparison features from the trained model
""" compare_cl = recordlinkage.Compare()
compare_cl.string('given_name', 'given_name', method='jarowinkler', threshold=0.85, label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('soc_sec_id', 'soc_sec_id', label='soc_sec_id')
compare_cl.string('address_1', 'address_1', method='levenshtein', threshold=0.85, label='address_1')
compare_cl.string('address_2', 'address_2', method='levenshtein', threshold=0.85, label='address_2')
compare_cl.string('suburb', 'suburb', method='levenshtein', threshold=0.85, label='suburb')
compare_cl.exact('postcode', 'postcode', label='postcode')
compare_cl.exact('state', 'state', label='state') """

# Define a new record to predict
new_record = pd.DataFrame({
    'rec_id': ['rec-552-dup-3'],
    'given_name': ['harley'],
    'surname': ['mccarthy'],
    'address_1': ['pridhamstreet'],
    'address_2': ['milton'],
    'suburb': ['marsden'],
    'postcode': ['3165'],
    'state': ['nsw'],
    'date_of_birth': ['19080419'],
    'soc_sec_id': ['6089216']
}, index=['rec-552-dup-3'])

# Compute the comparison features for the new record
features1 = compare_cl.compute(candidate_record_pairs,new_record)

# Use the


# predict the cluster of the new instance
# kmeans = recordlinkage.KMeansClassifier()
# # matches_kmeans = kmeans.fit_predict(features)
# prediction = kmeans.fit_predict(features1)

# print(prediction)

KeyError: "None of [Index(['rec-0-org', 'rec-0-org', 'rec-1-org', 'rec-1-org', 'rec-1-org',\n       'rec-10-dup-0', 'rec-10-dup-1', 'rec-10-dup-1', 'rec-10-org',\n       'rec-10-org',\n       ...\n       'rec-998-org', 'rec-998-org', 'rec-998-org', 'rec-998-org',\n       'rec-998-org', 'rec-998-org', 'rec-998-org', 'rec-998-org',\n       'rec-998-org', 'rec-998-org'],\n      dtype='object', name='rec_id_1', length=12873)] are in the [index]"

In [54]:
input_rec = 'rec-10-dup-1'
matches_df = matches_kmeans.to_frame()
matches_df = matches_df[matches_df.rec_id_1 ==input_rec]
matches_df= matches_df.reset_index(drop=True) 
#print(matches_df)
for ix, match in matches_df.iterrows():
    #print(df_a[df_a.index.isin(list(match[0]))])
    print(match[0])
    print('*'*50)



rec-10-dup-1
**************************************************
rec-10-dup-1
**************************************************
