# Appariement SPlink sur données de décès

## Environnement 

In [1]:
### Installation des packages splink et recordlinkage
!pip install splink
!pip install recordlinkage



In [2]:
### Import des librairies nécessaires
import pandas as pd
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on

# Pour S3
import os
import s3fs

## Chargement des données

Deux tables avec les mêmes individus (même nombre de lignes et mêmes indentifiants). La table **deces_perturb** a été dégradée (ajout volontaire d'imprécisions dans les différentes colonnes identifiantes).

In [3]:
S3_ENDPOINT_URL = "https://" + os.environ["AWS_S3_ENDPOINT"]
fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': S3_ENDPOINT_URL})
BUCKET = "projet-ssplab"

# Import de la table des décès
FILE_KEY_S3 = "appariements/deces.parquet"
FILE_PATH_S3 = BUCKET + "/" + FILE_KEY_S3

with fs.open(FILE_PATH_S3, mode="rb") as file_in:
    deces = pd.read_parquet(file_in)

# Import de la table des décès perturbée
FILE_KEY_S3 = "appariements/deces_perturb.parquet"
FILE_PATH_S3 = BUCKET + "/" + FILE_KEY_S3

with fs.open(FILE_PATH_S3, mode="rb") as file_in:
    deces_perturb = pd.read_parquet(file_in)

Besoin de passer les colonnes de noms/prénoms en minuscules dans la table de gauche.

In [4]:
deces['nom_etat_civil'] = deces['nom_etat_civil'].str.lower()
deces['prenoms_etat_civil'] = deces['prenoms_etat_civil'].str.lower()

## Appariement 

Nombre de lignes à sélectionner dans les deux bases (sur les 26 millions)

In [62]:
nb_lignes_gauche = 10000000
nb_lignes_droite = 100000


Les individus ont les mêmes identifiants ligne à ligne (la base perturbée contient les mêmes individus, triés dans le même ordre)

In [63]:
df_gauche = deces.iloc[:nb_lignes_gauche]
df_droite = deces_perturb.iloc[:nb_lignes_droite]

In [65]:
df_gauche['ident_deces'].equals(df_droite['ident_deces'])

False

Part de lignes ayant subi une "perturbation" lors de la création de la table de gauche

In [66]:
df_droite.agg(part=('perturbation', 'sum')) / len(df_droite)

Unnamed: 0,perturbation
part,0.18392


In [67]:
df_gauche = df_gauche.drop(['datenaiss', 'datedeces', 'lieudeces', 'adeces'], axis=1)

Initialisation de l'objet Linker

In [68]:
linker = DuckDBLinker([df_gauche, df_droite], {"link_type": "link_only", "unique_id_column_name": "ident_deces"})

Règle de blocage

In [69]:
blocking_rules = [
        "l.lieunaiss = r.lieunaiss and (substr(l.nom_etat_civil, 1, 3) = substr(r.nom_etat_civil, 1, 3) or substr(l.nom_etat_civil, length(l.nom_etat_civil) -2 , 3) = substr(r.nom_etat_civil, length(r.nom_etat_civil) - 2, 3))"
    ]

#blocking_rules = [
#        "l.lieunaiss = r.lieunaiss and substr(l.nom_etat_civil, 1, 3) = substr(r.nom_etat_civil, 1, 3)"
#    ]


In [None]:
#count = linker.cumulative_num_comparisons_from_blocking_rules_chart(blocking_rules)
#count

In [70]:
print("Nombre de paires conservées " 
f"{linker.count_num_comparisons_from_blocking_rule(' or '.join(blocking_rules))}")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Nombre de paires conservées 13551773


### Règle de comparaison des champs

In [71]:
comparisons_list = [
        cl.jaro_winkler_at_thresholds("nom_etat_civil", [0.95, 0.88], term_frequency_adjustments = True),
        cl.jaro_winkler_at_thresholds("prenoms_etat_civil", [0.95, 0.88], term_frequency_adjustments = True),
        cl.exact_match("mnais_etat_civil", term_frequency_adjustments=True),
        cl.exact_match("jnais_etat_civil", term_frequency_adjustments=True)
    ]

### Définition du dictionnaire des paramètres

In [72]:
linkage_settings = {
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": blocking_rules,
    "comparisons": comparisons_list,
    "unique_id_column_name": "ident_deces"
}

## Estimation des paramètres

Ici, cas particulier où on sait qu'à une ligne de droite correspond exactement une ligne de gauche.
Il faud fixer la probabilité **probability_two_random_records_match** à nb_lignes^2

Nombre de paires à utiliser pour réaliser les estimations des paramètres u et m. 
Conseil de la documentation : "au moins 10 millions, mais 1 milliard pour les grosses tables".

Choix des variables `nom_etat_civil` et `prenoms_etat_civil` pour s'approcher du Get Started de SPlink... L'idée serait de bloquer sur des paires de probables vrais matches pour estimer la proportion d'erreurs/imprécisions dans les données. Besoin de comprendre davantage la documentation pour faire mieux.

Pour estimer le paramètre **m**, on pourrait utiliser des données labélisées. Ceci pourrait faciliter l'estimation.
Voici la commande de la documentation `linker.estimate_m_from_label_column("social_security_number")`

In [73]:
nb_paires_estimation = 1e8

In [74]:
%%time
linker = DuckDBLinker([df_gauche, df_droite], linkage_settings)
linker.estimate_u_using_random_sampling(max_pairs = nb_paires_estimation)
session_nom = linker.estimate_parameters_using_expectation_maximisation(block_on("nom_etat_civil"))
session_prenom = linker.estimate_parameters_using_expectation_maximisation(block_on("prenoms_etat_civil"))

----- Estimating u probabilities using random sampling -----


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - nom_etat_civil (no m values are trained).
    - prenoms_etat_civil (no m values are trained).
    - mnais_etat_civil (no m values are trained).
    - jnais_etat_civil (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."nom_etat_civil" = r."nom_etat_civil"

Parameter estimates will be made for the following comparison(s):
    - prenoms_etat_civil
    - mnais_etat_civil
    - jnais_etat_civil

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - nom_etat_civil


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))




FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 1: Largest change in params was -0.753 in the m_probability of prenoms_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 2: Largest change in params was -0.0475 in the m_probability of jnais_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 3: Largest change in params was 0.0372 in the m_probability of prenoms_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 4: Largest change in params was 0.0332 in the m_probability of prenoms_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 5: Largest change in params was 0.0317 in the m_probability of prenoms_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 6: Largest change in params was 0.0316 in the m_probability of prenoms_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 7: Largest change in params was -0.0321 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 8: Largest change in params was -0.0327 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 9: Largest change in params was -0.0327 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 10: Largest change in params was -0.032 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 11: Largest change in params was -0.0306 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 12: Largest change in params was -0.0286 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 13: Largest change in params was -0.0264 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 14: Largest change in params was -0.024 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 15: Largest change in params was -0.0216 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 16: Largest change in params was -0.0194 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 17: Largest change in params was -0.0173 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 18: Largest change in params was -0.0154 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 19: Largest change in params was -0.0137 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 20: Largest change in params was -0.0122 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 21: Largest change in params was -0.0108 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 22: Largest change in params was -0.00966 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 23: Largest change in params was -0.00863 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 24: Largest change in params was -0.00772 in the m_probability of prenoms_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 25: Largest change in params was -0.00691 in the m_probability of prenoms_etat_civil, level `All other comparisons`

EM converged after 25 iterations

Your model is not yet fully trained. Missing estimates for:
    - nom_etat_civil (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."prenoms_etat_civil" = r."prenoms_etat_civil"

Parameter estimates will be made for the following comparison(s):
    - nom_etat_civil
    - mnais_etat_civil
    - jnais_etat_civil

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - prenoms_etat_civil


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))




FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 1: Largest change in params was -0.803 in the m_probability of nom_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 2: Largest change in params was 0.0392 in the m_probability of nom_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 3: Largest change in params was 0.0121 in the m_probability of nom_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 4: Largest change in params was 0.00996 in the m_probability of nom_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 5: Largest change in params was -0.0102 in the m_probability of jnais_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 6: Largest change in params was 0.0121 in the m_probability of jnais_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 7: Largest change in params was -0.0129 in the m_probability of jnais_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 8: Largest change in params was 0.013 in the m_probability of jnais_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 9: Largest change in params was -0.0127 in the m_probability of jnais_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 10: Largest change in params was 0.0122 in the m_probability of jnais_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 11: Largest change in params was -0.0117 in the m_probability of jnais_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 12: Largest change in params was -0.011 in the m_probability of jnais_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 13: Largest change in params was 0.0103 in the m_probability of jnais_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 14: Largest change in params was 0.00965 in the m_probability of jnais_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 15: Largest change in params was -0.00895 in the m_probability of jnais_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 16: Largest change in params was 0.00827 in the m_probability of jnais_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 17: Largest change in params was 0.00761 in the m_probability of jnais_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 18: Largest change in params was -0.00698 in the m_probability of jnais_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 19: Largest change in params was -0.00639 in the m_probability of jnais_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 20: Largest change in params was 0.00584 in the m_probability of jnais_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 21: Largest change in params was -0.00532 in the m_probability of jnais_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 22: Largest change in params was -0.00485 in the m_probability of jnais_etat_civil, level `All other comparisons`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 23: Largest change in params was 0.00441 in the m_probability of jnais_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 24: Largest change in params was 0.00401 in the m_probability of jnais_etat_civil, level `Exact match`


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Iteration 25: Largest change in params was -0.00365 in the m_probability of jnais_etat_civil, level `All other comparisons`

EM converged after 25 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values


CPU times: user 2h 57s, sys: 2min 26s, total: 2h 3min 24s
Wall time: 3min 46s


Autre stratégie (ne fonctionne pas a priori) : 

In [None]:
#%%time
#
#linker = DuckDBLinker([df_gauche, df_droite], linkage_settings)
#linker.estimate_u_using_random_sampling(max_pairs = nb_paires_estimation)
#
#training_blocking_rule_nom_prenom = block_on(["nom_etat_civil", "prenoms_etat_civil"])
#training_session_nom_prenom = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule_nom_prenom)
#
#training_blocking_rule_anais = block_on("anais_etat_civil")
#training_session_dob = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule_anais)

### Analyse du modèle 

In [75]:
linker.match_weights_chart() 

In [18]:
linker.m_u_parameters_chart()

In [19]:
linker.unlinkables_chart()

### Classification des paires

Attention à vérifier l'impact du seuil **0.5**

In [None]:
#results = linker.predict(threshold_match_probability=0.5)
results = linker.predict()
results_pandas = results.as_pandas_dataframe()
results_pandas.shape


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

### Résolution des conflits

In [21]:
sql = f"""
with ranked as

(
select *,
row_number() OVER (
    PARTITION BY ident_deces_l order by match_weight desc
    ) as row_number
from {results.physical_name}
)

select *
from ranked
where row_number = 1


"""
results = linker.query_sql(sql)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [26]:
list_vars = ['match_probability', 'ident_deces_l', 'ident_deces_r', 'nom_etat_civil_l', 'nom_etat_civil_r', 'prenoms_etat_civil_l', 'prenoms_etat_civil_r']
results.columns

Index(['match_weight', 'match_probability', 'source_dataset_l',
       'source_dataset_r', 'ident_deces_l', 'ident_deces_r',
       'nom_etat_civil_l', 'nom_etat_civil_r', 'gamma_nom_etat_civil',
       'prenoms_etat_civil_l', 'prenoms_etat_civil_r',
       'gamma_prenoms_etat_civil', 'mnais_etat_civil_l', 'mnais_etat_civil_r',
       'gamma_mnais_etat_civil', 'jnais_etat_civil_l', 'jnais_etat_civil_r',
       'gamma_jnais_etat_civil', 'lieunaiss_l', 'lieunaiss_r', 'row_number'],
      dtype='object')

In [28]:
pattern = ['match_probability', 'ident_deces', 'nom_etat_civil', 'prenoms_etat_civil', 'lieunaiss', 'jnais_etat_civil', 'mnais_etat_civil']
res = results.filter(regex=f'^({pattern})', axis=1)
res[res['match_weight'] < 0.5]

Unnamed: 0,match_weight,match_probability,source_dataset_l,source_dataset_r,ident_deces_l,ident_deces_r,nom_etat_civil_l,nom_etat_civil_r,prenoms_etat_civil_l,prenoms_etat_civil_r,mnais_etat_civil_l,mnais_etat_civil_r,jnais_etat_civil_l,jnais_etat_civil_r,lieunaiss_l,lieunaiss_r,row_number
31,-11.441115,3.595218e-04,__splink__input_table_0,__splink__input_table_1,Deces_2021_108191,Deces_2021_47603,rochet,rochet,francois joseph cesar,marie jeanne leonie,10,01,28,01,74028,74028,1
32,-25.923420,1.571350e-08,__splink__input_table_0,__splink__input_table_1,Deces_2021_10822,Deces_2021_308087,emeraud,beraud,veronique annik claude,bernard raymond,12,07,31,30,84019,84019,1
95,-11.279960,4.019936e-04,__splink__input_table_0,__splink__input_table_1,Deces_2021_121701,Deces_2021_508490,smith,smith,paulette lucie,michel charles roger,09,11,29,17,78449,78449,1
135,-6.842026,8.641234e-03,__splink__input_table_0,__splink__input_table_1,Deces_2021_129769,deces-1971_18928,maury,maury,lucien georges gabriel,jean eugene emile,01,01,26,23,12145,12145,1
261,-19.051379,1.840613e-06,__splink__input_table_0,__splink__input_table_1,Deces_2021_156438,Deces_2021_132472,cortequisse,bleusse,andre,danielle marie jacqueline,11,11,01,13,80021,80021,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
891332,-12.166468,2.174872e-04,__splink__input_table_0,__splink__input_table_1,Deces_2021_288537,Deces_2021_89409,gerard,girard,odette marie alexandrine,louison noe servan,01,01,29,22,44109,44109,1
891396,-12.041980,2.370827e-04,__splink__input_table_0,__splink__input_table_1,deces-1970_22133,Deces_2021_231584,roy,royer,olivia marie georgette,colette jeanne henriette,06,06,06,12,75114,75114,1
891604,-10.544677,6.690297e-04,__splink__input_table_0,__splink__input_table_1,Deces_2021_446231,deces-1972_129390,roiron,boiron,mireille marie therese,jacques eugene rene,06,05,15,15,42218,42218,1
891621,-25.923420,1.571350e-08,__splink__input_table_0,__splink__input_table_1,Deces_2021_450369,Deces_2021_374499,caron,veron,ginette renee,jacques bernard jean,04,11,16,10,78487,78487,1


### Evaluation de la qualité

In [29]:
def compute_performance_metrics_FEBRL(results, dataset_size):
    """
    Compute performance metrics of a record linkage process on FEBRL synthetic data.
    The assumption is that the size of the two datasets is the same and every record 
    from dataset A has exactly one match in dataset B.

            Parameters:
                    results (pandas DataFrame): Output from the linkage process
                    dataset_size (int): Length of both datasets to be linked

            Returns:
                    performance_metrics (tuple): Tuple of metrics (TP, TN, FP, FN, precision, recall, F-measure)
    """
    results['actual'] = (results['ident_deces_l'].str.extract(r'(Deces_2021_\d+)') 
                                == results['ident_deces_r'].str.extract(r'(Deces_2021_\d+)'))
    TP = sum(results['actual'])
    FP = sum(~results['actual'])
    #Pairs that were removed in the indexing phase must be taken into account to compute True and False negatives
    FN = dataset_size - TP
    TN = dataset_size*dataset_size - TP - FN - FP

    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    Fscore = 2 * precision * recall / (precision + recall)
    performance_metrics = (TP, TN, FP, FN, precision, recall, Fscore)
    return(performance_metrics)

def print_performance_metrics(linkage_output, dataset_size):
    """
    Prints performance metrics of a record linkage process on synthetic data.
    The assumption is that the size of the two datasets is the same and every record 
    from dataset A has exactly one match in dataset B.

            Parameters:
                    results (pandas DataFrame): Output from the linkage process
                    dataset_size (int): Length of both datasets to be linked

            Returns:
                    None
    """
    TP, TN, FP, FN, precision, recall, Fscore = compute_performance_metrics_FEBRL(results, dataset_size)
    print(f"Vrais positifs : {TP:,}".replace(',', ' '))
    print(f"Vrais négatifs : {TN:,}".replace(',', ' '))
    print(f"Faux positifs : {FP:,}".replace(',', ' '))
    print(f"Faux négatifs : {FN:,}".replace(',', ' '))
    print(f"Précision : {precision:.4}")
    print(f"Rappel : {recall:.4}")
    print(f"F-mesure : {Fscore:.4}")

print_performance_metrics(results, nb_lignes)



Vrais positifs : 655 191
Vrais négatifs : 809 998 863 509
Faux positifs : 236 491
Faux négatifs : 244 809
Précision : 0.7348
Rappel : 0.728
F-mesure : 0.7314
