**Script to evaluate the output of the mappings between two lists of person names**

This notebook contains the steps for evaluating how precise the mappings between two lists of person names (ListA and ListB) were (when using the mapping notebook "MappingPersonNames.ipynb".

It needs as an input the correct mapped Ids between two lists.

This script is written by Liliana Melgar-Estrada for the SKILLNET PROJECT (https://skillnet.nl/)

Last update: June 21, 2022

# Data preparation (externally, before importing)

If the mappings between two lists have been evaluated and confirmed by an export, create a file that contains the mapped Ids. Besides, import the original lists of names.

# Import libraries

In [None]:
import matplotlib
import pandas as pd
import numpy as np
import re
import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# import jellyfish

from IPython.display import display
from IPython.display import clear_output

import csv

from IPython.display import display, HTML
# display(HTML("<style>.container { width:95% !important; }</style>"))
# pd.options.display.max_columns = 10
pd.options.display.max_rows = 1000
# pd.options.display.width = 1000

# to add timestamp to file names
import time

# for progress bar (https://datascientyst.com/progress-bars-pandas-python-tqdm/)
from tqdm import tqdm
from time import sleep

# Import files

In [None]:
# Test data is located in the repository folder indicated in the path here
# this is the local path to the raw data in your own computer to where you downloaded/cloned the repository
pathRawDataFolder = f'/Users/Melga001/stack/workspace/SKILLNET-PRODUCTION/_sharedRepositoriesGithub/mappingPersonNames/data/raw/'

## Import ListA

For the test version, ListA contains unique names from the Catalogus Epistolarum Neerlandicarum (CEN) extracted from a slice of correspondents from van Leeuwenhoek and Swammerdam (internal note: cy08).

In [None]:
# Import here the first file (ListA), this is the names you want to map the other list to.
# the list is imported as a pandas dataframe
dfA_t0 = pd.read_csv(f"{pathRawDataFolder}ListA_cy15Test2_CEN_onlyCorrect.csv", sep = ",", index_col=False, engine='python')

In [None]:
dfA_t0.info()

## Import list to map to (LIST B)

For the test version, ListB contains unique names from the Epistolarium (http://ckcc.huygens.knaw.nl/epistolarium/)  (internal note: cy13).

In [None]:
# Import here the second file (ListB), these are the names you want to map (find a match) to the initial list.
# the list is imported as a pandas dataframe
dfB_t0 = pd.read_csv(f"{pathRawDataFolder}ListB_cy15Test2_Episto_onlyCorrect.csv", sep = ",", index_col=False, engine='python')

In [None]:
dfB_t0.info()

## Import list of correct mapped pairs

In [None]:
# this is the local path to the test data in your own computer to where you downloaded/cloned the repository
pathTestDataFolder = f'/Users/Melga001/stack/workspace/SKILLNET-PRODUCTION/_sharedRepositoriesGithub/mappingPersonNames/data/test/'

# Import file with correct mappings only
df3_correctMapPairs = pd.read_csv(f"{pathTestDataFolder}correctMappedIds.csv", sep = ",", index_col=False, encoding= 'unicode_escape', engine='python')

In [None]:
df3_correctMapPairs.info()

# Prepare ListA and ListB

In this step the data is prepared for the mappings (reassigning column names and changing data types in case they were not the right ones)

## Prepare ListA

In [None]:
# assign column names
dfA_t0.columns = ['personIdA',
                  'personStrIdA', #-->Delete for delivering
                   'nameStringA',
                   'dateBirthA', 
                   'dateDeathA', 
                   'dateFlA'
                   ]

In [None]:
# make a copy of the dataframe and rename it
dfA = dfA_t0.reset_index(drop=True)

In [None]:
# convert datatypes and fill in empty values
dfA_columns = dfA.columns
for column in dfA_columns:
    dataType = dfA.dtypes[column]
    if dataType == np.float64:
        dfA[column] = dfA[column].fillna(0.0)
        dfA[column] = dfA[column].astype(int)
    if dataType == object:
        dfA[column] = dfA[column].fillna('null')
        dfA[column] = dfA[column].astype(str)

In [None]:
dfA.info()

In [None]:
dfA.head(10)

## Prepare ListB

In [None]:
# assign column names
dfB_t0.columns = [
                   'personIdB',
                   'personStrIdB', #--> delete for delivering
                   'nameStringB', 
                   'dateBirthB', 
                   'dateDeathB', 
                   'dateFlB',
                   ]

In [None]:
# make a copy of the dataframe and rename it
dfB = dfB_t0.reset_index(drop=True).copy()

In [None]:
# convert datatypes and fill in empty values
dfB_columns = dfB.columns
for column in dfB_columns:
    dataType = dfB.dtypes[column]
    if dataType == np.float64:
        dfB[column] = dfB[column].fillna(0.0)
        dfB[column] = dfB[column].astype(int)
    if dataType == object:
        dfB[column] = dfB[column].fillna('null')
        dfB[column] = dfB[column].astype(str)

In [None]:
# dfB.info()

In [None]:
# dfB.head(10)

## Create a dataframe to store the mappings

In [None]:
dfC = pd.DataFrame()

# Run mapping script

Here below there is the mapping script that will compare the names in listB with the names in listA checkign if the name string matches and, if so, it applies some rules to determine if the respective dates of birth/death/fl. have a logical relation. If so, a mapping candidate is added to the dataframe C.

This script is also stored separately here: 

The counter shows:
|percentage done|items processed/total items \[time passed < time left, number of iterations per second\]

In [None]:
##### PASTE HERE THE SCRIPT AVAILABLE IN THIS PATH: 
### {your path to repository}/mappingPersonNames/src/personMappingScript-v44-20220620.py

In [None]:
# dfC.info()

In [None]:
# dfC.scoreCase.value_counts()

# Prepare mapping output for analysis

#### Replace the .0 in person dates and convert to strings

In [None]:

dfC['dateBirthA'] = dfC['dateBirthA'].astype(str).replace('\.0', '', regex=True)
dfC['dateDeathA'] = dfC['dateDeathA'].astype(str).replace('\.0', '', regex=True)
dfC['dateFlA'] = dfC['dateFlA'].astype(str).replace('\.0', '', regex=True)
dfC['match_dateBirthB'] = dfC['match_dateBirthB'].astype(str).replace('\.0', '', regex=True)
dfC['match_dateDeathB'] = dfC['match_dateDeathB'].astype(str).replace('\.0', '', regex=True)
dfC['match_dateFlB'] = dfC['match_dateFlB'].astype(str).replace('\.0', '', regex=True)


In [None]:
dfC.info()

#### Create joined / unique names and fill the blanks

In [None]:
dfC['JoinedInitial'] = dfC['nameStringA'] + '^' + dfC['dateBirthA'] + '^' + dfC['dateDeathA'] + '^' + dfC['dateFlA']
dfC['JoinedMapped'] = dfC['match_nameStringB'] + '^' + dfC['match_dateBirthB']  + '^' + dfC['match_dateDeathB'] + '^' + dfC['match_dateFlB']

# Fill in blanks
dfC['JoinedMapped'] = dfC['JoinedMapped'].fillna('notmapped')

In [None]:
dfC

In [None]:
# dfC.info()

#### Run the second script to detect variation in the mapped forms

In [None]:
# Convert these joined names to strings
dfC['JoinedInitial'] = dfC['JoinedInitial'].astype('string')
dfC['JoinedMapped'] = dfC['JoinedMapped'].astype('string')

# # convert NodegoatPersonObjectID to integer again (turns into float in previous step)
# dfC['match_personIdB'] = dfC['match_personIdB'].astype(int)


In [None]:
dfC.info()

In [None]:

for j in dfC.index:
    clear_output(wait=True)
    rowIndex = dfC.index[j]
    initialForm = dfC.iloc[j,13]
    mappedForm = dfC.iloc[j,14]
    matchScoreFinal = fuzz.ratio(initialForm, mappedForm)
    print("Current progress loop1:", np.round(j/len(dfC) *100, 2),"%")
    if 0 <= matchScoreFinal <=100:
        dfC.loc[rowIndex, 'ScoreMappedVersionsNotChangedis100'] = matchScoreFinal        
        

In [None]:
# dfC

In [None]:
dfC.columns

In [None]:
# Reorder the columns in a way that is easier to evaluate mapping

dfD = dfC[['JoinedInitial',
        'JoinedMapped',
        'personIdA',
        'match_personIdB',
        'scoreCase',
        'scoreType',
        'scoreNameString',
        'ScoreMappedVersionsNotChangedis100']]


In [None]:
dfD.info()

In [None]:
dfD.scoreCase.value_counts()

# Evaluate output of mapping script

In [None]:
# make a copy of the full candidate list from the script above
df4_mapCandidates = dfD.reset_index(drop=True).copy()

In [None]:
df4_mapCandidates.info()

In [None]:
# if Ids are numbers, convert to strings
df4_mapCandidates['match_personIdB'] = df4_mapCandidates['match_personIdB'].astype('string')
df4_mapCandidates['match_personIdB'] = df4_mapCandidates['match_personIdB'].replace('\.0', '', regex=True)

In [None]:
df4_mapCandidates.info()

In [None]:
# This is the info about the file imported in step 3.3 with the correct mappings
df3_correctMapPairs.info()

In [None]:
df3_correctMapPairs.head(2)

In [None]:
# if Ids are numbers, convert to strings
df3_correctMapPairs['match_personIdB'] = df3_correctMapPairs['match_personIdB'].astype(str)

In [None]:
# convert the correct pairs column to a list
list_correctMappingPairs = list(df3_correctMapPairs['correctMappings'])

In [None]:
# Create column in mappings file with the pairs (to be able to compare with the correct mappings)
df4_mapCandidates['mappingPairs'] = df4_mapCandidates['personIdA'] + ',' + df4_mapCandidates['match_personIdB']

In [None]:
# mC_all_df.mappingPairs.value_counts()

In [None]:
# add a mark to the full candidate list of the mappings that are correct
df4_mapCandidates['isCorrectMapping'] = df4_mapCandidates.mappingPairs[df4_mapCandidates.mappingPairs.isin(list_correctMappingPairs)].copy()

In [None]:
df4_mapCandidates.head(2)

## Evaluation of mapping candidates list

### Correct mappings in mapping candidates

In [None]:
# Get the table of CORRECT mappings that came out in the candidate mappings df
df4_mapCandidates_CORRECT = df4_mapCandidates[df4_mapCandidates.isCorrectMapping.notnull()]

In [None]:
df4_mapCandidates_CORRECT.info()

In [None]:
df4_mapCandidates_CORRECT.scoreCase.value_counts()

In [None]:
df4_mapCandidates_CORRECT.scoreNameString.value_counts()

In [None]:
df4_mapCandidates_CORRECT.scoreType.value_counts()

In [None]:
checkScore2 = df4_mapCandidates_CORRECT[df4_mapCandidates_CORRECT.scoreType.str.contains('matchScore2')]

checkScore2.scoreNameString.value_counts()

### Incorrect mappings in mapping candidates

In [None]:
# Get the table of INCORRECT mappings that came out in the candidate mappings df
df4_mapCandidates_INCORRECT = df4_mapCandidates[df4_mapCandidates.isCorrectMapping.isnull()]

In [None]:
df4_mapCandidates_INCORRECT.info()

In [None]:
df4_mapCandidates_INCORRECT.scoreCase.value_counts()

In [None]:
df4_mapCandidates_INCORRECT.head(50)

In [None]:
checkScore = df4_mapCandidates_INCORRECT[df4_mapCandidates_INCORRECT.scoreCase.str.contains('Y')]

checkScore.scoreNameString.value_counts()

In [None]:
df4_mapCandidates_INCORRECT.scoreType.value_counts()

## Correct mappings not detected

In [None]:
# first make a list of all CORRECT mapping candidate pairs in the candidate list
list_correctMappingPairsCandidates = list(df4_mapCandidates_CORRECT['mappingPairs'])

In [None]:
len(list_correctMappingPairsCandidates)

In [None]:
# this is the list of correct mapping pairs (from df3)
len(list_correctMappingPairs)

In [None]:
# then compare the two lists (all correct mappings vs all correct mapping candidates) and get those from correct mappings that are not in all mappings candidates

# function to get unique values
def nonDetected(list1, list2):
    # intilize a null list
    not_in_list = []
    # traverse for all elements
    for x in list1:
        # check if exists in list2 or not
        if x not in list2:
            not_in_list.append(x)
    return not_in_list


nonDetectedCorrectMappings = nonDetected(list_correctMappingPairs, list_correctMappingPairsCandidates)

In [None]:
len(nonDetectedCorrectMappings)

In [None]:
# nonDetectedCorrectMappings

In [None]:
# Convert list to Df
df3_correctMapPairs_nonDetected = pd.DataFrame (nonDetectedCorrectMappings, columns = ['personIdA,personIdB'])

In [None]:
df3_correctMapPairs_nonDetected.head(2)

In [None]:
# split columns and assign names (https://www.kite.com/python/answers/how-to-split-a-pandas-dataframe-column-in-python)
df3_correctMapPairs_nonDetected_t00 = df3_correctMapPairs_nonDetected['personIdA,personIdB'].str.split(",")

df3_correctMapPairs_nonDetected_t01 = df3_correctMapPairs_nonDetected_t00.to_list()

names = ["personIdA", "personIdB"]

df3_correctMapPairs_nonDetected_t02 = pd.DataFrame(df3_correctMapPairs_nonDetected_t01, columns=names)

In [None]:
df3_correctMapPairs_nonDetected_t02.head(2)

In [None]:
# Get the df from ListA (CEN) that wasn't captured
df1_nonDetected = df3_correctMapPairs_nonDetected_t02.merge(dfA, how = 'inner', left_on = 'personIdA', right_on = 'personIdA')

In [None]:
df1_nonDetected.info()

In [None]:
# Get the df from ListB (Epistolarium) that wasn't captured
df2_nonDetected = df3_correctMapPairs_nonDetected_t02.merge(dfB, how = 'inner', left_on = 'personIdB', right_on = 'personIdB')

In [None]:
df2_nonDetected.info()

In [None]:
# combine dfs to get a DF with Non-detected correct mappings
df_nonDetected = df1_nonDetected.merge(df2_nonDetected, how = 'inner', left_on = 'personIdA', right_on = 'personIdA')

In [None]:
df_nonDetected

In [None]:
# checking if both columns from merge are the same
check = df_nonDetected['personIdB_x'] == df_nonDetected['personIdB_y']

In [None]:
check.value_counts()

In [None]:
# drop duplicated column and rename
df_nonDetected = df_nonDetected.drop(['personIdB_y'], axis=1).copy()
df_nonDetected.rename(columns={"personIdB_x":"personIdB"},inplace=True)

In [None]:
df_nonDetected['pairs'] = df_nonDetected['personIdA'] + ',' + df_nonDetected['personIdB']

In [None]:
# check if the non-detected pairs from the merge dataset are correctly in the list of non-detected mapping pairs

# convert column to list
list_correctMappingPairsNonDetected = list(df_nonDetected['pairs'])

In [None]:
# check if they are the same from the list of non detected mappings

check2 = list_correctMappingPairsNonDetected == nonDetectedCorrectMappings

In [None]:
check2

### Determine causes for not mapping

#### Is it string matching score?

In [None]:
# first check if what went wrong was the string matching score

for index, row in df_nonDetected.iterrows():
    # capture variable for name in A and in B
    nameStringA = df_nonDetected.loc[index,'nameStringA']
    nameStringB = df_nonDetected.loc[index,'nameStringB']
    # define scores
    matchScore1 = fuzz.token_sort_ratio(nameStringA, nameStringB)
    matchScore2 = fuzz.token_set_ratio(nameStringA, nameStringB)
    df_nonDetected.loc[index,'matchScore1'] = matchScore1
    df_nonDetected.loc[index,'matchScore2'] = matchScore2

In [None]:
df_nonDetected

In [None]:
df_nonDetected.matchScore1.value_counts()

In [None]:
df_nonDetected.matchScore2.value_counts()

In [None]:
# CONCLUSION: the cause for non-captured mappings is not the string score
# ranges, except one, they are all above middle score value (69)

#### Is it the rules/cases?

In [None]:
df_nonDetected.columns

In [None]:
# determine if all pairs are correctly classified in the case/score type

for index, row in df_nonDetected.iterrows():
# Capture basic standard columns for the mapping dataset B (to be mapped) as variables 
    personIdB = df_nonDetected.loc[index,'personIdB']
    personStrIdB = df_nonDetected.loc[index,'personStrIdB']
    nameStringB = df_nonDetected.loc[index,'nameStringB']
    dateBirthB = df_nonDetected.loc[index,'dateBirthB']
    dateDeathB = df_nonDetected.loc[index,'dateDeathB']
    dateFlB = df_nonDetected.loc[index,'dateFlB']
    # Capture basic standard columns for the mapping dataset A (to be mapped to) as variables
    personIdA = df_nonDetected.loc[index,'personIdA']
    personStrIdA = df_nonDetected.loc[index,'personStrIdA']
    nameStringA = df_nonDetected.loc[index,'nameStringA']
    dateBirthA = df_nonDetected.loc[index,'dateBirthA']
    dateDeathA = df_nonDetected.loc[index,'dateDeathA']
    dateFlA = df_nonDetected.loc[index,'dateFlA']
    caseName = ''

### Paste here ONLY part of the script that identifies the case
### Script available here: ### {your path to repository}/mappingPersonNames/src/personMappingScript-v44-20220620.py

    df_nonDetected.loc[index,'caseNameTest'] = caseName

In [None]:
df_nonDetected.caseNameTest.value_counts()

In [None]:
checkNonDetectedDf = df_nonDetected[['nameStringA', 'dateBirthA', 'dateDeathA', 'dateFlA', 'nameStringB', 'dateBirthB', 'dateDeathB', 'dateFlB']]

In [None]:
display(checkNonDetectedDf)

In [None]:
testDf = checkNonDetectedDf.reset_index(drop=True).copy()

In [None]:
# determine to which general case a person name pair corresponds

for index, row in testDf.iterrows():
# Capture basic standard columns for the mapping dataset B (to be mapped) as variables 
    # personIdB = testDf.loc[index,'personIdB']
    # personStrIdB = testDf.loc[index,'personStrIdB']
    nameStringB = testDf.loc[index,'nameStringB']
    dateBirthB = testDf.loc[index,'dateBirthB']
    dateDeathB = testDf.loc[index,'dateDeathB']
    dateFlB = testDf.loc[index,'dateFlB']
    # Capture basic standard columns for the mapping dataset A (to be mapped to) as variables
    # personIdA = testDf.loc[index,'personIdA']
    # personStrIdA = testDf.loc[index,'personStrIdA']
    nameStringA = testDf.loc[index,'nameStringA']
    dateBirthA = testDf.loc[index,'dateBirthA']
    dateDeathA = testDf.loc[index,'dateDeathA']
    dateFlA = testDf.loc[index,'dateFlA']
    caseName = ''
    
    ############################## CAPTURE SCORE TYPES (NEEDS UPDATE...) #######################################
    ############# SCORES TYPE A
    # definition Score typeA: persons in both datasets (A and B) have complete dates of birth and death (rules = dates of birth and death are the same)
    if ((dateBirthA != 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB != 0)) or ((dateBirthA != 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB == 0)) or ((dateBirthA != 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB != 0)) or ((dateBirthA != 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB == 0)):
        caseName = 'A'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName

    ############# SCORES TYPE B
    # definition ScoreB: persons in either dataset A or B have complete dates of birth and death, and the mapping dataset has either of the two plus Flourished date (uses rules: either dates of birth or death are the same or with buffer, and date of Flourished is between dates of birth and/or death)
    #### B1
    elif ((dateBirthA != 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB != 0)) or ((dateBirthA != 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB != 0)) or ((dateBirthA != 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB != 0)) or ((dateBirthA != 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB == 0)):
        caseName = 'B1'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName
    #### B2 -> dates of death complete (applying rule for dateFl in relation to date of death of the other set)
    elif ((dateBirthA != 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB != 0)) or ((dateBirthA != 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB == 0)):
        caseName = 'B2'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName


    ############# SCORES TYPE C
    # definition ScoreC: persons in either dataset A or B have complete dates of birth and death, and the mapping dataset has either of the two but flourished date is not be present in the set with incomplete dates (rules: either dates of birth or death are the same, and Florished date is not used)
    #### C1
    elif ((dateBirthA != 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA != 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA != 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB != 0)) or ((dateBirthA != 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB == 0)):
        caseName = 'C1'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName
    #### C2 -> dates of death complete (applying rule for dateFl in relation to date of death of the other set)
    elif ((dateBirthA != 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB == 0)) or ((dateBirthA != 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB == 0)):
        caseName = 'C2'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName


    ############# SCORES TYPE D
    # definition ScoreD: persons in either dataset A or dataset B have either dates (of birth and/ordeath) and also both datasets have the Flourished date (uses rules: either dates of birth or death are the same (or have buffer), and date of Flourished is between dates of birth and/or death following some rules)
    # D with date of birth
    elif ((dateBirthA != 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB != 0)):
        caseName = 'D'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName
    # D with date of death
    elif ((dateBirthA == 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB != 0)):
        caseName = 'D'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName


    ############# SCORES TYPE E
    # definition ScoreE: none of the persons in either datasets A or B have complete dates of birth and death, one set has Flourished date the other don't (rules = either dates of birth are the same, or dates of death are the same)
    elif ((dateBirthA != 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB == 0)) or ((dateBirthA != 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB != 0)):
        caseName = 'E'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName


    ############# SCORES TYPE F
    # definition ScoreF: none of the persons in datasets A or B have complete dates of birth and death, they don't have Flourished date either (rules = either dates of birth are the same, or dates of death are the same, or with buffer)
    elif ((dateBirthA != 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB == 0)):
        caseName = 'F'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName


    ############# SCORES TYPE G
    # definition ScoreG: persons in one of the datasets have complete dates of birth and death (Flourished date is optional) and persons to map have only Flourished date (rules: one of the persons has complete dates of birth and death, the other person has only Flourished date, which is between dates of birth and/or death following some rules)
    # both datasets have dfl
    elif ((dateBirthA != 0 and dateDeathA != 0 and dateFlA !=0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA !=0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB != 0)):
        caseName = 'G1'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName
    # one of the datasets doesn't have dfl
    elif ((dateBirthA != 0 and dateDeathA != 0 and dateFlA ==0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB != 0)):
        caseName = 'G2'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName
    elif ((dateBirthA == 0 and dateDeathA == 0 and dateFlA !=0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB == 0)):
        caseName = 'G2'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName
    


    ############# SCORES TYPE H
    # definition ScoreH: persons in both datasets have incomplete dates of birth and death (either of the two in contrary way), but Flourished date is there
    elif ((dateBirthA != 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB != 0)):    
        caseName = 'H'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName


    ############# SCORES TYPE I 
    # definition Score I: persons in both datasets have incomplete dates of birth and death (either of the two in contrary way), and Flourished date is only in one of the sets
    elif ((dateBirthA != 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB != 0)) or ((dateBirthA != 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB != 0)):
        caseName = 'I'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName


    ############# SCORES TYPE J
    # definition ScoreJ: persons in both datasets have incomplete dates of birth and death (either of the two) in the opposite way, and none has Flourished date
    elif ((dateBirthA != 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB == 0)):
        caseName = 'J'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName


    ############# SCORES TYPE K
    #definition ScoreK: persons in one dataset have incomplete dates of birth and death (either of the two) plus Flourished date, and persons in the other dataset have none of the two, only Flourished date
    elif ((dateBirthA != 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB != 0)):
        caseName = 'K'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName


    ############# SCORES TYPE L
    # definition ScoreL: persons in one dataset have incomplete dates of birth and death (either of the two) and no Flourished date, persons in the other dataset have no date of birth nor death, but do have Flourished date
    elif ((dateBirthA != 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB == 0)):  
        caseName = 'L'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName


    # ############# SCORES TYPE M
    # definition ScoreM: persons in both datasets have only date of Flourished
    elif (dateBirthA == 0 and dateBirthA == 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB != 0):
        caseName = 'M'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName
                

    # SCORES TYPE X
    # definition ScoreX: persons in one dataset have both dates, and in the other dataset no dates at all
    elif ((dateBirthA != 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA != 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB != 0 and dateFlB == 0)):
        caseName = 'X'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName
                                

    ############# SCORES TYPE Y
    # definition ScoreY: persons in one dataset have either date, and in the other dataset no dates at all
    elif ((dateBirthA != 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA != 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA != 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA != 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB != 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB != 0 and dateDeathB == 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB != 0 and dateFlB == 0)) or ((dateBirthA == 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB != 0)):
        caseName = 'Y'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName

    ############# SCORES TYPE Z
    # definition ScoreZ: this group includes persons with no dates at all in both datasets, scores rely on string matching only (no other rules)
    elif (dateBirthA == 0 and dateDeathA == 0 and dateFlA == 0) and (dateBirthB == 0 and dateDeathB == 0 and dateFlB == 0):
        caseName = 'Z'
        caseNameAdd = testDf.loc[index, 'caseName'] = caseName


In [None]:
testDf

In [None]:
# testDf.iloc[10,1] == testDf.iloc[10,5]