<a href="https://colab.research.google.com/github/lilimelgar/mappingPersonNames/blob/main/src/MappingPersonNames_adapted_for_GoogleColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Script to map two lists of person names**

This notebook contains the steps for mapping two lists of person names (ListA and ListB) and gives as a result a list of possible candidates with scores.

This script is written by Liliana Melgar-Estrada for the SKILLNET PROJECT (https://skillnet.nl/)

Last update: June 17, 2022

# Data preparation (externally, before importing)

Please read the instructions for data preparation in the "read.me" file in the Github repository: https://github.com/lilimelgar/mappingPersonNames/blob/main/README.md

# Import libraries

In [1]:
pip install fuzzywuzzy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [2]:
pip install python-Levenshtein

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-Levenshtein
  Downloading python_Levenshtein-0.20.8-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.20.8
  Downloading Levenshtein-0.20.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (174 kB)
[K     |████████████████████████████████| 174 kB 4.8 MB/s 
[?25hCollecting rapidfuzz<3.0.0,>=2.3.0
  Downloading rapidfuzz-2.13.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 44.0 MB/s 
[?25hInstalling collected packages: rapidfuzz, Levenshtein, python-Levenshtein
Successfully installed Levenshtein-0.20.8 python-Levenshtein-0.20.8 rapidfuzz-2.13.7


# Upload the "data" directory of the Github repository to your own Google Drive
The data directory is here: https://github.com/lilimelgar/mappingPersonNames/tree/main/data
You should upload it with the same folder structure.

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
import matplotlib
import pandas as pd
import numpy as np
import re
import os
import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# import jellyfish

from IPython.display import display
from IPython.display import clear_output

import csv

from IPython.display import display, HTML
# display(HTML("<style>.container { width:95% !important; }</style>"))
# pd.options.display.max_columns = 10
pd.options.display.max_rows = 1000
# pd.options.display.width = 1000

# to add timestamp to file names
import time

# for progress bar (https://datascientyst.com/progress-bars-pandas-python-tqdm/)
from tqdm import tqdm
from time import sleep

# Import files

## Set data directories

In [None]:
# Test data is located in the repository folder indicated in the path here
# this is the local path to the raw data in your own computer to where you downloaded/cloned the repository
data_directory = os.path.abspath(os.path.join('..', 'data'))
data_raw_directory = os.path.join(data_directory, 'raw')
data_processed_directory = os.path.join(data_directory, 'processed')
data_temp_directory = os.path.join(data_directory, 'temp')

## Import ListA

For the test version, ListA contains unique names from the Catalogus Epistolarum Neerlandicarum (CEN) extracted from a slice of correspondents from van Leeuwenhoek and Swammerdam (internal note: cy08).

In [None]:
# Import here the first file (ListA), this is the names you want to map the other list to.
# the list is imported as a pandas dataframe
list_a_path = os.path.join(data_raw_directory, 'ListA.csv')
dfA_t0 = pd.read_csv(list_a_path, sep = ",", index_col=False, engine='python')

In [None]:
dfA_t0.info()

## Import list to map to (LIST B)

For the test version, ListB contains unique names from the Epistolarium (http://ckcc.huygens.knaw.nl/epistolarium/)  (internal note: cy13).

In [None]:
# Import here the second file (ListB), these are the names you want to map (find a match) to the initial list.
# the list is imported as a pandas dataframe
list_b_path = os.path.join(data_raw_directory, 'ListB.csv')
dfB_t0 = pd.read_csv(list_b_path, sep = ",", index_col=False, engine='python')

In [None]:
dfB_t0.info()

# Prepare ListA and ListB

In this step the data is prepared for the mappings (reassigning column names and changing data types in case they were not the right ones)

## Prepare ListA

In [None]:
# assign column names
dfA_t0.columns = ['personIdA',
                   'nameStringA',
                   'dateBirthA', 
                   'dateDeathA', 
                   'dateFlA'
                   ]

In [None]:
# make a copy of the dataframe and rename it
dfA = dfA_t0.reset_index(drop=True)

In [None]:
# convert datatypes and fill in empty values
dfA_columns = dfA.columns
for column in dfA_columns:
    dataType = dfA.dtypes[column]
    if dataType == np.float64:
        dfA[column] = dfA[column].fillna(0.0)
        dfA[column] = dfA[column].astype(int)
    if dataType == object:
        dfA[column] = dfA[column].fillna('null')
        dfA[column] = dfA[column].astype(str)

In [None]:
dfA.info()

In [None]:
dfA.head(10)

## Prepare ListB

In [None]:
# assign column names
dfB_t0.columns = [
                   'personIdB',
                   'nameStringB', 
                   'dateBirthB', 
                   'dateDeathB', 
                   'dateFlB',
                   ]

In [None]:
# make a copy of the dataframe and rename it
dfB = dfB_t0.reset_index(drop=True).copy()

In [None]:
# convert datatypes and fill in empty values
dfB_columns = dfB.columns
for column in dfB_columns:
    dataType = dfB.dtypes[column]
    if dataType == np.float64:
        dfB[column] = dfB[column].fillna(0.0)
        dfB[column] = dfB[column].astype(int)
    if dataType == object:
        dfB[column] = dfB[column].fillna('null')
        dfB[column] = dfB[column].astype(str)

In [None]:
dfB.info()

In [None]:
dfB.head(10)

## Store listA and listB for future reference

In [None]:
# this inserts the timestamp in the file name
timestr = time.strftime("%Y%m%d-%H%M%S")

fileListA = (f"{data_temp_directory}/ListA_{timestr}.csv")
dfA.to_csv(fileListA)

fileListB = (f"{data_temp_directory}/ListB_{timestr}.csv")
dfB.to_csv(fileListB)

## Create a dataframe to store the mappings

# Run mapping script

Here below there is the mapping script that will compare the names in listB with the names in listA checkign if the name string matches and, if so, it applies some rules to determine if the respective dates of birth/death/fl. have a logical relation. If so, a mapping candidate is added to the dataframe C.

This script is also stored separately here: 

The counter shows:
|percentage done|items processed/total items \[time passed < time left, number of iterations per second\]

In [None]:
mapped_candidates = compare_names(dfA, dfB)

# if you want to use different buffers, overwrite them
# mapped_candidates_buffer = compare_names(dfA, dfB, buf4 =8)

In [None]:
mapped_candidates.info()

In [None]:
mapped_candidates.head(5)

In [None]:
# test = mapped_candidates[mapped_candidates.scoreCase.str.contains('L-')]

In [None]:
# test

In [None]:
mapped_candidates.scoreCase.value_counts()

# Prepare mapping output for analysis

#### Replace the .0 in person dates and convert to strings

In [None]:

mapped_candidates['dateBirthA'] = mapped_candidates['dateBirthA'].astype(str).replace('\.0', '', regex=True)
mapped_candidates['dateDeathA'] = mapped_candidates['dateDeathA'].astype(str).replace('\.0', '', regex=True)
mapped_candidates['dateFlA'] = mapped_candidates['dateFlA'].astype(str).replace('\.0', '', regex=True)
mapped_candidates['match_dateBirthB'] = mapped_candidates['match_dateBirthB'].astype(str).replace('\.0', '', regex=True)
mapped_candidates['match_dateDeathB'] = mapped_candidates['match_dateDeathB'].astype(str).replace('\.0', '', regex=True)
mapped_candidates['match_dateFlB'] = mapped_candidates['match_dateFlB'].astype(str).replace('\.0', '', regex=True)


In [None]:
mapped_candidates.info()

#### Create joined / unique names and fill the blanks

In [None]:
mapped_candidates['JoinedInitial'] = mapped_candidates['nameStringA'] + '^' + mapped_candidates['dateBirthA'] + '^' + mapped_candidates['dateDeathA'] + '^' + mapped_candidates['dateFlA']
mapped_candidates['JoinedMapped'] = mapped_candidates['match_nameStringB'] + '^' + mapped_candidates['match_dateBirthB']  + '^' + mapped_candidates['match_dateDeathB'] + '^' + mapped_candidates['match_dateFlB']

# Fill in blanks
mapped_candidates['JoinedMapped'] = mapped_candidates['JoinedMapped'].fillna('notmapped')

In [None]:
mapped_candidates.head(5)

In [None]:
# mapped_candidates.info()

#### Run the second script to detect variation in the mapped forms

In [None]:
# Convert these joined names to strings
mapped_candidates['JoinedInitial'] = mapped_candidates['JoinedInitial'].astype('string')
mapped_candidates['JoinedMapped'] = mapped_candidates['JoinedMapped'].astype('string')

In [None]:
mapped_candidates.info()

In [None]:
# create a score to determine how much the initial and the map form of the name (including dates) varies
for j in mapped_candidates.index:
    clear_output(wait=True)
    rowIndex = mapped_candidates.index[j]
    initialForm = mapped_candidates.iloc[j,13]
    mappedForm = mapped_candidates.iloc[j,14]
    matchScoreFinal = fuzz.ratio(initialForm, mappedForm)
    print("Current progress loop1:", np.round(j/len(mapped_candidates) *100, 2),"%")
    if 0 <= matchScoreFinal <=100:
        mapped_candidates.loc[rowIndex, 'ScoreMappedVersionsNotChangedis100'] = matchScoreFinal        

In [None]:
# mapped_candidates

In [None]:
mapped_candidates.columns

In [None]:
# Reorder the columns in a way that is easier to evaluate mapping

dfD = mapped_candidates[['JoinedInitial',
        'JoinedMapped',
        'personIdA',
        'match_personIdB',
        'scoreCase',
        'scoreType',
        'scoreNameString',
        'ScoreMappedVersionsNotChangedis100']]


In [None]:
dfD.info()

In [None]:
dfD.scoreCase.value_counts()

# Download the mapping candidates file

In [None]:
# This file will contain the mapping candidates, which is easier to evaluate externally, e.g., in OpenRefine

datasetA = 'ListA' #change list name if wanted
datasetB = 'ListB' #change list name if wanted
description = '' #add file description if wanted

#####bring back dfD
timestr = time.strftime("%Y%m%d-%H%M%S")
fileNameMappingCandidates = (f"{data_processed_directory}/MappingsPersonsCandidates_{datasetA}vs{datasetB}_{description}_{timestr}.csv")
dfD.to_csv(fileNameMappingCandidates)