One can …

- either create the data again and then continue with the name calculation, or
- import existing data and continue with the name distance calculation

## Create Plazi Collectors Data Set and Match Names to WikiData

Create a data set of collectors recorded by Plazi:

- see <https://tb.plazi.org/GgServer/srsStats> section “Materials Citation Data”
- then select the data (columns) of interest, and then below on section **Fields to Use in Statistics** you can alter the output
    - choose **Operation** “show individual values”
    - filter values at **Filter on Values**
    - set the limit to e.g. 5 to see what data you would get
    - below you can get the download link to the data format you get offered there

## Example Data

| Field Name | Filter on Values |
|-|-|
| Collector Name          | >0 |
| GBIF Occurrence ID      | !0 |
| Collecting Month        |    |
| Collecting Year         |    |
| Collecting Decade       |    |
| Collecting Date         |    |
| Materials Citation UUID |    |

```bash
# added filter: gbifOccurrenceId → !0
# added filter: collector → >0 (seems to give the non empty collector names)
filename="plazi-stats_numberOfTreatments_gbifOccurrenceId-not0_date_decade_year_month_collector-gt0_$(date '+%Y%m%d').tsv"
wget --output-document="${filename}" \
'https://tb.plazi.org/GgServer/srsStats/stats?outputFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector&groupingFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector&FP-matCit.gbifOccurrenceId=!0&FP-matCit.collector=%3E0&format=TSV'

cat "${filename}" | wc -l
# 417402 minus 1 record (=column header)

{ head -n 5 "${filename}"; echo "..."; tail -n 5 "${filename}"; } | column --table --separator $'\t' | sed 's@^@  # @;'
  # DocCount  MatCitId                          MatCitGbifOccurrenceId  MatCitDate  MatCitDecade  MatCitYear  MatCitMonth  MatCitCollector
  # 1         78F03CF8FFE2FFE5C0C4F883FE73F8B4  3419301320                          0             0           0            1888 - 1890 & Morong, T.
  # 1         78F03CF8FFE5FFE2C187FB83FD0AFB94  3419301397                          0             0           0            1914 & Chodat, R.
  # 1         1FFD3CFF806D3D11C410027311B3FEAC  4012799597              1980-09-19  1980          1980        9            1980 - Sino- American Botanical Expedition
  # 1         AFA17A73FFA8F2414DA6F9AB94DCF942  3466701331                          0             0           0            20. 8.201 3 & Delage, A.
  # ...                                                                                                                    
  # 1         3B7F3CD7FFEDFFF5FB68FCBD4061FCB8  3072658352              2017-07-05  2010          2017        7            Z. Z. Xia
  # 1         3B5C3CD3FF9FFFACFCCB2B09BAD0FE79  1699618906              2002-06-25  2000          2002        6            Z. Z. Yang
  # 1         B5B23CA2C006FF87FB6FF9CBFA17F94A  2028140173              2009-08-18  2000          2009        8            Z. Z. Yang
  # 1         3B063C92F16FFF93DA9FFC4DFEDB1D0B  3866542316              2015-06-08  2010          2015        6            ZZ Zhang
  # 1         3B7C3CAD6B18FFBCADDEFA01FE543FE5  3034555558              1956-06-20  1950          1956        6            А. Schnitnikov
```



In [1]:
# first starting point on actual data from plazi server OR use second starting point below from getting saved TSV files
import json
import requests
import pandas as pd
import time
import pprint

# https://tb.plazi.org/GgServer/srsStats/stats?
#   outputFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector
#   &
#   groupingFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector
#   &
#   FP-matCit.gbifOccurrenceId=!0
#   &
#   FP-matCit.collector=%3E0
#   &
#   format=TSV
url = 'https://tb.plazi.org/GgServer/srsStats/stats'
params = [
    ('outputFields',   'matCit.id matCit.gbifOccurrenceId matCit.date matCit.decade matCit.year matCit.month matCit.collector'),
    ('groupingFields', 'matCit.id matCit.gbifOccurrenceId matCit.date matCit.decade matCit.year matCit.month matCit.collector'),
    ('FP-matCit.gbifOccurrenceId', '!0'),
    ('FP-matCit.collector', '>0'),
    ('format', 'JSON')
]

start_time = time.time()
print("Send data request to" , url)

response = requests.get(url, params)
dict = response.json()
collectors = dict['data']

print("Response of %s came in %s seconds (HTTP-code: %s)" % (
    url, 
    (time.time() - start_time), 
    response.status_code)
)

start_time = time.time()
print("Normalize JSON data with pandas …")

df = pd.json_normalize(collectors)

print("Normalization took %s seconds" % (time.time() - start_time) )

print("Print data sample …")
df



Send data request to https://tb.plazi.org/GgServer/srsStats/stats
Response of https://tb.plazi.org/GgServer/srsStats/stats came in 26.53520917892456 seconds (HTTP-code: 200)
Normalize JSON data with pandas …
Normalization took 4.050419092178345 seconds
Print data sample …


Unnamed: 0,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth,MatCitCollector
0,1,32B9471022665821C16802F7FC90F8D7,4429920328,1995-04-26,1990,1995,4,0. Haland
1,1,78F03CF8FFE2FFE5C0C4F883FE73F8B4,3419301320,,0,0,0,"1888 - 1890 & Morong, T."
2,1,78F03CF8FFE5FFE2C187FB83FD0AFB94,3419301397,,0,0,0,"1914 & Chodat, R."
3,1,1FFD3CFF806D3D11C410027311B3FEAC,4012799597,1980-09-19,1980,1980,9,1980 - Sino- American Botanical Expedition
4,1,3B393CF1137BB1294D88FA06FD5A5319,4101309727,,0,0,0,1 Apri. 2009 & R. Zampaulo
...,...,...,...,...,...,...,...,...
464312,1,3B351656D566FFAA3AA6256646E8FCD2,3912951308,2017-05-23,2010,2017,5,Z. Z. Yang & C. G. Li
464313,1,3B351656D562FFAE3A9B2751458AFEEA,3912951303,2021-05-11,2020,2021,5,Z. Z. Yang & Z. M. Li
464314,1,3B351656D562FFAE3B42277A44B5FECE,3912951304,2021-05-11,2020,2021,5,Z. Z. Yang & Z. M. Li
464315,1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08,2010,2015,6,ZZ Zhang


In [2]:
list(df.columns)

['DocCount',
 'MatCitId',
 'MatCitGbifOccurrenceId',
 'MatCitDate',
 'MatCitDecade',
 'MatCitYear',
 'MatCitMonth',
 'MatCitCollector']

In [3]:
# move 'MatCitCollector' to be the first column (prepare parsing names for bin/agent_parse4tsv.rb: collectors in the 1st column)
col = df.pop("MatCitCollector")
df.insert(0, col.name, col)
df

Unnamed: 0,MatCitCollector,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
0,0. Haland,1,32B9471022665821C16802F7FC90F8D7,4429920328,1995-04-26,1990,1995,4
1,"1888 - 1890 & Morong, T.",1,78F03CF8FFE2FFE5C0C4F883FE73F8B4,3419301320,,0,0,0
2,"1914 & Chodat, R.",1,78F03CF8FFE5FFE2C187FB83FD0AFB94,3419301397,,0,0,0
3,1980 - Sino- American Botanical Expedition,1,1FFD3CFF806D3D11C410027311B3FEAC,4012799597,1980-09-19,1980,1980,9
4,1 Apri. 2009 & R. Zampaulo,1,3B393CF1137BB1294D88FA06FD5A5319,4101309727,,0,0,0
...,...,...,...,...,...,...,...,...
464312,Z. Z. Yang & C. G. Li,1,3B351656D566FFAA3AA6256646E8FCD2,3912951308,2017-05-23,2010,2017,5
464313,Z. Z. Yang & Z. M. Li,1,3B351656D562FFAE3A9B2751458AFEEA,3912951303,2021-05-11,2020,2021,5
464314,Z. Z. Yang & Z. M. Li,1,3B351656D562FFAE3B42277A44B5FECE,3912951304,2021-05-11,2020,2021,5
464315,ZZ Zhang,1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08,2010,2015,6


## Write the Output Data or Get Existing Data

Write source data and also set some global script variables


In [1]:
# second starting point to get also previously saved TSV data
import os
import time
import pandas as pd
import time
import pprint

if not os.path.exists('data'):
    print("Make data directory for saving …")
    os.makedirs('data')

# Set some global varialbes
# this_timestamp_for_data=time.strftime('%Y%m%d') # 20230719
this_timestamp_for_data=20231116

this_name_source_file=\
  os.path.join("data", ("plazi_GbifOccurrenceId_CitCollector_%s.tsv" % this_timestamp_for_data))
this_name_source_file_parsed=\
  os.path.join("data", ("plazi_GbifOccurrenceId_CitCollector_%s_parsed.tsv" % this_timestamp_for_data))

if 'df' in locals():
    df.to_csv(this_name_source_file, sep='\t', index=False # skip the index
        # , header=["custom_colname_1", "custom_colname_2", "…"] # could rewrite header labels
    )
    print("Wrote data results into into %s (%d kB)" % (
        this_name_source_file
        , os.path.getsize(this_name_source_file) >> 10 
          # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
        ) 
    )
else:
    if os.path.exists(this_name_source_file):
        print("Recent data from a Plazi data query was not found, but a data result file exists\nand can be used from %s (%d kB).\nIn this script we use:\n- %s\n- %s\n- timestamp: %s" % 
            (this_name_source_file
             , os.path.getsize(this_name_source_file) >> 10 # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
             , this_name_source_file
             , this_name_source_file_parsed
             , this_timestamp_for_data
            )
        )
    else:
        print("No source data found that can be analysed (%s)"
        "\nRun a new data request on Plazi again or set a different name source file." % this_name_source_file)



Recent data from a Plazi data query was not found, but a data result file exists
and can be used from data/plazi_GbifOccurrenceId_CitCollector_20231116.tsv (38076 kB).
In this script we use:
- data/plazi_GbifOccurrenceId_CitCollector_20231116.tsv
- data/plazi_GbifOccurrenceId_CitCollector_20231116_parsed.tsv
- timestamp: 20231116


## Parse Collector Names

Now you can parse the names with dwcagent, if the collector names are in the first column:

```bash
cd bin
ruby agent_parse4tsv.rb \
  --input ../data/plazi_GbifOccurrenceId_CitCollector_20231116.tsv \
  --output ../data/plazi_GbifOccurrenceId_CitCollector_20231116_parsed.tsv

# or check also running time of the parsing script with `time command`; 
# add «nice ruby …» if the process drains the system too much
# adding --logfile for information of skipped names

time ruby agent_parse4tsv.rb --logfile \
  --input ../data/plazi_GbifOccurrenceId_CitCollector_20231116.tsv \
  --output ../data/plazi_GbifOccurrenceId_CitCollector_20231116_parsed.tsv
# -------------------------
# Done.
# We have 24838 empty parsing cleaned results detected.
#   You can also use --develop to get a full result table including the used source data of each parsed line
# Wrote log file of skipped names to
#   ../data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv_dwcagent_3.0.11.0.log
# Wrote data to
#   ../data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv
# -------------------------
# 
# real    6m23,077s
# user    3m40,778s
# sys     2m4,969s
```

## Load WikiData Names and Parsed Collector Data

This procedure follows Niels Klazenga’s `match_names_to_wikidata_items.ipynb` (<https://github.com/nielsklazenga/avh-collectors/blob/47c3374f02bea4064b1c6708d79bcd9ba55a08a0/match_names_to_wikidata_items.ipynb>).

Use [`create_wikidata_datasets_botanists-altlabel.ipynb`](create_wikidata_datasets_botanists-altlabel.ipynb) to generate the data of botanist of WikiData first, then load those data to prepare the match of your data:

In [2]:
import pandas as pd

wikidata = pd.read_csv(
    # "data/wikidata_persons_botanists_20231030_1539.csv", # inverse match: [particle +] family, given
    # "data/wikidata_persons_botanists_20231116.csv",        # match: given [+ particle] + family[+ , suffix]
    "data/wikidata_persons_botanists_20260210.csv",
    index_col=0, low_memory=False,
    dtype={
        'yob':'Int32',
        'yod':'Int32',
        'wyb':'Int32',
        'wye':'Int32'
    }    
)
pprint.pprint(wikidata.columns)
display(wikidata.head())

Index(['item', 'itemLabel', 'surname', 'initials', 'canonical_string',
       'canonical_string_fullname', 'orcid', 'viaf', 'isni', 'harv', 'ipni',
       'abbr', 'bionomia_id', 'yob', 'yod', 'wikidata_link', 'orcid_link',
       'harv_link', 'ipni_link', 'bionomia_link'],
      dtype='str')


Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q100142069,Frida Eggens,,,Eggens,Eggens,,,,,20045232-1,Eggens,,,,http://www.wikidata.org/wiki/Q100142069,,,https://www.ipni.org/a/20045232-1,
1,http://www.wikidata.org/entity/Q100142069,Frida Eggens,Frida,F.,F. Eggens,Frida Eggens,,,,,20045232-1,Eggens,,,,http://www.wikidata.org/wiki/Q100142069,,,https://www.ipni.org/a/20045232-1,
2,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Elizabeth,E.,E. Harrison,Elizabeth Harrison,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795
3,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,,,Mrs A. H.,Mrs A. H.,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795
4,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Mrs Arnold,M. A.,M. A. Harrison,Mrs Arnold Harrison,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795


In [3]:
# Create data frame with unique canonical strings 
# group by canonical name/string, count douplicated names
wd_matchtest = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()
wd_matchtest_fullnames = wikidata.groupby('canonical_string_fullname').agg({'item': ['count']}).reset_index()

display(wd_matchtest)
display(wd_matchtest_fullnames)

# colls = list(wikidata.columns)

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"""F."" Ryser",1
1,"""N.A. Antipova"" (lapsus)",1
2,"""N.A.Antipova"" (lapsus)",1
3,"""The grandmother of female scientists in Ghana""",1
4,"""Н. А. Антипова"" (lapsus)",1
...,...,...
171443,赵云鹏,1
171444,郭亚龙,1
171445,金井弘夫(Hiroo Kanai),1
171446,金双 马,1


Unnamed: 0_level_0,canonical_string_fullname,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"""Fritz"" Ryser",1
1,"""N.A. Antipova"" (lapsus)",1
2,"""N.A.Antipova"" (lapsus)",1
3,"""The grandmother of female scientists in Ghana""",1
4,"""Н. А. Антипова"" (lapsus)",1
...,...,...
204788,赵云鹏,1
204789,郭亚龙,1
204790,金井弘夫(Hiroo Kanai),1
204791,金双 马,1


In [4]:
# atomized names parsed already by ruby gem package: dwcagent

print("Load name parsed data from {file_name} ({file_size} kb)...".format(
    file_name=this_name_source_file_parsed,
    file_size=os.path.getsize(this_name_source_file_parsed) >> 10 # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
))

collectors = pd.read_csv(this_name_source_file_parsed, 
    sep="\t", low_memory=False,
    dtype={
        'family': str,
        'given': str,
        'suffix': str,
        'particle': str,
        'dropping_particle': str,
        'nick': str,
        'appellation': str,
        'title': str
    }
)
collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.
collectors.sort_values(by=['family', 'given'], inplace=True)

def convert_to_time_periode(x, freq='ms'):
    try:
        return pd.Period(x, freq=freq)
    except:
        # TODO check and curate date string values
        return pd.NaT

print("Modify MatCitDate to periode and remove some 0 time values...")

for col in ['MatCitDate']:
    print("- convert", col, "to pd.Period(...) in collectors ...")
    collectors[col] = collectors[col].apply(lambda x: convert_to_time_periode(x, freq='ms'))
    
for col in ['MatCitMonth', 'MatCitDecade', 'MatCitYear']:
    print("- replace in col", col,"0 by NA ...")
    collectors[col] = collectors[col].replace(0, pd.NA)
print("Done modifying.")    

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.
collectors

Load name parsed data from data/plazi_GbifOccurrenceId_CitCollector_20231116_parsed.tsv (112021 kb)...
Modify MatCitDate to periode and remove some 0 time values...
- convert MatCitDate to pd.Period(...) in collectors ...
- replace in col MatCitMonth 0 by NA ...
- replace in col MatCitDecade 0 by NA ...
- replace in col MatCitYear 0 by NA ...
Done modifying.


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
6861,A,Acuna E.E.,,,,,,,Acuna E. E. A,parsed:Acuna E.E. A,cleaned:Acuna E.E. A,1,3B183CD2A46DFFABFF58FC35FE82FB92,3464288392,1960-07-17 00:00:00.000,1960,1960,7
6862,A,Acuna E.E.,,,,,,,Acuna E. E. A,parsed:Acuna E.E. A,cleaned:Acuna E.E. A,1,3B183CD2A46DFFABFF58FC75FC83FC72,3464288455,1960-07-17 00:00:00.000,1960,1960,7
8776,A,Ae,,,,,,,Ae. A,parsed:Ae A,cleaned:Ae A,1,B9AF7B1CFFACE27585C0FBAF12D3FAB7,1438449014,NaT,,,
8777,A,Ae,,,,,,,Ae. A,parsed:Ae A,cleaned:Ae A,1,B9AF7B1CFFA2E27B85C0FA83147BF984,1438449026,NaT,,,
562969,A,Ae,,,,,,,Seta I & Ae. A,parsed:I. Seta<SEP>Ae A,cleaned:I. Seta<SEP>Ae A,1,B9AF7B1CFFA7E27E85C0FCF41352FB5A,1438449025,NaT,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46974,ҫa,F. A. Mendon,,,,,,,A. W. Exell & F. A. Mendon ҫa,parsed:A.W. Exell<SEP>F.A.Mendon ҫa,cleaned:A.W. Exell<SEP>F. A. Mendon ҫa,1,3B373C82774654349F44F8D7E90BD86F,4037809305,1937-04-24 00:00:00.000,1930,1937,4
46976,ҫa,F. A. Mendon,,,,,,,A. W. Exell & F. A. Mendon ҫa,parsed:A.W. Exell<SEP>F.A.Mendon ҫa,cleaned:A.W. Exell<SEP>F. A. Mendon ҫa,1,3B373C82774454369CBCF861E9DCD8FB,4037809347,1937-04-27 00:00:00.000,1930,1937,4
46978,ҫa,F. A. Mendon,,,,,,,A. W. Exell & F. A. Mendon ҫa,parsed:A.W. Exell<SEP>F.A.Mendon ҫa,cleaned:A.W. Exell<SEP>F. A. Mendon ҫa,1,3B373C82774A54389851FCB0E8E5DC8D,4037809383,1937-05-06 00:00:00.000,1930,1937,5
89556,ҫa,Mendon,,,,,,,Carrisso & Mendon ҫa,parsed:Carrisso<SEP>Mendon ҫa,cleaned:Carrisso<SEP>Mendon ҫa,1,3B373C82774654349D9AFA1BEBD0D931,4037809348,1927-01-01 00:00:00.000,1920,1927,


#### Check Composition of Parsed Collector Data

In [5]:
# TODO review code of abbreviated names and full name matching
criterion_fullnames = collectors.given.str.contains('^\\w{3,}', na=False)
print("Show collecors given name has (propably) a full name (%s records) …" % len(collectors[criterion_fullnames].index))
collectors[criterion_fullnames]

Show collecors given name has (propably) a full name (64010 records) …


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
6861,A,Acuna E.E.,,,,,,,Acuna E. E. A,parsed:Acuna E.E. A,cleaned:Acuna E.E. A,1,3B183CD2A46DFFABFF58FC35FE82FB92,3464288392,1960-07-17 00:00:00.000,1960,1960,7
6862,A,Acuna E.E.,,,,,,,Acuna E. E. A,parsed:Acuna E.E. A,cleaned:Acuna E.E. A,1,3B183CD2A46DFFABFF58FC75FC83FC72,3464288455,1960-07-17 00:00:00.000,1960,1960,7
12286,A,Agrobosques S.,,,,,,,Agrobosques S. A & de Arevalo,parsed:Agrobosques S. A<SEP>de Arevalo,cleaned:Agrobosques S. A<SEP>de Arevalo,1,3B083C841A434F0C70EFF861ECEEFF3E,1701220194,1991-01-23 00:00:00.000,1990,1991,1
117530,A,Berkov,,,,,,,Coll. Morillo. Lopez & Berkov. A & Weevil,parsed:Morillo Lopez<SEP>Berkov A<SEP>Weevil,cleaned:Morillo Lopez<SEP>Berkov A<SEP>Weevil,1,3B603CD7D864FFEB3CB4FC893081F811,2597529809,2013-12-29 00:00:00.000,2010,2013,12
73063,A,Boothia,,,,,,,Boothia. A,parsed:Boothia A,cleaned:Boothia A,1,948CD254FF865E22FEA45B9C7636F9F4,2273437260,NaT,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
561454,Índios,Serra,,dos,,,,,Serra dos Índios,parsed:Serra dos Índios,cleaned:Serra dos Índios,1,A9ED3CB9FF8E0459ECCEFF60FB7B8F96,3127529311,2007-05-26 00:00:00.000,2000,2007,5
477724,Órgãos,Parque Estadual,,da Serra dos,,,,,Parque Estadual da Serra dos Órgãos,parsed:Parque Estadual da Serra dos Órgãos,cleaned:Parque Estadual da Serra dos Órgãos,1,3B553CF75100FF82996D98C5FE58FCC4,3320586456,2000-02-08 00:00:00.000,2000,2000,2
408271,Óros,Megáli,,,,,,,Megáli Óros,parsed:Megáli Óros,cleaned:Megáli Óros,1,3B4DA343EF6EFFE45A5CFC2D4CF5892D,3435945784,NaT,,,
89556,ҫa,Mendon,,,,,,,Carrisso & Mendon ҫa,parsed:Carrisso<SEP>Mendon ҫa,cleaned:Carrisso<SEP>Mendon ҫa,1,3B373C82774654349D9AFA1BEBD0D931,4037809348,1927-01-01 00:00:00.000,1920,1927,


In [6]:
# check the name-parsed columns if they are empty or need to be considerd as data for matching or not
import pprint
for parsed_name_part in ["particle", "suffix", "dropping_particle", "appellation"]:
    test_collectors = collectors.loc[(collectors[parsed_name_part].isna() == False)]
    print("\n----------------------------------------\nshow names with **%s** found %s records:\n" % (parsed_name_part, len(test_collectors.index)))
    display(test_collectors.head())


----------------------------------------
show names with **particle** found 21726 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
353,A. A. Girault,G.,,as,,,,,A. A. Girault as G.,parsed:G. as A. A. Girault,cleaned:G. as A. A. Girault,1,E4E73CEFE566FFFE6C4A0CFC1E2D5BF7,3743912342,1909-08-25 00:00:00.000,1900,1909,8
354,A. A. Girault,G.,,as,,,,,A. A. Girault as G.,parsed:G. as A. A. Girault,cleaned:G. as A. A. Girault,1,E4E73CEFE566FFFE6C040CD1191C5BD2,3743912408,1910-07-01 00:00:00.000,1910,1910,7
8026,A. Donev,G.,,as,,,,,A. Donev & A. Donev as G. & D. Kostadinov,parsed:A. Donev<SEP>G. as A. Donev<SEP>D. Kost...,cleaned:A. Donev<SEP>G. as A. Donev<SEP>D. Kos...,1,E4E73CEFE5A4FF3C69BD0B1D1CAC5CBC,3743938309,1980-05-26 00:00:00.000,1980,1980,5
271086,A. Howden,H.,,x,,,,,H. x A. Howden,parsed:H. x A. Howden,cleaned:H. x A. Howden,1,CC884C68D535B27D87034627E9C8F90B,3909183447,1956-06-27 00:00:00.000,1950,1956,6
271087,A. Howden,H.,,x,,,,,H. x A. Howden,parsed:H. x A. Howden,cleaned:H. x A. Howden,1,CC884C68D535B27D847646C9EB52FF75,3909183474,1956-06-27 00:00:00.000,1950,1956,6



----------------------------------------
show names with **suffix** found 824 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
496266,Adair,W. Lee,Jr.,,,,,,P. Lucas & W. Lee Adair Jr.,parsed:P. Lucas<SEP>W.Lee Adair Jr.,cleaned:P. Lucas<SEP>W. Lee Adair Jr.,1,F2D43CDDFFC89A58FAF5CAB5FEBFFBDA,3046454523,1988-02-08 00:00:00.000,1980,1988,2
648468,Adair,W.L.,Jr.,,,,,,W. L. Adair Jr.,parsed:W.L. Adair Jr.,cleaned:W.L. Adair Jr.,1,F2D43CDDFFDE9A4EFAE0C9E4FB6DFEE6,3046454417,1990-09-12 00:00:00.000,1990,1990,9
356734,Adjuntas,Las,II,,,,,,Las Adjuntas II & Col. R. & Barba & Barrera. L...,parsed:Las Adjuntas II<SEP>R.<SEP>Barba<SEP>Ba...,cleaned:Las Adjuntas II<SEP><SEP>Barba<SEP><SE...,1,3E86E90AFFABFF9CFF44053FFD74FC45,1671744666,1991-11-27 00:00:00.000,1990,1991,11
543396,Agulhas,R.V.,II,,,,,,RV Agulhas II,parsed:R.V. Agulhas II,cleaned:R.V. Agulhas II,1,3B553C82415AFFD7FF594C15FAA26A14,4435726302,2017-10-20 00:00:00.000,2010,2017,10
543397,Agulhas,R.V.,II,,,,,,RV Agulhas II,parsed:R.V. Agulhas II,cleaned:R.V. Agulhas II,1,3B553C82415AFFD7FF374C5CFBFC6ADB,4435726301,2017-10-21 00:00:00.000,2010,2017,10



----------------------------------------
show names with **dropping_particle** found 0 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth



----------------------------------------
show names with **appellation** found 316 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
433238,A,Kiyohiko Yamamoto,,,,,Mr.,,Mr. Kiyohiko Yamamoto. A,parsed:Kiyohiko Yamamoto A,cleaned:Kiyohiko Yamamoto A,1,3B1B3CD1FFF4FE099AF8FD39DD26FD16,4437349317,2004-10-28 00:00:00.000,2000.0,2004.0,10.0
419962,Abernethy,O.,,,,,Miss,,Miss O. Abernethy,parsed:O. Abernethy,cleaned:O. Abernethy,1,25BDF64AFFD5FFBCBD02FEADFE41A738,2234227646,1922-03-07 00:00:00.000,1920.0,1922.0,3.0
492625,Araujo,,,,,,MS,,Pitfalltrap & MS Araujo & Silva,parsed:Araujo<SEP>Silva,cleaned:Araujo<SEP>Silva,1,3B123C97FFB6FFEAFEF5D209FE65F941,2610423338,2014-07-28 00:00:00.000,2010.0,2014.0,7.0
149230,Atkinson,W.S.,,,,,Mr,,Descr. & Indian & Insects Colln & Mr W. S. Atk...,parsed:Descr<SEP>Indian<SEP>Insects Colln<SEP>...,cleaned:Descr<SEP>Indian<SEP>Insects Colln<SEP...,1,4ACEB435FFE3FF8FFEFF0C5BFE3B600E,2622599334,NaT,,,
433754,Atkinson,W.S.,,,,,Mr.,,Mr. W. S. Atkinson,parsed:W.S. Atkinson,cleaned:W.S. Atkinson,1,3B7068124D2C7B5E816B6E5C9B7DFABE,4128848319,NaT,,,


Compile `canonical_string...` for the collector data we will later match the WikiData names with:

In [7]:
# combine parts of names similar to WikiData's given name labels
# collectors['canonical_string_collector_parsed'] = collectors[['given', 'particle', 'family', 'suffix']]\
#     .fillna('')\
#     .apply(
#         lambda this_df: "{given}{particle}{family}{suffix}".format(
#             given=this_df["given"],
#             particle=" " + this_df["particle"] if this_df["particle"] else '', 
#             family=" " + this_df["family"] if this_df["family"] else '', 
#             suffix=", " + this_df["suffix"] if this_df["suffix"] else ''
#         ), axis="columns"
#     )

c = collectors.fillna('')

# Wir bauen die Teile einzeln
part = (" "  + c['particle']).where(c['particle'] != '', '')
fam  = (" "  + c['family']).where(c['family'] != '', '')
suff = (", " + c['suffix']).where(c['suffix'] != '', '')

collectors['canonical_string_collector_parsed'] = (c['given'] + part + fam + suff).str.strip()

criterion = collectors["particle"].str.contains("\\w+ \\w+", na=False)

# display(collectors['canonical_string_collector_parsed'][criterion].head())
collectors[['canonical_string_collector_parsed', 'particle']][criterion].drop_duplicates().head(10)


Unnamed: 0,canonical_string_collector_parsed,particle
569184,Sierra de la Abra,de la
582128,Sotillo de la Adrada,de la
81141,Buca della Croce di Agnano N,della Croce di
458761,F. Sao Pedro da Agua Branca,Sao Pedro da
86196,S. Camino de Aguadores,Camino de
19058,Algarao da Ribeira de Alte,da Ribeira de
77763,F. Brejo de Altitude,Brejo de
121893,Conservacion de la Amazonia,de la
215715,Fundacion de la Amazonia,de la
207434,Universidad de la Amazonia,de la


In [8]:
# move canonical_string_collector_parsed after column title (title was the last of the parsing columns)
col = collectors.pop("canonical_string_collector_parsed")
collectors.insert(collectors.columns.get_loc('title') + 1, col.name, col)

these_columns=["family", "given", "suffix", "particle", "dropping_particle", "nick", "appellation", "title", 'canonical_string_collector_parsed']

if 'source_data' in collectors.columns:
    these_columns.append("source_data")

display(collectors.tail().get(these_columns))

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data
46974,ҫa,F. A. Mendon,,,,,,,F. A. Mendon ҫa,A. W. Exell & F. A. Mendon ҫa
46976,ҫa,F. A. Mendon,,,,,,,F. A. Mendon ҫa,A. W. Exell & F. A. Mendon ҫa
46978,ҫa,F. A. Mendon,,,,,,,F. A. Mendon ҫa,A. W. Exell & F. A. Mendon ҫa
89556,ҫa,Mendon,,,,,,,Mendon ҫa,Carrisso & Mendon ҫa
89558,ҫa,Mendon,,,,,,,Mendon ҫa,Carrisso & Mendon ҫa


In [9]:
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

In [10]:
collectors.dtypes

family                                      str
given                                       str
suffix                                      str
particle                                    str
dropping_particle                           str
nick                                        str
appellation                                 str
title                                       str
canonical_string_collector_parsed           str
source_data                                 str
parsed_names                                str
cleaned_names                               str
DocCount                                  int64
MatCitId                                    str
MatCitGbifOccurrenceId                    int64
MatCitDate                           period[ms]
MatCitDecade                             object
MatCitYear                               object
MatCitMonth                              object
dtype: object

In [11]:
# group and aggregate data to have unique name rows only for the matching of names later on
collectors_unique=collectors.groupby(['canonical_string_collector_parsed']).agg(
    family=('family', lambda x: list(x)[0]),
    given=('given', lambda x: list(x)[0]),
    suffix=('suffix', lambda x: list(x)[0]),
    particle=('particle', lambda x: list(x)[0]),
    dropping_particle=('dropping_particle', lambda x: list(x)[0]),
    nick=('nick', lambda x: list(x)[0]),
    appellation=('appellation', lambda x: list(x)[0]),
    title=('title', lambda x: list(x)[0]),
    DocCount_count= ('DocCount', 'sum'), # use count function
    MatCitGbifOccurrenceId_firstsample=('MatCitGbifOccurrenceId', lambda x: list(x)[0]),
    source_data=('source_data', lambda x: list(x)[0]),
    MatCitDate_mean=('MatCitDate', 'mean'),
    MatCitDate_min=('MatCitDate', 'min'),
    MatCitDate_max=('MatCitDate', 'max'),
    # MatCitDecade_mean=('MatCitDecade', 'mean'),
    # MatCitDecade_min=('MatCitDecade', 'min'),
    # MatCitDecade_max=('MatCitDecade', 'max'),
    MatCitYear_mean=('MatCitYear', 'mean'),
    MatCitYear_min=('MatCitYear', 'min'),
    MatCitYear_max=('MatCitYear', 'max')
    # MatCitMonth_mean=('MatCitMonth', 'mean'),
    # MatCitMonth_min=('MatCitMonth', 'min'),
    # MatCitMonth_max=('MatCitMonth', 'max')
).reset_index()

# move canonical_string_collector_parsed after column title
col = collectors_unique.pop("canonical_string_collector_parsed")
collectors_unique.insert(collectors_unique.columns.get_loc('title') + 1, col.name, col)

display(collectors_unique)

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount_count,MatCitGbifOccurrenceId_firstsample,source_data,MatCitDate_mean,MatCitDate_min,MatCitDate_max,MatCitYear_mean,MatCitYear_min,MatCitYear_max
0,A,,,,,,,,A,194,3758304303,A.,1975-08-20 22:28:57.931,1-10-06 00:00:00.000,2021-01-29 00:00:00.000,1998.141176,1893.0,2021.0
1,Virginia,A,,,,,,,A Virginia,2,3333037406,a 1 virginia,2019-08-01 12:00:00.000,2019-08-01 00:00:00.000,2019-08-02 00:00:00.000,2019.0,2019.0,2019.0
2,A. Ambros,A.,,,,,,,A. A. Ambros,2,3392596301,G. Ibarra-M & L. Gonzalez G. & A. Ambros A. & ...,1986-08-01 00:00:00.000,1986-08-01 00:00:00.000,1986-08-01 00:00:00.000,,,
3,A. C. Allyn,A.,,,,,,,A. A. C. Allyn,1,2248478804,H. L. King & Database & Allyn Museum Photo & N...,1972-06-01 00:00:00.000,1972-06-01 00:00:00.000,1972-06-01 00:00:00.000,1972.0,1972.0,1972.0
4,Filho,A. A. Costa Silva,,,,,,,A. A. Costa Silva Filho,1,2609494348,C. A. Rheims & A. A. Costa Silva Filho,2011-02-24 00:00:00.000,2011-02-24 00:00:00.000,2011-02-24 00:00:00.000,2011.0,2011.0,2011.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129229,Štěpánek,,,,,,,,Štěpánek,1,3047760324,Štěpánek,1927-07-01 00:00:00.000,1927-07-01 00:00:00.000,1927-07-01 00:00:00.000,1927.0,1927.0,1927.0
129230,Šumpich,,,,,,,,Šumpich,4,3987425391,Šumpich,2016-12-30 12:00:00.000,2015-06-21 00:00:00.000,2019-06-27 00:00:00.000,2016.5,2015.0,2019.0
129231,Calame,Τhomas,,,,,,,Τhomas Calame,1,2466103895,"Vinh Quang Luu, Τhomas Calame & Kieusomphone T...",2015-03-29 00:00:00.000,2015-03-29 00:00:00.000,2015-03-29 00:00:00.000,2015.0,2015.0,2015.0
129232,Schnitnikov,А.,,,,,,,А. Schnitnikov,1,3034555558,А. Schnitnikov,1956-06-20 00:00:00.000,1956-06-20 00:00:00.000,1956-06-20 00:00:00.000,1956.0,1956.0,1956.0


### Set Up the Text Search

See https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536

The ngrams function is used as an analyzer in the text search later.

In [12]:
import re
!pip install ftfy # amazing text cleaning for decode issues..
from ftfy import fix_text

def ngrams(string, n=3):
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.title() # normalise case - capital at start of each word
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single
    string = ' '+ string +' ' # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]



In [13]:
# pip install --upgrade scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# nbrs_data = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf_vector_data) 
# tfidf_vector_data contains the vectorized wikidata names from the previous step


def getNearestNeighbour(query, this_vectorizer, this_nbrs_data):
    """Calculate the k-nearest distance for query data using package scikit-learn


    @param query: DataFrame the query data to vectorize and transform
    @param this_vectorizer: the vectorizer of TfidfVectorizer
    @param this_nbrs_data: the data of NearestNeighbors calculations
    @return: (distances, indices) distances and indices
    @rtype (int, int)
    """
    queryTFIDF_ = this_vectorizer.transform(query)
    distances, indices = this_nbrs_data.kneighbors(queryTFIDF_)
    return distances, indices


def calculateTFIDFmatchingOfData(query_data, match_data, n_neighbors=1):
    """
    Calculate a TF-IDF (Term Frequency — Inverse Document Frequency) matching with getNearestN

    @param query_data: DataFrame usually a pandas data column to query names or strings for
    @param match_data: DataFrame against to match with
    @param n_neighbors: Number of neighbors required for each sample by default for :meth:`kneighbors` queries (originally 5).

    @requires NearestNeighbors()
    @requires getNearestNeighbour()
    @requires ngrams()
    @requires TfidfVectorizer()
    @requires NearestNeighbors()

    @return: DataFrame a data frame of matches with columns 'namematch_source_data', 'namematch_resource_data', 'namematch_distance'
    """

    import time
    start = time.time()
    query_data = set(query_data)
    # convert list to set for better performance

    print('Vectorizing data. This may take a while...')
    # vectorize wikidata names
    vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
    tfidf_vector_data = vectorizer.fit_transform(match_data
        # wd_matchtest['canonical_string']
    )
    nbrs_data = NearestNeighbors(n_neighbors=n_neighbors, n_jobs=-1).fit(tfidf_vector_data)
    duration = time.time() - start
    print('Vectorizing completed: created a matrix of TF-IDF features after %s s' % duration)

    print('Getting nearest neighbours of %s data with %s neighbor sample(s)...' % (len(query_data), n_neighbors))
    distances, indices = getNearestNeighbour(query_data, vectorizer, nbrs_data)
    duration = time.time() - start
    print('Completed after %s s' % duration)

    query_data = list(query_data)  # convert back to list

    print('Finding matches build new data frame ...')
    matches = []
    for i, j in enumerate(indices):
        temp = [query_data[i], match_data.values[j][0], round(distances[i][0], 2)]
        matches.append(temp)

    duration = time.time() - start
    print('Building matches done after %s s' % duration)
    matches = pd.DataFrame(
        matches,
        columns=['namematch_source_data', 'namematch_resource_data', 'namematch_distance']
    )

    print('Done')
    return matches

In [14]:
# some example data
print("Show ngram examples:")
print("- simple name:", ngrams('Klazenga, N.'))
print("- data from collectors:", ngrams(collectors_unique["canonical_string_collector_parsed"].at[1])) 
print("- data from match-test:", ngrams(wd_matchtest['canonical_string'].at[0]))
print("- data from match-test (full name):", ngrams(wd_matchtest_fullnames['canonical_string_fullname'].at[0]))

# some example data
for i, row in enumerate(range(5)):
    if (i == 0):
        print('\n(WikiData’s) canonical_string = (constructed) canonical_string_fullname:') 
    print("- {short_name} = {long_name}".format(
        short_name=wd_matchtest['canonical_string'].at[row],
        long_name=wd_matchtest_fullnames['canonical_string_fullname'].at[row]
    ))


Show ngram examples:
- simple name: [' Kl', 'Kla', 'laz', 'aze', 'zen', 'eng', 'nga', 'ga ', 'a N', ' N ']
- data from collectors: [' A ', 'A V', ' Vi', 'Vir', 'irg', 'rgi', 'gin', 'ini', 'nia', 'ia ']
- data from match-test: [' "F', '"F"', 'F" ', '" R', ' Ry', 'Rys', 'yse', 'ser', 'er ']
- data from match-test (full name): [' "F', '"Fr', 'Fri', 'rit', 'itz', 'tz"', 'z" ', '" R', ' Ry', 'Rys', 'yse', 'ser', 'er ']

(WikiData’s) canonical_string = (constructed) canonical_string_fullname:
- "F." Ryser = "Fritz" Ryser
- "N.A. Antipova" (lapsus) = "N.A. Antipova" (lapsus)
- "N.A.Antipova" (lapsus) = "N.A.Antipova" (lapsus)
- "The grandmother of female scientists in Ghana" = "The grandmother of female scientists in Ghana"
- "Н. А. Антипова" (lapsus) = "Н. А. Антипова" (lapsus)


Vectorize Wikidata names. Background: We use an information retrieval technique (Term Frequency — Inverse Document Frequency, blog [towardsdatascience.com/tf-idf-explained…](https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275)) for matching the source names with WikiData names, for that a calculated dinsance measure of the name match will help to match similar names and distinguish names that are rather no match. In general see also https://scikit-learn.org, https://pypi.org/project/scikit-learn/. 

Convert a collection of raw documents to a matrix of TF-IDF features and set up the function that performs the nearest neighbour matches...

### Perform the Matching

Perform the nearest neighbour (NN) matches on the (Plazi) collector names and create a data frame with matches, and we try to distinguish abbreviated and full names in the source to better match source data and WikiData ... (can take 10 to 30 minutes)

Now convert a collection of raw documents to a matrix of TF-IDF features and set up the function that performs the matches...

In [15]:
print("Calculate matching for **abbrevated** names separately …")

criterion_fullnames = collectors_unique.given.str.contains('^\\w{3,}', na=False)
collectors_names = collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values
# collectors_names = set(collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values)
matches = calculateTFIDFmatchingOfData(collectors_names, wd_matchtest['canonical_string'], 5) # TODO what effect has n_neighbors ? originally in the very source code it is set to 5, not 1

matches = matches.sort_values(['namematch_distance'])
matches = matches.reset_index(names=['old_index'])

matches

Calculate matching for **abbrevated** names separately …
Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 4.244699001312256 s
Getting nearest neighbours of 107391 data with 5 neighbor sample(s)...
Completed after 231.50650453567505 s
Finding matches build new data frame ...
Building matches done after 234.51917362213135 s
Done


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,23568,Rowe,Rowe,0.0
1,23593,D. Robertson,D. Robertson,0.0
2,23588,A. Pons,A. Pons,0.0
3,23580,Holbrook,Holbrook,0.0
4,91646,Akhmedov,Akhmedov,0.0
...,...,...,...,...
107386,64658,El Bagante,Ж. Перес,1.0
107387,107378,P. Wieng Ko Sai N.,В. Б. О'Шонесси,1.0
107388,107381,Li Yajin,И. Ф. Земмельвейс,1.0
107389,6,Ortatopac,А. Г. Натгорст,1.0


In [16]:
# criterion_fullnames = collectors_unique.given.str.contains('^\\w{3,}', na=False)
print("Calculate matching for **full** names separately …")
collectors_fullnames = collectors_unique['canonical_string_collector_parsed'][criterion_fullnames].values
matches_fullnames = calculateTFIDFmatchingOfData(collectors_fullnames, wd_matchtest_fullnames['canonical_string_fullname'], 5) # TODO what effect has n_neighbors ? originally in the very source code it is set to 5, not 1

matches_fullnames = matches_fullnames.sort_values(['namematch_distance'])
matches_fullnames = matches_fullnames.reset_index(names=['old_index'])

matches_fullnames

Calculate matching for **full** names separately …
Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 6.0167036056518555 s
Getting nearest neighbours of 21843 data with 5 neighbor sample(s)...
Completed after 60.981823682785034 s
Finding matches build new data frame ...
Building matches done after 61.24055480957031 s
Done


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,5055,Llewelyn Williams,Llewelyn Williams,0.0
1,5819,Robert Dick,Robert Dick,0.0
2,15353,Herbert W. Levi,Herbert W. Levi,0.0
3,18464,Lloyd Martin,Lloyd Martin,0.0
4,8173,Zhuqiu Song,Zhuqiu Song,0.0
...,...,...,...,...
21838,10285,Ayu-Dag Kirichenko,Жорж Луи Леклерк де Бюффон,1.0
21839,10286,Nan-Yi Tsai,"Жермен де Сен-Пьер, Жак Николя Эрнест",1.0
21840,10287,Loma de la Plaza,Карл Кристиан Гмелин,1.0
21841,10288,Mashonaland East,Жозеф Рок,1.0


### Create Output Results

Combine the matches data frame back to the (Plazi) collectors and Wikidata items …

In [17]:
# join matches data frame back to source collectors  dataframe 
collectors_matches = pd.merge(
    collectors_unique, matches, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

collectors_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount_count,...,MatCitDate_mean,MatCitDate_min,MatCitDate_max,MatCitYear_mean,MatCitYear_min,MatCitYear_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,A,,,,,,,,A,194,...,1975-08-20 22:28:57.931,1-10-06 00:00:00.000,2021-01-29 00:00:00.000,1998.141176,1893.0,2021.0,78463,A,Г. Бaйцзе,0.0
1,Virginia,A,,,,,,,A Virginia,2,...,2019-08-01 12:00:00.000,2019-08-01 00:00:00.000,2019-08-02 00:00:00.000,2019.0,2019.0,2019.0,7119,A Virginia,Virginio,0.72
2,A. Ambros,A.,,,,,,,A. A. Ambros,2,...,1986-08-01 00:00:00.000,1986-08-01 00:00:00.000,1986-08-01 00:00:00.000,,,,99873,A. A. Ambros,Ambros.,0.65
3,A. C. Allyn,A.,,,,,,,A. A. C. Allyn,1,...,1972-06-01 00:00:00.000,1972-06-01 00:00:00.000,1972-06-01 00:00:00.000,1972.0,1972.0,1972.0,78630,A. A. C. Allyn,A. C. Allem,0.98
4,Filho,A. A. Costa Silva,,,,,,,A. A. Costa Silva Filho,1,...,2011-02-24 00:00:00.000,2011-02-24 00:00:00.000,2011-02-24 00:00:00.000,2011.0,2011.0,2011.0,738,A. A. Costa Silva Filho,Costa-Silva,0.77


In [18]:
# append full name matches
collectors_matches_fullname = pd.merge(
    collectors_unique, matches_fullnames, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

collectors_matches_fullname.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount_count,...,MatCitDate_mean,MatCitDate_min,MatCitDate_max,MatCitYear_mean,MatCitYear_min,MatCitYear_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,Mohamad,Aadullah,,,,,,,Aadullah Mohamad,1,...,1979-04-25 00:00:00.000,1979-04-25 00:00:00.000,1979-04-25 00:00:00.000,1979.0,1979.0,1979.0,6698,Aadullah Mohamad,A.A.D.,0.98
1,Smith,Aaron D.,,,,,,,Aaron D. Smith,57,...,1967-12-29 19:23:04.616,1918-06-10 00:00:00.000,2014-11-16 00:00:00.000,1967.384615,1918.0,2014.0,1078,Aaron D. Smith,A. D. Smith,0.84
2,Fox,Aaron,,,,,,,Aaron Fox,4,...,2012-07-11 06:00:00.000,2005-12-22 00:00:00.000,2019-01-01 00:00:00.000,2012.25,2005.0,2019.0,18634,Aaron Fox,Fox,0.85
3,Bauer,Aaron M.,,,,,,,Aaron M. Bauer,6,...,2002-11-09 16:00:00.000,1998-01-13 00:00:00.000,2011-11-29 00:00:00.000,2002.166667,1998.0,2011.0,17210,Aaron M. Bauer,Barton M. Bauers,0.88
4,Prefecture,Aba,,,,,,,Aba Prefecture,1,...,1983-09-18 00:00:00.000,1983-09-18 00:00:00.000,1983-09-18 00:00:00.000,1983.0,1983.0,1983.0,6688,Aba Prefecture,Жюль Сезар Савиньи,1.0


In [19]:
collectors_all_matches=pd.concat([collectors_matches, collectors_matches_fullname])
collectors_all_matches.sort_values(by=['namematch_distance', 'family'], ascending=[True, True], inplace=True)
collectors_all_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount_count,...,MatCitDate_mean,MatCitDate_min,MatCitDate_max,MatCitYear_mean,MatCitYear_min,MatCitYear_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,A,,,,,,,,A,194,...,1975-08-20 22:28:57.931,1-10-06 00:00:00.000,2021-01-29 00:00:00.000,1998.141176,1893.0,2021.0,78463,A,Г. Бaйцзе,0.0
4170,A.G,,,,,,,,A.G,1,...,1895-09-01 00:00:00.000,1895-09-01 00:00:00.000,1895-09-01 00:00:00.000,1895.0,1895.0,1895.0,74587,A.G,Ag.,0.0
4310,A.H.O,,,,,,,,A.H.O,1,...,1876-09-01 00:00:00.000,1876-09-01 00:00:00.000,1876-09-01 00:00:00.000,1876.0,1876.0,1876.0,3228,A.H.O,Aho,0.0
5297,Aa,,,,,,,,Aa,7,...,2007-09-17 13:42:51.428,1972-09-18 00:00:00.000,2021-07-12 00:00:00.000,2007.142857,1972.0,2021.0,88120,Aa,Aa,0.0
48264,Aagaard,K.,,,,,,,K. Aagaard,2,...,1986-07-01 00:00:00.000,1986-07-01 00:00:00.000,1986-07-01 00:00:00.000,1986.0,1986.0,1986.0,6721,K. Aagaard,K. Aagaard,0.0


Save the results...

In [21]:
do_custom_data_aggregation=False
if do_custom_data_aggregation:
    import time
    import os
    if not os.path.exists('data'):
        os.makedirs('data')

    this_output_file='data/results_plazi_collectors_matches_wikidata-botanists_%s.csv' % (this_timestamp_for_data)

    collectors_all_matches.to_csv(this_output_file)

    print(
        "Wrote matches of collector names into %s (%d kB)" % 
        (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
    ) 

### Merge Matched Data

Combine abbreviated names und full names …

In [22]:
# merge now the matching data and the wiki data’s on the conaonical string name
collectors_matches_tmp_names_abbreviated = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string'
)
collectors_matches_tmp_fullnames = pd.merge(
    collectors_matches_fullname, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string_fullname'
)
collectors_matches_g1_merged_wikidata = pd.concat(
    [collectors_matches_tmp_names_abbreviated, collectors_matches_tmp_fullnames]
    , ignore_index=True
)


In [23]:
pprint.pprint(collectors_matches_g1_merged_wikidata.columns)

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed',
       'DocCount_count', 'MatCitGbifOccurrenceId_firstsample', 'source_data',
       'MatCitDate_mean', 'MatCitDate_min', 'MatCitDate_max',
       'MatCitYear_mean', 'MatCitYear_min', 'MatCitYear_max', 'old_index',
       'namematch_source_data', 'namematch_resource_data',
       'namematch_distance', 'item', 'itemLabel', 'surname', 'initials',
       'canonical_string', 'canonical_string_fullname', 'orcid', 'viaf',
       'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod',
       'wikidata_link', 'orcid_link', 'harv_link', 'ipni_link',
       'bionomia_link'],
      dtype='str')


In [24]:
print("Show some name match examples (e.g. «Louis…» matching various names) …")
for testname in ['Aarvik', 'Louis', 'Abbot']:
    # na=False prevents errors in empty cells
    criterion = collectors_matches_g1_merged_wikidata['canonical_string_collector_parsed'].str.contains(testname, na=False)
    
    this_table = collectors_matches_g1_merged_wikidata[criterion][[
        # 'canonical_string_collector_parsed', # canonical_string_collector_parsed = namematch_source_data
        'DocCount_count', 'MatCitGbifOccurrenceId_firstsample',
        'namematch_source_data', 'namematch_resource_data', 'namematch_distance', 
        # 'canonical_string_fullname', 
        'itemLabel', 'wikidata_link',
        'MatCitYear_min', 'MatCitYear_max',
        'yob', 'yod' # , 'wyb', 'wye'
    ]].sort_values(by=['namematch_distance'])
    
    print("# ---------------------------------------------\n# «%s…» as test name, %d matches found:" % (testname, criterion.sum()))
    # display(this_table)    
    display(this_table[[
        'namematch_source_data', 
        'namematch_resource_data', 
        'namematch_distance', 
        'itemLabel', 'wikidata_link',
        'MatCitYear_min', 'MatCitYear_max', 'yob', 'yod']]
    )


Show some name match examples (e.g. «Louis…» matching various names) …
# ---------------------------------------------
# «Aarvik…» as test name, 5 matches found:


Unnamed: 0,namematch_source_data,namematch_resource_data,namematch_distance,itemLabel,wikidata_link,MatCitYear_min,MatCitYear_max,yob,yod
6763,Aarvik,Aarvik,0.0,Leif Aarvik,http://www.wikidata.org/wiki/Q17114254,1991.0,2016.0,1954,
64689,L. Aarvik,L. Aarvik,0.0,Lars Aarvik,http://www.wikidata.org/wiki/Q106823278,1936.0,2018.0,1892,1981.0
64690,L. Aarvik,L. Aarvik,0.0,Leif Aarvik,http://www.wikidata.org/wiki/Q17114254,1936.0,2018.0,1954,
143041,Leif Aarvik,Leif Aarvik,0.0,Leif Aarvik,http://www.wikidata.org/wiki/Q17114254,1993.0,2014.0,1954,
67002,L.A. Aarvik,Aarvik,0.58,Leif Aarvik,http://www.wikidata.org/wiki/Q17114254,1992.0,1992.0,1954,


# ---------------------------------------------
# «Louis…» as test name, 26 matches found:


Unnamed: 0,namematch_source_data,namematch_resource_data,namematch_distance,itemLabel,wikidata_link,MatCitYear_min,MatCitYear_max,yob,yod
70951,Louis,Louis,0.0,Jean Laurent Prosper Louis,http://www.wikidata.org/wiki/Q5928759,1937.0,1991.0,1903.0,1947.0
143670,Louis A. Fuertes,Louis A. Fuertes,0.0,Louis Agassiz Fuertes,http://www.wikidata.org/wiki/Q1871480,1910.0,1910.0,1874.0,1927.0
53679,J. Louis,Louis,0.41,Jean Laurent Prosper Louis,http://www.wikidata.org/wiki/Q5928759,1937.0,1938.0,1903.0,1947.0
143680,Louise Russell,Louise M. Russell,0.53,Louise M. Russell,http://www.wikidata.org/wiki/Q21502595,,,1905.0,2009.0
6029,A.M. Louis,Louis,0.69,Jean Laurent Prosper Louis,http://www.wikidata.org/wiki/Q5928759,1983.0,2011.0,1903.0,1947.0
143671,Louis A. Hansen,Louis A. Hanic,0.74,Louis A. Hanic,http://www.wikidata.org/wiki/Q99674405,1993.0,1993.0,,
49565,I. Louis Philippe,Philippe,0.77,Xavier Philippe,http://www.wikidata.org/wiki/Q19001500,,,1802.0,1866.0
143674,Louis Hansen,Hans Hansen,0.78,Hans Nicholas Hansen,http://www.wikidata.org/wiki/Q21514638,1991.0,1991.0,1891.0,1960.0
143676,Louis La Pierre,Pierre-Louis Laudereau,0.83,Pierre-Louis Laudereau,http://www.wikidata.org/wiki/Q136526973,1996.0,1996.0,,
70953,Louisiana,Louis,0.83,Jean Laurent Prosper Louis,http://www.wikidata.org/wiki/Q5928759,1984.0,1984.0,1903.0,1947.0


# ---------------------------------------------
# «Abbot…» as test name, 25 matches found:


Unnamed: 0,namematch_source_data,namematch_resource_data,namematch_distance,itemLabel,wikidata_link,MatCitYear_min,MatCitYear_max,yob,yod
22,A. Abbott,A. Abbott,0.0,Alice A. Bartow,http://www.wikidata.org/wiki/Q87152672,1983.0,1995.0,1865.0,1951.0
6803,Abbott,Abbott,0.0,George Abbott,http://www.wikidata.org/wiki/Q47112598,1896.0,2006.0,,
102426,S. Abbott,S. Abbott,0.0,Sue Darwin Abbott,http://www.wikidata.org/wiki/Q105518299,1971.0,1971.0,1926.0,1976.0
102427,S. Abbott,S. Abbott,0.0,Sarah Rideout Abbott,http://www.wikidata.org/wiki/Q67079678,1971.0,1971.0,1871.0,1926.0
64692,L. Abbott,L. Abbott,0.0,Lynette K. Abbott,http://www.wikidata.org/wiki/Q36610629,,,,
102425,S. Abbott,S. Abbott,0.0,Sue Darwin Abbott,http://www.wikidata.org/wiki/Q105518299,1971.0,1971.0,1926.0,1976.0
120675,W. Abbott,W. Abbott,0.0,William Louis Abbott,http://www.wikidata.org/wiki/Q635604,1922.0,1922.0,1860.0,1936.0
120674,W. Abbott,W. Abbott,0.0,Walter Sidney Abbott,http://www.wikidata.org/wiki/Q55007517,1922.0,1922.0,1879.0,1942.0
136984,Edith Abbott,Edith Abbott,0.0,Edith Mae Abbott,http://www.wikidata.org/wiki/Q99342591,1984.0,1984.0,1909.0,2006.0
59788,K. Abbott,E. K. Abbott,0.39,Edwin Kirk Abbott,http://www.wikidata.org/wiki/Q81587932,1997.0,1997.0,1840.0,1918.0


## Output Mapping to DarwinCore Attribution Output

Here we map table data fields to fields of DarwinCore Attribution (<https://github.com/tdwg/attribution/>, <https://github.com/tdwg/attribution/blob/master/people/dwc/agent_actions_v2020-09-08.xml>) 

## Scoring

Individual scored properties should actually be balanced in such a way that one can simply add up these different property scores; in this case, assessment of the calculated values is still necessary. The problem here with calculation with a distance measure is that we have the opposite of similarity, whose distance can become greater than 1, which must somehow be mapped to a scope of 0 … 1 (or -1 … 0 … 1) (TODO review).

General thoughts: With a score of -1 to 1, it can be assumed that:
* -1 means full devaluation or no agreement
* 1 means full upvoting or agreement, and
* 0 can have several interpretations: it is in between, or no rating possible, or missing values.

### Task to Be Solved in Evaluating the Life Time ~ Rating/Scoring

We have grouped the collection date (evenDate) to the name in the source data, so it may be that for (abbreviated) names, e.g. “Bachmann, F.”, the collection date is valid for *several* personal names, not just one. This must be taken into account when considering and evaluating whether the life data match the collection date. The rating of the life data has the following idea:

| Score (life time) | Remarks | 
|--|--|
| 1.0  | complete match                     |
| 0.5  | somewhat correct, but has errors or mistakes, indicating multiple person names    |
| 0.0   | no evaluation (or not possible) |
| -0.5 | is rather to be rejected, indicating multiple person names and possibly overlapping time spans of the collection date of different person names, or mistakes in the original data |
| -1.0 | completely rejected                |

### Task to Be Solved With Several Names ~ Assessment/Score

Since we do not know if there are other possible names somewhere when there is only one name, we cannot assign a “1” (= full agreement) with certainty, so it was decided that if only 1 name was found, this would be evaluated as zero, in the sense of no evaluation. So when evaluating the multiple names, only the mismatches are evaluated, according to the idea:

| Score (multiple names) | Remarks | 
|--|--|
| 1.0  | this value (=full upvoting or agreement) would never be set in this regard, since we do not know all the full names of the cosmos ;-), and could state this score certainty of 1.0 |
| 0.0 | no evaluation, because only 1 name found | 
| less than 0 | multiple names found, i.e. deduction (perhaps just -0.5, as a decision needs to be made) | 

---

TODO review interpretation:

- the fields are defined in <https://github.com/tdwg/attribution/blob/master/people/dwc/agent_actions_v2020-09-08.xml> and regarding from this DwC-attribution concept: is it correct to map it like the following (`name` would represent the *interpreted* resource name (in long format), not the *source* collector `name` (in (theoretically) long format))?
    ```
    name          ← itemLabel (wikiData)
    alternateName ← canonical_string_collector_parsed (actual collector name)
    collectors_eventDate_mean → MatCitDate_mean
    collectors_eventDate_min  → MatCitDate_min
    collectors_eventDate_max  → MatCitDate_max
     → MatCitGbifOccurrenceId
    # occurrenceID_collectors_count= ('occurrenceID_count', 'sum'), # use count function
    occurrenceID_collectors_firstsample → MatCitGbifOccurrenceId_firstsample

    MatCitGbifOccurrenceId_firstsample=('MatCitGbifOccurrenceId', lambda x: list(x)[0]), # custom function, to get the first entry    
    ```

In [25]:
# TODO further evaluation or filtering, counting, clean up aso.
pprint.pprint(collectors_matches_g1_merged_wikidata.columns)

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed',
       'DocCount_count', 'MatCitGbifOccurrenceId_firstsample', 'source_data',
       'MatCitDate_mean', 'MatCitDate_min', 'MatCitDate_max',
       'MatCitYear_mean', 'MatCitYear_min', 'MatCitYear_max', 'old_index',
       'namematch_source_data', 'namematch_resource_data',
       'namematch_distance', 'item', 'itemLabel', 'surname', 'initials',
       'canonical_string', 'canonical_string_fullname', 'orcid', 'viaf',
       'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod',
       'wikidata_link', 'orcid_link', 'harv_link', 'ipni_link',
       'bionomia_link'],
      dtype='str')


In [26]:
# yob_is_lt_eventDate_min ~ yob_is_lt_citeDate_min
# yod_is_gt_eventDate_max ~ yod_is_gt_citeDate_max

# refactor namematch_similarity → namematch_distance
# refactor namematch_similarity_annotation → namematch_distance_annotation
# refactor custom_namematch_similarity → custom_namematch_namematch
# refactor sort_values
collectors_wikidata_kmeans = collectors_matches_g1_merged_wikidata[
    ['canonical_string_collector_parsed', 'family', 'given',
     'MatCitGbifOccurrenceId_firstsample',
     'source_data',
    'namematch_source_data', 'namematch_resource_data', 'namematch_distance',
    'item', 'canonical_string', 'itemLabel',
    'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id',
    'MatCitDate_mean', 'MatCitDate_min', 'MatCitDate_max',
     'yob', 'yod' #, 'wyb'
    ]
]

# Order by similarity (desc), number of Wikidata items (asc) and number of collections (desc)
collectors_wikidata_kmeans.sort_values(
    by=['namematch_distance', 'family', 'given'],
    ascending=[True, True, True], inplace=True
)

dwcagent_attr_output=collectors_wikidata_kmeans.get([
    "MatCitGbifOccurrenceId_firstsample",
    "canonical_string_collector_parsed",
    'family', 'given',
    "namematch_distance",
    "source_data",
    "itemLabel",
    "item",
    "MatCitDate_min",
    "MatCitDate_max",
    'yob', 'yod'
]).copy()

dwcagent_attr_output['canonical_string_collector_parsed'] = dwcagent_attr_output['canonical_string_collector_parsed'].astype(object)
dwcagent_attr_output['canonical_string_collector_parsed'] = dwcagent_attr_output['canonical_string_collector_parsed'].replace(
    to_replace=r'([^,]+),\s*(.+)',
    value=r'\\2 \\1',
    regex=True
)

dwcagent_attr_output['namematch_distance_annotation'] = dwcagent_attr_output['namematch_distance'].astype(str).str.replace(r'(.+)', '\\1 (k-means distance)', regex=True)
# dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'namematch_distance_annotation', '', allow_duplicates=True)

dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'life_time_periode', '', allow_duplicates=True)

combine_life_times = lambda this_df: ("%s-%s" % (this_df["yob"], this_df["yod"])).replace(r"<NA>", "?")
dwcagent_attr_output["life_time_periode"]=dwcagent_attr_output.apply(combine_life_times, axis="columns")

# dwcagent_attr_output["life_time_periode"]

years_from_birth_until_first_collection_activity = 10
dwcagent_attr_output["yob_is_lt_citeDate_min"] = dwcagent_attr_output["yob"] + years_from_birth_until_first_collection_activity < dwcagent_attr_output["MatCitDate_min"].dt.year
dwcagent_attr_output["yod_is_gt_citeDate_max"] = dwcagent_attr_output["yod"] > dwcagent_attr_output["MatCitDate_max"].dt.year
dwcagent_attr_output["custom_score_lifetime"] = 0.0
dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'custom_score_lifetime_annotation', '', allow_duplicates=True)

# df.loc[(df['column_of_interest'] … condition), 'fill_to_column'] = value

dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_citeDate_min"] & dwcagent_attr_output["yod_is_gt_citeDate_max"],
    "custom_score_lifetime"
] = 1.0
# True cases but <NA> missing values
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_citeDate_min"] & dwcagent_attr_output["yod_is_gt_citeDate_max"].isnull(),
    "custom_score_lifetime"
] = 1.0
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_citeDate_min"].isnull() & dwcagent_attr_output["yod_is_gt_citeDate_max"],
    "custom_score_lifetime"
] = 1.0
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_citeDate_min"].isnull() & dwcagent_attr_output["yod_is_gt_citeDate_max"].isnull(),
    "custom_score_lifetime"
] = 0.0

# False cases
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_citeDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_citeDate_max"] == False),
    "custom_score_lifetime"
] = -1.0
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_citeDate_min"]==True) & (dwcagent_attr_output["yod_is_gt_citeDate_max"] == False),
    "custom_score_lifetime"
] = 0.5
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_citeDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_citeDate_max"] == True),
    "custom_score_lifetime"
] = 0.5

# False cases but <NA> missing values
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_citeDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_citeDate_max"].isnull()),
    "custom_score_lifetime"
] = -0.5
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_citeDate_min"].isnull()) & (dwcagent_attr_output["yod_is_gt_citeDate_max"] == False),
    "custom_score_lifetime"
] = -0.5

# annotations True cases
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_citeDate_min"] & dwcagent_attr_output["yod_is_gt_citeDate_max"],
    "custom_score_lifetime_annotation"
] = "full match"

# annotations True cases but <NA> missing values
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_citeDate_min"] & dwcagent_attr_output["yod_is_gt_citeDate_max"].isnull(),
    "custom_score_lifetime_annotation"
] = "OK? year of death is missing"
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_citeDate_min"].isnull() & dwcagent_attr_output["yod_is_gt_citeDate_max"],
    "custom_score_lifetime_annotation"
] = "OK? year of birth is missing"
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_citeDate_min"].isnull() & dwcagent_attr_output["yod_is_gt_citeDate_max"].isnull(),
    "custom_score_lifetime_annotation"
] = "unknown life time"

# annotations False cases
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_citeDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_citeDate_max"] == False),
    "custom_score_lifetime_annotation"
] = "life time not matching any citeDate (yob + %s … yod)" % years_from_birth_until_first_collection_activity
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_citeDate_min"]==True) & (dwcagent_attr_output["yod_is_gt_citeDate_max"] == False),
    "custom_score_lifetime_annotation"
] = "OK yob + %s, but yod not matching, check name and liftime data" % years_from_birth_until_first_collection_activity
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_citeDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_citeDate_max"] == True),
    "custom_score_lifetime_annotation"
] = "yob + %s not matching, OK yod, check name and liftime data" % years_from_birth_until_first_collection_activity
# annotations False cases but <NA> missing values
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_citeDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_citeDate_max"].isnull()),
    "custom_score_lifetime_annotation"
] = "yob + %s not matching, yod unknown, check name and liftime data" % years_from_birth_until_first_collection_activity
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_citeDate_min"].isnull()) & (dwcagent_attr_output["yod_is_gt_citeDate_max"]==False),
    "custom_score_lifetime_annotation"
] = "yob unknown, yod not matching, check name and liftime data"

dwcagent_attr_output["custom_score_multiple_names"] = 0.0 # 0 shall mean: we don’t know yet for real
dwcagent_attr_output.loc[
    (dwcagent_attr_output['canonical_string_collector_parsed'].duplicated(keep=False)),
    'custom_score_multiple_names'
] = -0.5 # one decision has to be made, so cut the range of -1 to 0 only into half (or include multiple count somehow?)

namematch_distance_max=dwcagent_attr_output['namematch_distance'].max()
dwcagent_attr_output['custom_score_overall'] = (
    # reconsider/transform distance (0 … xx, range larger than 1) to similarity (1 … 0, range of 1) for scoring
    abs( dwcagent_attr_output['namematch_distance'] - namematch_distance_max ) / namematch_distance_max * \
    (
        ( dwcagent_attr_output["custom_score_lifetime"] + dwcagent_attr_output['custom_score_multiple_names']) / 2
    )
).round(3)

dwcagent_attr_output['attributionRemarks'] = dwcagent_attr_output.apply(
    lambda row: "{similarity_distance_note};"
                " {score_overall:.2f} (score overall);"
                " {lifetime_periode} (life time);"
                " {lifetime_score:.1f} (life time score);"
                " {lifetime_score_annote} (life time score note);"
                " {score_multinames:.2f} (score multiple names);"
        .format(
    similarity_distance_note=row['namematch_distance_annotation'],
    lifetime_periode=row["life_time_periode"],
    lifetime_score=row["custom_score_lifetime"],
    lifetime_score_annote=row["custom_score_lifetime_annotation"],
    score_overall=row["custom_score_overall"],
    score_multinames=row["custom_score_multiple_names"]
    ), axis='columns'
)

# adjust dwcagent displayOrder also to olerall score
dwcagent_attr_output.sort_values(
    by=['namematch_distance', 'family', 'given', 'custom_score_overall'],
    ascending=[True, True, True, False], inplace=True
)
# use ordered canonical_string_collector_parsed to generate displayOrder
temp_duplicated = dwcagent_attr_output['canonical_string_collector_parsed'].duplicated()
    # duplicated() keeps the first value False and mark all other duplicats as True, i.e. we can cumulate the Trues, it gives the order index
temp_insert_value=temp_duplicated.groupby(dwcagent_attr_output['canonical_string_collector_parsed']).cumsum() + 1 # display order starts at 1, incrementing
dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('canonical_string_collector_parsed') + 1, 'displayOrder', temp_insert_value, allow_duplicates=True)

# test an show example data
show_display_output=True
if show_display_output:
    display(dwcagent_attr_output.loc[dwcagent_attr_output['yob_is_lt_citeDate_min'] == True].get([
        # "MatCitGbifOccurrenceId_firstsample",
        "canonical_string_collector_parsed",
        'itemLabel',
        "custom_score_overall",
        "attributionRemarks",
        'custom_score_multiple_names',
        "namematch_distance",
        # 'yob', 'yod',
        "life_time_periode",
        'MatCitDate_min', 'MatCitDate_max',
        "yob_is_lt_citeDate_min" ,'yod_is_gt_citeDate_max',
        'custom_score_lifetime', 'custom_score_lifetime_annotation'
    ]).head(5))
    display(dwcagent_attr_output.loc[dwcagent_attr_output['yob_is_lt_citeDate_min'] == False].get([
        # "MatCitGbifOccurrenceId_firstsample",
        "canonical_string_collector_parsed",
        'itemLabel',
        "custom_score_overall",
        "attributionRemarks",
        'custom_score_multiple_names',
        "namematch_distance",
        # 'yob', 'yod',
        "life_time_periode",
        'MatCitDate_min', 'MatCitDate_max',
        "yob_is_lt_citeDate_min" ,'yod_is_gt_citeDate_max',
        'custom_score_lifetime', 'custom_score_lifetime_annotation'
    ]).head(5))

Unnamed: 0,canonical_string_collector_parsed,itemLabel,custom_score_overall,attributionRemarks,custom_score_multiple_names,namematch_distance,life_time_periode,MatCitDate_min,MatCitDate_max,yob_is_lt_citeDate_min,yod_is_gt_citeDate_max,custom_score_lifetime,custom_score_lifetime_annotation
5349,A.G,Carl Adolph Agardh,0.25,0.0 (k-means distance); 0.25 (score overall); ...,0.0,0.0,1785-1859,1895-09-01 00:00:00.000,1895-09-01 00:00:00.000,True,False,0.5,"OK yob + 10, but yod not matching, check name ..."
6758,Aa,Hubertus Antonius van der Aa,0.25,0.0 (k-means distance); 0.25 (score overall); ...,0.0,0.0,1935-2017,1972-09-18 00:00:00.000,2021-07-12 00:00:00.000,True,False,0.5,"OK yob + 10, but yod not matching, check name ..."
59786,K. Aagaard,Kaare Aagaard,0.5,0.0 (k-means distance); 0.50 (score overall); ...,0.0,0.0,1947-?,1986-07-01 00:00:00.000,1986-07-01 00:00:00.000,True,,1.0,OK? year of death is missing
64689,L. Aarvik,Lars Aarvik,0.0,0.0 (k-means distance); 0.00 (score overall); ...,-0.5,0.0,1892-1981,1936-03-01 00:00:00.000,2018-08-20 00:00:00.000,True,False,0.5,"OK yob + 10, but yod not matching, check name ..."
143041,Leif Aarvik,Leif Aarvik,0.5,0.0 (k-means distance); 0.50 (score overall); ...,0.0,0.0,1954-?,1993-01-01 00:00:00.000,2014-10-19 00:00:00.000,True,,1.0,OK? year of death is missing


Unnamed: 0,canonical_string_collector_parsed,itemLabel,custom_score_overall,attributionRemarks,custom_score_multiple_names,namematch_distance,life_time_periode,MatCitDate_min,MatCitDate_max,yob_is_lt_citeDate_min,yod_is_gt_citeDate_max,custom_score_lifetime,custom_score_lifetime_annotation
0,A,Geng Bojie,-0.5,0.0 (k-means distance); -0.50 (score overall);...,0.0,0.0,1917-1997,1-10-06 00:00:00.000,2021-01-29 00:00:00.000,False,False,-1.0,life time not matching any citeDate (yob + 10 ...
141787,Kaare Aagaard,Kaare Aagaard,-0.25,0.0 (k-means distance); -0.25 (score overall);...,0.0,0.0,1947-?,NaT,NaT,False,,-0.5,"yob + 10 not matching, yod unknown, check name..."
64690,L. Aarvik,Leif Aarvik,-0.5,0.0 (k-means distance); -0.50 (score overall);...,-0.5,0.0,1954-?,1936-03-01 00:00:00.000,2018-08-20 00:00:00.000,False,,-0.5,"yob + 10 not matching, yod unknown, check name..."
6799,Abbe,Ernst Cleveland Abbe,0.25,0.0 (k-means distance); 0.25 (score overall); ...,0.0,0.0,1905-2000,1878-07-01 00:00:00.000,1878-07-01 00:00:00.000,False,True,0.5,"yob + 10 not matching, OK yod, check name and ..."
71331,M. Abdel-Dayem,Mahmoud S. Abdel-Dayem,-0.5,0.0 (k-means distance); -0.50 (score overall);...,-0.5,0.0,2000-?,2010-10-14 00:00:00.000,2016-05-07 00:00:00.000,False,,-0.5,"yob + 10 not matching, yod unknown, check name..."


In [27]:
column_map_dwcagent_attr = {
    'MatCitGbifOccurrenceId_firstsample': 'occurrenceID',
    'canonical_string_collector_parsed':  'alternateName',
    'source_data':                        'verbatimName',
    'itemLabel':                          'name',
    'item':                               'identifier',
    'MatCitDate_min':                     'startedAtTime',
    'MatCitDate_max':                     'endedAtTime',
    'namematch_distance':                 'custom_namematch_distance'
}
dwcagent_attr_output.rename(
    mapper=column_map_dwcagent_attr,
    axis='columns',
    inplace=True)

dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('identifier')          + 1, 'agentIdentifierType', 'wikidata' , allow_duplicates=True)
dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('agentIdentifierType') + 1, 'agentType'          , 'Person'   , allow_duplicates=True)
dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('identifier')          + 1, 'action'             , 'collected', allow_duplicates=True)

show_display_output=False
if show_display_output:
    dwcagent_attr_output.head(20)

dwcagent_attr_output=dwcagent_attr_output.reindex(
    columns=[
        'occurrenceID', # no DwC agent standard (yet)?
        'verbatimName',
        'alternateName',
        'displayOrder', # shall start from 1, 2, 3 …
        'name',
        'attributionRemarks',
        'startedAtTime',
        'endedAtTime',
        'agentType',
        'action',
        'agentIdentifierType',
        'identifier',
        "custom_score_overall", # keep it for calculation convenience, no standard in DwC agent
        'custom_namematch_distance',# keep it for calculation convenience, no standard in DwC agent
        'custom_score_multiple_names',# keep it for calculation convenience, no standard in DwC agent
        'custom_score_lifetime' # keep it for calculation convenience, no standard in DwC agent
    ]
)
# column deletion not neccessary after ….reindex(columns=[…])
# for this_column in ['yob', 'yod', 'life_time_periode', 'yob_is_lt_citeDate_min', 'yod_is_gt_citeDate_max', 'score_lifetime_annotation']:
#     del dwcagent_attr_output[this_column]


In [28]:
show_display_output=True
if show_display_output:
    # criterion = dwcagent_attr_output['alternateName'].str.contains('S. Ahmad')
    criterion = dwcagent_attr_output['custom_score_multiple_names'].map(lambda this_score: this_score < 0 ) # show matches with multiple names
    
    display(dwcagent_attr_output[criterion].head(20))

Unnamed: 0,occurrenceID,verbatimName,alternateName,displayOrder,name,attributionRemarks,startedAtTime,endedAtTime,agentType,action,agentIdentifierType,identifier,custom_score_overall,custom_namematch_distance,custom_score_multiple_names,custom_score_lifetime
64689,3712345314,A. Bjornstad & L. Aarvik,L. Aarvik,1,Lars Aarvik,0.0 (k-means distance); 0.00 (score overall); ...,1936-03-01 00:00:00.000,2018-08-20 00:00:00.000,Person,collected,wikidata,http://www.wikidata.org/entity/Q106823278,0.0,0.0,-0.5,0.5
64690,3712345314,A. Bjornstad & L. Aarvik,L. Aarvik,2,Leif Aarvik,0.0 (k-means distance); -0.50 (score overall);...,1936-03-01 00:00:00.000,2018-08-20 00:00:00.000,Person,collected,wikidata,http://www.wikidata.org/entity/Q17114254,-0.5,0.0,-0.5,-0.5
102425,3080394386,S. Abbott,S. Abbott,1,Sue Darwin Abbott,0.0 (k-means distance); 0.25 (score overall); ...,1971-02-26 00:00:00.000,1971-02-26 00:00:00.000,Person,collected,wikidata,http://www.wikidata.org/entity/Q105518299,0.25,0.0,-0.5,1.0
102426,3080394386,S. Abbott,S. Abbott,2,Sue Darwin Abbott,0.0 (k-means distance); 0.25 (score overall); ...,1971-02-26 00:00:00.000,1971-02-26 00:00:00.000,Person,collected,wikidata,http://www.wikidata.org/entity/Q105518299,0.25,0.0,-0.5,1.0
102427,3080394386,S. Abbott,S. Abbott,3,Sarah Rideout Abbott,0.0 (k-means distance); 0.00 (score overall); ...,1971-02-26 00:00:00.000,1971-02-26 00:00:00.000,Person,collected,wikidata,http://www.wikidata.org/entity/Q67079678,0.0,0.0,-0.5,0.5
120674,3407812353,W. Abbott,W. Abbott,1,Walter Sidney Abbott,0.0 (k-means distance); 0.25 (score overall); ...,1922-04-07 00:00:00.000,1922-04-07 00:00:00.000,Person,collected,wikidata,http://www.wikidata.org/entity/Q55007517,0.25,0.0,-0.5,1.0
120675,3407812353,W. Abbott,W. Abbott,2,William Louis Abbott,0.0 (k-means distance); 0.25 (score overall); ...,1922-04-07 00:00:00.000,1922-04-07 00:00:00.000,Person,collected,wikidata,http://www.wikidata.org/entity/Q635604,0.25,0.0,-0.5,1.0
71329,2252300236,"Al Mandaq, W & Tourabah & Al Dhafer, H. & Abde...",M. Abdel Dayem,1,Mahmoud S. Abdel-Dayem,0.0 (k-means distance); 0.25 (score overall); ...,2012-05-03 00:00:00.000,2012-05-03 00:00:00.000,Person,collected,wikidata,http://www.wikidata.org/entity/Q27921885,0.25,0.0,-0.5,1.0
71330,2252300236,"Al Mandaq, W & Tourabah & Al Dhafer, H. & Abde...",M. Abdel Dayem,2,Mahmoud S. Abdel-Dayem,0.0 (k-means distance); 0.25 (score overall); ...,2012-05-03 00:00:00.000,2012-05-03 00:00:00.000,Person,collected,wikidata,http://www.wikidata.org/entity/Q27921885,0.25,0.0,-0.5,1.0
71331,3069288303,Abdel-Dayem M,M. Abdel-Dayem,1,Mahmoud S. Abdel-Dayem,0.0 (k-means distance); -0.50 (score overall);...,2010-10-14 00:00:00.000,2016-05-07 00:00:00.000,Person,collected,wikidata,http://www.wikidata.org/entity/Q27921885,-0.5,0.0,-0.5,-0.5


In [29]:
if not os.path.exists('data'):
    os.makedirs('data')

# this_timestamp_for_data=time.strftime('%Y%m%d') # 20230719
# this_timestamp_for_data=20231116
this_timestamp_for_data=20260210
this_output_file='data/results_plazi_collectors_citeDate_vs_wikidata-botanists_kneighbor_dwc-agent-output_%s.csv' % (
    this_timestamp_for_data
)

dwcagent_attr_output.to_csv(this_output_file, index=False)

print("Wrote matches of collector names as dwc-agent-output into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote matches of collector names as dwc-agent-output into data/results_plazi_collectors_citeDate_vs_wikidata-botanists_kneighbor_dwc-agent-output_20260210.csv (58277 kB)


## Documentation

Explanation of columns:

Column | Description
-|-
**Plazi data fields** | 
DocCount | number of documents
MatCitId | (?internal) cite id
MatCitGbifOccurrenceId | related GBIF occurrence id
MatCitDate | date of the material cited
MatCitDecade | decade of the material cited
MatCitYear | year of the material cited
MatCitMonth | month of the material cited
MatCitCollector | collector of the cited material
**Botanical collectors** |
family | parsed family name
given | parsed given name
suffix | suffix from name parsing
particle | particle from name parsing
dropping_particle | dropping_particle from name parsing
nick | nick name from name parsing
appellation | appellation from name parsing
title | title from name parsing
TODO … | Year of first collection
TODO end_date | Year of last collection
TODO activity_span | Number of years between first and last collection
**Name matching** |
nammatch_collector | matched name of the data set
nammatch_wikidata | matched name; = Wikidata item label name is matched to
name_match_distance | Nearest Neighbour distance between the name and matched name; the lower the value, the better the match
**DarwinCore Agent Output** | (☞ [agent_actions_v2020-09-08.xml](https://github.com/tdwg/attribution/blob/master/people/dwc/agent_actions_v2020-09-08.xml))
occurrenceID | occurrence ID of the data item
name | the interpreted name match (https://github.com/tdwg/attribution/ The name of the item. In this case the *full name* as would be written on a legal document (without abbreviation), eg givenName familyName)
verbatimName | the source data name(s) (https://github.com/tdwg/attribution/ As written on occurrence, such as the collection or determination label.)
alternateName | the input name, collector source name (An alias for the item. Other full name agent may have been known under such as maiden name.)
displayOrder | I guess ordering the multiple name cases (https://github.com/tdwg/attribution/ The display order for the agent that executed the action when more than one agent was a participant.)
attributionRemarks | notes on the results (distance or similarity), including calculated value
agentType | The nature of the agent, e.g. "Person", "Organization", "SoftwareApplication"
action | The name of the single action written as a verb in past tense. Recommended best practice is to use a controlled vocabulary, examples "collected" or "identified"
agentIdentifierType | The type of identifier for the agent. (https://github.com/tdwg/attribution/ Recommended best practice is to use a controlled vocabulary, e.g. “ORCID”, “ISNI”, “Wikidata”, “VIAF”, “RoR”, “Ringgold”, “GRID”).
identifier | Wikidata ID (Recommended practice is to identify the resource by means of a string conforming to an identification system. Examples include International Standard Book Number (ISBN), Digital Object Identifier (DOI), and Uniform Resource Name (URN). Persistent identifiers should be provided as HTTP URIs.)
startedAtTime | (https://github.com/tdwg/attribution/ Start is when an action is deemed to have been started by an agent.) the first date of eventDate (supposedly the first sampling date), but grouped from collector name—in case of multiple name matches this first “sampling date” is less reliable and be reliable to relate to the source collector’s life time.
endedAtTime | (https://github.com/tdwg/attribution/ End is when an action is deemed to have been ended by an agent.) the last date of eventDate (supposedly the last sampling date), but grouped from collector name—in case of multiple name matches this first “sampling date” is less reliable and be reliable to relate to the source collector’s life time.
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))