# Match Meise Collectors to Wikidata Items Using *Nearest Neighbour*, `eventDate` Involved

In this example we add `eventDate` of the source data, when the sample/occurrence was collected, to have a time reference, when the collector should have been  alive.

Basically we …

- match of `canonical_string` of WikiData to `canonical_string` of the source collectors (abbreviated names and full names, if given), and
- parse collector source names beforehand to get individual names out of name lists in the source data, we have used <https://libraries.io/rubygems/dwc_agent>, and in general we
- follow the example of Niels Klazenga <https://github.com/nielsklazenga/avh-collectors/blob/master/match_names_to_wikidata_items.ipynb>
- calculate the name distance and also a similarity score for the matching of life span with the (sampling) `eventDate`, and finally we
- write the output to provide a DarwinCore attribution structure (for `verbatimName` we would need the `source_data` name(s))

Technical Notes — Review Code perhaps:

- TODO review score calculation of the matching of relating eventData with range of yob, yod
- TODO review DwC agent output, keep at this time custom columns for filter-sort-evaluation convenience
- (NN ⇌ Cosine) refactor relation: wd_matchtest ⇌ wikidata_unique (replaced wd_matchtest → wikidata_unique)

### Load Wikidata Data Set

Use Jupyter Notebook [create_wikidata_datasets_botanists-altlabel.ipynb](./create_wikidata_datasets_botanists-altlabel.ipynb) to generate matching data of botanists.

Now load the data and make them unique …

In [1]:
import pandas as pd
import pprint, time, os

explain_and_show_the_data = True
this_timestamp_for_data=20260210
# this_timestamp_for_data=time.strftime('%Y%m%d')

wikidata = pd.read_csv(
    # "data/wikidata_persons_botanists_20231030_1539.csv", # inverse match: [particle +] family, given
    # "data/wikidata_persons_botanists_20231116.csv",        # match: given [+ particle] + family[+ , suffix]
    # "data/wikidata_persons_botanists_20240312.csv", # with itemLabel + altLabel wyb, wye removed
    "data/wikidata_persons_botanists_20260210.csv", 
    index_col=0, low_memory=False,
    dtype={
        'yob':'Int32',
        'yod':'Int32',
        'wyb':'Int32',
        'wye':'Int32'
    }    
)
# # # # # # # # # # # # # # # # 
# def convert_to_datetime(year):
#     year = int(year)
#     days = int((year - int(year)) * 365.25)
#     base_date = pd.to_datetime(f'{year}-01-01', errors="ignore")
#     # ,format='%Y-%m-%d', errors='coerce'
#     # base_date = pd.Period(year, freq='D')
#     return base_date + pd.DateOffset(days=days)
# 
# def convert_to_datetime(year):
#     base_date = pd.to_datetime(f'{year}-01-01', errors="ignore")
# print(wikidata.dtypes)

# wikidata['yob_converted'] = wikidata['yob'].apply(convert_to_datetime)
# wikidata['yod_converted'] = wikidata['yod'].apply(lambda x: pd.Period(x, freq='Y')) # Given date string "-286" not likely a datetime
# # # # # # # # # # # # # # # # 

if explain_and_show_the_data:
    pprint.pprint(wikidata.columns)
    display(wikidata.head())

Index(['item', 'itemLabel', 'surname', 'initials', 'canonical_string',
       'canonical_string_fullname', 'orcid', 'viaf', 'isni', 'harv', 'ipni',
       'abbr', 'bionomia_id', 'yob', 'yod', 'wikidata_link', 'orcid_link',
       'harv_link', 'ipni_link', 'bionomia_link'],
      dtype='str')


Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q100142069,Frida Eggens,,,Eggens,Eggens,,,,,20045232-1,Eggens,,,,http://www.wikidata.org/wiki/Q100142069,,,https://www.ipni.org/a/20045232-1,
1,http://www.wikidata.org/entity/Q100142069,Frida Eggens,Frida,F.,F. Eggens,Frida Eggens,,,,,20045232-1,Eggens,,,,http://www.wikidata.org/wiki/Q100142069,,,https://www.ipni.org/a/20045232-1,
2,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Elizabeth,E.,E. Harrison,Elizabeth Harrison,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795
3,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,,,Mrs A. H.,Mrs A. H.,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795
4,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Mrs Arnold,M. A.,M. A. Harrison,Mrs Arnold Harrison,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795


In [2]:
# compile data having only unique canonical strings
# group by canonical name/string, count douplicated names
wd_matchtest = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()
wd_matchtest_fullnames = wikidata.groupby('canonical_string_fullname').agg({'item': ['count']}).reset_index()

if explain_and_show_the_data:
    display(wd_matchtest)
    display(wd_matchtest_fullnames)

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"""F."" Ryser",1
1,"""N.A. Antipova"" (lapsus)",1
2,"""N.A.Antipova"" (lapsus)",1
3,"""The grandmother of female scientists in Ghana""",1
4,"""Н. А. Антипова"" (lapsus)",1
...,...,...
171443,赵云鹏,1
171444,郭亚龙,1
171445,金井弘夫(Hiroo Kanai),1
171446,金双 马,1


Unnamed: 0_level_0,canonical_string_fullname,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"""Fritz"" Ryser",1
1,"""N.A. Antipova"" (lapsus)",1
2,"""N.A.Antipova"" (lapsus)",1
3,"""The grandmother of female scientists in Ghana""",1
4,"""Н. А. Антипова"" (lapsus)",1
...,...,...
204788,赵云鹏,1
204789,郭亚龙,1
204790,金井弘夫(Hiroo Kanai),1
204791,金双 马,1


### Load Collectors Data Set

Data sources:

- Jupyter Notebook for [create_meise_gbif-occurrence_collectors_eventDate_dataset.ipynb](./create_meise_gbif-occurrence_collectors_eventDate_dataset.ipynb)

Then parse collector names to get single, separate collector names using `dwcagent`, use ruby gem package available at  <https://rubygems.org/gems/dwc_agent>:

- see [./bin/README.md](bin/README.md) to use ruby script `./bin/agent_parse4tsv.rb` for parsing text lines like `"Abbe,L.B., Abbe,E.C., Smitinand,T. & Rollet,B."`


In [3]:
# atomized names were parsed already by ruby gem package: dwcagent —
# they can contain also the same name accross multiple rows — 
# it’s probably better for the matching to make the name rows unique later on

# collectors = pd.read_csv("data/meise_collectors_20230510_1429_single-line_parsed_unique_names.tab", sep="\t")
collectors = pd.read_csv(
    "./data/Meise_doi-10.15468-dl.ax9zkh/occurrence_recordedBy_eventDate_occurrenceIDs_20230830_parsed.tsv", 
    sep="\t", low_memory=False
)

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.

# Out of bounds nanosecond timestamp: 1652-01-01T00:00:00
#  because date nanoseconds range limitations of pandas, see https://stackoverflow.com/a/69507200/1240387
print("modify time using pd.Periode(…) to get it work also on very old dates...")
for col in ['eventDate_mean', 'eventDate_min', 'eventDate_max']:
    print("- convert", col, "to pd.Period(...) in collectors")
    collectors[col] = collectors[col].apply(
        lambda x: pd.Period(
            x, freq='s' if col.lower().endswith('mean') else 'D' # Seconds or Day level
        )
    ) # D=day level
    # see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-period-aliases
print("modifing done.")

collectors.sort_values(by=['family', 'given','occurrenceID_firstsample'], inplace=True)
if explain_and_show_the_data: display(collectors)

modify time using pd.Periode(…) to get it work also on very old dates...
- convert eventDate_mean to pd.Period(...) in collectors
- convert eventDate_min to pd.Period(...) in collectors
- convert eventDate_max to pd.Period(...) in collectors
modifing done.


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
43639,A,H. M. Llo,,,,,,,H.M. L[lo][???]a,parsed:H.M.Llo a,cleaned:H. M. Llo A,1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT
44219,A,Hameln,,,,,,,Hameln [a/d Weser],parsed:Hameln a<SEP>d Weser,cleaned:Hameln A<SEP>D Weser,1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT
84528,A,R.,,,,,,,R. Decha[] & R. []a,parsed:R. Decha<SEP>R. a,cleaned:R. Decha<SEP>R. A,1,http://www.botanicalcollections.be/specimen/BR...,1988-11-19 00:00:00,1988-11-19,1988-11-19
18419,A,,,,,,,,"Church, A.C., Ismail, Ruskandi, A",parsed:A.C. Church<SEP>Ruskandi Ismail<SEP>A,cleaned:A.C. Church<SEP>Ruskandi Ismail<SEP>A,1,http://www.botanicalcollections.be/specimen/BR...,1995-10-26 00:00:00,1995-10-26,1995-10-26
87726,A,,,,,,,,Raynal R. & A,parsed:R. Raynal<SEP>A,cleaned:R. Raynal<SEP>A,1,http://www.botanicalcollections.be/specimen/BR...,1978-06-08 00:00:00,1978-06-08,1978-06-08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24944,Žertová,A.,,,,,,Dr.,Dr. A. Žertová,parsed:A. Žertová,cleaned:A. Žertová,6,http://www.botanicalcollections.be/specimen/BR...,1959-01-01 04:00:00,1958-08-13,1960-06-23
58955,Žertová,,,,,,,,Klášterský & Žertová,parsed:Klášterský<SEP>Žertová,cleaned:Klášterský<SEP>Žertová,1,http://www.botanicalcollections.be/specimen/BR...,1958-05-20 00:00:00,1958-05-20,1958-05-20
91691,Ǿllgaard,B.,,,,,,,"S.P. Pinnerup, B. Ǿllgaard & L. Holm-Nielsen",parsed:S.P. Pinnerup<SEP>B. Ǿllgaard<SEP>L. Ho...,cleaned:S.P. Pinnerup<SEP>B. Ǿllgaard<SEP>L. H...,1,http://www.botanicalcollections.be/specimen/BR...,1975-08-29 00:00:00,1975-08-29,1975-08-29
118435,Т. Королева. V. Petrovsky,В.Пертовский,,и,,,,,В. Пертовский и Т. Королева. V. Petrovsky & T....,parsed:В.Пертовский и Т. Королева. V. Petrovsk...,cleaned:В.Пертовский и Т. Королева. V. Petrovs...,1,http://www.botanicalcollections.be/specimen/BR...,1973-06-24 00:00:00,1973-06-24,1973-06-24


#### Check Composition of Parsed Collector Data

In [4]:
# TODO review code of abbreviated names and full name matching
if explain_and_show_the_data: 
    criterion_fullnames = collectors.given.str.contains('^\\w{3,}', na=False)
    print("Show collecors given name has (propably) a full name (%s records) …" % len(collectors[criterion_fullnames].index))
    display(collectors[criterion_fullnames])

Show collecors given name has (propably) a full name (18092 records) …


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
44219,A,Hameln,,,,,,,Hameln [a/d Weser],parsed:Hameln a<SEP>d Weser,cleaned:Hameln A<SEP>D Weser,1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT
64324,A. Brabant,Lochenies,,in,,,,,Lochenies in A. Brabant,parsed:Lochenies in A. Brabant,cleaned:Lochenies in A. Brabant,1,http://www.botanicalcollections.be/specimen/BR...,1892-01-01 00:00:00,1892-01-01,1892-01-01
34284,A. Chevalier,Fleury,,in,,,,,Fleury in A. Chevalier,parsed:Fleury in A. Chevalier,cleaned:Fleury in A. Chevalier,2,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT
22272,A. Cogniaux,Dandois,,in,,,,,Dandois in A. Cogniaux,parsed:Dandois in A. Cogniaux,cleaned:Dandois in A. Cogniaux,1,http://www.botanicalcollections.be/specimen/BR...,1866-06-01 00:00:00,1866-06-01,1866-06-01
39899,A. Cogniaux,Gilbert,,in,,,,,Gilbert in A. Cogniaux,parsed:Gilbert in A. Cogniaux,cleaned:Gilbert in A. Cogniaux,2,http://www.botanicalcollections.be/specimen/BR...,1864-08-06 00:00:00,1864-08-06,1864-08-06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95209,Čermak,Stan,,,,,,,Stan. Čermak,parsed:Stan Čermak,cleaned:Stan Čermak,1,http://www.botanicalcollections.be/specimen/BR...,1926-05-01 00:00:00,1926-05-01,1926-05-01
95204,Čermák,Stan,,,,,,,Stan Čermák,parsed:Stan Čermák,cleaned:Stan Čermák,1,http://www.botanicalcollections.be/specimen/BR...,1925-10-12 00:00:00,1925-10-12,1925-10-12
97411,Şes-Frăţilescu,Tatiana,,,,,,,Tatiana Şes[]-Frăţilescu,parsed:Tatiana Şes-Frăţilescu,cleaned:Tatiana Şes-Frăţilescu,1,http://www.botanicalcollections.be/specimen/BR...,1972-06-13 00:00:00,1972-06-13,1972-06-13
97412,Şesan-Frăţilescu,Tatiana,,,,,,,Tatiana Şesan-Frăţilescu,parsed:Tatiana Şesan-Frăţilescu,cleaned:Tatiana Şesan-Frăţilescu,9,http://www.botanicalcollections.be/specimen/BR...,1972-04-06 08:00:00,1971-07-12,1973-06-12


In [5]:
# check the name-parsed columns if they are empty or need to be considerd as data for matching or not
if explain_and_show_the_data: 
    for parsed_name_part in ["particle", "suffix", "dropping_particle", "appellation"]:
        test_collectors = collectors.loc[(collectors[parsed_name_part].isna() == False)]
        print("\n----------------------------------------\nshow names with **%s** found %s records:\n" % (parsed_name_part, len(test_collectors.index)))
        display(test_collectors.head().get(["family", "given", "suffix", "particle", "dropping_particle", "nick", "appellation", "title"]))    


----------------------------------------
show names with **particle** found 8197 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title
12654,A Robyns,Br,,in,,,,
69636,A. Anspach,L.,,in,,,,
64324,A. Brabant,Lochenies,,in,,,,
34284,A. Chevalier,Fleury,,in,,,,
49473,A. Chrková,J. Chrtek,,at,,,,



----------------------------------------
show names with **suffix** found 188 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title
35091,Almeda,Frank,Jr.,,,,,
69175,Anderson,W.A.,Jr.,,,,,
52876,Bailey,L.H.,Jr.,,,,,
61664,Bailey,L.H.,Jr.,,,,,
35259,Barkley,Robert,II,,,,,



----------------------------------------
show names with **dropping_particle** found 0 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title



----------------------------------------
show names with **appellation** found 1127 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title
34792,A. Hardy,Lebrun,,in,,,Fr.,
9305,A. Hardy,Lebrun,,in,,,Fr,
117482,Abbon,G.,,,,,fr.,
109095,Ahlberg,,,,,,Fr.,
34736,Ahlfvengren,E.,,,,,Fr.,


Compile and compose `canonical_string…` of the collector data that we will later match the WikiData names with:

In [6]:
if explain_and_show_the_data: print("combine parts of names similar to WikiData's given name labels")

# 1. Prepare canonical name with conditional spacing: add space/comma directly to the value …
given = collectors['given'].fillna('')
# .where(condition, value_if_true) ensures spaces only appear when data is present
part = collectors['particle'].where(collectors['particle'].isna(), " " + collectors['particle'])
fam  = collectors['family'].where(collectors['family'].isna(), " " + collectors['family'])
suff = collectors['suffix'].where(collectors['suffix'].isna(), ", " + collectors['suffix'])
# 2. Vectorized concatenation
collectors['canonical_string_collector_parsed'] = (
    given + part.fillna('') + fam.fillna('') + suff.fillna('')
).str.strip()
# Optional: Clean up multiple spaces if they exist in the source data
collectors['canonical_string_collector_parsed'] = (
    collectors['canonical_string_collector_parsed'].str.replace(r'\s+', ' ', regex=True)
)

criterion = collectors["particle"].str.contains("\\w+ \\w+", na=False)

if explain_and_show_the_data: 
    # display(collectors['canonical_string_collector_parsed'][criterion].head())
    display(collectors[['canonical_string_collector_parsed', 'particle']][criterion].drop_duplicates().head(10))

combine parts of names similar to WikiData's given name labels


Unnamed: 0,canonical_string_collector_parsed,particle
5111,A. coenen in A. Hardy,coenen in
1668,A. Hardy avec Mr. Fonsny in A. Hardy,avec Mr. Fonsny in
109857,H. Ponckier in L. Quadvlieg in A. Hardy,in L. Quadvlieg in
60164,L. Coututier in Simon in A. Hardy,in Simon in
62122,Laboulle de Verviers in A. Hardy,de Verviers in
63441,Lejeune in Sonnet in A. Hardy,in Sonnet in
51897,J. Van de Sande in A. Jans,Van de Sande in
53154,J.E. De Langhe in A. Jans,De Langhe in
56734,José Van Baelen in A. Lawalrée,Van Baelen in
37706,G.-C. van Haesendonck in A. Thielens,van Haesendonck in


In [7]:
# move canonical_string_collector_parsed after column title (title was the last of the parsing columns)
col = collectors.pop("canonical_string_collector_parsed")
collectors.insert(collectors.columns.get_loc('title') + 1, col.name, col)

these_columns=["family", "given", "suffix", "particle", "dropping_particle", "nick", "appellation", "title", 'canonical_string_collector_parsed']

if 'source_data' in collectors.columns:
    these_columns.append("source_data")

display(collectors.tail().get(these_columns))

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data
24944,Žertová,A.,,,,,,Dr.,A. Žertová,Dr. A. Žertová
58955,Žertová,,,,,,,,Žertová,Klášterský & Žertová
91691,Ǿllgaard,B.,,,,,,,B. Ǿllgaard,"S.P. Pinnerup, B. Ǿllgaard & L. Holm-Nielsen"
118435,Т. Королева. V. Petrovsky,В.Пертовский,,и,,,,,В.Пертовский и Т. Королева. V. Petrovsky,В. Пертовский и Т. Королева. V. Petrovsky & T....
52095,Ẅttewaall,J.,,,,,,,J. Ẅttewaall,J. [Ẅ]ttewaall


In [8]:
# group and aggregate data to have unique name rows only for the matching of names later on
collectors_unique=collectors.groupby(['canonical_string_collector_parsed']).agg(
    family=('family', lambda x: list(x)[0]),
    given=('given', lambda x: list(x)[0]),
    suffix=('suffix', lambda x: list(x)[0]),
    particle=('particle', lambda x: list(x)[0]),
    dropping_particle=('dropping_particle', lambda x: list(x)[0]),
    nick=('nick', lambda x: list(x)[0]),
    appellation=('appellation', lambda x: list(x)[0]),
    title=('title', lambda x: list(x)[0]),
    source_data=('source_data', lambda x: list(x)[0]),
    occurrenceID_collectors_count= ('occurrenceID_count', 'sum'), # use count function
    occurrenceID_collectors_firstsample=('occurrenceID_firstsample', lambda x: list(x)[0]), # custom function, to get the first entry
    collectors_eventDate_mean=('eventDate_mean', 'mean'),
    collectors_eventDate_min=('eventDate_min', 'min'),
    collectors_eventDate_max=('eventDate_max', 'max')
).reset_index()

# move canonical_string_collector_parsed after column title
col = collectors_unique.pop("canonical_string_collector_parsed")
collectors_unique.insert(collectors_unique.columns.get_loc('title') + 1, col.name, col)
collectors_unique.drop_duplicates(inplace=True)

if explain_and_show_the_data: display(collectors_unique)

# column naming perhaps more clear (because we condensed the data)?
# collectors=collectors.add_suffix('_namegrouped') \
#  if not any(col.endswith("_namegrouped") for col in list(collectors.columns))

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max
0,A,,,,,,,,A,"Church, A.C., Ismail, Ruskandi, A",222,http://www.botanicalcollections.be/specimen/BR...,1941-11-06 11:24:04,1809-01-01,2000-07-14
1,Aux,A,,,,,,,A Aux,[]a[] []aux,3,http://www.botanicalcollections.be/specimen/BR...,1968-01-31 08:00:00,1967-06-01,1968-06-02
2,Funk,A,,,,,,,A Funk,a Funk,1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT
3,Isaac Holden,A,,,,,,,A Isaac Holden,a) Isaac Holden;b) W.A. Setchell,1,http://www.botanicalcollections.be/specimen/BR...,1892-09-01 00:00:00,1892-09-01,1892-09-01
4,Jablonski,A,,,,,,,A Jablonski,a Jablonski,1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69216,Černev,Ž.,,,,,,,Ž. Černev,"T. Mešinev, Ž. Černev, C. Kacareva",1,http://www.botanicalcollections.be/specimen/BR...,1973-07-16 00:00:00,1973-07-16,1973-07-16
69217,Černeva,Ž.,,,,,,,Ž. Černeva,"Ž. Černeva, P. Gerginov",104,http://www.botanicalcollections.be/specimen/BR...,1975-11-15 09:21:13,1973-06-10,1979-09-20
69218,Žertová,,,,,,,,Žertová,Klášterský & Žertová,1,http://www.botanicalcollections.be/specimen/BR...,1958-05-20 00:00:00,1958-05-20,1958-05-20
69219,Péterfi,ϯ M.,,,,,,,ϯ M. Péterfi,Ϯ M. Péterfi & M. Pr[]c[],1,http://www.botanicalcollections.be/specimen/BR...,1921-06-02 00:00:00,1921-06-02,1921-06-02


In [9]:
# show collectors with highest occurrenceID_collectors_count
collectors_unique.sort_values(by=['occurrenceID_collectors_count', 'family', 'given'], ascending=[False, True, True]).head(10)

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max
27411,Vanderyst,H.,,,,,,,H. Vanderyst,Vanderyst H. & Polis C.,41259,http://www.botanicalcollections.be/specimen/BR...,1914-09-09 04:44:15,1819-06-07,1990-04-01
72,Aptroot,A.,,,,,,,A. Aptroot,Aptroot A.,32921,http://www.botanicalcollections.be/specimen/BR...,1989-07-24 07:01:45,1874-04-03,2022-08-03
32613,Louis,J.,,,,,,,J. Louis,[J.Louis?],30727,http://www.botanicalcollections.be/specimen/BR...,1938-05-08 00:55:43,1875-08-01,1987-08-29
9923,Berghen,C. Vanden,,,,,,,C. Vanden Berghen,C. Vanden Berghen [et DJ.],29259,http://www.botanicalcollections.be/specimen/BR...,1960-03-21 13:07:04,1856-04-03,2000-01-20
20188,Malaisse,F.,,,,,,,F. Malaisse,Malaisse F. & Saad L.,26240,http://www.botanicalcollections.be/specimen/BR...,1996-04-08 05:10:00,1802-01-01,2017-01-09
31772,Duvigneaud,J.,,,,,,,J. Duvigneaud,"P. Auquier, W. Belotte & J. Duvigneaud",23751,http://www.botanicalcollections.be/specimen/BR...,1975-08-27 06:43:43,1874-06-26,2021-09-15
16529,Coppejans,E.,,,,,,,E. Coppejans,Coppejans E.,23015,http://www.botanicalcollections.be/specimen/BR...,1995-07-08 23:20:34,1967-01-01,2014-03-13
38860,Delvosalle,L.,,,,,,,L. Delvosalle,Delvosalle L.,22529,http://www.botanicalcollections.be/specimen/BR...,1963-10-15 04:43:01,1900-06-24,2013-06-01
33400,Symoens,J.,,,,,,,J. Symoens,Hoffmann E. & Symoens J.,22073,http://www.botanicalcollections.be/specimen/BR...,1966-05-16 18:50:07,1848-06-01,2017-01-09
61836,Arts,T.,,,,,,,T. Arts,Arts T.,21420,http://www.botanicalcollections.be/specimen/BR...,1989-04-12 02:03:50,1888-03-15,2000-11-26


In [10]:
# Idea: Should we use data column suffixes to follow the data source after merging is done later?
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

### Set Up the Text Analysis

See https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536 for deeper understanding.

The `ngrams` function is used as an analyzer in the text search later.

In [11]:
import re
import unicodedata
!pip install ftfy # amazing text cleaning for decode issues..
from ftfy import fix_text

# Regex for unwanted special characters
CHARS_TO_REMOVE_RX = re.compile(r"[)(.|\[\]{}'„“”\"‚‘’›‹»«]|[,-./]|\sBD")
MULTIPLE_SPACES_RX = re.compile(r' +')

def ngrams(string, n=3):
    """
    Construct ngram(s) of a given text
     
    @param string: the text string to perform the ngram splitting on 
    @param n: character length of the particular (split) result text each
    @return: string as ngram
    """
    if not string:
        return []

    # (fix encoding errors)
    string = fix_text(string)

    # Normalization (IMPORTANT for Unicode)
    # NFC ensures that characters such as ‘é’ are treated as a single character
    # rather than as a combination of ‘e’ + accent.
    string = unicodedata.normalize('NFC', string)
    string = string.lower()    
    # (other languages often use different separators; kept generic here)
    string = string.replace('&', 'and').replace(',', ' ').replace('-', ' ')    
    # character cleanup but leave then as spaces
    string = CHARS_TO_REMOVE_RX.sub(' ', string)    
    # Normalization of spaces & case: ….title() also works with umlauts (e.g., “öllegard” -> “Öllegard”)
    string = string.title()
    string = MULTIPLE_SPACES_RX.sub(' ', string).strip()    
    # Padding & N-gram generation
    string = f" {string} "
    return [string[i : i + n] for i in range(len(string) - n + 1)]



In [12]:
import numpy as np
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

def calculateTFIDFmatchingOfData(query_data, match_data, n_neighbors=1):
    """
    Calculate a TF-IDF (Term Frequency — Inverse Document Frequency) matching with getNearestN

    @param query_data: DataFrame usually a pandas data column to query names or strings for
    @param match_data: DataFrame against to match with
    @param n_neighbors: number of similar data neighbors to calculate for (output as well; 5 = 5 times more output data, therefore default: 1)

    @requires NearestNeighbors()
    @requires getNearestNeighbour()
    @requires ngrams()
    @requires TfidfVectorizer()
    @requires NearestNeighbors()

    @return: DataFrame a data frame of matches with columns 'namematch_source_data', 'namematch_resource_data', 'namematch_distance'
    """
    # TODO if n_neighbors > 1 then output data need more explaining columns, 
    # or ensure the right ordering, because as e.g. n_neighbors=5  
    # then 5 sub-samples of the same name shall be exist
    
    start = time.time()
    
    # Unique values for performance, but directly as a list for stable index
    query_list = list(set(query_data))
    match_values = match_data.values # Cache für schnellen Zugriff
    
    print(f'Vectorizing {len(match_data)} reference items...')
    # Tip: ‘ngrams’ must be defined globally or appear here as a lambda.
    vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
    tfidf_match_matrix = vectorizer.fit_transform(match_data)
    
    nbrs = NearestNeighbors(n_neighbors=n_neighbors, n_jobs=-1, metric='cosine').fit(tfidf_match_matrix)
    
    print(f'Transforming {len(query_list)} queries...')
    tfidf_query_matrix = vectorizer.transform(query_list)
    
    # Start search
    distances, indices = nbrs.kneighbors(tfidf_query_matrix)
    
    print('Building DataFrame...')
    
    # We repeat each entry in query_list n_neighbors times
    # so that the column lengths for the DataFrame match
    query_repeated = np.repeat(query_list, n_neighbors)
    # TODO SUBSAMPLES n_neighbors: query_ids = np.repeat(np.arange(len(query_list)), n_neighbors)
    
    # Ensure that match_values is a flat NumPy array
    # .ravel() is the counterpart to .flatten(), but works more reliably
    flat_indices = indices.ravel()
    
    # We access the values and ensure that we have a NumPy format.
    match_hits = np.array(match_values)[flat_indices]
    
    matches = pd.DataFrame({
        # TODO SUBSAMPLES n_neighbors: 'group_index_id': query_ids,
        'namematch_source_data': query_repeated,
        'namematch_resource_data': match_hits,
        'namematch_distance': distances.ravel(),
        # TODO SUBSAMPLES n_neighbors: 'namematch_index': flat_indices
    })
    
    # Optional: Rounding the distance
    matches['namematch_distance'] = matches['namematch_distance'].round(4)

    print(f'Done in {time.time() - start:.2f}s')
    return matches

In [13]:
# some example data
samples = [
    ("simple name", "Klazenga, N."),
    ("simple name", "金井弘夫(Hiroo Kanai)"),
    ("simple name", "Н. А. Антипова"),
    ("data from collectors", collectors_unique["canonical_string_collector_parsed"].at[1]),
    ("data from match-test", collectors_unique["canonical_string_collector_parsed"].at[1]),
    ("data from match-test, full names", wd_matchtest_fullnames['canonical_string_fullname'].at[0])
]

print("Show ngram examples:")
for label, name in samples:
    print(f"- {label} “{name}”:", ngrams(name))

# some example data
print('\n(WikiData’s) canonical_string = (constructed) canonical_string_fullname:')
# We only take the first 5 lines of both columns.
short_names = wd_matchtest['canonical_string'].head(5)
long_names = wd_matchtest_fullnames['canonical_string_fullname'].head(5)

for short, long in zip(short_names, long_names):
    print(f"- {short} = {long}")

Show ngram examples:
- simple name “Klazenga, N.”: [' Kl', 'Kla', 'laz', 'aze', 'zen', 'eng', 'nga', 'ga ', 'a N', ' N ']
- simple name “金井弘夫(Hiroo Kanai)”: [' 金井', '金井弘', '井弘夫', '弘夫 ', '夫 H', ' Hi', 'Hir', 'iro', 'roo', 'oo ', 'o K', ' Ka', 'Kan', 'ana', 'nai', 'ai ']
- simple name “Н. А. Антипова”: [' Н ', 'Н А', ' А ', 'А А', ' Ан', 'Ант', 'нти', 'тип', 'ипо', 'пов', 'ова', 'ва ']
- data from collectors “A Aux”: [' A ', 'A A', ' Au', 'Aux', 'ux ']
- data from match-test “A Aux”: [' A ', 'A A', ' Au', 'Aux', 'ux ']
- data from match-test, full names “"Fritz" Ryser”: [' Fr', 'Fri', 'rit', 'itz', 'tz ', 'z R', ' Ry', 'Rys', 'yse', 'ser', 'er ']

(WikiData’s) canonical_string = (constructed) canonical_string_fullname:
- "F." Ryser = "Fritz" Ryser
- "N.A. Antipova" (lapsus) = "N.A. Antipova" (lapsus)
- "N.A.Antipova" (lapsus) = "N.A.Antipova" (lapsus)
- "The grandmother of female scientists in Ghana" = "The grandmother of female scientists in Ghana"
- "Н. А. Антипова" (lapsus) = "Н. А. А

Vectorize Wikidata names. Background: We use an information retrieval technique (Term Frequency — Inverse Document Frequency, blog [towardsdatascience.com/tf-idf-explained…](https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275)) for matching the source names with WikiData names, for that a calculated dinsance measure of the name match will help to match similar names and distinguish names that are rather no match. In general see also https://scikit-learn.org, https://pypi.org/project/scikit-learn/.

### Perform the Matching

Perform the nearest neighbour (NN) matches on the (Meise) collector names and create a data frame with matches, and we try to distinguish abbreviated and full names in the source to better match source data and WikiData ... (can take 5 to 10 minutes)

Now convert a collection of raw documents to a matrix of TF-IDF features and set up the function that performs the matches...

In [14]:
criterion_fullnames = collectors_unique.given.str.contains('^\\w{3,}', na=False)
collectors_names = collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values
# collectors_names = set(collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values)
print("Calculate rather the abbreviated names only …")
matches = calculateTFIDFmatchingOfData(collectors_names, wd_matchtest['canonical_string']) 

matches = matches.sort_values(['namematch_distance'])
matches = matches.reset_index(names=['old_index'])

if explain_and_show_the_data: display(matches)

Calculate rather the abbreviated names only …
Vectorizing 171448 reference items...
Transforming 55872 queries...
Building DataFrame...
Done in 154.84s


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,32638,C.R. Orcutt,C. R. Orcutt,0.0
1,7993,W.M. Loerakker,W.M. Loerakker,0.0
2,7994,P. Dumée,P. Dumée,0.0
3,7999,Ker,Ker,0.0
4,32640,C.H. Chung,C.H.Chung,0.0
...,...,...,...,...
55867,39264,Fèc,百瀬静男(Sizuo Momose),1.0
55868,25550,I:B:B,百瀬静男(Sizuo Momose),1.0
55869,26668,Ux,百瀬静男(Sizuo Momose),1.0
55870,52963,Kűphï,百瀬静男(Sizuo Momose),1.0


In [15]:
# criterion_fullnames = collectors_unique.given.str.contains('^\\w{3,}', na=False)
collectors_fullnames = collectors_unique['canonical_string_collector_parsed'][criterion_fullnames].values

print("Calculate rather the full names only …")
matches_fullnames = calculateTFIDFmatchingOfData(collectors_fullnames, wd_matchtest_fullnames['canonical_string_fullname']) 

matches_fullnames = matches_fullnames.sort_values(['namematch_distance'])
matches_fullnames = matches_fullnames.reset_index(names=['old_index'])

if explain_and_show_the_data: display(matches_fullnames)

Calculate rather the full names only …
Vectorizing 204793 reference items...
Transforming 13349 queries...
Building DataFrame...
Done in 50.03s


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,11819,Alice Goen,Alice Goen,0.0000
1,4105,Edgar Anderson,Edgar Anderson,0.0000
2,3366,Richard W. Holm,Richard W. Holm,0.0000
3,11192,Léopold Reichling,Léopold Reichling,0.0000
4,11187,Jean-Louis de Sloover,Jean Louis De Sloover,0.0000
...,...,...,...,...
13344,10433,Reunião de Bot. Peninsular Coi,Greuning,0.7612
13345,3690,Abbí J. Jerrí,Abbe,0.7624
13346,13094,Sajy Garoefoiets,Sajo,0.7633
13347,6490,Frè Altigien,J.Alten,0.7648


### Create Output Results

Combine the matches data frame back to the (Meise) collectors and Wikidata items …

In [16]:
if explain_and_show_the_data: print("join matches data frame back to source collectors dataframe")
collectors_matches = pd.merge(
    collectors_unique, matches, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

if explain_and_show_the_data: display(collectors_matches.head())

join matches data frame back to source collectors dataframe


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,A,,,,,,,,A,"Church, A.C., Ismail, Ruskandi, A",222,http://www.botanicalcollections.be/specimen/BR...,1941-11-06 11:24:04,1809-01-01,2000-07-14,4830,A,A. A.,0.2696
1,Aux,A,,,,,,,A Aux,[]a[] []aux,3,http://www.botanicalcollections.be/specimen/BR...,1968-01-31 08:00:00,1967-06-01,1968-06-02,26839,A Aux,W. H. Auxer,0.5216
2,Funk,A,,,,,,,A Funk,a Funk,1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT,18515,A Funk,A. Funk,0.0
3,Isaac Holden,A,,,,,,,A Isaac Holden,a) Isaac Holden;b) W.A. Setchell,1,http://www.botanicalcollections.be/specimen/BR...,1892-09-01 00:00:00,1892-09-01,1892-09-01,341,A Isaac Holden,Isaac,0.2921
4,Jablonski,A,,,,,,,A Jablonski,a Jablonski,1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT,345,A Jablonski,E. Jablonski,0.0992


In [17]:
if explain_and_show_the_data: print("show full name matches, and append them to all matches")
collectors_matches_fullname = pd.merge(
    collectors_unique, matches_fullnames, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

if explain_and_show_the_data: display(collectors_matches_fullname.head())

show full name matches, and append them to all matches


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,Bunge,AAv,,,,,,,AAv Bunge,Bunge A.A.v.,688,http://www.botanicalcollections.be/specimen/BR...,1843-04-27 13:11:38,1839-01-01,1898-05-01,6529,AAv Bunge,Bunge,0.3762
1,Rudio,ABaun,,,,,,,ABaun Rudio,"Rudio, ABaun",1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT,4488,ABaun Rudio,Rudio,0.3467
2,Schneller,AHardy,,,,,,,AHardy Schneller,"Schneller, AHardy",1,http://www.botanicalcollections.be/specimen/BR...,1868-07-01 00:00:00,1868-07-01,1868-07-01,3216,AHardy Schneller,Schneller,0.3422
3,Guy,Aabert,,,,,,,Aabert Guy,[Aa]bert Guy,1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT,10925,Aabert Guy,Per Aabel,0.5124
4,Smit,Aad,,,,,,,Aad Smit,Aad Smit,1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT,13145,Aad Smit,Smit,0.3984


In [18]:
collectors_all_matches=pd.concat([collectors_matches, collectors_matches_fullname])
collectors_all_matches.sort_values(by=['namematch_distance', 'family'], ascending=[True, True], inplace=True)
if explain_and_show_the_data:
    print("show match results of all abbreviated and full names")
    display(collectors_all_matches.head())

show match results of all abbreviated and full names


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
34302,A. García,M.,,,,,,,M. A. García,"Rafael Torres C., A. García M., L. Cortes",10,http://www.botanicalcollections.be/specimen/BR...,1987-04-09 19:12:00,1987-04-07,1987-04-13,36448,M. A. García,M.A.García,0.0
2670,A.A,,,,,,,,A.A,A.A.,3,http://www.botanicalcollections.be/specimen/BR...,2013-12-23 08:00:00,2013-12-22,2013-12-24,25119,A.A,A. A.,0.0
2731,A.B,,,,,,,,A.B,A.B.,96,http://www.botanicalcollections.be/specimen/BR...,1862-02-12 06:49:05,1827-05-01,1904-05-01,34072,A.B,A. B.,0.0
2819,A.C.B,,,,,,,,A.C.B,?A.C.B.,1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT,53790,A.C.B,A C B,0.0
3434,A.R.-Smith,,,,,,,,A.R.-Smith,"G. Pope, A.R.-Smith & D.Goyder",1,http://www.botanicalcollections.be/specimen/BR...,NaT,NaT,NaT,45953,A.R.-Smith,A. R. Smith,0.0


### Merge Matched Data and WikiData’s

Review (TODO)
- evaluate time references: `eventDate` ~ `yob`, `wyb`—perhaps define a score value that could integrate all scores from properties we need for decision of the name matching (name distance, eventDate ~ year of birth/work year begin aso.)
- merge abbreviated and full name data properly, distinguish abbrevited match and full name match
- refactor `collectors_matches` or `collectors_matches_g1` aso. to `collectors_all_matches`
- refactor `collectors` to `collectors_unique`
- refactor `matches`to `matches_abbr` or distinguish `matches_fullname`

Now
1. merge now the matching data and the wiki data’s on the conaonical string name
2. later aggregate fine tuned, checking if multiple same (canonical string) names relate to multiple different persons (we use wd-items (the Q1233242 thing), and wd-item-labels to aggregate on) … aso.
3. save those data tables


In [19]:
if explain_and_show_the_data: print("merge now the matching data and the wiki data’s on the conaonical string name")
    
collectors_matches_tmp_names_abbreviated = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string'
)
collectors_matches_tmp_fullnames = pd.merge(
    collectors_matches_fullname, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string_fullname'
)
collectors_matches_g1_merged_wikidata = pd.concat(
    [collectors_matches_tmp_names_abbreviated, collectors_matches_tmp_fullnames]
    , ignore_index=True
)

merge now the matching data and the wiki data’s on the conaonical string name


In [20]:
# # # # # # # # # # # # # # # # # # # # # # # # # 
# custom grouping, analysing and saving of data:
# # # # # # # # # # # # # # # # # # # # # # # # # 
# ## cell split - markdown
# # Save the plain name matching results only ...
# 
# ## cell split - code
# 
# if not os.path.exists('data'):
#     print("Make data directory for saving …")
#     os.makedirs('data')
# 
# # Set some global varialbes
# # this_timestamp_for_data=time.strftime('%Y%m%d') # 20230719
# this_timestamp_for_data=20231030
# 
# this_output_file='data/results_meise_collectors-eventDate_vs_wikidata-botanists_kneighbor_plain-names_%s.csv' % (
#     this_timestamp_for_data
# )
# 
# collectors_all_matches.to_csv(this_output_file)
# 
# print("Wrote plain name matches of collector names into %s (%d kB)" % 
#     (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
# )
# 

In [21]:
if explain_and_show_the_data:
    for testname in ['Louis', 'Abbot']:
        print(f"Show some name match examples (e.g. «…{testname}…») …")
        criterion = collectors_matches_g1_merged_wikidata['canonical_string_collector_parsed'].str.contains(testname)
        this_table=collectors_matches_g1_merged_wikidata[criterion][[
            # 'canonical_string_collector_parsed', # canonical_string_collector_parsed = namematch_source_data
            'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
            'namematch_source_data', 'namematch_resource_data', 'namematch_distance', 
            'itemLabel', 
            'canonical_string_fullname', # canonical_string_fullname contains the former itemMatchingLabel
            'wikidata_link',
            'collectors_eventDate_min', 'collectors_eventDate_max',
            'yob', 'yod',            
        ]].sort_values(by=['namematch_distance'], ascending=[True])
        print("# ---------------------------------------------\n# «%s…» as test name, %d collector names contain:" % (testname, criterion.sum()))    
        display(this_table)

Show some name match examples (e.g. «…Louis…») …
# ---------------------------------------------
# «Louis…» as test name, 236 collector names contain:


Unnamed: 0,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_distance,itemLabel,canonical_string_fullname,wikidata_link,collectors_eventDate_min,collectors_eventDate_max,yob,yod
4135,2348,http://www.botanicalcollections.be/specimen/BR...,A.M. Louis,A.M.Louis,0.0000,Adriaan M. Louis,A.M.Louis,http://www.wikidata.org/wiki/Q21338327,1902-11-28,1995-05-04,1944,
26695,4,http://www.botanicalcollections.be/specimen/BR...,H. Louis,H. Louis,0.0000,Herbert Louis Mason,Herbert Louis,http://www.wikidata.org/wiki/Q11925452,1860-08-01,1965-09-01,1896,1994
41614,12,http://www.botanicalcollections.be/specimen/BR...,Louis,Louis,0.0000,Jean Laurent Prosper Louis,Louis,http://www.wikidata.org/wiki/Q5928759,1878-06-01,1915-06-18,1903,1947
41620,9,http://www.botanicalcollections.be/specimen/BR...,Louis-Marie,Louis-Marie,0.0000,Louis-Marie Lalonde,Louis-Marie,http://www.wikidata.org/wiki/Q5981449,1934-08-03,1952-07-01,1896,1978
59526,3,http://www.botanicalcollections.be/specimen/BR...,Simon-Louis,Simon-Louis,0.0000,Léon L. Simon-Louis,Simon-Louis,http://www.wikidata.org/wiki/Q21608924,1908-01-01,1908-01-01,1834,1913
...,...,...,...,...,...,...,...,...,...,...,...,...
75972,1,http://www.botanicalcollections.be/specimen/BR...,Louis Dalmée in V. Lambert,M. Lambert,0.6062,M. Lambert,M. Lambert,http://www.wikidata.org/wiki/Q88838652,1936-06-23,1936-06-23,,
78524,3,http://www.botanicalcollections.be/specimen/BR...,Pierre-Louise Virton in M.-Th. Kerger,Pierre Louis Briot,0.6142,Pierre Louis Briot,Pierre Louis Briot,http://www.wikidata.org/wiki/Q21394994,1991-09-03,1992-08-10,1804,1888
76007,4,http://www.botanicalcollections.be/specimen/BR...,Louis Zoude in Intstituti Carnoy,Carnoy,0.6254,Jean-Baptiste Carnoy,Carnoy,http://www.wikidata.org/wiki/Q423006,1916-06-11,1916-07-31,1836,1899
40865,112,http://www.botanicalcollections.be/specimen/BR...,Le Cesve Raphaël Louis René,Louis,0.6295,Jean Laurent Prosper Louis,Louis,http://www.wikidata.org/wiki/Q5928759,1912-05-20,1939-06-10,1903,1947


Show some name match examples (e.g. «…Abbot…») …
# ---------------------------------------------
# «Abbot…» as test name, 13 collector names contain:


Unnamed: 0,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_distance,itemLabel,canonical_string_fullname,wikidata_link,collectors_eventDate_min,collectors_eventDate_max,yob,yod
4474,1,http://www.botanicalcollections.be/specimen/BR...,A.T.D. Abbott,A. T. D. Abbott,0.0,A. T. D. Abbott,A. T. D. Abbott,http://www.wikidata.org/wiki/Q117328147,2010-01-07,2010-01-07,1936.0,2013.0
4615,2,http://www.botanicalcollections.be/specimen/BR...,Abbott,Abbott,0.0,George Abbott,Abbott,http://www.wikidata.org/wiki/Q47112598,1982-11-16,1982-11-16,,
33170,1,http://www.botanicalcollections.be/specimen/BR...,J. Richard Abbott,J.Richard Abbott,0.0,J. Richard Abbott,J.Richard Abbott,http://www.wikidata.org/wiki/Q18982386,1995-03-22,1995-03-22,1968.0,
53098,2,http://www.botanicalcollections.be/specimen/BR...,R. Abbott,R. Abbott,0.0,R.J. Abbott,Richard Abbott,http://www.wikidata.org/wiki/Q33660683,1995-01-27,1995-06-20,1945.0,2024.0
65644,1,http://www.botanicalcollections.be/specimen/BR...,W.L. Abbott,W. L. Abbott,0.0,William Louis Abbott,William L. Abbott,http://www.wikidata.org/wiki/Q635604,1922-02-26,1922-02-26,1860.0,1936.0
65643,1,http://www.botanicalcollections.be/specimen/BR...,W.L. Abbott,W. L. Abbott,0.0,William Louis Abbott,W. L. Abbott,http://www.wikidata.org/wiki/Q635604,1922-02-26,1922-02-26,1860.0,1936.0
65642,1,http://www.botanicalcollections.be/specimen/BR...,W.L. Abbott,W. L. Abbott,0.0,William Louis Abbott,W L Abbott,http://www.wikidata.org/wiki/Q635604,1922-02-26,1922-02-26,1860.0,1936.0
65641,1,http://www.botanicalcollections.be/specimen/BR...,W.L. Abbott,W. L. Abbott,0.0,William Louis Abbott,W. L. Abbott,http://www.wikidata.org/wiki/Q635604,1922-02-26,1922-02-26,1860.0,1936.0
65645,1,http://www.botanicalcollections.be/specimen/BR...,W.L. Abbott,W. L. Abbott,0.0,William Louis Abbott,William Louis Abbott,http://www.wikidata.org/wiki/Q635604,1922-02-26,1922-02-26,1860.0,1936.0
65322,1,http://www.botanicalcollections.be/specimen/BR...,W.C. Abbott,C. Abbott,0.0766,Cecelia White Abbott,C. Abbott,http://www.wikidata.org/wiki/Q99340892,NaT,NaT,1936.0,2010.0


In [22]:
# ## cell split - markdown
# # Aggregate data to get atomized listings of multiple resource name matches joining by “|” aso.
# ## cell split - code
# 
# print('Group data by canonical names (abbreviated and full name):'
#       ' multiple related WD items (e.g. Q1232456), item labels, year of birth, year of death')
# for i, wd_matching_column in enumerate(['canonical_string', 'canonical_string_fullname']):
#     print('Run %s:   Group by wiki data’s %s, and aggregate/join item(s), labels, yob, yod '
#           'by “…|…”, add new columns “…_joined” ...' % (i + 1, wd_matching_column))
#     wdata_joined_items_and_others = wikidata.groupby([wd_matching_column]).agg(
#         items_joined = ('item', lambda x: '|'.join(x)),
#         item_labels_joined = ('itemLabel', lambda x: '|'.join(x)),
#         yob_joined = ('yob', lambda x: '|'.join([str(s) for s in list(x)]) ),
#         yod_joined = ('yod', lambda x: '|'.join([str(s) for s in list(x)]) )
#     ).reset_index()
# 
#     # print("Done. Show examples of items having multiple matching data «|» … ")
#     # criterion = wdata_joined_items['items'].map(lambda x: '|' in x)
#     # wdata_joined_items[criterion].head()
# 
#     print('Run %s:   Merge all based on namematch_resource_data, add item(s) data ...' % (i + 1))
#     collectors_matches_g2 = pd.merge(
#         collectors_matches_g1_merged_wikidata, wdata_joined_items_and_others,
#         left_on='namematch_resource_data', right_on=wd_matching_column
#         , suffixes=('__wikidata_merge', '__grp_by_items')
#         # append to left-data, right-data only when identical column names occur
#     )
# 
#     print('Run %s:   Build data frame “collectors_matches_group” ...' % (i + 1))
#     collectors_matches_group = collectors_matches_g2 \
#         if i == 0 \
#         else pd.concat([collectors_matches_group, collectors_matches_g2], ignore_index = True)
#     
# print('Done')
# ## cell split - code
# 
# print("Show examples of item_labels_joined having multiple matching data «|» … ")
# criterion = collectors_matches_group['item_labels_joined'].map(lambda x: '|' in x)
# 
# collectors_matches_group[criterion].get([ # empty 
#     # 'canonical_string_collector_parsed', # canonical_string_collector_parsed = namematch_source_data
#     'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
#     'namematch_source_data', 'namematch_resource_data', 'namematch_distance', 
#     # 'canonical_string_fullname', 
#     'item_labels_joined', 'items_joined', 'yob_joined', 'yod_joined'
# ], default="...get: Are data empty or it has probably a wrong named column?")
# ## cell split - code
# 
# # check what columns we have and what we would keep for further analysis and what to drop
# pprint.pprint(collectors_matches_group.columns)
# # from merge: _x would means normally from left column, _y means from right column
# # in BASH fold text long lines; echo "${text}" | fold --spaces | sed 's@^@#  @'
# ## cell split - markdown
# 
# # Prepare data to save later on …
# ## cell split - code
# # Remove superfluous columns TODO check WARNING: A value is trying to be set on a copy of a slice from a DataFrame
# # TODO check duplicates
# collectors_matches_group_simplified = collectors_matches_group.get(
#     ['family', 'given', 'canonical_string_collector_parsed', 
#       'namematch_source_data', # redundant: 'namematch_source_data' == 'canonical_string_collector_parsed'
#       'namematch_resource_data', 'namematch_distance', 
#       'collectors_eventDate_mean', 'collectors_eventDate_min', 'collectors_eventDate_max', # collecors dates
#       'yob_joined', 'yod_joined', # WikiData dates
#       'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 
#       'items_joined', 'canonical_string', 'canonical_string_fullname', 'surname', 'initials', 'item_labels_joined'
#     ], default="...get: Are data empty or it has probably a wrong named column?"
# )
# # collectors_matches_group = collectors_matches_g3
# collectors_matches_group_simplified.sort_values(
#     by=['namematch_distance', 'canonical_string_collector_parsed']
#     , inplace=True
# )
# collectors_matches_group_simplified.drop_duplicates(inplace=True)
# collectors_matches_group_simplified.head()
# ## cell split - code
# 
# # old file meise_collectors_matches_wikidata_items_group_concat_%s.csv
# this_output_file='data/results_meise_collectors-eventDate_vs_wikidata-botanists_kneighbor_wditems_group_concat_wdlabels-joined_%s.csv' % (
#     this_timestamp_for_data
# )
# 
# collectors_matches_group.to_csv(this_output_file)
# 
# print("Wrote groups of collectors matches into %s (%d kB)" % 
#     (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
# )
# ## cell split - markdown

In [23]:
# # ### Merge Data to Individual WikiData Items
# # 
# # For this, merge by namematch_resource_data and focus to get individual WikiData items.
# 
# ## cell split - code
# 
# print('Merge simply namematch_resource_data to Wiki data for abbreviated and full names... ')
# for i, wd_matching_column in enumerate(['canonical_string', 'canonical_string_fullname']):
# 
#     # join wikidata items to avh collectors matches
#     #   avh_matches = pd.merge(avh, matches, left_on='label', right_on='name')
#     #   avh_matches_t1 = pd.merge(avh_matches, wikidata, left_on='matched_name', right_on='canonical_string')
#     # link counts of wikidata items with same canonical name string
#     #   avh_matches_t2 = pd.merge(avh_matches_t1, wd_test, left_on="matched_name", right_on="canonical_string")
#     #   avh_matches_t2.rename(columns = {list(avh_matches_t2.columns)[-1]: 'dup_count'}, inplace=True)
#     
#     print('Run %s:   Merge all (collectors matches) using namematch_resource_data, add wikidata ...' % (i + 1))
#     collectors_matches_wd1 = pd.merge(
#         collectors_all_matches, wikidata,
#         left_on='namematch_resource_data', right_on=wd_matching_column,
#         suffixes=('__coll_all_matches', '__wd')
#         # append to left-data, right-data only when identical column names occur
#     )
# 
#     print('Run %s:   Build data frame “collectors_matches_with_wdata” ...' % (i + 1))
#     collectors_matches_with_wdata = collectors_matches_wd1 \
#         if i == 0 \
#         else pd.concat([collectors_matches_with_wdata, collectors_matches_wd1], ignore_index=True)
# 
# print('Done')
# ## cell split - code
# 
# pprint.pprint(collectors_matches_with_wdata.columns)
# # echo "${text}" | fold --spaces | sed 's@^@#  @'
# ## cell split - code
# 
# collectors_matches_with_wdata.drop_duplicates(inplace=True)
# display(collectors_matches_with_wdata)
# ## cell split - markdown
# 
# # Save all columns for further analysis
# 
# ## cell split - code
# 
# # old meise_collectors_matches_wikidata-botanists_all-columns_%s.csv
# 
# this_output_file='data/results_meise_collectors-eventDate_vs_wikidata-botanists_kneighbor_names-atomized_all-columns_%s.csv' % (
#     this_timestamp_for_data
# )
# 
# collectors_matches_with_wdata.sort_values(
#     by=['namematch_distance', 'canonical_string_collector_parsed']
#     , inplace=True
# )
# collectors_matches_with_wdata.to_csv(
#     this_output_file, index=False # drop index column
# )
# 
# print("Wrote isolated WikiData items of collector matches into %s (%d kB)" % 
#     (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
# )
# ## cell split - code
# 
# # TODO meaningful?
# # remove redundant (duplicate (?or empty?)) columns that in any kind are duplicate data (i.e. that we usually do not need)
# # do it by transposing it (https://www.statology.org/pandas-drop-duplicate-columns/)
# compact_df_tmp=collectors_matches_with_wdata.transpose().drop_duplicates().transpose()
# compact_df_tmp.sort_values(
#     by=['namematch_distance', 'canonical_string_collector_parsed']
#     , inplace=True
# )
# 
# # old meise_collectors_matches_wikidata-botanists_all-columns-made-unique_%s.csv
# # results_meise_collectors_vs_wikidata-botanists_kneighbor_names-atomized_all-columns_%s.csv
# this_output_file='data/results_meise_collectors-eventDate_vs_wikidata-botanists_kneighbor_names-atomized_all-columns-compact_%s.csv' % (
#     this_timestamp_for_data
# )
# 
# compact_df_tmp.to_csv(
#     this_output_file, index=False # drop index column
# )
# 
# print("Wrote isolated WikiData items (unique columns) of collector matches into %s (%d kB)" % 
#     (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
# )

## Write DarwinCore Attribution Output

Here we map table data fields to fields of DarwinCore Attribution (<https://github.com/tdwg/attribution/>, <https://github.com/tdwg/attribution/blob/master/people/dwc/agent_actions_v2020-09-08.xml>) 

### Scoring

Individual scored properties should actually be balanced in such a way that one can simply add up these different property scores; in this case, assessment of the calculated values is still necessary. The problem here with calculation with a distance measure is that we have the opposite of similarity, whose distance can become greater than 1, which must somehow be mapped to a scope of 0 … 1 (or -1 … 0 … 1) (TODO review).

General thoughts: With a score of -1 to 1, it can be assumed that:
* -1 means full devaluation or no agreement
* 1 means full upvoting or agreement, and
* 0 can have several interpretations: it is in between, or no rating possible, or missing values.

### Task to Be Solved in Evaluating the Life Time ~ Rating/Scoring

We have grouped the collection date (evenDate) to the name in the source data, so it may be that for (abbreviated) names, e.g. “Bachmann, F.”, the collection date is valid for *several* personal names, not just one. This must be taken into account when considering and evaluating whether the life data match the collection date. The rating of the life data has the following idea:

| Score (life time) | Remarks | 
|--|--|
| 1.0  | complete match                     |
| 0.5  | somewhat correct, but has errors or mistakes, indicating multiple person names    |
| 0.0   | no evaluation (or not possible) |
| -0.5 | is rather to be rejected, indicating multiple person names and possibly overlapping time spans of the collection date of different person names, or mistakes in the original data |
| -1.0 | completely rejected                |

### Task to Be Solved With Several Names ~ Assessment/Score

Since we do not know if there are other possible names somewhere when there is only one name, we cannot assign a “1” (= full agreement) with certainty, so it was decided that if only 1 name was found, this would be evaluated as zero, in the sense of no evaluation. So when evaluating the multiple names, only the mismatches are evaluated, according to the idea:

| Score (multiple names) | Remarks | 
|--|--|
| 1.0  | this value (=full upvoting or agreement) would never be set in this regard, since we do not know all the full names of the cosmos ;-), and could state this score certainty of 1.0 |
| 0.0 | no evaluation, because only 1 name found | 
| less than 0 | multiple names found, i.e. deduction (perhaps just -0.5, as a decision needs to be made) | 

---

TODO review interpretation:

- the fields are defined in <https://github.com/tdwg/attribution/blob/master/people/dwc/agent_actions_v2020-09-08.xml> and regarding from this DwC-attribution concept: is it correct to map it like the following (`name` would represent the *interpreted* resource name (in long format), not the *source* collector `name` (in (theoretically) long format))?
    ```
    name          ← itemLabel (wikiData)
    alternateName ← canonical_string_collector_parsed (actual collector name)
    ```

In [24]:
# TODO review further evaluation or filtering, counting, clean up aso.
pprint.pprint(collectors_matches_g1_merged_wikidata.columns)

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed',
       'source_data', 'occurrenceID_collectors_count',
       'occurrenceID_collectors_firstsample', 'collectors_eventDate_mean',
       'collectors_eventDate_min', 'collectors_eventDate_max', 'old_index',
       'namematch_source_data', 'namematch_resource_data',
       'namematch_distance', 'item', 'itemLabel', 'surname', 'initials',
       'canonical_string', 'canonical_string_fullname', 'orcid', 'viaf',
       'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod',
       'wikidata_link', 'orcid_link', 'harv_link', 'ipni_link',
       'bionomia_link'],
      dtype='str')


In [25]:
# refactor namematch_similarity → namematch_distance
# refactor namematch_similarity_annotation → namematch_distance_annotation
# refactor custom_namematch_similarity → custom_namematch_namematch
# refactor sort_values
# refactor collectors_eventDate_mean
# refactor collectors_eventDate_min
# - refactor yob_is_lt_eventDate_min
# refactor collectors_eventDate_max
# - refactr yod_is_gt_eventDate_max
# refactor custom_score_lifetime            → custom_score_lifetime_data
# refactor custom_score_lifetime_annotation → custom_score_lifetime_data_annotation
# refactor namematch_similarity             → namematch_distance
# refactor namematch_similarity_annotation  → namematch_distance_annotation
# refactor custom_namematch_similarity → custom_namematch_namematch
# refactor sort_values
collectors_wikidata_cossimOrKmeans = collectors_matches_g1_merged_wikidata[
    [
        'canonical_string_collector_parsed', 'family', 'given', 
        'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
        'source_data',
        'namematch_source_data', 'namematch_resource_data', 'namematch_distance', 
        'item', 'canonical_string', 'itemLabel',
        'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 
        'collectors_eventDate_mean', 'collectors_eventDate_min', 'collectors_eventDate_max',
        'yob', 'yod',
        # 'wyb'
    ]
].copy()

# order by canonical_string_collector_parsed (actual collector name) (asc)
#   order by similarity (desc) or namematch_distance (asc), 
#     order by number of Wikidata items (asc) and 
#       order by number of collections (desc)
collectors_wikidata_cossimOrKmeans.sort_values(
    by=['canonical_string_collector_parsed', 'namematch_distance', 'family', 'given'], 
    ascending=[True, True, True, True], 
    inplace=True
)

dwcagent_attr_output=collectors_wikidata_cossimOrKmeans.get([
    "occurrenceID_collectors_firstsample", 
    "canonical_string_collector_parsed",
    'family', 'given',
    "namematch_distance", 
    "source_data", 
    "itemLabel", 
    "item",
    "collectors_eventDate_min",
    "collectors_eventDate_max",
    'yob', 'yod'
]).copy().drop_duplicates(ignore_index=True)


dwcagent_attr_output['canonical_string_collector_parsed'] = dwcagent_attr_output['canonical_string_collector_parsed'].astype(object)
dwcagent_attr_output['canonical_string_collector_parsed'] = dwcagent_attr_output['canonical_string_collector_parsed'].replace(
    to_replace=r'([^,]+),\s*(.+)',
    value=r'\\2 \\1',
    regex=True
)
dwcagent_attr_output['namematch_distance_annotation'] = dwcagent_attr_output['namematch_distance'].astype(str).str.replace(r'(.+)', '\\1 (k-means distance)', regex=True)
# dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'namematch_distance_annotation', '', allow_duplicates=True)

dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'life_time_periode', '', allow_duplicates=True)

combine_life_times = lambda this_df: ("%s-%s" % (this_df["yob"], this_df["yod"])).replace(r"<NA>", "?")
dwcagent_attr_output["life_time_periode"]=dwcagent_attr_output.apply(combine_life_times, axis="columns")

# dwcagent_attr_output["life_time_periode"]

years_from_birth_until_first_collection_activity = 10
dwcagent_attr_output["yob_is_lt_eventDate_min"] = dwcagent_attr_output["yob"] + years_from_birth_until_first_collection_activity < dwcagent_attr_output["collectors_eventDate_min"].dt.year
dwcagent_attr_output["yod_is_gt_eventDate_max"] = dwcagent_attr_output["yod"] > dwcagent_attr_output["collectors_eventDate_max"].dt.year
dwcagent_attr_output["custom_score_lifetime_data"] = 0.0
dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'custom_score_lifetime_data_annotation', '', allow_duplicates=True)

import numpy as np

# 1. Variables - Ensure they are converted to standard numpy booleans
# We fill NA with False for the logical check to satisfy numpy.select
y_to_activity = years_from_birth_until_first_collection_activity 

b_ok = dwcagent_attr_output["yob_is_lt_eventDate_min"].fillna(False).astype(bool)
d_ok = dwcagent_attr_output["yod_is_gt_eventDate_max"].fillna(False).astype(bool)

# To still detect the "NA" cases in your conditions, we need the original NA info:
b_isna = dwcagent_attr_output["yob_is_lt_eventDate_min"].isna()
d_isna = dwcagent_attr_output["yod_is_gt_eventDate_max"].isna()

# 2. Updated Conditions (using the sanitized booleans and the NA-checkers)
conditions = [
    (b_ok == True)  & (d_ok == True),   # Both match
    (b_ok == True)  & (d_isna),         # Birth matches, death unknown
    (b_isna)        & (d_ok == True),   # Birth unknown, death matches
    (b_isna)        & (d_isna),         # Both unknown
    (b_ok == True)  & (d_ok == False) & (~d_isna),  # Birth matches, death contradicts
    (b_ok == False) & (~b_isna) & (d_ok == True),   # Birth contradicts, death matches
    (b_ok == False) & (~b_isna) & (d_isna),         # Birth contradicts, death unknown
    (b_isna)        & (d_ok == False) & (~d_isna),  # Birth unknown, death contradicts
    (b_ok == False) & (~b_isna) & (d_ok == False) & (~d_isna) # Both contradict
]

scores = [1.0, 1.0, 1.0, 0.0, 0.5, 0.5, -0.5, -0.5, -1.0]

labels = [
    "full match",
    "OK? year of death is missing",
    "OK? year of birth is missing",
    "unknown life time",
    f"OK yob + {y_to_activity}, but yod not matching, check name and lifetime data",
    f"yob + {y_to_activity} not matching, OK yod, check name and lifetime data",
    f"yob + {y_to_activity} not matching, yod unknown, check name and lifetime data",
    f"yob unknown, yod not matching, check name and lifetime data",
    f"life time not matching any eventDate (yob + {y_to_activity} … yod)"
]

# 3. Apply Scoring and Annotations
dwcagent_attr_output["custom_score_lifetime_data"] = np.select(conditions, scores, default=0.0)
dwcagent_attr_output["custom_score_lifetime_data_annotation"] = np.select(conditions, labels, default="check data consistency")

# 4. Handle Ambiguity
is_duplicated = dwcagent_attr_output['canonical_string_collector_parsed'].duplicated(keep=False)
dwcagent_attr_output["custom_score_multiple_names"] = np.where(is_duplicated, -0.5, 0.0)

# 5. Overall Score Calculation
namematch_distance_max = dwcagent_attr_output['namematch_distance'].max()
similarity = 1.0 - (dwcagent_attr_output['namematch_distance'] / namematch_distance_max) if namematch_distance_max > 0 else 1.0

reliability_subtotal = dwcagent_attr_output[["custom_score_lifetime_data", "custom_score_multiple_names"]].mean(axis=1)
dwcagent_attr_output['custom_score_overall'] = (similarity * reliability_subtotal).round(3)
# print("Overall score calculation completed.")

# 1. Convert numbers to strings (once for the entire column)
score_overall = dwcagent_attr_output["custom_score_overall"].round(2).astype(str)
score_life = dwcagent_attr_output["custom_score_lifetime_data"].round(1).astype(str)
score_multi = dwcagent_attr_output["custom_score_multiple_names"].round(2).astype(str)

# 2. Vectorized merge
dwcagent_attr_output['attributionRemarks'] = (
    dwcagent_attr_output['namematch_distance_annotation'] + "; " +
    score_overall + " (score overall); " +
    dwcagent_attr_output["life_time_periode"] + " (life time); " +
    score_life    + " (life time score); " +
    dwcagent_attr_output["custom_score_lifetime_data_annotation"] + " (life time score note); " +
    score_multi   + " (score multiple names);"
)

# adjust dwcagent displayOrder also to olerall score
dwcagent_attr_output.sort_values(
    # by=['namematch_distance', 'family', 'given', 'custom_score_overall'], 
    # ascending=[True, True, True, False], 
    by=['canonical_string_collector_parsed', 'custom_score_overall', 'family', 'given'], 
    ascending=[True, False, True, True], 
    inplace=True
)
# use ordered canonical_string_collector_parsed to generate displayOrder
temp_duplicated = dwcagent_attr_output['canonical_string_collector_parsed'].duplicated() 
    # duplicated() keeps the first value False and mark all other duplicats as True, i.e. we can cumulate the Trues, it gives the order index
temp_insert_value=temp_duplicated.groupby(dwcagent_attr_output['canonical_string_collector_parsed']).cumsum() + 1 # display order starts at 1, incrementing
dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('canonical_string_collector_parsed') + 1, 'displayOrder', temp_insert_value, allow_duplicates=True)

# test an show example data
if explain_and_show_the_data:
    print("example data: names having year of birth (yob) < enventDate_min")
    display(dwcagent_attr_output.loc[dwcagent_attr_output['yob_is_lt_eventDate_min'] == True].get([
        # "occurrenceID_collectors_firstsample",
        "canonical_string_collector_parsed",
        'itemLabel',
        "custom_score_overall",
        "attributionRemarks",
        'custom_score_multiple_names',
        "namematch_distance",
        # 'yob', 'yod',
        "life_time_periode", 
        'collectors_eventDate_min', 'collectors_eventDate_max', 
        "yob_is_lt_eventDate_min" ,'yod_is_gt_eventDate_max', 
        'custom_score_lifetime_data', 'custom_score_lifetime_data_annotation'
    ]).head(5))
    print("example data: names having year of birth (yob) > eventDate_min")
    display(dwcagent_attr_output.loc[dwcagent_attr_output['yob_is_lt_eventDate_min'] == False].get([
        # "occurrenceID_collectors_firstsample",
        "canonical_string_collector_parsed",
        'itemLabel',
        "custom_score_overall",
        "attributionRemarks",
        'custom_score_multiple_names',
        "namematch_distance",
        # 'yob', 'yod',
        "life_time_periode", 
        'collectors_eventDate_min', 'collectors_eventDate_max', 
        "yob_is_lt_eventDate_min" ,'yod_is_gt_eventDate_max', 
        'custom_score_lifetime_data', 'custom_score_lifetime_data_annotation'
    ]).head(5))


example data: names having year of birth (yob) < enventDate_min


Unnamed: 0,canonical_string_collector_parsed,itemLabel,custom_score_overall,attributionRemarks,custom_score_multiple_names,namematch_distance,life_time_periode,collectors_eventDate_min,collectors_eventDate_max,yob_is_lt_eventDate_min,yod_is_gt_eventDate_max,custom_score_lifetime_data,custom_score_lifetime_data_annotation
1,A Aux,William Hensel Auxer,0.12,0.5216 (k-means distance); 0.12 (score overall...,0.0,0.5216,1867-1953,1967-06-01,1968-06-02,True,False,0.5,"OK yob + 10, but yod not matching, check name ..."
8,A'd,Andrew Adie Dalglish,0.404,0.1923 (k-means distance); 0.4 (score overall)...,0.0,0.1923,1868-1924,1896-08-06,1896-08-06,True,True,1.0,full match
10,A-J,Alice J. Heading,0.186,0.2549 (k-means distance); 0.19 (score overall...,0.0,0.2549,1852-1945,1948-07-01,1948-07-01,True,False,0.5,"OK yob + 10, but yod not matching, check name ..."
15,A. A. Sagástegui,Abundio Sagástegui Alva,0.487,0.026 (k-means distance); 0.49 (score overall)...,0.0,0.026,1932-2012,1981-09-13,1986-07-12,True,True,1.0,full match
22,A. Abdezábal,Anita Hoffmann,0.225,0.5509 (k-means distance); 0.22 (score overall...,0.0,0.5509,1919-2007,1992-09-08,1992-09-08,True,True,1.0,full match


example data: names having year of birth (yob) > eventDate_min


Unnamed: 0,canonical_string_collector_parsed,itemLabel,custom_score_overall,attributionRemarks,custom_score_multiple_names,namematch_distance,life_time_periode,collectors_eventDate_min,collectors_eventDate_max,yob_is_lt_eventDate_min,yod_is_gt_eventDate_max,custom_score_lifetime_data,custom_score_lifetime_data_annotation
0,A,Anna Atkins,-0.365,0.2696 (k-means distance); -0.36 (score overal...,0.0,0.2696,1799-1871,1809-01-01,2000-07-14,False,False,-1.0,life time not matching any eventDate (yob + 10...
2,A Funk,Alvin Funk,0.25,0.0 (k-means distance); 0.25 (score overall); ...,0.0,0.0,1925-2010,NaT,NaT,False,True,0.5,"yob + 10 not matching, OK yod, check name and ..."
3,A Isaac Holden,Frances Margaret Leighton,0.177,0.2921 (k-means distance); 0.18 (score overall...,0.0,0.2921,1909-2006,1892-09-01,1892-09-01,False,True,0.5,"yob + 10 not matching, OK yod, check name and ..."
4,A Jablonski,Eugene Jablonszky,0.225,0.0992 (k-means distance); 0.22 (score overall...,0.0,0.0992,1892-1975,NaT,NaT,False,True,0.5,"yob + 10 not matching, OK yod, check name and ..."
7,A Kühlewein,Alexander Eberhard von Kühlewein,0.223,0.1078 (k-means distance); 0.22 (score overall...,0.0,0.1078,1791-1888,NaT,NaT,False,True,0.5,"yob + 10 not matching, OK yod, check name and ..."


In [26]:
column_map_dwcagent_attr = {
    'occurrenceID_collectors_firstsample':'occurrenceID',
    'canonical_string_collector_parsed':  'alternateName',
    'source_data':                        'verbatimName',
    'itemLabel':                          'name',
    'item':                               'identifier',
    'collectors_eventDate_min':           'startedAtTime',
    'collectors_eventDate_max':           'endedAtTime',
    'namematch_distance':                 'custom_namematch_distance'
}
dwcagent_attr_output.rename(
    mapper=column_map_dwcagent_attr,
    axis='columns',
    inplace=True)

dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('identifier')          + 1, 'agentIdentifierType', 'wikidata' , allow_duplicates=True)
dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('agentIdentifierType') + 1, 'agentType'          , 'Person'   , allow_duplicates=True)
dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('identifier')          + 1, 'action'             , 'collected', allow_duplicates=True)

if explain_and_show_the_data:
    print("the mapped DarwinCore attribution output examples, sorted by alternateName (=collector name) + displayOrder …")
    display(dwcagent_attr_output.head(20))

dwcagent_attr_output=dwcagent_attr_output.reindex(
    columns=[
        'occurrenceID', # no DwC agent standard (yet)?
        'verbatimName',
        'alternateName',
        'displayOrder', # shall start from 1, 2, 3 …
        'name',
        'attributionRemarks',
        'startedAtTime',
        'endedAtTime',
        'agentType',
        'action',
        'agentIdentifierType',
        'identifier',
        "custom_score_overall", # keep it for calculation convenience, no standard in DwC agent
        'custom_namematch_distance',# keep it for calculation convenience, no standard in DwC agent
        'custom_score_multiple_names',# keep it for calculation convenience, no standard in DwC agent
        'custom_score_lifetime_data' # keep it for calculation convenience, no standard in DwC agent
    ]
)
# column deletion not neccessary after ….reindex(columns=[…])
# for this_column in ['yob', 'yod', 'life_time_periode', 'yob_is_lt_eventDate_min', 'yod_is_gt_eventDate_max', 'score_lifetime_annotation']:
#     del dwcagent_attr_output[this_column]

the mapped DarwinCore attribution output examples, sorted by alternateName (=collector name) + displayOrder …


Unnamed: 0,occurrenceID,alternateName,displayOrder,family,given,custom_namematch_distance,verbatimName,name,identifier,action,...,yod,namematch_distance_annotation,life_time_periode,yob_is_lt_eventDate_min,yod_is_gt_eventDate_max,custom_score_lifetime_data,custom_score_lifetime_data_annotation,custom_score_multiple_names,custom_score_overall,attributionRemarks
0,http://www.botanicalcollections.be/specimen/BR...,A,1,A,,0.2696,"Church, A.C., Ismail, Ruskandi, A",Anna Atkins,http://www.wikidata.org/entity/Q264269,collected,...,1871.0,0.2696 (k-means distance),1799-1871,False,False,-1.0,life time not matching any eventDate (yob + 10...,0.0,-0.365,0.2696 (k-means distance); -0.36 (score overal...
1,http://www.botanicalcollections.be/specimen/BR...,A Aux,1,Aux,A,0.5216,[]a[] []aux,William Hensel Auxer,http://www.wikidata.org/entity/Q136543112,collected,...,1953.0,0.5216 (k-means distance),1867-1953,True,False,0.5,"OK yob + 10, but yod not matching, check name ...",0.0,0.12,0.5216 (k-means distance); 0.12 (score overall...
2,http://www.botanicalcollections.be/specimen/BR...,A Funk,1,Funk,A,0.0,a Funk,Alvin Funk,http://www.wikidata.org/entity/Q21337490,collected,...,2010.0,0.0 (k-means distance),1925-2010,False,True,0.5,"yob + 10 not matching, OK yod, check name and ...",0.0,0.25,0.0 (k-means distance); 0.25 (score overall); ...
3,http://www.botanicalcollections.be/specimen/BR...,A Isaac Holden,1,Isaac Holden,A,0.2921,a) Isaac Holden;b) W.A. Setchell,Frances Margaret Leighton,http://www.wikidata.org/entity/Q5864685,collected,...,2006.0,0.2921 (k-means distance),1909-2006,False,True,0.5,"yob + 10 not matching, OK yod, check name and ...",0.0,0.177,0.2921 (k-means distance); 0.18 (score overall...
4,http://www.botanicalcollections.be/specimen/BR...,A Jablonski,1,Jablonski,A,0.0992,a Jablonski,Eugene Jablonszky,http://www.wikidata.org/entity/Q3059563,collected,...,1975.0,0.0992 (k-means distance),1892-1975,False,True,0.5,"yob + 10 not matching, OK yod, check name and ...",0.0,0.225,0.0992 (k-means distance); 0.22 (score overall...
5,http://www.botanicalcollections.be/specimen/BR...,A Kuhlew,1,Kuhlew,A,0.3752,Dr. a Kuhlew.,Ralph Kuhlenkamp,http://www.wikidata.org/entity/Q36665275,collected,...,,0.3752 (k-means distance),?-?,,,0.0,unknown life time,0.0,0.0,0.3752 (k-means distance); 0.0 (score overall)...
6,http://www.botanicalcollections.be/specimen/BR...,A Kuhlewein,1,Kuhlewein,A,0.458,Dr. a Kuhlewein,Ralph Kuhlenkamp,http://www.wikidata.org/entity/Q36665275,collected,...,,0.458 (k-means distance),?-?,,,0.0,unknown life time,0.0,0.0,0.458 (k-means distance); 0.0 (score overall);...
7,http://www.botanicalcollections.be/specimen/BR...,A Kühlewein,1,Kühlewein,A,0.1078,Dr. a Kühlewein,Alexander Eberhard von Kühlewein,http://www.wikidata.org/entity/Q118869465,collected,...,1888.0,0.1078 (k-means distance),1791-1888,False,True,0.5,"yob + 10 not matching, OK yod, check name and ...",0.0,0.223,0.1078 (k-means distance); 0.22 (score overall...
8,http://www.botanicalcollections.be/specimen/BR...,A'd,1,A'd,,0.1923,A'[D],Andrew Adie Dalglish,http://www.wikidata.org/entity/Q114055421,collected,...,1924.0,0.1923 (k-means distance),1868-1924,True,True,1.0,full match,0.0,0.404,0.1923 (k-means distance); 0.4 (score overall)...
9,http://www.botanicalcollections.be/specimen/BR...,A'hlbon,1,A'hlbon,,0.6485,A'hlbon,Annette Hladik,http://www.wikidata.org/entity/Q113656903,collected,...,,0.6485 (k-means distance),?-?,,,0.0,unknown life time,0.0,0.0,0.6485 (k-means distance); 0.0 (score overall)...


In [27]:
if explain_and_show_the_data:
    print("show column-reduced examples of ?multiple name cases …")
    # criterion = dwcagent_attr_output['alternateName'].map(lambda x: x.startswith('S. Ahmad'))
    criterion = dwcagent_attr_output['custom_score_multiple_names'].map(lambda this_score: this_score < 0 )  # show multiple name cases
    
    # display(dwcagent_attr_output[criterion].head(20))
    display(dwcagent_attr_output[criterion].drop(['agentType', 'action', 'agentIdentifierType'], axis='columns').head(20))

show column-reduced examples of ?multiple name cases …


Unnamed: 0,occurrenceID,verbatimName,alternateName,displayOrder,name,attributionRemarks,startedAtTime,endedAtTime,identifier,custom_score_overall,custom_namematch_distance,custom_score_multiple_names,custom_score_lifetime_data
12,http://www.botanicalcollections.be/specimen/BR...,"Y.S. Abeid, A. Hernández A., H. Katandasha & B...",A. A. Hernández,1,Antonio Hernández,0.0436 (k-means distance); -0.24 (score overal...,2002-12-17,2002-12-17,http://www.wikidata.org/entity/Q21516463,-0.239,0.0436,-0.5,0.0
13,http://www.botanicalcollections.be/specimen/BR...,"Y.S. Abeid, A. Hernández A., H. Katandasha & B...",A. A. Hernández,2,Alexandra Hernández,0.0436 (k-means distance); -0.24 (score overal...,2002-12-17,2002-12-17,http://www.wikidata.org/entity/Q36503191,-0.239,0.0436,-0.5,0.0
20,http://www.botanicalcollections.be/specimen/BR...,A. Abbas,A. Abbas,1,Alia Abbas,0.0 (k-means distance); -0.25 (score overall);...,1869-05-04,2001-08-10,http://www.wikidata.org/entity/Q60141229,-0.25,0.0,-0.5,0.0
21,http://www.botanicalcollections.be/specimen/BR...,A. Abbas,A. Abbas,2,Abdulla Abbas,0.0 (k-means distance); -0.25 (score overall);...,1869-05-04,2001-08-10,http://www.wikidata.org/entity/Q88804360,-0.25,0.0,-0.5,0.0
58,http://www.botanicalcollections.be/specimen/BR...,Anderson A.,A. Anderson,1,Andrew Anderson,0.0 (k-means distance); 0.0 (score overall); 1...,NaT,NaT,http://www.wikidata.org/entity/Q123652849,0.0,0.0,-0.5,0.5
59,http://www.botanicalcollections.be/specimen/BR...,Anderson A.,A. Anderson,2,Alexander Anderson,0.0 (k-means distance); 0.0 (score overall); 1...,NaT,NaT,http://www.wikidata.org/entity/Q4718216,0.0,0.0,-0.5,0.5
60,http://www.botanicalcollections.be/specimen/BR...,"A. Andersson, J. Lå[uzzat]",A. Andersson,1,Axel Andersson,0.0 (k-means distance); -0.5 (score overall); ...,1869-06-23,1981-08-23,http://www.wikidata.org/entity/Q123652899,-0.5,0.0,-0.5,-0.5
62,http://www.botanicalcollections.be/specimen/BR...,"A. Andersson, J. Lå[uzzat]",A. Andersson,2,I. Anita Andersson,0.0 (k-means distance); -0.5 (score overall); ...,1869-06-23,1981-08-23,http://www.wikidata.org/entity/Q21505194,-0.5,0.0,-0.5,-0.5
61,http://www.botanicalcollections.be/specimen/BR...,"A. Andersson, J. Lå[uzzat]",A. Andersson,3,Karl Alfred Andersson,0.0 (k-means distance); -0.75 (score overall);...,1869-06-23,1981-08-23,http://www.wikidata.org/entity/Q131724787,-0.75,0.0,-0.5,-1.0
105,http://www.botanicalcollections.be/specimen/BR...,Aubert A.,A. Aubert,1,Gustave Aubert,0.0734 (k-means distance); 0.0 (score overall)...,NaT,NaT,http://www.wikidata.org/entity/Q21505424,0.0,0.0734,-0.5,0.5


In [28]:
# TODO further evaluation or filtering, counting, clean up aso.
if not os.path.exists('data'):
    os.makedirs('data')

# this_timestamp_for_data=time.strftime('%Y%m%d') # 20230719
# this_timestamp_for_data=20231116
this_output_file='data/results_meise_collectors-eventDate_vs_wikidata-botanists_kneighbor_dwc-agent-output_%s.csv' % (
    this_timestamp_for_data
)

dwcagent_attr_output.to_csv(this_output_file, index=False)

print("Wrote matches of collector names as dwc-agent-output into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote matches of collector names as dwc-agent-output into data/results_meise_collectors-eventDate_vs_wikidata-botanists_kneighbor_dwc-agent-output_20260210.csv (28467 kB)


## Documentation

Explanation of columns:

Column | Description
-|-
**Botanical collectors** |
family | parsed family name
given | parsed given name
suffix | suffix from name parsing
particle | particle from name parsing
dropping_particle | dropping_particle from name parsing
nick | nick name from name parsing
appellation | appellation from name parsing
title | title from name parsing
eventDate | date of the sampling event (required by GBIF, ☞ https://www.gbif.org/data-quality-requirements-sampling-events)
eventDate_min | calculated earliest date of all the sampling events within the data
eventDate_max | calculated latest date of all the sampling events within the data
eventDate_mean | calculated mean date of all the sampling events within the data
TODO activity_span | Number of years between first and last collection
**Name matching** |
nammatch_collector | matched name of the data set
nammatch_wikidata | matched name; = Wikidata item label name is matched to
namematch_distance | Nearest Neighbour distance between the name and matched name; the lower the value, the better the match
**DarwinCore Agent Output** | (☞ [agent_actions_v2020-09-08.xml](https://github.com/tdwg/attribution/blob/master/people/dwc/agent_actions_v2020-09-08.xml))
occurrenceID | occurrence ID of the data item
name | the interpreted name match (https://github.com/tdwg/attribution/ The name of the item. In this case the *full name* as would be written on a legal document (without abbreviation), eg givenName familyName)
verbatimName | the source data name(s) (https://github.com/tdwg/attribution/ As written on occurrence, such as the collection or determination label.)
alternateName | the input name, collector source name (An alias for the item. Other full name agent may have been known under such as maiden name.)
displayOrder | I guess ordering the multiple name cases (https://github.com/tdwg/attribution/ The display order for the agent that executed the action when more than one agent was a participant.)
attributionRemarks | notes on the results (distance or similarity), including calculated value
agentType | The nature of the agent, e.g. "Person", "Organization", "SoftwareApplication"
action | The name of the single action written as a verb in past tense. Recommended best practice is to use a controlled vocabulary, examples "collected" or "identified"
agentIdentifierType | The type of identifier for the agent. (https://github.com/tdwg/attribution/ Recommended best practice is to use a controlled vocabulary, e.g. “ORCID”, “ISNI”, “Wikidata”, “VIAF”, “RoR”, “Ringgold”, “GRID”).
identifier | Wikidata ID (Recommended practice is to identify the resource by means of a string conforming to an identification system. Examples include International Standard Book Number (ISBN), Digital Object Identifier (DOI), and Uniform Resource Name (URN). Persistent identifiers should be provided as HTTP URIs.)
startedAtTime | (https://github.com/tdwg/attribution/ Start is when an action is deemed to have been started by an agent.) the first date of eventDate (supposedly the first sampling date), but grouped from collector name—in case of multiple name matches this first “sampling date” is less reliable and be reliable to relate to the source collector’s life time.
endedAtTime | (https://github.com/tdwg/attribution/ End is when an action is deemed to have been ended by an agent.) the last date of eventDate (supposedly the last sampling date), but grouped from collector name—in case of multiple name matches this first “sampling date” is less reliable and be reliable to relate to the source collector’s life time.
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))

Refactoring from <https://github.com/nielsklazenga/avh-collectors/blob/master/match_names_to_wikidata_items.ipynb>

AVH | collector_matching (here)
-|-
avh_matches | collectors_all_matches
wd_test | wd_matchtest