# Match Naturalis Collectors to Wikidata Items Using *Nearest Neighbour*

In this example we add `eventDate` of the source data, when the sample was collected, to have a time reference, when the collector was alive. 

Basically we attempt a match of `canonical_string` of WikiData to `canonical_string` of the collectors (in this case the names were parsed beforehand into single names using <https://libraries.io/rubygems/dwc_agent>)

TODO:

- review code: match when full collector name was given instead of an abbreviated collector name
- review code: evaluate if multiple names (WikiData or collector data) are found
- match also with time periode of work (WikiData) ⇌ created time of the herbarium sheet (if no other life time data are available)

### Load Wikidata Data Set

Use Jupyter Notebook [create_wikidata_datasets_botanists.ipynb](./create_wikidata_datasets_botanists.ipynb) to generate matching data of botanists.

Now load the data and make them unique …

In [1]:
import pandas as pd
import pprint, time, os
wikidata = pd.read_csv("data/wikidata_persons_botanists_20230703_1352.csv", index_col=0, low_memory=False)

pprint.pprint(wikidata.columns)
display(wikidata.head())

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Bieberstein,F.A.M.v.,"Bieberstein, F.A.M.v.","Bieberstein, Friedrich August Marschall von",,43340073,0000 0001 1630 5464,1373,...,Q66612,1768.0,1826.0,,,http://www.wikidata.org/wiki/Q66612,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/6129-1,https://bionomia.net/Q66612
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Behr,H.H.,"Behr, H.H.","Behr, Hans Hermann",,20328622,0000 0001 1604 8680,42741,...,Q66934,1818.0,1904.0,,,http://www.wikidata.org/wiki/Q66934,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/619-1,https://bionomia.net/Q66934
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Schäffer,J.C.,"Schäffer, J.C.","Schäffer, Jacob Christian",,47016953,0000 0000 8343 3899,1101,...,,1718.0,1790.0,,,http://www.wikidata.org/wiki/Q66661,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/12818-1,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Klotzsch,J.F.,"Klotzsch, J.F.","Klotzsch, Johann Friedrich",,20426762,0000 0001 1749 2732,135,...,Q67003,1805.0,1860.0,,,http://www.wikidata.org/wiki/Q67003,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/4855-1,https://bionomia.net/Q67003
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Menge,F.A.,"Menge, F.A.","Menge, Franz Anton",,59847236,0000 0001 1653 0899,73782,...,,1808.0,1880.0,,,http://www.wikidata.org/wiki/Q66322,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/23266-1,


In [2]:
# Create data frame with unique canonical strings 
# group by canonical name/string, count douplicated names
wd_matchtest = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()

wd_matchtest

# colls = list(wikidata.columns)

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"(-Walraevens), O.H.",1
1,"(1835-1906), G.A.F.E.",1
2,"(1873-1926), S.S.",1
3,"(1888–1973), G.A.",1
4,"(1904-1990), J.J.",1
...,...,...
61479,"Șerbanescu, I.",1
61480,"Ștefureac, T.",1
61481,"Țopa, E.",1
61482,"Ḥalwaǧī, R.",1


In [3]:
wd_matchtest_fullnames = wikidata.groupby('canonical_string_fullname').agg({'item': ['count']}).reset_index()

wd_matchtest_fullnames


Unnamed: 0_level_0,canonical_string_fullname,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"(-Walraevens), O Heylen",1
1,"(1835-1906), Gustav Adolf Ferdinand Eichler",1
2,"(1873-1926), Søren Sørensen",1
3,"(1888–1973), Georges André",1
4,"(1904-1990), Johannes Johannessen",1
...,...,...
63605,"Șerbanescu, Ioan",1
63606,"Ștefureac, Traian",1
63607,"Țopa, Emilian",1
63608,"Ḥalwaǧī, Riyāḍ",1


### Load Collectors Data Set

Data sources:

- Jupyter Notebook for [create_naturalis_gbif-occurrence_collectors_eventDate_dataset.ipynb](./create_naturalis_gbif-occurrence_collectors_eventDate_dataset.ipynb)

Then parse collector names to get single, separate collector names using `dwcagent`, use ruby gem package available at  <https://rubygems.org/gems/dwc_agent>:

- see [./bin/README.md](bin/README.md) to use ruby script `./bin/agent_parse4tsv.rb` for parsing text lines like `"Abbe,L.B., Abbe,E.C., Smitinand,T. & Rollet,B."`


In [4]:
# atomized names were parsed already by ruby gem package: dwcagent —
# they can contain also the same name accross multiple rows — 
# it’s probably better for the matching to make the name rows unique later on

# collectors = pd.read_csv("data/naturalis_collectors_20230510_1429_single-line_parsed_unique_names.tab", sep="\t")
collectors = pd.read_csv("./data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913_parsed.tsv", sep="\t", low_memory=False)

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.

# Out of bounds nanosecond timestamp: 1652-01-01T00:00:00
#  because date nanoseconds range limitations of pandas, see https://stackoverflow.com/a/69507200/1240387
#  work around: use datetime or using pd.Periode(…)
print("modify time using pd.Periode(…) to get it work also on very old dates...")
for col in ['eventDate_mean', 'eventDate_min', 'eventDate_max']:
    print("- convert", col, "to pd.Period(...) in collectors")
    collectors[col] = collectors[col].apply(lambda x: pd.Period(x, freq='ms'))
print("done modifying")

collectors.sort_values(by=['family', 'given','occurrenceID_firstsample'], inplace=True)
collectors

modify time using pd.Periode(…) to get it work also on very old dates...
- convert eventDate_mean to pd.Period(...) in collectors
- convert eventDate_min to pd.Period(...) in collectors
- convert eventDate_max to pd.Period(...) in collectors
done modifying


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
81414,A,,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1899-08-07 00:00:00.000,1899-08-07 00:00:00.000,1899-08-07 00:00:00.000
170656,A,,,,,,,,5,https://data.biodiversitydata.nl/naturalis/spe...,1981-12-26 00:00:00.000,1981-03-20 00:00:00.000,1983-05-18 00:00:00.000
163328,A,,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
52199,A,,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1997-02-01 00:00:00.000,1997-02-01 00:00:00.000,1997-02-01 00:00:00.000
136326,A,,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1996-07-07 00:00:00.000,1996-07-07 00:00:00.000,1996-07-07 00:00:00.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
178779,Štepánek,J.,,,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1998-11-06 12:00:00.000,1992-05-15 00:00:00.000,2005-04-30 00:00:00.000
85001,Štepánek,J.,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1983-05-09 00:00:00.000,1983-05-09 00:00:00.000,1983-05-09 00:00:00.000
178764,Štepánek,J.,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,2003-06-01 00:00:00.000,2003-06-01 00:00:00.000,2003-06-01 00:00:00.000
62174,Štepánek,J.,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1978-05-17 00:00:00.000,1978-05-17 00:00:00.000,1978-05-17 00:00:00.000


#### Check Composition of Parsed Collector Data

In [5]:
# TODO review code of abbreviated names and full name matching
criterion_fullnames = collectors.given.str.contains('^\w{3,}', na=False)
print("Show collecors given name has (propably) a full name (%s records) …" % len(collectors[criterion_fullnames].index))
collectors[criterion_fullnames]

Show collecors given name has (propably) a full name (6730 records) …


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
133764,A-M-V-J,Renier,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
71597,A-ts'ai,Hsieh,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000
154401,A. Kneucker T,Stuckert,,in,,,,,4,https://data.biodiversitydata.nl/naturalis/spe...,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000
78904,AFle,Jolis,,,,,,,420,https://data.biodiversitydata.nl/naturalis/spe...,1860-07-06 19:47:47.797,1800-01-01 00:00:00.000,1983-10-04 00:00:00.000
12995,Aaaa,Bellynck,,,,,,,6,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...
85287,Zwaan Jp,Kleiweg,,de,,,,,10,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
67933,d'Alleizette,Herb,,,,,,,32,https://data.biodiversitydata.nl/naturalis/spe...,1910-01-14 11:04:36.924,1901-11-01 00:00:00.000,1920-05-13 00:00:00.000
17082,d'Anty,Bons,,,,,,,3,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
124795,dePoicy,Pirey,,,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1919-02-01 00:00:00.000,1919-02-01 00:00:00.000,1919-02-01 00:00:00.000


In [6]:
# check the parsed columns if they are empty or need to be considerd as data for matching or not
for parsed_name_part in ["particle", "suffix", "dropping_particle", "appellation"]:
    test_collectors = collectors.loc[(collectors[parsed_name_part].isna() == False)]
    print("\n----------------------------------------\nshow names with **%s** found %s records:\n" % (parsed_name_part, len(test_collectors.index)))
    display(test_collectors.head())


----------------------------------------
show names with **particle** found 4006 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
154401,A. Kneucker T,Stuckert,,in,,,,,4,https://data.biodiversitydata.nl/naturalis/spe...,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000
47,Aa,H. A. van der,,van,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1962-07-07 00:00:00.000,1962-07-07 00:00:00.000,1962-07-07 00:00:00.000
57,Aalst,Mdjm,,van,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1978-12-31 00:00:00.000,1975-06-01 00:00:00.000,1982-08-01 00:00:00.000
114071,Aaron,Native,,boy,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1912-03-01 00:00:00.000,1912-03-01 00:00:00.000,1912-03-01 00:00:00.000
4933,Abdilah,Rasit,,bin,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,2000-10-16 00:00:00.000,2000-10-16 00:00:00.000,2000-10-16 00:00:00.000



----------------------------------------
show names with **suffix** found 22 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
178054,Bakker,Zinderen,Sr.,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1965-01-31 00:00:00.000,1965-01-31 00:00:00.000,1965-01-31 00:00:00.000
58839,Gradstein,,SR,van,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1980-05-08 00:00:00.000,1980-05-08 00:00:00.000,1980-05-08 00:00:00.000
58837,Gradstein,,SR,van,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1973-11-01 00:00:00.000,1973-11-01 00:00:00.000,1973-11-01 00:00:00.000
84666,Leopold,King,III,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
150830,Maurit,Flora,II,,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1900-03-14 00:00:00.000,1900-03-14 00:00:00.000,1900-03-14 00:00:00.000



----------------------------------------
show names with **dropping_particle** found 0 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max



----------------------------------------
show names with **appellation** found 1 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
17782,McCullogh,,,,,,Mrs,,34,https://data.biodiversitydata.nl/naturalis/spe...,1975-01-30 09:10:35.294,1975-01-30 00:00:00.000,1975-01-31 00:00:00.000


Compile and compose `canonical_string…` of the collector data that we will later match the WikiData names with:

In [7]:
collectors['canonical_string_collector_parsed'] = (
  # use collectors.family only where given name has NA values, otherwise use family name + given name
  collectors.family.where(
      # condition
      collectors.given.isna(),
      # any other 
      # TODO improve the combined name for canonical_string_collector_parsed if any of the other dwc_parsed fields is not NaN
      other= (collectors.family + ", " + collectors.given) \
        if any(collectors.particle.isna()) \
        else collectors.particle + " " + collectors.family + ", " + collectors.given
  )
)
# move canonical_string_collector_parsed after column title (title was the last of the parsing columns)
col = collectors.pop("canonical_string_collector_parsed")
collectors.insert(collectors.columns.get_loc('title') + 1, col.name, col)
collectors.tail()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
178779,Štepánek,J.,,,,,,,"Štepánek, J.",2,https://data.biodiversitydata.nl/naturalis/spe...,1998-11-06 12:00:00.000,1992-05-15 00:00:00.000,2005-04-30 00:00:00.000
85001,Štepánek,J.,,,,,,,"Štepánek, J.",1,https://data.biodiversitydata.nl/naturalis/spe...,1983-05-09 00:00:00.000,1983-05-09 00:00:00.000,1983-05-09 00:00:00.000
178764,Štepánek,J.,,,,,,,"Štepánek, J.",1,https://data.biodiversitydata.nl/naturalis/spe...,2003-06-01 00:00:00.000,2003-06-01 00:00:00.000,2003-06-01 00:00:00.000
62174,Štepánek,J.,,,,,,,"Štepánek, J.",1,https://data.biodiversitydata.nl/naturalis/spe...,1978-05-17 00:00:00.000,1978-05-17 00:00:00.000,1978-05-17 00:00:00.000
146319,Šumberová,K.,,,,,,,"Šumberová, K.",17,https://data.biodiversitydata.nl/naturalis/spe...,2016-08-16 15:31:45.882,2016-08-16 00:00:00.000,2016-08-17 00:00:00.000


In [8]:
# group and aggregate data to have unique name rows only for the matching of names later on
collectors_unique=collectors.groupby(['canonical_string_collector_parsed']).agg(
    family=('family', lambda x: list(x)[0]),
    given=('given', lambda x: list(x)[0]),
    suffix=('suffix', lambda x: list(x)[0]),
    particle=('particle', lambda x: list(x)[0]),
    dropping_particle=('dropping_particle', lambda x: list(x)[0]),
    nick=('nick', lambda x: list(x)[0]),
    appellation=('appellation', lambda x: list(x)[0]),
    title=('title', lambda x: list(x)[0]),
    occurrenceID_collectors_count= ('occurrenceID_count', 'sum'), # use count function
    occurrenceID_collectors_firstsample=('occurrenceID_firstsample', lambda x: list(x)[0]), # custom function, to get the first entry
    collectors_eventDate_mean=('eventDate_mean', 'mean'),
    collectors_eventDate_min=('eventDate_min', 'min'),
    collectors_eventDate_max=('eventDate_max', 'max')
).reset_index()

# move canonical_string_collector_parsed after column title
col = collectors_unique.pop("canonical_string_collector_parsed")
collectors_unique.insert(collectors_unique.columns.get_loc('title') + 1, col.name, col)

collectors_unique

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max
0,A,,,,,,,,A,18,https://data.biodiversitydata.nl/naturalis/spe...,1981-04-19 16:00:00.000,1899-08-07 00:00:00.000,1999-12-10 00:00:00.000
1,A'buino'o,,,,,,,,A'buino'o,1,https://data.biodiversitydata.nl/naturalis/spe...,1965-08-24 00:00:00.000,1965-08-24 00:00:00.000,1965-08-24 00:00:00.000
2,A-M-V-J,Renier,,,,,,,"A-M-V-J, Renier",1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
3,A-ts'ai,Hsieh,,,,,,,"A-ts'ai, Hsieh",1,https://data.biodiversitydata.nl/naturalis/spe...,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000
4,A. Kneucker T,Stuckert,,in,,,,,"A. Kneucker T, Stuckert",4,https://data.biodiversitydata.nl/naturalis/spe...,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57550,Širjaev,G.I.,,,,,,,"Širjaev, G.I.",32,https://data.biodiversitydata.nl/naturalis/spe...,1927-05-29 22:53:28.696,1924-05-01 00:00:00.000,1932-09-26 00:00:00.000
57551,Šmite,D.,,,,,,,"Šmite, D.",13,https://data.biodiversitydata.nl/naturalis/spe...,1978-01-12 16:30:00.000,1975-01-01 00:00:00.000,1980-09-08 00:00:00.000
57552,Špacek,J.,,,,,,,"Špacek, J.",2,https://data.biodiversitydata.nl/naturalis/spe...,1962-07-10 00:00:00.000,1962-07-10 00:00:00.000,1962-07-10 00:00:00.000
57553,Štepánek,J.,,,,,,,"Štepánek, J.",620,https://data.biodiversitydata.nl/naturalis/spe...,1988-06-14 10:28:20.310,1966-05-25 00:00:00.000,2006-07-13 00:00:00.000


In [9]:
# show collectors with highest occurrenceID_collectors_count
collectors_unique.sort_values(by=['occurrenceID_collectors_count', 'family'], ascending=[False, True]).head(10)

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max
5724,Boom,B.K.,,,,,,,"Boom, B.K.",51929,https://data.biodiversitydata.nl/naturalis/spe...,1956-02-26 18:12:10.768,1856-01-01 00:00:00.000,1997-04-11 00:00:00.000
6579,Breteler,F.J.,,,,,,,"Breteler, F.J.",41443,https://data.biodiversitydata.nl/naturalis/spe...,1988-09-24 10:56:52.051,1955-06-12 00:00:00.000,2020-03-06 00:00:00.000
32743,Maxwell,J.F.,,,,,,,"Maxwell, J.F.",38782,https://data.biodiversitydata.nl/naturalis/spe...,1996-08-29 12:11:24.527,1969-01-18 00:00:00.000,2013-04-11 00:00:00.000
26981,Koorders,S.H.,,,,,,,"Koorders, S.H.",34173,https://data.biodiversitydata.nl/naturalis/spe...,1915-03-09 11:07:43.928,1829-08-27 00:00:00.000,2012-11-11 00:00:00.000
29034,Leeuwenberg,A.J.M.,,,,,,,"Leeuwenberg, A.J.M.",32867,https://data.biodiversitydata.nl/naturalis/spe...,1973-07-14 13:58:10.508,1926-02-20 00:00:00.000,1999-11-16 00:00:00.000
48116,Soest,J.L.,,,,,,,"Soest, J.L.",31684,https://data.biodiversitydata.nl/naturalis/spe...,1947-10-12 23:09:55.812,1803-08-10 00:00:00.000,1999-06-06 00:00:00.000
613,Ajgh,Kostermans,,,,,,,"Ajgh, Kostermans",30712,https://data.biodiversitydata.nl/naturalis/spe...,1959-02-23 21:53:35.298,1892-09-30 00:00:00.000,1994-11-15 00:00:00.000
23478,Itinere,Stud,,biol Rheno-Trai in,,,,,"Itinere, Stud",29912,https://data.biodiversitydata.nl/naturalis/spe...,1966-03-04 03:14:54.417,1847-06-18 00:00:00.000,1996-07-08 00:00:00.000
55730,Wilde-Duyfjes,B.E.E.,,,,,,,"Wilde-Duyfjes, B.E.E.",29893,https://data.biodiversitydata.nl/naturalis/spe...,1986-10-15 13:20:06.923,1958-06-28 00:00:00.000,2019-09-04 00:00:00.000
55575,Wieringa,J.J.,,,,,,,"Wieringa, J.J.",23282,https://data.biodiversitydata.nl/naturalis/spe...,2006-07-29 21:16:17.781,1980-08-19 00:00:00.000,2022-11-12 00:00:00.000


In [10]:
# Idea: Should we use data column suffixes to follow the data source after merging is done later?
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

### Set Up the Text Analysis

See https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536 for deeper understanding.

The `ngrams` function is used as an analyzer in the text search later.

In [11]:
# some example data
for i, row in enumerate(range(5)):
    if (i == 0):
        print('(WikiData’s) canonical_string = (constructed) canonical_string_fullname') 
    pprint.pprint("%s = %s" % (
        wd_matchtest['canonical_string'].at[row],
        wd_matchtest_fullnames['canonical_string_fullname'].at[row]
    ))

(WikiData’s) canonical_string = (constructed) canonical_string_fullname
'(-Walraevens), O.H. = (-Walraevens), O Heylen'
'(1835-1906), G.A.F.E. = (1835-1906), Gustav Adolf Ferdinand Eichler'
'(1873-1926), S.S. = (1873-1926), Søren Sørensen'
'(1888–1973), G.A. = (1888–1973), Georges André'
'(1904-1990), J.J. = (1904-1990), Johannes Johannessen'


In [12]:
import re
!pip install ftfy # amazing text cleaning for decode issues..
from ftfy import fix_text

def ngrams(string, n=3):
    """
    Construct ngram(s) of a given text
     
    @param string: the text string to perform the ngram splitting on 
    @param n: character length of the particular (split) result text each
    @return: string as ngram
    """
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.title()  # normalise case - capital at start of each word
    string = re.sub(' +', ' ', string).strip() # get rid of multiple spaces and replace with a single
    string = ' ' + string + ' '  # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD', r'', string)
    this_ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in this_ngrams]

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

In [13]:
print("Show ngram examples:")
print("- simple name:", ngrams('Klazenga, N.'))
print("- data from collectors:", ngrams(collectors_unique["canonical_string_collector_parsed"].at[1])) 
print("- data from match-test:", ngrams(wd_matchtest['canonical_string'].at[0]))
print("- data from match-test (full name):", ngrams(wd_matchtest_fullnames['canonical_string_fullname'].at[3]))

Show ngram examples:
- simple name: [' Kl', 'Kla', 'laz', 'aze', 'zen', 'eng', 'nga', 'ga ', 'a N', ' N ']
- data from collectors: [' Ab', 'Abu', 'bui', 'uin', 'ino', 'noo', 'oo ']
- data from match-test: [' Wa', 'Wal', 'alr', 'lra', 'rae', 'aev', 'eve', 'ven', 'ens', 'ns ', 's O', ' Oh', 'Oh ']
- data from match-test (full name): [' 18', '188', '888', '881', '819', '197', '973', '73 ', '3 G', ' Ge', 'Geo', 'eor', 'org', 'rge', 'ges', 'es ', 's A', ' An', 'And', 'ndr', 'dr ']


Vectorize Wikidata names. Background: We use an information retrieval technique (Term Frequency — Inverse Document Frequency, blog [towardsdatascience.com/tf-idf-explained…](https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275)) for matching the source names with WikiData names, for that a calculated dinsance measure of the name match will help to match similar names and distinguish names that are rather no match. In general see also https://scikit-learn.org, https://pypi.org/project/scikit-learn/.

### Perform the Matching

Perform the nearest neighbour (NN) matches on the (Naturalis) collector names and create a data frame with matches, and we try to distinguish abbreviated and full names in the source to better match source data and WikiData ... (can take 5 to 10 minutes)

Now convert a collection of raw documents to a matrix of TF-IDF features and set up the function that performs the matches...

In [14]:
# pip install --upgrade scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# nbrs_data = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf_vector_data) 
# tfidf_vector_data contains the vectorized wikidata names from the previous step


def getNearestNeighbour(query, this_vectorizer, this_nbrs_data):
    """Calculate the k-nearest distance for query data using package scikit-learn


    @param query: DataFrame the query data to vectorize and transform
    @param this_vectorizer: the vectorizer of TfidfVectorizer
    @param this_nbrs_data: the data of NearestNeighbors calculations
    @return: (distances, indices) distances and indices
    @rtype (int, int)
    """
    queryTFIDF_ = this_vectorizer.transform(query)
    distances, indices = this_nbrs_data.kneighbors(queryTFIDF_)
    return distances, indices


def calculateTFIDFmatchingOfData(query_data, match_data, n_neighbors=1):
    """
    Calculate a TF-IDF (Term Frequency — Inverse Document Frequency) matching with getNearestN

    @param query_data: DataFrame usually a pandas data column to query names or strings for
    @param match_data: DataFrame against to match with
    @param n_neighbors: Number of neighbors required for each sample by default for :meth:`kneighbors` queries (originally 5).

    @requires NearestNeighbors()
    @requires getNearestNeighbour()
    @requires ngrams()
    @requires TfidfVectorizer()
    @requires NearestNeighbors()

    @return: DataFrame a data frame of matches with columns 'namematch_source_data', 'namematch_resource_data', 'namematch_distance'
    """

    import time
    start = time.time()
    query_data = set(query_data)
    # convert list to set for better performance

    print('Vectorizing data. This may take a while...')
    # vectorize wikidata names
    vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
    tfidf_vector_data = vectorizer.fit_transform(match_data
        # wd_matchtest['canonical_string']
    )
    nbrs_data = NearestNeighbors(n_neighbors=n_neighbors, n_jobs=-1).fit(tfidf_vector_data)
    duration = time.time() - start
    print('Vectorizing completed: created a matrix of TF-IDF features after %s s' % duration)

    print('Getting nearest neighbours of %s data with %s neighbor sample(s)...' % (len(query_data), n_neighbors))
    distances, indices = getNearestNeighbour(query_data, vectorizer, nbrs_data)
    duration = time.time() - start
    print('Completed after %s s' % duration)

    query_data = list(query_data)  # convert back to list

    print('Finding matches build new data frame ...')
    matches = []
    for i, j in enumerate(indices):
        temp = [query_data[i], match_data.values[j][0], round(distances[i][0], 2)]
        matches.append(temp)

    duration = time.time() - start
    print('Building matches done after %s s' % duration)
    matches = pd.DataFrame(
        matches,
        columns=['namematch_source_data', 'namematch_resource_data', 'namematch_distance']
    )

    print('Done')
    return matches


In [15]:
criterion_fullnames = collectors_unique.given.str.contains('^\w{3,}', na=False)
collectors_names = collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values
print("Calculate matching for **abbrevated** names separately …")
# collectors_names = set(collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values)
matches = calculateTFIDFmatchingOfData(collectors_names, wd_matchtest['canonical_string'], 5) # TODO what effect has n_neighbors ? originally in the very source code it is set to 5, not 1

matches = matches.sort_values(['namematch_distance'])
matches = matches.reset_index(names=['old_index'])

matches

Calculate matching for **abbrevated** names separately …
Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 2.7093393802642822 s
Getting nearest neighbours of 55376 data with 5 neighbor sample(s)...
Completed after 368.9916708469391 s
Finding matches build new data frame ...
Building matches done after 369.5535762310028 s
Done


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,28258,"Fourcade, C.","Fourcade, C.",0.00
1,10484,"Peters, T.M.","Peters, T.M.",0.00
2,45161,"Hilse, F.W.","Hilse, F.W.",0.00
3,34046,"Tobe, H.","Tobe, H.",0.00
4,22172,"Draisma, S.G.A.","Draisma, S.G.A.",0.00
...,...,...,...,...
55371,27855,"Nakkuntod, M.","Markkula, M.",1.25
55372,53578,"Waidahnahsahp, B.","Wai, N.",1.25
55373,14955,"Alpha Sukmadimigrat, R.","Vladimirov, A.A.",1.26
55374,33118,"Watdahnahsahp, B.","Popp, B.",1.26


In [16]:
# criterion_fullnames = collectors_unique.given.str.contains('^\w{3,}', na=False)
print("Calculate matching for **full** names separately …")
collectors_fullnames = collectors_unique['canonical_string_collector_parsed'][criterion_fullnames].values
matches_fullnames = calculateTFIDFmatchingOfData(collectors_fullnames, wd_matchtest_fullnames['canonical_string_fullname'], 5) # TODO what effect has n_neighbors ? originally in the very source code it is set to 5, not 1

matches_fullnames = matches_fullnames.sort_values(['namematch_distance'])
matches_fullnames = matches_fullnames.reset_index(names=['old_index'])

matches_fullnames


Calculate matching for **full** names separately …
Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 3.1273386478424072 s
Getting nearest neighbours of 2179 data with 5 neighbor sample(s)...
Completed after 35.43817949295044 s
Finding matches build new data frame ...
Building matches done after 35.45609903335571 s
Done


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,1518,"Won, Hyosig","Won, Hyosig",0.00
1,1119,"Awas, Tesfaye","Awas, Tesfaye",0.00
2,25,"Lai, Ming-Jou","Lai, Ming-Jou",0.00
3,2065,"Chunghee, Lee","Chunghee, Lee",0.00
4,1153,"Singh, Amar","Singh, Amar",0.00
...,...,...,...,...
2174,85,"Landbouw, Stichting Ontwikkeling Machinale",Grandbois,1.26
2175,165,"Ontwikkelingslanden Rotterdam, Centrum Bevorde...","Stern, Kingsley Roland",1.27
2176,1365,"Soemberpoetjoeng, Boscharchitect","Soenarko, Soejatmi",1.27
2177,1882,"H.M.S. Sulphur, Voyage","Kurz, Wilhelm Sulpiz",1.27


### Create Output Results

Combine the matches data frame back to the (Naturalis) collectors and Wikidata items …

In [17]:
# join matches data frame back to source collectors  dataframe 
collectors_matches = pd.merge(
    collectors_unique, matches, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

collectors_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,A,,,,,,,,A,18,https://data.biodiversitydata.nl/naturalis/spe...,1981-04-19 16:00:00.000,1899-08-07 00:00:00.000,1999-12-10 00:00:00.000,53717,A,"Mas, A.",1.16
1,A'buino'o,,,,,,,,A'buino'o,1,https://data.biodiversitydata.nl/naturalis/spe...,1965-08-24 00:00:00.000,1965-08-24 00:00:00.000,1965-08-24 00:00:00.000,13527,A'buino'o,"Abutalıbov, M.",1.22
2,Aa,H. A. van der,,van,,,,,"Aa, H. A. van der",2,https://data.biodiversitydata.nl/naturalis/spe...,1962-07-07 00:00:00.000,1962-07-07 00:00:00.000,1962-07-07 00:00:00.000,17733,"Aa, H. A. van der","Derbès, A.A.",1.06
3,Aafjes,J.,,,,,,,"Aafjes, J.",4,https://data.biodiversitydata.nl/naturalis/spe...,1941-10-19 00:00:00.000,1941-10-19 00:00:00.000,1941-10-19 00:00:00.000,48878,"Aafjes, J.","Kutafjeva, N.P.",1.15
4,Aajchomphoo,W.,,,,,,,"Aajchomphoo, W.",17,https://data.biodiversitydata.nl/naturalis/spe...,1987-03-17 00:00:00.000,1987-03-16 00:00:00.000,1987-03-18 00:00:00.000,39510,"Aajchomphoo, W.","Rajchenberg, M.",1.19


In [18]:
# append full name matches
collectors_matches_fullname = pd.merge(
    collectors_unique, matches_fullnames, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

collectors_matches_fullname.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,A-M-V-J,Renier,,,,,,,"A-M-V-J, Renier",1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT,1262,"A-M-V-J, Renier","Renner, Matt A M",1.14
1,A-ts'ai,Hsieh,,,,,,,"A-ts'ai, Hsieh",1,https://data.biodiversitydata.nl/naturalis/spe...,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000,447,"A-ts'ai, Hsieh","Hsieh, A Tsai",0.36
2,A. Kneucker T,Stuckert,,in,,,,,"A. Kneucker T, Stuckert",4,https://data.biodiversitydata.nl/naturalis/spe...,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000,801,"A. Kneucker T, Stuckert","Kneucker, Johann Andreas",0.95
3,AFle,Jolis,,,,,,,"AFle, Jolis",420,https://data.biodiversitydata.nl/naturalis/spe...,1860-07-06 19:47:47.797,1800-01-01 00:00:00.000,1983-10-04 00:00:00.000,745,"AFle, Jolis","Jolis, Auguste le",1.09
4,Aaaa,Bellynck,,,,,,,"Aaaa, Bellynck",6,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT,2163,"Aaaa, Bellynck",Beeynck,1.1


In [19]:
collectors_all_matches=pd.concat([collectors_matches, collectors_matches_fullname])
collectors_all_matches.sort_values(by=['namematch_distance', 'family'], ascending=[True, True], inplace=True)
collectors_all_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
13,Aaronsohn,A.,,,,,,,"Aaronsohn, A.",3,https://data.biodiversitydata.nl/naturalis/spe...,1907-01-26 12:00:00.000,1906-12-06 00:00:00.000,1907-03-19 00:00:00.000,14602,"Aaronsohn, A.","Aaronsohn, A.",0.0
37,Abbas,A.,,,,,,,"Abbas, A.",378,https://data.biodiversitydata.nl/naturalis/spe...,1963-03-03 08:38:52.762,1936-02-11 00:00:00.000,1963-11-01 00:00:00.000,37429,"Abbas, A.","Abbas, A.",0.0
41,Abbe,E.C.,,,,,,,"Abbe, E.C.",537,https://data.biodiversitydata.nl/naturalis/spe...,1961-03-04 07:37:30.486,1932-01-01 00:00:00.000,1964-08-31 00:00:00.000,52651,"Abbe, E.C.","Abbe, E.C.",0.0
44,Abbiatti,D.,,,,,,,"Abbiatti, D.",2,https://data.biodiversitydata.nl/naturalis/spe...,1944-05-31 00:00:00.000,1937-10-01 00:00:00.000,1951-01-29 00:00:00.000,526,"Abbiatti, D.","Abbiatti, D.",0.0
55,Abbott,A.T.D.,,,,,,,"Abbott, A.T.D.",14,https://data.biodiversitydata.nl/naturalis/spe...,2002-12-17 01:36:00.000,1997-02-13 00:00:00.000,2010-05-27 00:00:00.000,47429,"Abbott, A.T.D.","Abbott, A.T.D.",0.0


In [20]:
# criterion = collectors_all_matches['canonical_string_collector_parsed'].map(lambda x: x.startswith('Kotschy'))
# print("Show example of «Kotschy…» with namematch distances from 0.0 to 1.0 (in Cosine Similiarity we had 0.5 … 1.0)")
# collectors_all_matches[criterion]

Save the plain name matching results only ...

In [21]:
if not os.path.exists('data'):
    print("Make data directory for saving …")
    os.makedirs('data')

# Set some global varialbes
# this_timestamp_for_data=time.strftime('%Y%m%d') # 20230719
this_timestamp_for_data=20230913

this_output_file='data/results_naturalis_collectors-eventDate_vs_wikidata-botanists_kneighbor_plain-names_%s.csv' % (
    this_timestamp_for_data
)

collectors_all_matches.to_csv(this_output_file)

print("Wrote plain name matches of collector names into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote plain name matches of collector names into data/results_naturalis_collectors-eventDate_vs_wikidata-botanists_kneighbor_plain-names_20230913.csv (11778 kB)


### Merge and Aggregate Matched Data and WikiData’s

Review (TODO)
- evaluate time references: `eventDate` ~ `yob`, `wyb`—perhaps define a score value that could integrate all scores from properties we need for decision of the name matching (name distance, eventDate ~ year of birth/work year begin aso.)
- merge abbreviated and full name data properly, distinguish abbrevited match and full name match
- refactor `collectors_matches` or `collectors_matches_g1` aso. to `collectors_all_matches`
- refactor `collectors` to `collectors_unique`
- refactor `matches`to `matches_abbr` or distinguish `matches_fullname`

Now
1. merge now the matching data and the wiki data’s on the conaonical string name
2. later aggregate fine tuned, checking if multiple same (canonical string) names relate to multiple different persons (we use wd-items (the Q1233242 thing), and wd-item-labels to aggregate on) … aso.
3. save those data tables

In [22]:
# merge now the matching data and the wiki data’s on the conaonical string name
collectors_matches_tmp_names_abbreviated = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string'
)
collectors_matches_tmp_fullnames = pd.merge(
    collectors_matches_fullname, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string_fullname'
)
collectors_matches_g1_merged_wikidata = pd.concat(
    [collectors_matches_tmp_names_abbreviated, collectors_matches_tmp_fullnames]
    , ignore_index=True
)

In [23]:
print("Show some name match examples (e.g. «Louis…» matching various names) …")
for testname in ['Louis', 'Abbot']:
    criterion = collectors_matches_g1_merged_wikidata['canonical_string_collector_parsed'].map(lambda x: x.startswith(testname))    
    this_table=collectors_matches_g1_merged_wikidata[criterion].get([
        # 'canonical_string_collector_parsed', # canonical_string_collector_parsed = namematch_source_data
        'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
        'namematch_source_data', 'namematch_resource_data', 'namematch_distance', 
        # 'canonical_string_fullname', 
        'itemLabel', 'wikidata_link',
        'collectors_eventDate_min', 'collectors_eventDate_max',
        'yob', 'yod', 'wyb', 'wye'
    ]).sort_values(by=['namematch_distance'])
    print("# ---------------------------------------------\n# «%s…» as test name, %d collector names begin with:" % (testname, criterion.sum()))    
    display(this_table)

Show some name match examples (e.g. «Louis…» matching various names) …
# ---------------------------------------------
# «Louis…» as test name, 15 collector names begin with:


Unnamed: 0,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_distance,itemLabel,wikidata_link,collectors_eventDate_min,collectors_eventDate_max,yob,yod,wyb,wye
9302,2,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, A.","Louis, A.",0.0,A. Louis,http://www.wikidata.org/wiki/Q33682458,NaT,NaT,,,,
37515,10542,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, A.M.","Louis, A.M.",0.0,Adriaan M. Louis,http://www.wikidata.org/wiki/Q21338327,1969-04-10 00:00:00.000,2013-03-02 00:00:00.000,1944.0,,,
37516,3339,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, J.L.P.","Louis, J.L.P.",0.0,Jean Laurent Prosper Louis,http://www.wikidata.org/wiki/Q5928759,1900-01-01 00:00:00.000,1998-05-17 00:00:00.000,1903.0,1947.0,,
37517,51,https://data.biodiversitydata.nl/naturalis/spe...,"Louis-Marie, P.",Louis-Marie,0.36,Louis-Marie,http://www.wikidata.org/wiki/Q5981449,1925-07-28 00:00:00.000,1953-07-08 00:00:00.000,1896.0,1978.0,,
9301,59,https://data.biodiversitydata.nl/naturalis/spe...,Louis,"Louis, A.",0.41,A. Louis,http://www.wikidata.org/wiki/Q33682458,1904-05-28 00:00:00.000,1984-09-21 00:00:00.000,,,,
37518,1,https://data.biodiversitydata.nl/naturalis/spe...,"Louis-Marie, R.P.",Louis-Marie,0.52,Louis-Marie,http://www.wikidata.org/wiki/Q5981449,1934-07-09 00:00:00.000,1934-07-09 00:00:00.000,1896.0,1978.0,,
9303,14,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, F.","Louis, A.",0.6,A. Louis,http://www.wikidata.org/wiki/Q33682458,1910-01-01 00:00:00.000,1953-12-01 00:00:00.000,,,,
9305,3,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, H.","Louis, A.",0.6,A. Louis,http://www.wikidata.org/wiki/Q33682458,1907-06-01 00:00:00.000,1953-10-01 00:00:00.000,,,,
9307,4,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, O.","Louis, A.",0.65,A. Louis,http://www.wikidata.org/wiki/Q33682458,1937-07-14 00:00:00.000,1937-07-27 00:00:00.000,,,,
9304,116,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, F.H.","Louis, A.",0.75,A. Louis,http://www.wikidata.org/wiki/Q33682458,1853-08-03 00:00:00.000,1960-09-25 00:00:00.000,,,,


# ---------------------------------------------
# «Abbot…» as test name, 10 collector names begin with:


Unnamed: 0,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_distance,itemLabel,wikidata_link,collectors_eventDate_min,collectors_eventDate_max,yob,yod,wyb,wye
99,14,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, A.T.D.","Abbott, A.T.D.",0.0,A. T. D. Abbott,http://www.wikidata.org/wiki/Q117328147,1997-02-13 00:00:00.000,2010-05-27 00:00:00.000,1936.0,2013.0,,
100,2,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, E.K.","Abbott, E.K.",0.0,Edwin Kirk Abbott,http://www.wikidata.org/wiki/Q81587932,1889-01-01 00:00:00.000,1889-04-01 00:00:00.000,1840.0,1918.0,,
101,2,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, E.K.","Abbott, E.K.",0.0,Erwin Kirk Abbott,http://www.wikidata.org/wiki/Q113588322,1889-01-01 00:00:00.000,1889-04-01 00:00:00.000,1840.0,1918.0,,
103,10,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, W.L.","Abbott, W.L.",0.0,William Louis Abbott,http://www.wikidata.org/wiki/Q635604,1922-04-05 00:00:00.000,1922-04-30 00:00:00.000,1860.0,1936.0,,
96,1,https://data.biodiversitydata.nl/naturalis/spe...,Abbott,"Abbott, G.",0.43,George Abbott,http://www.wikidata.org/wiki/Q47112598,NaT,NaT,,,,
102,106,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, I.A.","Abbott, I.",0.57,Isabella Abbott,http://www.wikidata.org/wiki/Q6077932,1946-05-01 00:00:00.000,1995-02-22 00:00:00.000,1919.0,2010.0,,
97,1,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, D.P.","Abbott, G.",0.74,George Abbott,http://www.wikidata.org/wiki/Q47112598,1967-08-02 00:00:00.000,1967-08-02 00:00:00.000,,,,
87,1,https://data.biodiversitydata.nl/naturalis/spe...,"Abbot-Anderson, M.","Anderson, M.",0.79,Marilyn Anderson,http://www.wikidata.org/wiki/Q44754645,1933-06-21 00:00:00.000,1933-06-21 00:00:00.000,,,,
88,1,https://data.biodiversitydata.nl/naturalis/spe...,"Abbot-Anderson, M.","Anderson, M.",0.79,Mary Anderson,http://www.wikidata.org/wiki/Q111694258,1933-06-21 00:00:00.000,1933-06-21 00:00:00.000,1875.0,,,
89,1,https://data.biodiversitydata.nl/naturalis/spe...,"Abbot-Anderson, M.","Anderson, M.",0.79,Mark Anderson,http://www.wikidata.org/wiki/Q111990210,1933-06-21 00:00:00.000,1933-06-21 00:00:00.000,,,,


Aggregate data to get atomized listings of multiple resource name matches joining by “|” aso.

In [24]:
print('Group data by canonical names (abbreviated and full name):'
      ' multiple related WD items (e.g. Q1232456), item labels, year of birth, year of death')
for i, wd_matching_column in enumerate(['canonical_string', 'canonical_string_fullname']):
    print('Run %s:   Group by wiki data’s %s, and aggregate/join item(s), labels, yob, yod '
          'by “…|…”, add new columns “…_joined” ...' % (i + 1, wd_matching_column))
    wdata_joined_items_and_others = wikidata.groupby([wd_matching_column]).agg(
        items_joined = ('item', lambda x: '|'.join(x)),
        item_labels_joined = ('itemLabel', lambda x: '|'.join(x)),
        yob_joined = ('yob', lambda x: '|'.join([str(s) for s in list(x)]) ),
        yod_joined = ('yod', lambda x: '|'.join([str(s) for s in list(x)]) )
    ).reset_index()

    # print("Done. Show examples of items having multiple matching data «|» … ")
    # criterion = wdata_joined_items['items'].map(lambda x: '|' in x)
    # wdata_joined_items[criterion].head()

    print('Run %s:   Merge all based on namematch_resource_data, add item(s) data ...' % (i + 1))
    collectors_matches_g2 = pd.merge(
        collectors_matches_g1_merged_wikidata, wdata_joined_items_and_others,
        left_on='namematch_resource_data', right_on=wd_matching_column
        , suffixes=('__wikidata_merge', '__grp_by_items')
        # append to left-data, right-data only when identical column names occur
    )

    print('Run %s:   Build data frame “collectors_matches_group” ...' % (i + 1))
    collectors_matches_group = collectors_matches_g2 \
        if i == 0 \
        else pd.concat([collectors_matches_group, collectors_matches_g2], ignore_index = True)
    
print('Done')

Group data by canonical names (abbreviated and full name): multiple related WD items (e.g. Q1232456), item labels, year of birth, year of death
Run 1:   Group by wiki data’s canonical_string, and aggregate/join item(s), labels, yob, yod by “…|…”, add new columns “…_joined” ...
Run 1:   Merge all based on namematch_resource_data, add item(s) data ...
Run 1:   Build data frame “collectors_matches_group” ...
Run 2:   Group by wiki data’s canonical_string_fullname, and aggregate/join item(s), labels, yob, yod by “…|…”, add new columns “…_joined” ...
Run 2:   Merge all based on namematch_resource_data, add item(s) data ...
Run 2:   Build data frame “collectors_matches_group” ...
Done


In [25]:
print("Show examples of item_labels_joined having multiple matching data «|» … ")
criterion = collectors_matches_group['item_labels_joined'].map(lambda x: '|' in x)

collectors_matches_group[criterion].get([ # empty 
    # 'canonical_string_collector_parsed', # canonical_string_collector_parsed = namematch_source_data
    'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
    'namematch_source_data', 'namematch_resource_data', 'namematch_distance', 
    # 'canonical_string_fullname', 
    'item_labels_joined', 'items_joined', 'yob_joined', 'yod_joined'
], default="...get: Are data empty or it has probably a wrong named column?")


Show examples of item_labels_joined having multiple matching data «|» … 


Unnamed: 0,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_distance,item_labels_joined,items_joined,yob_joined,yod_joined
54,4,https://data.biodiversitydata.nl/naturalis/spe...,"Abas, A.N.","Abbas, A.",1.00,Alia Abbas|Abdulla Abbas,http://www.wikidata.org/entity/Q60141229|http:...,nan|nan,nan|nan
55,4,https://data.biodiversitydata.nl/naturalis/spe...,"Abas, A.N.","Abbas, A.",1.00,Alia Abbas|Abdulla Abbas,http://www.wikidata.org/entity/Q60141229|http:...,nan|nan,nan|nan
56,378,https://data.biodiversitydata.nl/naturalis/spe...,"Abbas, A.","Abbas, A.",0.00,Alia Abbas|Abdulla Abbas,http://www.wikidata.org/entity/Q60141229|http:...,nan|nan,nan|nan
57,378,https://data.biodiversitydata.nl/naturalis/spe...,"Abbas, A.","Abbas, A.",0.00,Alia Abbas|Abdulla Abbas,http://www.wikidata.org/entity/Q60141229|http:...,nan|nan,nan|nan
58,1,https://data.biodiversitydata.nl/naturalis/spe...,"Abbas, H.","Abbas, A.",0.58,Alia Abbas|Abdulla Abbas,http://www.wikidata.org/entity/Q60141229|http:...,nan|nan,nan|nan
...,...,...,...,...,...,...,...,...,...
61837,25,https://data.biodiversitydata.nl/naturalis/spe...,"Te, Hasseloo B.H.","Hassel, Kristian",1.15,Kristian Hassel|Kristian Hassel,http://www.wikidata.org/entity/Q59604134|http:...,nan|nan,nan|nan
61913,195,https://data.biodiversitydata.nl/naturalis/spe...,"Unyong, Asah","Strong, Asa B",1.08,Asa B. Strong|Asa B. Strong,http://www.wikidata.org/entity/Q36511637|http:...,nan|nan,nan|nan
61914,195,https://data.biodiversitydata.nl/naturalis/spe...,"Unyong, Asah","Strong, Asa B",1.08,Asa B. Strong|Asa B. Strong,http://www.wikidata.org/entity/Q36511637|http:...,nan|nan,nan|nan
62005,1,https://data.biodiversitydata.nl/naturalis/spe...,"Won, Hyosig","Won, Hyosig",0.00,Hyosig Won|Hyosig Won,http://www.wikidata.org/entity/Q88828745|http:...,nan|nan,nan|nan


In [26]:
# check what columns we have and what we would keep for further analysis and what to drop
pprint.pprint(collectors_matches_group.columns)
# from merge: _x would means normally from left column, _y means from right column
# in BASH fold text long lines; echo "${text}" | fold --spaces | sed 's@^@#  @'

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed',
       'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
       'collectors_eventDate_mean', 'collectors_eventDate_min',
       'collectors_eventDate_max', 'old_index', 'namematch_source_data',
       'namematch_resource_data', 'namematch_distance', 'item', 'itemLabel',
       'surname', 'initials', 'canonical_string__wikidata_merge',
       'canonical_string_fullname', 'orcid', 'viaf', 'isni', 'harv', 'ipni',
       'abbr', 'bionomia_id', 'yob', 'yod', 'wyb', 'wye', 'wikidata_link',
       'orcid_link', 'harv_link', 'ipni_link', 'bionomia_link',
       'canonical_string__grp_by_items', 'items_joined', 'item_labels_joined',
       'yob_joined', 'yod_joined', 'canonical_string',
       'canonical_string_fullname__wikidata_merge',
       'canonical_string_fullname__grp_by_items'],
      dtype='object')


Prepare data to save later on …

In [27]:
# Remove superfluous columns TODO check WARNING: A value is trying to be set on a copy of a slice from a DataFrame
# TODO check duplicates
collectors_matches_group_simplified = collectors_matches_group.get(
    ['family', 'given', 'canonical_string_collector_parsed', 
      'namematch_source_data', # redundant: 'namematch_source_data' == 'canonical_string_collector_parsed'
      'namematch_resource_data', 'namematch_distance', 
      'collectors_eventDate_mean', 'collectors_eventDate_min', 'collectors_eventDate_max', # collectors’ dates
      'yob_joined', 'yod_joined', # WikiData dates
      'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 
      'items_joined', 'canonical_string', 'canonical_string_fullname', 'surname', 'initials', 'item_labels_joined'
    ], default="...get: Are data empty or it has probably a wrong named column?"
)
# collectors_matches_group = collectors_matches_g3
collectors_matches_group_simplified.sort_values(
    by=['namematch_distance', 'canonical_string_collector_parsed']
    , inplace=True
)
collectors_matches_group_simplified.drop_duplicates(inplace=True)
collectors_matches_group_simplified.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_matches_group_simplified.sort_values(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_matches_group_simplified.drop_duplicates(inplace=True)


Unnamed: 0,family,given,canonical_string_collector_parsed,namematch_source_data,namematch_resource_data,namematch_distance,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,yob_joined,...,harv,ipni,abbr,bionomia_id,items_joined,canonical_string,canonical_string_fullname,surname,initials,item_labels_joined
31,Aaronsohn,A.,"Aaronsohn, A.","Aaronsohn, A.","Aaronsohn, A.",0.0,1907-01-26 12:00:00.000,1906-12-06 00:00:00.000,1907-03-19 00:00:00.000,1876.0,...,30592.0,23-1,Aarons.,Q2086130,http://www.wikidata.org/entity/Q2086130,,"Aaronsohn, Aaron",Aaronsohn,A.,Aaron Aaronsohn
56,Abbas,A.,"Abbas, A.","Abbas, A.","Abbas, A.",0.0,1963-03-03 08:38:52.762,1936-02-11 00:00:00.000,1963-11-01 00:00:00.000,nan|nan,...,,20034668-1,Al.Abbas,,http://www.wikidata.org/entity/Q60141229|http:...,,"Abbas, Alia",Abbas,A.,Alia Abbas|Abdulla Abbas
57,Abbas,A.,"Abbas, A.","Abbas, A.","Abbas, A.",0.0,1963-03-03 08:38:52.762,1936-02-11 00:00:00.000,1963-11-01 00:00:00.000,nan|nan,...,,20034420-1,A.Abbas,,http://www.wikidata.org/entity/Q60141229|http:...,,"Abbas, Abdulla",Abbas,A.,Alia Abbas|Abdulla Abbas
71,Abbe,E.C.,"Abbe, E.C.","Abbe, E.C.","Abbe, E.C.",0.0,1961-03-04 07:37:30.486,1932-01-01 00:00:00.000,1964-08-31 00:00:00.000,1905.0,...,30066.0,26-1,Abbe,Q10274118,http://www.wikidata.org/entity/Q10274118,,"Abbe, Ernst Cleveland",Abbe,E.C.,Ernst Cleveland Abbe
74,Abbiatti,D.,"Abbiatti, D.","Abbiatti, D.","Abbiatti, D.",0.0,1944-05-31 00:00:00.000,1937-10-01 00:00:00.000,1951-01-29 00:00:00.000,1918.0,...,3809.0,27-1,Abbiatti,,http://www.wikidata.org/entity/Q5801800,,"Abbiatti, Delia",Abbiatti,D.,Delia Abbiatti


In [28]:
# old file naturalis_collectors_matches_wikidata_items_group_concat_%s.csv
this_output_file='data/results_naturalis_collectors-eventDate_vs_wikidata-botanists_kneighbor_wditems_group_concat_wdlabels-joined_%s.csv' % (
    this_timestamp_for_data
)

collectors_matches_group.to_csv(this_output_file)

print("Wrote groups of collectors matches into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote groups of collectors matches into data/results_naturalis_collectors-eventDate_vs_wikidata-botanists_kneighbor_wditems_group_concat_wdlabels-joined_20230913.csv (34448 kB)


### Merge Data to Individual WikiData Items

For this, merge by namematch_resource_data and focus to get individual WikiData items.

In [29]:
print('Merge simply namematch_resource_data to Wiki data for abbreviated and full names... ')
for i, wd_matching_column in enumerate(['canonical_string', 'canonical_string_fullname']):

    # join wikidata items to avh collectors matches
    #   avh_matches = pd.merge(avh, matches, left_on='label', right_on='name')
    #   avh_matches_t1 = pd.merge(avh_matches, wikidata, left_on='matched_name', right_on='canonical_string')
    # link counts of wikidata items with same canonical name string
    #   avh_matches_t2 = pd.merge(avh_matches_t1, wd_test, left_on="matched_name", right_on="canonical_string")
    #   avh_matches_t2.rename(columns = {list(avh_matches_t2.columns)[-1]: 'dup_count'}, inplace=True)
    
    print('Run %s:   Merge all (collectors matches) using namematch_resource_data, add wikidata ...' % (i + 1))
    collectors_matches_wd1 = pd.merge(
        collectors_all_matches, wikidata,
        left_on='namematch_resource_data', right_on=wd_matching_column,
        suffixes=('__coll_all_matches', '__wd')
        # append to left-data, right-data only when identical column names occur
    )

    print('Run %s:   Build data frame “collectors_matches_with_wdata” ...' % (i + 1))
    collectors_matches_with_wdata = collectors_matches_wd1 \
        if i == 0 \
        else pd.concat([collectors_matches_with_wdata, collectors_matches_wd1], ignore_index=True)

print('Done')


Merge simply namematch_resource_data to Wiki data for abbreviated and full names... 
Run 1:   Merge all (collectors matches) using namematch_resource_data, add wikidata ...
Run 1:   Build data frame “collectors_matches_with_wdata” ...
Run 2:   Merge all (collectors matches) using namematch_resource_data, add wikidata ...
Run 2:   Build data frame “collectors_matches_with_wdata” ...
Done


In [30]:
pprint.pprint(collectors_matches_with_wdata.columns)
# echo "${text}" | fold --spaces | sed 's@^@#  @'

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed',
       'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
       'collectors_eventDate_mean', 'collectors_eventDate_min',
       'collectors_eventDate_max', 'old_index', 'namematch_source_data',
       'namematch_resource_data', 'namematch_distance', 'item', 'itemLabel',
       'surname', 'initials', 'canonical_string', 'canonical_string_fullname',
       'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob',
       'yod', 'wyb', 'wye', 'wikidata_link', 'orcid_link', 'harv_link',
       'ipni_link', 'bionomia_link'],
      dtype='object')


In [31]:
collectors_matches_with_wdata.drop_duplicates(inplace=True)
collectors_matches_with_wdata

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,Aaronsohn,A.,,,,,,,"Aaronsohn, A.",3,...,Q2086130,1876.0,1919.0,,,http://www.wikidata.org/wiki/Q2086130,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/23-1,https://bionomia.net/Q2086130
1,Abbas,A.,,,,,,,"Abbas, A.",378,...,,,,,,http://www.wikidata.org/wiki/Q60141229,,,https://www.ipni.org/a/20034668-1,
2,Abbas,A.,,,,,,,"Abbas, A.",378,...,,,,,,http://www.wikidata.org/wiki/Q88804360,,,https://www.ipni.org/a/20034420-1,
3,Abbas,H.,,,,,,,"Abbas, H.",1,...,,,,,,http://www.wikidata.org/wiki/Q60141229,,,https://www.ipni.org/a/20034668-1,
4,Abbas,H.,,,,,,,"Abbas, H.",1,...,,,,,,http://www.wikidata.org/wiki/Q88804360,,,https://www.ipni.org/a/20034420-1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62044,Brasilia,Taxonomy Class Universidade,,de,,,,,"Brasilia, Taxonomy Class Universidade",7,...,,1947.0,,,,http://www.wikidata.org/wiki/Q21608903,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/34757-1,
62045,Ephe-Cnrs,Lab Palyn,,,,,,,"Ephe-Cnrs, Lab Palyn",21,...,,,,,,http://www.wikidata.org/wiki/Q59603586,https://orcid.org/0000-0002-0980-1651,,https://www.ipni.org/a/20020083-2,
62046,Rotterdam,Plantenwerkgroep K.N.N.V.,,afd,,,,,"Rotterdam, Plantenwerkgroep K.N.N.V.",4,...,,,,,,http://www.wikidata.org/wiki/Q55313498,,,,
62047,H.M.S. Sulphur,Voyage,,of,,,,,"H.M.S. Sulphur, Voyage",1,...,Q2587569,1834.0,1878.0,,,http://www.wikidata.org/wiki/Q2587569,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/5157-1,https://bionomia.net/Q2587569


Save all columns for further analysis

In [32]:
# old naturalis_collectors_matches_wikidata-botanists_all-columns_%s.csv

this_output_file='data/results_naturalis_collectors-eventDate_vs_wikidata-botanists_kneighbor_names-atomized_all-columns_%s.csv' % (
    this_timestamp_for_data
)

collectors_matches_with_wdata.sort_values(
    by=['namematch_distance', 'canonical_string_collector_parsed']
    , inplace=True
)
collectors_matches_with_wdata.to_csv(
    this_output_file, index=False # drop index column
)

print("Wrote isolated WikiData items of collector matches into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote isolated WikiData items of collector matches into data/results_naturalis_collectors-eventDate_vs_wikidata-botanists_kneighbor_names-atomized_all-columns_20230913.csv (27976 kB)


In [33]:
# TODO meaningful?
# remove redundant (duplicate (?or empty?)) columns that in any kind are duplicate data (i.e. that we usually do not need)
# do it by transposing it (https://www.statology.org/pandas-drop-duplicate-columns/)
compact_df_tmp=collectors_matches_with_wdata.transpose().drop_duplicates().transpose()
compact_df_tmp.sort_values(
    by=['namematch_distance', 'canonical_string_collector_parsed']
    , inplace=True
)

# old naturalis_collectors_matches_wikidata-botanists_all-columns-made-unique_%s.csv
# results_naturalis_collectors_vs_wikidata-botanists_kneighbor_names-atomized_all-columns_%s.csv
this_output_file='data/results_naturalis_collectors-eventDate_vs_wikidata-botanists_kneighbor_names-atomized_all-columns-compact_%s.csv' % (
    this_timestamp_for_data
)

compact_df_tmp.to_csv(
    this_output_file, index=False # drop index column
)

print("Wrote isolated WikiData items (unique columns) of collector matches into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote isolated WikiData items (unique columns) of collector matches into data/results_naturalis_collectors-eventDate_vs_wikidata-botanists_kneighbor_names-atomized_all-columns-compact_20230913.csv (27130 kB)


In [34]:
# TODO further evaluation or filtering, counting, clean up aso.

## Documentation

TODO document columns

Explanation of columns:

Column | Description
-|-
**Botanical collectors** |
family | parsed family name
given | parsed given name
suffix | suffix from name parsing
particle | particle from name parsing
dropping_particle | dropping_particle from name parsing
nick | nick name from name parsing
appellation | appellation from name parsing
title | title from name parsing
eventDate | date of the sampling event (required by GBIF, see https://www.gbif.org/data-quality-requirements-sampling-events)
eventDate_min | calculated earliest date of all the sampling events within the data
eventDate_max | calculated latest date of all the sampling events within the data
eventDate_mean | calculated mean date of all the sampling events within the data
TODO activity_span | Number of years between first and last collection
**Name matching** |
nammatch_collector | matched name of the data set
nammatch_wikidata | matched name; = Wikidata item label name is matched to
namematch_distance | Nearest Neighbour distance between the name and matched name; the lower the value, the better the match
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label (perhaps similar to the full name)
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))

Refactoring from <https://github.com/nielsklazenga/avh-collectors/blob/master/match_names_to_wikidata_items.ipynb>

AVH | collector_matching (here)
-|-
avh_matches | collectors_all_matches
wd_test | wd_matchtest