# Match Naturalis Collectors to Wikidata Items Using *Cosine Similarity*

Basically we …
- match of `canonical_string` of WikiData to `canonical_string` of the source collectors (abbreviated names and full names, if given), and
- parse collector source names beforehand to get individual names out of name lists in the source data, we have used <https://libraries.io/rubygems/dwc_agent>, and in general we
- follow the example of Niels Klazenga <https://github.com/nielsklazenga/avh-collectors/blob/master/cosine_similarity.ipynb>

Technical Notes — Review Code perhaps:
- TODO refactor some data files to results….csv
- done implement: run matching on `canonical_string_fullname` vs. `canonical_string` (abbreviated) names
- (NN ⇌ Cosine) refactor relation: wd_matchtest ⇌ wikidata_unique (replaced wd_matchtest → wikidata_unique)

### Load Wikidata Data Set

Construct data using Jupyter Notebook [create_wikidata_datasets_botanists.ipynb](./create_wikidata_datasets_botanists.ipynb)

Out of the Wikidata items data set we create a data frame with unique canonical name strings and their counts.

In [28]:
import pandas as pd
import pprint, time, os

wikidata = pd.read_csv("data/wikidata_persons_botanists_20230703_1352.csv", index_col=0, low_memory=False)

pprint.pprint(wikidata.columns)
display(wikidata.head())

Index(['item', 'itemLabel', 'surname', 'initials', 'canonical_string',
       'canonical_string_fullname', 'orcid', 'viaf', 'isni', 'harv', 'ipni',
       'abbr', 'bionomia_id', 'yob', 'yod', 'wyb', 'wye', 'wikidata_link',
       'orcid_link', 'harv_link', 'ipni_link', 'bionomia_link'],
      dtype='object')


Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Bieberstein,F.A.M.v.,"Bieberstein, F.A.M.v.","Bieberstein, Friedrich August Marschall von",,43340073,0000 0001 1630 5464,1373,...,Q66612,1768.0,1826.0,,,http://www.wikidata.org/wiki/Q66612,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/6129-1,https://bionomia.net/Q66612
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Behr,H.H.,"Behr, H.H.","Behr, Hans Hermann",,20328622,0000 0001 1604 8680,42741,...,Q66934,1818.0,1904.0,,,http://www.wikidata.org/wiki/Q66934,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/619-1,https://bionomia.net/Q66934
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Schäffer,J.C.,"Schäffer, J.C.","Schäffer, Jacob Christian",,47016953,0000 0000 8343 3899,1101,...,,1718.0,1790.0,,,http://www.wikidata.org/wiki/Q66661,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/12818-1,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Klotzsch,J.F.,"Klotzsch, J.F.","Klotzsch, Johann Friedrich",,20426762,0000 0001 1749 2732,135,...,Q67003,1805.0,1860.0,,,http://www.wikidata.org/wiki/Q67003,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/4855-1,https://bionomia.net/Q67003
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Menge,F.A.,"Menge, F.A.","Menge, Franz Anton",,59847236,0000 0001 1653 0899,73782,...,,1808.0,1880.0,,,http://www.wikidata.org/wiki/Q66322,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/23266-1,


In [2]:
# compile data having only unique canonical strings
# group by canonical name/string, count douplicated names
wd_matchtest = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()

wd_matchtest
# cols = wd_matchtest.columns.tolist()

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"(-Walraevens), O.H.",1
1,"(1835-1906), G.A.F.E.",1
2,"(1873-1926), S.S.",1
3,"(1888–1973), G.A.",1
4,"(1904-1990), J.J.",1
...,...,...
61479,"Șerbanescu, I.",1
61480,"Ștefureac, T.",1
61481,"Țopa, E.",1
61482,"Ḥalwaǧī, R.",1


In [3]:
wd_matchtest_fullnames = wikidata.groupby('canonical_string_fullname').agg({'item': ['count']}).reset_index()

wd_matchtest_fullnames


Unnamed: 0_level_0,canonical_string_fullname,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"(-Walraevens), O Heylen",1
1,"(1835-1906), Gustav Adolf Ferdinand Eichler",1
2,"(1873-1926), Søren Sørensen",1
3,"(1888–1973), Georges André",1
4,"(1904-1990), Johannes Johannessen",1
...,...,...
63605,"Șerbanescu, Ioan",1
63606,"Ștefureac, Traian",1
63607,"Țopa, Emilian",1
63608,"Ḥalwaǧī, Riyāḍ",1


## Load Collectors Data Set

**Data sources:**

- Jupyter Notebook for [create_naturalis_gbif-occurrence_collectors_eventDate_dataset.ipynb](./create_naturalis_gbif-occurrence_collectors_eventDate_dataset.ipynb)

Then parse collector names to get single, separate collector names using `dwcagent`, use ruby gem package available at  <https://rubygems.org/gems/dwc_agent>:

- see [./bin/README.md](bin/README.md) to use ruby script `./bin/agent_parse4tsv.rb` for parsing text lines like `"Abbe,L.B., Abbe,E.C., Smitinand,T. & Rollet,B."`

Technical notes:

- the corresponding objects, variable names of Nils’ python code were:
```
refactor df_avh = → = collectors
refactor df_avh['label'] = → = collectors['canonical_string_collector_parsed']
…
```

In [4]:
# unique names parsed already by ruby gem package: dwcagent

collectors = pd.read_csv("data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913_parsed.tsv", sep="\t", low_memory=False)

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.

# Out of bounds nanosecond timestamp: 1652-01-01T00:00:00
#  because date nanoseconds range limitations of pandas, see https://stackoverflow.com/a/69507200/1240387
#  work around: use datetime or using pd.Periode(…)
print("modify time using pd.Periode(…) to get it work also on very old dates...")
for col in ['eventDate_mean', 'eventDate_min', 'eventDate_max']:
    print("- convert", col, "to pd.Period(...) in collectors")
    collectors[col] = collectors[col].apply(lambda x: pd.Period(x, freq='ms'))
print("done modifying")

collectors.sort_values(by=['family', 'given','occurrenceID_firstsample'], inplace=True)
collectors

modify time using pd.Periode(…) to get it work also on very old dates...
- convert eventDate_mean to pd.Period(...) in collectors
- convert eventDate_min to pd.Period(...) in collectors
- convert eventDate_max to pd.Period(...) in collectors
done modifying


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
81414,A,,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1899-08-07 00:00:00.000,1899-08-07 00:00:00.000,1899-08-07 00:00:00.000
170656,A,,,,,,,,5,https://data.biodiversitydata.nl/naturalis/spe...,1981-12-26 00:00:00.000,1981-03-20 00:00:00.000,1983-05-18 00:00:00.000
163328,A,,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
52199,A,,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1997-02-01 00:00:00.000,1997-02-01 00:00:00.000,1997-02-01 00:00:00.000
136326,A,,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1996-07-07 00:00:00.000,1996-07-07 00:00:00.000,1996-07-07 00:00:00.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
178779,Štepánek,J.,,,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1998-11-06 12:00:00.000,1992-05-15 00:00:00.000,2005-04-30 00:00:00.000
85001,Štepánek,J.,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1983-05-09 00:00:00.000,1983-05-09 00:00:00.000,1983-05-09 00:00:00.000
178764,Štepánek,J.,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,2003-06-01 00:00:00.000,2003-06-01 00:00:00.000,2003-06-01 00:00:00.000
62174,Štepánek,J.,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1978-05-17 00:00:00.000,1978-05-17 00:00:00.000,1978-05-17 00:00:00.000


### Check Composition of Parsed Collector Data

In [5]:
# TODO review code of abbreviated names and full name matching
criterion_fullnames = collectors.given.str.contains('^\w{3,}', na=False)
print("Show collecors given name has (propably) a full name (%s records) …" % len(collectors[criterion_fullnames].index))
collectors[criterion_fullnames]

Show collecors given name has (propably) a full name (6730 records) …


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
133764,A-M-V-J,Renier,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
71597,A-ts'ai,Hsieh,,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000
154401,A. Kneucker T,Stuckert,,in,,,,,4,https://data.biodiversitydata.nl/naturalis/spe...,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000
78904,AFle,Jolis,,,,,,,420,https://data.biodiversitydata.nl/naturalis/spe...,1860-07-06 19:47:47.797,1800-01-01 00:00:00.000,1983-10-04 00:00:00.000
12995,Aaaa,Bellynck,,,,,,,6,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...
85287,Zwaan Jp,Kleiweg,,de,,,,,10,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
67933,d'Alleizette,Herb,,,,,,,32,https://data.biodiversitydata.nl/naturalis/spe...,1910-01-14 11:04:36.924,1901-11-01 00:00:00.000,1920-05-13 00:00:00.000
17082,d'Anty,Bons,,,,,,,3,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
124795,dePoicy,Pirey,,,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1919-02-01 00:00:00.000,1919-02-01 00:00:00.000,1919-02-01 00:00:00.000


In [6]:
# check the name-parsed columns if they are empty or need to be considerd as data for matching or not
for parsed_name_part in ["particle", "suffix", "dropping_particle", "appellation"]:
    test_collectors = collectors.loc[(collectors[parsed_name_part].isna() == False)]
    print("\n----------------------------------------\nshow names with **%s** found %s records:\n" % (parsed_name_part, len(test_collectors.index)))
    display(test_collectors.head())


----------------------------------------
show names with **particle** found 4006 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
154401,A. Kneucker T,Stuckert,,in,,,,,4,https://data.biodiversitydata.nl/naturalis/spe...,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000
47,Aa,H. A. van der,,van,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1962-07-07 00:00:00.000,1962-07-07 00:00:00.000,1962-07-07 00:00:00.000
57,Aalst,Mdjm,,van,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1978-12-31 00:00:00.000,1975-06-01 00:00:00.000,1982-08-01 00:00:00.000
114071,Aaron,Native,,boy,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1912-03-01 00:00:00.000,1912-03-01 00:00:00.000,1912-03-01 00:00:00.000
4933,Abdilah,Rasit,,bin,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,2000-10-16 00:00:00.000,2000-10-16 00:00:00.000,2000-10-16 00:00:00.000



----------------------------------------
show names with **suffix** found 22 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
178054,Bakker,Zinderen,Sr.,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1965-01-31 00:00:00.000,1965-01-31 00:00:00.000,1965-01-31 00:00:00.000
58839,Gradstein,,SR,van,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1980-05-08 00:00:00.000,1980-05-08 00:00:00.000,1980-05-08 00:00:00.000
58837,Gradstein,,SR,van,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,1973-11-01 00:00:00.000,1973-11-01 00:00:00.000,1973-11-01 00:00:00.000
84666,Leopold,King,III,,,,,,1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
150830,Maurit,Flora,II,,,,,,2,https://data.biodiversitydata.nl/naturalis/spe...,1900-03-14 00:00:00.000,1900-03-14 00:00:00.000,1900-03-14 00:00:00.000



----------------------------------------
show names with **dropping_particle** found 0 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max



----------------------------------------
show names with **appellation** found 1 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
17782,McCullogh,,,,,,Mrs,,34,https://data.biodiversitydata.nl/naturalis/spe...,1975-01-30 09:10:35.294,1975-01-30 00:00:00.000,1975-01-31 00:00:00.000


Compile `canonical_string…` for the collector data we will later match the WikiData names with:

In [7]:
collectors['canonical_string_collector_parsed'] = (
  # use collectors.family only where given name has NA values, otherwise use family name + given name
  collectors.family.where(
      # condition
      collectors.given.isna(),
      # any other 
      # TODO improve the combined name for canonical_string_collector_parsed if any of the other dwc_parsed fields is not NaN
      other= (collectors.family + ", " + collectors.given) \
        if any(collectors.particle.isna()) \
        else collectors.particle + " " + collectors.family + ", " + collectors.given
  )
)
# move canonical_string_collector_parsed after column title (title was the last of the parsing columns)
col = collectors.pop("canonical_string_collector_parsed")
collectors.insert(collectors.columns.get_loc('title') + 1, col.name, col)
collectors.tail()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
178779,Štepánek,J.,,,,,,,"Štepánek, J.",2,https://data.biodiversitydata.nl/naturalis/spe...,1998-11-06 12:00:00.000,1992-05-15 00:00:00.000,2005-04-30 00:00:00.000
85001,Štepánek,J.,,,,,,,"Štepánek, J.",1,https://data.biodiversitydata.nl/naturalis/spe...,1983-05-09 00:00:00.000,1983-05-09 00:00:00.000,1983-05-09 00:00:00.000
178764,Štepánek,J.,,,,,,,"Štepánek, J.",1,https://data.biodiversitydata.nl/naturalis/spe...,2003-06-01 00:00:00.000,2003-06-01 00:00:00.000,2003-06-01 00:00:00.000
62174,Štepánek,J.,,,,,,,"Štepánek, J.",1,https://data.biodiversitydata.nl/naturalis/spe...,1978-05-17 00:00:00.000,1978-05-17 00:00:00.000,1978-05-17 00:00:00.000
146319,Šumberová,K.,,,,,,,"Šumberová, K.",17,https://data.biodiversitydata.nl/naturalis/spe...,2016-08-16 15:31:45.882,2016-08-16 00:00:00.000,2016-08-17 00:00:00.000


In [8]:
# group and aggregate data to have unique name rows only for the matching of names later on
collectors_unique=collectors.groupby(['canonical_string_collector_parsed']).agg(
    family=('family', lambda x: list(x)[0]),
    given=('given', lambda x: list(x)[0]),
    suffix=('suffix', lambda x: list(x)[0]),
    particle=('particle', lambda x: list(x)[0]),
    dropping_particle=('dropping_particle', lambda x: list(x)[0]),
    nick=('nick', lambda x: list(x)[0]),
    appellation=('appellation', lambda x: list(x)[0]),
    title=('title', lambda x: list(x)[0]),
    occurrenceID_collectors_count= ('occurrenceID_count', 'sum'), # use count function
    occurrenceID_collectors_firstsample=('occurrenceID_firstsample', lambda x: list(x)[0]), # custom function, to get the first entry
    collectors_eventDate_mean=('eventDate_mean', 'mean'),
    collectors_eventDate_min=('eventDate_min', 'min'),
    collectors_eventDate_max=('eventDate_max', 'max')
).reset_index()

# move canonical_string_collector_parsed after column title
col = collectors_unique.pop("canonical_string_collector_parsed")
collectors_unique.insert(collectors_unique.columns.get_loc('title') + 1, col.name, col)

display(collectors_unique)

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max
0,A,,,,,,,,A,18,https://data.biodiversitydata.nl/naturalis/spe...,1981-04-19 16:00:00.000,1899-08-07 00:00:00.000,1999-12-10 00:00:00.000
1,A'buino'o,,,,,,,,A'buino'o,1,https://data.biodiversitydata.nl/naturalis/spe...,1965-08-24 00:00:00.000,1965-08-24 00:00:00.000,1965-08-24 00:00:00.000
2,A-M-V-J,Renier,,,,,,,"A-M-V-J, Renier",1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
3,A-ts'ai,Hsieh,,,,,,,"A-ts'ai, Hsieh",1,https://data.biodiversitydata.nl/naturalis/spe...,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000
4,A. Kneucker T,Stuckert,,in,,,,,"A. Kneucker T, Stuckert",4,https://data.biodiversitydata.nl/naturalis/spe...,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57550,Širjaev,G.I.,,,,,,,"Širjaev, G.I.",32,https://data.biodiversitydata.nl/naturalis/spe...,1927-05-29 22:53:28.696,1924-05-01 00:00:00.000,1932-09-26 00:00:00.000
57551,Šmite,D.,,,,,,,"Šmite, D.",13,https://data.biodiversitydata.nl/naturalis/spe...,1978-01-12 16:30:00.000,1975-01-01 00:00:00.000,1980-09-08 00:00:00.000
57552,Špacek,J.,,,,,,,"Špacek, J.",2,https://data.biodiversitydata.nl/naturalis/spe...,1962-07-10 00:00:00.000,1962-07-10 00:00:00.000,1962-07-10 00:00:00.000
57553,Štepánek,J.,,,,,,,"Štepánek, J.",620,https://data.biodiversitydata.nl/naturalis/spe...,1988-06-14 10:28:20.310,1966-05-25 00:00:00.000,2006-07-13 00:00:00.000


In [9]:
# show collectors with highest occurrenceID_collectors_count
collectors_unique.sort_values(by=['occurrenceID_collectors_count', 'family'], ascending=[False, True]).head(10)

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max
5724,Boom,B.K.,,,,,,,"Boom, B.K.",51929,https://data.biodiversitydata.nl/naturalis/spe...,1956-02-26 18:12:10.768,1856-01-01 00:00:00.000,1997-04-11 00:00:00.000
6579,Breteler,F.J.,,,,,,,"Breteler, F.J.",41443,https://data.biodiversitydata.nl/naturalis/spe...,1988-09-24 10:56:52.051,1955-06-12 00:00:00.000,2020-03-06 00:00:00.000
32743,Maxwell,J.F.,,,,,,,"Maxwell, J.F.",38782,https://data.biodiversitydata.nl/naturalis/spe...,1996-08-29 12:11:24.527,1969-01-18 00:00:00.000,2013-04-11 00:00:00.000
26981,Koorders,S.H.,,,,,,,"Koorders, S.H.",34173,https://data.biodiversitydata.nl/naturalis/spe...,1915-03-09 11:07:43.928,1829-08-27 00:00:00.000,2012-11-11 00:00:00.000
29034,Leeuwenberg,A.J.M.,,,,,,,"Leeuwenberg, A.J.M.",32867,https://data.biodiversitydata.nl/naturalis/spe...,1973-07-14 13:58:10.508,1926-02-20 00:00:00.000,1999-11-16 00:00:00.000
48116,Soest,J.L.,,,,,,,"Soest, J.L.",31684,https://data.biodiversitydata.nl/naturalis/spe...,1947-10-12 23:09:55.812,1803-08-10 00:00:00.000,1999-06-06 00:00:00.000
613,Ajgh,Kostermans,,,,,,,"Ajgh, Kostermans",30712,https://data.biodiversitydata.nl/naturalis/spe...,1959-02-23 21:53:35.298,1892-09-30 00:00:00.000,1994-11-15 00:00:00.000
23478,Itinere,Stud,,biol Rheno-Trai in,,,,,"Itinere, Stud",29912,https://data.biodiversitydata.nl/naturalis/spe...,1966-03-04 03:14:54.417,1847-06-18 00:00:00.000,1996-07-08 00:00:00.000
55730,Wilde-Duyfjes,B.E.E.,,,,,,,"Wilde-Duyfjes, B.E.E.",29893,https://data.biodiversitydata.nl/naturalis/spe...,1986-10-15 13:20:06.923,1958-06-28 00:00:00.000,2019-09-04 00:00:00.000
55575,Wieringa,J.J.,,,,,,,"Wieringa, J.J.",23282,https://data.biodiversitydata.nl/naturalis/spe...,2006-07-29 21:16:17.781,1980-08-19 00:00:00.000,2022-11-12 00:00:00.000


In [10]:
# TODO continue 2023-08-21 10:28:54
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

## Set Up the Cosine Similarity and Text Search

See 
- for the application code https://github.com/nielsklazenga/avh-collectors/blob/master/cosine_similarity.ipynb
- for reading on the topic: Taylor, Josh. 2019. ‘Fuzzy Matching at Scale’. Towards Data Science (blog). 2 July 2019. https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536.

The `ngrams`-function is used as an analyzer in the text search later.

In [11]:
import pandas as pd, numpy as np, re
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from sparse_dot_topn import awesome_cossim_topn # pip install sparse-dot-topn

def get_matches_df(sparse_matrix, A, B, top=100):
    non_zeros = sparse_matrix.nonzero()

    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]

    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size

    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similarity = np.zeros(nr_matches)

    for index in range(0, nr_matches):
        left_side[index] = A[sparserows[index]]
        right_side[index] = B[sparsecols[index]]
        similarity[index] = round(sparse_matrix.data[index], 3)

    return pd.DataFrame({'namematch_source_data': left_side,
                         'namematch_resource_data': right_side,
                         'namematch_similarity': similarity})

!pip install ftfy
from ftfy import fix_text

def ngrams(string, n=3):
    """
    Construct ngram(s) of a given text

    @param string: the text string to perform the ngram splitting on
    @param n: character length of the particular (split) result text each
    @return: string as ngram
    """
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.replace('.', ' ')
    string = string.title()  # normalise case - capital at start of each word
    string = re.sub(' +', ' ', string).strip() # get rid of multiple spaces and replace with a single
    string = ' ' + string + ' '  # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    string = string.strip()
    this_ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in this_ngrams]

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

In [12]:
print("Show ngram examples:")
print("- simple name:", ngrams('Klazenga, N.'))
print("- data from collectors:", ngrams(collectors_unique["canonical_string_collector_parsed"].at[1])) 
print("- data from match-test:", ngrams(wd_matchtest['canonical_string'].at[0]))
print("- data from match-test (full name):", ngrams(wd_matchtest_fullnames['canonical_string_fullname'].at[3]))


Show ngram examples:
- simple name: ['Kla', 'laz', 'aze', 'zen', 'eng', 'nga', 'ga ', 'a N']
- data from collectors: ['Abu', 'bui', 'uin', 'ino', 'noo']
- data from match-test: ['Wal', 'alr', 'lra', 'rae', 'aev', 'eve', 'ven', 'ens', 'ns ', 's O', ' O ', 'O H']
- data from match-test (full name): ['188', '888', '881', '819', '197', '973', '73 ', '3 G', ' Ge', 'Geo', 'eor', 'org', 'rge', 'ges', 'es ', 's A', ' An', 'And', 'ndr']


In [13]:
# some example data
for i, row in enumerate(range(5)):
    if (i == 0):
        print('(WikiData’s) canonical_string = (constructed) canonical_string_fullname') 
    pprint.pprint("%s = %s" % (
        wd_matchtest['canonical_string'].at[row],
        wd_matchtest_fullnames['canonical_string_fullname'].at[row]
    ))


(WikiData’s) canonical_string = (constructed) canonical_string_fullname
'(-Walraevens), O.H. = (-Walraevens), O Heylen'
'(1835-1906), G.A.F.E. = (1835-1906), Gustav Adolf Ferdinand Eichler'
'(1873-1926), S.S. = (1873-1926), Søren Sørensen'
'(1888–1973), G.A. = (1888–1973), Georges André'
'(1904-1990), J.J. = (1904-1990), Johannes Johannessen'


In [14]:
def calculateTFIDFmatchingOfData(query_data, match_data, cossim_ntop=1, cossim_lower_bound=0.5):
    """
    Calculate a TF-IDF (Term Frequency — Inverse Document Frequency) matching with awesome_cossim_topn() and return matched data

    @param query_data: DataFrame usually a pandas data column to query names or strings for
    @param match_data: DataFrame against to match with
    @param cossim_ntop: how many cossim matches each shall be calculated (default 1, i.e. the highest similarity) — increase it to get more alternative
        matches with less similarity
    @param cossim_lower_bound: where is the lower similarity cut off to regard data as similar (default 0.5)

    @requires get_get_matches_df()
    @requires ngrams()
    @requires awesome_cossim_topn()
    @requires TfidfVectorizer()

    @return: a data frame dictionary: namematch_source_data, namematch_resource_data, namematch_similarity (from @see get_matches_df())
    @rtype pd.DataFrame
    """

    import time
    time_start = time.time()

    # Vectorize Wikidata name (use fit_transform())
    print('Vectorizing data. This may take a while...')
    vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
    tf_idf_matrix_clean = vectorizer.fit_transform(match_data)
    # Vectorize collectors’ names (use transform())
    tf_idf_matrix_dirty = vectorizer.transform(query_data)

    duration = time.time() - time_start
    print('Vectorizing completed: created a matrix of TF-IDF features after %s s' % duration)

    # Calculate Cosine Similarity; keep only the best match (ntop=1) and only if the similarity is greater than 0.5 (lower_bound=0.5)
    # (lower_bound: a threshold that the element of A*B must be greater than
    #  https://github.com/ing-bank/sparse_dot_topn/blob/3f40611b0553b50c27f23c7dcffc3ca9a9e8f5b5/sparse_dot_topn/awesome_cossim_topn.py#L26C9-L26C78)
    cossim_matches = awesome_cossim_topn(
        tf_idf_matrix_dirty,
        tf_idf_matrix_clean.transpose(),
        ntop=cossim_ntop,
        lower_bound=cossim_lower_bound
    )
    print("Cossim matches calculated after %s s" % (time.time() - time_start))

    print("Get all matches together ...")
    # construct the matching data frame
    matches_df = get_matches_df(
        cossim_matches,
        query_data,
        match_data,
        top=0
    )
    print("Done. Matches calculated after %s s" % (time.time() - time_start))

    return matches_df

In [15]:
criterion_fullnames = collectors_unique.given.str.contains('^\w{3,}', na=False)
collectors_names = collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values

matches = calculateTFIDFmatchingOfData(
    collectors_names, 
    wd_matchtest['canonical_string'], 
    cossim_ntop=1 # e.g. cossim_ntop=3 would give more alternative matches as well, having lower similarities, data would increase 3 times as well
)
matches = matches.sort_values(by=['namematch_similarity'], ascending=[False])
matches = matches.reset_index(names=['old_index'])
matches

Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 4.976190090179443 s
Cossim matches calculated after 6.759381532669067 s
Get all matches together ...
Done. Matches calculated after 7.194205284118652 s


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
0,21176,"Laughton, F.S.","Laughton, F.S.",1.0
1,20196,"Krasnoperova, L.A.","Krasnoperova, L.A.",1.0
2,20125,"Koyama, M.","Koyama, M.",1.0
3,20145,"Kraemer, H.","Kraemer, H.",1.0
4,20148,"Kraft, G.T.","Kraft, G.T.",1.0
...,...,...,...,...
42347,19867,"Kokeil, F.","Keil, F.",0.5
42348,22278,"Lobscheid, W.","Hetterscheid, W.",0.5
42349,6450,"Chai-Anan, C.","Chaianan, C.",0.5
42350,4216,"Borgstede, F.","Borgström, F.L.",0.5


In [16]:
# criterion_fullnames = collectors_unique.given.str.contains('^\w{3,}', na=False)
collectors_fullnames = collectors_unique['canonical_string_collector_parsed'][criterion_fullnames].values
matches_fullnames = calculateTFIDFmatchingOfData(
    collectors_fullnames, 
    wd_matchtest_fullnames['canonical_string_fullname'], 
    cossim_ntop=1 # 10 would give more alternative matches also with lesser similarity
)

matches_fullnames = matches_fullnames.sort_values(by=['namematch_similarity'], ascending=[False])
matches_fullnames = matches_fullnames.reset_index(names=['old_index'])

matches_fullnames

Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 4.292505264282227 s
Cossim matches calculated after 4.46893310546875 s
Get all matches together ...
Done. Matches calculated after 4.475776672363281 s


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
0,77,"Busu, Baya","Busu, Baya",1.000
1,40,"Aweke, Getachew","Aweke, Getachew",1.000
2,100,"Chiou, Wen-liang","Chiou, Wen Liang",1.000
3,380,"Lu, Sheng-you","Lu, Sheng-You",1.000
4,104,"Chung, Shih-Wen","Chung, Shih Wen",1.000
...,...,...,...,...
623,567,"Unyong, Asah","Strong, Asa B",0.502
624,227,"Fries, Herb","Davies, Thomas Herbert",0.501
625,417,"Nancy, Jard Bot","Davis, Nancy Jane",0.501
626,122,"D', Argy C.","d'Argy, Charles",0.501


### Create Output Results

Combine the matches data frame back to the (Naturalis) collectors and Wikidata items …

Note: merging 18.770.000 collector matches earlier to wikidata was too much to calculate. Hence the descision was to make the data unique by canonical_string_collector_parsed.

In [17]:
# # join (only) abbreviated name matches with collector source data
collectors_matches = pd.merge(
    collectors_unique, matches, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data', 
    how='left'
)

collectors_matches.dropna(subset=['namematch_similarity'], inplace=True)
collectors_matches # 42298 rows × 18 columns

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
6,Aa,H. A. van der,,van,,,,,"Aa, H. A. van der",2,https://data.biodiversitydata.nl/naturalis/spe...,1962-07-07 00:00:00.000,1962-07-07 00:00:00.000,1962-07-07 00:00:00.000,0.0,"Aa, H. A. van der","Aa, H.A.v.d.",0.605
10,Aaku,A.,,,,,,,"Aaku, A.",3,https://data.biodiversitydata.nl/naturalis/spe...,1909-11-19 16:00:00.000,1909-05-28 00:00:00.000,1910-07-29 00:00:00.000,1.0,"Aaku, A.","Karakuş, T.",0.557
12,Aalders,A.,,,,,,,"Aalders, A.",1,https://data.biodiversitydata.nl/naturalis/spe...,1975-06-20 00:00:00.000,1975-06-20 00:00:00.000,1975-06-20 00:00:00.000,2.0,"Aalders, A.","Aalders, L.E.",0.797
13,Aalders,R.,,,,,,,"Aalders, R.",2,https://data.biodiversitydata.nl/naturalis/spe...,1966-06-15 12:00:00.000,1966-05-31 00:00:00.000,1966-07-01 00:00:00.000,3.0,"Aalders, R.","Aalders, L.E.",0.792
19,Aarding,W.,,,,,,,"Aarding, W.",1,https://data.biodiversitydata.nl/naturalis/spe...,2003-10-28 00:00:00.000,2003-10-28 00:00:00.000,2003-10-28 00:00:00.000,4.0,"Aarding, W.","Harding, W.",0.761
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57550,Širjaev,G.I.,,,,,,,"Širjaev, G.I.",32,https://data.biodiversitydata.nl/naturalis/spe...,1927-05-29 22:53:28.696,1924-05-01 00:00:00.000,1932-09-26 00:00:00.000,42347.0,"Širjaev, G.I.","Širjaev, G.I.",1.000
57551,Šmite,D.,,,,,,,"Šmite, D.",13,https://data.biodiversitydata.nl/naturalis/spe...,1978-01-12 16:30:00.000,1975-01-01 00:00:00.000,1980-09-08 00:00:00.000,42348.0,"Šmite, D.","Šmite, D.",1.000
57552,Špacek,J.,,,,,,,"Špacek, J.",2,https://data.biodiversitydata.nl/naturalis/spe...,1962-07-10 00:00:00.000,1962-07-10 00:00:00.000,1962-07-10 00:00:00.000,42349.0,"Špacek, J.","Machacek, J.E.",0.610
57553,Štepánek,J.,,,,,,,"Štepánek, J.",620,https://data.biodiversitydata.nl/naturalis/spe...,1988-06-14 10:28:20.310,1966-05-25 00:00:00.000,2006-07-13 00:00:00.000,42350.0,"Štepánek, J.","Štěpánek, J.",0.605


In [18]:
# join (only) full name matches with collector source data
collectors_matches_fullname = pd.merge(
    collectors_unique, matches_fullnames, 
    left_on='canonical_string_collector_parsed' , right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

collectors_matches_fullname # 628 rows × 18 columns

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
0,A-ts'ai,Hsieh,,,,,,,"A-ts'ai, Hsieh",1,https://data.biodiversitydata.nl/naturalis/spe...,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000,1929-05-21 00:00:00.000,0,"A-ts'ai, Hsieh","Hsieh, A Tsai",0.770
1,A. Kneucker T,Stuckert,,in,,,,,"A. Kneucker T, Stuckert",4,https://data.biodiversitydata.nl/naturalis/spe...,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000,1902-01-01 00:00:00.000,1,"A. Kneucker T, Stuckert","Kneucker, Johann Andreas",0.520
2,Abbas,Damsah,,,,,,,"Abbas, Damsah",1,https://data.biodiversitydata.nl/naturalis/spe...,1967-06-20 00:00:00.000,1967-06-20 00:00:00.000,1967-06-20 00:00:00.000,2,"Abbas, Damsah","Abbas, Alia",0.511
3,Abdela,Ahmet,,,,,,,"Abdela, Ahmet",113,https://data.biodiversitydata.nl/naturalis/spe...,2007-10-20 00:12:44.601,2007-10-16 00:00:00.000,2007-10-24 00:00:00.000,3,"Abdela, Ahmet","İlçim, Ahmet",0.533
4,Abdullah A,Samat,,bin,,,,,"Abdullah A, Samat",110,https://data.biodiversitydata.nl/naturalis/spe...,1965-08-14 14:02:24.000,1961-08-03 00:00:00.000,2005-04-30 00:00:00.000,4,"Abdullah A, Samat","Abdullah, N",0.681
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
623,Zhong-tao,Wang,,,,,,,"Zhong-tao, Wang",60,https://data.biodiversitydata.nl/naturalis/spe...,1992-01-19 19:33:20.000,1986-08-03 00:00:00.000,2014-06-17 00:00:00.000,623,"Zhong-tao, Wang","Wang, Tao",0.699
624,Zhou,Hang,,,,,,,"Zhou, Hang",3,https://data.biodiversitydata.nl/naturalis/spe...,1974-08-25 16:00:00.000,1912-05-01 00:00:00.000,2006-04-23 00:00:00.000,624,"Zhou, Hang","Zhou, Hang",1.000
625,Zu,Solms-Laubach H.M.C.L.F.,,,,,,,"Zu, Solms-Laubach H.M.C.L.F.",18,https://data.biodiversitydata.nl/naturalis/spe...,1884-01-01 00:00:00.000,1884-01-01 00:00:00.000,1884-01-01 00:00:00.000,625,"Zu, Solms-Laubach H.M.C.L.F.","Solms-Laubach, Hermann zu",0.573
626,Zu,Wied-Neuwied M.A.P.,,,,,,,"Zu, Wied-Neuwied M.A.P.",16,https://data.biodiversitydata.nl/naturalis/spe...,1818-09-24 06:00:00.000,1816-12-01 00:00:00.000,1824-01-01 00:00:00.000,626,"Zu, Wied-Neuwied M.A.P.","Wied-Neuwied, Prince Maximilian of",0.568


In [19]:
# join all name matches together
collectors_all_matches=pd.concat([collectors_matches, collectors_matches_fullname])
collectors_all_matches.sort_values(by=['namematch_similarity', 'family'], ascending=[False, True], inplace=True)
collectors_all_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
21,Aaronsohn,A.,,,,,,,"Aaronsohn, A.",3,https://data.biodiversitydata.nl/naturalis/spe...,1907-01-26 12:00:00.000,1906-12-06 00:00:00.000,1907-03-19 00:00:00.000,5.0,"Aaronsohn, A.","Aaronsohn, A.",1.0
47,Abbas,A.,,,,,,,"Abbas, A.",378,https://data.biodiversitydata.nl/naturalis/spe...,1963-03-03 08:38:52.762,1936-02-11 00:00:00.000,1963-11-01 00:00:00.000,18.0,"Abbas, A.","Abbas, A.",1.0
52,Abbe,E.C.,,,,,,,"Abbe, E.C.",537,https://data.biodiversitydata.nl/naturalis/spe...,1961-03-04 07:37:30.486,1932-01-01 00:00:00.000,1964-08-31 00:00:00.000,21.0,"Abbe, E.C.","Abbe, E.C.",1.0
55,Abbiatti,D.,,,,,,,"Abbiatti, D.",2,https://data.biodiversitydata.nl/naturalis/spe...,1944-05-31 00:00:00.000,1937-10-01 00:00:00.000,1951-01-29 00:00:00.000,23.0,"Abbiatti, D.","Abbiatti, D.",1.0
66,Abbott,A.T.D.,,,,,,,"Abbott, A.T.D.",14,https://data.biodiversitydata.nl/naturalis/spe...,2002-12-17 01:36:00.000,1997-02-13 00:00:00.000,2010-05-27 00:00:00.000,30.0,"Abbott, A.T.D.","Abbott, A.T.D.",1.0


Save the plain name matching results only ...

In [20]:
if not os.path.exists('data'):
    print("Make data directory for saving …")
    os.makedirs('data')

# Set some global varialbes
# this_timestamp_for_data=time.strftime('%Y%m%d') # 20230913
this_timestamp_for_data=20230913

this_output_file='data/results_naturalis_collectors_vs_wikidata-botanists_cossim-similarity_plain-names_%s.csv' % (
    this_timestamp_for_data
)

collectors_all_matches.to_csv(this_output_file)

print("Wrote plain name matches of collector names into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote plain name matches of collector names into data/results_naturalis_collectors_vs_wikidata-botanists_cossim-similarity_plain-names_20230913.csv (8901 kB)


In [21]:
# old code # Join Wikidata items
# df_avh_matches_wikidata = pd.merge(df_avh_matches, df_wikidata                , left_on='namematch_resource_data', right_on='canonical_string', how='left')
# df_avh_matches_wikidata = pd.merge(df_avh_matches_wikidata, df_wikidata_unique, left_on='namematch_resource_data', right_on='canonical_string', how='left')
# df_avh_matches_wikidata.rename(columns={df_avh_matches_wikidata.columns.tolist()[-1]: 'dup_count'}, inplace=True)


In [22]:
# merge now with WikiData: the matching data and the wiki data’s on the conaonical string name
collectors_matches_tmp_names_abbreviated = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string'
)
collectors_matches_tmp_fullnames = pd.merge(
    collectors_matches_fullname, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string_fullname'
)
collectors_matches_g1_merged_wikidata = pd.concat(
    [collectors_matches_tmp_names_abbreviated, collectors_matches_tmp_fullnames]
    , ignore_index=True
)

In [23]:
print("Show some name match examples (e.g. «Louis…» matching various names) …")
for testname in ['Louis', 'Abbot']:
    criterion = collectors_matches_g1_merged_wikidata['canonical_string_collector_parsed'].map(lambda x: x.startswith(testname))    
    this_table=collectors_matches_g1_merged_wikidata[criterion].get([
        # 'canonical_string_collector_parsed', # canonical_string_collector_parsed = namematch_source_data
        'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
        'namematch_source_data', 'namematch_resource_data', 'namematch_similarity', 
        # 'canonical_string_fullname', 
        'itemLabel', 'wikidata_link',
        'collectors_eventDate_min', 'collectors_eventDate_max',
        'yob', 'yod', 'wyb', 'wye'
    ]).sort_values(by=['namematch_similarity'], ascending=[False])
    print("# ---------------------------------------------\n# «%s…» as test name, %d collector names begin with:" % (testname, criterion.sum()))    
    display(this_table)

Show some name match examples (e.g. «Louis…» matching various names) …
# ---------------------------------------------
# «Louis…» as test name, 13 collector names begin with:


Unnamed: 0,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_similarity,itemLabel,wikidata_link,collectors_eventDate_min,collectors_eventDate_max,yob,yod,wyb,wye
20089,2,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, A.","Louis, A.",1.0,A. Louis,http://www.wikidata.org/wiki/Q33682458,NaT,NaT,,,,
27698,10542,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, A.M.","Louis, A.M.",1.0,Adriaan M. Louis,http://www.wikidata.org/wiki/Q21338327,1969-04-10 00:00:00.000,2013-03-02 00:00:00.000,1944.0,,,
27699,3339,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, J.L.P.","Louis, J.L.P.",1.0,Jean Laurent Prosper Louis,http://www.wikidata.org/wiki/Q5928759,1900-01-01 00:00:00.000,1998-05-17 00:00:00.000,1903.0,1947.0,,
27700,51,https://data.biodiversitydata.nl/naturalis/spe...,"Louis-Marie, P.",Louis-Marie,0.914,Louis-Marie,http://www.wikidata.org/wiki/Q5981449,1925-07-28 00:00:00.000,1953-07-08 00:00:00.000,1896.0,1978.0,,
20088,59,https://data.biodiversitydata.nl/naturalis/spe...,Louis,"Louis, A.",0.868,A. Louis,http://www.wikidata.org/wiki/Q33682458,1904-05-28 00:00:00.000,1984-09-21 00:00:00.000,,,,
20092,3,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, H.","Louis, A.",0.858,A. Louis,http://www.wikidata.org/wiki/Q33682458,1907-06-01 00:00:00.000,1953-10-01 00:00:00.000,,,,
27701,1,https://data.biodiversitydata.nl/naturalis/spe...,"Louis-Marie, R.P.",Louis-Marie,0.858,Louis-Marie,http://www.wikidata.org/wiki/Q5981449,1934-07-09 00:00:00.000,1934-07-09 00:00:00.000,1896.0,1978.0,,
20090,14,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, F.","Louis, A.",0.857,A. Louis,http://www.wikidata.org/wiki/Q33682458,1910-01-01 00:00:00.000,1953-12-01 00:00:00.000,,,,
20094,4,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, O.","Louis, A.",0.835,A. Louis,http://www.wikidata.org/wiki/Q33682458,1937-07-14 00:00:00.000,1937-07-27 00:00:00.000,,,,
20091,116,https://data.biodiversitydata.nl/naturalis/spe...,"Louis, F.H.","Louis, A.",0.755,A. Louis,http://www.wikidata.org/wiki/Q33682458,1853-08-03 00:00:00.000,1960-09-25 00:00:00.000,,,,


# ---------------------------------------------
# «Abbot…» as test name, 10 collector names begin with:


Unnamed: 0,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_similarity,itemLabel,wikidata_link,collectors_eventDate_min,collectors_eventDate_max,yob,yod,wyb,wye
51,14,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, A.T.D.","Abbott, A.T.D.",1.0,A. T. D. Abbott,http://www.wikidata.org/wiki/Q117328147,1997-02-13 00:00:00.000,2010-05-27 00:00:00.000,1936.0,2013.0,,
52,2,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, E.K.","Abbott, E.K.",1.0,Edwin Kirk Abbott,http://www.wikidata.org/wiki/Q81587932,1889-01-01 00:00:00.000,1889-04-01 00:00:00.000,1840.0,1918.0,,
53,2,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, E.K.","Abbott, E.K.",1.0,Erwin Kirk Abbott,http://www.wikidata.org/wiki/Q113588322,1889-01-01 00:00:00.000,1889-04-01 00:00:00.000,1840.0,1918.0,,
55,10,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, W.L.","Abbott, W.L.",1.0,William Louis Abbott,http://www.wikidata.org/wiki/Q635604,1922-04-05 00:00:00.000,1922-04-30 00:00:00.000,1860.0,1936.0,,
54,106,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, I.A.","Abbott, I.",0.888,Isabella Abbott,http://www.wikidata.org/wiki/Q6077932,1946-05-01 00:00:00.000,1995-02-22 00:00:00.000,1919.0,2010.0,,
49,1,https://data.biodiversitydata.nl/naturalis/spe...,Abbott,"Abbott, G.",0.867,George Abbott,http://www.wikidata.org/wiki/Q47112598,NaT,NaT,,,,
50,1,https://data.biodiversitydata.nl/naturalis/spe...,"Abbott, D.P.","Abbott, G.",0.764,George Abbott,http://www.wikidata.org/wiki/Q47112598,1967-08-02 00:00:00.000,1967-08-02 00:00:00.000,,,,
40,1,https://data.biodiversitydata.nl/naturalis/spe...,"Abbot-Anderson, M.","Anderson, M.",0.631,Marilyn Anderson,http://www.wikidata.org/wiki/Q44754645,1933-06-21 00:00:00.000,1933-06-21 00:00:00.000,,,,
41,1,https://data.biodiversitydata.nl/naturalis/spe...,"Abbot-Anderson, M.","Anderson, M.",0.631,Mary Anderson,http://www.wikidata.org/wiki/Q111694258,1933-06-21 00:00:00.000,1933-06-21 00:00:00.000,1875.0,,,
42,1,https://data.biodiversitydata.nl/naturalis/spe...,"Abbot-Anderson, M.","Anderson, M.",0.631,Mark Anderson,http://www.wikidata.org/wiki/Q111990210,1933-06-21 00:00:00.000,1933-06-21 00:00:00.000,,,,


In [24]:
pprint.pprint(collectors_matches_g1_merged_wikidata.columns)

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed',
       'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
       'collectors_eventDate_mean', 'collectors_eventDate_min',
       'collectors_eventDate_max', 'old_index', 'namematch_source_data',
       'namematch_resource_data', 'namematch_similarity', 'item', 'itemLabel',
       'surname', 'initials', 'canonical_string', 'canonical_string_fullname',
       'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob',
       'yod', 'wyb', 'wye', 'wikidata_link', 'orcid_link', 'harv_link',
       'ipni_link', 'bionomia_link'],
      dtype='object')


In [25]:
collectors_matches_g1_merged_wikidata.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,Aa,H. A. van der,,van,,,,,"Aa, H. A. van der",2,...,,1935.0,2017.0,,,http://www.wikidata.org/wiki/Q967491,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/30582-1,
1,Aaku,A.,,,,,,,"Aaku, A.",3,...,,,,,,http://www.wikidata.org/wiki/Q88884405,,,https://www.ipni.org/a/20039537-1,
2,Waku,T.,,,,,,,"Waku, T.",10,...,,,,,,http://www.wikidata.org/wiki/Q88884405,,,https://www.ipni.org/a/20039537-1,
3,Aalders,A.,,,,,,,"Aalders, A.",1,...,,1933.0,2005.0,,,http://www.wikidata.org/wiki/Q21340898,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/22-1,
4,Aalders,R.,,,,,,,"Aalders, R.",2,...,,1933.0,2005.0,,,http://www.wikidata.org/wiki/Q21340898,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/22-1,


In [26]:
# Select useful columns for data results
collectors_wikidata_cossim = collectors_matches_g1_merged_wikidata[
    ['canonical_string_collector_parsed', 'family', 'given', 
     'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
    'namematch_source_data', 'namematch_resource_data', 'namematch_similarity', 
    'item', 'canonical_string', 'itemLabel',
    'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 
    'collectors_eventDate_mean', 'collectors_eventDate_min', 'collectors_eventDate_max',
     'yob', 'yod', 'wyb'
    ]
]

# Order by similarity (desc), number of Wikidata items (asc) and number of collections (desc)
collectors_wikidata_cossim.sort_values(by=['namematch_similarity', 'family'], ascending=[False, True], inplace=True)

collectors_wikidata_cossim # comparison-match of «Kotschy, Karl Georg Th» (collector data) →← «Kotschy, T» (Wikidata) has only 0.5 similarity but corresponds to the correct person name we need

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_wikidata_cossim.sort_values(by=['namematch_similarity', 'family'], ascending=[False, True], inplace=True)


Unnamed: 0,canonical_string_collector_parsed,family,given,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_similarity,item,canonical_string,...,harv,ipni,abbr,bionomia_id,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,yob,yod,wyb
8,"Aaronsohn, A.",Aaronsohn,A.,3,https://data.biodiversitydata.nl/naturalis/spe...,"Aaronsohn, A.","Aaronsohn, A.",1.0,http://www.wikidata.org/entity/Q2086130,"Aaronsohn, A.",...,30592,23-1,Aarons.,Q2086130,1907-01-26 12:00:00.000,1906-12-06 00:00:00.000,1907-03-19 00:00:00.000,1876.0,1919.0,
25,"Abbas, A.",Abbas,A.,378,https://data.biodiversitydata.nl/naturalis/spe...,"Abbas, A.","Abbas, A.",1.0,http://www.wikidata.org/entity/Q60141229,"Abbas, A.",...,,20034668-1,Al.Abbas,,1963-03-03 08:38:52.762,1936-02-11 00:00:00.000,1963-11-01 00:00:00.000,,,
26,"Abbas, A.",Abbas,A.,378,https://data.biodiversitydata.nl/naturalis/spe...,"Abbas, A.","Abbas, A.",1.0,http://www.wikidata.org/entity/Q88804360,"Abbas, A.",...,,20034420-1,A.Abbas,,1963-03-03 08:38:52.762,1936-02-11 00:00:00.000,1963-11-01 00:00:00.000,,,
31,"Abbe, E.C.",Abbe,E.C.,537,https://data.biodiversitydata.nl/naturalis/spe...,"Abbe, E.C.","Abbe, E.C.",1.0,http://www.wikidata.org/entity/Q10274118,"Abbe, E.C.",...,30066,26-1,Abbe,Q10274118,1961-03-04 07:37:30.486,1932-01-01 00:00:00.000,1964-08-31 00:00:00.000,1905.0,2000.0,
33,"Abbiatti, D.",Abbiatti,D.,2,https://data.biodiversitydata.nl/naturalis/spe...,"Abbiatti, D.","Abbiatti, D.",1.0,http://www.wikidata.org/entity/Q5801800,"Abbiatti, D.",...,3809,27-1,Abbiatti,,1944-05-31 00:00:00.000,1937-10-01 00:00:00.000,1951-01-29 00:00:00.000,1918.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15817,"Siemelink, M.",Siemelink,M.,10,https://data.biodiversitydata.nl/naturalis/spe...,"Siemelink, M.","Semmelink, J.",0.5,http://www.wikidata.org/entity/Q2347342,"Semmelink, J.",...,10563,,,Q2347342,1979-02-08 06:51:25.714,1979-02-04 00:00:00.000,1979-02-10 00:00:00.000,1837.0,1912.0,
40399,"Stinson, E.B.",Stinson,E.B.,1,https://data.biodiversitydata.nl/naturalis/spe...,"Stinson, E.B.","Johnson, E.B.W.",0.5,http://www.wikidata.org/entity/Q100701296,"Johnson, E.B.W.",...,,,,Q100701296,1888-09-18 00:00:00.000,1888-09-18 00:00:00.000,1888-09-18 00:00:00.000,1897.0,1961.0,
42089,"Tulla, H.",Tulla,H.,4,https://data.biodiversitydata.nl/naturalis/spe...,"Tulla, H.","Bregulla, H.",0.5,http://www.wikidata.org/entity/Q1596829,"Bregulla, H.",...,,,,Q1596829,1932-05-14 16:00:00.000,1930-06-25 00:00:00.000,1935-08-21 00:00:00.000,1930.0,2013.0,
18828,"Valder, P.G.",Valder,P.G.,6,https://data.biodiversitydata.nl/naturalis/spe...,"Valder, P.G.","Halder, S.",0.5,http://www.wikidata.org/entity/Q21514516,"Halder, S.",...,97509,20013467-2,Halder,,1986-11-13 04:48:00.000,1978-08-01 00:00:00.000,1995-01-30 00:00:00.000,1987.0,,


In [27]:
# TODO further evaluation or filtering, counting, clean up aso.
if not os.path.exists('data'):
    os.makedirs('data')

# naturalis_collectors_cosine-similarity_wikidata-botanists_%s.csv
this_output_file='data/results_naturalis_collectors_vs_wikidata-botanists_cossim-similarity_merged-data_%s.csv' % (
    this_timestamp_for_data
)

collectors_wikidata_cossim.to_csv(this_output_file)

print("Wrote matches of collector names into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote matches of collector names into data/results_naturalis_collectors_vs_wikidata-botanists_cossim-similarity_merged-data_20230913.csv (14123 kB)


## Documentation

TODO document columns

Explanation of columns:

Column | Description
-|-
**Botanical collectors** |
family | parsed family name
given | parsed given name
suffix | suffix from name parsing
particle | particle from name parsing
dropping_particle | dropping_particle from name parsing
nick | nick name from name parsing
appellation | appellation from name parsing
title | title from name parsing
occurrenceID_collectors_count | count of all occurrenceID of one particular collector name
occurrenceID_collectors_firstsample | a data sample of an occurrenceID 
eventDate | date of the sampling event (required by GBIF, see https://www.gbif.org/data-quality-requirements-sampling-events)
eventDate_min | calculated earliest date of all the sampling events within the data
eventDate_max | calculated latest date of all the sampling events within the data
eventDate_mean | calculated mean date of all the sampling events within the data
TODO activity_span | Number of years between first and last collection
**Name matching** |
namematch_source_data | matched name of the collector data set
namematch_resource_data | matched name of Wikidata the collector was tried to matched to
namematch_similarity | calculated cosine-similarity
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label (perhaps similar to the full name)
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))