# Match Naturalis Collectors to Wikidata Items Using *Cosine Similarity*, `eventDate` Involved.

In this example we add `eventDate` of the source data, when the sample/occurrence was collected, to have a time reference, when the collector should have been  alive.

Basically we …

- match of `canonical_string` of WikiData to `canonical_string` of the source collectors (abbreviated names and full names, if given), and
- parse collector source names beforehand to get individual names out of name lists in the source data, we have used <https://libraries.io/rubygems/dwc_agent>, and in general we
- follow the example of Niels Klazenga <https://github.com/nielsklazenga/avh-collectors/blob/master/cosine_similarity.ipynb>
- write the output to provide a DarwinCore attribution structure (for `verbatimName` we would need the `source_data` name(s))

Technical Notes — Review Code perhaps:
- TODO review score calculation of the matching of relating eventData with range of yob, yod
- TODO review DwC agent output, keep at this time custom columns for filter-sort-evaluation convenience
- (NN ⇌ Cosine) refactor relation: wd_matchtest ⇌ wikidata_unique (replaced wd_matchtest → wikidata_unique)

### Load Wikidata Data Set

Construct data using Jupyter Notebook [create_wikidata_datasets_botanists.ipynb](./create_wikidata_datasets_botanists.ipynb)

Out of the Wikidata items data set we create a data frame with unique canonical name strings and their counts.

In [1]:
import pandas as pd
import pprint, time, os

wikidata = pd.read_csv(
    # "data/wikidata_persons_botanists_20231030_1539.csv", # inverse match: [particle +] family, given
    "data/wikidata_persons_botanists_20231116.csv",        # match: given [+ particle] + family[+ , suffix]
    index_col=0, low_memory=False,
    dtype={
        'yob':'Int32',
        'yod':'Int32',
        'wyb':'Int32',
        'wye':'Int32'
    }
)
pprint.pprint(wikidata.columns)
display(wikidata.head())

Index(['item', 'itemLabel', 'surname', 'initials', 'canonical_string',
       'canonical_string_fullname', 'orcid', 'viaf', 'isni', 'harv', 'ipni',
       'abbr', 'bionomia_id', 'yob', 'yod', 'wyb', 'wye', 'wikidata_link',
       'orcid_link', 'harv_link', 'ipni_link', 'bionomia_link'],
      dtype='object')


Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Friedrich August Marschall von,F. A. M. v.,F. A. M. v. Bieberstein,Friedrich August Marschall von Bieberstein,,43340073,0000 0001 1630 5464,1373,...,Q66612,1768,1826,,,http://www.wikidata.org/wiki/Q66612,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/6129-1,https://bionomia.net/Q66612
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Hans Hermann,H. H.,H. H. Behr,Hans Hermann Behr,,20328622,0000 0001 1604 8680,42741,...,Q66934,1818,1904,,,http://www.wikidata.org/wiki/Q66934,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/619-1,https://bionomia.net/Q66934
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Jacob Christian,J. C.,J. C. Schäffer,Jacob Christian Schäffer,,47016953,0000 0000 8343 3899,1101,...,,1718,1790,,,http://www.wikidata.org/wiki/Q66661,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/12818-1,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Johann Friedrich,J. F.,J. F. Klotzsch,Johann Friedrich Klotzsch,,20426762,0000 0001 1749 2732,135,...,Q67003,1805,1860,,,http://www.wikidata.org/wiki/Q67003,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/4855-1,https://bionomia.net/Q67003
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Franz Anton,F. A.,F. A. Menge,Franz Anton Menge,,59847236,0000 0001 1653 0899,73782,...,,1808,1880,,,http://www.wikidata.org/wiki/Q66322,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/23266-1,


In [2]:
# compile data having only unique canonical strings
# group by canonical name/string, count douplicated names
wd_matchtest = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()
wd_matchtest_fullnames = wikidata.groupby('canonical_string_fullname').agg({'item': ['count']}).reset_index()

display(wd_matchtest)
display(wd_matchtest_fullnames)

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,('W.') S. W. Wong,1
1,(A. A.) G. L. Monnier,1
2,(A.) F. Valet,1
3,(A.) H. (S.) Stenar,1
4,(A.) T. Wegelin,1
...,...,...
69099,Э. Э. Керн,1
69100,Ю. К. Шель,1
69101,Ю. П. Нюкша,1
69102,Я. Я. Алексеев,1


Unnamed: 0_level_0,canonical_string_fullname,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,('Wilson') Sze Wing Wong,1
1,(Alexandre Alexis) George Le Monnier,1
2,(Antonius) Theodoor Wegelin,1
3,(August) Friedrich Valet,1
4,(Axel) Helge (Svensson) Stenar,1
...,...,...
71394,Эдуард Эдуардович Керн,1
71395,Юлиан Карлович Шель,1
71396,Юлия Петровна Нюкша,1
71397,Яков Яковлевич Алексеев,1


## Load Collectors Data Set

**Data sources:**

- Jupyter Notebook for [create_naturalis_gbif-occurrence_collectors_eventDate_dataset.ipynb](./create_naturalis_gbif-occurrence_collectors_eventDate_dataset.ipynb)

Then parse collector names to get single, separate collector names using `dwcagent`, use ruby gem package available at  <https://rubygems.org/gems/dwc_agent>:

- see [./bin/README.md](bin/README.md) to use ruby script `./bin/agent_parse4tsv.rb` for parsing text lines like `"Abbe,L.B., Abbe,E.C., Smitinand,T. & Rollet,B."`

Technical notes:

- the corresponding objects, variable names of Nils’ python code were:
```
refactor df_avh = → = collectors
refactor df_avh['label'] = → = collectors['canonical_string_collector_parsed']
…
```

In [3]:
# unique names parsed already by ruby gem package: dwcagent

collectors = pd.read_csv("data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913_parsed.tsv", sep="\t", low_memory=False)

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.

# Out of bounds nanosecond timestamp: 1652-01-01T00:00:00
#  because date nanoseconds range limitations of pandas, see https://stackoverflow.com/a/69507200/1240387
#  work around: use datetime or using pd.Periode(…)
print("modify time using pd.Periode(…) to get it work also on very old dates...")
for col in ['eventDate_mean', 'eventDate_min', 'eventDate_max']:
    print("- convert", col, "to pd.Period(...) in collectors")
    collectors[col] = collectors[col].apply(
        lambda x: pd.Period(
            x, freq='S' if col.lower().endswith('mean') else 'D' # Seconds or Day level
        )
    ) # D=day level
    # see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-period-aliases
print("modifing done.")

collectors.sort_values(by=['family', 'given','occurrenceID_firstsample'], inplace=True)
collectors

modify time using pd.Periode(…) to get it work also on very old dates...
- convert eventDate_mean to pd.Period(...) in collectors
- convert eventDate_min to pd.Period(...) in collectors
- convert eventDate_max to pd.Period(...) in collectors
modifing done.


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
86130,A,,,,,,,,Kampen PN van; A,parsed:Kampen PN van<SEP>A,cleaned:<SEP>A,1,https://data.biodiversitydata.nl/naturalis/spe...,1899-08-07 00:00:00,1899-08-07,1899-08-07
180474,A,,,,,,,,Wendt T; Villalobos C A; A; Navarrete I,parsed:T. Wendt<SEP>A. Villalobos C<SEP>A<SEP>...,cleaned:T. Wendt<SEP>A. Villalobos C<SEP>A<SEP...,5,https://data.biodiversitydata.nl/naturalis/spe...,1981-12-26 00:00:00,1981-03-20,1983-05-18
172212,A,,,,,,,,Unknown; A,parsed:A,cleaned:A,1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
55251,A,,,,,,,,Fuentes C; A; Rosa de la,parsed:C. Fuentes<SEP>A<SEP>Rosa de la,cleaned:C. Fuentes<SEP>A<SEP>,1,https://data.biodiversitydata.nl/naturalis/spe...,1997-02-01 00:00:00,1997-02-01,1997-02-01
143852,A,,,,,,,,Romero EM; Fuentes RE; A,parsed:E.M. Romero<SEP>R.E. Fuentes<SEP>A,cleaned:E.M. Romero<SEP>R.E. Fuentes<SEP>A,1,https://data.biodiversitydata.nl/naturalis/spe...,1996-07-07 00:00:00,1996-07-07,1996-07-07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189361,Štepánek,J.,,,,,,,Štepánek J; Trávnícek B,parsed:J. Štepánek<SEP>B. Trávnícek,cleaned:J. Štepánek<SEP>B. Trávnícek,2,https://data.biodiversitydata.nl/naturalis/spe...,1998-11-06 12:00:00,1992-05-15,2005-04-30
89846,Štepánek,J.,,,,,,,Kirschnerová L; Kirschner J; Štepánek J,parsed:L. Kirschnerová<SEP>J. Kirschner<SEP>J....,cleaned:L. Kirschnerová<SEP>J. Kirschner<SEP>J...,1,https://data.biodiversitydata.nl/naturalis/spe...,1983-05-09 00:00:00,1983-05-09,1983-05-09
189346,Štepánek,J.,,,,,,,Štepánek J; Jakl J,parsed:J. Štepánek<SEP>J. Jakl,cleaned:J. Štepánek<SEP>J. Jakl,1,https://data.biodiversitydata.nl/naturalis/spe...,2003-06-01 00:00:00,2003-06-01,2003-06-01
65694,Štepánek,J.,,,,,,,Hadinec J; Kirschner J; Sourkova M; Štepánek J,parsed:J. Hadinec<SEP>J. Kirschner<SEP>M. Sour...,cleaned:J. Hadinec<SEP>J. Kirschner<SEP>M. Sou...,1,https://data.biodiversitydata.nl/naturalis/spe...,1978-05-17 00:00:00,1978-05-17,1978-05-17


### Check Composition of Parsed Collector Data

In [4]:
# TODO review code of abbreviated names and full name matching
criterion_fullnames = collectors.given.str.contains('^\w{3,}', na=False)
print("Show collecors given name has (propably) a full name (%s records) …" % len(collectors[criterion_fullnames].index))
collectors[criterion_fullnames]

Show collecors given name has (propably) a full name (6757 records) …


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
141189,A-M-V-J,Renier,,,,,,,Renier A-M-V-J,parsed:Renier A-M-V-J,cleaned:Renier A-M-V-J,1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
75652,A-ts'ai,Hsieh,,,,,,,Hsieh A-ts'ai,parsed:Hsieh A-ts'ai,cleaned:Hsieh A-ts'ai,1,https://data.biodiversitydata.nl/naturalis/spe...,1929-05-21 00:00:00,1929-05-21,1929-05-21
162946,A. Kneucker T,Stuckert,,in,,,,,Stuckert in A. Kneucker T,parsed:Stuckert in A. Kneucker T,cleaned:Stuckert in A. Kneucker T,4,https://data.biodiversitydata.nl/naturalis/spe...,1902-01-01 00:00:00,1902-01-01,1902-01-01
83422,AFle,Jolis,,,,,,,Jolis AFle,parsed:Jolis AFle,cleaned:Jolis AFle,420,https://data.biodiversitydata.nl/naturalis/spe...,1860-07-06 19:47:47,1800-01-01,1983-10-04
14090,Aaaa,Bellynck,,,,,,,Bellynck AAAA,parsed:Bellynck AAAA,cleaned:Bellynck Aaaa,6,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144648,le,Roux A.,,,,,,,Roux A le; Lloyd JW,parsed:Roux A. le<SEP>J.W. Lloyd,cleaned:Roux A. le<SEP>J.W. Lloyd,1,https://data.biodiversitydata.nl/naturalis/spe...,1985-03-09 00:00:00,1985-03-09,1985-03-09
144650,le,Roux A.,,,,,,,Roux A le; Ramsey M,parsed:Roux A. le<SEP>M. Ramsey,cleaned:Roux A. le<SEP>M. Ramsey,9,https://data.biodiversitydata.nl/naturalis/spe...,1977-12-16 21:20:00,1972-09-13,1978-09-13
163366,le,Sueur F.A.,,,,,,,Sueur FA le,parsed:Sueur F.A. le,cleaned:Sueur F.A. le,23,https://data.biodiversitydata.nl/naturalis/spe...,1977-01-09 19:12:00,1951-03-15,1981-04-01
167399,le,Testu G.M.P.C.,,,,,,,Testu GMPC le,parsed:Testu G.M.P.C. le,cleaned:Testu G.M.P.C. le,3,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT


In [5]:
# check the name-parsed columns if they are empty or need to be considerd as data for matching or not
for parsed_name_part in ["particle", "suffix", "dropping_particle", "appellation"]:
    test_collectors = collectors.loc[(collectors[parsed_name_part].isna() == False)]
    print("\n----------------------------------------\nshow names with **%s** found %s records:\n" % (parsed_name_part, len(test_collectors.index)))
    display(test_collectors.head().get(["family", "given", "suffix", "particle", "dropping_particle", "nick", "appellation", "title"]))


----------------------------------------
show names with **particle** found 4802 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title
162946,A. Kneucker T,Stuckert,,in,,,,
68,Aa,H. A. van der,,van,,,,
52,Aa,H.A.,,van der,,,,
81,Aalst,Mdjm,,van,,,,
89,Aanen,D.K.,,van der,,,,



----------------------------------------
show names with **suffix** found 22 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title
188594,Bakker,Zinderen,Sr.,,,,,
62158,Gradstein,,SR,van,,,,
62137,Gradstein,,SR,van,,,,
89493,Leopold,King,III,,,,,
159263,Maurit,Flora,II,,,,,



----------------------------------------
show names with **dropping_particle** found 0 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title



----------------------------------------
show names with **appellation** found 1 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title
19154,McCullogh,,,,,,Mrs,


Compile `canonical_string…` for the collector data we will later match the WikiData names with:

In [6]:
# combine parts of names similar to WikiData's given name labels
collectors['canonical_string_collector_parsed'] = collectors[['given', 'particle', 'family', 'suffix']]\
    .fillna('')\
    .apply(
        lambda this_df: "{given}{particle}{family}{suffix}".format(
            given=this_df["given"],
            particle=" " + this_df["particle"] if this_df["particle"] else '', 
            family=" " + this_df["family"] if this_df["family"] else '', 
            suffix=", " + this_df["suffix"] if this_df["suffix"] else ''
        ), axis="columns"
    )

criterion = collectors["particle"].str.contains("\w+ \w+", na=False)

# display(collectors['canonical_string_collector_parsed'][criterion].head())
collectors[['canonical_string_collector_parsed', 'particle']][criterion].drop_duplicates().head(10)


Unnamed: 0,canonical_string_collector_parsed,particle
52,H.A. van der Aa,van der
89,D.K. van der Aanen,van der
92,P.J.M. van der Aart,van der
542,A. van der Abdullah,van der
683,Thaj van der Abeleven,van der
760,H. van der Abé,van der
2096,C. van der Alders,van der
2099,K. van der Alders,van der
2731,J. J. M. van van der Alphen,van der
2837,W. van der Altenburg,van der


In [8]:
# move canonical_string_collector_parsed after column title (title was the last of the parsing columns)
col = collectors.pop("canonical_string_collector_parsed")
collectors.insert(collectors.columns.get_loc('title') + 1, col.name, col)

these_columns=["family", "given", "suffix", "particle", "dropping_particle", "nick", "appellation", "title", 'canonical_string_collector_parsed']

if 'source_data' in collectors.columns:
    these_columns.append("source_data")

display(collectors.tail().get(these_columns))

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data
189361,Štepánek,J.,,,,,,,J. Štepánek,Štepánek J; Trávnícek B
89846,Štepánek,J.,,,,,,,J. Štepánek,Kirschnerová L; Kirschner J; Štepánek J
189346,Štepánek,J.,,,,,,,J. Štepánek,Štepánek J; Jakl J
65694,Štepánek,J.,,,,,,,J. Štepánek,Hadinec J; Kirschner J; Sourkova M; Štepánek J
154410,Šumberová,K.,,,,,,,K. Šumberová,Simons ELAN; Šumberová K


In [9]:
# group and aggregate data to have unique name rows only for the matching of names later on
collectors_unique=collectors.groupby(['canonical_string_collector_parsed']).agg(
    family=('family', lambda x: list(x)[0]),
    given=('given', lambda x: list(x)[0]),
    suffix=('suffix', lambda x: list(x)[0]),
    particle=('particle', lambda x: list(x)[0]),
    dropping_particle=('dropping_particle', lambda x: list(x)[0]),
    nick=('nick', lambda x: list(x)[0]),
    appellation=('appellation', lambda x: list(x)[0]),
    title=('title', lambda x: list(x)[0]),
    source_data=('source_data', lambda x: list(x)[0]),
    occurrenceID_collectors_count= ('occurrenceID_count', 'sum'), # use count function
    occurrenceID_collectors_firstsample=('occurrenceID_firstsample', lambda x: list(x)[0]), # custom function, to get the first entry
    collectors_eventDate_mean=('eventDate_mean', 'mean'),
    collectors_eventDate_min=('eventDate_min', 'min'),
    collectors_eventDate_max=('eventDate_max', 'max')
).reset_index()

# move canonical_string_collector_parsed after column title
col = collectors_unique.pop("canonical_string_collector_parsed")
collectors_unique.insert(collectors_unique.columns.get_loc('title') + 1, col.name, col)

display(collectors_unique)

# column naming perhaps more clear (because we condensed the data)?
# collectors=collectors.add_suffix('_namegrouped') \
#  if not any(col.endswith("_namegrouped") for col in list(collectors.columns))

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max
0,A,,,,,,,,A,Kampen PN van; A,18,https://data.biodiversitydata.nl/naturalis/spe...,1981-04-19 16:00:00,1899-08-07,1999-12-10
1,A'buino'o,,,,,,,,A'buino'o,A'buino'o; Hunt PF,1,https://data.biodiversitydata.nl/naturalis/spe...,1965-08-24 00:00:00,1965-08-24,1965-08-24
2,Aalsmeer,,,,,,,,Aalsmeer,Aalsmeer,1,https://data.biodiversitydata.nl/naturalis/spe...,1950-09-02 00:00:00,1950-09-02,1950-09-02
3,Aba,,,,,,,,Aba,Aba,4,https://data.biodiversitydata.nl/naturalis/spe...,1949-03-01 00:00:00,1949-03-01,1949-03-01
4,Abai,,,,,,,,Abai,Abai; Madjib,1,https://data.biodiversitydata.nl/naturalis/spe...,1968-08-29 00:00:00,1968-08-29,1968-08-29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59029,Rozsemberszky,Ö.,,,,,,,Ö. Rozsemberszky,Rozsemberszky Ö,2,https://data.biodiversitydata.nl/naturalis/spe...,1915-08-28 00:00:00,1915-08-28,1915-08-28
59030,Szatala,Ö.,,,,,,,Ö. Szatala,Timkó G; Zsák Z; Szatala Ö,2,https://data.biodiversitydata.nl/naturalis/spe...,1920-06-01 00:00:00,1920-06-01,1920-06-01
59031,Johansen,Ø.,,,,,,,Ø. Johansen,Johansen Ø,5,https://data.biodiversitydata.nl/naturalis/spe...,1976-11-16 04:48:00,1975-12-19,1978-02-07
59032,Weholt,Ø.,,,,,,,Ø. Weholt,Weholt Ø,6,https://data.biodiversitydata.nl/naturalis/spe...,1974-01-03 12:00:00,1927-01-01,1984-08-18


In [10]:
# show collectors with highest occurrenceID_collectors_count
collectors_unique.sort_values(
    by=['occurrenceID_collectors_count', 'family', 'given'], 
    ascending=[False, True, True]
).head(10)

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max
13340,Boom,B.K.,,,,,,,B.K. Boom,Ooststroom SJ van; Boom BK,51103,https://data.biodiversitydata.nl/naturalis/spe...,1958-05-10 04:33:59,1856-01-01,1997-04-11
23108,Breteler,F.J.,,,,,,,F.J. Breteler,Maas PJM; Breteler FJ; Maas-van de Kamer H; Ni...,41100,https://data.biodiversitydata.nl/naturalis/spe...,1988-10-25 10:05:30,1955-06-12,2020-03-06
34677,Maxwell,J.F.,,,,,,,J.F. Maxwell,Maxwell JF; Sankamethawee W,38782,https://data.biodiversitydata.nl/naturalis/spe...,1996-08-29 12:11:24,1969-01-18,2013-04-11
53251,Koorders,S.H.,,,,,,,S.H. Koorders,Koorders SH; Valeton T,34147,https://data.biodiversitydata.nl/naturalis/spe...,1917-12-17 06:47:23,1829-08-27,2012-11-11
10978,Leeuwenberg,A.J.M.,,,,,,,A.J.M. Leeuwenberg,Sidiyasa K; Leeuwenberg AJM; Arbainsyah,32591,https://data.biodiversitydata.nl/naturalis/spe...,1973-10-23 01:24:35,1926-02-20,1999-11-16
38049,Ajgh,Kostermans,,,,,,,Kostermans Ajgh,Kostermans AJGH; Soegeng-Reksodihardjo W,30712,https://data.biodiversitydata.nl/naturalis/spe...,1959-02-23 21:53:36,1892-09-30,1994-11-15
13179,Wilde-Duyfjes,B.E.E.,,,,,,,B.E.E. Wilde-Duyfjes,Mennema J; Wilde-Duyfjes BEE de,29893,https://data.biodiversitydata.nl/naturalis/spe...,1986-10-15 13:20:06,1958-06-28,2019-09-04
53856,Itinere,Stud,,biol Rheno-Trai in,,,,,Stud biol Rheno-Trai in Itinere,Stud biol Rheno-Trai in itinere; Krüger JHJ,28222,https://data.biodiversitydata.nl/naturalis/spe...,1963-03-31 08:11:10,1847-06-18,1996-07-08
35541,Soest,J.L.,,van,,,,,J.L. van Soest,Unknown; Soest JL van,25106,https://data.biodiversitydata.nl/naturalis/spe...,1927-09-01 15:50:20,1803-08-10,1994-09-01
35300,Wieringa,J.J.,,,,,,,J.J. Wieringa,Wieringa JJ,22913,https://data.biodiversitydata.nl/naturalis/spe...,2006-10-31 04:52:04,1980-08-19,2022-11-12


In [11]:
# TODO continue 2023-08-21 10:28:54
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

## Set Up the Cosine Similarity and Text Search

See 
- for the application code https://github.com/nielsklazenga/avh-collectors/blob/master/cosine_similarity.ipynb
- for reading on the topic: Taylor, Josh. 2019. ‘Fuzzy Matching at Scale’. Towards Data Science (blog). 2 July 2019. https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536.

The `ngrams`-function is used as an analyzer in the text search later.

In [12]:
import pandas as pd, numpy as np, re
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from sparse_dot_topn import awesome_cossim_topn # pip install sparse-dot-topn

def get_matches_df(sparse_matrix, A, B, top=100):
    non_zeros = sparse_matrix.nonzero()

    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]

    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size

    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similarity = np.zeros(nr_matches)

    for index in range(0, nr_matches):
        left_side[index] = A[sparserows[index]]
        right_side[index] = B[sparsecols[index]]
        similarity[index] = round(sparse_matrix.data[index], 3)

    return pd.DataFrame({'namematch_source_data': left_side,
                         'namematch_resource_data': right_side,
                         'namematch_similarity': similarity})

!pip install ftfy
from ftfy import fix_text

def ngrams(string, n=3):
    """
    Construct ngram(s) of a given text

    @param string: the text string to perform the ngram splitting on
    @param n: character length of the particular (split) result text each
    @return: string as ngram
    """
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.replace('.', ' ')
    string = string.title()  # normalise case - capital at start of each word
    string = re.sub(' +', ' ', string).strip() # get rid of multiple spaces and replace with a single
    string = ' ' + string + ' '  # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    string = string.strip()
    this_ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in this_ngrams]

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

In [13]:
def calculateTFIDFmatchingOfData(query_data, match_data, cossim_ntop=1, cossim_lower_bound=0.5):
    """
    Calculate a TF-IDF (Term Frequency — Inverse Document Frequency) matching with awesome_cossim_topn() and return matched data

    @param query_data: DataFrame usually a pandas data column to query names or strings for
    @param match_data: DataFrame against to match with
    @param cossim_ntop: how many cossim matches each shall be calculated (default 1, i.e. the highest similarity) — increase it to get more alternative
        matches with less similarity
    @param cossim_lower_bound: where is the lower similarity cut off to regard data as similar (default 0.5)

    @requires get_get_matches_df()
    @requires ngrams()
    @requires awesome_cossim_topn()
    @requires TfidfVectorizer()

    @return: a data frame dictionary: namematch_source_data, namematch_resource_data, namematch_similarity (from @see get_matches_df())
    @rtype pd.DataFrame
    """

    import time
    time_start = time.time()

    # Vectorize Wikidata name (use fit_transform())
    print('Vectorizing data. This may take a while...')
    vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
    tf_idf_matrix_clean = vectorizer.fit_transform(match_data)
    # Vectorize collectors’ names (use transform())
    tf_idf_matrix_dirty = vectorizer.transform(query_data)

    duration = time.time() - time_start
    print('Vectorizing completed: created a matrix of TF-IDF features after %s s' % duration)

    # Calculate Cosine Similarity; keep only the best match (ntop=1) and only if the similarity is greater than 0.5 (lower_bound=0.5)
    # (lower_bound: a threshold that the element of A*B must be greater than
    #  https://github.com/ing-bank/sparse_dot_topn/blob/3f40611b0553b50c27f23c7dcffc3ca9a9e8f5b5/sparse_dot_topn/awesome_cossim_topn.py#L26C9-L26C78)
    cossim_matches = awesome_cossim_topn(
        tf_idf_matrix_dirty,
        tf_idf_matrix_clean.transpose(),
        ntop=cossim_ntop,
        lower_bound=cossim_lower_bound
    )
    print("Cossim matches calculated after %s s" % (time.time() - time_start))

    print("Get all matches together ...")
    # construct the matching data frame
    matches_df = get_matches_df(
        cossim_matches,
        query_data,
        match_data,
        top=0
    )
    print("Done. Matches calculated after %s s" % (time.time() - time_start))

    return matches_df

In [14]:
# some example data
print("Show ngram examples:")
print("- simple name:", ngrams('Klazenga, N.'))
print("- data from collectors:", ngrams(collectors_unique["canonical_string_collector_parsed"].at[1])) 
print("- data from match-test:", ngrams(wd_matchtest['canonical_string'].at[0]))
print("- data from match-test (full name):", ngrams(wd_matchtest_fullnames['canonical_string_fullname'].at[0]))

# some example data
for i, row in enumerate(range(5)):
    if (i == 0):
        print('\n(WikiData’s) canonical_string = (constructed) canonical_string_fullname:') 
    print("- {short_name} = {long_name}".format(
        short_name=wd_matchtest['canonical_string'].at[row],
        long_name=wd_matchtest_fullnames['canonical_string_fullname'].at[row]
    ))


Show ngram examples:
- simple name: ['Kla', 'laz', 'aze', 'zen', 'eng', 'nga', 'ga ', 'a N']
- data from collectors: ['Abu', 'bui', 'uin', 'ino', 'noo']
- data from match-test: ['W S', ' S ', 'S W', ' W ', 'W W', ' Wo', 'Won', 'ong']
- data from match-test (full name): ['Wil', 'ils', 'lso', 'son', 'on ', 'n S', ' Sz', 'Sze', 'ze ', 'e W', ' Wi', 'Win', 'ing', 'ng ', 'g W', ' Wo', 'Won', 'ong']

(WikiData’s) canonical_string = (constructed) canonical_string_fullname:
- ('W.') S. W. Wong = ('Wilson') Sze Wing Wong
- (A. A.) G. L. Monnier = (Alexandre Alexis) George Le Monnier
- (A.) F. Valet = (Antonius) Theodoor Wegelin
- (A.) H. (S.) Stenar = (August) Friedrich Valet
- (A.) T. Wegelin = (Axel) Helge (Svensson) Stenar


In [15]:
criterion_fullnames = collectors_unique.given.str.contains('^\w{3,}', na=False)
collectors_names = collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values

matches = calculateTFIDFmatchingOfData(
    collectors_names, 
    wd_matchtest['canonical_string'], 
    cossim_ntop=1 # e.g. cossim_ntop=3 would give more alternative matches as well, having lower similarities, data would increase 3 times as well
)
matches = matches.sort_values(by=['namematch_similarity'], ascending=[False])
matches = matches.reset_index(names=['old_index'])
display(matches)

Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 5.532832860946655 s
Cossim matches calculated after 7.044798135757446 s
Get all matches together ...
Done. Matches calculated after 7.484648942947388 s


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
0,45817,Ø.H. Rustan,Ø. H. Rustan,1.0
1,18090,G. Bergold,G. Bergold,1.0
2,18121,G. Bosc,G. Bosc,1.0
3,18115,G. Bohus,G. Bohus,1.0
4,41297,S.C. Chen,S. C. Chen,1.0
...,...,...,...,...
45813,8096,A.M. Broeders,C. A. Schroeder,0.5
45814,10194,C. Cissé,M. A. Cisternas,0.5
45815,27390,J.M. Ackworth,G. C. Eickwort,0.5
45816,27407,J.M. Bonpard,M. Bon,0.5


In [16]:
# criterion_fullnames = collectors_unique.given.str.contains('^\w{3,}', na=False)
collectors_fullnames = collectors_unique['canonical_string_collector_parsed'][criterion_fullnames].values
matches_fullnames = calculateTFIDFmatchingOfData(
    collectors_fullnames, 
    wd_matchtest_fullnames['canonical_string_fullname'], 
    cossim_ntop=1 # 10 would give more alternative matches also with lesser similarity
)

matches_fullnames = matches_fullnames.sort_values(by=['namematch_similarity'], ascending=[False])
matches_fullnames = matches_fullnames.reset_index(names=['old_index'])

display(matches_fullnames)

Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 3.968977689743042 s
Cossim matches calculated after 4.174441576004028 s
Get all matches together ...
Done. Matches calculated after 4.183927536010742 s


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
0,357,Laily bin Din,Laily Bin Din,1.000
1,120,Ching-I. Peng,Ching I Peng,1.000
2,215,Hang Zhou,Hang Zhou,1.000
3,453,Oswaldo Handro,Oswaldo Handro,1.000
4,29,Amar Singh,Amar Singh,1.000
...,...,...,...,...
636,628,Yassin bin Dangi,Bin Bin Liu,0.501
637,46,Ban-Tung Lee,Shi Lin Tung,0.500
638,110,Cate R.S. Ten,R.S. Terry,0.500
639,249,Herb Kegelianum,W. Kegel,0.500


### Create Output Results

Combine the matches data frame back to the (Naturalis) collectors and Wikidata items …

Note: merging 18.770.000 collector matches earlier to wikidata was too much to calculate. Hence the descision was to make the data unique by canonical_string_collector_parsed.

In [17]:
# # join (only) abbreviated name matches with collector source data
collectors_matches = pd.merge(
    collectors_unique, matches, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data', 
    how='left'
)

collectors_matches.dropna(subset=['namematch_similarity'], inplace=True)
display(collectors_matches) # 42298 rows × 18 columns

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
3,Aba,,,,,,,,Aba,Aba,4,https://data.biodiversitydata.nl/naturalis/spe...,1949-03-01 00:00:00,1949-03-01,1949-03-01,0.0,Aba,H. Şağban,0.607
4,Abai,,,,,,,,Abai,Abai; Madjib,1,https://data.biodiversitydata.nl/naturalis/spe...,1968-08-29 00:00:00,1968-08-29,1968-08-29,1.0,Abai,G. Ababaikeli,0.520
6,Abbe,,,,,,,,Abbe,Abbe,3,https://data.biodiversitydata.nl/naturalis/spe...,1928-01-01 00:00:00,1928-01-01,1928-01-01,2.0,Abbe,E. C. Abbe,0.701
8,Abbiw,,,,,,,,Abbiw,Cheek MR; Abbiw,27,https://data.biodiversitydata.nl/naturalis/spe...,1998-10-26 06:00:00,1998-10-21,1998-10-31,3.0,Abbiw,D. Abbiatti,0.562
9,Abbott,,,,,,,,Abbott,Abbott,1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT,4.0,Abbott,G. Abbott,0.858
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59028,Rosemberszky,Ö.,,,,,,,Ö. Rosemberszky,Rosemberszky Ö,1,https://data.biodiversitydata.nl/naturalis/spe...,1918-04-01 00:00:00,1918-04-01,1918-04-01,45813.0,Ö. Rosemberszky,J.A. Rosemberg,0.606
59030,Szatala,Ö.,,,,,,,Ö. Szatala,Timkó G; Zsák Z; Szatala Ö,2,https://data.biodiversitydata.nl/naturalis/spe...,1920-06-01 00:00:00,1920-06-01,1920-06-01,45814.0,Ö. Szatala,Ö. Szatala,1.000
59031,Johansen,Ø.,,,,,,,Ø. Johansen,Johansen Ø,5,https://data.biodiversitydata.nl/naturalis/spe...,1976-11-16 04:48:00,1975-12-19,1978-02-07,45815.0,Ø. Johansen,F. Johansen,0.866
59032,Weholt,Ø.,,,,,,,Ø. Weholt,Weholt Ø,6,https://data.biodiversitydata.nl/naturalis/spe...,1974-01-03 12:00:00,1927-01-01,1984-08-18,45816.0,Ø. Weholt,Ø. Weholt,1.000


In [18]:
# join (only) full name matches with collector source data
collectors_matches_fullname = pd.merge(
    collectors_unique, matches_fullnames, 
    left_on='canonical_string_collector_parsed' , right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

display(collectors_matches_fullname) # 628 rows × 18 columns

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
0,Bergh Mh,Aarts-van,,den,,,,,Aarts-van den Bergh Mh,Aarts-van den Bergh MH,2,https://data.biodiversitydata.nl/naturalis/spe...,1970-04-02 00:00:00,1950-09-01,1989-11-01,0,Aarts-van den Bergh Mh,Pieter Johannes van den Bergh,0.514
1,Des,Abbayes H.R.N.,,,,,,,Abbayes H.R.N. Des,Abbayes HRN des; Stomps TJ,11,https://data.biodiversitydata.nl/naturalis/spe...,1941-04-23 06:24:00,1937-07-10,1948-11-22,1,Abbayes H.R.N. Des,R.N. De,0.512
2,Momin,Abdul Karim,,,,,,,Abdul Karim Momin,Sundaling D; Abdul Karim Momin,533,https://data.biodiversitydata.nl/naturalis/spe...,1980-04-19 10:14:04,1948-06-16,1994-11-17,2,Abdul Karim Momin,Abdul-Karim K.A. Al-Bermani,0.533
3,Kbin,Abdul,,,,,,,Abdul Kbin,Abdul Kbin,2,https://data.biodiversitydata.nl/naturalis/spe...,1950-07-24 00:00:00,1950-07-24,1950-07-24,3,Abdul Kbin,Abdul Kafi,0.567
4,Mabh,Abdullah,,,,,,,Abdullah Mabh,Abdullah MABH,2,https://data.biodiversitydata.nl/naturalis/spe...,1958-01-10 00:00:00,1953-02-28,1962-11-22,4,Abdullah Mabh,N. Abdullah,0.635
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
636,Bolhassan,Zainuddin,,,,,,,Zainuddin Bolhassan,Zainuddin Bolhassan,1,https://data.biodiversitydata.nl/naturalis/spe...,1959-11-17 00:00:00,1959-11-17,1959-11-17,636,Zainuddin Bolhassan,M.H. Bolhassan,0.575
637,Xinying,Zhang,,,,,,,Zhang Xinying,Zhang Xinying,2,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT,637,Zhang Xinying,Xing Xiang Zhang,0.593
638,Zhi-Sang,Zhu,,,,,,,Zhu Zhi-Sang,Zhu Zhi-Sang,1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT,638,Zhu Zhi-Sang,Zhu Zhang,0.654
639,Sato,Zin,,,,,,,Zin Sato,Pleyte DR; Zin Sato,4,https://data.biodiversitydata.nl/naturalis/spe...,1965-04-03 00:00:00,1951-09-30,1978-10-06,639,Zin Sato,Ken Sato,0.647


In [19]:
# join all name matches together
collectors_all_matches=pd.concat([collectors_matches, collectors_matches_fullname])
collectors_all_matches.sort_values(by=['namematch_similarity', 'family'], ascending=[False, True], inplace=True)
collectors_all_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,collectors_eventDate_mean,collectors_eventDate_min,collectors_eventDate_max,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
6998,Aaronsohn,A.,,,,,,,A. Aaronsohn,Aaronsohn A,3,https://data.biodiversitydata.nl/naturalis/spe...,1907-01-26 12:00:00,1906-12-06,1907-03-19,4654.0,A. Aaronsohn,A. Aaronsohn,1.0
7003,Abbas,A.,,,,,,,A. Abbas,Abdallah MS-A; Sa'ad FE-ZM; Mahdy M; Abbas A,378,https://data.biodiversitydata.nl/naturalis/spe...,1963-03-03 08:38:53,1936-02-11,1963-11-01,4659.0,A. Abbas,A. Abbas,1.0
20588,Abbe,E.C.,,,,,,,E.C. Abbe,Abbe EC; Lampangi,537,https://data.biodiversitydata.nl/naturalis/spe...,1961-03-04 07:37:31,1932-01-01,1964-08-31,15525.0,E.C. Abbe,E. C. Abbe,1.0
16808,Abbiatti,D.,,,,,,,D. Abbiatti,Abbiatti D,2,https://data.biodiversitydata.nl/naturalis/spe...,1944-05-31 00:00:00,1937-10-01,1951-01-29,12468.0,D. Abbiatti,D. Abbiatti,1.0
11681,Abbott,A.T.D.,,,,,,,A.T.D. Abbott,Brand R; Bosch A; Abbott ATD,14,https://data.biodiversitydata.nl/naturalis/spe...,2002-12-17 01:36:00,1997-02-13,2010-05-27,8519.0,A.T.D. Abbott,A. T. D. Abbott,1.0


In [20]:
# # Save the plain name matching results only ...
# 
# if not os.path.exists('data'):
#     print("Make data directory for saving …")
#     os.makedirs('data')
# 
# # Set some global varialbes
# # this_timestamp_for_data=time.strftime('%Y%m%d') # 20230913
# this_timestamp_for_data=20231030
# 
# this_output_file='data/results_naturalis_collectors_vs_wikidata-botanists_cossim-similarity_plain-names_%s.csv' % (
#     this_timestamp_for_data
# )
# 
# collectors_all_matches.to_csv(this_output_file)
# 
# print("Wrote plain name matches of collector names into %s (%d kB)" % 
#     (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
# )

In [21]:
# merge now with WikiData: the matching data and the wiki data’s on the conaonical string name
collectors_matches_tmp_names_abbreviated = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string'
)
collectors_matches_tmp_fullnames = pd.merge(
    collectors_matches_fullname, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string_fullname'
)
collectors_matches_g1_merged_wikidata = pd.concat(
    [collectors_matches_tmp_names_abbreviated, collectors_matches_tmp_fullnames]
    , ignore_index=True
)

In [22]:
print("Show some name match examples (e.g. «Louis…» matching various names) …")
for testname in ['Louis', 'Abbot']:
    criterion = collectors_matches_g1_merged_wikidata['canonical_string_collector_parsed'].str.contains(testname)
    this_table=collectors_matches_g1_merged_wikidata[criterion].get([
        # 'canonical_string_collector_parsed', # canonical_string_collector_parsed = namematch_source_data
        'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
        'namematch_source_data', 'namematch_resource_data', 'namematch_similarity', 
        # 'canonical_string_fullname', 
        'itemLabel', 'wikidata_link',
        'collectors_eventDate_min', 'collectors_eventDate_max',
        'yob', 'yod', 'wyb', 'wye'
    ]).sort_values(by=['namematch_similarity'], ascending=[False])
    print("# ---------------------------------------------\n# «%s…» as test name, %d collector names begin with:" % (testname, criterion.sum()))    
    display(this_table)

Show some name match examples (e.g. «Louis…» matching various names) …
# ---------------------------------------------
# «Louis…» as test name, 16 collector names begin with:


Unnamed: 0,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_similarity,itemLabel,wikidata_link,collectors_eventDate_min,collectors_eventDate_max,yob,yod,wyb,wye
7454,2,https://data.biodiversitydata.nl/naturalis/spe...,A. Louis,A. Louis,1.0,A. Louis,http://www.wikidata.org/wiki/Q33682458,NaT,NaT,,,,
19634,10542,https://data.biodiversitydata.nl/naturalis/spe...,A.M. Louis,A. M. Louis,1.0,Adriaan M. Louis,http://www.wikidata.org/wiki/Q21338327,1969-04-10,2013-03-02,1944.0,,,
38701,3339,https://data.biodiversitydata.nl/naturalis/spe...,J.L.P. Louis,J. L. P. Louis,1.0,Jean Laurent Prosper Louis,http://www.wikidata.org/wiki/Q5928759,1900-01-01,1998-05-17,1903.0,1947.0,,
44197,51,https://data.biodiversitydata.nl/naturalis/spe...,P. Louis-Marie,Louis-Marie,0.926,Louis-Marie,http://www.wikidata.org/wiki/Q5981449,1925-07-28,1953-07-08,1896.0,1978.0,,
44198,1,https://data.biodiversitydata.nl/naturalis/spe...,R.P. Louis-Marie,Louis-Marie,0.877,Louis-Marie,http://www.wikidata.org/wiki/Q5981449,1934-07-09,1934-07-09,1896.0,1978.0,,
7460,3,https://data.biodiversitydata.nl/naturalis/spe...,H. Louis,A. Louis,0.859,A. Louis,http://www.wikidata.org/wiki/Q33682458,1907-06-01,1953-10-01,,,,
7453,59,https://data.biodiversitydata.nl/naturalis/spe...,Louis,A. Louis,0.858,A. Louis,http://www.wikidata.org/wiki/Q33682458,1904-05-28,1984-09-21,,,,
7456,14,https://data.biodiversitydata.nl/naturalis/spe...,F. Louis,A. Louis,0.855,A. Louis,http://www.wikidata.org/wiki/Q33682458,1910-01-01,1953-12-01,,,,
7461,3,https://data.biodiversitydata.nl/naturalis/spe...,J.B.A. Louis,A. Louis,0.834,A. Louis,http://www.wikidata.org/wiki/Q33682458,1878-05-01,1938-10-30,,,,
7462,4,https://data.biodiversitydata.nl/naturalis/spe...,O. Louis,A. Louis,0.826,A. Louis,http://www.wikidata.org/wiki/Q33682458,1937-07-14,1937-07-27,,,,


# ---------------------------------------------
# «Abbot…» as test name, 10 collector names begin with:


Unnamed: 0,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_similarity,itemLabel,wikidata_link,collectors_eventDate_min,collectors_eventDate_max,yob,yod,wyb,wye
20166,14,https://data.biodiversitydata.nl/naturalis/spe...,A.T.D. Abbott,A. T. D. Abbott,1.0,A. T. D. Abbott,http://www.wikidata.org/wiki/Q117328147,1997-02-13,2010-05-27,1936.0,2013.0,,
26910,2,https://data.biodiversitydata.nl/naturalis/spe...,E.K. Abbott,E. K. Abbott,1.0,Edwin Kirk Abbott,http://www.wikidata.org/wiki/Q81587932,1889-01-01,1889-04-01,1840.0,1918.0,,
26911,2,https://data.biodiversitydata.nl/naturalis/spe...,E.K. Abbott,E. K. Abbott,1.0,Erwin Kirk Abbott,http://www.wikidata.org/wiki/Q113588322,1889-01-01,1889-04-01,1840.0,1918.0,,
48796,10,https://data.biodiversitydata.nl/naturalis/spe...,W.L. Abbott,W. L. Abbott,1.0,William Louis Abbott,http://www.wikidata.org/wiki/Q635604,1922-04-05,1922-04-30,1860.0,1936.0,,
35772,106,https://data.biodiversitydata.nl/naturalis/spe...,I.A. Abbott,I. Abbott,0.937,Isabella Abbott,http://www.wikidata.org/wiki/Q6077932,1946-05-01,1995-02-22,1919.0,2010.0,,
27036,1,https://data.biodiversitydata.nl/naturalis/spe...,D.P. Abbott,S. P. Abbott,0.894,Sean P. Abbott,http://www.wikidata.org/wiki/Q36672029,1967-08-02,1967-08-02,,,,
10,1,https://data.biodiversitydata.nl/naturalis/spe...,Abbott,G. Abbott,0.858,George Abbott,http://www.wikidata.org/wiki/Q47112598,NaT,NaT,,,,
392,1,https://data.biodiversitydata.nl/naturalis/spe...,M. Abbot-Anderson,M. Anderson,0.616,Marilyn Anderson,http://www.wikidata.org/wiki/Q44754645,1933-06-21,1933-06-21,,,,
393,1,https://data.biodiversitydata.nl/naturalis/spe...,M. Abbot-Anderson,M. Anderson,0.616,Mark Anderson,http://www.wikidata.org/wiki/Q111990210,1933-06-21,1933-06-21,,,,
394,1,https://data.biodiversitydata.nl/naturalis/spe...,M. Abbot-Anderson,M. Anderson,0.616,Mary Anderson,http://www.wikidata.org/wiki/Q111694258,1933-06-21,1933-06-21,1875.0,,,


In [23]:
# # # # # # # # # # # # # # # # # # # # # # # # #
# display data and save custom columns
# # # # # # # # # # # # # # # # # # # # # # # # #
# pprint.pprint(collectors_matches_g1_merged_wikidata.columns)
# ## cell split - code
# collectors_matches_g1_merged_wikidata.head()
# ## cell split - code
# 
# # Select useful columns for data results
# collectors_wikidata_cossim = collectors_matches_g1_merged_wikidata[
#     ['canonical_string_collector_parsed', 'family', 'given', 
#      'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
#     'namematch_source_data', 'namematch_resource_data', 'namematch_similarity', 
#     'item', 'canonical_string', 'itemLabel',
#     'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 
#     'collectors_eventDate_mean', 'collectors_eventDate_min', 'collectors_eventDate_max',
#      'yob', 'yod', 'wyb'
#     ]
# ]
# 
# # Order by similarity (desc), number of Wikidata items (asc) and number of collections (desc)
# collectors_wikidata_cossim.sort_values(by=['namematch_similarity', 'family', 'given'], ascending=[False, True, True], inplace=True)
# 
# collectors_wikidata_cossim # comparison-match of «Kotschy, Karl Georg Th» (collector data) →← «Kotschy, T» (Wikidata) has only 0.5 similarity but corresponds to the correct person name we need
# ## cell split - code
# 
# 
# # TODO further evaluation or filtering, counting, clean up aso.
# if not os.path.exists('data'):
#     os.makedirs('data')
# 
# # naturalis_collectors_cosine-similarity_wikidata-botanists_%s.csv
# this_output_file='data/results_naturalis_collectors_vs_wikidata-botanists_cossim-similarity_merged-data_%s.csv' % (
#     this_timestamp_for_data
# )
# 
# collectors_wikidata_cossim.to_csv(this_output_file)
# 
# print("Wrote matches of collector names into %s (%d kB)" % 
#     (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
# )

## Output Mapping to DarwinCore Attribution Output

Here we map table data fields to fields of DarwinCore Attribution (<https://github.com/tdwg/attribution/>, <https://github.com/tdwg/attribution/blob/master/people/dwc/agent_actions_v2020-09-08.xml>) 

## Scoring

Individual scored properties should actually be balanced in such a way that one can simply add up these different property scores; in this case, assessment of the calculated values is still necessary. The problem here with calculation with a distance measure is that we have the opposite of similarity, whose distance can become greater than 1, which must somehow be mapped to a scope of 0 … 1 (or -1 … 0 … 1) (TODO review).

General thoughts: With a score of -1 to 1, it can be assumed that:
* -1 means full devaluation or no agreement
* 1 means full upvoting or agreement, and
* 0 can have several interpretations: it is in between, or no rating possible, or missing values.

### Task to Be Solved in Evaluating the Life Time ~ Rating/Scoring

We have grouped the collection date (evenDate) to the name in the source data, so it may be that for (abbreviated) names, e.g. “Bachmann, F.”, the collection date is valid for *several* personal names, not just one. This must be taken into account when considering and evaluating whether the life data match the collection date. The rating of the life data has the following idea:

| Score (life time) | Remarks | 
|--|--|
| 1.0  | complete match                     |
| 0.5  | somewhat correct, but has errors or mistakes, indicating multiple person names    |
| 0.0   | no evaluation (or not possible) |
| -0.5 | is rather to be rejected, indicating multiple person names and possibly overlapping time spans of the collection date of different person names, or mistakes in the original data |
| -1.0 | completely rejected                |

### Task to Be Solved With Several Names ~ Assessment/Score

Since we do not know if there are other possible names somewhere when there is only one name, we cannot assign a “1” (= full agreement) with certainty, so it was decided that if only 1 name was found, this would be evaluated as zero, in the sense of no evaluation. So when evaluating the multiple names, only the mismatches are evaluated, according to the idea:

| Score (multiple names) | Remarks | 
|--|--|
| 1.0  | this value (=full upvoting or agreement) would never be set in this regard, since we do not know all the full names of the cosmos ;-), and could state this score certainty of 1.0 |
| 0.0 | no evaluation, because only 1 name found | 
| less than 0 | multiple names found, i.e. deduction (perhaps just -0.5, as a decision needs to be made) | 

---

TODO review interpretation:

- the fields are defined in <https://github.com/tdwg/attribution/blob/master/people/dwc/agent_actions_v2020-09-08.xml> and regarding from this DwC-attribution concept: is it correct to map it like the following (`name` would represent the *interpreted* resource name (in long format), not the *source* collector `name` (in (theoretically) long format))?
    ```
    name          ← itemLabel (wikiData)
    alternateName ← canonical_string_collector_parsed (actual collector name)
    ```

In [24]:
pprint.pprint(collectors_matches_g1_merged_wikidata.columns)

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed',
       'source_data', 'occurrenceID_collectors_count',
       'occurrenceID_collectors_firstsample', 'collectors_eventDate_mean',
       'collectors_eventDate_min', 'collectors_eventDate_max', 'old_index',
       'namematch_source_data', 'namematch_resource_data',
       'namematch_similarity', 'item', 'itemLabel', 'surname', 'initials',
       'canonical_string', 'canonical_string_fullname', 'orcid', 'viaf',
       'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb',
       'wye', 'wikidata_link', 'orcid_link', 'harv_link', 'ipni_link',
       'bionomia_link'],
      dtype='object')


In [25]:
# TODO review score (similarity + ((yob,yod) ~ (eventDate_min, eventDate_max))

collectors_wikidata_cossim = collectors_matches_g1_merged_wikidata[
    ['canonical_string_collector_parsed', 'family', 'given', 
     'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
     'source_data',
    'namematch_source_data', 'namematch_resource_data', 'namematch_similarity', 
    'item', 'canonical_string', 'itemLabel',
    'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 
    'collectors_eventDate_mean', 'collectors_eventDate_min', 'collectors_eventDate_max',
     'yob', 'yod', 'wyb'
    ]
]

# Order by similarity (desc), number of Wikidata items (asc) and number of collections (desc)
collectors_wikidata_cossim.sort_values(
    by=['namematch_similarity', 'family', 'given'], 
    ascending=[False, True, True], 
    inplace=True
)

dwcagent_attr_output=collectors_wikidata_cossim.get([
    "occurrenceID_collectors_firstsample", 
    "canonical_string_collector_parsed",
    'family', 'given',
    "namematch_similarity", 
    "source_data", 
    "itemLabel", 
    "item",
    "collectors_eventDate_min",
    "collectors_eventDate_max",
    'yob', 'yod'
]).copy()

dwcagent_attr_output['canonical_string_collector_parsed'].replace(to_replace=r'([^,]+),\s*(.+)', value='\\2 \\1', inplace=True, regex=True)
dwcagent_attr_output['namematch_similarity_annotation'] = dwcagent_attr_output['namematch_similarity'].astype(str).str.replace(r'(.+)', '\\1 (cos similarity)', regex=True)
# dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'namematch_similarity_annotation', '', allow_duplicates=True)

dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'life_time_periode', '', allow_duplicates=True)

combine_life_times = lambda this_df: ("%s-%s" % (this_df["yob"], this_df["yod"])).replace(r"<NA>", "?")
dwcagent_attr_output["life_time_periode"]=dwcagent_attr_output.apply(combine_life_times, axis="columns")

# dwcagent_attr_output["life_time_periode"]

years_from_birth_until_first_collection_activity = 10
dwcagent_attr_output["yob_is_lt_eventDate_min"] = dwcagent_attr_output["yob"] + years_from_birth_until_first_collection_activity < dwcagent_attr_output["collectors_eventDate_min"].dt.year
dwcagent_attr_output["yod_is_gt_eventDate_max"] = dwcagent_attr_output["yod"] > dwcagent_attr_output["collectors_eventDate_max"].dt.year
dwcagent_attr_output["custom_score_lifetime"] = 0
dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'custom_score_lifetime_annotation', '', allow_duplicates=True)

# df.loc[(df['column_of_interest'] … condition), 'fill_to_column'] = value 

dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_eventDate_min"] & dwcagent_attr_output["yod_is_gt_eventDate_max"],
    "custom_score_lifetime"
] = 1
# True cases but <NA> missing values
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_eventDate_min"] & dwcagent_attr_output["yod_is_gt_eventDate_max"].isnull(),
    "custom_score_lifetime"
] = 1
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_eventDate_min"].isnull() & dwcagent_attr_output["yod_is_gt_eventDate_max"],
    "custom_score_lifetime"
] = 1
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_eventDate_min"].isnull() & dwcagent_attr_output["yod_is_gt_eventDate_max"].isnull(),
    "custom_score_lifetime"
] = 0

# False cases
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_eventDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_eventDate_max"] == False),
    "custom_score_lifetime"
] = -1
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_eventDate_min"]==True) & (dwcagent_attr_output["yod_is_gt_eventDate_max"] == False),
    "custom_score_lifetime"
] = 0.5
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_eventDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_eventDate_max"] == True),
    "custom_score_lifetime"
] = 0.5

# False cases but <NA> missing values
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_eventDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_eventDate_max"].isnull()),
    "custom_score_lifetime"
] = -0.5
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_eventDate_min"].isnull()) & (dwcagent_attr_output["yod_is_gt_eventDate_max"] == False),
    "custom_score_lifetime"
] = -0.5

# annotations True cases
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_eventDate_min"] & dwcagent_attr_output["yod_is_gt_eventDate_max"], 
    "custom_score_lifetime_annotation"
] = "full match"

# annotations True cases but <NA> missing values
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_eventDate_min"] & dwcagent_attr_output["yod_is_gt_eventDate_max"].isnull(), 
    "custom_score_lifetime_annotation"
] = "OK? year of death is missing"
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_eventDate_min"].isnull() & dwcagent_attr_output["yod_is_gt_eventDate_max"], 
    "custom_score_lifetime_annotation"
] = "OK? year of birth is missing"
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob_is_lt_eventDate_min"].isnull() & dwcagent_attr_output["yod_is_gt_eventDate_max"].isnull(), 
    "custom_score_lifetime_annotation"
] = "unknown life time"

# annotations False cases
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_eventDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_eventDate_max"] == False), 
    "custom_score_lifetime_annotation"
] = "life time not matching any enventDate (yob + %s … yod)" % years_from_birth_until_first_collection_activity
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_eventDate_min"]==True) & (dwcagent_attr_output["yod_is_gt_eventDate_max"] == False), 
    "custom_score_lifetime_annotation"
] = "OK yob + %s, but yod not matching, check name and liftime data" % years_from_birth_until_first_collection_activity
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_eventDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_eventDate_max"] == True), 
    "custom_score_lifetime_annotation"
] = "yob + %s not matching, OK yod, check name and liftime data" % years_from_birth_until_first_collection_activity
# annotations False cases but <NA> missing values
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_eventDate_min"]==False) & (dwcagent_attr_output["yod_is_gt_eventDate_max"].isnull()), 
    "custom_score_lifetime_annotation"
] = "yob + %s not matching, yod unknown, check name and liftime data" % years_from_birth_until_first_collection_activity
dwcagent_attr_output.loc[
    (dwcagent_attr_output["yob_is_lt_eventDate_min"].isnull()) & (dwcagent_attr_output["yod_is_gt_eventDate_max"]==False), 
    "custom_score_lifetime_annotation"
] = "yob unknown, yod not matching, check name and liftime data"

dwcagent_attr_output["custom_score_multiple_names"]=0 # 0 shall mean: we don’t know yet for real
dwcagent_attr_output.loc[
    (dwcagent_attr_output['canonical_string_collector_parsed'].duplicated(keep=False)),
    'custom_score_multiple_names'
] = -0.5 # one decision has to be made, so cut the range of -1 to 0 only into half (or include multiple count somehow?)

dwcagent_attr_output['custom_score_overall'] = (
    dwcagent_attr_output['namematch_similarity'] * \
    (
        ( dwcagent_attr_output["custom_score_lifetime"] + dwcagent_attr_output['custom_score_multiple_names']) / 2
    )
).round(3)

dwcagent_attr_output['attributionRemarks'] = dwcagent_attr_output.apply(
    lambda row: "{similarity_distance_note};"
                " {score_overall:.2f} (score overall);"
                " {lifetime_periode} (life time);"
                " {lifetime_score:.1f} (life time score);"
                " {lifetime_score_annote} (life time score note);"
                " {score_multinames:.2f} (score multiple names);"
        .format(
    similarity_distance_note=row['namematch_similarity_annotation'],
    lifetime_periode=row["life_time_periode"],
    lifetime_score=row["custom_score_lifetime"],
    lifetime_score_annote=row["custom_score_lifetime_annotation"],
    score_overall=row["custom_score_overall"],
    score_multinames=row["custom_score_multiple_names"]
    ), axis='columns'
)

# adjust dwcagent displayOrder also to olerall score
dwcagent_attr_output.sort_values(
    by=['namematch_similarity', 'family', 'given', 'custom_score_overall'], 
    ascending=[False, True, True, False], 
    inplace=True
)
# use ordered canonical_string_collector_parsed to generate displayOrder
temp_duplicated = dwcagent_attr_output['canonical_string_collector_parsed'].duplicated() 
    # duplicated() keeps the first value False and mark all other duplicats as True, i.e. we can cumulate the Trues, it gives the order index
temp_insert_value=temp_duplicated.groupby(dwcagent_attr_output['canonical_string_collector_parsed']).cumsum() + 1 # display order starts at 1, incrementing
dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('canonical_string_collector_parsed') + 1, 'displayOrder', temp_insert_value, allow_duplicates=True)

# test an show example data
show_display_output=True
if show_display_output:
    display(dwcagent_attr_output.loc[dwcagent_attr_output['yob_is_lt_eventDate_min'] == True].get([
        # "occurrenceID_collectors_firstsample",
        "canonical_string_collector_parsed",
        'itemLabel',
        "custom_score_overall",
        "attributionRemarks",
        'custom_score_multiple_names',
        "namematch_similarity",
        # 'yob', 'yod',
        "life_time_periode", 
        'collectors_eventDate_min', 'collectors_eventDate_max', 
        "yob_is_lt_eventDate_min" ,'yod_is_gt_eventDate_max', 
        'custom_score_lifetime', 'custom_score_lifetime_annotation'
    ]).head(5))
    display(dwcagent_attr_output.loc[dwcagent_attr_output['yob_is_lt_eventDate_min'] == False].get([
        # "occurrenceID_collectors_firstsample",
        "canonical_string_collector_parsed",
        'itemLabel',
        "custom_score_overall",
        "attributionRemarks",
        'custom_score_multiple_names',
        "namematch_similarity",
        # 'yob', 'yod',
        "life_time_periode", 
        'collectors_eventDate_min', 'collectors_eventDate_max', 
        "yob_is_lt_eventDate_min" ,'yod_is_gt_eventDate_max', 
        'custom_score_lifetime', 'custom_score_lifetime_annotation'
    ]).head(5))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_wikidata_cossim.sort_values(


Unnamed: 0,canonical_string_collector_parsed,itemLabel,custom_score_overall,attributionRemarks,custom_score_multiple_names,namematch_similarity,life_time_periode,collectors_eventDate_min,collectors_eventDate_max,yob_is_lt_eventDate_min,yod_is_gt_eventDate_max,custom_score_lifetime,custom_score_lifetime_annotation
13824,A. Aaronsohn,Aaron Aaronsohn,0.5,1.0 (cos similarity); 0.50 (score overall); 18...,0.0,1.0,1876-1919,1906-12-06,1907-03-19,True,True,1.0,full match
7,E.C. Abbe,Ernst Cleveland Abbe,0.5,1.0 (cos similarity); 0.50 (score overall); 19...,0.0,1.0,1905-2000,1932-01-01,1964-08-31,True,True,1.0,full match
9,D. Abbiatti,Delia Abbiatti,0.5,1.0 (cos similarity); 0.50 (score overall); 19...,0.0,1.0,1918-?,1937-10-01,1951-01-29,True,,1.0,OK? year of death is missing
20166,A.T.D. Abbott,A. T. D. Abbott,0.5,1.0 (cos similarity); 0.50 (score overall); 19...,0.0,1.0,1936-2013,1997-02-13,2010-05-27,True,True,1.0,full match
26910,E.K. Abbott,Edwin Kirk Abbott,0.25,1.0 (cos similarity); 0.25 (score overall); 18...,-0.5,1.0,1840-1918,1889-01-01,1889-04-01,True,True,1.0,full match


Unnamed: 0,canonical_string_collector_parsed,itemLabel,custom_score_overall,attributionRemarks,custom_score_multiple_names,namematch_similarity,life_time_periode,collectors_eventDate_min,collectors_eventDate_max,yob_is_lt_eventDate_min,yod_is_gt_eventDate_max,custom_score_lifetime,custom_score_lifetime_annotation
39340,S.K. Abdullah,Samir K. Abdullah,-0.25,1.0 (cos similarity); -0.25 (score overall); 1...,0.0,1.0,1947-?,1921-03-01,1986-04-01,False,,-0.5,"yob + 10 not matching, yod unknown, check name..."
27198,E. Acharius,Erik Acharius,0.25,1.0 (cos similarity); 0.25 (score overall); 17...,0.0,1.0,1757-1819,NaT,NaT,False,True,0.5,"yob + 10 not matching, OK yod, check name and ..."
31285,G. Achoundong,Gaston Achoundong,-0.25,1.0 (cos similarity); -0.25 (score overall); 1...,0.0,1.0,1950-?,1960-04-21,2006-06-08,False,,-0.5,"yob + 10 not matching, yod unknown, check name..."
39027,J.P.H. Acocks,John Phillip Harison Acocks,-0.5,1.0 (cos similarity); -0.50 (score overall); 1...,0.0,1.0,1911-1979,1855-05-04,1984-06-29,False,False,-1.0,life time not matching any enventDate (yob + 1...
19519,A.M. Adamson,Alastair Martin Adamson,0.25,1.0 (cos similarity); 0.25 (score overall); 19...,0.0,1.0,1901-1945,NaT,NaT,False,True,0.5,"yob + 10 not matching, OK yod, check name and ..."


In [26]:
column_map_dwcagent_attr = {
    'occurrenceID_collectors_firstsample':'occurrenceID',
    'canonical_string_collector_parsed':  'alternateName',
    'source_data':                        'verbatimName',
    'itemLabel':                          'name',
    'item':                               'identifier',
    'collectors_eventDate_min':           'startedAtTime',
    'collectors_eventDate_max':           'endedAtTime',
    'namematch_similarity':               'custom_namematch_similarity'
}
dwcagent_attr_output.rename(
    mapper=column_map_dwcagent_attr,
    axis='columns',
    inplace=True)

dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('identifier')          + 1, 'agentIdentifierType', 'wikidata' , allow_duplicates=True)
dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('agentIdentifierType') + 1, 'agentType'          , 'Person'   , allow_duplicates=True)
dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('identifier')          + 1, 'action'             , 'collected', allow_duplicates=True)

show_display_output=False
if show_display_output:
    dwcagent_attr_output.head(20)

dwcagent_attr_output=dwcagent_attr_output.reindex(
    columns=[
        'occurrenceID', # no DwC agent standard (yet)?
        'verbatimName',
        'alternateName',
        'displayOrder', # shall start from 1, 2, 3 …
        'name',
        'attributionRemarks',
        'startedAtTime',
        'endedAtTime',
        'agentType',
        'action',
        'agentIdentifierType',
        'identifier',
        "custom_score_overall", # keep it for calculation convenience, no standard in DwC agent
        'custom_namematch_similarity',# keep it for calculation convenience, no standard in DwC agent
        'custom_score_multiple_names',# keep it for calculation convenience, no standard in DwC agent
        'custom_score_lifetime' # keep it for calculation convenience, no standard in DwC agent
    ]
)
# column deletion not neccessary after ….reindex(columns=[…])
# for this_column in ['yob', 'yod', 'life_time_periode', 'yob_is_lt_eventDate_min', 'yod_is_gt_eventDate_max', 'score_lifetime_annotation']:
#     del dwcagent_attr_output[this_column]


In [27]:
show_display_output=True
if show_display_output:
    # criterion = dwcagent_attr_output['alternateName'].map(lambda x: x.startswith('S. Ahmad'))
    criterion = dwcagent_attr_output['custom_score_multiple_names'].map(lambda this_score: this_score < 0 ) # show multiple name cases
    
    display(dwcagent_attr_output[criterion].head(20))

Unnamed: 0,occurrenceID,verbatimName,alternateName,displayOrder,name,attributionRemarks,startedAtTime,endedAtTime,agentType,action,agentIdentifierType,identifier,custom_score_overall,custom_namematch_similarity,custom_score_multiple_names,custom_score_lifetime
13831,https://data.biodiversitydata.nl/naturalis/spe...,Abdallah MS-A; Sa'ad FE-ZM; Mahdy M; Abbas A,A. Abbas,1,Alia Abbas,1.0 (cos similarity); -0.25 (score overall); ?...,1936-02-11,1963-11-01,Person,collected,wikidata,http://www.wikidata.org/entity/Q60141229,-0.25,1.0,-0.5,0.0
13832,https://data.biodiversitydata.nl/naturalis/spe...,Abdallah MS-A; Sa'ad FE-ZM; Mahdy M; Abbas A,A. Abbas,2,Abdulla Abbas,1.0 (cos similarity); -0.25 (score overall); ?...,1936-02-11,1963-11-01,Person,collected,wikidata,http://www.wikidata.org/entity/Q88804360,-0.25,1.0,-0.5,0.0
26910,https://data.biodiversitydata.nl/naturalis/spe...,Abbott EK,E.K. Abbott,1,Edwin Kirk Abbott,1.0 (cos similarity); 0.25 (score overall); 18...,1889-01-01,1889-04-01,Person,collected,wikidata,http://www.wikidata.org/entity/Q81587932,0.25,1.0,-0.5,1.0
26911,https://data.biodiversitydata.nl/naturalis/spe...,Abbott EK,E.K. Abbott,2,Erwin Kirk Abbott,1.0 (cos similarity); 0.25 (score overall); 18...,1889-01-01,1889-04-01,Person,collected,wikidata,http://www.wikidata.org/entity/Q113588322,0.25,1.0,-0.5,1.0
31292,https://data.biodiversitydata.nl/naturalis/spe...,Ackermann M; et al.,M. Ackermann,1,Markus Ackermann,1.0 (cos similarity); 0.25 (score overall); 19...,2002-09-11,2005-10-01,Person,collected,wikidata,http://www.wikidata.org/entity/Q21504506,0.25,1.0,-0.5,1.0
31293,https://data.biodiversitydata.nl/naturalis/spe...,Ackermann M; et al.,M. Ackermann,2,Marianne Ackermann,1.0 (cos similarity); -0.25 (score overall); ?...,2002-09-11,2005-10-01,Person,collected,wikidata,http://www.wikidata.org/entity/Q36619087,-0.25,1.0,-0.5,0.0
31294,https://data.biodiversitydata.nl/naturalis/spe...,Ackermann M; et al.,M. Ackermann,3,Manfred Ackermann,1.0 (cos similarity); -0.25 (score overall); ?...,2002-09-11,2005-10-01,Person,collected,wikidata,http://www.wikidata.org/entity/Q47112660,-0.25,1.0,-0.5,0.0
46069,https://data.biodiversitydata.nl/naturalis/spe...,Blank D; Adams RP,R.P. Adams,1,Robert Phillip Adams,1.0 (cos similarity); 0.25 (score overall); 19...,1964-10-30,1996-11-29,Person,collected,wikidata,http://www.wikidata.org/entity/Q10363201,0.25,1.0,-0.5,1.0
46071,https://data.biodiversitydata.nl/naturalis/spe...,Blank D; Adams RP,R.P. Adams,2,Robert Perry Adams,1.0 (cos similarity); 0.00 (score overall); 18...,1964-10-30,1996-11-29,Person,collected,wikidata,http://www.wikidata.org/entity/Q21504568,0.0,1.0,-0.5,0.5
46070,https://data.biodiversitydata.nl/naturalis/spe...,Blank D; Adams RP,R.P. Adams,3,Robert Perry Adams,1.0 (cos similarity); -0.25 (score overall); ?...,1964-10-30,1996-11-29,Person,collected,wikidata,http://www.wikidata.org/entity/Q10363200,-0.25,1.0,-0.5,0.0


In [28]:
# TODO further evaluation or filtering, counting, clean up aso.
if not os.path.exists('data'):
    os.makedirs('data')

# this_timestamp_for_data=time.strftime('%Y%m%d') # 20230719
this_timestamp_for_data=20231116
this_output_file='data/results_naturalis_collectors-eventDate_vs_wikidata-botanists_cossim-similarity_dwc-agent-output_%s.csv' % (
    this_timestamp_for_data
)

dwcagent_attr_output.to_csv(this_output_file, index=False)

print("Wrote matches of collector names as dwc-agent-output into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote matches of collector names as dwc-agent-output into data/results_naturalis_collectors-eventDate_vs_wikidata-botanists_cossim-similarity_dwc-agent-output_20231116.csv (19206 kB)


## Documentation

TODO document columns

Explanation of columns:

Column | Description
-|-
**Botanical collectors** |
family | parsed family name
given | parsed given name
suffix | suffix from name parsing
particle | particle from name parsing
dropping_particle | dropping_particle from name parsing
nick | nick name from name parsing
appellation | appellation from name parsing
title | title from name parsing
occurrenceID_collectors_count | count of all occurrenceID of one particular collector name
occurrenceID_collectors_firstsample | a data sample of an occurrenceID 
eventDate | date of the sampling event (required by GBIF, ☞ https://www.gbif.org/data-quality-requirements-sampling-events)
eventDate_min | calculated earliest date of all the sampling events within the data
eventDate_max | calculated latest date of all the sampling events within the data
eventDate_mean | calculated mean date of all the sampling events within the data
TODO activity_span | Number of years between first and last collection
**Name matching** |
namematch_source_data | matched name of the collector data set
namematch_resource_data | matched name of Wikidata the collector was tried to matched to
namematch_similarity | calculated cosine-similarity
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label (perhaps similar to the full name)
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))