# Match BGBM Collectors to Wikidata Items Using *Nearest Neighbour*, Just Name Comparison

See also [match_names_BGBM-dwcagent-parsed-eventDate_vs_WikiData_k-nearest.ipynb](./match_names_BGBM-dwcagent-parsed-eventDate_vs_WikiData_k-nearest.ipynb) that includes the `eventDate` of sampling to get a life time reference for deciding name matches to WikiData.

- match of `canonical_string` of WikiData to `canonical_string` of the source collectors (abbreviated names and full names, if given), and
- parse collector source names beforehand to get individual names out of name lists in the source data, we have used <https://libraries.io/rubygems/dwc_agent>, and in general we
- follow the example of Niels Klazenga <https://github.com/nielsklazenga/avh-collectors/blob/master/match_names_to_wikidata_items.ipynb>
- write the output to provide a DarwinCore attribution structure (for `verbatimName` we would need the `source_data` name(s))

For the output of DarwinCore agent attribution, reconsidere `displayOrder` that it represents rather the data quality first and foremost, *not* the very name match.

Technical Notes — Review Code perhaps:
- TODO review score calculation of the matching of relating eventData with range of yob, yod
- TODO review DwC agent output, keep at this time custom columns for filter-sort-evaluation convenience
- (NN ⇌ Cosine) refactor relation: wd_matchtest ⇌ wikidata_unique (replaced wd_matchtest → wikidata_unique)

### Load Wikidata Data Set

Use Jupyter Notebook [create_wikidata_datasets_botanists-altlabel.ipynb](./create_wikidata_datasets_botanists-altlabel.ipynb) to generate matching data of botanists.

Now load the data and make them unique …

In [1]:
import pandas as pd
import pprint, time, os

explain_and_show_the_data = True
this_timestamp_for_data=20260210
# this_timestamp_for_data=time.strftime('%Y%m%d')

wikidata = pd.read_csv(
    # "data/wikidata_persons_botanists_20231030_1539.csv", # inverse match: [particle +] family, given
    # "data/wikidata_persons_botanists_20231116.csv",        # match: given [+ particle] + family[+ , suffix]
    # "data/wikidata_persons_botanists_20240312.csv", # with itemLabel + altLabel wyb, wye removed
    "data/wikidata_persons_botanists_20260210.csv",
    index_col=0, low_memory=False,
    dtype={
        'yob':'Int32',
        'yod':'Int32',
        'wyb':'Int32',
        'wye':'Int32'
    }    
)
if explain_and_show_the_data:
    pprint.pprint(wikidata.columns)
    display(wikidata.head())

Index(['item', 'itemLabel', 'surname', 'initials', 'canonical_string',
       'canonical_string_fullname', 'orcid', 'viaf', 'isni', 'harv', 'ipni',
       'abbr', 'bionomia_id', 'yob', 'yod', 'wikidata_link', 'orcid_link',
       'harv_link', 'ipni_link', 'bionomia_link'],
      dtype='str')


Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q100142069,Frida Eggens,,,Eggens,Eggens,,,,,20045232-1,Eggens,,,,http://www.wikidata.org/wiki/Q100142069,,,https://www.ipni.org/a/20045232-1,
1,http://www.wikidata.org/entity/Q100142069,Frida Eggens,Frida,F.,F. Eggens,Frida Eggens,,,,,20045232-1,Eggens,,,,http://www.wikidata.org/wiki/Q100142069,,,https://www.ipni.org/a/20045232-1,
2,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Elizabeth,E.,E. Harrison,Elizabeth Harrison,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795
3,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,,,Mrs A. H.,Mrs A. H.,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795
4,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Mrs Arnold,M. A.,M. A. Harrison,Mrs Arnold Harrison,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795


In [2]:
# Create data frame with unique canonical strings 
# group by canonical name/string, count douplicated names
wd_matchtest = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()
wd_matchtest_fullnames = wikidata.groupby('canonical_string_fullname').agg({'item': ['count']}).reset_index()

if explain_and_show_the_data:
    display(wd_matchtest )
    display(wd_matchtest_fullnames)

# colls = list(wikidata.columns)

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"""F."" Ryser",1
1,"""N.A. Antipova"" (lapsus)",1
2,"""N.A.Antipova"" (lapsus)",1
3,"""The grandmother of female scientists in Ghana""",1
4,"""Н. А. Антипова"" (lapsus)",1
...,...,...
171443,赵云鹏,1
171444,郭亚龙,1
171445,金井弘夫(Hiroo Kanai),1
171446,金双 马,1


Unnamed: 0_level_0,canonical_string_fullname,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"""Fritz"" Ryser",1
1,"""N.A. Antipova"" (lapsus)",1
2,"""N.A.Antipova"" (lapsus)",1
3,"""The grandmother of female scientists in Ghana""",1
4,"""Н. А. Антипова"" (lapsus)",1
...,...,...
204788,赵云鹏,1
204789,郭亚龙,1
204790,金井弘夫(Hiroo Kanai),1
204791,金双 马,1


### Load Collectors Data Set

Data sources:

- option 1: Jupyter Notebook for [create_bgbm_gbif-occurrence_collectors_dataset.ipynb](./create_bgbm_gbif-occurrence_collectors_dataset.ipynb)
- option 2: Jupyter Notebook for `create_bgbm_botanypilot_collectors_dataset.ipynb` from SPARQL (not in this official documentation yet)

Then parse collector names to get single, separate collector names using `dwcagent`, use ruby gem package available at  <https://rubygems.org/gems/dwc_agent>:

- see [./bin/README.md](bin/README.md) to use ruby script `./bin/agent_parse4tsv.rb` for parsing text lines like `"Abbe,L.B., Abbe,E.C., Smitinand,T. & Rollet,B."`

In [3]:
# atomized names were parsed already by ruby gem package: dwcagent —
# they can contain also the same name accross multiple rows — 
# it’s probably better for the matching to make the name rows unique later on

# collectors = pd.read_csv("data/bgbm_collectors_20230510_1429_single-line_parsed_unique_names.tab", sep="\t")
collectors = pd.read_csv("data/VHde_doi-10.15468-dl.tued2e/occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv", sep="\t", low_memory=False)

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.
collectors.sort_values(by=['family', 'given','occurrenceID_firstsample'], inplace=True)

if explain_and_show_the_data: display(collectors)

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,occurrenceID_count,occurrenceID_firstsample
1100,A,,,,,,,,"Angeludis C., A.",parsed:C. Angeludis<SEP>A,cleaned:C. Angeludis<SEP>A,1,http://id.snsb.info/snsb/collection/462869/564...
18482,A,,,,,,,,"Gambia, A.",parsed:A,cleaned:A,6,http://id.snsb.info/snsb/collection/668020/726...
40642,A,,,,,,,,"Nobs, Malcolm,A. & Galen Smith,S.",parsed:Malcolm Nobs<SEP>A<SEP>S. Galen Smith,cleaned:Malcolm Nobs<SEP>A<SEP>S. Galen Smith,1,https://herbarium.bgbm.org/object/B100089760
62629,A,,,,,,,,"Weigend,M., Ackermann, M.Cano,A. & La Torre,M.I.",parsed:M. Weigend<SEP>M.Cano Ackermann<SEP>A<S...,cleaned:M. Weigend<SEP>M. Cano Ackermann<SEP>A...,2,https://herbarium.bgbm.org/object/B100137881
39131,A,,,,,,,,Monteagudo A.; Peña; A.; Francis; R. & Quintuy...,parsed:A. Monteagudo<SEP>Peña<SEP>A<SEP>Franci...,cleaned:A. Monteagudo<SEP>Peña<SEP>A<SEP>Franc...,1,https://herbarium.bgbm.org/object/B100575848
...,...,...,...,...,...,...,...,...,...,...,...,...,...
66687,Żelany,J.,,,,,,,"Żelany,J.",parsed:J. Żelany,cleaned:J. Żelany,1,https://herbarium.bgbm.org/object/B100220196
66688,Żelazny,J.,,,,,,,"Żelazny,J.",parsed:J. Żelazny,cleaned:J. Żelazny,4,https://herbarium.bgbm.org/object/B100344466
66689,Ždanova,O.,,,,,,,"Ždanova,O.",parsed:O. Ždanova,cleaned:O. Ždanova,5,https://herbarium.bgbm.org/object/B100263330
32958,Ždanova,O.,,,,,,,"Lomonosova,M., Ždanova,O. & Šaulo,D.",parsed:M. Lomonosova<SEP>O. Ždanova<SEP>D. Šaulo,cleaned:M. Lomonosova<SEP>O. Ždanova<SEP>D. Šaulo,1,https://herbarium.bgbm.org/object/B100263331


#### Check Composition of Parsed Collector Data

In [4]:
# TODO review code of abbreviated names and full name matching
if explain_and_show_the_data: 
    criterion_fullnames = collectors.given.str.contains('^\\w{3,}', na=False)
    print("Show collecors given name has (propably) a full name (%s records) …" % len(collectors[criterion_fullnames].index))
    display(collectors[criterion_fullnames])

Show collecors given name has (propably) a full name (1368 records) …


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,source_data,parsed_names,cleaned_names,occurrenceID_count,occurrenceID_firstsample
2314,Abdallah,Raffael,,,,,,,"Bally,P.R.O., Abdallah, Raffael & Reichstein, T.",parsed:P.R.O. Bally<SEP>Raffael Abdallah<SEP>T...,cleaned:P.R.O. Bally<SEP>Raffael Abdallah<SEP>...,1,https://herbarium.bgbm.org/object/B200125981
27183,Abdul,Kadir Bin,,,,,,,Kadir Bin Abdul,parsed:Kadir Bin Abdul,cleaned:Kadir Bin Abdul,1,https://herbarium.bgbm.org/object/B100184021
21155,Abreu,Guilherme,,de,,,,,Guilherme de Abreu (no. 103),parsed:Guilherme de Abreu,cleaned:Guilherme de Abreu,1,http://id.snsb.info/snsb/collection/22086/3086...
18579,Adá,García,,,,,,,"García Adá, Luceño, Rico,E., Romero,T. & Varga...",parsed:García Adá<SEP>Rico Luceño<SEP>E. Romer...,cleaned:García Adá<SEP>Rico Luceño<SEP>E. Rome...,1,https://herbarium.bgbm.org/object/B100296455
414,Ahagen,Schiers,,,,,,,"Ahagen, Schiers,C. & al.",parsed:Schiers Ahagen<SEP>C,cleaned:Schiers Ahagen<SEP>,1,https://herbarium.bgbm.org/object/B100194646
...,...,...,...,...,...,...,...,...,...,...,...,...,...
65900,Zickendrath,Ernst,,,,,,,"Zickendrath,Ernst",parsed:Ernst Zickendrath,cleaned:Ernst Zickendrath,3,https://je.jacq.org/JE04006629
65901,Zickendrath,Ernst,,,,,,,"Zickendrath,Ernst & Heyden,K.L.",parsed:Ernst Zickendrath<SEP>K.L. Heyden,cleaned:Ernst Zickendrath<SEP>K.L. Heyden,1,https://je.jacq.org/JE04007139
65987,Ziz,Johann Baptist,,,,,,,"Ziz,Johann Baptist",parsed:Johann Baptist Ziz,cleaned:Johann Baptist Ziz,1,https://je.jacq.org/JE00017744
66049,Zollinger,Heinrich,,,,,,,"Zollinger,Heinrich",parsed:Heinrich Zollinger,cleaned:Heinrich Zollinger,2,https://herbarium.bgbm.org/object/B101097046


In [5]:
# check the parsed columns if they are empty or need to be considerd as data for matching or not
if explain_and_show_the_data:
    for parsed_name_part in ["particle", "suffix", "dropping_particle", "appellation"]:
        test_collectors = collectors.loc[(collectors[parsed_name_part].isna() == False)]
        print("\n----------------------------------------\nshow names with **%s** found %s records:\n" % (parsed_name_part, len(test_collectors.index)))
        display(test_collectors.head().get(["family", "given", "suffix", "particle", "dropping_particle", "nick", "appellation", "title"]))


----------------------------------------
show names with **particle** found 738 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title
21155,Abreu,Guilherme,,de,,,,
363,Aghababyan,M.,,von,,,,
4192,Aguilar,M.L.,,Reyna de,,,,
60978,Aguilar,M.L.,,Reyna de,,,,
16871,Aguilar,M.L.,,Reyna de,,,,



----------------------------------------
show names with **suffix** found 15 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title
17393,August,Friedrich,II.,,,,,
59026,Dogma,I.J.,Jr.,,,,,
17123,Forsyth,W.,jr.,,,,,
803,Grear,J.W.,Jr.,,,,,
26304,Grear,J.W.,Jr.,,,,,



----------------------------------------
show names with **dropping_particle** found 0 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title



----------------------------------------
show names with **appellation** found 1 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title
17225,Sennen,,,,,,Fr,


Compile and compose `canonical_string…` of the collector data that we will later match the WikiData names with:

In [6]:
if explain_and_show_the_data: print("combine parts of names similar to WikiData's given name labels")
collectors['canonical_string_collector_parsed'] = collectors[['given', 'particle', 'family', 'suffix']]\
    .fillna('')\
    .apply(
        lambda this_df: "{given}{particle}{family}{suffix}".format(
            given=this_df["given"],
            particle=" " + this_df["particle"] if this_df["particle"] else '', 
            family=" " + this_df["family"] if this_df["family"] else '', 
            suffix=", " + this_df["suffix"] if this_df["suffix"] else ''
        ).strip(), axis="columns"
    )

criterion = collectors["particle"].str.contains("\\w+ \\w+", na=False)

# display(collectors['canonical_string_collector_parsed'][criterion].head())
if explain_and_show_the_data: 
    display(collectors[['canonical_string_collector_parsed', 'particle']][criterion].drop_duplicates().head(10))


combine parts of names similar to WikiData's given name labels


Unnamed: 0,canonical_string_collector_parsed,particle
4192,M.L. Reyna de Aguilar,Reyna de
8,H. Abbas al Ani,Abbas al
49192,J. Francisco del Aquila,Francisco del
34502,A.L.M. Marcailhou d' Aymeric,Marcailhou d'
47296,P.R.O. Ritchie in Bally,Ritchie in
42269,H. Perrier de la Bathie,Perrier de la
41631,A.M.F. Palisot de Beauvois,Palisot de
41632,A.M.F.J. Palisot de Beauvois,Palisot de
60536,D. Van der Ben,Van der
41643,F.A. Marschall von Bieberstein,Marschall von


In [7]:
# move canonical_string_collector_parsed after column title (title was the last of the parsing columns)
col = collectors.pop("canonical_string_collector_parsed")
collectors.insert(collectors.columns.get_loc('title') + 1, col.name, col)

these_columns=["family", "given", "suffix", "particle", "dropping_particle", "nick", "appellation", "title", 'canonical_string_collector_parsed']

if 'source_data' in collectors.columns:
    these_columns.append("source_data")

if explain_and_show_the_data: display(collectors.tail().get(these_columns))

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data
66687,Żelany,J.,,,,,,,J. Żelany,"Żelany,J."
66688,Żelazny,J.,,,,,,,J. Żelazny,"Żelazny,J."
66689,Ždanova,O.,,,,,,,O. Ždanova,"Ždanova,O."
32958,Ždanova,O.,,,,,,,O. Ždanova,"Lomonosova,M., Ždanova,O. & Šaulo,D."
66690,Žíla,V.,,,,,,,V. Žíla,"Žíla,V."


In [8]:
if explain_and_show_the_data: print("group and aggregate data to have unique name rows only for the matching of names later on")

collectors_unique=collectors.groupby(['canonical_string_collector_parsed']).agg(
    family=('family', lambda x: list(x)[0]),
    given=('given', lambda x: list(x)[0]),
    suffix=('suffix', lambda x: list(x)[0]),
    particle=('particle', lambda x: list(x)[0]),
    dropping_particle=('dropping_particle', lambda x: list(x)[0]),
    nick=('nick', lambda x: list(x)[0]),
    appellation=('appellation', lambda x: list(x)[0]),
    title=('title', lambda x: list(x)[0]),
    source_data=('source_data', lambda x: list(x)[0]),
    occurrenceID_collectors_count= ('occurrenceID_count', 'sum'), # use count function
    occurrenceID_collectors_firstsample=('occurrenceID_firstsample', lambda x: list(x)[0]) # custom function, to get the first entry
).reset_index()

# move canonical_string_collector_parsed after column title
col = collectors_unique.pop("canonical_string_collector_parsed")
collectors_unique.insert(collectors_unique.columns.get_loc('title') + 1, col.name, col)
collectors_unique.drop_duplicates(inplace=True)

if explain_and_show_the_data: display(collectors_unique)

# column naming perhaps more clear (because we condensed the data)?
# collectors=collectors.add_suffix('_namegrouped') \
#  if not any(col.endswith("_namegrouped") for col in list(collectors.columns))

group and aggregate data to have unique name rows only for the matching of names later on


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample
0,A,,,,,,,,A,"Angeludis C., A.",25,http://id.snsb.info/snsb/collection/462869/564...
1,Silveira,A. A. da,,,,,,,A. A. da Silveira,"Silveira,A.A. da",6,https://herbarium.bgbm.org/object/B200129406a
2,Thouars,A. A. du,,,,,,,A. A. du Thouars,"Thouars,A. A.du Petit-",34,https://herbarium.bgbm.org/object/BW20158010
3,Bunge,A. A. von,,,,,,,A. A. von Bunge,"Bunge, A.A. von (no. )",1,http://id.snsb.info/snsb/collection/511901/634...
4,Aaronsohn,A.,,,,,,,A. Aaronsohn,"Aaronsohn,A.",3,https://je.jacq.org/JE00010154
...,...,...,...,...,...,...,...,...,...,...,...,...
20966,Karjagin,İ.,,,,,,,İ. Karjagin,"Karjagin, İ.",1,https://herbarium.bgbm.org/object/B100601585
20967,Graf,Ž.,,,,,,,Ž. Graf,"Graf, Ž.",1,http://id.snsb.info/snsb/collection/480433/591...
20968,Černeva,Ž.,,,,,,,Ž. Černeva,"Markova,M. & Černeva,Ž.",1,https://herbarium.bgbm.org/object/B100310826
20969,Fedorova,В.,,,,,,,В. Fedorova,"Tikhomirov,V. [Тихомиров,В.], Fedorova,T. & St...",1,https://herbarium.bgbm.org/object/B100630485


In [9]:
if explain_and_show_the_data: 
    print("show collectors with highest occurrenceID_collectors_count")
    display(collectors_unique.sort_values(by=['occurrenceID_collectors_count', 'family', 'given'], ascending=[False, True, True]).head(10))

show collectors with highest occurrenceID_collectors_count


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample
5334,Willing,E.,,,,,,,E. Willing,"Willing,E. & Eisenblätter,R.",189994,https://herbarium.bgbm.org/object/B100145955
17109,Willing,R.,,,,,,,R. Willing,"Willing,R. & Willing,R.",188088,https://herbarium.bgbm.org/object/B100074611
15678,Hein,P.,,,,,,,P. Hein,"Kilian,N., Hein,P. & Oberprieler,C.",9594,https://herbarium.bgbm.org/object/B100113011
14836,Kilian,N.,,,,,,,N. Kilian,"Hunger,S. & Kilian,N.",9198,https://herbarium.bgbm.org/object/B100003448
3491,Martius,C.F.P.,,von,,,,,C.F.P. von Martius,"Martius, C.F.P. von",6097,http://id.snsb.info/snsb/collection/502732/623...
4206,Rodríguez,D.,,,,,,,D. Rodríguez,"Rodríguez,D., Monterrosa,J., Hernández,A. & Ma...",5206,https://herbarium.bgbm.org/object/B100038970
5263,Tempel,E.,,,,,,,E. Tempel,"Tempel,E.",4561,https://dr.jacq.org/DR073621
2824,Martius,C. F. P. von,,,,,,,C. F. P. von Martius,"Martius, C.F.P. von (no. Obs. 1490)",4545,http://id.snsb.info/snsb/collection/117775/176...
7707,Schimper,G.W.,,,,,,,G.W. Schimper,"Schimper, G.W. (no. 880)",4320,http://id.snsb.info/snsb/collection/108223/167...
8844,Haussknecht,H.K.,,,,,,,H.K. Haussknecht,"Haussknecht, H.K.",4288,http://id.snsb.info/snsb/collection/474055/583...


In [10]:
# Idea: Should we use data column suffixes to follow the data source after merging is done later?
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

### Set Up the Text Analysis

See https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536 for deeper understanding.

The `ngrams` function is used as an analyzer in the text search later.

In [11]:
import re
!pip install ftfy # amazing text cleaning for decode issues..
from ftfy import fix_text

def ngrams(string, n=3):
    """
    Construct ngram(s) of a given text
     
    @param string: the text string to perform the ngram splitting on 
    @param n: character length of the particular (split) result text each
    @return: string as ngram
    """
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.title()  # normalise case - capital at start of each word
    string = re.sub(' +', ' ', string).strip() # get rid of multiple spaces and replace with a single
    string = ' ' + string + ' '  # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD', r'', string)
    this_ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in this_ngrams]



In [12]:
# pip install --upgrade scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# nbrs_data = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf_vector_data) 
# tfidf_vector_data contains the vectorized wikidata names from the previous step


def getNearestNeighbour(query, this_vectorizer, this_nbrs_data):
    """Calculate the k-nearest distance for query data using package scikit-learn


    @param query: DataFrame the query data to vectorize and transform
    @param this_vectorizer: the vectorizer of TfidfVectorizer
    @param this_nbrs_data: the data of NearestNeighbors calculations
    @return: (distances, indices) distances and indices
    @rtype (int, int)
    """
    queryTFIDF_ = this_vectorizer.transform(query)
    distances, indices = this_nbrs_data.kneighbors(queryTFIDF_)
    return distances, indices


def calculateTFIDFmatchingOfData(query_data, match_data, n_neighbors=1):
    """
    Calculate a TF-IDF (Term Frequency — Inverse Document Frequency) matching with getNearestN

    @param query_data: DataFrame usually a pandas data column to query names or strings for
    @param match_data: DataFrame against to match with
    @param n_neighbors: Number of neighbors required for each sample by default for :meth:`kneighbors` queries (originally 5).

    @requires NearestNeighbors()
    @requires getNearestNeighbour()
    @requires ngrams()
    @requires TfidfVectorizer()
    @requires NearestNeighbors()

    @return: DataFrame a data frame of matches with columns 'namematch_source_data', 'namematch_resource_data', 'namematch_distance'
    """

    import time
    start = time.time()
    query_data = set(query_data)
    # convert list to set for better performance

    print('Vectorizing data. This may take a while...')
    # vectorize wikidata names
    vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
    tfidf_vector_data = vectorizer.fit_transform(match_data
        # wd_matchtest['canonical_string']
    )
    nbrs_data = NearestNeighbors(n_neighbors=n_neighbors, n_jobs=-1).fit(tfidf_vector_data)
    duration = time.time() - start
    print('Vectorizing completed: created a matrix of TF-IDF features after %s s' % duration)

    print('Getting nearest neighbours of %s data with %s neighbor sample(s)...' % (len(query_data), n_neighbors))
    distances, indices = getNearestNeighbour(query_data, vectorizer, nbrs_data)
    duration = time.time() - start
    print('Completed after %s s' % duration)

    query_data = list(query_data)  # convert back to list

    print('Finding matches build new data frame ...')
    matches = []
    for i, j in enumerate(indices):
        temp = [query_data[i], match_data.values[j][0], round(distances[i][0], 2)]
        matches.append(temp)

    duration = time.time() - start
    print('Building matches done after %s s' % duration)
    matches = pd.DataFrame(
        matches,
        columns=['namematch_source_data', 'namematch_resource_data', 'namematch_distance']
    )

    print('Done')
    return matches


Vectorize Wikidata names. Background: We use an information retrieval technique (Term Frequency — Inverse Document Frequency, blog [towardsdatascience.com/tf-idf-explained…](https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275)) for matching the source names with WikiData names, for that a calculated dinsance measure of the name match will help to match similar names and distinguish names that are rather no match. In general see also https://scikit-learn.org, https://pypi.org/project/scikit-learn/.

In [13]:
# some example data
print("Show ngram examples:")
print("- simple name:", ngrams('Klazenga, N.'))
print("- data from collectors:", ngrams(collectors_unique["canonical_string_collector_parsed"].at[1])) 
print("- data from match-test:", ngrams(wd_matchtest['canonical_string'].at[0]))
print("- data from match-test (full name):", ngrams(wd_matchtest_fullnames['canonical_string_fullname'].at[0]))

# some example data
for i, row in enumerate(range(5)):
    if (i == 0):
        print('\n(WikiData’s) canonical_string = (constructed) canonical_string_fullname:') 
    print("- {short_name} = {long_name}".format(
        short_name=wd_matchtest['canonical_string'].at[row],
        long_name=wd_matchtest_fullnames['canonical_string_fullname'].at[row]
    ))


Show ngram examples:
- simple name: [' Kl', 'Kla', 'laz', 'aze', 'zen', 'eng', 'nga', 'ga ', 'a N', ' N ']
- data from collectors: [' A ', 'A A', ' A ', 'A D', ' Da', 'Da ', 'a S', ' Si', 'Sil', 'ilv', 'lve', 'vei', 'eir', 'ira', 'ra ']
- data from match-test: [' "F', '"F"', 'F" ', '" R', ' Ry', 'Rys', 'yse', 'ser', 'er ']
- data from match-test (full name): [' "F', '"Fr', 'Fri', 'rit', 'itz', 'tz"', 'z" ', '" R', ' Ry', 'Rys', 'yse', 'ser', 'er ']

(WikiData’s) canonical_string = (constructed) canonical_string_fullname:
- "F." Ryser = "Fritz" Ryser
- "N.A. Antipova" (lapsus) = "N.A. Antipova" (lapsus)
- "N.A.Antipova" (lapsus) = "N.A.Antipova" (lapsus)
- "The grandmother of female scientists in Ghana" = "The grandmother of female scientists in Ghana"
- "Н. А. Антипова" (lapsus) = "Н. А. Антипова" (lapsus)


### Perform the Matching

Perform the nearest neighbour (NN) matches on the (BGBM) collector names and create a data frame with matches, and we try to distinguish abbreviated and full names in the source to better match source data and WikiData ... (can take 5 to 10 minutes)

Now convert a collection of raw documents to a matrix of TF-IDF features and set up the function that performs the matches...

In [14]:
criterion_fullnames = collectors_unique.given.str.contains('^\\w{3,}', na=False)
collectors_names = collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values
# collectors_names = set(collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values)
print("Calculate rather the abbreviated names only …")
matches = calculateTFIDFmatchingOfData(collectors_names, wd_matchtest['canonical_string'], 5) 
        # TODO what effect has n_neighbors ? originally in the very source code it is set to 5, not 1

matches = matches.sort_values(['namematch_distance'])
matches = matches.reset_index(names=['old_index'])

if explain_and_show_the_data: display(matches)

Calculate rather the abbreviated names only …
Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 4.260415077209473 s
Getting nearest neighbours of 20437 data with 5 neighbor sample(s)...
Completed after 67.35547757148743 s
Finding matches build new data frame ...
Building matches done after 67.51791548728943 s
Done


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,8372,M. Sanchez,M. Sanchez,0.0
1,4580,W. Wolf,W. Wolf,0.0
2,4579,J. Goldie,J. Goldie,0.0
3,16578,J. Webster,J. Webster,0.0
4,16574,Zuccarini,Zuccarini,0.0
...,...,...,...,...
20432,17609,J.B. Kreuzpointner,И.С. Решетова,1.0
20433,17631,L. Habicht,Ж. д. Герн,1.0
20434,17626,J. Pörzler,М. А. Фёдорович,1.0
20435,20430,J. Ukmar,Л. Бриттен,1.0


In [15]:
this_name = "Kotsch"
criterion = matches['namematch_source_data'].str.contains(this_name)
display(matches[criterion])

Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_distance
2858,7940,Kotschy,Kotschy,0.0
5266,2575,T. Kotschy,T. Kotschy,0.0
10456,8552,Th Kotschy,Kotschy,0.6
11651,16534,K.G.T. Kotschy,Kotschy,0.67
11711,12374,C.G. Kotschy,Kotschy,0.67
13891,17768,C.G.T. Kotschy,Kotschy,0.76
17966,13946,C.G.T. Kotschyi,Kotschy,0.95


In [16]:
# criterion_fullnames = collectors_unique.given.str.contains('^\\w{3,}', na=False)
collectors_fullnames = collectors_unique['canonical_string_collector_parsed'][criterion_fullnames].values

print("Calculate rather the full names only …")
matches_fullnames = calculateTFIDFmatchingOfData(collectors_fullnames, wd_matchtest_fullnames['canonical_string_fullname'], 5) 
         # TODO what effect has n_neighbors ? originally in the very source code it is set to 5, not 1

matches_fullnames = matches_fullnames.sort_values(['namematch_distance'])
matches_fullnames = matches_fullnames.reset_index(names=['old_index'])

if explain_and_show_the_data: display(matches_fullnames)


Calculate rather the full names only …
Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 5.026895761489868 s
Getting nearest neighbours of 534 data with 5 neighbor sample(s)...
Completed after 6.125038146972656 s
Finding matches build new data frame ...
Building matches done after 6.129297494888306 s
Done


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,0,Sánchez García,Sánchez-García,0.0
1,21,Gabriel Strobl,Gabriel Strobl,0.0
2,529,Franz Wilhelm Sieber,Franz Wilhelm Sieber,0.0
3,33,Arthur Krause,Arthur Krause,0.0
4,31,Sun Hang,Sun Hang,0.0
...,...,...,...,...
529,503,Synho Kirinco,Иоганн-Генрих-Фридрих Линк,1.0
530,34,Sumedha Mahadeo,Жерар Даниэль Вестендорп,1.0
531,32,FLl Herminier,Йоханн Вильгельм Мейген,1.0
532,28,Bierkamp Machatzi,Жозеф Декен,1.0


### Create Output Results

Combine the matches data frame back to the (BGBM) collectors and Wikidata items …

In [17]:
if explain_and_show_the_data: print("join matches data frame back to source collectors dataframe")
collectors_matches = pd.merge(
    collectors_unique, matches, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

if explain_and_show_the_data: display(collectors_matches.head())

join matches data frame back to source collectors dataframe


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,A,,,,,,,,A,"Angeludis C., A.",25,http://id.snsb.info/snsb/collection/462869/564...,14848,A,Г. Бaйцзе,0.0
1,Silveira,A. A. da,,,,,,,A. A. da Silveira,"Silveira,A.A. da",6,https://herbarium.bgbm.org/object/B200129406a,687,A. A. da Silveira,A. A. d. Silveira,0.61
2,Thouars,A. A. du,,,,,,,A. A. du Thouars,"Thouars,A. A.du Petit-",34,https://herbarium.bgbm.org/object/BW20158010,8511,A. A. du Thouars,A. A. D. Thouars,0.69
3,Bunge,A. A. von,,,,,,,A. A. von Bunge,"Bunge, A.A. von (no. )",1,http://id.snsb.info/snsb/collection/511901/634...,6839,A. A. von Bunge,A. A. v. Bunge,0.73
4,Aaronsohn,A.,,,,,,,A. Aaronsohn,"Aaronsohn,A.",3,https://je.jacq.org/JE00010154,17163,A. Aaronsohn,A. Aaronsohn,0.0


In [18]:
if explain_and_show_the_data: print("show full name matches, and append them to all matches")
collectors_matches_fullname = pd.merge(
    collectors_unique, matches_fullnames, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

if explain_and_show_the_data: display(collectors_matches_fullname.head())

show full name matches, and append them to all matches


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,Orbigny,ACVDd,,,,,,,ACVDd Orbigny,"Orbigny,A.C.V.D.d'",10,https://herbarium.bgbm.org/object/B200097228,281,ACVDd Orbigny,Жозе Франсишку Коррея да Серра,1.0
1,Sauze,Abbé,,,,,,,Abbé Sauze,"Sauze,Abbé,J.",1,https://herbarium.bgbm.org/object/B100263049,143,Abbé Sauze,Sauzé,0.99
2,Hassan,Abdisalam Sheikh,,,,,,,Abdisalam Sheikh Hassan,"Friis,I., Vollesen,K. & Abdisalam Sheikh Hassan",7,https://herbarium.bgbm.org/object/B100003700,15,Abdisalam Sheikh Hassan,Abdisalam S. Hassan,0.67
3,Hassan,Abdisalem Sheikh,,,,,,,Abdisalem Sheikh Hassan,"Friis,I, Vollesen,K., & Abdisalem Sheikh Hassan",2,https://herbarium.bgbm.org/object/B100003663,393,Abdisalem Sheikh Hassan,Abdisalam S. Hassan,0.84
4,Khalek,Abel,,el,,,,,Abel el Khalek,Abel el Khalek,1,https://herbarium.bgbm.org/object/B100763849,193,Abel el Khalek,Abel,1.0


In [19]:
collectors_all_matches=pd.concat([collectors_matches, collectors_matches_fullname])
collectors_all_matches.sort_values(by=['namematch_distance', 'family'], ascending=[True, True], inplace=True)
if explain_and_show_the_data:
    print("show match results of all abbreviated and full names")
    display(collectors_all_matches.head())

show match results of all abbreviated and full names


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,source_data,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,A,,,,,,,,A,"Angeludis C., A.",25,http://id.snsb.info/snsb/collection/462869/564...,14848,A,Г. Бaйцзе,0.0
1165,A.A,,,,,,,,A.A,"Magalhães G.,A.A.",3,https://herbarium.bgbm.org/object/B100247806,9271,A.A,Aa,0.0
4,Aaronsohn,A.,,,,,,,A. Aaronsohn,"Aaronsohn,A.",3,https://je.jacq.org/JE00010154,17163,A. Aaronsohn,A. Aaronsohn,0.0
1606,Abbot,,,,,,,,Abbot,Abbot,2,https://herbarium.bgbm.org/object/B100159967,9289,Abbot,Abbot,0.0
16048,Abbott,R.,,,,,,,R. Abbott,"Bisse,J. & Abbott,R.",3,https://je.jacq.org/JE00010651,18865,R. Abbott,R. Abbott,0.0


In [20]:
# criterion = collectors_all_matches['canonical_string_collector_parsed'].map(lambda x: x.startswith('Kotschy'))
# print("Show example of «Kotschy…» with namematch distances from 0.0 to 1.0 (in Cosine Similiarity we had 0.5 … 1.0)")
# collectors_all_matches[criterion]

In [21]:
# Save the plain name matching results only ...

do_custom_data_aggregation=False
if do_custom_data_aggregation:
    if not os.path.exists('data'):
        print("Make data directory for saving …")
        os.makedirs('data')
        
    this_output_file='data/results_bgbm_collectors_vs_wikidata-botanists_kneighbor_plain-names_%s.csv' % (
        this_timestamp_for_data
    )
    
    collectors_all_matches.to_csv(this_output_file)
    
    print("Wrote plain name matches of collector names into %s (%d kB)" % 
        (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
    )

### Merge Matched Data and WikiData’s

Review (TODO)
- merge abbreviated and full name data properly, distinguish abbrevited match and full name match
- refactor `collectors_matches` or `collectors_matches_g1` aso. to `collectors_all_matches`
- refactor `collectors` to `collectors_unique`
- refactor `matches`to `matches_abbr` or distinguish `matches_fullname`

Now
1. merge now the matching data and the wiki data’s on the conaonical string name
2. later aggregate fine tuned, checking if multiple same (canonical string) names relate to multiple different persons (we use wd-items (the Q1233242 thing), and wd-item-labels to aggregate on) … aso.
3. save those data tables

In [22]:
if explain_and_show_the_data: print("merge now the matching data and the wiki data’s on the conaonical string name")
    
collectors_matches_tmp_names_abbreviated = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string'
)
collectors_matches_tmp_fullnames = pd.merge(
    collectors_matches_fullname, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string_fullname'
)
collectors_matches_g1_merged_wikidata = pd.concat(
    [collectors_matches_tmp_names_abbreviated, collectors_matches_tmp_fullnames]
    , ignore_index=True
)

merge now the matching data and the wiki data’s on the conaonical string name


In [23]:
if explain_and_show_the_data:
    # print(collectors_matches_g1_merged_wikidata.columns)
    print("Old matching: Show example data of «Kotschy…» with namematch distances from 0.0 to almost 1.0 (in Cosine Similiarity we had 0.5 … 1.0)")
    print("Old matching: Interpretation: most of the matches seem correct (=Carl Georg Theodor) also with higher distances but we cannot be sure in mid ranges …")
    print("Old matching: Interestingly with an …yi: Kotschyi, C.G.T. (0.92) was matched to Carl Georg Theodor (=correct, and higher distance) …")
    print("Old matching:       … but Kotschy, C.G.T. (0.76) was calculated to the other person Carl Friedrich (=incorrect, lower distance … :-/)")
    this_name="Kotschy"
    print("example data: show matches of {}".format(this_name))
    
    criterion = collectors_matches_g1_merged_wikidata['canonical_string_collector_parsed'].str.contains(this_name)
    display(collectors_matches_g1_merged_wikidata[criterion].get([
        # 'canonical_string_collector_parsed', # canonical_string_collector_parsed = namematch_source_data
        'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
        'namematch_source_data', 'namematch_resource_data', 'namematch_distance', 
        'itemLabel', 
        'canonical_string_fullname', # canonical_string_fullname contains the former itemMatchingLabel
        'wikidata_link',
    ]))

Old matching: Show example data of «Kotschy…» with namematch distances from 0.0 to almost 1.0 (in Cosine Similiarity we had 0.5 … 1.0)
Old matching: Interpretation: most of the matches seem correct (=Carl Georg Theodor) also with higher distances but we cannot be sure in mid ranges …
Old matching: Interestingly with an …yi: Kotschyi, C.G.T. (0.92) was matched to Carl Georg Theodor (=correct, and higher distance) …
Old matching:       … but Kotschy, C.G.T. (0.76) was calculated to the other person Carl Friedrich (=incorrect, lower distance … :-/)
example data: show matches of Kotschy


Unnamed: 0,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_distance,itemLabel,canonical_string_fullname,wikidata_link
4524,1,http://id.snsb.info/snsb/collection/22980/3175...,C.G. Kotschy,Kotschy,0.67,Theodor Kotschy,Kotschy,http://www.wikidata.org/wiki/Q113299
4544,2494,http://id.snsb.info/snsb/collection/108230/167...,C.G.T. Kotschy,Kotschy,0.76,Theodor Kotschy,Kotschy,http://www.wikidata.org/wiki/Q113299
4545,1,https://herbarium.bgbm.org/object/B100160086,C.G.T. Kotschyi,Kotschy,0.95,Theodor Kotschy,Kotschy,http://www.wikidata.org/wiki/Q113299
15235,37,http://id.snsb.info/snsb/collection/16719/2549...,K.G.T. Kotschy,Kotschy,0.67,Theodor Kotschy,Kotschy,http://www.wikidata.org/wiki/Q113299
15504,2,https://dr.jacq.org/DR049432,Kotschy,Kotschy,0.0,Theodor Kotschy,Kotschy,http://www.wikidata.org/wiki/Q113299
23617,310,http://id.snsb.info/snsb/collection/117808/176...,T. Kotschy,T. Kotschy,0.0,Theodor Kotschy,T. Kotschy,http://www.wikidata.org/wiki/Q113299
23618,310,http://id.snsb.info/snsb/collection/117808/176...,T. Kotschy,T. Kotschy,0.0,Theodor Kotschy,Theodor Kotschy,http://www.wikidata.org/wiki/Q113299
24040,5,https://herbarium.bgbm.org/object/B100526350,Th Kotschy,Kotschy,0.6,Theodor Kotschy,Kotschy,http://www.wikidata.org/wiki/Q113299
26289,2,https://je.jacq.org/JE00022436,Carl Georg Theodor Kotschy,Carl Georg Theodor Kotschy,0.0,Theodor Kotschy,Carl Georg Theodor Kotschy,http://www.wikidata.org/wiki/Q113299
26539,1,https://herbarium.bgbm.org/object/B101113772,Karl Georg Th Kotschy,Karl Georg Theodor Kotschy,0.65,Theodor Kotschy,Karl Georg Theodor Kotschy,http://www.wikidata.org/wiki/Q113299


In [24]:
do_custom_data_aggregation=False
if do_custom_data_aggregation:
    if not os.path.exists('data'):
        print("Make data directory for saving …")
        os.makedirs('data')
        
    this_output_file='data/results_bgbm_collectors_vs_wikidata-botanists_kneighbor_merged-data_%s.csv' % (
        this_timestamp_for_data
    )
    
    collectors_matches_g1_merged_wikidata.to_csv(this_output_file)
    
    print("Wrote wiki-data merged name matches of collector names into %s (%d kB)" % 
        (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
    )

### Write DarwinCore Attribution Output

For this, merge by namematch_resource_data and focus to get individual WikiData items.

In [25]:
# refactor collectors_eventDate_mean
# refactor collectors_eventDate_min
# - refactor yob_is_lt_eventDate_min
# refactor collectors_eventDate_max
# - refactr yod_is_gt_eventDate_max
# refactor custom_score_lifetime            → custom_score_lifetime_data
# refactor custom_score_lifetime_annotation → custom_score_lifetime_data_annotation
# refactor namematch_similarity             → namematch_distance
# refactor namematch_similarity_annotation  → namematch_distance_annotation
collectors_wikidata_cossimOrKmeans = collectors_matches_g1_merged_wikidata[
    [
        'canonical_string_collector_parsed', 'family', 'given', 
        'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
        'source_data',
        'namematch_source_data', 'namematch_resource_data', 'namematch_distance', 
        'item', 'canonical_string', 'itemLabel',
        'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 
        'yob', 'yod',
        # 'wyb'
    ]
].copy()

# order by canonical_string_collector_parsed (actual collector name) (asc)
#   order by similarity (desc) or namematch_distance (asc), 
#     order by number of Wikidata items (asc) and 
#       order by number of collections (desc)
collectors_wikidata_cossimOrKmeans.sort_values(
    by=['canonical_string_collector_parsed', 'namematch_distance', 'family', 'given'], 
    ascending=[True, True, True, True], 
    inplace=True
)

dwcagent_attr_output=collectors_wikidata_cossimOrKmeans.get([
    "occurrenceID_collectors_firstsample", 
    "canonical_string_collector_parsed",
    'family', 'given',
    "namematch_distance", 
    "source_data", 
    "itemLabel", 
    "item",
    'yob', 'yod'
]).drop_duplicates(ignore_index=True).copy()


# dwcagent_attr_output['canonical_string_collector_parsed'].replace(to_replace=r'([^,]+),\s*(.+)', value='\\2 \\1', inplace=True, regex=True)
dwcagent_attr_output['canonical_string_collector_parsed'] = dwcagent_attr_output['canonical_string_collector_parsed'].astype(object)
dwcagent_attr_output['canonical_string_collector_parsed'] = dwcagent_attr_output['canonical_string_collector_parsed'].replace(
    to_replace=r'([^,]+),\s*(.+)', 
    value=r'\\2 \\1', 
    regex=True
)
dwcagent_attr_output['namematch_distance_annotation'] = dwcagent_attr_output['namematch_distance'].astype(str).str.replace(r'(.+)', '\\1 (k-means distance)', regex=True)
# dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'namematch_distance_annotation', '', allow_duplicates=True)

dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'life_time_periode', '', allow_duplicates=True)

combine_life_times = lambda this_df: ("%s-%s" % (this_df["yob"], this_df["yod"])).replace(r"<NA>", "?")
dwcagent_attr_output["life_time_periode"]=dwcagent_attr_output.apply(combine_life_times, axis="columns")

# dwcagent_attr_output["life_time_periode"]

years_from_birth_until_first_collection_activity = 10
dwcagent_attr_output["custom_score_lifetime_data"] = 0.0
dwcagent_attr_output.insert(len(dwcagent_attr_output.columns), 'custom_score_lifetime_data_annotation', '', allow_duplicates=True)

# df.loc[(df['column_of_interest'] … condition), 'fill_to_column'] = value 

dwcagent_attr_output.loc[
    pd.notnull(dwcagent_attr_output["yob"]) & pd.notnull(dwcagent_attr_output["yod"]),
    "custom_score_lifetime_data"
] = 1.0
# True cases but <NA> missing values
dwcagent_attr_output.loc[
    pd.notnull(dwcagent_attr_output["yob"]) & dwcagent_attr_output["yod"].isnull(),
    "custom_score_lifetime_data"
] = 0.5
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob"].isnull() & pd.notnull(dwcagent_attr_output["yod"]),
    "custom_score_lifetime_data"
] = 0.5
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob"].isnull() & dwcagent_attr_output["yod"].isnull(),
    "custom_score_lifetime_data"
] = 0.0


# annotations True cases
dwcagent_attr_output.loc[
    pd.notnull(dwcagent_attr_output["yob"]) & pd.notnull(dwcagent_attr_output["yod"]), 
    "custom_score_lifetime_data_annotation"
] = "life time known"

# annotations True cases but <NA> missing values
dwcagent_attr_output.loc[
    pd.notnull(dwcagent_attr_output["yob"]) & dwcagent_attr_output["yod"].isnull(), 
    "custom_score_lifetime_data_annotation"
] = "year of death is missing"
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob"].isnull() & pd.notnull(dwcagent_attr_output["yod"]), 
    "custom_score_lifetime_data_annotation"
] = "year of birth is missing"
dwcagent_attr_output.loc[
    dwcagent_attr_output["yob"].isnull() & dwcagent_attr_output["yod"].isnull(), 
    "custom_score_lifetime_data_annotation"
] = "unknown life time"


dwcagent_attr_output["custom_score_multiple_names"] = 0.0 # 0 shall mean: we don’t know yet for real
dwcagent_attr_output.loc[
    (dwcagent_attr_output['canonical_string_collector_parsed'].duplicated(keep=False)),
    'custom_score_multiple_names'
] = -0.5 # one decision has to be made, so cut the range of -1 to 0 only into half (or include multiple count somehow?)

namematch_distance_max=dwcagent_attr_output['namematch_distance'].max()
dwcagent_attr_output['custom_score_overall'] = (
    # reconsider/transform distance (0 … xx, range larger than 1) to similarity (1 … 0, range of 1) for scoring
    abs( dwcagent_attr_output['namematch_distance'] - namematch_distance_max ) / namematch_distance_max * \
    (
        ( dwcagent_attr_output["custom_score_lifetime_data"] + dwcagent_attr_output['custom_score_multiple_names']) / 2
    )
).round(3)


dwcagent_attr_output['attributionRemarks'] = dwcagent_attr_output.apply(
    lambda row: "{similarity_distance_note};"
                " {score_overall:.2f} (score overall);"
                " {lifetime_periode} (life time);"
                " {lifetime_score:.1f} (life time score);"
                " {lifetime_score_annote} (life time score note);"
                " {score_multinames:.2f} (score multiple names);"
        .format(
    similarity_distance_note=row['namematch_distance_annotation'],
    lifetime_periode=row["life_time_periode"],
    lifetime_score=row["custom_score_lifetime_data"],
    lifetime_score_annote=row["custom_score_lifetime_data_annotation"],
    score_overall=row["custom_score_overall"],
    score_multinames=row["custom_score_multiple_names"]
    ), axis='columns'
)

# adjust dwcagent displayOrder also to olerall score
dwcagent_attr_output.sort_values(
    # by=['namematch_distance', 'family', 'given', 'custom_score_overall'], 
    # ascending=[True, True, True, False], 
    by=['canonical_string_collector_parsed', 'custom_score_overall', 'family', 'given'], 
    ascending=[True, False, True, True], 
    inplace=True
)
# use ordered canonical_string_collector_parsed to generate displayOrder
temp_duplicated = dwcagent_attr_output['canonical_string_collector_parsed'].duplicated() 
    # duplicated() keeps the first value False and mark all other duplicats as True, i.e. we can cumulate the Trues, it gives the order index
temp_insert_value=temp_duplicated.groupby(
    dwcagent_attr_output['canonical_string_collector_parsed']).cumsum() + 1 # display order starts at 1, incrementing
dwcagent_attr_output.insert(
    dwcagent_attr_output.columns.get_loc('canonical_string_collector_parsed') + 1, 
    'displayOrder', temp_insert_value, 
    allow_duplicates=True
)

# test an show example data
if explain_and_show_the_data:
    print("example data: names having year of birth (yob)")
    display(dwcagent_attr_output.loc[pd.notnull(dwcagent_attr_output['yob'])].get([
        # "occurrenceID_collectors_firstsample",
        "canonical_string_collector_parsed",
        'itemLabel',
        "custom_score_overall",
        "attributionRemarks",
        'custom_score_multiple_names',
        "namematch_distance",
        # 'yob', 'yod',
        "life_time_periode", 
        'custom_score_lifetime_data', 'custom_score_lifetime_data_annotation'
    ]).head(5))
    print("example data: names missing the year of birth (yob)")
    display(dwcagent_attr_output.loc[dwcagent_attr_output['yob'].isnull()].get([
        # "occurrenceID_collectors_firstsample",
        "canonical_string_collector_parsed",
        'itemLabel',
        "custom_score_overall",
        "attributionRemarks",
        'custom_score_multiple_names',
        "namematch_distance",
        # 'yob', 'yod',
        "life_time_periode", 
        'custom_score_lifetime_data', 'custom_score_lifetime_data_annotation'
    ]).head(5))


example data: names having year of birth (yob)


Unnamed: 0,canonical_string_collector_parsed,itemLabel,custom_score_overall,attributionRemarks,custom_score_multiple_names,namematch_distance,life_time_periode,custom_score_lifetime_data,custom_score_lifetime_data_annotation
0,A,Geng Bojie,0.5,0.0 (k-means distance); 0.50 (score overall); ...,0.0,0.0,1917-1997,1.0,life time known
1,A. A. da Silveira,Alvaro Astolpho da Silveira,0.195,0.61 (k-means distance); 0.20 (score overall);...,0.0,0.61,1867-1945,1.0,life time known
2,A. A. du Thouars,Abel Aubert Dupetit Thouars,0.155,0.69 (k-means distance); 0.15 (score overall);...,0.0,0.69,1793-1864,1.0,life time known
3,A. A. von Bunge,Alexander Bunge,0.135,0.73 (k-means distance); 0.14 (score overall);...,0.0,0.73,1803-1890,1.0,life time known
4,A. Aaronsohn,Aaron Aaronsohn,0.5,0.0 (k-means distance); 0.50 (score overall); ...,0.0,0.0,1876-1919,1.0,life time known


example data: names missing the year of birth (yob)


Unnamed: 0,canonical_string_collector_parsed,itemLabel,custom_score_overall,attributionRemarks,custom_score_multiple_names,namematch_distance,life_time_periode,custom_score_lifetime_data,custom_score_lifetime_data_annotation
15,A. Al-Farsi,Akhlaq A. Warsi,0.0,0.97 (k-means distance); 0.00 (score overall);...,0.0,0.97,?-?,0.0,unknown life time
17,A. Al-Tereiry,Luis Marión,0.0,1.0 (k-means distance); 0.00 (score overall); ...,0.0,1.0,?-?,0.0,unknown life time
20,A. Aldao Nuñez,Fernando Nuez Viñals,0.0,0.9 (k-means distance); 0.00 (score overall); ...,0.0,0.9,?-?,0.0,unknown life time
29,A. Alvarado Mendez,Marco Antonio Alvarado Vásquez,0.0,0.75 (k-means distance); 0.00 (score overall);...,0.0,0.75,?-?,0.0,unknown life time
39,A. Antonietti,Philippe Antonetti,0.0,0.87 (k-means distance); 0.00 (score overall);...,0.0,0.87,?-?,0.0,unknown life time


In [26]:
# refactor collectors_eventDate_mean
# refactor collectors_eventDate_min
# - refactor yob_is_lt_eventDate_min
# refactor collectors_eventDate_max
# - refactr yod_is_gt_eventDate_max
# refactor custom_score_lifetime            → custom_score_lifetime_data
# refactor custom_score_lifetime_annotation → custom_score_lifetime_data_annotation
# refactor namematch_similarity             → namematch_distance
# refactor namematch_similarity_annotation  → namematch_distance_annotation
# refactor custom_namematch_similarity      → custom_namematch_distance
column_map_dwcagent_attr = {
    'occurrenceID_collectors_firstsample':'occurrenceID',
    'canonical_string_collector_parsed':  'alternateName',
    'source_data':                        'verbatimName',
    'itemLabel':                          'name',
    'item':                               'identifier',
    'namematch_distance':                 'custom_namematch_distance'
}
dwcagent_attr_output.rename(
    mapper=column_map_dwcagent_attr,
    axis='columns',
    inplace=True)

dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('identifier')          + 1, 'agentIdentifierType', 'wikidata' , allow_duplicates=True)
dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('agentIdentifierType') + 1, 'agentType'          , 'Person'   , allow_duplicates=True)
dwcagent_attr_output.insert(dwcagent_attr_output.columns.get_loc('identifier')          + 1, 'action'             , 'collected', allow_duplicates=True)

if explain_and_show_the_data:
    print("the mapped DarwinCore attribution output examples, sorted by alternateName (=collector name) + displayOrder …")
    display(dwcagent_attr_output.head(20))

dwcagent_attr_output=dwcagent_attr_output.reindex(
    columns=[
        'occurrenceID',  # no DwC agent standard (yet)?
        'verbatimName',  # source_data
        'alternateName', # canonical_string_collector_parsed (actual collector name)
        'displayOrder',  # shall start from 1, 2, 3 … represents the available data quality not the match in the first place
        'name', # itemLabel is interpreted as the preferred name
        'attributionRemarks',
        'agentType',
        'action',
        'agentIdentifierType',
        'identifier',
        "custom_score_overall", # keep it for calculation convenience, no standard in DwC agent
        'custom_namematch_distance',# keep it for calculation convenience, no standard in DwC agent
        'custom_score_multiple_names',# keep it for calculation convenience, no standard in DwC agent
        'custom_score_lifetime_data' # keep it for calculation convenience, no standard in DwC agent
    ]
)


the mapped DarwinCore attribution output examples, sorted by alternateName (=collector name) + displayOrder …


Unnamed: 0,occurrenceID,alternateName,displayOrder,family,given,custom_namematch_distance,verbatimName,name,identifier,action,...,agentType,yob,yod,namematch_distance_annotation,life_time_periode,custom_score_lifetime_data,custom_score_lifetime_data_annotation,custom_score_multiple_names,custom_score_overall,attributionRemarks
0,http://id.snsb.info/snsb/collection/462869/564...,A,1,A,,0.0,"Angeludis C., A.",Geng Bojie,http://www.wikidata.org/entity/Q2466938,collected,...,Person,1917.0,1997.0,0.0 (k-means distance),1917-1997,1.0,life time known,0.0,0.5,0.0 (k-means distance); 0.50 (score overall); ...
1,https://herbarium.bgbm.org/object/B200129406a,A. A. da Silveira,1,Silveira,A. A. da,0.61,"Silveira,A.A. da",Alvaro Astolpho da Silveira,http://www.wikidata.org/entity/Q19002102,collected,...,Person,1867.0,1945.0,0.61 (k-means distance),1867-1945,1.0,life time known,0.0,0.195,0.61 (k-means distance); 0.20 (score overall);...
2,https://herbarium.bgbm.org/object/BW20158010,A. A. du Thouars,1,Thouars,A. A. du,0.69,"Thouars,A. A.du Petit-",Abel Aubert Dupetit Thouars,http://www.wikidata.org/entity/Q2821491,collected,...,Person,1793.0,1864.0,0.69 (k-means distance),1793-1864,1.0,life time known,0.0,0.155,0.69 (k-means distance); 0.15 (score overall);...
3,http://id.snsb.info/snsb/collection/511901/634...,A. A. von Bunge,1,Bunge,A. A. von,0.73,"Bunge, A.A. von (no. )",Alexander Bunge,http://www.wikidata.org/entity/Q65899,collected,...,Person,1803.0,1890.0,0.73 (k-means distance),1803-1890,1.0,life time known,0.0,0.135,0.73 (k-means distance); 0.14 (score overall);...
4,https://je.jacq.org/JE00010154,A. Aaronsohn,1,Aaronsohn,A.,0.0,"Aaronsohn,A.",Aaron Aaronsohn,http://www.wikidata.org/entity/Q2086130,collected,...,Person,1876.0,1919.0,0.0 (k-means distance),1876-1919,1.0,life time known,0.0,0.5,0.0 (k-means distance); 0.50 (score overall); ...
5,https://herbarium.bgbm.org/object/B100217620,A. Abaouz,1,Abaouz,A.,1.0,"Jury,S.L., Abaouz,A. ,Lafkih,M.Ait & Griffiths...",Neonila Zenonovna Semenova-Tjan-Schanskaya,http://www.wikidata.org/entity/Q21608653,collected,...,Person,1906.0,1960.0,1.0 (k-means distance),1906-1960,1.0,life time known,0.0,0.0,1.0 (k-means distance); 0.00 (score overall); ...
6,http://id.snsb.info/snsb/collection/719469/786...,A. Acebey,1,Acebey,A.,0.0,"Acebey, A. (no. 563)",Amparo Acebey,http://www.wikidata.org/entity/Q5673219,collected,...,Person,1973.0,,0.0 (k-means distance),1973-?,0.5,year of death is missing,0.0,0.25,0.0 (k-means distance); 0.25 (score overall); ...
7,https://herbarium.bgbm.org/object/B100393538,A. Aceres,1,Aceres,A.,0.95,"Álvarez de Zayas,A., Aceres,A., Bässler,M., Bi...",Giuseppe Acerbi,http://www.wikidata.org/entity/Q55007624,collected,...,Person,1773.0,1846.0,0.95 (k-means distance),1773-1846,1.0,life time known,0.0,0.025,0.95 (k-means distance); 0.03 (score overall);...
8,https://herbarium.bgbm.org/object/B100092853,A. Ackermann,1,Ackermann,A.,0.34,"Weigend,M., Ackermann,A. & Castillo,J.A.",Jacob Fidelis Ackermann,http://www.wikidata.org/entity/Q98053,collected,...,Person,1765.0,1815.0,0.34 (k-means distance),1765-1815,1.0,life time known,0.0,0.33,0.34 (k-means distance); 0.33 (score overall);...
9,https://herbarium.bgbm.org/object/B100629441,A. Acosta,1,Acosta,A.,0.37,"Zardini,E.M. & Acosta,A.",Salvador Acosta Castellanos,http://www.wikidata.org/entity/Q10367096,collected,...,Person,1957.0,,0.37 (k-means distance),1957-?,0.5,year of death is missing,0.0,0.158,0.37 (k-means distance); 0.16 (score overall);...


In [27]:
if explain_and_show_the_data:
    # this_name = "S. Ahmad"
    # print("show examples of a name beginning with “{}” …".format(this_name))
    # criterion = dwcagent_attr_output['alternateName'].map(lambda x: x.startswith(this_name))
    # -------
    print("show column-reduced examples of ?multiple name cases …")
    criterion = dwcagent_attr_output['custom_score_multiple_names'].map(lambda this_score: this_score < 0 )
    
    display(dwcagent_attr_output[criterion].drop(['agentType', 'action', 'agentIdentifierType'], axis='columns').head(20))

show column-reduced examples of ?multiple name cases …


Unnamed: 0,occurrenceID,verbatimName,alternateName,displayOrder,name,attributionRemarks,identifier,custom_score_overall,custom_namematch_distance,custom_score_multiple_names,custom_score_lifetime_data
21,https://herbarium.bgbm.org/object/B100810598,"Fayvush,G. [Ֆայվուշի,Գ.; Файвуш Г.], Oganesian...",A. Aleksanyan,1,Anatolij Aleksandrovič Ničiporovič,0.88 (k-means distance); 0.03 (score overall);...,http://www.wikidata.org/entity/Q26244283,0.03,0.88,-0.5,1.0
22,https://herbarium.bgbm.org/object/B100810598,"Fayvush,G. [Ֆայվուշի,Գ.; Файвуш Г.], Oganesian...",A. Aleksanyan,2,A.A. Richter,0.88 (k-means distance); 0.03 (score overall);...,http://www.wikidata.org/entity/Q4394909,0.03,0.88,-0.5,1.0
24,https://herbarium.bgbm.org/object/B101141687,"Miller,A. [Alison]",A. Alison Miller,1,James S. Miller,0.92 (k-means distance); 0.02 (score overall);...,http://www.wikidata.org/entity/Q22112339,0.02,0.92,-0.5,1.0
25,https://herbarium.bgbm.org/object/B101141687,"Miller,A. [Alison]",A. Alison Miller,2,John Frederick Miller,0.92 (k-means distance); 0.02 (score overall);...,http://www.wikidata.org/entity/Q2700645,0.02,0.92,-0.5,1.0
26,https://herbarium.bgbm.org/object/B101141687,"Miller,A. [Alison]",A. Alison Miller,3,David Miller,0.92 (k-means distance); 0.02 (score overall);...,http://www.wikidata.org/entity/Q5237572,0.02,0.92,-0.5,1.0
27,https://herbarium.bgbm.org/object/B101141687,"Miller,A. [Alison]",A. Alison Miller,4,Gerrit Smith Miller,0.92 (k-means distance); 0.02 (score overall);...,http://www.wikidata.org/entity/Q538252,0.02,0.92,-0.5,1.0
34,https://herbarium.bgbm.org/object/B100212980,"Andersson,A. & Franzén,R.",A. Andersson,1,Karl Alfred Andersson,0.0 (k-means distance); 0.25 (score overall); ...,http://www.wikidata.org/entity/Q131724787,0.25,0.0,-0.5,1.0
33,https://herbarium.bgbm.org/object/B100212980,"Andersson,A. & Franzén,R.",A. Andersson,2,Axel Andersson,0.0 (k-means distance); 0.00 (score overall); ...,http://www.wikidata.org/entity/Q123652899,0.0,0.0,-0.5,0.5
35,https://herbarium.bgbm.org/object/B100212980,"Andersson,A. & Franzén,R.",A. Andersson,3,I. Anita Andersson,0.0 (k-means distance); 0.00 (score overall); ...,http://www.wikidata.org/entity/Q21505194,0.0,0.0,-0.5,0.5
52,https://herbarium.bgbm.org/object/B101206042,"Asensi,A. & Salvo,E.",A. Asensi,1,Alfredo Asensi,0.0 (k-means distance); 0.00 (score overall); ...,http://www.wikidata.org/entity/Q21505377,0.0,0.0,-0.5,0.5


In [28]:
if not os.path.exists('data'):
    os.makedirs('data')

this_output_file='data/results_bgbm_collectors_vs_wikidata-botanists_kneighbor_dwc-agent-output_%s.csv' % (
    this_timestamp_for_data
)

dwcagent_attr_output.to_csv(this_output_file, index=False)

print("Wrote matches of collector names as dwc-agent-output into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote matches of collector names as dwc-agent-output into data/results_bgbm_collectors_vs_wikidata-botanists_kneighbor_dwc-agent-output_20260210.csv (7905 kB)


## Documentation

Explanation of columns:

Column | Description
-|-
**Botanical collectors** |
family | parsed family name
given | parsed given name
suffix | suffix from name parsing
particle | particle from name parsing
dropping_particle | dropping_particle from name parsing
nick | nick name from name parsing
appellation | appellation from name parsing
title | title from name parsing
eventDate | date of the sampling event (required by GBIF, ☞ https://www.gbif.org/data-quality-requirements-sampling-events)
eventDate_min | calculated earliest date of all the sampling events within the data
eventDate_max | calculated latest date of all the sampling events within the data
eventDate_mean | calculated mean date of all the sampling events within the data
TODO activity_span | Number of years between first and last collection
**Name matching** |
nammatch_collector | matched name of the data set
nammatch_wikidata | matched name; = Wikidata item label name is matched to
namematch_distance | Nearest Neighbour distance between the name and matched name; the lower the value, the better the match
**DarwinCore Agent Output** | (☞ [agent_actions_v2020-09-08.xml](https://github.com/tdwg/attribution/blob/master/people/dwc/agent_actions_v2020-09-08.xml))
occurrenceID | occurrence ID of the data item
name | the interpreted name match (https://github.com/tdwg/attribution/ The name of the item. In this case the *full name* as would be written on a legal document (without abbreviation), eg givenName familyName)
verbatimName | the source data name(s) (https://github.com/tdwg/attribution/ As written on occurrence, such as the collection or determination label.)
alternateName | the input name, collector source name (An alias for the item. Other full name agent may have been known under such as maiden name.)
displayOrder | I guess ordering the multiple name cases (https://github.com/tdwg/attribution/ The display order for the agent that executed the action when more than one agent was a participant.)
attributionRemarks | notes on the results (distance or similarity), including calculated value
agentType | The nature of the agent, e.g. "Person", "Organization", "SoftwareApplication"
action | The name of the single action written as a verb in past tense. Recommended best practice is to use a controlled vocabulary, examples "collected" or "identified"
agentIdentifierType | The type of identifier for the agent. (https://github.com/tdwg/attribution/ Recommended best practice is to use a controlled vocabulary, e.g. “ORCID”, “ISNI”, “Wikidata”, “VIAF”, “RoR”, “Ringgold”, “GRID”).
identifier | Wikidata ID (Recommended practice is to identify the resource by means of a string conforming to an identification system. Examples include International Standard Book Number (ISBN), Digital Object Identifier (DOI), and Uniform Resource Name (URN). Persistent identifiers should be provided as HTTP URIs.)
startedAtTime | (https://github.com/tdwg/attribution/ Start is when an action is deemed to have been started by an agent.) the first date of eventDate (supposedly the first sampling date), but grouped from collector name—in case of multiple name matches this first “sampling date” is less reliable and be reliable to relate to the source collector’s life time.
endedAtTime | (https://github.com/tdwg/attribution/ End is when an action is deemed to have been ended by an agent.) the last date of eventDate (supposedly the last sampling date), but grouped from collector name—in case of multiple name matches this first “sampling date” is less reliable and be reliable to relate to the source collector’s life time.
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))

Refactoring from <https://github.com/nielsklazenga/avh-collectors/blob/master/match_names_to_wikidata_items.ipynb>

AVH | collector_matching (here)
-|-
avh_matches | collectors_all_matches
wd_test | wd_matchtest